Project 9: The Operational Readiness Review (ORR) System
Build a system that automates the “handover” or “promotion” of a service from experimental to production-ready through automated checks and quality gates.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 2 Weeks (20-30 hours) |
| Primary Language | YAML / Python |
| Alternative Languages | Go, Bash |
| Prerequisites | CI/CD familiarity, DevOps concepts |
| Key Topics | Governance, Service Maturity, Shift-Left Operations |
1. Learning Objectives
By completing this project, you will:
- Define production readiness criteria for your organization
- Automate operational checks (documentation, monitoring, ownership)
- Implement maturity levels for services
- Integrate governance into CI/CD without creating bottlenecks
- Balance freedom and safety in team autonomy
2. Theoretical Foundation
2.1 Core Concepts
The Production Readiness Problem
CURRENT STATE DESIRED STATE
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ "Ship it and pray" │ │ "Ship with confidence" │
│ │ │ │
│ - No runbook │ │ ✓ Runbook exists │
│ - No alerts │ │ ✓ Alerts configured │
│ - No owner listed │ │ ✓ Owner in catalog │
│ - Breaks at 3 AM │ │ ✓ On-call rotation set │
│ - "Who wrote this?" │ │ ✓ Dependencies documented │
└─────────────────────────────┘ └─────────────────────────────┘
Service Maturity Model
┌─────────────────────────────────────────────────────────────────┐
│ SERVICE MATURITY LEVELS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LEVEL 0: Experimental │
│ ───────────────────────────── │
│ - No guarantees │
│ - May be deleted without notice │
│ - Not in production │
│ │
│ LEVEL 1: Development │
│ ───────────────────────────── │
│ ✓ Has owner │
│ ✓ Has README │
│ ✓ Has basic logging │
│ │
│ LEVEL 2: Staging │
│ ───────────────────────────── │
│ ✓ All Level 1 requirements │
│ ✓ Has health endpoints │
│ ✓ Has monitoring dashboard │
│ ✓ Has on-call rotation │
│ │
│ LEVEL 3: Production │
│ ───────────────────────────── │
│ ✓ All Level 2 requirements │
│ ✓ Has runbook │
│ ✓ Has SLO defined │
│ ✓ Has security review │
│ ✓ Has load tested │
│ │
│ LEVEL 4: Critical │
│ ───────────────────────────── │
│ ✓ All Level 3 requirements │
│ ✓ Has disaster recovery plan │
│ ✓ Has multi-region deployment │
│ ✓ Has chaos testing │
│ │
└─────────────────────────────────────────────────────────────────┘
Shift-Left Operations
TRADITIONAL SHIFT-LEFT
Code ──► Build ──► Test ──► Deploy ──► Operate
│
Problems found here
(expensive to fix)
Code ──► Operate-Check ──► Build ──► Test ──► Deploy
│
Problems found here
(cheap to fix)
2.2 Why This Matters
Without ORR:
- Services launch with missing documentation
- On-call gets paged for services they don’t know exist
- Incidents take longer because runbooks don’t exist
- Security vulnerabilities go undetected
With ORR:
- Minimum bar for production is clear and automated
- Teams know exactly what’s required before launch
- Operations get better over time (continuous improvement)
2.3 Historical Context
- Google ORR (2010s): Formalized production readiness at scale
- SRE Book (2016): Published ORR concepts
- Platform Engineering (2020s): Self-service ORR becomes standard
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “ORR slows us down” | ORR prevents 3 AM incidents that slow you down more |
| “One size fits all” | Different services need different maturity levels |
| “Manual review is enough” | Automated checks are faster and more consistent |
| “Developers will resist” | Developers appreciate clear, automated expectations |
3. Project Specification
3.1 What You Will Build
- Checklist Definition: YAML-based requirements per maturity level
- Automated Checker: CLI tool that validates services
- CI/CD Integration: Block deploys that don’t meet level
- Dashboard: Show maturity status across all services
- Gamification: Badges/leaderboard for operational excellence
3.2 Functional Requirements
- Checklist Schema
- Define checks per maturity level
- Support automated checks (file exists, URL responds)
- Support manual checks (security review approved)
- Configurable per service type
- Checker Tool
orr-check <service>validates against requirements- Returns pass/fail with specific failures
- Suggests fixes for each failure
- CI/CD Integration
- Run on PR to main branch
- Block merge if below required level
- Allow override with explicit approval
- Dashboard
- Show all services by maturity level
- Drill down to specific failures
- Track improvement over time
3.3 Non-Functional Requirements
- Checks must complete in < 30 seconds
- Must support 100+ services
- Must be extensible for custom checks
- Must integrate with existing CI (GitHub Actions, GitLab CI)
3.4 Example Usage / Output
Checklist Definition (orr-checklist.yaml):
maturity_levels:
level_1_development:
name: "Development"
required_for: ["staging-deploy"]
checks:
- id: owner_defined
type: file_exists
path: "OWNERS"
message: "Service must have an OWNERS file"
- id: readme_exists
type: file_exists
path: "README.md"
message: "Service must have a README"
- id: logging_configured
type: file_contains
path: "config/logging.yaml"
pattern: "structured_logging: true"
message: "Structured logging must be enabled"
level_2_staging:
name: "Staging"
required_for: ["production-deploy"]
inherits: level_1_development
checks:
- id: health_endpoint
type: url_responds
path: "/health"
status: 200
message: "Service must expose /health endpoint"
- id: metrics_endpoint
type: url_responds
path: "/metrics"
status: 200
message: "Service must expose /metrics endpoint"
- id: oncall_configured
type: external_check
api: "https://pagerduty.com/api/schedules/{service_id}"
expect: "schedule_exists"
message: "On-call rotation must be configured in PagerDuty"
- id: dashboard_exists
type: external_check
api: "https://grafana.internal/api/search?query={service_name}"
expect: "result_count > 0"
message: "Grafana dashboard must exist"
level_3_production:
name: "Production"
required_for: ["critical-tier"]
inherits: level_2_staging
checks:
- id: runbook_exists
type: file_exists
path: "docs/runbook.md"
message: "Runbook documentation required"
- id: slo_defined
type: file_exists
path: "slo.yaml"
message: "SLO definition required"
- id: security_review
type: manual
approval_system: "jira"
ticket_type: "SECURITY-REVIEW"
message: "Security review must be approved"
- id: load_tested
type: manual
approval_system: "jira"
ticket_type: "LOAD-TEST"
message: "Load test results must be documented"
CLI Output:
$ orr-check --service checkout-api --target-level production
╔══════════════════════════════════════════════════════════════════╗
║ OPERATIONAL READINESS REVIEW: checkout-api ║
╚══════════════════════════════════════════════════════════════════╝
Target Level: Level 3 (Production)
Current Level: Level 2 (Staging)
══════════════════════════════════════════════════════════════════
LEVEL 1 (Development) - ALL PASSED ✓
══════════════════════════════════════════════════════════════════
✓ owner_defined OWNERS file exists
✓ readme_exists README.md exists
✓ logging_configured Structured logging enabled
══════════════════════════════════════════════════════════════════
LEVEL 2 (Staging) - ALL PASSED ✓
══════════════════════════════════════════════════════════════════
✓ health_endpoint /health returns 200
✓ metrics_endpoint /metrics returns 200
✓ oncall_configured PagerDuty schedule exists
✓ dashboard_exists Grafana dashboard found
══════════════════════════════════════════════════════════════════
LEVEL 3 (Production) - 2 FAILED ✗
══════════════════════════════════════════════════════════════════
✓ runbook_exists docs/runbook.md exists
✓ slo_defined slo.yaml exists
✗ security_review MISSING: No approved SECURITY-REVIEW ticket
→ Create ticket: jira.example.com/secure-review
✗ load_tested MISSING: No approved LOAD-TEST ticket
→ Create ticket: jira.example.com/load-test
══════════════════════════════════════════════════════════════════
SUMMARY
══════════════════════════════════════════════════════════════════
Passed: 9/11
Failed: 2/11
Status: BLOCKED - Cannot deploy to production
Next Steps:
1. Complete security review (SECURITY-REVIEW ticket)
2. Run load test and document results (LOAD-TEST ticket)
Estimated time to Level 3: ~3-5 days
$ echo $?
1 # Exit code 1 = failed
Dashboard (Conceptual):
╔══════════════════════════════════════════════════════════════════╗
║ SERVICE MATURITY DASHBOARD ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ OVERALL MATURITY ║
║ ┌────────────────────────────────────────────────────────────┐ ║
║ │ Level 4 ██ 2 services (4%) │ ║
║ │ Level 3 ██████████████████ 18 services (36%) │ ║
║ │ Level 2 ████████████████████████ 24 services (48%) │ ║
║ │ Level 1 ██████ 6 services (12%) │ ║
║ └────────────────────────────────────────────────────────────┘ ║
║ ║
║ TOP BLOCKERS ║
║ ┌────────────────────────────────────────────────────────────┐ ║
║ │ 1. Missing runbook - 12 services │ ║
║ │ 2. No SLO defined - 10 services │ ║
║ │ 3. Security review pending - 8 services │ ║
║ │ 4. Missing on-call rotation - 4 services │ ║
║ └────────────────────────────────────────────────────────────┘ ║
║ ║
║ LEADERBOARD (Most Level 3+ Services) ║
║ ┌────────────────────────────────────────────────────────────┐ ║
║ │ 🥇 Platform Team - 8/8 services at Level 3+ │ ║
║ │ 🥈 Payments Team - 5/6 services at Level 3+ │ ║
║ │ 🥉 Checkout Team - 4/5 services at Level 3+ │ ║
║ └────────────────────────────────────────────────────────────┘ ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
3.5 Real World Outcome
After implementing ORR:
- All production services have documented owners
- On-call has runbooks for every service they support
- Security reviews happen before launch, not after incident
- Teams compete for operational excellence badges
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────────┐
│ ORR SYSTEM │
└─────────────────────────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ CHECKLIST │ │ CHECKER │ │ DASHBOARD │
│ (YAML) │ │ (Python) │ │ (Web) │
│ │ │ │ │ │
│ Defines │────►│ Executes │────►│ Visualizes │
│ requirements │ │ checks │ │ status │
└───────────────┘ └───────────────┘ └───────────────┘
│
▼
┌───────────────────┐
│ CI/CD GATE │
│ │
│ Block deploys │
│ if checks fail │
└───────────────────┘
4.2 Key Components
- Checklist Definition: YAML schema for requirements
- Check Executors: Plugins for different check types
- Aggregator: Combines results, determines level
- CI Integration: GitHub Actions / GitLab CI
- Dashboard: Web UI for visualization
4.3 Data Structures
# models.py
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional
class CheckType(Enum):
FILE_EXISTS = "file_exists"
FILE_CONTAINS = "file_contains"
URL_RESPONDS = "url_responds"
EXTERNAL_CHECK = "external_check"
MANUAL = "manual"
class CheckResult(Enum):
PASS = "pass"
FAIL = "fail"
SKIP = "skip"
@dataclass
class Check:
id: str
type: CheckType
message: str
params: dict # Type-specific parameters
@dataclass
class CheckExecution:
check_id: str
result: CheckResult
details: str
fix_suggestion: Optional[str] = None
@dataclass
class ServiceMaturity:
service_id: str
current_level: int
target_level: int
passed_checks: List[str]
failed_checks: List[CheckExecution]
blocked: bool
4.4 Algorithm Overview
def run_orr_check(service_id: str, target_level: int) -> ServiceMaturity:
# Load checklist
checklist = load_checklist("orr-checklist.yaml")
# Get all checks up to target level
all_checks = []
for level in range(1, target_level + 1):
level_def = checklist[f"level_{level}"]
all_checks.extend(level_def.checks)
# Execute each check
results = []
for check in all_checks:
executor = get_executor(check.type)
result = executor.run(service_id, check)
results.append(result)
# Determine current level
current_level = 0
for level in range(1, target_level + 1):
level_checks = [c for c in results if c.level == level]
if all(c.result == CheckResult.PASS for c in level_checks):
current_level = level
else:
break
# Build maturity report
passed = [r.check_id for r in results if r.result == CheckResult.PASS]
failed = [r for r in results if r.result == CheckResult.FAIL]
return ServiceMaturity(
service_id=service_id,
current_level=current_level,
target_level=target_level,
passed_checks=passed,
failed_checks=failed,
blocked=(current_level < target_level)
)
5. Implementation Guide
5.1 Development Environment Setup
# Create project
mkdir orr-system && cd orr-system
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install pyyaml click requests colorama
5.2 Project Structure
orr-system/
├── config/
│ ├── orr-checklist.yaml
│ └── service-registry.yaml
├── src/
│ ├── __init__.py
│ ├── models.py
│ ├── checklist.py # Load checklist definition
│ ├── executors/
│ │ ├── file.py # file_exists, file_contains
│ │ ├── http.py # url_responds
│ │ ├── external.py # external_check
│ │ └── manual.py # manual approval lookup
│ ├── runner.py # Execute all checks
│ └── report.py # Generate output
├── cli.py # Command-line interface
├── ci/
│ └── github-action.yaml # CI integration
└── dashboard/
└── app.py # Web dashboard (Flask/FastAPI)
5.3 The Core Question You’re Answering
“How do we ensure that ‘Freedom’ for teams doesn’t lead to ‘Chaos’ for the organization?”
Modern operating models give teams autonomy, but autonomy without standards is dangerous. The ORR system is the “Policy” that makes autonomy safe.
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Shift-Left Operations
- Why should operational checks happen during development?
- Book Reference: “The Phoenix Project”
- Service Maturity Models
- What are the levels of maturity for a microservice?
- Book Reference: “The Site Reliability Workbook” Ch. 11
- Continuous Compliance
- How do you enforce policy without being a bottleneck?
- Reference: Policy-as-code literature
5.5 Questions to Guide Your Design
Before implementing, think through these:
Automation vs. Manual
- Which checks can be done by script (file exists)?
- Which require human approval (security review)?
- How do you handle “exceptions”?
Incentives
- Why would a developer want to pass the ORR?
- Does passing unlock something valuable (production deploy)?
- How do you celebrate achievement (badges, leaderboard)?
Evolution
- How do you add new checks without breaking existing services?
- Should legacy services be grandfathered?
5.6 Thinking Exercise
The “Day 0” Disaster
Imagine you launch a service and it crashes within 5 minutes.
Questions:
- What information would you need to fix it? (Logs? Metrics? Owner?)
- If that information isn’t available, whose fault is it?
- Could an automated check have caught the missing info before launch?
Write down:
- 5 pieces of information you’d need
- For each, the ORR check that would ensure it exists
5.7 Hints in Layers
Hint 1: Start Simple Begin with a Markdown checklist of 10 things every service must have.
Hint 2: Use GitHub Branch Protection
Require that a specific check (like orr-status) passes before merging to main.
Hint 3: Build the Checker Incrementally Start with file_exists checks. Add HTTP checks. Add external API checks last.
Hint 4: Gamify It Create a dashboard showing which teams have the most “Gold Level” services.
5.8 The Interview Questions They’ll Ask
Prepare to answer these:
- “What is an Operational Readiness Review?”
- A checklist of requirements a service must meet before production.
- “How do you automate governance without slowing teams down?”
- Make checks fast, provide clear fix suggestions, integrate into existing workflow.
- “Should ORR be centralized or self-service?”
- Self-service checks with centralized standards. Teams run checks; central team defines rules.
- “What are the top 3 items on a production-readiness checklist?”
- Owner defined, on-call configured, runbook exists.
- “How do you handle legacy services that don’t meet new standards?”
- Grandfather with timeline, prioritize based on risk, track improvement.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Production Readiness | “The Site Reliability Workbook” | Ch. 11 |
| Shift-Left | “The Phoenix Project” | Part 2 |
| Governance | “Modern Software Engineering” | Ch. 11 |
5.10 Implementation Phases
Phase 1: Checklist Design (3-4 hours)
- Define 4 maturity levels
- Write 3-5 checks per level
- Review with ops team
Phase 2: Checker Core (5-7 hours)
- Build file_exists executor
- Build file_contains executor
- Build url_responds executor
- Aggregate results by level
Phase 3: CI Integration (3-4 hours)
- Create GitHub Action
- Add as required check
- Test with sample repo
Phase 4: Dashboard (5-7 hours)
- Build simple Flask/FastAPI app
- Show all services with levels
- Add leaderboard
Phase 5: Rollout (3-4 hours)
- Run against all services
- Share results with teams
- Set timeline for compliance
5.11 Key Implementation Decisions
| Decision | Option A | Option B | Recommendation |
|---|---|---|---|
| Enforcement | Block deploys | Advisory only | Start advisory, move to blocking |
| Storage | Files in repo | Central DB | Files in repo (GitOps) |
| Manual checks | Jira tickets | Google Forms | Jira (audit trail) |
| Exceptions | No exceptions | Documented exceptions | Documented with expiration |
6. Testing Strategy
Unit Tests
def test_file_exists_check():
executor = FileExistsExecutor()
result = executor.run("./test-service", Check(
id="test",
type=CheckType.FILE_EXISTS,
params={"path": "README.md"}
))
assert result.result == CheckResult.PASS
def test_level_determination():
results = [
CheckExecution(check_id="a", result=CheckResult.PASS, level=1),
CheckExecution(check_id="b", result=CheckResult.PASS, level=1),
CheckExecution(check_id="c", result=CheckResult.FAIL, level=2),
]
assert determine_level(results) == 1
Integration Tests
- Run against real service repository
- Verify all check types work
- Verify CI integration blocks correctly
Rollout Testing
- Run in advisory mode for 2 weeks
- Review false positives/negatives
- Adjust checks before enforcement
7. Common Pitfalls & Debugging
| Problem | Symptom | Root Cause | Fix |
|---|---|---|---|
| Too strict | No services pass | Checks don’t match reality | Start with current state as baseline |
| Too lenient | Everything passes | Checks too simple | Add meaningful checks |
| Slow checks | CI times out | External API calls slow | Add caching, parallel execution |
| Manual bottleneck | Tickets pile up | Too many manual checks | Automate more, reduce manual |
8. Extensions & Challenges
Extension 1: Policy-as-Code
Use Open Policy Agent (OPA) for complex rules.
Extension 2: Trend Tracking
Track maturity levels over time. Show improvement graphs.
Extension 3: Auto-Remediation
For some checks, automatically create PRs to fix issues.
Extension 4: Integration with Backstage
Publish maturity badges to service catalog.
9. Real-World Connections
How Big Tech Does This:
- Google: Production Readiness Review (PRR)
- Netflix: Chaos Engineering integrated with readiness
- Spotify: Squad Health Check for non-technical readiness
Tools:
- OpsLevel - Service maturity platform
- Backstage - Can display maturity badges
- Open Policy Agent - Policy-as-code
10. Resources
SRE Resources
Policy-as-Code
- Open Policy Agent
- Conftest - Test structured data against OPA
Related Projects
- P03: Ownership Mapper - Define owners
- P07: SLE Agreement - Define SLOs
11. Self-Assessment Checklist
Before considering this project complete, verify:
- Checklist has 4 maturity levels with clear criteria
- At least 3 check types are implemented
- CLI tool returns clear pass/fail with fix suggestions
- CI integration blocks deploys (or at least warns)
- Dashboard shows all services with their levels
- At least one team has used this to improve a service
- Legacy services have a documented path to compliance
12. Submission / Completion Criteria
This project is complete when you have:
- orr-checklist.yaml with 4 levels and 15+ checks
- CLI tool that runs checks and reports results
- CI integration (GitHub Action or GitLab CI)
- Dashboard showing service maturity overview
- Documentation for adding new checks
- Rollout plan for organization adoption
Previous Project: P08: Dependency Spaghetti Visualizer Next Project: P10: Incident Response Battle Cards