Project 9: The Operational Readiness Review (ORR) System

Build a system that automates the “handover” or “promotion” of a service from experimental to production-ready through automated checks and quality gates.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 2 Weeks (20-30 hours)
Primary Language YAML / Python
Alternative Languages Go, Bash
Prerequisites CI/CD familiarity, DevOps concepts
Key Topics Governance, Service Maturity, Shift-Left Operations

1. Learning Objectives

By completing this project, you will:

  1. Define production readiness criteria for your organization
  2. Automate operational checks (documentation, monitoring, ownership)
  3. Implement maturity levels for services
  4. Integrate governance into CI/CD without creating bottlenecks
  5. Balance freedom and safety in team autonomy

2. Theoretical Foundation

2.1 Core Concepts

The Production Readiness Problem

CURRENT STATE                       DESIRED STATE
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ "Ship it and pray"          │    │ "Ship with confidence"      │
│                             │    │                             │
│ - No runbook               │    │ ✓ Runbook exists            │
│ - No alerts                │    │ ✓ Alerts configured         │
│ - No owner listed          │    │ ✓ Owner in catalog          │
│ - Breaks at 3 AM           │    │ ✓ On-call rotation set      │
│ - "Who wrote this?"        │    │ ✓ Dependencies documented   │
└─────────────────────────────┘    └─────────────────────────────┘

Service Maturity Model

┌─────────────────────────────────────────────────────────────────┐
│                    SERVICE MATURITY LEVELS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LEVEL 0: Experimental                                         │
│  ─────────────────────────────                                 │
│  - No guarantees                                               │
│  - May be deleted without notice                               │
│  - Not in production                                           │
│                                                                 │
│  LEVEL 1: Development                                          │
│  ─────────────────────────────                                 │
│  ✓ Has owner                                                   │
│  ✓ Has README                                                  │
│  ✓ Has basic logging                                           │
│                                                                 │
│  LEVEL 2: Staging                                              │
│  ─────────────────────────────                                 │
│  ✓ All Level 1 requirements                                   │
│  ✓ Has health endpoints                                       │
│  ✓ Has monitoring dashboard                                   │
│  ✓ Has on-call rotation                                       │
│                                                                 │
│  LEVEL 3: Production                                           │
│  ─────────────────────────────                                 │
│  ✓ All Level 2 requirements                                   │
│  ✓ Has runbook                                                │
│  ✓ Has SLO defined                                            │
│  ✓ Has security review                                        │
│  ✓ Has load tested                                            │
│                                                                 │
│  LEVEL 4: Critical                                             │
│  ─────────────────────────────                                 │
│  ✓ All Level 3 requirements                                   │
│  ✓ Has disaster recovery plan                                 │
│  ✓ Has multi-region deployment                                │
│  ✓ Has chaos testing                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Shift-Left Operations

TRADITIONAL                         SHIFT-LEFT

Code ──► Build ──► Test ──► Deploy ──► Operate
                                          │
                                    Problems found here
                                    (expensive to fix)

Code ──► Operate-Check ──► Build ──► Test ──► Deploy
              │
        Problems found here
        (cheap to fix)

2.2 Why This Matters

Without ORR:

  • Services launch with missing documentation
  • On-call gets paged for services they don’t know exist
  • Incidents take longer because runbooks don’t exist
  • Security vulnerabilities go undetected

With ORR:

  • Minimum bar for production is clear and automated
  • Teams know exactly what’s required before launch
  • Operations get better over time (continuous improvement)

2.3 Historical Context

  • Google ORR (2010s): Formalized production readiness at scale
  • SRE Book (2016): Published ORR concepts
  • Platform Engineering (2020s): Self-service ORR becomes standard

2.4 Common Misconceptions

Misconception Reality
“ORR slows us down” ORR prevents 3 AM incidents that slow you down more
“One size fits all” Different services need different maturity levels
“Manual review is enough” Automated checks are faster and more consistent
“Developers will resist” Developers appreciate clear, automated expectations

3. Project Specification

3.1 What You Will Build

  1. Checklist Definition: YAML-based requirements per maturity level
  2. Automated Checker: CLI tool that validates services
  3. CI/CD Integration: Block deploys that don’t meet level
  4. Dashboard: Show maturity status across all services
  5. Gamification: Badges/leaderboard for operational excellence

3.2 Functional Requirements

  1. Checklist Schema
    • Define checks per maturity level
    • Support automated checks (file exists, URL responds)
    • Support manual checks (security review approved)
    • Configurable per service type
  2. Checker Tool
    • orr-check <service> validates against requirements
    • Returns pass/fail with specific failures
    • Suggests fixes for each failure
  3. CI/CD Integration
    • Run on PR to main branch
    • Block merge if below required level
    • Allow override with explicit approval
  4. Dashboard
    • Show all services by maturity level
    • Drill down to specific failures
    • Track improvement over time

3.3 Non-Functional Requirements

  • Checks must complete in < 30 seconds
  • Must support 100+ services
  • Must be extensible for custom checks
  • Must integrate with existing CI (GitHub Actions, GitLab CI)

3.4 Example Usage / Output

Checklist Definition (orr-checklist.yaml):

maturity_levels:
  level_1_development:
    name: "Development"
    required_for: ["staging-deploy"]
    checks:
      - id: owner_defined
        type: file_exists
        path: "OWNERS"
        message: "Service must have an OWNERS file"

      - id: readme_exists
        type: file_exists
        path: "README.md"
        message: "Service must have a README"

      - id: logging_configured
        type: file_contains
        path: "config/logging.yaml"
        pattern: "structured_logging: true"
        message: "Structured logging must be enabled"

  level_2_staging:
    name: "Staging"
    required_for: ["production-deploy"]
    inherits: level_1_development
    checks:
      - id: health_endpoint
        type: url_responds
        path: "/health"
        status: 200
        message: "Service must expose /health endpoint"

      - id: metrics_endpoint
        type: url_responds
        path: "/metrics"
        status: 200
        message: "Service must expose /metrics endpoint"

      - id: oncall_configured
        type: external_check
        api: "https://pagerduty.com/api/schedules/{service_id}"
        expect: "schedule_exists"
        message: "On-call rotation must be configured in PagerDuty"

      - id: dashboard_exists
        type: external_check
        api: "https://grafana.internal/api/search?query={service_name}"
        expect: "result_count > 0"
        message: "Grafana dashboard must exist"

  level_3_production:
    name: "Production"
    required_for: ["critical-tier"]
    inherits: level_2_staging
    checks:
      - id: runbook_exists
        type: file_exists
        path: "docs/runbook.md"
        message: "Runbook documentation required"

      - id: slo_defined
        type: file_exists
        path: "slo.yaml"
        message: "SLO definition required"

      - id: security_review
        type: manual
        approval_system: "jira"
        ticket_type: "SECURITY-REVIEW"
        message: "Security review must be approved"

      - id: load_tested
        type: manual
        approval_system: "jira"
        ticket_type: "LOAD-TEST"
        message: "Load test results must be documented"

CLI Output:

$ orr-check --service checkout-api --target-level production

╔══════════════════════════════════════════════════════════════════╗
║           OPERATIONAL READINESS REVIEW: checkout-api             ║
╚══════════════════════════════════════════════════════════════════╝

Target Level: Level 3 (Production)
Current Level: Level 2 (Staging)

══════════════════════════════════════════════════════════════════
LEVEL 1 (Development) - ALL PASSED ✓
══════════════════════════════════════════════════════════════════
  ✓ owner_defined       OWNERS file exists
  ✓ readme_exists       README.md exists
  ✓ logging_configured  Structured logging enabled

══════════════════════════════════════════════════════════════════
LEVEL 2 (Staging) - ALL PASSED ✓
══════════════════════════════════════════════════════════════════
  ✓ health_endpoint     /health returns 200
  ✓ metrics_endpoint    /metrics returns 200
  ✓ oncall_configured   PagerDuty schedule exists
  ✓ dashboard_exists    Grafana dashboard found

══════════════════════════════════════════════════════════════════
LEVEL 3 (Production) - 2 FAILED ✗
══════════════════════════════════════════════════════════════════
  ✓ runbook_exists      docs/runbook.md exists
  ✓ slo_defined         slo.yaml exists
  ✗ security_review     MISSING: No approved SECURITY-REVIEW ticket
                        → Create ticket: jira.example.com/secure-review
  ✗ load_tested         MISSING: No approved LOAD-TEST ticket
                        → Create ticket: jira.example.com/load-test

══════════════════════════════════════════════════════════════════
SUMMARY
══════════════════════════════════════════════════════════════════
  Passed: 9/11
  Failed: 2/11

  Status: BLOCKED - Cannot deploy to production

  Next Steps:
  1. Complete security review (SECURITY-REVIEW ticket)
  2. Run load test and document results (LOAD-TEST ticket)

  Estimated time to Level 3: ~3-5 days

$ echo $?
1  # Exit code 1 = failed

Dashboard (Conceptual):

╔══════════════════════════════════════════════════════════════════╗
║                 SERVICE MATURITY DASHBOARD                       ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  OVERALL MATURITY                                               ║
║  ┌────────────────────────────────────────────────────────────┐ ║
║  │ Level 4 ██                              2 services (4%)    │ ║
║  │ Level 3 ██████████████████              18 services (36%) │ ║
║  │ Level 2 ████████████████████████        24 services (48%) │ ║
║  │ Level 1 ██████                          6 services (12%)  │ ║
║  └────────────────────────────────────────────────────────────┘ ║
║                                                                  ║
║  TOP BLOCKERS                                                   ║
║  ┌────────────────────────────────────────────────────────────┐ ║
║  │ 1. Missing runbook           - 12 services               │ ║
║  │ 2. No SLO defined            - 10 services               │ ║
║  │ 3. Security review pending   - 8 services                │ ║
║  │ 4. Missing on-call rotation  - 4 services                │ ║
║  └────────────────────────────────────────────────────────────┘ ║
║                                                                  ║
║  LEADERBOARD (Most Level 3+ Services)                           ║
║  ┌────────────────────────────────────────────────────────────┐ ║
║  │ 🥇 Platform Team      - 8/8 services at Level 3+         │ ║
║  │ 🥈 Payments Team      - 5/6 services at Level 3+         │ ║
║  │ 🥉 Checkout Team      - 4/5 services at Level 3+         │ ║
║  └────────────────────────────────────────────────────────────┘ ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝

3.5 Real World Outcome

After implementing ORR:

  • All production services have documented owners
  • On-call has runbooks for every service they support
  • Security reviews happen before launch, not after incident
  • Teams compete for operational excellence badges

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                       ORR SYSTEM                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  CHECKLIST    │     │  CHECKER      │     │  DASHBOARD    │
│  (YAML)       │     │  (Python)     │     │  (Web)        │
│               │     │               │     │               │
│  Defines      │────►│  Executes     │────►│  Visualizes   │
│  requirements │     │  checks       │     │  status       │
└───────────────┘     └───────────────┘     └───────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  CI/CD GATE       │
                    │                   │
                    │  Block deploys    │
                    │  if checks fail   │
                    └───────────────────┘

4.2 Key Components

  1. Checklist Definition: YAML schema for requirements
  2. Check Executors: Plugins for different check types
  3. Aggregator: Combines results, determines level
  4. CI Integration: GitHub Actions / GitLab CI
  5. Dashboard: Web UI for visualization

4.3 Data Structures

# models.py
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional

class CheckType(Enum):
    FILE_EXISTS = "file_exists"
    FILE_CONTAINS = "file_contains"
    URL_RESPONDS = "url_responds"
    EXTERNAL_CHECK = "external_check"
    MANUAL = "manual"

class CheckResult(Enum):
    PASS = "pass"
    FAIL = "fail"
    SKIP = "skip"

@dataclass
class Check:
    id: str
    type: CheckType
    message: str
    params: dict  # Type-specific parameters

@dataclass
class CheckExecution:
    check_id: str
    result: CheckResult
    details: str
    fix_suggestion: Optional[str] = None

@dataclass
class ServiceMaturity:
    service_id: str
    current_level: int
    target_level: int
    passed_checks: List[str]
    failed_checks: List[CheckExecution]
    blocked: bool

4.4 Algorithm Overview

def run_orr_check(service_id: str, target_level: int) -> ServiceMaturity:
    # Load checklist
    checklist = load_checklist("orr-checklist.yaml")

    # Get all checks up to target level
    all_checks = []
    for level in range(1, target_level + 1):
        level_def = checklist[f"level_{level}"]
        all_checks.extend(level_def.checks)

    # Execute each check
    results = []
    for check in all_checks:
        executor = get_executor(check.type)
        result = executor.run(service_id, check)
        results.append(result)

    # Determine current level
    current_level = 0
    for level in range(1, target_level + 1):
        level_checks = [c for c in results if c.level == level]
        if all(c.result == CheckResult.PASS for c in level_checks):
            current_level = level
        else:
            break

    # Build maturity report
    passed = [r.check_id for r in results if r.result == CheckResult.PASS]
    failed = [r for r in results if r.result == CheckResult.FAIL]

    return ServiceMaturity(
        service_id=service_id,
        current_level=current_level,
        target_level=target_level,
        passed_checks=passed,
        failed_checks=failed,
        blocked=(current_level < target_level)
    )

5. Implementation Guide

5.1 Development Environment Setup

# Create project
mkdir orr-system && cd orr-system
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install pyyaml click requests colorama

5.2 Project Structure

orr-system/
├── config/
│   ├── orr-checklist.yaml
│   └── service-registry.yaml
├── src/
│   ├── __init__.py
│   ├── models.py
│   ├── checklist.py       # Load checklist definition
│   ├── executors/
│   │   ├── file.py        # file_exists, file_contains
│   │   ├── http.py        # url_responds
│   │   ├── external.py    # external_check
│   │   └── manual.py      # manual approval lookup
│   ├── runner.py          # Execute all checks
│   └── report.py          # Generate output
├── cli.py                 # Command-line interface
├── ci/
│   └── github-action.yaml # CI integration
└── dashboard/
    └── app.py             # Web dashboard (Flask/FastAPI)

5.3 The Core Question You’re Answering

“How do we ensure that ‘Freedom’ for teams doesn’t lead to ‘Chaos’ for the organization?”

Modern operating models give teams autonomy, but autonomy without standards is dangerous. The ORR system is the “Policy” that makes autonomy safe.

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Shift-Left Operations
    • Why should operational checks happen during development?
    • Book Reference: “The Phoenix Project”
  2. Service Maturity Models
    • What are the levels of maturity for a microservice?
    • Book Reference: “The Site Reliability Workbook” Ch. 11
  3. Continuous Compliance
    • How do you enforce policy without being a bottleneck?
    • Reference: Policy-as-code literature

5.5 Questions to Guide Your Design

Before implementing, think through these:

Automation vs. Manual

  • Which checks can be done by script (file exists)?
  • Which require human approval (security review)?
  • How do you handle “exceptions”?

Incentives

  • Why would a developer want to pass the ORR?
  • Does passing unlock something valuable (production deploy)?
  • How do you celebrate achievement (badges, leaderboard)?

Evolution

  • How do you add new checks without breaking existing services?
  • Should legacy services be grandfathered?

5.6 Thinking Exercise

The “Day 0” Disaster

Imagine you launch a service and it crashes within 5 minutes.

Questions:

  1. What information would you need to fix it? (Logs? Metrics? Owner?)
  2. If that information isn’t available, whose fault is it?
  3. Could an automated check have caught the missing info before launch?

Write down:

  • 5 pieces of information you’d need
  • For each, the ORR check that would ensure it exists

5.7 Hints in Layers

Hint 1: Start Simple Begin with a Markdown checklist of 10 things every service must have.

Hint 2: Use GitHub Branch Protection Require that a specific check (like orr-status) passes before merging to main.

Hint 3: Build the Checker Incrementally Start with file_exists checks. Add HTTP checks. Add external API checks last.

Hint 4: Gamify It Create a dashboard showing which teams have the most “Gold Level” services.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is an Operational Readiness Review?”
    • A checklist of requirements a service must meet before production.
  2. “How do you automate governance without slowing teams down?”
    • Make checks fast, provide clear fix suggestions, integrate into existing workflow.
  3. “Should ORR be centralized or self-service?”
    • Self-service checks with centralized standards. Teams run checks; central team defines rules.
  4. “What are the top 3 items on a production-readiness checklist?”
    • Owner defined, on-call configured, runbook exists.
  5. “How do you handle legacy services that don’t meet new standards?”
    • Grandfather with timeline, prioritize based on risk, track improvement.

5.9 Books That Will Help

Topic Book Chapter
Production Readiness “The Site Reliability Workbook” Ch. 11
Shift-Left “The Phoenix Project” Part 2
Governance “Modern Software Engineering” Ch. 11

5.10 Implementation Phases

Phase 1: Checklist Design (3-4 hours)

  1. Define 4 maturity levels
  2. Write 3-5 checks per level
  3. Review with ops team

Phase 2: Checker Core (5-7 hours)

  1. Build file_exists executor
  2. Build file_contains executor
  3. Build url_responds executor
  4. Aggregate results by level

Phase 3: CI Integration (3-4 hours)

  1. Create GitHub Action
  2. Add as required check
  3. Test with sample repo

Phase 4: Dashboard (5-7 hours)

  1. Build simple Flask/FastAPI app
  2. Show all services with levels
  3. Add leaderboard

Phase 5: Rollout (3-4 hours)

  1. Run against all services
  2. Share results with teams
  3. Set timeline for compliance

5.11 Key Implementation Decisions

Decision Option A Option B Recommendation
Enforcement Block deploys Advisory only Start advisory, move to blocking
Storage Files in repo Central DB Files in repo (GitOps)
Manual checks Jira tickets Google Forms Jira (audit trail)
Exceptions No exceptions Documented exceptions Documented with expiration

6. Testing Strategy

Unit Tests

def test_file_exists_check():
    executor = FileExistsExecutor()
    result = executor.run("./test-service", Check(
        id="test",
        type=CheckType.FILE_EXISTS,
        params={"path": "README.md"}
    ))
    assert result.result == CheckResult.PASS

def test_level_determination():
    results = [
        CheckExecution(check_id="a", result=CheckResult.PASS, level=1),
        CheckExecution(check_id="b", result=CheckResult.PASS, level=1),
        CheckExecution(check_id="c", result=CheckResult.FAIL, level=2),
    ]
    assert determine_level(results) == 1

Integration Tests

  • Run against real service repository
  • Verify all check types work
  • Verify CI integration blocks correctly

Rollout Testing

  • Run in advisory mode for 2 weeks
  • Review false positives/negatives
  • Adjust checks before enforcement

7. Common Pitfalls & Debugging

Problem Symptom Root Cause Fix
Too strict No services pass Checks don’t match reality Start with current state as baseline
Too lenient Everything passes Checks too simple Add meaningful checks
Slow checks CI times out External API calls slow Add caching, parallel execution
Manual bottleneck Tickets pile up Too many manual checks Automate more, reduce manual

8. Extensions & Challenges

Extension 1: Policy-as-Code

Use Open Policy Agent (OPA) for complex rules.

Extension 2: Trend Tracking

Track maturity levels over time. Show improvement graphs.

Extension 3: Auto-Remediation

For some checks, automatically create PRs to fix issues.

Extension 4: Integration with Backstage

Publish maturity badges to service catalog.


9. Real-World Connections

How Big Tech Does This:

  • Google: Production Readiness Review (PRR)
  • Netflix: Chaos Engineering integrated with readiness
  • Spotify: Squad Health Check for non-technical readiness

Tools:


10. Resources

SRE Resources

Policy-as-Code


11. Self-Assessment Checklist

Before considering this project complete, verify:

  • Checklist has 4 maturity levels with clear criteria
  • At least 3 check types are implemented
  • CLI tool returns clear pass/fail with fix suggestions
  • CI integration blocks deploys (or at least warns)
  • Dashboard shows all services with their levels
  • At least one team has used this to improve a service
  • Legacy services have a documented path to compliance

12. Submission / Completion Criteria

This project is complete when you have:

  1. orr-checklist.yaml with 4 levels and 15+ checks
  2. CLI tool that runs checks and reports results
  3. CI integration (GitHub Action or GitLab CI)
  4. Dashboard showing service maturity overview
  5. Documentation for adding new checks
  6. Rollout plan for organization adoption

Previous Project: P08: Dependency Spaghetti Visualizer Next Project: P10: Incident Response Battle Cards