Project 9: The Operational Readiness Review (ORR) System

Build a system that automates the “handover” or “promotion” of a service from experimental to production-ready through automated checks and quality gates.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	2 Weeks (20-30 hours)
Primary Language	YAML / Python
Alternative Languages	Go, Bash
Prerequisites	CI/CD familiarity, DevOps concepts
Key Topics	Governance, Service Maturity, Shift-Left Operations

1. Learning Objectives

By completing this project, you will:

Define production readiness criteria for your organization
Automate operational checks (documentation, monitoring, ownership)
Implement maturity levels for services
Integrate governance into CI/CD without creating bottlenecks
Balance freedom and safety in team autonomy

2. Theoretical Foundation

2.1 Core Concepts

The Production Readiness Problem

CURRENT STATE                       DESIRED STATE
┌─────────────────────────────┐    ┌─────────────────────────────┐
│ "Ship it and pray"          │    │ "Ship with confidence"      │
│                             │    │                             │
│ - No runbook               │    │ ✓ Runbook exists            │
│ - No alerts                │    │ ✓ Alerts configured         │
│ - No owner listed          │    │ ✓ Owner in catalog          │
│ - Breaks at 3 AM           │    │ ✓ On-call rotation set      │
│ - "Who wrote this?"        │    │ ✓ Dependencies documented   │
└─────────────────────────────┘    └─────────────────────────────┘

Service Maturity Model

┌─────────────────────────────────────────────────────────────────┐
│                    SERVICE MATURITY LEVELS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LEVEL 0: Experimental                                         │
│  ─────────────────────────────                                 │
│  - No guarantees                                               │
│  - May be deleted without notice                               │
│  - Not in production                                           │
│                                                                 │
│  LEVEL 1: Development                                          │
│  ─────────────────────────────                                 │
│  ✓ Has owner                                                   │
│  ✓ Has README                                                  │
│  ✓ Has basic logging                                           │
│                                                                 │
│  LEVEL 2: Staging                                              │
│  ─────────────────────────────                                 │
│  ✓ All Level 1 requirements                                   │
│  ✓ Has health endpoints                                       │
│  ✓ Has monitoring dashboard                                   │
│  ✓ Has on-call rotation                                       │
│                                                                 │
│  LEVEL 3: Production                                           │
│  ─────────────────────────────                                 │
│  ✓ All Level 2 requirements                                   │
│  ✓ Has runbook                                                │
│  ✓ Has SLO defined                                            │
│  ✓ Has security review                                        │
│  ✓ Has load tested                                            │
│                                                                 │
│  LEVEL 4: Critical                                             │
│  ─────────────────────────────                                 │
│  ✓ All Level 3 requirements                                   │
│  ✓ Has disaster recovery plan                                 │
│  ✓ Has multi-region deployment                                │
│  ✓ Has chaos testing                                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Shift-Left Operations

TRADITIONAL                         SHIFT-LEFT

Code ──► Build ──► Test ──► Deploy ──► Operate
                                          │
                                    Problems found here
                                    (expensive to fix)

Code ──► Operate-Check ──► Build ──► Test ──► Deploy
              │
        Problems found here
        (cheap to fix)

2.2 Why This Matters

Without ORR:

Services launch with missing documentation
On-call gets paged for services they don’t know exist
Incidents take longer because runbooks don’t exist
Security vulnerabilities go undetected

With ORR:

Minimum bar for production is clear and automated
Teams know exactly what’s required before launch
Operations get better over time (continuous improvement)

2.3 Historical Context

Google ORR (2010s): Formalized production readiness at scale
SRE Book (2016): Published ORR concepts
Platform Engineering (2020s): Self-service ORR becomes standard

2.4 Common Misconceptions

Misconception	Reality
“ORR slows us down”	ORR prevents 3 AM incidents that slow you down more
“One size fits all”	Different services need different maturity levels
“Manual review is enough”	Automated checks are faster and more consistent
“Developers will resist”	Developers appreciate clear, automated expectations

3. Project Specification

3.1 What You Will Build

Checklist Definition: YAML-based requirements per maturity level
Automated Checker: CLI tool that validates services
CI/CD Integration: Block deploys that don’t meet level
Dashboard: Show maturity status across all services
Gamification: Badges/leaderboard for operational excellence

3.2 Functional Requirements

Checklist Schema
- Define checks per maturity level
- Support automated checks (file exists, URL responds)
- Support manual checks (security review approved)
- Configurable per service type
Checker Tool
- orr-check <service> validates against requirements
- Returns pass/fail with specific failures
- Suggests fixes for each failure
CI/CD Integration
- Run on PR to main branch
- Block merge if below required level
- Allow override with explicit approval
Dashboard
- Show all services by maturity level
- Drill down to specific failures
- Track improvement over time

3.3 Non-Functional Requirements

Checks must complete in < 30 seconds
Must support 100+ services
Must be extensible for custom checks
Must integrate with existing CI (GitHub Actions, GitLab CI)

3.4 Example Usage / Output

Checklist Definition (orr-checklist.yaml):

maturity_levels:
  level_1_development:
    name: "Development"
    required_for: ["staging-deploy"]
    checks:
      - id: owner_defined
        type: file_exists
        path: "OWNERS"
        message: "Service must have an OWNERS file"

      - id: readme_exists
        type: file_exists
        path: "README.md"
        message: "Service must have a README"

      - id: logging_configured
        type: file_contains
        path: "config/logging.yaml"
        pattern: "structured_logging: true"
        message: "Structured logging must be enabled"

  level_2_staging:
    name: "Staging"
    required_for: ["production-deploy"]
    inherits: level_1_development
    checks:
      - id: health_endpoint
        type: url_responds
        path: "/health"
        status: 200
        message: "Service must expose /health endpoint"

      - id: metrics_endpoint
        type: url_responds
        path: "/metrics"
        status: 200
        message: "Service must expose /metrics endpoint"

      - id: oncall_configured
        type: external_check
        api: "https://pagerduty.com/api/schedules/{service_id}"
        expect: "schedule_exists"
        message: "On-call rotation must be configured in PagerDuty"

      - id: dashboard_exists
        type: external_check
        api: "https://grafana.internal/api/search?query={service_name}"
        expect: "result_count > 0"
        message: "Grafana dashboard must exist"

  level_3_production:
    name: "Production"
    required_for: ["critical-tier"]
    inherits: level_2_staging
    checks:
      - id: runbook_exists
        type: file_exists
        path: "docs/runbook.md"
        message: "Runbook documentation required"

      - id: slo_defined
        type: file_exists
        path: "slo.yaml"
        message: "SLO definition required"

      - id: security_review
        type: manual
        approval_system: "jira"
        ticket_type: "SECURITY-REVIEW"
        message: "Security review must be approved"

      - id: load_tested
        type: manual
        approval_system: "jira"
        ticket_type: "LOAD-TEST"
        message: "Load test results must be documented"

CLI Output:

$ orr-check --service checkout-api --target-level production

╔══════════════════════════════════════════════════════════════════╗
║           OPERATIONAL READINESS REVIEW: checkout-api             ║
╚══════════════════════════════════════════════════════════════════╝

Target Level: Level 3 (Production)
Current Level: Level 2 (Staging)

══════════════════════════════════════════════════════════════════
LEVEL 1 (Development) - ALL PASSED ✓
══════════════════════════════════════════════════════════════════
  ✓ owner_defined       OWNERS file exists
  ✓ readme_exists       README.md exists
  ✓ logging_configured  Structured logging enabled

══════════════════════════════════════════════════════════════════
LEVEL 2 (Staging) - ALL PASSED ✓
══════════════════════════════════════════════════════════════════
  ✓ health_endpoint     /health returns 200
  ✓ metrics_endpoint    /metrics returns 200
  ✓ oncall_configured   PagerDuty schedule exists
  ✓ dashboard_exists    Grafana dashboard found

══════════════════════════════════════════════════════════════════
LEVEL 3 (Production) - 2 FAILED ✗
══════════════════════════════════════════════════════════════════
  ✓ runbook_exists      docs/runbook.md exists
  ✓ slo_defined         slo.yaml exists
  ✗ security_review     MISSING: No approved SECURITY-REVIEW ticket
                        → Create ticket: jira.example.com/secure-review
  ✗ load_tested         MISSING: No approved LOAD-TEST ticket
                        → Create ticket: jira.example.com/load-test

══════════════════════════════════════════════════════════════════
SUMMARY
══════════════════════════════════════════════════════════════════
  Passed: 9/11
  Failed: 2/11

  Status: BLOCKED - Cannot deploy to production

  Next Steps:
  1. Complete security review (SECURITY-REVIEW ticket)
  2. Run load test and document results (LOAD-TEST ticket)

  Estimated time to Level 3: ~3-5 days

$ echo $?
1  # Exit code 1 = failed

Dashboard (Conceptual):

╔══════════════════════════════════════════════════════════════════╗
║                 SERVICE MATURITY DASHBOARD                       ║
╠══════════════════════════════════════════════════════════════════╣
║                                                                  ║
║  OVERALL MATURITY                                               ║
║  ┌────────────────────────────────────────────────────────────┐ ║
║  │ Level 4 ██                              2 services (4%)    │ ║
║  │ Level 3 ██████████████████              18 services (36%) │ ║
║  │ Level 2 ████████████████████████        24 services (48%) │ ║
║  │ Level 1 ██████                          6 services (12%)  │ ║
║  └────────────────────────────────────────────────────────────┘ ║
║                                                                  ║
║  TOP BLOCKERS                                                   ║
║  ┌────────────────────────────────────────────────────────────┐ ║
║  │ 1. Missing runbook           - 12 services               │ ║
║  │ 2. No SLO defined            - 10 services               │ ║
║  │ 3. Security review pending   - 8 services                │ ║
║  │ 4. Missing on-call rotation  - 4 services                │ ║
║  └────────────────────────────────────────────────────────────┘ ║
║                                                                  ║
║  LEADERBOARD (Most Level 3+ Services)                           ║
║  ┌────────────────────────────────────────────────────────────┐ ║
║  │ 🥇 Platform Team      - 8/8 services at Level 3+         │ ║
║  │ 🥈 Payments Team      - 5/6 services at Level 3+         │ ║
║  │ 🥉 Checkout Team      - 4/5 services at Level 3+         │ ║
║  └────────────────────────────────────────────────────────────┘ ║
║                                                                  ║
╚══════════════════════════════════════════════════════════════════╝

3.5 Real World Outcome

After implementing ORR:

All production services have documented owners
On-call has runbooks for every service they support
Security reviews happen before launch, not after incident
Teams compete for operational excellence badges

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────────┐
│                       ORR SYSTEM                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        │                     │                     │
        ▼                     ▼                     ▼
┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│  CHECKLIST    │     │  CHECKER      │     │  DASHBOARD    │
│  (YAML)       │     │  (Python)     │     │  (Web)        │
│               │     │               │     │               │
│  Defines      │────►│  Executes     │────►│  Visualizes   │
│  requirements │     │  checks       │     │  status       │
└───────────────┘     └───────────────┘     └───────────────┘
                              │
                              ▼
                    ┌───────────────────┐
                    │  CI/CD GATE       │
                    │                   │
                    │  Block deploys    │
                    │  if checks fail   │
                    └───────────────────┘

4.2 Key Components

Checklist Definition: YAML schema for requirements
Check Executors: Plugins for different check types
Aggregator: Combines results, determines level
CI Integration: GitHub Actions / GitLab CI
Dashboard: Web UI for visualization

4.3 Data Structures

# models.py
from dataclasses import dataclass
from enum import Enum
from typing import List, Optional

class CheckType(Enum):
    FILE_EXISTS = "file_exists"
    FILE_CONTAINS = "file_contains"
    URL_RESPONDS = "url_responds"
    EXTERNAL_CHECK = "external_check"
    MANUAL = "manual"

class CheckResult(Enum):
    PASS = "pass"
    FAIL = "fail"
    SKIP = "skip"

@dataclass
class Check:
    id: str
    type: CheckType
    message: str
    params: dict  # Type-specific parameters

@dataclass
class CheckExecution:
    check_id: str
    result: CheckResult
    details: str
    fix_suggestion: Optional[str] = None

@dataclass
class ServiceMaturity:
    service_id: str
    current_level: int
    target_level: int
    passed_checks: List[str]
    failed_checks: List[CheckExecution]
    blocked: bool

4.4 Algorithm Overview

def run_orr_check(service_id: str, target_level: int) -> ServiceMaturity:
    # Load checklist
    checklist = load_checklist("orr-checklist.yaml")

    # Get all checks up to target level
    all_checks = []
    for level in range(1, target_level + 1):
        level_def = checklist[f"level_{level}"]
        all_checks.extend(level_def.checks)

    # Execute each check
    results = []
    for check in all_checks:
        executor = get_executor(check.type)
        result = executor.run(service_id, check)
        results.append(result)

    # Determine current level
    current_level = 0
    for level in range(1, target_level + 1):
        level_checks = [c for c in results if c.level == level]
        if all(c.result == CheckResult.PASS for c in level_checks):
            current_level = level
        else:
            break

    # Build maturity report
    passed = [r.check_id for r in results if r.result == CheckResult.PASS]
    failed = [r for r in results if r.result == CheckResult.FAIL]

    return ServiceMaturity(
        service_id=service_id,
        current_level=current_level,
        target_level=target_level,
        passed_checks=passed,
        failed_checks=failed,
        blocked=(current_level < target_level)
    )

5. Implementation Guide

5.1 Development Environment Setup

# Create project
mkdir orr-system && cd orr-system
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install pyyaml click requests colorama

5.2 Project Structure

orr-system/
├── config/
│   ├── orr-checklist.yaml
│   └── service-registry.yaml
├── src/
│   ├── __init__.py
│   ├── models.py
│   ├── checklist.py       # Load checklist definition
│   ├── executors/
│   │   ├── file.py        # file_exists, file_contains
│   │   ├── http.py        # url_responds
│   │   ├── external.py    # external_check
│   │   └── manual.py      # manual approval lookup
│   ├── runner.py          # Execute all checks
│   └── report.py          # Generate output
├── cli.py                 # Command-line interface
├── ci/
│   └── github-action.yaml # CI integration
└── dashboard/
    └── app.py             # Web dashboard (Flask/FastAPI)

5.3 The Core Question You’re Answering

“How do we ensure that ‘Freedom’ for teams doesn’t lead to ‘Chaos’ for the organization?”

Modern operating models give teams autonomy, but autonomy without standards is dangerous. The ORR system is the “Policy” that makes autonomy safe.

5.4 Concepts You Must Understand First

Stop and research these before coding:

Shift-Left Operations
- Why should operational checks happen during development?
- Book Reference: “The Phoenix Project”
Service Maturity Models
- What are the levels of maturity for a microservice?
- Book Reference: “The Site Reliability Workbook” Ch. 11
Continuous Compliance
- How do you enforce policy without being a bottleneck?
- Reference: Policy-as-code literature

5.5 Questions to Guide Your Design

Before implementing, think through these:

Automation vs. Manual

Which checks can be done by script (file exists)?
Which require human approval (security review)?
How do you handle “exceptions”?

Incentives

Why would a developer want to pass the ORR?
Does passing unlock something valuable (production deploy)?
How do you celebrate achievement (badges, leaderboard)?

Evolution

How do you add new checks without breaking existing services?
Should legacy services be grandfathered?

5.6 Thinking Exercise

The “Day 0” Disaster

Imagine you launch a service and it crashes within 5 minutes.

Questions:

What information would you need to fix it? (Logs? Metrics? Owner?)
If that information isn’t available, whose fault is it?
Could an automated check have caught the missing info before launch?

Write down:

5 pieces of information you’d need
For each, the ORR check that would ensure it exists

5.7 Hints in Layers

Hint 1: Start Simple Begin with a Markdown checklist of 10 things every service must have.

Hint 2: Use GitHub Branch Protection Require that a specific check (like orr-status) passes before merging to main.

Hint 3: Build the Checker Incrementally Start with file_exists checks. Add HTTP checks. Add external API checks last.

Hint 4: Gamify It Create a dashboard showing which teams have the most “Gold Level” services.

5.8 The Interview Questions They’ll Ask

Prepare to answer these:

“What is an Operational Readiness Review?”
- A checklist of requirements a service must meet before production.
“How do you automate governance without slowing teams down?”
- Make checks fast, provide clear fix suggestions, integrate into existing workflow.
“Should ORR be centralized or self-service?”
- Self-service checks with centralized standards. Teams run checks; central team defines rules.
“What are the top 3 items on a production-readiness checklist?”
- Owner defined, on-call configured, runbook exists.
“How do you handle legacy services that don’t meet new standards?”
- Grandfather with timeline, prioritize based on risk, track improvement.

5.9 Books That Will Help

Topic	Book	Chapter
Production Readiness	“The Site Reliability Workbook”	Ch. 11
Shift-Left	“The Phoenix Project”	Part 2
Governance	“Modern Software Engineering”	Ch. 11

5.10 Implementation Phases

Phase 1: Checklist Design (3-4 hours)

Define 4 maturity levels
Write 3-5 checks per level
Review with ops team

Phase 2: Checker Core (5-7 hours)

Build file_exists executor
Build file_contains executor
Build url_responds executor
Aggregate results by level

Phase 3: CI Integration (3-4 hours)

Create GitHub Action
Add as required check
Test with sample repo

Phase 4: Dashboard (5-7 hours)

Build simple Flask/FastAPI app
Show all services with levels
Add leaderboard

Phase 5: Rollout (3-4 hours)

Run against all services
Share results with teams
Set timeline for compliance

5.11 Key Implementation Decisions

Decision	Option A	Option B	Recommendation
Enforcement	Block deploys	Advisory only	Start advisory, move to blocking
Storage	Files in repo	Central DB	Files in repo (GitOps)
Manual checks	Jira tickets	Google Forms	Jira (audit trail)
Exceptions	No exceptions	Documented exceptions	Documented with expiration

6. Testing Strategy

Unit Tests

def test_file_exists_check():
    executor = FileExistsExecutor()
    result = executor.run("./test-service", Check(
        id="test",
        type=CheckType.FILE_EXISTS,
        params={"path": "README.md"}
    ))
    assert result.result == CheckResult.PASS

def test_level_determination():
    results = [
        CheckExecution(check_id="a", result=CheckResult.PASS, level=1),
        CheckExecution(check_id="b", result=CheckResult.PASS, level=1),
        CheckExecution(check_id="c", result=CheckResult.FAIL, level=2),
    ]
    assert determine_level(results) == 1

Integration Tests

Run against real service repository
Verify all check types work
Verify CI integration blocks correctly

Rollout Testing

Run in advisory mode for 2 weeks
Review false positives/negatives
Adjust checks before enforcement

7. Common Pitfalls & Debugging

Problem	Symptom	Root Cause	Fix
Too strict	No services pass	Checks don’t match reality	Start with current state as baseline
Too lenient	Everything passes	Checks too simple	Add meaningful checks
Slow checks	CI times out	External API calls slow	Add caching, parallel execution
Manual bottleneck	Tickets pile up	Too many manual checks	Automate more, reduce manual

8. Extensions & Challenges

Extension 1: Policy-as-Code

Use Open Policy Agent (OPA) for complex rules.

Extension 2: Trend Tracking

Track maturity levels over time. Show improvement graphs.

Extension 3: Auto-Remediation

For some checks, automatically create PRs to fix issues.

Extension 4: Integration with Backstage

Publish maturity badges to service catalog.

9. Real-World Connections

How Big Tech Does This:

Google: Production Readiness Review (PRR)
Netflix: Chaos Engineering integrated with readiness
Spotify: Squad Health Check for non-technical readiness

Tools:

OpsLevel - Service maturity platform
Backstage - Can display maturity badges
Open Policy Agent - Policy-as-code

10. Resources

SRE Resources

Policy-as-Code

Open Policy Agent
Conftest - Test structured data against OPA

P03: Ownership Mapper - Define owners
P07: SLE Agreement - Define SLOs

11. Self-Assessment Checklist

Before considering this project complete, verify:

Checklist has 4 maturity levels with clear criteria
At least 3 check types are implemented
CLI tool returns clear pass/fail with fix suggestions
CI integration blocks deploys (or at least warns)
Dashboard shows all services with their levels
At least one team has used this to improve a service
Legacy services have a documented path to compliance

12. Submission / Completion Criteria

This project is complete when you have:

orr-checklist.yaml with 4 levels and 15+ checks
CLI tool that runs checks and reports results
CI integration (GitHub Action or GitLab CI)
Dashboard showing service maturity overview
Documentation for adding new checks
Rollout plan for organization adoption

Previous Project: P08: Dependency Spaghetti Visualizer Next Project: P10: Incident Response Battle Cards