Project 18: Production Prompt Platform Capstone

Integrated platform demo with release gates, observability, and incident drills.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	5-10 days (capstone: 3-5 weeks)
Main Programming Language	TypeScript
Alternative Programming Languages	Python, Go
Coolness Level	Level 5: Career Defining
Business Potential	5. Startup-Ready Product
Knowledge Area	End-to-End AI Platform
Software or Tool	PromptOps control plane
Main Book	Site Reliability Engineering (Google)
Concept Clusters	All concept clusters

1. Learning Objectives

By completing this project, you will:

Design a reliable artifact: A complete PromptOps control plane integrating registry, router, evaluator, rollout controller, and incident responder into a unified platform.
Architect a control plane that cleanly separates policy, routing, and versioning decisions from model inference (the data plane).
Build cross-functional quality gates that combine contract compliance, security posture, performance, and cost signals into a unified release verdict.
Implement operational resilience through chaos engineering drills, automated incident playbooks, and runbook-driven auto-remediation.
Produce a system where end-to-end request lifecycle is traceable from prompt registry through model inference to post-inference evaluation.
Demonstrate production readiness through game day exercises that inject failures and measure mean time to recovery.

2. All Theory Needed (Per-Concept Breakdown)

Control-Plane Architecture for PromptOps

Fundamentals Control-plane architecture for PromptOps separates the decisions about what to do (which prompt version to serve, which model to route to, which policy to enforce) from the execution of model inference itself. This mirrors the control-plane / data-plane split used in networking (SDN controllers vs switches), Kubernetes (API server vs kubelet), and service meshes (Istio control plane vs Envoy sidecar). In a PromptOps context, the control plane manages prompt versions, routing rules, policy configurations, rollout state, and incident response workflows. The data plane handles the actual LLM API calls, token processing, and response delivery. This separation matters because control-plane operations are low-throughput, high-importance decisions (promote a prompt version, trigger a rollback, update a routing rule) while data-plane operations are high-throughput, latency-sensitive workloads (serving thousands of prompt completions per second). Mixing them creates systems where an operational decision (like updating a routing rule) can destabilize the inference path, or where inference load prevents operators from executing rollbacks during incidents.

Deep Dive into the concept The control-plane / data-plane split in PromptOps has four distinct subsystems that coordinate through well-defined interfaces.

The Configuration Subsystem is the source of truth for all runtime behavior. It stores which prompt versions are active, which models are available, what routing rules apply, and what policy gates are enforced. Changes to configuration flow through an approval workflow (from Project 15’s registry) before taking effect. The configuration subsystem exposes a declarative API: operators describe the desired state (“prompt refund_assistant should serve version 2.3.1 with 80% traffic on claude-3.5-sonnet and 20% on gpt-4o”), and the control plane reconciles the actual state to match. This declarative model, borrowed from Kubernetes, means the control plane continuously monitors drift between desired and actual state and takes corrective action automatically.

The Decision Subsystem processes incoming requests and determines how to handle them. For each prompt completion request, it resolves: which prompt version to use (from the registry), which model to route to (from the router), which policies to enforce (from the firewall), and whether the request falls within rate limits and cost budgets. These decisions are made before any LLM API call happens, keeping the decision logic fast and deterministic. The decision subsystem emits a decision record for every request, containing the resolved prompt version, target model, applied policies, and decision latency. This record is the starting point for end-to-end tracing.

The Observation Subsystem collects signals from both the control plane and the data plane. From the control plane, it captures configuration changes, decision records, and workflow state transitions. From the data plane, it captures inference latency, token usage, error rates, output quality scores (from post-inference evaluation), and cost metrics. The observation subsystem aggregates these signals into dashboards, alerts, and trend reports. It is the foundation for incident detection (when do metrics deviate from baselines?) and capacity planning (how is usage growing?).

The Remediation Subsystem acts on signals from the observation subsystem. When an alert fires (e.g., error rate exceeds threshold), the remediation subsystem executes a predefined playbook: roll back to the previous prompt version, shift traffic to a different model, increase rate limits on a backup path, or page an on-call operator. Playbooks are versioned and tested through game day drills, ensuring that the remediation path works before a real incident occurs. The remediation subsystem also handles scheduled operations like migration window enforcement, deprecation deadlines, and cost budget adjustments.

The interfaces between these subsystems are critical. The configuration subsystem publishes state changes to a message bus (or polling endpoint) that the decision subsystem watches. The decision subsystem writes decision records to a log stream that the observation subsystem consumes. The observation subsystem emits alerts to the remediation subsystem through a well-defined alert schema. Each interface has a defined contract, versioned independently, so that subsystems can evolve without breaking each other.

Failure isolation is a design principle. If the observation subsystem goes down, the decision subsystem continues serving requests using cached configuration. If the remediation subsystem fails, alerts still fire and operators receive pages. If the configuration subsystem is temporarily unavailable, the decision subsystem operates on the last-known-good configuration. This graceful degradation prevents cascading failures, which is essential for a system that manages production prompt traffic.

How this fit on projects Control-plane architecture is the central organizing principle of Project 18. Every other concept (quality gates, observability, incident drills) is implemented as a subsystem within this architecture. The project’s primary deliverable is the working control plane with all four subsystems integrated.

Definitions & key terms

Control plane: The subsystem that makes decisions about system behavior (what to serve, how to route, when to roll back) without directly handling inference traffic.
Data plane: The subsystem that executes model inference, handling the actual LLM API calls and response processing.
Declarative configuration: Describing the desired state (“serve prompt v2.3.1”) rather than imperative commands (“stop v2.3.0, then start v2.3.1”), allowing the control plane to handle reconciliation.
Decision record: A structured log entry capturing what the decision subsystem resolved for a specific request (prompt version, model, policies, latency).
Reconciliation loop: The continuous process of comparing desired state to actual state and taking corrective action to close the gap.
Graceful degradation: The ability of each subsystem to continue operating (with reduced capability) when other subsystems are unavailable.

Mental model diagram (ASCII)

+===========================================================================+
||                     CONTROL PLANE                                       ||
||                                                                         ||
||  +---------------------+        +---------------------+                 ||
||  | Configuration       |------->| Decision            |                 ||
||  | Subsystem           |        | Subsystem           |                 ||
||  |                     |        |                     |                 ||
||  | - Prompt versions   |        | - Version resolution|                 ||
||  | - Routing rules     |        | - Model routing     |                 ||
||  | - Policy configs    |        | - Policy enforcement|                 ||
||  | - Desired state     |        | - Rate limiting     |                 ||
||  +---------------------+        +---------------------+                 ||
||          ^                              |                               ||
||          |                              | decision records              ||
||          |                              v                               ||
||  +---------------------+        +---------------------+                 ||
||  | Remediation         |<-------| Observation         |                 ||
||  | Subsystem           |        | Subsystem           |                 ||
||  |                     |        |                     |                 ||
||  | - Rollback playbooks|        | - Metrics aggregation|                ||
||  | - Traffic shifting  |        | - Alert evaluation  |                 ||
||  | - Auto-remediation  |        | - Trend analysis    |                 ||
||  | - Operator paging   |        | - Dashboard data    |                 ||
||  +---------------------+        +---------------------+                 ||
||                                         ^                               ||
+===========================================================================+
                                           |
                                    inference signals
                                    (latency, tokens,
                                     errors, quality)
                                           |
+===========================================================================+
||                      DATA PLANE                                         ||
||                                                                         ||
||  +---------------------+        +---------------------+                 ||
||  | Request Handler     |------->| LLM Gateway         |                 ||
||  | (apply decision:    |        | (API calls to       |                 ||
||  |  version, model,    |        |  Claude, GPT, etc.) |                 ||
||  |  policy)            |        +---------------------+                 ||
||  +---------------------+                |                               ||
||                                         v                               ||
||                                 +---------------------+                 ||
||                                 | Post-Inference Eval |                 ||
||                                 | (quality scoring,   |                 ||
||                                 |  contract checks)   |                 ||
||                                 +---------------------+                 ||
||                                                                         ||
+===========================================================================+

How it works (step-by-step, with invariants and failure modes)

An operator declares the desired state via the configuration API: “prompt refund_assistant v2.3.1 should serve 100% of production traffic via claude-3.5-sonnet with policy profile strict.” The configuration subsystem validates this against the registry (does v2.3.1 exist and have PROMOTED status?) and persists it. Invariant: only PROMOTED prompt versions can be referenced in desired state. Failure mode: referencing a DRAFT or DEPRECATED version returns a validation error.
The decision subsystem watches for configuration changes. When it detects a new desired state, it updates its internal routing table. For each incoming request, it resolves the prompt version, selects the model, and evaluates policy gates. Invariant: the decision subsystem always uses the latest acknowledged configuration. Failure mode: if configuration cannot be fetched, the decision subsystem uses the last-known-good configuration and emits a staleness warning metric.
The decision subsystem emits a decision record for every request, containing: trace_id, resolved prompt version, content hash, model target, applied policies, decision latency, and timestamp. These records flow to the observation subsystem. Invariant: every request produces exactly one decision record. Failure mode: if the decision record write fails, the request still proceeds (observation is best-effort, not blocking).
The data plane executes the LLM API call using the parameters from the decision record, then runs post-inference evaluation. The observation subsystem receives inference signals: latency, token count, error code (if any), quality score, and cost. Invariant: inference signals are correlated with decision records via trace_id. Failure mode: missing trace_id on an inference signal creates an orphaned record that cannot be correlated.
The observation subsystem evaluates alert rules against aggregated metrics. If the 5-minute error rate exceeds 5%, it fires an alert to the remediation subsystem. Invariant: alert rules are evaluated on sliding windows, not point-in-time values, to reduce false positives. Failure mode: a metric pipeline delay causes alert evaluation to use stale data, potentially missing a spike.
The remediation subsystem receives the alert and executes the associated playbook: “if error_rate > 5% and current_version != previous_version, roll back to previous_version.” The rollback updates the configuration subsystem’s desired state and logs the action in the audit trail. Invariant: automated rollbacks trigger operator notification even if they succeed. Failure mode: if the rollback itself fails (e.g., previous version was deprecated), the playbook escalates to human operator.

Minimal concrete example

Desired State Declaration:
  PUT /v1/config/desired-state
  {
    "prompt_name": "refund_policy_assistant",
    "version": "2.3.1",
    "model_routing": {
      "primary": "claude-3.5-sonnet",
      "fallback": "gpt-4o",
      "primary_weight": 0.9
    },
    "policy_profile": "strict",
    "rate_limit": { "rpm": 1000, "tpm": 100000 }
  }

Decision Record (per request):
  {
    "trace_id": "trc_p18_1001",
    "prompt_version": "2.3.1",
    "content_hash": "sha256:a1b2c3...",
    "model_target": "claude-3.5-sonnet",
    "policies_applied": ["no-pii", "max-tokens-4096"],
    "decision_latency_ms": 3,
    "timestamp": "2026-02-01T10:00:01.234Z"
  }

Alert Rule:
  {
    "rule_id": "high_error_rate",
    "condition": "avg(error_rate, 5m) > 0.05",
    "playbook": "auto_rollback_to_previous",
    "severity": "critical",
    "notification_channels": ["pagerduty", "slack-ops"]
  }

Common misconceptions

“The control plane needs to be in the request path for every completion.” The control plane sets the rules; the data plane follows them. Decision resolution should be cached at the data plane with periodic refresh, not fetched from the control plane on every request. This keeps inference latency low.
“Declarative configuration is harder than imperative commands.” Declarative configuration is harder to implement but easier to operate. An imperative approach (“stop version A, then start version B”) fails if step 1 succeeds and step 2 fails, leaving the system in an undefined state. A declarative approach (“desired state is version B”) lets the reconciliation loop handle partial failures.
“If the observation subsystem is down, we should stop serving traffic.” Observation is important but not critical-path. The data plane should continue serving requests using the last-known-good configuration. Missing observability is a degraded state, not a failure state. Alert the operators that observability is down so they know they are flying blind.
“Control plane and data plane can share a database.” Sharing a database creates coupling: a control-plane schema migration can lock tables that the data plane reads, causing inference latency spikes. Separate data stores with well-defined replication patterns provide better isolation.
“Automated remediation replaces on-call operators.” Automated remediation handles known failure modes with tested playbooks. Novel failures, cascading incidents, and judgment calls still require human operators. Automation buys time and reduces MTTR for routine incidents; it does not eliminate the need for humans.

Check-your-understanding questions

Why does the control plane use declarative configuration rather than imperative commands?
What happens when the configuration subsystem is temporarily unavailable?
Why should decision records be written on a best-effort basis rather than blocking the inference path?
How does the reconciliation loop detect drift between desired and actual state?
What is the risk of sharing a database between the control plane and data plane?

Check-your-understanding answers

Declarative configuration describes the desired end state, allowing the reconciliation loop to handle partial failures and ordering issues automatically. Imperative commands require the operator to specify the exact sequence of steps, and any failure in the sequence leaves the system in an inconsistent state that requires manual recovery.
The decision subsystem continues serving requests using the last-known-good configuration that it has cached locally. It emits a staleness metric that the observation subsystem tracks. When the configuration subsystem recovers, the decision subsystem re-syncs to the latest desired state. This is the graceful degradation principle.
Decision records are observability data. If writing them blocks the inference path, a slow observation pipeline causes latency spikes for user-facing requests. Best-effort writes mean occasional missing records during pipeline issues, which is an acceptable tradeoff for protecting inference latency.
The reconciliation loop periodically compares the desired state (from the configuration store) with the actual state (from the decision subsystem’s routing table and the data plane’s health endpoints). Drift is detected when the actual state differs from the desired state (e.g., the data plane is still serving v2.3.0 when the desired state specifies v2.3.1). The loop then takes corrective action.
A shared database creates tight coupling. Control-plane operations (schema migrations, bulk audit queries) can cause lock contention that degrades data-plane read latency. An outage in the shared database takes down both planes simultaneously. Separate stores provide failure isolation: a control-plane database issue does not affect inference serving.

Real-world applications

Kubernetes uses a control-plane / data-plane split where the API server, scheduler, and controller manager (control plane) make decisions about pod placement and scaling, while kubelets and kube-proxy (data plane) execute those decisions on each node.
Istio service mesh has a control plane (Pilot, Mixer, Citadel) that manages routing rules and security policies, while Envoy sidecars (data plane) enforce them at each service. PromptOps follows the same pattern for prompt routing and policy enforcement.
Netflix’s Zuul gateway separates routing decision logic (which backend, which canary percentage) from request forwarding, allowing routing changes without affecting request processing latency.
PromptLayer, Langfuse, and Helicone implement observability layers that capture prompt execution data separately from the inference path, following the observation subsystem pattern.
Google’s Borg and Omega cluster managers use declarative desired-state configuration with reconciliation loops, the same pattern applied to prompt version management in this project.

Where you’ll apply it

Phase 1: implement the configuration subsystem and decision subsystem with a basic routing table.
Phase 2: integrate the observation subsystem and build the dashboard.
Phase 3: add the remediation subsystem with playbook execution and game day drills.

References

“Site Reliability Engineering” by Google - Chapters on managing systems at scale and change management
“Designing Data-Intensive Applications” by Martin Kleppmann - Chapters on distributed systems coordination
Kubernetes architecture documentation (control plane components)
Istio service mesh architecture documentation (control plane / data plane split)

Key insights The control-plane / data-plane split is not an architectural luxury; it is the foundation that enables every other operational capability (rollbacks, canary routing, policy enforcement, incident response) without destabilizing inference traffic.

Summary Control-plane architecture for PromptOps separates decision-making (configuration, routing, policy, remediation) from execution (inference, evaluation, response delivery). The control plane has four subsystems: configuration (desired state), decision (per-request resolution), observation (metrics and alerts), and remediation (playbooks and auto-recovery). Each subsystem communicates through well-defined interfaces and degrades gracefully when other subsystems are unavailable. This architecture enables all the operational capabilities required for production prompt management.

Homework/Exercises to practice the concept

Draw a detailed architecture diagram of a PromptOps control plane with all four subsystems, labeling the interfaces between them and the data that flows across each interface.
Design the failure modes for each subsystem (configuration down, decision degraded, observation delayed, remediation failed) and describe how the system behaves in each case.
Compare the PromptOps control-plane architecture to Kubernetes architecture: map each Kubernetes component (API server, etcd, scheduler, controller manager, kubelet) to its PromptOps equivalent and explain the analogy.

Solutions to the homework/exercises

The architecture diagram should show: Configuration Subsystem publishing desired state to a message bus, Decision Subsystem subscribing to state changes and writing decision records to a log stream, Observation Subsystem consuming decision records and data-plane signals to compute metrics and evaluate alert rules, and Remediation Subsystem receiving alerts and updating configuration. Each interface should be labeled with the data schema (desired state schema, decision record schema, alert schema, remediation action schema).
Failure modes: Configuration down -> Decision uses cached state, emits staleness metric, operators notified. Decision degraded -> Requests fail or fall back to a default prompt version, error rate spike triggers remediation. Observation delayed -> Alerts fire late, operators lose visibility but traffic continues. Remediation failed -> Alert still fires, escalates to human operator via PagerDuty. In all cases, the data plane continues serving with last-known-good state.
Kubernetes mapping: API server -> Configuration Subsystem (desired state storage and validation). etcd -> Configuration store (persistent state). Scheduler -> Decision Subsystem (placement decisions). Controller Manager -> Remediation Subsystem (reconciliation loops). Kubelet -> Data Plane (execution engine). The analogy holds because both systems use declarative configuration with reconciliation to manage complex distributed behavior.

Cross-Functional Quality Gates

Fundamentals Cross-functional quality gates combine signals from multiple domains (contract compliance, security posture, performance metrics, cost budget, evaluation scores) into a unified release verdict that determines whether a prompt version can be promoted to the next deployment stage. In traditional CI/CD, quality gates are often single-domain: “do the unit tests pass?” For prompt-driven systems, a single domain is insufficient because a prompt can pass contract validation but fail security checks, or pass all automated tests but blow through cost budgets. Cross-functional gates ensure that every dimension of quality is evaluated before a prompt reaches production. They are the mechanism that transforms the approval workflow (from Project 15) into a data-driven decision process where the release verdict is backed by evidence from every relevant domain.

Deep Dive into the concept A cross-functional quality gate aggregates verdicts from multiple evaluators into a single release decision. Each evaluator runs independently and produces a typed verdict: PASS, WARN, or FAIL, along with a detailed report. The gate engine collects all verdicts and applies a composition rule to produce the final decision. The simplest composition rule is “all must pass”: if any evaluator returns FAIL, the gate blocks promotion. More sophisticated rules allow WARNs to pass with additional review requirements, or allow specific evaluators to be advisory (their verdict is recorded but does not block).

The evaluators for a PromptOps quality gate typically include:

Contract Compliance Evaluator: Runs the prompt against the evaluation suite from the prompt contract (Project 1). Checks that outputs match the declared output schema, required fields are present, and behavioral invariants hold. The verdict includes the pass rate, the list of failing test cases, and comparison against the previous version’s pass rate.

Security Posture Evaluator: Runs the prompt injection red-team suite (Project 3) against the new prompt version. Checks for injection vulnerabilities, PII leakage, unauthorized tool calls, and policy violations. This evaluator uses adversarial test cases generated by the eval forge (Project 14). The verdict includes the attack success rate, the list of successful attack vectors, and a comparison against the security baseline.

Performance Evaluator: Measures inference latency, token consumption, and throughput for the new prompt version under simulated load. Compares against the SLA requirements (p50, p95, p99 latency targets) and the previous version’s performance profile. The verdict includes latency percentiles, token usage per request, and any regressions.

Cost Evaluator: Estimates the cost impact of the new prompt version based on token usage, model pricing, and projected traffic volume. Compares against the cost budget and the previous version’s cost profile. The verdict includes estimated daily/monthly cost, cost per request, and budget utilization percentage.

Evaluation Score Aggregator: Collects quality scores from the evaluation harness (human eval scores, LLM-as-judge scores, retrieval quality metrics) and compares against quality thresholds. The verdict includes the aggregate quality score, per-dimension scores, and any regressions from the previous version.

The gate engine’s composition logic is configurable per deployment stage. For promotion to staging, the rules might be relaxed (WARNs allowed, cost evaluation advisory). For promotion to production, the rules are strict (all evaluators must PASS, no WARNs except in pre-approved exemption categories). For emergency rollbacks, all gates are bypassed (but the bypass is recorded in the audit trail for post-incident review).

Gate results become part of the prompt version’s metadata in the registry. When a reviewer evaluates a promotion request, they see the complete gate report: which evaluators passed, which warned, which failed, and the detailed evidence for each verdict. This turns the review process from a subjective assessment into a data-driven decision.

Temporal gates add a time dimension. Some quality properties can only be measured after the prompt has been serving traffic for a period (e.g., “no quality regression over 24 hours of canary traffic”). Temporal gates work in conjunction with the rollout controller (Project 11): the prompt is deployed to canary, metrics are collected over the observation period, and the temporal gate evaluates the accumulated data before allowing full promotion.

How this fit on projects Cross-functional quality gates are the decision layer between the registry (Project 15) and the rollout controller (Project 11) in the capstone platform. They determine whether a registered, approved prompt version is safe to promote through the deployment stages.

Definitions & key terms

Gate evaluator: An independent component that assesses one dimension of quality (security, performance, cost, etc.) and produces a typed verdict.
Composition rule: The logic that combines individual evaluator verdicts into a single release decision (e.g., “all must pass”, “majority pass with no security failures”).
Temporal gate: A quality gate that requires observation over a time window before producing its verdict.
Gate report: The structured output of all evaluator verdicts, evidence, and the final release decision, attached to the prompt version’s metadata.
Exemption: A documented exception allowing a prompt to bypass a specific evaluator, with a justification and an expiration date.

Mental model diagram (ASCII)

Prompt Version v2.3.1 (APPROVED in Registry)
                |
                v
+===============================================================+
|              CROSS-FUNCTIONAL QUALITY GATE                     |
|                                                                |
|  +-----------------+  +-----------------+  +-----------------+ |
|  | Contract        |  | Security        |  | Performance     | |
|  | Compliance      |  | Posture         |  | Evaluator       | |
|  |                 |  |                 |  |                 | |
|  | eval pass rate: |  | injection tests:|  | p95 latency:    | |
|  |   98.5% (PASS)  |  |   0/50 (PASS)   |  |   180ms (PASS)  | |
|  | schema valid:   |  | PII leakage:    |  | tokens/req:     | |
|  |   100% (PASS)   |  |   0 (PASS)      |  |   850 (PASS)    | |
|  +-----------------+  +-----------------+  +-----------------+ |
|                                                                |
|  +-----------------+  +-----------------+                      |
|  | Cost            |  | Quality Score   |                      |
|  | Evaluator       |  | Aggregator      |                      |
|  |                 |  |                 |                      |
|  | est. daily:     |  | overall: 4.2/5  |                      |
|  |   $142 (WARN)   |  |   (PASS)        |                      |
|  | budget util:    |  | regression:     |                      |
|  |   87% (WARN)    |  |   none (PASS)   |                      |
|  +-----------------+  +-----------------+                      |
|                                                                |
|  Composition Rule: all_pass_or_warn_with_review                |
|  Final Verdict: PASS_WITH_REVIEW (cost warnings)               |
+===============================================================+
                |
                v
        Promotion to Canary (with reviewer sign-off on cost)

How it works (step-by-step, with invariants and failure modes)

A promotion request triggers the quality gate for the target deployment stage. The gate engine loads the composition rule for that stage and spawns all evaluators in parallel. Invariant: all evaluators run against the exact same prompt version (identified by content hash). Failure mode: if an evaluator crashes, its verdict defaults to FAIL (fail-closed) and the error is included in the gate report.
Each evaluator runs its test suite or analysis against the prompt version and produces a verdict with structured evidence. Invariant: evaluator verdicts include the evaluator version, the test suite version, and the timestamp, ensuring reproducibility. Failure mode: non-deterministic evaluators (e.g., LLM-as-judge) run multiple iterations and report confidence intervals; if confidence is too low, the verdict is INCONCLUSIVE (treated as WARN).
The gate engine collects all verdicts and applies the composition rule. For a “strict” stage, any FAIL blocks promotion. For a “standard” stage, WARNs generate review tasks but do not block. Invariant: the composition rule is defined in configuration and versioned, not hardcoded. Failure mode: if the composition rule references an evaluator that did not produce a verdict (e.g., new evaluator added but not yet integrated), the gate engine reports an incomplete gate with the missing evaluator identified.
The gate report is attached to the prompt version’s metadata in the registry and included in the promotion request. Reviewers see every evaluator’s verdict and evidence. Invariant: the gate report is immutable once generated; re-running the gate creates a new report rather than overwriting. Failure mode: a reviewer approves promotion based on an outdated gate report; the system checks that the gate report timestamp is within a configured freshness window.
For temporal gates, the gate remains in PENDING state during the observation window. The observation subsystem feeds metrics to the temporal evaluator, which updates its assessment as data accumulates. At the end of the window, the temporal evaluator produces its final verdict. Invariant: the observation window cannot be shortened without platform lead approval. Failure mode: if insufficient data is collected during the window (e.g., traffic was too low), the temporal evaluator returns INCONCLUSIVE and the window is extended.

Minimal concrete example

Gate configuration for "production" stage:
  {
    "stage": "production",
    "evaluators": [
      { "id": "contract_compliance", "required": true, "fail_threshold": "pass_rate < 95%" },
      { "id": "security_posture", "required": true, "fail_threshold": "any_injection_success" },
      { "id": "performance", "required": true, "fail_threshold": "p95_latency > 500ms" },
      { "id": "cost", "required": false, "warn_threshold": "budget_util > 80%" },
      { "id": "quality_score", "required": true, "fail_threshold": "score < 3.5" }
    ],
    "composition": "all_required_pass",
    "temporal_gate": {
      "enabled": true,
      "window_hours": 24,
      "metric": "canary_error_rate",
      "fail_threshold": "avg > 0.02"
    }
  }

Gate report:
  {
    "gate_id": "gate_20260201_001",
    "prompt_name": "refund_policy_assistant",
    "prompt_version": "2.3.1",
    "content_hash": "sha256:a1b2c3...",
    "stage": "production",
    "evaluators": [
      { "id": "contract_compliance", "verdict": "PASS", "pass_rate": 0.985 },
      { "id": "security_posture", "verdict": "PASS", "attacks_succeeded": 0, "attacks_total": 50 },
      { "id": "performance", "verdict": "PASS", "p95_latency_ms": 180 },
      { "id": "cost", "verdict": "WARN", "daily_estimate_usd": 142, "budget_util": 0.87 },
      { "id": "quality_score", "verdict": "PASS", "score": 4.2 }
    ],
    "temporal_gate": { "status": "PENDING", "window_remaining_hours": 18 },
    "final_verdict": "PENDING_TEMPORAL",
    "generated_at": "2026-02-01T10:00:00Z"
  }

Common misconceptions

“If all tests pass, the prompt is ready for production.” Tests validate known scenarios. Quality gates add dimensions that tests do not cover: cost projections, security posture against novel attacks, and performance under production load patterns. Tests are necessary but not sufficient.
“Cost evaluation can be done after deployment.” By the time you discover a cost problem in production, you may have already consumed a significant budget. Pre-deployment cost estimation using projected traffic and token usage prevents budget surprises.
“Security testing is a one-time gate.” Security posture must be re-evaluated for every prompt version because wording changes can open new injection vectors. A prompt that was secure in v2.3.0 might be vulnerable in v2.3.1 due to a subtle instruction change.
“Quality gates create bottlenecks.” Parallel evaluator execution means the gate latency is bounded by the slowest evaluator. Well-designed evaluators complete in minutes. The alternative (discovering quality issues in production) is far more expensive in terms of time and impact.
“Temporal gates are unnecessary if pre-deployment tests are thorough.” Pre-deployment tests run against synthetic traffic. Temporal gates validate against real production traffic patterns, which often differ from synthetic benchmarks in distribution, edge cases, and load patterns.

Check-your-understanding questions

Why do quality gate evaluators run in parallel rather than sequentially?
How does the gate engine handle a new evaluator that has not yet been integrated?
What is the purpose of a freshness window on gate reports?
Why should the cost evaluator be advisory (WARN) rather than blocking (FAIL) for staging promotions?
How do temporal gates interact with canary rollouts?

Check-your-understanding answers

Parallel execution minimizes gate latency. If evaluators ran sequentially and each took 2 minutes, a 5-evaluator gate would take 10 minutes. Running in parallel, the gate completes in the time of the slowest evaluator (2 minutes). This reduces the promotion pipeline duration.
The gate engine detects that a configured evaluator did not produce a verdict and marks the gate as INCOMPLETE. It reports which evaluator is missing and blocks promotion until the evaluator is either integrated or explicitly removed from the gate configuration. This prevents accidental bypasses.
Gate reports become stale as the codebase and environment change. A report generated a week ago may not reflect the current security posture or performance characteristics. The freshness window (e.g., 24 hours) ensures that promotion decisions are based on recent evidence. If the report is stale, the gate must be re-run.
In staging, cost estimates are based on projected traffic, which may be inaccurate. Blocking on an inaccurate cost estimate would slow iteration. In production, after temporal gates have provided actual cost data, the cost evaluator can be upgraded to blocking. The advisory approach gives teams visibility without creating false-positive blocks.
The canary rollout controller deploys the new prompt version to a percentage of traffic. During the canary period, the temporal gate collects metrics (error rate, latency, quality scores) on the canary traffic. At the end of the observation window, the temporal gate produces its verdict. If it passes, the rollout controller proceeds to full promotion. If it fails, the controller rolls back the canary automatically.

Real-world applications

Google’s change management process requires multiple review types (code review, security review, privacy review, launch review) before production deployment. Cross-functional quality gates automate this multi-reviewer pattern.
Datadog’s LLM Observability provides integrated dashboards that track latency, token usage, error rates, and quality scores in a single view, providing the data that quality gate evaluators consume.
LaunchDarkly feature flags enable gradual rollouts with quality checks at each traffic percentage, functioning as temporal gates.
SOC 2 compliance requires evidence that every production change was evaluated against security criteria, directly mapping to the security posture evaluator in the quality gate.

Where you’ll apply it

Phase 2: implement the gate engine, individual evaluators, and composition rules.
Phase 3: add temporal gates integrated with the canary rollout system and build the gate report dashboard.

References

“Site Reliability Engineering” by Google - Chapters on release engineering and change management
“Accelerate” by Forsgren et al. - Chapters on continuous delivery and deployment practices
“AI Engineering” by Chip Huyen - Chapters on evaluation and monitoring
OWASP LLM Top 10 (security evaluation criteria)

Key insights A quality gate is only as strong as its weakest evaluator; cross-functional design ensures that no dimension of quality (security, cost, performance, compliance) can be silently bypassed because it was not measured.

Summary Cross-functional quality gates aggregate verdicts from multiple independent evaluators (contract, security, performance, cost, quality) into a unified release decision. The gate engine runs evaluators in parallel, applies configurable composition rules, and produces immutable gate reports that become part of the prompt version’s metadata. Temporal gates add time-based validation against real production traffic. Together, these gates transform the promotion process from a subjective review into a data-driven decision backed by evidence from every relevant domain.

Homework/Exercises to practice the concept

Design the evaluator interface (input schema, output verdict schema) and implement pseudocode for at least three evaluators: contract compliance, security posture, and cost estimation.
Define composition rules for three deployment stages (dev, staging, production) with increasing strictness. Specify which evaluators are required vs advisory at each stage.
Create a temporal gate configuration for a 24-hour canary observation window. Define the metrics collected, the thresholds for pass/warn/fail, and the actions taken at each verdict.

Solutions to the homework/exercises

Evaluator interface: input is { prompt_name, prompt_version, content_hash, stage, previous_version }. Output is { evaluator_id, evaluator_version, verdict: PASS|WARN|FAIL|INCONCLUSIVE, evidence: { ... }, timestamp }. Contract compliance evidence includes pass_rate, failing_cases, regression_from_previous. Security posture evidence includes attack_success_rate, successful_vectors, baseline_comparison. Cost estimation evidence includes tokens_per_request, estimated_daily_cost_usd, budget_utilization, cost_change_from_previous.
Composition rules: Dev stage has all evaluators advisory (no blockers). Staging requires contract_compliance PASS and security_posture PASS; others advisory. Production requires all PASS except cost which is advisory with mandatory reviewer acknowledgment if WARN. Each rule is a configuration object specifying evaluator IDs, required/advisory status, and WARN handling policy.
Temporal gate config: metric is canary_error_rate (5-minute rolling average). PASS if avg < 0.01 over 24h. WARN if avg between 0.01 and 0.03. FAIL if avg > 0.03 or any 5-minute window > 0.10. On PASS, proceed to full promotion. On WARN, extend window by 12 hours and notify team. On FAIL, auto-rollback canary and open incident ticket.

Operational Resilience and Incident Drills

Fundamentals Operational resilience for a PromptOps platform means the system can absorb failures, recover quickly, and improve from each incident. Unlike traditional software where failures are often deterministic (a bug either exists or it does not), prompt-driven systems face stochastic failures: a model may degrade gradually, a prompt change may cause subtle quality drift that only manifests under specific traffic patterns, or an injection attack may succeed intermittently. Incident drills (game days) are structured exercises where the team intentionally injects failures into the production-like environment and practices the detection, diagnosis, and recovery workflow. The goal is not to prevent all failures (impossible in probabilistic systems) but to build the organizational muscle memory to detect them quickly, contain them effectively, and learn from them systematically. A platform that has never practiced failure recovery will fumble during a real incident; a platform that drills regularly will respond with confidence and speed.

Deep Dive into the concept Operational resilience has four layers: detection, diagnosis, remediation, and learning. Each layer requires specific tooling and practice.

Detection is the ability to notice that something is wrong before users complain. For prompt systems, detection relies on the observability stack: metrics (error rate, latency, quality scores), alerts (threshold-based and anomaly-based), and health checks (periodic synthetic requests that verify end-to-end functionality). The challenge is tuning detection sensitivity: too sensitive creates alert fatigue (paging operators for transient blips), too insensitive misses real degradation. A well-tuned detection system uses sliding-window aggregation (5-minute averages rather than per-request checks), burn-rate alerts (measuring how quickly the error budget is depleting), and composite signals (combining multiple metrics into a single health score).

Error budgets are borrowed from SRE practice. For each prompt, define an SLO (Service Level Objective): “the refund_policy_assistant prompt must produce valid responses with < 1% error rate over a rolling 30-day window.” The error budget is the allowed 1% failure. When the error budget is healthy, teams can deploy aggressively. When the error budget is depleted, deployments are frozen until quality is restored. Error budgets create a quantitative framework for balancing reliability and velocity.

Diagnosis is the ability to determine the root cause of a detected issue. For prompt systems, root causes include: a new prompt version that changed output behavior, a model provider degradation (higher latency, different quality), a traffic pattern shift (new user segment, different query distribution), a policy misconfiguration, or an upstream data quality issue (RAG context retrieval returning stale or irrelevant documents). End-to-end tracing (from request ingestion through decision, inference, evaluation, and response) is the primary diagnostic tool. Each trace carries a correlation ID that links every step, enabling operators to reconstruct the exact path of a failing request.

The diagnostic workflow for prompt incidents follows a structured runbook:

Check the alert details: which metric triggered, what is the current value, what is the threshold?
Identify the scope: is this affecting all traffic or a specific prompt, model, or traffic segment?
Correlate with recent changes: was a new prompt version promoted, a policy updated, or a model rotated within the alert window?
Inspect sample traces: pull 5-10 traces from the affected time window and look for common patterns (same error code, same model, same prompt version).
Determine the root cause category: prompt regression, model degradation, configuration error, or external dependency failure.

Remediation is the ability to contain and resolve the issue. For prompt systems, the primary remediation actions are: roll back to the previous prompt version (if the issue is prompt-related), shift traffic to a different model (if the issue is model-related), tighten policy gates (if the issue is security-related), or scale resources (if the issue is capacity-related). Automated remediation playbooks handle common, well-understood failure modes. Manual remediation handles novel or complex failures. The key metric is MTTR (Mean Time to Recovery): the time from alert to resolution.

Learning is the ability to improve from each incident. Post-incident reviews (postmortems) analyze what happened, why it happened, what worked well in the response, what did not, and what concrete actions will prevent recurrence. A blameless postmortem culture focuses on system improvements rather than individual blame. Each postmortem produces action items that are tracked to completion, such as: adding a new alert rule, improving a runbook step, adding a new evaluation test case, or hardening a remediation playbook.

Game day drills simulate incidents in a controlled environment. A drill scenario specifies: the failure to inject (e.g., “increase the error rate of the primary model by 50%”), the expected detection time (e.g., “alert should fire within 5 minutes”), the expected diagnosis steps (e.g., “operator should correlate with model health dashboard”), and the expected remediation (e.g., “shift 100% traffic to fallback model within 10 minutes”). Drills are run periodically (monthly or quarterly) and the results are compared against targets to track operational readiness improvement over time.

How this fit on projects Operational resilience and incident drills are the culminating capability of Project 18. The platform proves its production readiness by running game day exercises that demonstrate detection, diagnosis, remediation, and learning in an integrated, end-to-end workflow.

Definitions & key terms

Error budget: The allowed failure margin defined by the SLO (e.g., if SLO is 99% success rate, the error budget is 1% failures over the measurement window).
Burn rate: The rate at which the error budget is being consumed. A burn rate of 1x means the budget will be exhausted exactly at the end of the window. A burn rate of 10x means the budget will be exhausted in 1/10th of the window.
MTTR (Mean Time to Recovery): The average time from incident detection to resolution. A primary operational health metric.
MTTD (Mean Time to Detect): The average time from incident onset to alert firing.
Game day drill: A structured exercise where failures are intentionally injected to practice and measure the incident response workflow.
Runbook: A documented procedure for diagnosing and remediating a specific class of incidents.
Blameless postmortem: A post-incident review focused on system improvements rather than individual fault.

Mental model diagram (ASCII)

Incident Lifecycle:

  Failure Occurs          Detection              Diagnosis
  (model degrades,   -->  (alert fires,     -->  (check traces,
   prompt regresses,       metric breaches        correlate with
   injection attack)       threshold)             recent changes)
       |                       |                       |
       |                       |                       |
       v                       v                       v
  +----------+           +----------+           +----------+
  | ONSET    |           | DETECTED |           | DIAGNOSED|
  | (unknown |  MTTD     | (alert   |  diag     | (root    |
  |  to ops) | --------> |  fired)  | -------->  |  cause   |
  +----------+           +----------+           |  known)  |
                                                +----------+
                                                      |
                              remediation             |
                              action                  |
                                                      v
                                                +----------+
                                                | REMEDIATED|
                                                | (service  |
                                                |  restored)|
                              MTTR              +----------+
                         (onset to                    |
                          remediated)                 |
                                                      v
                                                +----------+
                                                | REVIEWED |
                                                | (postmor-|
                                                |  tem done)|
                                                +----------+

Game Day Drill Workflow:

  +-------------------+      +-------------------+      +-------------------+
  | 1. Define         |      | 2. Execute        |      | 3. Measure        |
  |    Scenario       |----->|    Injection      |----->|    Response       |
  |                   |      |                   |      |                   |
  | - failure type    |      | - inject fault    |      | - MTTD achieved?  |
  | - expected MTTD   |      | - start timer     |      | - MTTR achieved?  |
  | - expected MTTR   |      | - observe team    |      | - correct root    |
  | - success criteria|      |   response        |      |   cause found?    |
  +-------------------+      +-------------------+      +-------------------+
                                                               |
                                                               v
                                                        +-------------------+
                                                        | 4. Debrief        |
                                                        |                   |
                                                        | - what worked     |
                                                        | - what didnt      |
                                                        | - action items    |
                                                        | - update runbooks |
                                                        +-------------------+

How it works (step-by-step, with invariants and failure modes)

The platform defines SLOs for each prompt: success rate, latency percentiles, and quality score minimums. Error budgets are computed from SLOs over rolling 30-day windows. Invariant: SLOs are reviewed quarterly and updated based on business requirements. Failure mode: SLOs set too tight cause constant alerting; SLOs set too loose miss real degradation.
The observation subsystem evaluates alert rules continuously. When a metric breaches its threshold (e.g., 5-minute error rate > 2%), an alert is fired with severity, affected prompt, current metric value, and threshold. Invariant: alerts include the trace_ids of sample failing requests for immediate diagnosis. Failure mode: alert pipeline delay means the metric breach happened 3 minutes before the alert, increasing effective MTTD.
On-call operators follow the runbook for the alert type. The runbook specifies diagnostic steps (which dashboards to check, which queries to run, which traces to pull) and decision criteria (when to escalate, when to auto-remediate). Invariant: runbooks are versioned and tested through drills. Failure mode: an outdated runbook references a dashboard that no longer exists, causing diagnostic delay.
Remediation is executed: rollback, traffic shift, policy change, or escalation to engineering. The remediation action is recorded in the audit trail with the incident ID, operator, action taken, and timestamp. Invariant: every remediation action has a verification step (check that the metric returns to normal within 5 minutes). Failure mode: the remediation fixes the symptom but not the root cause, leading to recurrence.
After the incident, a postmortem is conducted within 48 hours. The postmortem documents: timeline, root cause, impact, response quality, contributing factors, and action items with owners and deadlines. Invariant: action items are tracked in a task system and reviewed in the next weekly operations meeting. Failure mode: action items are documented but never completed, causing the same incident to recur.

Minimal concrete example

Game Day Drill Scenario:
  {
    "drill_id": "dr_901",
    "scenario": "injection_spike",
    "description": "Simulate a sudden increase in prompt injection attacks against the refund assistant.",
    "injection": {
      "type": "increase_attack_rate",
      "target_prompt": "refund_policy_assistant",
      "attack_rate_multiplier": 10,
      "duration_minutes": 15
    },
    "expected_detection": {
      "alert_name": "security_posture_degradation",
      "max_mttd_minutes": 5
    },
    "expected_remediation": {
      "action": "enable_strict_policy_mode",
      "max_mttr_minutes": 10
    },
    "success_criteria": {
      "mttd_within_target": true,
      "mttr_within_target": true,
      "correct_remediation_action": true,
      "no_data_leakage_during_incident": true
    }
  }

Drill Results:
  {
    "drill_id": "dr_901",
    "status": "COMPLETED",
    "actual_mttd_minutes": 3.2,
    "actual_mttr_minutes": 7.5,
    "correct_root_cause_identified": true,
    "correct_remediation_action": true,
    "data_leakage_detected": false,
    "score": "PASS",
    "debrief_notes": "Detection was fast due to injection-specific alert rule. Remediation delayed 2 min because runbook did not specify which policy profile to activate. Action item: update runbook with explicit policy profile names."
  }

SLO Dashboard:
  {
    "prompt": "refund_policy_assistant",
    "slo_success_rate": "99%",
    "current_success_rate": "99.7%",
    "error_budget_remaining": "70%",
    "error_budget_burn_rate": "0.8x",
    "mttr_30day_avg_minutes": 8.2,
    "drills_completed_this_quarter": 3,
    "open_postmortem_action_items": 2
  }

Common misconceptions

“We only need incident drills after launch.” Drills before launch identify gaps in detection, diagnosis, and remediation tooling that would be much more expensive to discover during a real incident. Drilling early is cheaper than debugging under pressure.
“If automated remediation handles common cases, we do not need runbooks.” Automated remediation handles the first 80% of incidents. Runbooks handle the remaining 20% that require human judgment. Without runbooks, operators improvise during the hardest incidents, increasing MTTR.
“Postmortems are optional for minor incidents.” Minor incidents often share root causes with major incidents. A pattern of minor incidents that are not investigated can be a leading indicator of a major outage. Lightweight postmortems (15-minute template) for minor incidents are a worthwhile investment.
“Error budgets are just SRE jargon.” Error budgets create a shared language between engineering and product teams about the tradeoff between reliability and deployment velocity. Without error budgets, discussions about “how much testing is enough” are subjective and unresolvable.
“MTTR is the only metric that matters.” MTTD is equally important. An incident that takes 2 minutes to fix but 30 minutes to detect has an effective MTTR of 32 minutes. Investing in detection sensitivity often has a larger impact on total incident duration than investing in faster remediation.

Check-your-understanding questions

Why is MTTD measured separately from MTTR, and which is typically harder to improve?
How do error budgets create alignment between reliability and deployment velocity?
What makes a game day drill effective versus a trivial exercise?
Why should postmortem action items have owners and deadlines rather than being left as suggestions?
How does the burn rate metric provide earlier warning than a simple threshold alert?

Check-your-understanding answers

MTTD measures how quickly the system notices a problem; MTTR measures the total time from onset to resolution (including detection). MTTD is typically harder to improve because it requires investment in observability instrumentation, alert tuning, and synthetic health checks, while remediation speed improves through automation and runbook practice.
When the error budget is healthy (plenty of remaining budget), teams can deploy frequently, accepting the small risk of regressions. When the error budget is depleted, deployments freeze until reliability is restored. This creates a self-balancing incentive: teams that ship reliable code get more deployment freedom, teams that ship buggy code get temporarily slowed down.
An effective drill uses a realistic scenario (based on actual incident patterns), has specific measurable targets (MTTD < 5 min, MTTR < 15 min), exercises the actual tools and runbooks (not a simplified walkthrough), and produces actionable findings (specific runbook updates, new alert rules, tooling gaps). A trivial drill uses an artificial scenario, has no targets, and produces no action items.
Unowned action items are suggestions that decay. Postmortem action items represent concrete improvements needed to prevent recurrence. Without owners and deadlines, they accumulate in a backlog and are never completed, meaning the same incident class recurs. Weekly operations reviews should track action item completion rate as an operational health metric.
A simple threshold alert fires when the metric crosses a fixed boundary (e.g., error rate > 5%). This misses slow degradation that stays below the threshold for a long time before suddenly spiking. A burn rate alert fires when the rate of error budget consumption exceeds a multiple of the sustainable rate (e.g., 10x burn rate means the 30-day budget will be exhausted in 3 days). This catches gradual degradation early because even a slow burn at 3x will trigger before the budget is fully consumed.

Real-world applications

Netflix’s Chaos Monkey randomly terminates instances to verify that services handle failures gracefully, the same principle applied to prompt system drills (simulate model failures, injection spikes, traffic surges).
Google’s DiRT (Disaster Recovery Testing) program runs company-wide drills that simulate infrastructure failures, measuring detection and recovery times against defined targets.
PagerDuty incident response practices define severity levels, escalation policies, and postmortem templates, directly applicable to PromptOps incident management.
Langfuse and Helicone provide the observability data (traces, metrics, dashboards) that power the detection and diagnosis layers of the operational resilience framework.
AWS Well-Architected Framework’s Reliability Pillar defines practices for fault tolerance, recovery, and testing that map directly to PromptOps operational resilience.

Where you’ll apply it

Phase 3: build the drill runner, define drill scenarios, implement the SLO/error budget dashboard, and conduct at least two game day exercises as part of the capstone demonstration.

References

“Site Reliability Engineering” by Google - Chapters on monitoring, alerting, incident management, and postmortems
“Designing Data-Intensive Applications” by Martin Kleppmann - Chapter 1: Reliability
“AI Engineering” by Chip Huyen - Chapters on monitoring and continuous improvement
Netflix Chaos Engineering principles
PagerDuty incident response documentation

Key insights Operational resilience is not a feature you build once; it is a practice you maintain through regular drills, honest postmortems, and continuous improvement of detection, diagnosis, remediation, and learning capabilities.

Summary Operational resilience for a PromptOps platform encompasses four layers: detection (SLOs, error budgets, alerts), diagnosis (tracing, runbooks, structured root cause analysis), remediation (automated playbooks, traffic shifting, rollback), and learning (postmortems, action item tracking, drill improvements). Game day drills are the mechanism for validating and improving all four layers in a controlled environment, measuring MTTD and MTTR against targets, and building the organizational muscle memory needed for effective real-incident response.

Homework/Exercises to practice the concept

Define SLOs for three different prompt types (simple Q&A, multi-step reasoning, tool-calling agent) with appropriate error budgets and measurement windows. Explain why each SLO is different.
Design three game day drill scenarios (model degradation, prompt injection spike, configuration error) with specific injection parameters, expected detection times, expected remediation actions, and measurable success criteria.
Write a postmortem template with sections for: incident summary, timeline, root cause analysis (using the “5 Whys” technique), impact assessment, response evaluation, contributing factors, and action items with owners and deadlines.

Solutions to the homework/exercises

Simple Q&A prompt: SLO 99.5% success rate, p95 latency < 2s, quality score > 3.5/5. Error budget: 0.5% over 30 days. Justification: high volume, low complexity, users expect fast responses. Multi-step reasoning: SLO 97% success rate, p95 latency < 10s, quality score > 4.0/5. Error budget: 3% over 30 days. Justification: lower volume, higher complexity, users accept longer latency, quality standards are higher because errors in reasoning are more impactful. Tool-calling agent: SLO 95% success rate, p95 latency < 30s, quality score > 4.2/5, tool execution accuracy > 98%. Error budget: 5% over 30 days. Justification: tool calls introduce additional failure modes (tool errors, permission issues), latency is inherently higher, and accuracy is critical because tool actions may be irreversible.
Drill scenarios: (1) Model degradation: inject 200ms additional latency on primary model for 10 minutes. Expected MTTD < 3 min (latency alert fires). Expected remediation: shift traffic to fallback model within 5 min. Success: MTTR < 8 min, no SLO breach. (2) Injection spike: multiply injection attempts 10x for 15 minutes. Expected MTTD < 5 min (security alert fires). Expected remediation: enable strict policy mode within 10 min. Success: no successful injections, MTTR < 12 min. (3) Configuration error: push a misconfigured routing rule that sends all traffic to a deprecated model. Expected MTTD < 2 min (error rate spike). Expected remediation: rollback configuration within 5 min. Success: MTTR < 7 min, audit trail records the rollback.
Postmortem template: Title, date, severity, duration, author. Incident Summary (2-3 sentences). Timeline (timestamp: event, including detection, diagnosis steps, remediation actions, resolution). Root Cause Analysis (5 Whys from the triggering event to the systemic cause). Impact (users affected, requests failed, SLO impact, error budget consumed). Response Evaluation (what worked well, what could be improved, was the runbook followed). Contributing Factors (systemic issues: missing tests, insufficient monitoring, unclear ownership). Action Items (each with: description, owner, deadline, priority, tracking ticket ID).

3. Project Specification

3.1 What You Will Build

An end-to-end PromptOps control plane that integrates contracts, evals, rollouts, routing, and incident response.

3.2 Functional Requirements

Integrate registry, eval harness, router, policy firewall, and rollout controller.
Provide unified operator UI and API for release and incident actions.
Run deterministic incident drills and capture response metrics.
Enforce governance gates before production promotion.

3.3 Non-Functional Requirements

Performance: Critical control-plane actions respond under 300 ms p95.
Reliability: Drill scenarios produce repeatable outcomes for same seed/config.
Security/Policy: High-risk operations require multi-step confirmation and audit trail.

3.4 Example Usage / Output

Browser URL: http://localhost:3018/control-plane

+--------------------------------------------------------------------------------+
| PromptOps Control Plane                                                        |
| Version: refund:v2.3.0   Canary: 20%   Safety Incidents (24h): 0   MTTR: 7m   |
+--------------------------------------------------------------------------------+
| Rollout Timeline            | Incident Feed                                    |
| Draft -> Eval -> Canary     | [RESOLVED] Injection spike rollback (dr_901)    |
| -> Promote                  | [OPEN] Citation grounding dip (sev-2)           |
+--------------------------------------------------------------------------------+
| Actions: [Run Drill] [Promote] [Rollback] [Open Trace Explorer]               |
+--------------------------------------------------------------------------------+

$ curl -s http://localhost:3000/v1/platform/drills \
  -H 'content-type: application/json' \
  -d '{
  "scenario": "injection_spike",
  "traffic_profile": "support_peak",
  "seed": 2026
}' | jq
{
  "drill_id": "dr_901",
  "status": "COMPLETED",
  "rollback_triggered": true,
  "mttr_minutes": 7,
  "trace_id": "trc_p18_901"
}

3.5 Data Formats / Schemas / Protocols

Unified event log stream with typed events across all subsystems.
Drill report JSON with timeline, decisions, and MTTR metrics.
Runbook markdown templates for incident classes.

3.6 Edge Cases

Rollout and incident drill overlap on same prompt family.
One subsystem unavailable while operator initiates rollback.
Cross-service trace ids become inconsistent.
Policy updates conflict with active canary state.

3.7 Real World Outcome

This project is complete when both UI workflow and backend policy enforcement are visible and auditable.

3.7.1 How to Run (Copy/Paste)

$ npm run dev --workspace p18-control-plane

3.7.2 Golden Path Demo (Deterministic)

Use the provided fixture payload and pre-seeded queue/data so UI counts and API responses are reproducible.

3.7.3 Browser Flow

Open: http://localhost:3018/control-plane
Verify these visible states:
Top row cards show Current Prompt Version, Canary State, Safety Incidents (24h), and Mean Recovery Time.
Middle panel has live rollout timeline with states Draft -> Eval -> Canary -> Promote/Rollback.
Right panel lists active incidents and links to trace bundles and rollback buttons.

+--------------------------------------------------------------------------------+
| PromptOps Control Plane                                                        |
| Version: refund:v2.3.0   Canary: 20%   Safety Incidents (24h): 0   MTTR: 7m   |
+--------------------------------------------------------------------------------+
| Rollout Timeline            | Incident Feed                                    |
| Draft -> Eval -> Canary     | [RESOLVED] Injection spike rollback (dr_901)    |
| -> Promote                  | [OPEN] Citation grounding dip (sev-2)           |
+--------------------------------------------------------------------------------+
| Actions: [Run Drill] [Promote] [Rollback] [Open Trace Explorer]               |
+--------------------------------------------------------------------------------+

3.7.4 API Behavior (Success + Error)

$ curl -s http://localhost:3000/v1/platform/drills \
  -H 'content-type: application/json' \
  -d '{
  "scenario": "injection_spike",
  "traffic_profile": "support_peak",
  "seed": 2026
}' | jq
{
  "drill_id": "dr_901",
  "status": "COMPLETED",
  "rollback_triggered": true,
  "mttr_minutes": 7,
  "trace_id": "trc_p18_901"
}

$ curl -s http://localhost:3000/v1/platform/drills \
  -H 'content-type: application/json' \
  -d '{
  "scenario": "unknown",
  "traffic_profile": "support_peak",
  "seed": 2026
}' | jq
{
  "error": {
    "code": "INVALID_DRILL_SCENARIO",
    "message": "Scenario \"unknown\" is not registered.",
    "trace_id": "trc_p18_902",
    "project": "P18"
  }
}

4. Solution Architecture

4.1 High-Level Design

+===========================================================================+
|                     CONTROL PLANE                                         |
|                                                                           |
|  +---------------------+     +---------------------+                     |
|  | Configuration API   |---->| Decision Engine     |                     |
|  | (desired state,     |     | (version resolution,|                     |
|  |  routing rules,     |     |  model routing,     |                     |
|  |  policy configs)    |     |  policy enforcement) |                     |
|  +---------------------+     +---------------------+                     |
|           |                           |                                   |
|           v                           v                                   |
|  +---------------------+     +---------------------+                     |
|  | Remediation Engine  |<----| Observation Stack   |                     |
|  | (playbooks,         |     | (metrics, alerts,   |                     |
|  |  auto-rollback,     |     |  dashboards, SLO    |                     |
|  |  operator paging)   |     |  tracking)          |                     |
|  +---------------------+     +---------------------+                     |
|           |                           ^                                   |
|           v                           |                                   |
|  +---------------------+     +---------------------+                     |
|  | Drill Runner        |     | Quality Gate Engine |                     |
|  | (scenario injection,|     | (evaluators,        |                     |
|  |  MTTD/MTTR scoring) |     |  composition rules) |                     |
|  +---------------------+     +---------------------+                     |
|                                                                           |
+===========================================================================+
            |                           ^
            v                           |
+===========================================================================+
|                      DATA PLANE                                           |
|  Request Handler -> LLM Gateway -> Post-Inference Eval -> Response       |
+===========================================================================+

4.2 Key Components

4.3 Data Structures (No Full Code)

DesiredState:
- prompt_name: string
- version: semver
- model_routing: { primary: string, fallback: string, primary_weight: float }
- policy_profile: string
- rate_limit: { rpm: int, tpm: int }

DecisionRecord:
- trace_id: string
- prompt_version: semver
- content_hash: sha256
- model_target: string
- policies_applied: string[]
- decision_latency_ms: int
- timestamp: iso8601

AlertRule:
- rule_id: string
- condition: string (metric expression)
- playbook: string
- severity: critical | warning | info
- notification_channels: string[]

DrillScenario:
- drill_id: string
- scenario: string
- injection: { type, target, parameters, duration }
- expected_detection: { alert_name, max_mttd_minutes }
- expected_remediation: { action, max_mttr_minutes }
- success_criteria: { ... }

DrillResult:
- drill_id: string
- actual_mttd_minutes: float
- actual_mttr_minutes: float
- correct_root_cause: boolean
- correct_remediation: boolean
- score: PASS | PARTIAL | FAIL
- debrief_notes: string

4.4 Algorithm Overview

Key algorithm: Incident detection and automated remediation

Observation stack evaluates alert rules on sliding-window metrics every 30 seconds.
When a rule fires, the remediation engine looks up the associated playbook.
Playbook executes a sequence of actions: check preconditions, execute remediation (rollback/traffic shift/policy change), verify recovery (metric returns to normal).
Record all actions in the audit trail with incident correlation ID.

Key algorithm: Quality gate evaluation

Receive promotion request with prompt version and target stage.
Spawn all configured evaluators in parallel.
Collect verdicts and apply composition rule.
Generate immutable gate report, attach to version metadata.

Complexity Analysis (conceptual):

Alert evaluation: O(R * M) per cycle where R is alert rules and M is metrics per rule.
Quality gate: O(E) parallel time where E is the slowest evaluator.
Drill execution: O(D) per drill where D is drill duration (bounded by configuration).

5. Implementation Guide

5.1 Development Environment Setup

# 1) Install dependencies for all sub-projects
# 2) Prepare shared fixtures under fixtures/
# 3) Start the control plane: npm run dev --workspace p18-control-plane
# 4) Verify: open http://localhost:3018/control-plane

5.2 Project Structure

p18/
├── src/
│   ├── config/           # Configuration API, desired state management
│   ├── decision/         # Decision engine, version resolution, routing
│   ├── observation/      # Metrics collection, alert evaluation, SLO tracking
│   ├── remediation/      # Playbook engine, auto-rollback, notifications
│   ├── gates/            # Quality gate engine, evaluators, composition
│   ├── drills/           # Drill runner, scenario definitions, scoring
│   ├── integration/      # Cross-subsystem interfaces, event bus
│   └── ui/               # Operator console (web dashboard)
├── fixtures/
│   ├── prompts/          # Prompt version fixtures
│   ├── scenarios/        # Drill scenario definitions
│   ├── traffic/          # Traffic profile fixtures
│   └── configs/          # Platform configuration fixtures
├── runbooks/
│   ├── model-degradation.md
│   ├── injection-spike.md
│   └── config-error.md
├── policies/
├── out/
└── README.md

5.3 The Core Question You’re Answering

“Can I operate prompt-driven features with the same rigor as mission-critical software services?”

This question matters because it tests whether you can integrate all the specialized primitives (registry, eval, routing, rollout, firewall) into a coherent operational platform with real-time visibility, automated recovery, and measurable resilience.

5.4 Concepts You Must Understand First

Control-plane / data-plane architecture
- How do systems like Kubernetes and Istio separate decision-making from execution?
- Book Reference: “Site Reliability Engineering” by Google - Operations chapters
SLOs, error budgets, and burn rates
- How does Google define and enforce service level objectives?
- Book Reference: “Site Reliability Engineering” by Google - SLO chapters
Cross-functional quality gates
- How do mature CI/CD pipelines combine multiple quality signals into release decisions?
- Book Reference: “Accelerate” by Forsgren et al. - Continuous delivery chapters
Chaos engineering and game day drills
- How does Netflix validate system resilience through intentional failure injection?
- Book Reference: Netflix Chaos Engineering principles
Incident management and postmortems
- How do SRE teams detect, diagnose, remediate, and learn from production incidents?
- Book Reference: “Site Reliability Engineering” by Google - Incident management chapters

5.5 Questions to Guide Your Design

Subsystem integration
- How will the four control-plane subsystems communicate (message bus, polling, shared state)?
- What happens when one subsystem is unavailable?
- How do you maintain trace correlation across subsystem boundaries?
Quality gates and release flow
- Which evaluators are needed for each deployment stage?
- How do temporal gates interact with canary traffic percentage?
- How do you handle gate failures during an active rollout?
Operational resilience
- What SLOs are appropriate for each prompt type?
- How do you tune alert sensitivity to minimize false positives without missing real incidents?
- What drill scenarios cover the most common and most dangerous failure modes?
Platform ergonomics
- What does an operator need to see on the dashboard during normal operations vs during an incident?
- How do you make the rollback action as fast and safe as possible?
- How do you prevent operator error during high-stress incident response?

5.6 Thinking Exercise

Pre-Mortem for Production Prompt Platform Capstone

Before implementing, write down 15 ways this platform can fail in production. Classify each failure into: integration, configuration, observability, remediation, or governance. For each, specify whether the failure is preventable (through design) or must be detected (through monitoring).

Questions to answer:

Which failures cascade across subsystems?
Which failures are silent (no alert fires)?
Which failures require human judgment to resolve?

5.7 The Interview Questions They’ll Ask

“How do you separate the control plane from the data plane in a PromptOps system, and why does it matter?”
“Describe how you would design cross-functional quality gates for prompt releases.”
“What is an error budget and how does it balance reliability with deployment velocity?”
“How would you design an incident drill for an LLM-powered customer service system?”
“What metrics would you track on a PromptOps dashboard, and what alerts would you set?”
“How do you handle a situation where automated remediation fails during a real incident?”

5.8 Hints in Layers

Hint 1: Start with the integration backbone Define the event schemas and interfaces between subsystems before building any subsystem. The interfaces are harder to change later than the implementations.

Hint 2: Build observability before remediation You cannot automate recovery until you can reliably detect problems. Build the metrics pipeline, dashboards, and alerts first, then layer remediation playbooks on top.

Hint 3: Use fixtures to simulate the full stack You do not need real LLM API calls to test the platform. Use deterministic fixtures that simulate model responses, evaluation results, and traffic patterns to test the entire control-plane workflow.

Hint 4: Drill early and drill often Define your first drill scenario before the platform is “complete.” Running a drill on an incomplete platform reveals integration gaps faster than code review.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (Integration Backbone)

Define shared event schemas for all subsystem interfaces.
Implement the configuration API with desired state management.
Build the decision engine with basic version resolution and model routing.
Set up the operator console skeleton with dashboard layout.
Checkpoint: Declaring a desired state updates the decision engine’s routing, and the dashboard shows the current state.

Phase 2: Quality and Observability

Implement the observation stack: metrics collection, alert evaluation, SLO tracking.
Build the quality gate engine with at least three evaluators (contract, security, performance).
Integrate gate results with the promotion workflow.
Populate the dashboard with live metrics, SLO gauges, and alert history.
Checkpoint: A simulated metric breach fires an alert visible on the dashboard, and a quality gate blocks an insufficiently-tested promotion.

Phase 3: Resilience and Drills

Implement the remediation engine with playbook execution and auto-rollback.
Build the drill runner with at least two scenario types (model degradation, injection spike).
Connect all subsystems: drill injects fault -> observation detects -> remediation acts -> audit records.
Conduct game day exercises and measure MTTD/MTTR against targets.
Checkpoint: A complete drill executes end-to-end with measured results, and the postmortem identifies at least one improvement action.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Declaring a desired state propagates to the decision engine within the refresh interval.
A quality gate correctly blocks promotion when the security evaluator returns FAIL.
A metric breach fires the correct alert and triggers the associated playbook.
Auto-rollback restores the previous prompt version and the metric recovers.
A drill scenario produces the same MTTD/MTTR scores when re-run with the same seed.
The audit trail records every action across all subsystems with consistent trace correlation.

6.3 Test Data

fixtures/prompts/refund_assistant_v2.3.0.json
fixtures/prompts/refund_assistant_v2.3.1.json
fixtures/scenarios/injection_spike.json
fixtures/scenarios/model_degradation.json
fixtures/scenarios/config_error.json
fixtures/traffic/support_peak_profile.json
fixtures/configs/desired_state.json
fixtures/configs/alert_rules.json
fixtures/configs/gate_configs.json

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | “Integrated demo works but operations are brittle” | Cross-service event schemas are inconsistent or undocumented. | Define shared event schemas first, validate at subsystem boundaries. | | “Rollback took too long” | No single-action rollback path; operator had to manually update config, restart services, verify recovery. | Automate rollback as a single API call that updates config, notifies subsystems, and verifies recovery. | | “Incidents are hard to investigate” | Trace correlation IDs are missing or inconsistent across subsystems. | Propagate a global trace_id through every event and log entry from ingestion to response. | | “Alerts fire too often (fatigue)” | Threshold too tight or no debounce/sliding window. | Use 5-minute sliding windows, burn-rate alerts, and severity-based notification routing. | | “Drills are not representative” | Scenarios are too simple or do not exercise the actual tooling. | Base drill scenarios on real incident patterns; use production-like traffic profiles. |

7.2 Debugging Strategies

Trace a single request through all subsystems: decision record -> inference log -> evaluation result -> metric emission. Verify trace_id consistency at each boundary.
Compare the desired state in the configuration store with the actual routing table in the decision engine to detect reconciliation lag.
Replay a drill scenario with verbose logging to identify where detection or remediation deviated from expectations.
Check the quality gate report for the most recently promoted version to verify all evaluators ran and produced valid verdicts.

7.3 Performance Traps

Alert evaluation scanning all rules against all metrics every 30 seconds can be expensive with many rules. Index metrics by rule subscription to evaluate only relevant rules.
Quality gate evaluators that make external API calls (LLM-as-judge, model benchmarks) can slow the promotion pipeline. Cache evaluator results keyed by (prompt_content_hash, evaluator_version).
The operator dashboard polling for updates every second can create unnecessary load. Use WebSocket for push-based updates.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a third drill scenario (e.g., upstream RAG context degradation).
Add an SLO history chart to the dashboard showing error budget consumption over 30 days.

8.2 Intermediate Extensions

Implement multi-tenant support: separate configurations, SLOs, and dashboards per team.
Add cost chargeback reporting: track per-team, per-prompt cost allocation.
Build a compliance report generator for SOC 2 audits from the audit trail.

8.3 Advanced Extensions

Implement cross-region deployment with region-specific desired state and global rollout coordination.
Add predictive alerting using anomaly detection on metric time series.
Build a chaos engineering framework that generates novel drill scenarios from historical incident patterns.

9. Real-World Connections

9.1 Industry Applications

PromptOps platform teams at enterprises operating AI features under compliance constraints (finance, healthcare, government).
Internal AI governance tooling for organizations deploying hundreds of prompts across multiple products.
AI infrastructure startups building prompt management platforms as a service (PromptLayer, Langfuse, Helicone, Agenta, Braintrust).

Langfuse: open-source LLM observability with tracing, prompt management, and evaluation.
OpenLLMetry (Traceloop): OpenTelemetry-based observability that integrates with existing APM tools.
Agenta: open-source prompt management with evaluation and deployment pipelines.
LiteLLM: LLM gateway with model routing, fallback, and cost tracking.

9.3 Interview Relevance

Demonstrates end-to-end systems thinking: integrating multiple subsystems into a coherent platform.
Shows SRE discipline: SLOs, error budgets, incident management, game day drills.
Proves production readiness: quality gates, automated remediation, audit trails, compliance.
Demonstrates leadership capability: designing processes (postmortems, drill programs) not just code.

10. Resources

10.1 Essential Reading

“Site Reliability Engineering” by Google - Chapters on monitoring, alerting, incident management, SLOs, and change management.
“Designing Data-Intensive Applications” by Martin Kleppmann - Chapters on reliability, distributed systems coordination, and data integration.
“AI Engineering” by Chip Huyen - Chapters on evaluation, monitoring, and production deployment.
“Accelerate” by Forsgren et al. - Chapters on delivery performance and lean management.

10.2 Video Resources

Google Cloud Next talks on SRE practices and error budgets.
Netflix engineering talks on chaos engineering and resilience testing.
Conference talks on LLM observability and production AI operations.

10.3 Tools & Documentation

Langfuse documentation (open-source LLM observability).
OpenTelemetry specification (distributed tracing and metrics).
Prometheus and Grafana documentation (metrics and dashboards).
PagerDuty incident response guide (incident management best practices).

Project 1 (Prompt Contract Harness): provides the contract evaluation that feeds the compliance quality gate.
Project 3 (Prompt Injection Red-Team Lab): provides the security test suite for the security quality gate.
Project 7 (Temperature Sweeper): provides performance benchmarking methodology for the performance quality gate.
Project 11 (Canary Prompt Rollout Controller): provides the traffic management that the remediation engine controls.
Project 13 (Tool Permission Firewall): provides the policy enforcement integrated into the decision engine.
Project 14 (Adversarial Eval Forge): provides adversarial test cases for the security quality gate.
Project 15 (Prompt Registry + Versioning Service): provides the prompt artifact management that the configuration subsystem depends on.
Project 16 (Human-in-the-Loop Escalation Queue): provides the escalation path for incidents requiring human judgment.

11. Self-Assessment Checklist

11.1 Understanding

I can explain the control-plane / data-plane split and why it matters for PromptOps.
I can describe how cross-functional quality gates aggregate multiple evaluator verdicts into a release decision.
I can define SLOs, error budgets, and burn rates for a prompt-driven system.
I can design game day drill scenarios with measurable success criteria.
I can describe the four layers of operational resilience (detection, diagnosis, remediation, learning).

11.2 Implementation

The configuration API accepts desired state declarations and propagates them to the decision engine.
Quality gates correctly block or allow promotions based on evaluator verdicts and composition rules.
The observation stack detects metric breaches and fires alerts within configured thresholds.
Automated remediation playbooks execute rollback/traffic shift when triggered by alerts.
At least two drill scenarios run end-to-end with measured MTTD and MTTR.
The operator dashboard shows live platform state with actionable buttons.

11.3 Growth

I can explain the tradeoffs in my subsystem communication design (event bus vs polling vs shared state).
I can describe how this platform would scale to support multiple teams and hundreds of prompts.
I can present this project as a capstone achievement in an interview, connecting it to all prior projects.

12. Submission / Completion Criteria

Minimum Viable Completion:

Control-plane architecture with configuration, decision, and observation subsystems working end-to-end.
At least one quality gate evaluator producing typed verdicts that gate promotions.
At least one drill scenario executing with measured MTTD and MTTR.
Operator dashboard showing platform state.

Full Completion:

All four control-plane subsystems (configuration, decision, observation, remediation) integrated.
Cross-functional quality gates with at least three evaluators and configurable composition rules.
At least two drill scenarios with automated remediation and postmortem documentation.
SLO tracking with error budget visualization.
Full audit trail across all subsystem actions.

Excellence (Above & Beyond):

Multi-tenant support with per-team configurations and dashboards.
Temporal gates integrated with canary rollouts.
Compliance report generator from audit trail data.
Predictive alerting or anomaly detection on metric time series.
Demonstrated game day exercise with measured results, postmortem, and improvement actions implemented.

Project 18: Production Prompt Platform Capstone

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Control-Plane Architecture for PromptOps

Cross-Functional Quality Gates

Operational Resilience and Incident Drills

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 Browser Flow

3.7.4 API Behavior (Success + Error)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (Integration Backbone)

Phase 2: Quality and Observability

Phase 3: Resilience and Drills

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria