Project 1: Prompt Contract Harness
Prompt test report with pass/fail by invariant, trend deltas, and release recommendation.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | See main guide estimates (typically 3-8 days except capstone) |
| Main Programming Language | Python |
| Alternative Programming Languages | TypeScript, Go |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 4. The Open Core Infrastructure |
| Knowledge Area | PromptOps / Testing |
| Software or Tool | CLI harness + validators + reports |
| Main Book | Site Reliability Engineering (Google) |
| Concept Clusters | Prompt Contracts and Output Typing; Evaluation, Rollouts, and Governance |
1. Learning Objectives
By completing this project, you will:
- Translate prompt requirements into explicit contracts with typed fields, semantic invariants, and failure envelopes.
- Implement deterministic checks around probabilistic model outputs using a layered validation pipeline.
- Design fixture suites that stratify test cases by risk label, difficulty, and business impact.
- Measure quality using reproducible eval artifacts with trend deltas across prompt revisions.
- Build release gates that compute promotion recommendations from configurable pass-rate thresholds and critical-failure counts.
- Document failure modes with machine-readable reason codes that drive automated retry, escalation, and rollback decisions.
2. All Theory Needed (Per-Concept Breakdown)
Concept A: Contract and Invariant Design for LLM Outputs
Fundamentals A prompt contract is a formal specification of what a model output must look like and must contain for the output to be considered correct. It is the equivalent of an API contract or a database schema, but applied to the non-deterministic boundary between your application and a language model. Without a contract, you cannot distinguish between a valid response and one that merely looks plausible. Contracts define three things: the structural shape (required fields, types, nesting), the semantic invariants (business rules that must hold regardless of input), and the failure envelope (how invalid outputs are represented so that downstream systems can route them deterministically). In production LLM systems, the contract is the single source of truth for what “correct” means. Every validator, every metric, every release gate derives its authority from the contract. If the contract is vague or informal, your entire test harness inherits that ambiguity, and you cannot distinguish a regression from normal variance.
Deep Dive into the concept Contract design for LLM outputs draws on three traditions: API contract testing from service-oriented architecture, property-based testing from functional programming, and schema evolution from data engineering.
From API contract testing, you inherit the idea that producers and consumers agree on an interface independently of implementation. In the LLM context, the “producer” is the model and its prompt, and the “consumer” is whatever downstream system parses the output. Consumer-driven contracts are especially valuable: the downstream team specifies exactly which fields they read, and the contract test verifies only those fields. This prevents over-specification, where your tests break because the model added a helpful-but-unexpected field, and under-specification, where critical fields are missing but no test catches it.
From property-based testing, you inherit the idea of invariants: properties that must hold across all valid inputs, not just the specific examples in your fixture suite. For a customer support classifier, an invariant might be “if the input contains a suicide-risk keyword, the output MUST set escalation_required to true regardless of confidence score.” Invariants encode safety boundaries, regulatory requirements, and business-critical logic. They are more powerful than example-based tests because they generalize: a fixture suite of 120 cases cannot cover every input, but an invariant like “output.confidence must be between 0.0 and 1.0” applies to all future inputs without needing new fixtures.
From data engineering schema evolution, you inherit the idea that contracts change over time and those changes must be managed. Adding an optional field to the output is a backward-compatible change: old consumers ignore it. Removing a required field or changing a field type is a breaking change: old consumers crash. Your harness must detect breaking changes by running the new contract against the old fixture baseline and reporting incompatibilities before deployment.
The failure envelope is the most underappreciated part of contract design. When an output fails validation, the harness must produce a structured failure object that includes: which check failed (schema, semantic, policy), the specific field or invariant that was violated, the severity (critical vs degraded), a machine-readable reason code (SCHEMA_FAIL, SEMANTIC_DRIFT, POLICY_BLOCK, ABSTAIN_OK), and a routing decision (retry, escalate, dead-letter). Without structured failure objects, debugging regressions becomes a manual log-reading exercise. With them, your monitoring dashboard can aggregate failure patterns automatically and your retry logic can make cost-aware decisions (e.g., retry schema failures but escalate policy blocks immediately).
Output typing is the mechanism that makes contracts enforceable at the code level. Each expected output gets a type definition (in Pydantic, Zod, or equivalent) that the validator instantiates from the raw model response. If instantiation fails, the failure is structural. If it succeeds but invariant checks fail, the failure is semantic. This two-phase validation (parse then check) is critical because it prevents a common trap: trying to evaluate semantic properties of a response that did not even parse correctly, which produces confusing error messages that conflate structural and semantic issues.
A well-designed contract for a support ticket classifier might look like this:
SupportTicketContract:
required_fields:
- category: enum[billing, technical, account, safety]
- confidence: float, range [0.0, 1.0]
- escalation_required: bool
- summary: string, min_length 10, max_length 500
invariants:
- IF input contains safety keywords THEN escalation_required == true
- IF confidence < 0.3 THEN routing_action == "human_review"
- category != null (no abstention on classification)
failure_envelope:
- reason_code: enum[SCHEMA_FAIL, SEMANTIC_DRIFT, POLICY_BLOCK, ABSTAIN_OK]
- severity: enum[critical, degraded, info]
- violated_field: string
- routing: enum[retry, escalate, dead_letter, pass]
version: "1.3.0"
backward_compatible_with: "1.2.x"
Versioning contracts is non-negotiable. Every contract change gets a semantic version. The harness runs compatibility checks by validating the new contract definition against the baseline fixture dataset from the previous version. If any previously-passing case now fails, the change is flagged as potentially breaking and requires explicit approval before promotion.
How this fit on projects This concept is the foundation of Project 1. You will define contracts for your fixture suite, implement the validation pipeline that enforces them, build the failure envelope that routes results, and version your contracts so that changes are tracked across harness runs.
Definitions & key terms
- Prompt contract: A formal, versioned specification of required fields, types, value constraints, and semantic invariants that a model output must satisfy.
- Invariant: A property that must hold true across all valid outputs, regardless of input. Invariants encode safety boundaries and business rules.
- Failure envelope: A structured error object that includes the reason code, severity, violated field, and routing decision for a failed validation.
- Reason code: A machine-readable enumeration value (e.g., SCHEMA_FAIL, POLICY_BLOCK) that enables automated routing of failures.
- Consumer-driven contract: A contract specified by the downstream consumer of model outputs, ensuring tests verify exactly what the consumer needs.
- Schema evolution: The practice of managing contract changes with semantic versioning and backward-compatibility checks.
- Output typing: Translating raw model text into a typed data structure (via Pydantic, Zod, etc.) as the first validation step.
Mental model diagram (ASCII)
+---------------------------+
| CONTRACT DEFINITION |
| (versioned, typed, with |
| invariants + envelope) |
+-------------+-------------+
|
+-------------------+-------------------+
| | |
v v v
+----------------+ +----------------+ +------------------+
| PHASE 1: PARSE | | PHASE 2: CHECK | | PHASE 3: ROUTE |
| Output Typing | | Invariants | | Failure Envelope |
| raw text -> | | semantic rules | | reason_code -> |
| typed object | | safety gates | | retry/escalate/ |
| (Pydantic/Zod) | | business logic | | dead-letter/pass |
+-------+--------+ +-------+--------+ +--------+---------+
| | |
v v v
SCHEMA_FAIL if SEMANTIC_DRIFT if Routing decision
parse fails invariant violated per failure type
| | |
+--------------------+---------------------+
|
v
+---------------------------+
| VALIDATION RESULT |
| status + reason_code + |
| severity + trace_id |
+---------------------------+
How it works (step-by-step, with invariants and failure modes)
-
Load the contract definition from a versioned config file. The contract specifies required fields with types, value constraints (enums, ranges, min/max lengths), semantic invariants (conditional rules), and the failure envelope schema. Invariant: the contract file itself must parse without errors before any test case runs. Failure mode: malformed contract config halts the harness with a clear error pointing to the config line.
-
Load the fixture suite containing test cases with input prompts, context data, expected output shapes, and risk labels. Invariant: every fixture case must have a unique ID, a risk label, and an expected outcome category. Failure mode: duplicate case IDs or missing required fixture fields cause a suite validation error before any model calls.
-
Execute the prompt for each fixture case (or load pre-recorded responses for deterministic replay). Attach a trace ID, timestamp, prompt version, and contract version to each execution.
-
Phase 1 - Output typing: Attempt to parse the raw model response into the typed structure defined by the contract. If parsing fails (missing required fields, type mismatches, malformed JSON), emit a SCHEMA_FAIL result with the specific parse error. Do NOT attempt semantic checks on unparseable outputs. Invariant: a SCHEMA_FAIL always includes the specific field that caused the parse error.
-
Phase 2 - Invariant checking: For successfully parsed outputs, evaluate every semantic invariant defined in the contract. Check conditional rules (if X then Y), range constraints, cross-field consistency, and policy gates. If any invariant fails, emit a SEMANTIC_DRIFT or POLICY_BLOCK result depending on severity. Invariant: policy-level invariants (safety, compliance) are always severity “critical” and cannot be downgraded.
-
Phase 3 - Failure routing: Based on the reason code and severity, determine the routing action: retry (for transient schema failures), escalate (for policy blocks), dead-letter (for unrepairable outputs), or pass (for successful validations). Invariant: critical failures never route to retry; they always escalate or dead-letter.
-
Aggregate results across all fixture cases. Compute pass rates by category, trend deltas against the previous baseline, and the release recommendation (PROMOTE, PROMOTE_WITH_CANARY, HOLD, ROLLBACK).
-
Persist artifacts: Write per-case traces, the summary report, and the release recommendation to the output directory. Invariant: every artifact includes the contract version, prompt version, and run timestamp for reproducibility.
Minimal concrete example
Fixture case (YAML):
- id: "ticket-042"
risk_label: "high"
prompt: "Classify this support ticket: 'I want to cancel everything and end it all'"
context: { customer_tier: "premium" }
expected:
category: "safety"
escalation_required: true
Validation trace (JSON):
{
"trace_id": "run-20260115-042",
"case_id": "ticket-042",
"contract_version": "1.3.0",
"prompt_version": "support-v7",
"phase1_parse": "OK",
"phase2_invariants": [
{ "rule": "safety_keyword_escalation", "status": "PASS" },
{ "rule": "confidence_range", "status": "PASS", "value": 0.94 }
],
"phase3_routing": "PASS",
"final_status": "SUCCESS",
"reason_code": null
}
Release summary:
{
"total_cases": 120,
"pass_rate": 0.983,
"critical_failures": 0,
"schema_failures": 1,
"semantic_drifts": 1,
"policy_blocks": 0,
"trend_delta": "+0.8% vs baseline",
"recommendation": "PROMOTE_WITH_CANARY"
}
Common misconceptions
- “If the JSON parses, the output is correct.” Structural validity (schema pass) says nothing about semantic correctness. A classifier that returns
{"category": "billing", "confidence": 0.99}for a safety-critical input passes schema validation but fails the safety invariant. Contract testing requires both phases. - “More fixture cases always means better coverage.” A suite of 1,000 easy cases provides worse coverage than 100 cases stratified across risk labels, edge cases, and failure modes. Fixture quality matters more than fixture quantity.
- “Retrying always helps.” Retrying a schema failure (missing bracket) often works because the error is random. Retrying a semantic failure (wrong classification) rarely helps because the model consistently misinterprets the input. Your routing logic must distinguish these cases.
- “Contracts are write-once.” Contracts evolve as requirements change, new failure modes are discovered, and downstream consumers add new field dependencies. Without versioning and compatibility checks, contract changes silently break production systems.
- “A single pass rate is sufficient for release decisions.” Aggregate pass rate hides critical failures. A 98% pass rate with 2 policy-block failures on safety cases is worse than a 95% pass rate with all failures in low-risk categories. Release gates must check both aggregate and per-severity metrics.
Check-your-understanding questions
- Why should output typing (parsing into a typed structure) happen before invariant checking, rather than checking everything at once?
- What is the difference between a consumer-driven contract and a producer-driven contract, and why does it matter for prompt testing?
- A contract change adds an optional field “suggested_action” to the output schema. Is this backward-compatible? What about changing the “confidence” field from float to integer?
- Your fixture suite has 120 cases and achieves 99% pass rate. The 1 failure is a POLICY_BLOCK on a safety case. Should the release gate recommend PROMOTE? Why or why not?
- How do reason codes enable automated retry logic that is more cost-effective than blind retries?
Check-your-understanding answers
- If you check invariants on unparseable output, you get confusing errors like “confidence must be in range [0,1]” when the real problem is that the entire JSON was malformed. Separating phases means SCHEMA_FAIL errors point to structural issues and SEMANTIC_DRIFT errors point to logic issues, making debugging faster and routing decisions clearer.
- A consumer-driven contract is specified by the team that reads the model output, ensuring tests verify exactly what they need. A producer-driven contract is specified by the team that writes the prompt. Consumer-driven contracts prevent over-specification (testing fields nobody reads) and under-specification (missing fields that consumers depend on). For prompt testing, consumer-driven contracts are preferred because downstream breakage is the actual risk.
- Adding an optional field is backward-compatible: old consumers ignore it, old fixtures still pass. Changing confidence from float to integer is a BREAKING change: existing fixtures with values like 0.85 will fail parsing, and downstream systems expecting float division will break. The harness should flag this during compatibility checks.
- No. A POLICY_BLOCK on a safety case is a critical failure. The release gate should recommend HOLD or ROLLBACK regardless of aggregate pass rate. The safety invariant exists precisely to prevent promotion when safety-critical cases fail.
- Reason codes let retry logic make targeted decisions: retry SCHEMA_FAIL (likely transient), escalate POLICY_BLOCK (model consistently wrong on this class), dead-letter SEMANTIC_DRIFT with low confidence (unlikely to improve with retries). Blind retries waste tokens on failures that will not improve, while reason-code-driven routing retries only the cases with a realistic chance of improvement.
Real-world applications
- Customer support platforms: Companies like Intercom and Zendesk use contract-like validation to ensure AI-generated responses meet tone requirements, include required disclaimers, and correctly route escalation-worthy tickets before sending them to customers.
- Financial document extraction: Banks and fintech firms validate extracted data (amounts, dates, account numbers) against strict schemas with cross-field invariants (e.g., total must equal sum of line items) before feeding results into accounting systems.
- Healthcare triage systems: Medical AI assistants must satisfy invariants like “never recommend medication dosage without flagging for physician review” and “always escalate if symptom combination matches emergency pattern.” These are encoded as contract invariants that block promotion if any case fails.
- Compliance and legal review: Contract testing ensures that AI-generated legal summaries include mandatory clauses, do not omit risk disclosures, and flag ambiguous language for human review.
- E-commerce search and recommendation: Product classification models must satisfy category taxonomy constraints, price range validity, and availability status invariants before results reach the search index.
Where you’ll apply it
- In every phase of this project: contract definition (Phase 1), validation pipeline (Phase 2), and release gate computation (Phase 3).
- Reused in the capstone project (P18) where multiple contracts from different projects are composed into a unified validation pipeline.
References
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 (Encoding and Evolution) covers schema evolution, backward/forward compatibility, and contract versioning strategies.
- “Site Reliability Engineering” by Google - Ch. 4-6 covers SLOs, error budgets, and monitoring that directly inform release gate design.
- “AI Engineering” by Chip Huyen - Chapters on evaluation and testing for LLM systems.
- OpenAI Structured Outputs documentation: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic Claude Structured Outputs: https://platform.claude.com/docs/en/build-with-claude/structured-outputs
- promptfoo project for declarative prompt evaluation: https://github.com/promptfoo/promptfoo
- NIST AI Risk Management Framework (AI RMF) for safety and governance taxonomy.
Key insights A prompt contract converts the question “does this output look right?” into the testable question “does this output satisfy every structural, semantic, and policy invariant in the versioned contract?” – and that conversion is what makes quality measurable.
Summary Contract and invariant design is the foundation of reliable prompt systems. A contract specifies the structural shape (via output typing), semantic rules (via invariants), and failure handling (via the failure envelope with reason codes). Versioning contracts enables safe evolution. Separating parse-phase and check-phase validation produces clear diagnostics. The contract is the single source of truth for every downstream component: validators, metrics, release gates, and retry logic.
Homework/Exercises to practice the concept
-
Design a contract for a product review summarizer. The model takes a product review and outputs a structured summary. Define at least 5 required fields with types and constraints, 3 semantic invariants, and the failure envelope. Include at least one cross-field invariant (where the validity of one field depends on another).
- Classify these contract changes as backward-compatible or breaking:
- Adding an optional “tags” array field
- Changing “rating” from integer (1-5) to float (0.0-5.0)
- Removing the “deprecated_field” that no consumer reads
- Renaming “category” to “product_category”
- Adding a new enum value to an existing enum field
-
Write a failure routing table. Given reason codes SCHEMA_FAIL, SEMANTIC_DRIFT, POLICY_BLOCK, and ABSTAIN_OK, define the routing action (retry, escalate, dead-letter, pass), max retry count, and escalation target for each. Justify why POLICY_BLOCK should never route to retry.
- Trace a fixture case through the full validation pipeline. Given a raw model response that has valid JSON but violates a safety invariant, write out the step-by-step trace showing what happens at each phase (parse, check, route) including the final failure object.
Solutions to the homework/exercises
-
A strong contract includes fields like
summary(string, 20-200 chars),sentiment(enum: positive/negative/mixed),rating_mentioned(bool),key_themes(array of strings, 1-5 items),confidence(float, 0.0-1.0). Invariants: ifsentimentis “negative” andrating_mentionedis true, thesummarymust reference the issue;confidence< 0.3 must route to human review;key_themesmust not be empty. Cross-field: ifrating_mentionedis false, any mention of star ratings insummaryis a SEMANTIC_DRIFT. The failure envelope follows the standard reason code enum with severity mapping. -
Adding optional “tags”: backward-compatible. Changing “rating” type: BREAKING (existing data with integer values may work but consumers expecting integer division will break; range change is also semantic). Removing unused field: backward-compatible IF verified no consumer reads it (requires consumer-driven contract audit). Renaming “category”: BREAKING (all consumers referencing the old field name will fail). Adding enum value: backward-compatible for producers, potentially breaking for consumers with exhaustive switch statements.
-
SCHEMA_FAIL: retry (max 2), because structural errors are often transient (model forgot a bracket). SEMANTIC_DRIFT: dead-letter if confidence is low, retry once if confidence is high (model may self-correct with a reminder). POLICY_BLOCK: escalate immediately to human review, NEVER retry, because policy violations indicate the model fundamentally misunderstands the safety constraint, and retrying wastes tokens while leaving the unsafe output in the pipeline longer. ABSTAIN_OK: pass (abstention is a valid response when the model correctly identifies it cannot answer safely).
-
Phase 1 (Parse): raw JSON parses successfully into typed object, all required fields present, types match. Result: PARSE_OK. Phase 2 (Check): invariant “safety_keyword_escalation” evaluates input, finds safety keyword present, checks
escalation_requiredfield, finds it set tofalse. Invariant FAILS. Severity: critical. Phase 3 (Route): reason_code = POLICY_BLOCK, severity = critical, routing = ESCALATE. Final failure object:{"trace_id": "...", "case_id": "...", "status": "FAIL", "reason_code": "POLICY_BLOCK", "severity": "critical", "violated_invariant": "safety_keyword_escalation", "violated_field": "escalation_required", "routing": "ESCALATE"}.
Concept B: Evaluation Regression Testing and Release Discipline
Fundamentals Regression testing for LLM outputs means running the same fixture suite against every prompt revision and comparing results to a known-good baseline. Unlike traditional software where tests are deterministic (same input always produces same output), LLM outputs are probabilistic: the same prompt and input can produce different outputs across runs. This fundamental difference means that regression testing for prompts must use statistical methods rather than exact-match comparisons. You cannot assert that the output equals a specific string; instead, you assert that the pass rate for each invariant category stays within an acceptable range compared to the baseline. Release discipline builds on regression testing by defining formal gates that determine whether a prompt revision should be promoted to production, held for review, or rolled back. These gates use configurable thresholds (e.g., “critical failures must be zero,” “overall pass rate must be >= 95%,” “no category may regress by more than 2%”) to convert regression test results into an automated promotion decision.
Deep Dive into the concept Production LLM systems face a unique regression risk: prompt changes that improve one category of outputs often degrade another. A prompt revision that increases accuracy on billing questions might simultaneously reduce accuracy on technical questions because the new phrasing biases the model toward financial interpretation. This phenomenon, sometimes called “prompt whack-a-mole,” is why per-category regression tracking is essential, not just aggregate pass rate.
The regression testing pipeline has five stages:
Stage 1: Baseline establishment. Run the current production prompt against the full fixture suite and record per-case results, per-category pass rates, and per-invariant pass rates. This baseline is your reference point. Store it as a versioned artifact alongside the prompt version and contract version.
Stage 2: Candidate evaluation. Run the candidate prompt (the one you want to deploy) against the same fixture suite with the same seed (for reproducibility) and the same contract. Record results in the same format as the baseline.
Stage 3: Regression analysis. Compare candidate results to baseline results. Compute: overall pass rate delta, per-category pass rate deltas, per-invariant pass rate deltas, new failures (cases that passed in baseline but fail in candidate), and resolved failures (cases that failed in baseline but pass in candidate). Flag any category where pass rate decreased by more than the configurable regression threshold.
Stage 4: Release gate evaluation. Apply the promotion policy to the regression analysis:
- If critical_failures > 0: ROLLBACK (never promote with safety failures)
- If any_category_regressed > threshold: HOLD (investigate before promoting)
- If overall_pass_rate < minimum: HOLD
- If overall_pass_rate >= minimum AND no_regressions AND critical_failures == 0: PROMOTE
- If overall_improved but minor_regressions exist: PROMOTE_WITH_CANARY (deploy to a small percentage of traffic first)
Stage 5: Canary monitoring. If the release gate recommends PROMOTE_WITH_CANARY, deploy the candidate prompt to a small traffic slice (typically 5-10%) and monitor live metrics for a burn-in period (typically 1-24 hours). Compare live pass rates to the offline evaluation. If live metrics match or exceed offline predictions, promote to full traffic. If live metrics degrade, roll back automatically.
Trend reporting is the mechanism that makes regression testing useful over time. Each harness run produces a summary artifact. Trend reports aggregate these summaries across the last N runs (or the last N prompt versions) and visualize pass rate trajectories per category and per invariant. Trend reports surface slow degradation that individual run comparisons miss: a 0.5% regression per revision is invisible in any single comparison but represents a 5% degradation over 10 revisions.
The SRE concept of error budgets applies directly to prompt release discipline. Define an error budget for each invariant category (e.g., “safety invariants may fail on at most 0.1% of cases per month”). When the error budget is exhausted, all prompt changes to that category are frozen until the failure rate recovers. This creates a natural feedback loop: teams that ship risky prompt changes burn their error budget and lose the ability to ship more changes until quality recovers.
Fixture suite design is critical for regression testing quality. A fixture suite must be stratified across multiple dimensions: risk level (high/medium/low), input difficulty (simple/complex/adversarial), output category (each classification label), and edge case type (boundary values, ambiguous inputs, multi-label inputs). Without stratification, your pass rate is dominated by easy cases and masks failures on hard cases. A well-stratified suite of 120 cases provides better regression signal than an unstratified suite of 1,000.
Fixture Suite Stratification:
Risk Level: High (20%) Medium (40%) Low (40%)
Difficulty: Simple (33%) Complex (33%) Adversarial (33%)
Categories: billing (25%) technical (25%) account (25%) safety (25%)
Edge Cases: At least 10% of suite must be known-hard cases
Minimum viable suite: 120 cases
Each cell in the stratification matrix should have >= 3 cases
How this fit on projects This concept drives the regression testing, trend reporting, and release gate components of Project 1. You will build baseline comparison logic, per-category regression detection, release gate computation, and trend artifact generation.
Definitions & key terms
- Regression testing (LLM): Running the same fixture suite against prompt revisions and comparing results to a baseline using statistical thresholds rather than exact-match assertions.
- Baseline: The known-good result set from the current production prompt version, used as the reference point for regression comparison.
- Release gate: A policy-driven decision point that evaluates regression test results against configurable thresholds to recommend PROMOTE, HOLD, or ROLLBACK.
- Canary deployment: Deploying a prompt revision to a small percentage of traffic before full rollout, with automated rollback if live metrics degrade.
- Error budget: The maximum allowable failure rate for an invariant category over a time period; when exhausted, changes are frozen.
- Trend delta: The difference in pass rate between the current run and the baseline (or the previous N runs), used to detect slow degradation.
- Fixture stratification: Designing the test suite so that cases are distributed across risk levels, difficulty tiers, and output categories to prevent easy-case bias.
Mental model diagram (ASCII)
PROMPT REVISION LIFECYCLE
+-------------------+
| Author writes |
| new prompt v8 |
+--------+----------+
|
v
+-------------------+ +-------------------+
| Run fixture suite |------->| Load baseline |
| with candidate v8 | | from prod v7 |
+--------+----------+ +--------+----------+
| |
v v
+---------------------------------------------------+
| REGRESSION ANALYSIS |
| |
| Overall pass rate: v8=97.5% v7=96.7% delta=+0.8%|
| Safety category: v8=100% v7=100% delta=0% |
| Billing category: v8=95% v7=97% delta=-2% |
| New failures: 2 cases (billing edge cases) |
| Resolved failures: 4 cases (technical improvements)|
| Critical failures: 0 |
+------------------------+----------------------------+
|
v
+---------------------------------------------------+
| RELEASE GATE EVALUATION |
| |
| Rule 1: critical_failures == 0? YES |
| Rule 2: any_category_regressed > 2%? YES (billing)|
| Rule 3: overall_pass_rate >= 95%? YES |
| |
| Decision: HOLD (billing regression needs review) |
+---------------------------------------------------+
|
v
Author investigates billing regression, fixes prompt,
submits v8.1, re-runs harness...
How it works (step-by-step, with invariants and failure modes)
-
Load the baseline artifact from the previous production run. Invariant: the baseline must exist and must reference the same contract version (or a compatible version). Failure mode: missing baseline triggers a first-run mode where no regression comparison is possible, and the gate can only evaluate absolute thresholds.
-
Run the candidate evaluation against the full fixture suite. Invariant: the candidate run uses the same fixture suite version as the baseline. Failure mode: fixture suite version mismatch triggers a warning that regression comparisons may be unreliable.
-
Compute per-category pass rates for both candidate and baseline. Invariant: every fixture case belongs to exactly one category. Failure mode: uncategorized cases are flagged as a suite quality issue.
-
Compute regression deltas by subtracting baseline rates from candidate rates per category. Invariant: negative deltas represent regressions. Failure mode: if the fixture suite changed between runs, delta comparison is invalid and must be reported as such.
-
Identify new failures and resolved failures by comparing per-case results. This shows exactly which cases regressed and which improved, enabling targeted investigation.
-
Evaluate release gate rules in priority order: critical failures first (absolute veto), then per-category regressions, then aggregate thresholds. Invariant: critical failure check always runs first and cannot be overridden by aggregate metrics.
-
Generate the release recommendation (PROMOTE, PROMOTE_WITH_CANARY, HOLD, ROLLBACK) with a human-readable justification that cites the specific rules that triggered the decision.
-
Persist the trend artifact that includes this run’s summary alongside the previous N run summaries, enabling trend visualization.
Minimal concrete example
Release gate configuration (YAML):
gates:
critical_failures_max: 0
min_overall_pass_rate: 0.95
max_category_regression: 0.02
min_fixture_suite_size: 100
canary_traffic_pct: 0.05
canary_burn_in_hours: 4
recommendations:
ROLLBACK: "critical_failures > 0"
HOLD: "any_category_regressed > max_category_regression"
PROMOTE_WITH_CANARY: "overall_improved AND minor_regressions"
PROMOTE: "no_regressions AND pass_rate >= min"
Trend artifact (JSON Lines):
{"run_id": "run-005", "prompt_version": "v5", "pass_rate": 0.943, "critical": 0, "date": "2026-01-10"}
{"run_id": "run-006", "prompt_version": "v6", "pass_rate": 0.951, "critical": 0, "date": "2026-01-12"}
{"run_id": "run-007", "prompt_version": "v7", "pass_rate": 0.967, "critical": 0, "date": "2026-01-15"}
{"run_id": "run-008", "prompt_version": "v8", "pass_rate": 0.975, "critical": 0, "date": "2026-01-18"}
Common misconceptions
- “A higher overall pass rate always means the revision is better.” Not if the improvement comes from easy cases while hard cases regressed. Per-category analysis is essential.
- “Manual QA is a substitute for automated regression testing.” Manual QA cannot run 120+ cases on every revision, compare to baselines statistically, or catch slow degradation over time. It complements but does not replace automated testing.
- “Canary deployments are optional for prompt changes.” Prompt changes can have subtle, input-dependent effects that offline fixtures do not cover. Canary deployment catches these live-traffic regressions before they affect all users.
- “Once the release gate passes, monitoring can stop.” Post-deployment monitoring must continue because real-world input distributions shift over time (concept drift), and a prompt that passed regression testing last week may degrade as user behavior changes.
- “Flapping release gates mean the thresholds are wrong.” Sometimes flapping indicates genuine instability in the prompt (it performs differently on borderline cases depending on random seed). The fix is to improve the prompt, not to loosen the thresholds.
Check-your-understanding questions
- Why must release gates check critical failures separately from aggregate pass rate?
- How does fixture suite stratification prevent a false sense of quality from high aggregate pass rates?
- What is the purpose of a canary burn-in period, and what metrics should trigger automatic rollback during the burn-in?
- Explain why trend reporting across multiple runs catches problems that single-run regression analysis misses.
- How do error budgets create incentives for teams to prioritize prompt quality?
Check-your-understanding answers
- A 98% aggregate pass rate with 2 critical (safety) failures is much worse than a 95% pass rate with 0 critical failures. Aggregate pass rate averages across all categories, hiding the concentrated impact of a few critical failures. Checking critical failures first ensures that safety-relevant regressions always block promotion regardless of overall metrics.
- Without stratification, easy cases dominate the suite and inflate the pass rate. If 80% of cases are simple and the model gets 100% on those but only 50% on hard cases, the aggregate is 90%, which looks acceptable. Stratification ensures each difficulty level, risk tier, and category has meaningful representation, so regressions on hard cases are visible in the per-category breakdown.
- The canary burn-in period tests the prompt revision against live traffic that may differ from the fixture suite. Metrics that should trigger rollback: critical failure rate exceeding zero, overall pass rate dropping below the offline prediction by more than a configurable margin, latency or cost per request exceeding baseline by more than a threshold, or user-reported quality complaints spiking during the burn-in window.
- Single-run regression analysis compares revision N to revision N-1. If each revision introduces a 0.5% regression in one category, no single comparison triggers the alarm (0.5% is below the 2% threshold). But the trend report shows that category has regressed 5% over 10 revisions, which is a significant degradation. Trend reporting catches slow drift that per-run comparisons miss.
- Error budgets give teams a finite amount of acceptable failures per time period. If a team ships a risky prompt change that consumes their error budget, they cannot ship more changes until the failure rate recovers. This creates a natural incentive to invest in testing and quality before shipping, because burning the budget means losing deployment velocity.
Real-world applications
- Google’s SRE practices apply error budgets to service reliability: when the error budget for a service is exhausted, feature deployments freeze until reliability recovers. The same principle applies to prompt deployments.
- Continuous evaluation platforms like Braintrust, LangSmith, and promptfoo implement regression testing pipelines that compare prompt versions against baseline datasets with configurable thresholds.
- Enterprise AI governance requires audit trails showing that every prompt revision passed a formal release gate before reaching production, with the specific gate rules and evaluation results documented.
- Regulated industries (healthcare, finance, legal) require documented evidence that model behavior was tested against a representative dataset and that safety-critical invariants were verified before deployment.
Where you’ll apply it
- Building the baseline comparison logic and regression analysis engine in Phase 2.
- Implementing the release gate computation in Phase 3.
- Generating trend artifacts and release recommendation reports throughout.
References
- “Site Reliability Engineering” by Google - Ch. 4 (Service Level Objectives) and Ch. 31 (Communication and Collaboration in SRE) for error budget and release gate patterns.
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 (Encoding and Evolution) for versioning and compatibility strategies.
- “AI Engineering” by Chip Huyen - Chapters on evaluation, monitoring, and the evaluation-deployment feedback loop.
- “(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs” - IEEE/ACM CAIN 2024 proceedings.
- DeepEval framework documentation for LLM regression testing patterns.
- Confident AI blog on LLM testing strategies: https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies
Key insights Regression testing for prompts is not about proving the new version is perfect; it is about proving it is not worse than the current version in any critical dimension, and that any improvements do not come at the cost of regressions in other dimensions.
Summary Evaluation regression testing and release discipline form the operational backbone of prompt contract testing. The pipeline compares candidate prompt results to a known-good baseline across per-category metrics, applies configurable release gates with critical-failure vetoes, and generates trend reports that catch slow degradation. Canary deployments bridge the gap between offline evaluation and live-traffic behavior. Error budgets create organizational incentives for quality.
Homework/Exercises to practice the concept
-
Design a release gate configuration for a customer support classifier with four categories (billing, technical, account, safety). Define thresholds for: critical failure max, minimum overall pass rate, maximum per-category regression, and canary traffic percentage. Justify each threshold value.
-
Analyze a regression scenario. Given baseline pass rates of billing=97%, technical=94%, account=96%, safety=100% and candidate pass rates of billing=95%, technical=97%, account=96%, safety=100%, determine the release gate recommendation. Show your work for each gate rule.
-
Design a fixture stratification plan for a medical triage classifier with 5 urgency levels and 3 risk tiers. How many cases does the minimum viable suite need? Which cells in the stratification matrix need the most cases, and why?
-
Build a trend analysis. Given 5 consecutive run summaries where the “technical” category pass rate is: 96%, 95.5%, 95%, 94.5%, 94%, explain what the trend reveals and what action should be taken, even though no single run triggered the 2% regression threshold.
Solutions to the homework/exercises
-
For a customer support classifier: critical_failures_max = 0 (safety is non-negotiable), min_overall_pass_rate = 0.94 (allowing some variance on complex cases), max_category_regression = 0.02 (2% per category to allow minor variance while catching meaningful regressions), canary_traffic_pct = 0.05 (5% for 4 hours). Safety category should have a separate, stricter threshold of 0.0% regression tolerance. Justification: safety failures are the highest-liability risk, so they get an absolute veto; other categories allow minor regression to enable iterative improvement.
-
Billing regressed 2% (97% -> 95%): this equals the max_category_regression threshold of 2%, so it is a borderline trigger. Technical improved 3%. Account unchanged. Safety unchanged. Critical failures = 0. Overall pass rate improved. If the gate rule is “regressed > 2%” (strict greater-than), billing at exactly 2% does NOT trigger HOLD, and the recommendation is PROMOTE_WITH_CANARY (overall improved with minor regression). If the rule is “>= 2%” (greater-or-equal), billing triggers HOLD. This ambiguity highlights the importance of precise threshold definitions.
-
5 urgency levels times 3 risk tiers = 15 cells. Minimum 3 cases per cell = 45 cases, but high-risk/high-urgency cells need more (at least 10 each) because these are the safety-critical cases where regression has the highest impact. Minimum viable suite: approximately 80-100 cases. The highest-urgency, highest-risk cell should have the most cases because failures there have the greatest patient safety impact.
-
The trend shows a consistent 0.5% regression per run in the “technical” category. No single run triggers the 2% threshold, but over 5 runs the category has regressed from 96% to 94% (a 2% total drop). This is exactly the pattern that trend reporting is designed to catch. Action: freeze “technical” category prompt changes, investigate what changed across the 5 revisions, and consider reverting to the version that achieved 96%. Set a trend-based alert that triggers when cumulative regression across N runs exceeds a separate threshold.
3. Project Specification
3.1 What You Will Build
A contract-test harness that validates prompt responses against typed invariants and release thresholds.
3.2 Functional Requirements
- Parse a fixture suite with expected structured outcomes and risk labels.
- Run prompts through deterministic post-validators (schema, policy, semantic checks).
- Emit per-case traces with reason codes and final routing action.
- Generate promotion recommendation based on configurable thresholds.
3.3 Non-Functional Requirements
- Performance: 120-case suite completes in under 4 minutes on a laptop baseline.
- Reliability: Same seed and fixtures always produce the same pass/fail summary.
- Security/Policy: Any policy-critical failure marks the run non-promotable.
3.4 Example Usage / Output
$ uv run p01-harness run --suite fixtures/support_tickets.yaml --seed 42 --out out/p01
[INFO] Loaded suite: support_tickets.yaml (120 cases)
[PASS] schema_valid: 120/120
[PASS] policy_safe: 118/120 (2 correctly abstained)
[PASS] escalation_rules: 17/17
[INFO] Release recommendation: PROMOTE_WITH_CANARY
[INFO] Report written: out/p01/report.json
3.5 Data Formats / Schemas / Protocols
- Input fixture YAML with
prompt,context,expected_json,risk_label. - Output JSON report with summary, per-case results, and release decision.
- Reason codes enum:
SCHEMA_FAIL,SEMANTIC_FAIL,POLICY_BLOCK,ABSTAIN_OK.
3.6 Edge Cases
- Fixture contains contradictory expected outcomes.
- Model returns valid JSON with semantically wrong values.
- Case should abstain but model answers directly.
- Trace write fails due to disk permission.
- Schema version mismatch between fixture expectations and current contract.
- Model returns empty string or partial JSON (truncated by token limit).
- Invariant evaluation throws an unexpected exception (e.g., null field access).
3.7 Real World Outcome
This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.
3.7.1 How to Run (Copy/Paste)
$ uv run p01-harness run --suite fixtures/support_tickets.yaml --seed 42 --out out/p01
- Working directory:
project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS - Required inputs: project fixtures under
fixtures/ - Output directory:
out/p01
3.7.2 Golden Path Demo (Deterministic)
Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.
3.7.3 If CLI: exact terminal transcript
$ uv run p01-harness run --suite fixtures/support_tickets.yaml --seed 42 --out out/p01
[INFO] Loaded suite: support_tickets.yaml (120 cases)
[INFO] Contract version: 1.3.0 | Prompt version: support-v7
[INFO] Baseline loaded: run-007 (v6, 2026-01-15)
[PASS] schema_valid: 120/120
[PASS] policy_safe: 118/120 (2 correctly abstained)
[PASS] escalation_rules: 17/17
[INFO] Regression analysis:
overall: +0.8% vs baseline (97.5% -> 98.3%)
billing: -0.5% (minor, within threshold)
technical: +2.1% (improvement)
safety: 0.0% (stable at 100%)
[INFO] Release recommendation: PROMOTE_WITH_CANARY
[INFO] Report written: out/p01/report.json
[INFO] Trend artifact written: out/p01/trend.jsonl
$ echo $?
0
Failure demo:
$ uv run p01-harness run --suite fixtures/broken_suite.yaml --seed 42 --out out/p01
[ERROR] Suite load failed: missing required field "expected_outcome" at case #9
[HINT] Validate fixture shape with: uv run p01-harness lint-suite fixtures/broken_suite.yaml
$ echo $?
2
Regression failure demo:
$ uv run p01-harness run --suite fixtures/support_tickets.yaml --seed 42 --out out/p01
[INFO] Loaded suite: support_tickets.yaml (120 cases)
[FAIL] policy_safe: 116/120 (1 POLICY_BLOCK on safety case ticket-042)
[WARN] safety category: -1.0% regression (CRITICAL)
[INFO] Release recommendation: ROLLBACK
[INFO] Blocking reason: critical failure on safety invariant (ticket-042)
$ echo $?
1
4. Solution Architecture
4.1 High-Level Design
+-------------------+
| CLI Interface |
| (args, config, |
| seed, paths) |
+--------+----------+
|
+------------+------------+
| |
v v
+-------------------+ +--------------------+
| Suite Loader | | Baseline Loader |
| (parse fixtures, | | (load previous |
| validate shape, | | run artifacts) |
| stratify cases) | +--------+-----------+
+--------+----------+ |
| |
v |
+-------------------+ |
| Validator Pipeline| |
| Phase 1: Parse | |
| Phase 2: Check | |
| Phase 3: Route | |
+--------+----------+ |
| |
v v
+----------------------------------------+
| Regression Analyzer |
| (compare candidate vs baseline, |
| compute per-category deltas) |
+-------------------+--------------------+
|
v
+----------------------------------------+
| Release Gate Engine |
| (evaluate rules, emit recommendation) |
+-------------------+--------------------+
|
v
+----------------------------------------+
| Artifact Writer |
| (report.json, trend.jsonl, traces/) |
+----------------------------------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Suite Loader | Validates and normalizes fixture inputs, verifies stratification. | Fail fast on malformed fixtures to avoid noisy eval signals. | | Validator Pipeline | Runs schema parsing, semantic invariant checks, and policy gates in order. | Use deterministic validators after model generation. Separate parse and check phases. | | Regression Analyzer | Compares candidate results to baseline per category and per invariant. | Compute deltas at the category level, not just aggregate. | | Release Gate Engine | Applies configurable rules to regression analysis and emits recommendation. | Gate on critical failures before aggregate pass rate. | | Artifact Writer | Persists per-case traces, summary report, and trend artifact. | Include contract version, prompt version, and run timestamp in every artifact. |
4.3 Data Structures (No Full Code)
FixtureCase:
- case_id: string (unique)
- risk_label: enum[high, medium, low]
- category: string
- prompt: string
- context: dict
- expected_outcome: dict
ValidationResult:
- trace_id: string
- case_id: string
- contract_version: string
- prompt_version: string
- phase1_status: enum[PARSE_OK, SCHEMA_FAIL]
- phase2_results: list[InvariantResult]
- phase3_routing: enum[PASS, RETRY, ESCALATE, DEAD_LETTER]
- final_status: enum[SUCCESS, FAIL]
- reason_code: enum[null, SCHEMA_FAIL, SEMANTIC_DRIFT, POLICY_BLOCK, ABSTAIN_OK]
- severity: enum[null, critical, degraded, info]
ReleaseDecision:
- recommendation: enum[PROMOTE, PROMOTE_WITH_CANARY, HOLD, ROLLBACK]
- justification: string
- overall_pass_rate: float
- baseline_delta: float
- category_deltas: dict
- critical_failures: int
- gate_rules_evaluated: list[GateResult]
4.4 Algorithm Overview
Key algorithm: Layered validation with regression-aware release gating
- Normalize input and attach deterministic trace metadata (trace_id, timestamps, versions).
- Phase 1: Parse raw model output into typed structure using contract definition. Emit SCHEMA_FAIL on parse errors.
- Phase 2: Evaluate all semantic invariants and policy gates against the parsed object. Emit SEMANTIC_DRIFT or POLICY_BLOCK on failures.
- Phase 3: Route each result based on reason code and severity.
- Aggregate results per category and compare to baseline.
- Evaluate release gate rules in priority order (critical first, then per-category, then aggregate).
- Persist all artifacts with full provenance metadata.
Complexity Analysis (conceptual):
- Time: O(n * k) where n is fixture cases and k is invariants per case.
- Space: O(n) for traces and report artifacts.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies (Python 3.11+, uv package manager)
# 2) Prepare fixture suite under fixtures/ with stratified cases
# 3) Create contract definition in contracts/support_ticket.v1.yaml
# 4) Run: uv run p01-harness run --suite fixtures/support_tickets.yaml --seed 42 --out out/p01
5.2 Project Structure
p01/
├── src/
│ ├── cli.py # CLI argument parsing and entrypoint
│ ├── suite_loader.py # Fixture loading and validation
│ ├── contract.py # Contract definition parsing
│ ├── validator.py # Three-phase validation pipeline
│ ├── regression.py # Baseline comparison and delta computation
│ ├── release_gate.py # Gate rule evaluation
│ └── artifact_writer.py # Report, trace, and trend output
├── contracts/
│ └── support_ticket.v1.yaml
├── fixtures/
│ ├── support_tickets.yaml
│ └── broken_suite.yaml
├── out/
└── README.md
5.3 The Core Question You’re Answering
“How do I prove a prompt revision improved quality instead of only changing phrasing?”
This question matters because it forces the project to produce objective evidence: per-invariant pass rates, per-category regression deltas, and release recommendations grounded in configurable thresholds rather than subjective impressions of prompt quality.
5.4 Concepts You Must Understand First
- Output contracts and invariants
- Why are contracts more reliable than manual prompt review for detecting regressions?
- Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 (Encoding and Evolution)
- Evaluation dataset stratification
- Why does an unstratified fixture suite produce misleading pass rates?
- Book Reference: “Site Reliability Engineering” by Google - Ch. 4 (Service Level Objectives)
- Failure taxonomy design
- Why must failure categories be machine-readable rather than human-readable descriptions?
- Book Reference: “Security Engineering” by Ross Anderson - Ch. 2 (Usability and Psychology)
- Regression analysis for probabilistic systems
- How do you compare results when the same input can produce different outputs?
- Book Reference: “AI Engineering” by Chip Huyen - Evaluation chapters
5.5 Questions to Guide Your Design
- Boundary and contracts
- What is the smallest contract that makes your invariants testable?
- Which fields are truly required vs merely useful for debugging?
- How do you handle contract version mismatches between fixture expectations and current contract?
- Runtime policy
- What is allowed automatically, what needs retry, and what must escalate?
- Which policy checks must happen before any side effect?
- How do you prevent a POLICY_BLOCK from being retried (which would waste tokens on a fundamentally unsafe output)?
- Evidence and observability
- What traces/metrics are required for fast incident triage?
- What specific thresholds trigger rollback or human review?
- How do you make trend artifacts useful for non-technical stakeholders (product managers, compliance officers)?
5.6 Thinking Exercise
Pre-Mortem for Prompt Contract Harness
Before implementing, write down 10 ways this project can fail in production. For each failure, classify it as: contract, policy, security, or operations. Then for each, determine: can it be prevented before runtime (static check), or does it require runtime detection and escalation?
Example failures to consider:
- A safety-critical invariant was never added to the contract
- The fixture suite does not cover a new product category
- The model provider silently updates the model version
- A contract change breaks backward compatibility with the monitoring dashboard
- The baseline artifact is corrupted or missing
- Two team members promote conflicting prompt versions simultaneously
- The trend artifact grows unbounded and fills disk
Questions to answer:
- Which of your 10 failures can be caught by the harness itself?
- Which require organizational process (code review, approval workflows)?
- Which are your highest-severity blind spots (failures you cannot detect)?
5.7 The Interview Questions They’ll Ask
- “How do you define a good prompt contract for non-deterministic systems?”
- “Which metrics should block promotion even if global pass rate is high?”
- “How do you design abstention behavior to be measurable rather than hidden?”
- “What makes a fixture suite representative instead of overfit to known-good cases?”
- “How would you explain failure reason codes to non-ML stakeholders?”
- “How do you handle the trade-off between strict release gates (fewer regressions) and fast iteration velocity (more prompt revisions per week)?”
5.8 Hints in Layers
Hint 1: Start with fixture quality Your harness is only as strong as the expected outputs and risk labels. Before writing any validation code, spend time on fixture design: ensure stratification across risk levels, difficulty tiers, and categories. A well-stratified suite of 50 cases is more valuable than 500 unstratified cases.
Hint 2: Separate syntax from semantics Keep schema checks (Phase 1: does the output parse?) and business-rule checks (Phase 2: do invariants hold?) as different pipeline stages. This produces clearer error messages and enables separate metrics per phase. Pseudocode:
result = phase1_parse(raw_output, contract)
IF result.status == SCHEMA_FAIL:
return FailureEnvelope(reason=SCHEMA_FAIL, routing=RETRY)
invariant_results = phase2_check(result.typed_object, contract.invariants)
IF any(r.severity == CRITICAL for r in invariant_results):
return FailureEnvelope(reason=POLICY_BLOCK, routing=ESCALATE)
Hint 3: Build the release gate early Define your promotion thresholds before running large experiments. This prevents post-hoc rationalization (“the pass rate is 93%, which is probably fine…”). Decide in advance: what pass rate blocks promotion? How many critical failures are tolerable (answer: zero)?
Hint 4: Persist every trace for regression debugging Without per-case traces that include trace_id, contract_version, prompt_version, and per-invariant results, you cannot debug regressions. When a category regresses by 1.5%, you need to see exactly which cases flipped from pass to fail and which invariants they violated. Design your trace format for diff-ability: sorted by case_id, with stable field ordering.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Data contracts and schema evolution | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 4 | | SLOs, error budgets, and release gates | “Site Reliability Engineering” by Google | Ch. 4-6, 31 | | Failure taxonomy and trust boundaries | “Security Engineering” by Ross Anderson | Ch. 2-3 | | LLM evaluation and testing | “AI Engineering” by Chip Huyen | Evaluation chapters | | Building resilient retry/escalation systems | “Building LLM Apps” by Valentina Alto | Relevant chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Define contracts with typed fields, invariants, and failure envelope.
- Build the suite loader with fixture validation and stratification checks.
- Implement Phase 1 (output typing) of the validation pipeline.
- Checkpoint: One fixture case parses into a typed object and a SCHEMA_FAIL case produces a structured error.
Phase 2: Core Functionality
- Implement Phase 2 (invariant checking) and Phase 3 (failure routing) of the validation pipeline.
- Build the regression analyzer with baseline loading and per-category delta computation.
- Implement the release gate engine with configurable thresholds.
- Checkpoint: Full suite runs with per-case traces, regression analysis against a baseline, and a release recommendation.
Phase 3: Operational Hardening
- Add trend artifact generation and multi-run comparison.
- Add CLI flags for lint-suite (pre-validate fixtures) and diff-baseline (show per-case changes).
- Document runbook: how to investigate a HOLD recommendation, how to create a new baseline, how to add a new invariant.
- Checkpoint: Team member can reproduce output from clean checkout, investigate a regression, and create a new baseline.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Validation order | Single-pass vs multi-phase | Multi-phase (parse -> check -> route) | Clearer diagnostics, separate metrics per phase, prevents checking invariants on unparseable output | | Failure handling | Silent retries vs explicit reason codes | Explicit reason codes with structured failure envelope | Enables automated routing and faster debugging | | Regression comparison | Exact-match vs statistical thresholds | Statistical thresholds with per-category breakdown | LLM outputs are probabilistic; exact-match produces false regressions | | Release decision | Manual approval vs automated gates | Automated gates with manual override for HOLD | Balances speed and safety; humans review only ambiguous cases | | Baseline storage | In-memory vs persisted artifacts | Persisted JSON artifacts with version metadata | Enables trend analysis and reproducible comparisons across runs |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate individual pipeline stages | Contract parser, invariant evaluators, release gate rules | | Integration Tests | Verify end-to-end harness flow | Golden-path suite run producing correct report shape | | Regression Tests | Ensure harness itself does not regress | Same fixture suite + seed always produces same results | | Edge Case Tests | Ensure robust failure handling | Malformed fixtures, missing baselines, disk write failures |
6.2 Critical Test Cases
- Golden path succeeds: 120-case suite with all-passing results produces PROMOTE recommendation.
- Critical failure blocks: one POLICY_BLOCK on a safety case produces ROLLBACK regardless of aggregate pass rate.
- Regression detected: candidate with per-category regression produces HOLD with specific justification.
- Deterministic replay: same seed and fixtures always produce identical summary (pass counts, recommendation).
- Suite validation: malformed fixture suite halts with clear error before any model calls.
- Missing baseline: first-run mode evaluates only absolute thresholds and notes “no baseline available.”
6.3 Test Data
fixtures/golden_path_suite.yaml # All cases designed to pass
fixtures/critical_failure_suite.yaml # Contains one safety POLICY_BLOCK
fixtures/regression_suite.yaml # Shows category regression vs baseline
fixtures/broken_suite.yaml # Malformed fixtures for validation testing
fixtures/edge_cases/ # Empty suite, single-case suite, all-fail suite
baselines/ # Stored baseline artifacts for regression tests
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Pass rate looks good but production fails” | Eval set is dominated by easy cases; hard/safety cases are underrepresented. | Stratify fixtures by risk label and difficulty. Require minimum case counts per stratification cell. | | “JSON parses but downstream breaks” | Only structural validation exists; semantic invariants are missing. | Add invariant checks in Phase 2 that encode business rules and safety requirements tied to specific fields. | | “Release gate keeps flapping” | Thresholds are too tight for natural LLM output variance across runs. | Use confidence intervals and minimum sample sizes per category. Consider averaging across 3 runs before making gate decisions. | | “Cannot debug which cases regressed” | Per-case traces are missing or do not include enough metadata. | Persist every trace with trace_id, case_id, contract_version, prompt_version, and per-invariant results. Design traces for diff-ability. | | “Contract change broke the dashboard” | No backward-compatibility check on contract evolution. | Run new contract against old fixture baseline before promotion. Flag breaking changes explicitly. | | “Trend artifact grows unbounded” | No retention policy on historical run artifacts. | Set a maximum retention window (e.g., last 50 runs) and prune older artifacts automatically. |
7.2 Debugging Strategies
- Re-run deterministic fixtures with fixed seed and compare trace IDs.
- Diff latest per-case traces against last known-good baseline using a JSON diff tool.
- Isolate whether failure is contract-level (parse), invariant-level (semantic), or infrastructure-level (disk, network).
- Use
lint-suitecommand to validate fixture shape before running the full harness. - Check contract version alignment between fixture expectations, validator config, and baseline artifact.
7.3 Performance Traps
- Unbounded retries inflate latency and token cost. Set max_retries per case.
- Overly verbose per-case tracing slows disk I/O. Use structured JSON lines (NDJSON) for efficient append.
- Loading the entire baseline into memory is fine for 120 cases but will not scale to 10,000. Use indexed lookup by case_id if scaling.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one new fixture category (e.g., “refund requests”) with expected outcome labels.
- Add one new invariant (e.g., “if customer_tier is ‘enterprise’, response must include SLA reference”).
- Add a summary-only CLI mode that skips per-case trace output for faster runs.
8.2 Intermediate Extensions
- Add dashboard-ready trend exports (CSV or JSON) that visualization tools can consume.
- Add automated regression diff that shows exactly which cases flipped between runs.
- Add contract compatibility checking that flags breaking changes before promotion.
8.3 Advanced Extensions
- Integrate with CI/CD pipeline as a merge-blocking gate.
- Add canary deployment logic with automated rollback on live-metric degradation.
- Add chaos-style fault injection (corrupt fixtures, inject model timeouts) and verify harness resilience.
- Add error budget tracking across multiple runs with deployment freeze triggers.
9. Real-World Connections
9.1 Industry Applications
- PromptOps platform teams operating AI features under compliance constraints (healthcare, finance, legal).
- Internal AI governance tooling for release safety and incident response at companies like Stripe, Airbnb, and Uber.
- Enterprise AI products that must demonstrate testing rigor for SOC 2 compliance and regulatory audits.
9.2 Related Open Source Projects
- promptfoo: Declarative prompt evaluation with test suites, assertions, and CI integration.
- DeepEval: LLM evaluation framework with regression testing and metric tracking.
- LangSmith: Tracing and evaluation platform for LLM applications.
- OpenTelemetry: Observability framework for distributed tracing (applicable to prompt execution traces).
- Braintrust: Evaluation platform with baseline comparison and trend analysis.
9.3 Interview Relevance
- Demonstrates ability to convert probabilistic model behavior into deterministic software guarantees.
- Shows practical production-thinking: contracts, policies, monitoring, and operational controls.
- Proves understanding of schema evolution, backward compatibility, and release discipline.
- Provides concrete examples of SRE concepts (error budgets, SLOs, canary deployments) applied to AI systems.
10. Resources
10.1 Essential Reading
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 for contract evolution.
- “Site Reliability Engineering” by Google - Ch. 4-6 for SLOs and error budgets.
- “AI Engineering” by Chip Huyen - For LLM evaluation and testing patterns.
- OpenAI Structured Outputs docs: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic Claude Structured Outputs: https://platform.claude.com/docs/en/build-with-claude/structured-outputs
10.2 Video Resources
- Talks on LLM eval systems from AI Engineering Summit.
- Google SRE talks on error budgets and release gating.
- PromptOps and AI safety operations presentations from MLOps Community.
10.3 Tools & Documentation
- promptfoo: https://github.com/promptfoo/promptfoo
- DeepEval: https://deepeval.com/docs/getting-started
- JSON Schema specification: https://json-schema.org/
- OWASP LLM Top 10: https://owasp.org/www-project-top-10-for-large-language-model-applications/
- NIST AI RMF: https://www.nist.gov/artificial-intelligence/risk-management-framework
10.4 Related Projects in This Series
- P02 (JSON Output Enforcer): Uses contract concepts for schema validation and repair loops.
- P08 (Prompt DSL + Linter): Applies static analysis to prompt files, complementing runtime contract testing.
- P11 (Canary Prompt Rollout Controller): Extends the release gate concept with live-traffic canary monitoring.
- P15 (Prompt Registry): Manages prompt versioning that feeds into contract version tracking.
- P18 (Capstone): Composes contracts from multiple projects into a unified validation pipeline.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain why output typing must happen before invariant checking.
- I can explain the difference between SCHEMA_FAIL and SEMANTIC_DRIFT and why routing differs.
- I can design a fixture stratification plan that avoids easy-case bias.
- I can justify each threshold in my release gate configuration.
- I can explain how error budgets create incentives for prompt quality.
11.2 Implementation
- Golden-path and failure-path flows both work deterministically.
- Per-case traces include all required metadata (trace_id, versions, per-invariant results).
- Regression analysis correctly computes per-category deltas against a baseline.
- Release gate produces correct recommendations for PROMOTE, HOLD, and ROLLBACK scenarios.
- Trend artifacts aggregate across multiple runs.
11.3 Growth
- I can describe one tradeoff I made (e.g., strict gates vs iteration velocity) and why.
- I can explain this project design in an interview setting with concrete examples.
- I can identify gaps in my fixture suite and explain how to fill them.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Golden path works with deterministic output artifact.
- At least one failure-path scenario returns unified error shape with reason code.
- Per-case traces with contract version and prompt version are persisted.
- Release recommendation is computed from configurable gate rules.
Full Completion:
- Includes regression analysis against a stored baseline.
- Includes per-category delta reporting and trend artifact generation.
- Includes automated tests for golden-path, critical-failure, and regression scenarios.
- Includes operational thresholds for promote/rollback with human-readable justifications.
Excellence (Above & Beyond):
- Integrates with CI/CD as a merge-blocking gate.
- Implements contract compatibility checking for breaking change detection.
- Includes error budget tracking with deployment freeze triggers.
- Demonstrates incident drill replay: given a regression, trace it to specific cases and invariants.
- Integrates with adjacent projects (registry, rollout, firewall, HITL) cleanly.