Project 2: JSON Output Enforcer (Schema + Repair Loop)
Stable JSON response flow with schema pass rates and controlled retry outcomes.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | See main guide estimates (typically 3-8 days except capstone) |
| Main Programming Language | TypeScript |
| Alternative Programming Languages | Python, Go |
| Coolness Level | Level 3: Useful and Reusable |
| Business Potential | 3. Internal Platform Utility |
| Knowledge Area | Structured Output Engineering |
| Software or Tool | Schema validator + retry orchestrator |
| Main Book | Designing Data-Intensive Applications |
| Concept Clusters | Prompt Contracts and Output Typing; Evaluation, Rollouts, and Governance |
1. Learning Objectives
By completing this project, you will:
- Implement strict JSON Schema validation against raw LLM outputs using a schema-first approach.
- Design a bounded repair loop that feeds parser errors back to the model for self-correction with configurable max attempts.
- Build a dead-letter pipeline that captures unrepairable outputs with full diagnostic metadata for manual triage.
- Track schema pass rates by version and measure the repair loop’s uplift contribution.
- Handle real-world output pathologies: trailing text after JSON, truncated responses, nested schema violations, and enum constraint failures.
- Produce deterministic, reproducible results that enable regression comparison across schema versions.
2. All Theory Needed (Per-Concept Breakdown)
Concept A: JSON Schema Validation for LLM Outputs
Fundamentals JSON Schema is a declarative specification language that describes the structure, types, and constraints of JSON data. When applied to LLM outputs, JSON Schema serves as the enforcement layer between the probabilistic model and the deterministic downstream system that consumes its output. Without schema validation, you are trusting that the model will always produce correctly-structured data, which it will not. Models drift, prompts change, context windows get truncated, and output formats degrade in subtle ways. JSON Schema validation converts the question “did the model return good JSON?” into a deterministic, automatable check that reports exactly which field failed which constraint. This is the first line of defense in any structured output pipeline.
The JSON Schema specification (currently draft 2020-12) defines a rich vocabulary of constraints: type validation (string, number, boolean, array, object, null), format validation (email, date-time, URI), numeric constraints (minimum, maximum, multipleOf), string constraints (minLength, maxLength, pattern), array constraints (minItems, maxItems, uniqueItems, items), object constraints (required, properties, additionalProperties, patternProperties), and composition keywords (allOf, anyOf, oneOf, not). For LLM output validation, the most critical constraints are: required (ensures the model did not omit a field), type (catches type confusion like returning a string where a number is expected), enum (constrains categorical fields to valid values), additionalProperties: false (prevents the model from inventing extra fields that confuse downstream parsers), and pattern (validates string formats like UUIDs or date strings).
Deep Dive into the concept Validating LLM outputs against JSON Schema involves several layers of complexity that do not exist in traditional API validation.
Layer 1: JSON extraction. The model may not return pure JSON. Common pathologies include: markdown code fences wrapping the JSON ( json ... ), explanatory text before or after the JSON block (“Here is the result: {…}”), trailing analysis text after a valid JSON object, multiple JSON objects in a single response, and truncated JSON due to token limits. Before schema validation can begin, you need a JSON extraction layer that handles these cases. The extraction layer should try multiple strategies in order: (1) parse the entire response as JSON, (2) find the first { and last } and parse the substring, (3) strip markdown code fences and try again, (4) report extraction failure with the specific pathology detected.
Layer 2: Structural validation. Once you have a parsed JSON object, validate it against the schema. A good validator library (ajv for TypeScript/JavaScript, jsonschema or pydantic for Python) will report ALL violations in a single pass, not just the first one. This is critical for the repair loop: if the output has three violations, you want to report all three so the repair prompt can fix them all in one attempt, rather than requiring three separate repair iterations. Each violation should include: the JSON path to the violated field (e.g., /items/2/quantity), the constraint that was violated (e.g., “must be >= 1”), and the actual value that was found.
Layer 3: Semantic validation beyond schema. JSON Schema cannot express all business constraints. Cross-field validations (e.g., “if status is ‘shipped’ then tracking_number must be non-null”), computed field checks (e.g., “total must equal sum of line_items[*].amount”), and referential integrity checks (e.g., “category_id must exist in the known categories list”) require a post-schema validation layer. This project focuses on structural schema validation but should provide hooks for semantic validators (which are covered more deeply in P01).
Layer 4: Schema versioning. Schemas evolve over time. A new version might add optional fields, tighten constraints, or change enum values. Your enforcer must track which schema version was used for validation so that metrics are comparable across versions. When a schema update changes validation behavior, you need to re-run the fixture suite against both old and new schemas to understand the impact.
Provider-specific structured output modes change the validation landscape. OpenAI’s Structured Outputs feature (introduced in 2024) uses constrained decoding to guarantee that model output conforms to a JSON Schema at the token generation level. It converts the schema into a context-free grammar (CFG) and dynamically constrains which tokens the model can emit at each generation step. This means the model literally cannot produce invalid JSON for supported schema types. Anthropic’s Claude offers similar structured output capabilities using constrained decoding with support for Pydantic (Python) and Zod (TypeScript) schema definitions. Google’s Gemini provides JSON mode with schema enforcement.
However, constrained decoding has limitations: it does not support all JSON Schema features (recursive schemas, some composition keywords), it may increase latency due to the constraint checking at each token, and it is provider-specific. Your enforcer should work with AND without provider-level constraints: use constrained decoding when available (as an optimization that reduces repair loop invocations), but always validate the output against your schema as a defense-in-depth measure. The provider’s constraint might have bugs, the schema might not be fully supported, or the provider might change behavior silently.
VALIDATION APPROACH COMPARISON
Provider-Level Constrained Decoding:
+ Guarantees valid JSON structure at token level
+ No repair loop needed for structural issues
- Provider-specific (not portable)
- Limited schema feature support
- May increase generation latency
- Cannot enforce semantic constraints
Application-Level Schema Validation:
+ Works with any provider
+ Supports full JSON Schema spec
+ Can layer semantic checks on top
- Requires repair loop for failures
- Post-hoc (invalid tokens already generated)
Recommended: Use both. Constrained decoding reduces repair loops.
Application-level validation catches what constrained decoding misses.
A critical implementation choice is how to configure additionalProperties. When set to false, the validator rejects any field the model adds beyond what the schema defines. This is important for LLM outputs because models frequently add “helpful” extra fields (like explanation, reasoning, or confidence) that are not in the schema. While these fields might seem harmless, they can break downstream parsers that use strict deserialization, inflate payload sizes, and leak internal reasoning into user-facing responses. Set additionalProperties: false by default and require explicit opt-in for schemas that allow extra fields.
How this fit on projects JSON Schema validation is the core of Project 2. You will build the extraction layer that handles LLM output pathologies, the validation layer that reports all violations with field paths, and the schema versioning system that tracks metrics per version.
Definitions & key terms
- JSON Schema: A declarative specification that describes the structure and constraints of JSON data, using keywords like
type,required,properties,enum,additionalProperties. - Schema draft: A version of the JSON Schema specification. Draft 2020-12 is current. Different validator libraries support different drafts.
- Constrained decoding: A technique where the model’s token generation is dynamically restricted to only emit tokens that produce valid JSON according to a schema. Used by OpenAI Structured Outputs, Anthropic Claude, and vLLM.
- ajv: “Another JSON Schema Validator,” the most widely-used JSON Schema validation library for JavaScript/TypeScript. Supports drafts 4 through 2020-12.
- Validation error path: The JSON Pointer (e.g.,
/items/2/quantity) that identifies exactly where in the document a constraint was violated. - additionalProperties: A JSON Schema keyword that controls whether an object may contain properties not listed in the
propertiesdefinition. Setting it tofalseenforces strict schema adherence.
Mental model diagram (ASCII)
Raw LLM Response
(may contain text + JSON + text)
|
v
+----------------------------+
| LAYER 1: JSON EXTRACTION |
| |
| Try: full parse |
| Try: find { ... } |
| Try: strip code fences |
| Fail: EXTRACTION_ERROR |
+-------------+--------------+
|
parsed JSON object
|
v
+----------------------------+
| LAYER 2: SCHEMA |
| VALIDATION |
| |
| required fields? |
| correct types? |
| enum values valid? |
| no extra properties? |
| array constraints met? |
| string patterns match? |
| |
| Output: ALL violations |
| with field paths |
+-------------+--------------+
|
+------------+------------+
| |
ALL PASS VIOLATIONS FOUND
| |
v v
+------------------+ +------------------------+
| LAYER 3: SEMANTIC| | Feed violations to |
| VALIDATION | | REPAIR LOOP (Concept B)|
| (post-schema | +------------------------+
| business rules) |
+------------------+
|
v
VALID OUTPUT
(typed, verified, ready
for downstream system)
How it works (step-by-step, with invariants and failure modes)
-
Receive raw model response. The response is a string that may or may not be valid JSON. Invariant: the response is never null (the model always returns something, even if empty). Failure mode: empty string or whitespace-only response triggers an immediate EXTRACTION_ERROR.
-
JSON extraction. Try parsing strategies in order: direct parse, substring extraction, code fence stripping. Invariant: exactly one extraction strategy succeeds, or extraction fails. Failure mode: all strategies fail and the response is sent to the repair loop with error type “json_extraction_failed.” The extraction layer records which strategy succeeded (useful for metrics: if 30% of responses need code fence stripping, the prompt should be adjusted).
-
Schema validation. Run the parsed JSON through the validator configured with the target schema version. Collect ALL violations in a single pass. Invariant: the validator never crashes on malformed input (it reports errors, it does not throw). Failure mode: the schema file itself is invalid, which should be caught at harness startup, not at validation time.
-
Report violations with paths. Each violation includes the JSON Pointer path, the violated constraint, and the actual value. This report is used both for metrics (which constraints fail most often?) and for the repair loop (which fields need fixing?). Invariant: every violation has a path that can be resolved against the original JSON.
-
Route the result. If validation passes, the typed object proceeds to downstream consumers. If validation fails, the violations are sent to the repair loop (Concept B). If the repair loop exhausts its attempts, the output goes to the dead-letter sink. Invariant: no output reaches downstream consumers without passing schema validation.
Minimal concrete example
Schema (invoice.v2.json):
{
"type": "object",
"required": ["invoice_id", "items", "total", "currency"],
"additionalProperties": false,
"properties": {
"invoice_id": { "type": "string", "pattern": "^INV-[0-9]{6}$" },
"items": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["description", "quantity", "unit_price"],
"properties": {
"description": { "type": "string", "minLength": 1 },
"quantity": { "type": "integer", "minimum": 1 },
"unit_price": { "type": "number", "minimum": 0 }
}
}
},
"total": { "type": "number", "minimum": 0 },
"currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] }
}
}
Raw LLM response (with pathologies):
"Here is the extracted invoice:
```json
{
"invoice_id": "INV-00042",
"items": [
{"description": "Widget A", "quantity": 3, "unit_price": 12.50},
{"description": "", "quantity": 0, "unit_price": -5}
],
"total": 32.50,
"currency": "JPY",
"notes": "Processed by AI"
}
```
Let me know if you need anything else!"
Extraction: code fence stripping succeeds.
Validation violations:
1. /items/1/description: minLength 1, got ""
2. /items/1/quantity: minimum 1, got 0
3. /items/1/unit_price: minimum 0, got -5
4. /currency: enum ["USD","EUR","GBP"], got "JPY"
5. /notes: additionalProperties false, unexpected property "notes"
Common misconceptions
- “If the model returns valid JSON, schema validation is unnecessary.” Valid JSON is not the same as valid-according-to-your-schema.
{"foo": "bar"}is valid JSON but fails validation against an invoice schema. The model can produce syntactically correct JSON that misses required fields, uses wrong types, or includes extra properties. - “JSON Schema validation is slow and adds latency.” Schema validation is microseconds for typical LLM output sizes (a few KB). The model generation itself takes 100-5000ms. Validation overhead is negligible.
- “Constrained decoding makes schema validation redundant.” Constrained decoding handles structural correctness but may not support all schema features, may have implementation bugs, and cannot enforce semantic constraints. Defense-in-depth requires application-level validation even when using constrained decoding.
- “Reporting only the first validation error is sufficient.” If the repair loop only sees one error, it fixes that error but may leave four others. Reporting all errors enables single-attempt repair, which is more cost-effective and faster.
- “additionalProperties should be true to be flexible.” In LLM output contexts, extra properties are usually unintended model hallucinations, not useful data. They can break strict deserializers, leak reasoning, and inflate payloads. Default to false.
Check-your-understanding questions
- Why should the JSON extraction layer try multiple strategies rather than assuming the model returns pure JSON?
- What is the difference between JSON Schema’s
type: "integer"andtype: "number", and why does it matter for LLM outputs? - When would you choose
additionalProperties: truefor an LLM output schema, and what risks does it introduce? - Why is it important to report ALL schema violations in a single pass rather than stopping at the first one?
- How does constrained decoding reduce but not eliminate the need for application-level schema validation?
Check-your-understanding answers
- Because LLMs frequently wrap JSON in markdown code fences, add explanatory text before/after the JSON, or produce multiple JSON objects. A pure JSON parse fails on these common outputs. Trying multiple extraction strategies (direct parse, substring, code fence strip) handles most pathologies without requiring the repair loop, which is more expensive.
"integer"requires a whole number (e.g., 3), while"number"allows decimals (e.g., 3.14). LLMs sometimes return3.0for a quantity field that should be an integer, which passes"number"but may fail"integer"depending on the validator’s strict mode. Choose carefully based on what your downstream consumer expects.- You would choose
additionalProperties: truewhen the schema is intentionally extensible (e.g., a metadata object where the model can add arbitrary key-value pairs). The risks: downstream parsers may crash on unexpected fields, extra fields may leak internal reasoning to end users, payload sizes become unpredictable, and you cannot distinguish intentional extra fields from hallucinated ones. - The repair loop sends violation details to the model as context for self-correction. If only one error is reported, the model fixes that one but the output still fails on the other four. Reporting all errors enables the model to fix everything in a single repair attempt, which saves tokens and latency (one repair call vs four).
- Constrained decoding guarantees that the model only emits tokens that form valid JSON according to a schema. However: (a) not all schema features are supported by all providers, (b) the implementation may have bugs, (c) schema changes may not be immediately reflected in the constraint, (d) semantic constraints (cross-field rules, computed values) cannot be expressed in JSON Schema. Application-level validation catches these gaps.
Real-world applications
- E-commerce order processing: Invoice extraction pipelines validate LLM-extracted data against strict schemas to ensure that required fields (order_id, line items, total) are present and correctly typed before entering the ERP system.
- Healthcare data extraction: Clinical note summarization systems validate extracted FHIR-compatible JSON against schemas that enforce required patient identifiers, medication codes, and dosage formats.
- Financial report generation: Automated financial reports validate extracted figures against schemas with numeric precision constraints, currency enum validation, and required regulatory fields.
- API response generation: LLM-powered API endpoints use schema validation to guarantee that generated responses conform to the published API contract, preventing 500 errors from reaching consumers.
- Content management systems: Article metadata extraction (title, author, tags, categories) validates against CMS schemas before publishing, catching enum violations on category fields and missing required metadata.
Where you’ll apply it
- In the validation layer of this project: building the extraction, validation, and error reporting pipeline.
- Schema versioning and metrics tracking throughout the project.
- The validation results feed directly into the repair loop (Concept B).
References
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 (Encoding and Evolution) for schema evolution and compatibility.
- JSON Schema specification: https://json-schema.org/specification
- OpenAI Structured Outputs: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic Claude Structured Outputs: https://platform.claude.com/docs/en/build-with-claude/structured-outputs
- ajv validator documentation: https://ajv.js.org/
- Pydantic documentation for Python schema validation: https://docs.pydantic.dev/
- Zod documentation for TypeScript schema validation: https://zod.dev/
- “A Guide to Structured Outputs Using Constrained Decoding” - Aidan Cooper’s technical deep-dive on CFG-based constraint systems.
Key insights JSON Schema validation for LLM outputs is not about checking if the JSON is valid; it is about checking if the JSON satisfies the contract your downstream system requires, and reporting exactly which fields fail which constraints so that the repair loop can fix them efficiently.
Summary
JSON Schema validation is the first enforcement layer between an LLM and a downstream system. It requires handling LLM-specific output pathologies (code fences, trailing text, truncation) in an extraction layer, running full-spec validation that reports all violations with field paths, and tracking metrics by schema version. Constrained decoding at the provider level reduces but does not eliminate the need for application-level validation. Setting additionalProperties: false and reporting all violations in a single pass are critical for production-grade LLM output pipelines.
Homework/Exercises to practice the concept
-
Design a JSON Schema for a product review summary. Include at least 6 fields with diverse constraint types: string with pattern, integer with range, enum, array with minItems, nested object with required fields. Set
additionalProperties: false. Then list 5 realistic ways an LLM might violate this schema. -
Write a JSON extraction strategy ladder. Given a raw LLM response that may contain markdown, explanatory text, or multiple JSON objects, describe each extraction strategy in order, what pathology it handles, and when it should give up and send the response to the repair loop.
-
Analyze schema evolution impact. You have invoice.v1.json (5 required fields, currency enum: [“USD”, “EUR”]) and invoice.v2.json (6 required fields, added “tax_rate”, currency enum: [“USD”, “EUR”, “GBP”]). Which changes are backward-compatible? If you have 500 responses validated against v1, how many might fail against v2 and why?
-
Compare validator libraries. For TypeScript (ajv vs zod) and Python (jsonschema vs pydantic), list the trade-offs for LLM output validation: schema definition style, error reporting detail, performance, and strictness defaults.
Solutions to the homework/exercises
-
Schema should include fields like:
product_name(string, minLength: 1),rating(integer, min: 1, max: 5),sentiment(enum: [“positive”, “negative”, “mixed”]),key_points(array of strings, minItems: 1, maxItems: 5),word_count(integer, min: 10),metadata(object with requiredsourceanddatefields). Five violations: (a) model returns rating as float 4.5 instead of integer, (b) sentiment is “neutral” not in enum, (c) key_points is empty array, (d) model adds unrequested “confidence” field blocked by additionalProperties, (e) word_count is negative. -
Strategy ladder: (1) Direct JSON.parse on full response, handles pure JSON. (2) Find first
{and last}, parse substring, handles leading/trailing text. (3) Strip `json ...` markers, try strategy 1, handles markdown code fences. (4) Split on double newlines, try each chunk, handles multiple JSON objects (take the first valid one). (5) All failed: send to repair loop with error “json_extraction_failed” and the original raw response. Each strategy should log which one succeeded for metrics. -
Adding “tax_rate” as a required field is a BREAKING change: all 500 existing responses lack this field and will fail validation against v2 unless “tax_rate” has a default. Adding “GBP” to the currency enum is backward-compatible: existing responses with “USD” or “EUR” still pass. If “tax_rate” is made optional instead of required, the change becomes backward-compatible. Best practice: make new fields optional in the initial version, then upgrade to required once all producers have been updated.
-
ajv (TS): schema-first (write JSON Schema directly), detailed error paths with JSON Pointers, fastest JS validator, strictness configurable. Best for: validating against existing JSON Schema specs. Zod (TS): code-first (define schema in TypeScript), excellent TypeScript type inference, less detailed error paths by default, slower than ajv. Best for: TypeScript-heavy projects where type safety is primary. jsonschema (Python): schema-first, standard-compliant, moderate error detail, moderate speed. Best for: validating against JSON Schema specs. Pydantic (Python): code-first (Python classes), excellent error messages, fast (Rust core in v2), strict mode available. Best for: Python projects that want schema + serialization + validation in one tool.
Concept B: Bounded Repair Loops and Dead-Letter Handling
Fundamentals A repair loop is a controlled feedback cycle where a failed LLM output is sent back to the model along with the specific validation errors, asking the model to correct its previous attempt. The loop is “bounded” because it has a maximum number of attempts (typically 2-3) to prevent infinite retries, runaway token costs, and cascading latency. When the repair loop exhausts its attempts without producing a valid output, the failed item is sent to a dead-letter sink: a separate storage location that captures the raw output, all validation errors from each attempt, and diagnostic metadata for manual triage.
The repair loop is not a retry. A retry re-executes the same prompt and hopes for a different result. A repair loop provides the model with explicit information about what went wrong (the validation errors, the specific fields, the expected constraints) so it can make a targeted correction. This distinction is critical: blind retries have a low success rate for structural issues (the model tends to make the same structural mistakes repeatedly), while informed repairs have a much higher success rate because the model receives corrective feedback.
The dead-letter pattern originates from message queue systems (Kafka, SQS, RabbitMQ) where messages that cannot be processed after a configurable number of attempts are moved to a separate queue rather than blocking the pipeline. In the LLM context, the dead-letter sink prevents unrepairable outputs from either blocking the processing pipeline or silently being dropped. Every item in the dead-letter sink represents a case that needs human attention: either the schema is too strict, the prompt is inadequate for this input class, or the model genuinely cannot produce the required output.
Deep Dive into the concept The repair loop architecture has four components: the error formatter, the repair prompt builder, the attempt tracker, and the dead-letter router.
Error formatter. The error formatter takes the raw validation errors from the schema validator and converts them into a format that the model can understand and act on. This is not trivial: raw validation library errors often use technical jargon (JSON Pointer paths, schema keyword references) that the model may not interpret optimally. A good error formatter translates errors into natural language with enough specificity for correction:
Raw validator error:
{ "path": "/items/1/quantity", "keyword": "minimum", "params": {"limit": 1}, "message": "must be >= 1" }
Formatted for repair prompt:
"Error at items[1].quantity: The value 0 must be at least 1. This field represents the quantity of an item and must be a positive integer."
Including the field’s semantic meaning (“represents the quantity”) alongside the constraint (“must be at least 1”) helps the model understand not just what failed but why the constraint exists, leading to better corrections.
Repair prompt builder. The repair prompt includes: (1) the original task description, (2) the model’s previous invalid output, (3) the formatted validation errors, and (4) explicit instructions to fix only the errors while preserving valid parts. Critical design decision: should the repair prompt include the full previous output or just the errors? Including the full output lets the model make targeted fixes (change only the violated fields). Excluding it forces the model to regenerate from scratch, which may fix the reported errors but introduce new ones. Recommended approach: include the full previous output with the errors annotated inline.
Repair prompt structure:
"Your previous response had validation errors. Fix ONLY the fields
listed below. Keep all other fields exactly as they are.
Previous output:
{the JSON with errors}
Errors to fix:
1. items[1].description: must not be empty (minLength: 1)
2. items[1].quantity: value 0 must be >= 1
3. items[1].unit_price: value -5 must be >= 0
4. currency: value 'JPY' must be one of: USD, EUR, GBP
5. Remove unexpected field 'notes' (not in schema)
Return the corrected JSON only, with no explanation."
Attempt tracker. The attempt tracker records each repair attempt: the attempt number, the validation errors that triggered it, the model’s repair response, the new validation result, and the token cost. This metadata is essential for three reasons: (1) debugging why a specific item ended up in dead-letter, (2) measuring repair loop effectiveness (what percentage of items are fixed per attempt?), and (3) detecting systematic failures (if the same error type always exhausts the repair loop, the schema or prompt needs adjustment, not more retries).
A key metric is the repair funnel:
REPAIR FUNNEL METRICS
Initial validation: 500 items
Passed first try: 442 (88.4%) <- prompt effectiveness
Failed first try: 58 (11.6%) <- enter repair loop
Repair attempt 1: 58 items
Fixed in attempt 1: 39 (67.2%) <- repair loop value
Still failing: 19
Repair attempt 2: 19 items
Fixed in attempt 2: 10 (52.6%) <- diminishing returns
Still failing: 9
Dead-lettered: 9 (1.8%) <- human triage needed
Final pass rate: 491/500 (98.2%)
Repair loop uplift: +49 items (+9.8 percentage points)
Cost: 58+19 = 77 repair calls (average 1.33 attempts per repaired item)
Dead-letter router. When an item exhausts its repair attempts, the dead-letter router captures: the original input, the raw model response from each attempt, the validation errors from each attempt, the total token cost, the schema version, and a timestamp. The dead-letter sink should be a structured file (NDJSON) or database table that supports querying by error type, date range, and schema version. This enables systematic triage: if 80% of dead-letter items fail on the same constraint, that constraint should be reviewed (is it too strict? Is the prompt inadequate for that input class?).
The choice of max repair attempts is a cost/quality trade-off. Each attempt costs tokens (the repair prompt is often as long as the original prompt) and adds latency. Empirical evidence suggests diminishing returns after 2-3 attempts: if the model cannot fix the error with explicit feedback in 2 tries, a third try rarely helps. A max of 2 is recommended for most use cases, with 3 for high-value items where the cost of dead-lettering is significant.
Escalation vs dead-letter. Not all failures should go to the same place. Schema failures (wrong type, missing field) that exhaust the repair loop go to dead-letter for batch triage. Policy failures (safety constraint violated, PII detected in output) should never enter the repair loop at all; they should escalate immediately to human review. The repair loop should only attempt to fix structural and constraint errors, not semantic or policy errors.
FAILURE ROUTING DECISION TREE
Validation fails
|
+-- Is it a POLICY failure?
| |
| YES --> ESCALATE immediately (no repair loop)
|
+-- Is it a SCHEMA/CONSTRAINT failure?
|
YES --> Enter repair loop
|
+-- Attempt 1: repair with error feedback
| |
| PASS --> Done (emit repaired output)
| FAIL --> Continue
|
+-- Attempt 2: repair with cumulative errors
| |
| PASS --> Done (emit repaired output)
| FAIL --> Dead-letter
|
+-- DEAD-LETTER: store with full audit trail
Idempotency and determinism. For reproducible testing, the repair loop must be deterministic when given the same input, seed, and max attempts. This means using fixed seeds for model calls during testing, deterministic error formatting (sorted error lists), and stable repair prompt construction. In production, determinism is relaxed (temperature > 0), but the attempt tracker still records all inputs and outputs for debugging.
Cost awareness. Each repair attempt has a token cost. Your enforcer should track: tokens per initial generation, tokens per repair attempt, total tokens per item (including repairs), and cost per successful output. This enables cost optimization: if a specific schema constraint causes 80% of repair loop entries and each repair costs $0.02, the total cost of that constraint across 10,000 items is significant. It might be cheaper to relax the constraint or improve the prompt than to pay for repairs.
How this fit on projects The repair loop and dead-letter handling are the core differentiating features of Project 2. You will build the error formatter, the repair prompt builder, the bounded attempt tracker, and the dead-letter router with full diagnostic metadata.
Definitions & key terms
- Repair loop: A bounded feedback cycle where validation errors from a failed output are sent back to the model along with the previous output, asking for targeted correction.
- Bounded: Having a configurable maximum number of attempts (typically 2-3) to prevent infinite retries and runaway costs.
- Dead-letter sink: A storage location for items that could not be repaired within the maximum attempts, containing full diagnostic metadata for manual triage.
- Repair funnel: A metric showing how many items pass at each stage: initial validation, repair attempt 1, repair attempt 2, and dead-letter.
- Uplift: The percentage point improvement in pass rate attributable to the repair loop (e.g., 88% initial -> 98% final = 10 percentage points of uplift).
- Error formatter: The component that translates raw validator errors into human-readable (and model-readable) descriptions with field paths and constraint explanations.
- Attempt tracker: A log of each repair attempt including the errors that triggered it, the model’s response, the validation result, and the token cost.
Mental model diagram (ASCII)
Raw LLM Response
|
v
+--------------------+
| Schema Validation |
+--------+-----------+
|
+-----+-----+
| |
PASS FAIL
| |
v v
Output +-------------------+
to | Is it a POLICY |
consumer | failure? |
+----+----+---------+
| |
YES NO
| |
v v
ESCALATE +----------------------------+
(no | REPAIR LOOP |
repair) | |
| Attempt 1: |
| +----------------------+ |
| | Format errors | |
| | Build repair prompt | |
| | Call model | |
| | Validate result | |
| +----------+-----------+ |
| | |
| +-----+-----+ |
| | | |
| PASS FAIL |
| | | |
| v v |
| Output Attempt 2: |
| (track (same steps) |
| uplift) | |
| +-----+-----+ |
| | | |
| PASS FAIL |
| | | |
| v v |
| Output DEAD |
| LETTER |
+----------------------------+
|
v
+----------------------------+
| DEAD-LETTER SINK |
| |
| - original input |
| - raw response per attempt |
| - errors per attempt |
| - token cost per attempt |
| - schema version |
| - timestamp |
+----------------------------+
How it works (step-by-step, with invariants and failure modes)
-
Receive a validation failure from the schema validation layer (Concept A). The failure includes all violation details with field paths. Invariant: the failure object is always non-empty (at least one violation). Failure mode: an empty violation list indicates a bug in the validator.
-
Check failure type. If any violation is a policy failure (safety, PII, compliance), route immediately to escalation. Do NOT enter the repair loop for policy failures. Invariant: policy failures never enter the repair loop. Failure mode: a miscategorized policy failure entering the repair loop could produce an output that passes schema validation but violates safety constraints.
-
Format errors for the repair prompt. Translate each violation into model-readable text with field path, constraint description, actual value, and expected constraint. Sort errors by field path for deterministic prompt construction. Invariant: the error format is deterministic for the same set of violations.
-
Build the repair prompt. Include the original task, the previous invalid output, and the formatted errors. Instruct the model to fix ONLY the listed errors and return corrected JSON with no explanation. Invariant: the repair prompt never includes PII or sensitive data beyond what was in the original task. Failure mode: an overly long repair prompt (large previous output + many errors) may exceed the model’s context window. Truncate the previous output if necessary, keeping the error-relevant fields.
-
Execute the repair call with the repair prompt. Track the token usage. Invariant: the repair call uses the same model as the original generation (to avoid model-specific formatting differences). Failure mode: the model returns a non-JSON response (explanation instead of corrected JSON). Handle this as a repair failure and decrement the remaining attempts.
-
Validate the repair result against the same schema. If it passes, emit the repaired output and record it as a repair success with the attempt number. If it fails, check if the new violations are different from the old ones (progress was made but new errors were introduced) or the same (the model failed to fix the reported issues). Invariant: the validation step after repair uses the exact same schema and validator configuration as the initial validation.
-
If attempts remain, go to step 3 with the cumulative error history (errors from all previous attempts, not just the latest). Including cumulative errors prevents the model from oscillating between two invalid states. Failure mode: the model “fixes” error A but reintroduces error B from the first attempt, creating an oscillation. Cumulative error feedback mitigates this.
-
If attempts exhausted, route to dead-letter. Write the full attempt history (raw responses, errors, token costs per attempt) to the dead-letter sink. Invariant: every dead-lettered item includes the complete attempt history, not just the final failure.
Minimal concrete example
REPAIR LOOP TRACE (invoice extraction)
--- Attempt 0 (initial generation) ---
Response: {"invoice_id": "INV-00042", "items": [...], "total": 32.50, "currency": "JPY", "notes": "AI processed"}
Validation: FAIL (2 errors)
Error 1: /currency - enum violation, "JPY" not in ["USD","EUR","GBP"]
Error 2: /notes - additionalProperties violation, unexpected field
--- Attempt 1 (repair) ---
Repair prompt: "Fix these errors in your JSON: 1) currency must be USD/EUR/GBP, 2) remove unexpected 'notes' field. Return corrected JSON only."
Response: {"invoice_id": "INV-00042", "items": [...], "total": 32.50, "currency": "USD"}
Validation: PASS
Result: REPAIRED (attempt 1), uplift recorded
--- Dead-letter example ---
Item exhausted 2 repair attempts. Errors oscillated between:
Attempt 1 errors: /items/0/quantity type mismatch (string vs integer)
Attempt 2 errors: /items/0/quantity type ok, but /total now negative
Dead-letter payload:
{
"original_input": "...",
"attempts": [
{"attempt": 0, "response": "...", "errors": [...], "tokens": 450},
{"attempt": 1, "response": "...", "errors": [...], "tokens": 380},
{"attempt": 2, "response": "...", "errors": [...], "tokens": 395}
],
"total_tokens": 1225,
"schema_version": "invoice.v2",
"dead_letter_reason": "max_attempts_exhausted",
"timestamp": "2026-01-18T14:30:00Z"
}
Common misconceptions
- “A repair loop is just a retry.” A retry re-runs the same prompt and hopes for different output. A repair loop provides explicit error feedback so the model can make targeted corrections. Retries work for transient errors (rate limits, timeouts). Repairs work for structural errors (wrong type, missing field) where the model needs to know what went wrong.
- “More repair attempts always improve results.” Empirically, repair effectiveness drops sharply after 2 attempts. If the model cannot fix the error with explicit feedback in 2 tries, a third try rarely helps. Increasing max attempts mostly increases cost without improving pass rate.
- “Dead-letter items can be ignored.” Dead-letter items represent systematic failures that require investigation. If 5% of items are dead-lettered, either the schema is too strict for the model’s capability, the prompt is inadequate for certain input classes, or the model has a consistent blind spot. Ignoring dead-letter defeats the purpose of the dead-letter sink.
- “The repair prompt should regenerate from scratch.” Regenerating from scratch often introduces new errors while fixing old ones. Including the previous output and asking for targeted fixes preserves valid fields and produces higher repair success rates.
- “All failures should enter the repair loop.” Policy failures (safety, compliance, PII) should never be retried or repaired. The model consistently produced an unsafe output; retrying it wastes tokens and delays escalation to human review. Only structural/constraint failures benefit from the repair loop.
Check-your-understanding questions
- Why should the repair prompt include the previous invalid output rather than asking the model to regenerate from scratch?
- How does cumulative error feedback (including errors from ALL previous attempts) prevent oscillation?
- What information should a dead-letter entry contain, and how would you use it for systematic triage?
- Why should policy failures bypass the repair loop entirely?
- How do you measure the cost-effectiveness of the repair loop, and at what point does the loop become more expensive than alternatives (like improving the prompt)?
Check-your-understanding answers
- Including the previous output enables targeted fixes: the model can change only the violated fields while preserving everything else. Regenerating from scratch often fixes the reported errors but introduces new ones, because the model is making independent generation decisions for every field. Targeted repair has a higher single-attempt success rate.
- Without cumulative feedback, the model might fix error A in attempt 1 but reintroduce error B (which was in the original response but fixed in attempt 1’s repair). With cumulative feedback, the repair prompt says “avoid ALL of these errors” including ones from previous attempts, which prevents the model from oscillating between two invalid states.
- A dead-letter entry should contain: original input, raw model response from each attempt, validation errors from each attempt, token cost per attempt, schema version, and timestamp. For triage: group dead-letter items by the most common error type. If 80% fail on the same constraint (e.g., currency enum), investigate whether the prompt mentions valid currencies, whether the schema enum is too restrictive, or whether the model has a training bias toward unsupported currencies.
- Policy failures indicate the model fundamentally misunderstands a safety or compliance constraint. Retrying a policy failure wastes tokens (the model will likely produce another unsafe output) and delays escalation to human review. The unsafe output should be flagged for human attention immediately, not fed back to the model for attempted repair.
- Cost-effectiveness = (repair_loop_uplift_in_pass_rate * value_per_successful_item) / (total_repair_tokens * cost_per_token). If improving the prompt by 5% would cost $200 of engineer time but the repair loop costs $0.02 per item across 10,000 daily items ($200/day), prompt improvement pays for itself in one day. Track the repair funnel metrics and compute break-even points to decide when to invest in prompt improvement vs continued repair.
Real-world applications
- Haystack framework: Implements a structured output loop with auto-correction, where validation errors are fed back to the model in a generator-validator loop with configurable max retries.
- Instructor library: Python library built on Pydantic that provides built-in retry mechanisms for structured LLM outputs, automatically validating against type definitions and retrying with error context.
- LangChain OutputFixingParser: Provides an output-fixing parser that catches parsing errors and asks the LLM to correct its output, using the error message as context.
- Apache Kafka dead-letter queues: The dead-letter pattern used in message processing systems where messages that cannot be deserialized or processed are routed to a separate topic for manual investigation.
- Payment processing systems: Financial transaction systems use bounded retry with dead-letter for failed transactions, ensuring no transaction is silently dropped while preventing infinite retry loops.
Where you’ll apply it
- Building the repair loop orchestrator with error formatting, prompt building, and attempt tracking.
- Implementing the dead-letter router with full diagnostic metadata.
- Measuring repair funnel metrics and computing cost-effectiveness.
References
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 (Encoding and Evolution) and Ch. 11 (Stream Processing) for dead-letter patterns.
- “Site Reliability Engineering” by Google - Ch. 22 (Addressing Cascading Failures) for bounded retry design.
- “Release It!” by Michael Nygard - Ch. 5 (Stability Patterns) for circuit breakers and bounded retries.
- “Building LLM Apps” by Valentina Alto - Chapters on output handling and error recovery.
- Haystack tutorial on structured output with loop-based auto-correction: https://haystack.deepset.ai/tutorials/28_structured_output_with_loop
- Instructor library documentation: https://python.useinstructor.com/
- json_repair Python library: https://github.com/mangiucugna/json_repair
- LangChain OutputFixingParser: https://python.langchain.com/docs/how_to/output_parser_fixing/
- AWS Dead Letter Queue documentation: https://aws.amazon.com/what-is/dead-letter-queue/
Key insights A repair loop transforms a binary outcome (valid/invalid) into a graduated recovery process with measurable cost and diminishing returns, while the dead-letter sink ensures that no failure is silently lost and every systematic failure pattern becomes visible for root-cause analysis.
Summary Bounded repair loops provide targeted error feedback to the model for self-correction, with a configurable maximum number of attempts to prevent infinite retries and runaway costs. The error formatter, repair prompt builder, attempt tracker, and dead-letter router form a complete pipeline. Policy failures bypass the repair loop entirely. Dead-letter items must contain full attempt history for systematic triage. Measuring the repair funnel (initial pass rate, per-attempt fix rate, dead-letter rate, token cost) enables cost-effectiveness analysis and informs decisions about when to invest in prompt improvement instead of continued repairs.
Homework/Exercises to practice the concept
-
Design a repair prompt template. Given an invoice extraction task, write the repair prompt for a response that has 3 validation errors: wrong currency enum, negative unit_price, and an extra “notes” field. Include the previous output, the formatted errors, and clear instructions. Explain why you formatted the errors the way you did.
-
Calculate repair loop cost-effectiveness. Given: 10,000 items/day, initial pass rate 88%, repair attempt 1 fixes 67% of failures, repair attempt 2 fixes 52% of remaining failures, each repair call costs $0.015 in tokens. Calculate: daily repair cost, dead-letter rate, and the break-even point for prompt improvement (if a prompt engineer spending $500 could increase initial pass rate to 94%).
-
Design a dead-letter triage workflow. Describe how a team would review dead-letter items: how often to review, how to categorize items, what actions to take for each category (schema too strict, prompt inadequate, model limitation), and how to close the feedback loop.
-
Implement oscillation detection. Describe an algorithm that detects when the repair loop is oscillating between two invalid states (attempt 1 fixes error A but introduces error B, attempt 2 fixes error B but reintroduces error A). What action should the algorithm take when oscillation is detected?
Solutions to the homework/exercises
-
The repair prompt should include: original task context (“Extract invoice data from the following text…”), the full previous JSON output, and errors formatted as a numbered list with field path, actual value, expected constraint, and semantic explanation. Example: “1. currency: you used ‘JPY’ but must be one of: USD, EUR, GBP. 2. items[1].unit_price: value is -5 but must be >= 0 (prices cannot be negative). 3. Remove the ‘notes’ field (not in schema).” The semantic explanation (“prices cannot be negative”) is included because it helps the model understand the business reason, not just the technical constraint.
-
Daily items: 10,000. Initial pass: 8,800. Failures: 1,200. Attempt 1: fixes 804, leaving 396. Attempt 2: fixes 206, leaving 190. Dead-letter: 190 (1.9%). Total repair calls: 1,200 + 396 = 1,596. Daily cost: 1,596 * $0.015 = $23.94. With improved prompt (94% initial pass): failures drop to 600. Attempt 1: fixes 402, leaving 198. Attempt 2: fixes 103, leaving 95. Repair calls: 600 + 198 = 798. Daily cost: $11.97. Savings: $11.97/day. Break-even: $500 / $11.97 = 42 days. The prompt improvement pays for itself in about 6 weeks, plus reduces dead-letter volume from 190 to 95 items/day.
-
Triage workflow: Review daily (or weekly for low-volume). Categorize by most common error type. Actions: (a) Schema too strict: if >30% of dead-letters fail on one constraint, review whether the constraint matches real-world data (e.g., is “JPY” actually a valid currency for this use case?). (b) Prompt inadequate: if dead-letters cluster around one input pattern, add examples of that pattern to the prompt. (c) Model limitation: if the model consistently cannot produce a specific structure, consider simplifying the schema or using constrained decoding. Feedback loop: after each fix, measure whether dead-letter rate decreased in subsequent runs.
-
Oscillation detection: after each repair attempt, compare the current error set to all previous error sets. If the current errors are a subset of errors from attempt N-2 (two steps back), oscillation is detected. Action: immediately dead-letter the item with reason “oscillation_detected” instead of continuing attempts. Include all attempt histories in the dead-letter entry. Oscillation indicates the model is trapped between two local optima and additional attempts will not converge.
3. Project Specification
3.1 What You Will Build
A structured-output gateway that enforces JSON schemas and runs bounded repair retries.
3.2 Functional Requirements
- Validate every model output against a strict JSON schema with extraction handling for common LLM pathologies.
- If validation fails, run bounded repair prompts with explicit parser errors formatted for model correction.
- Stop after max attempts and emit dead-letter payload with full attempt history for manual review.
- Track pass-rate metrics by schema version, including repair funnel breakdowns.
3.3 Non-Functional Requirements
- Performance: 500 items processed under 2 minutes with parallel validation workers.
- Reliability: Repair behavior is deterministic for identical input, seed, and max-repair values.
- Security/Policy: Repair prompts must not execute tool calls or side effects. Policy failures bypass the repair loop entirely.
3.4 Example Usage / Output
$ uv run p02-enforcer validate --input fixtures/invoices.ndjson --schema schemas/invoice.v2.json --max-repair 2 --out out/p02
[INFO] Loaded 500 responses from fixtures/invoices.ndjson
[INFO] Schema: invoice.v2.json (draft 2020-12)
[INFO] Initial schema pass: 442/500 (88.4%)
[INFO] Repair loop started for 58 items (max 2 attempts each)
[INFO] Repair attempt 1: +39 fixed (67.2%)
[INFO] Repair attempt 2: +10 fixed (52.6% of remaining)
[PASS] Final schema pass: 491/500 (98.2%)
[INFO] Repair loop uplift: +49 items (+9.8pp)
[INFO] Dead-letter items: 9 (exported to out/p02/dead_letter.ndjson)
[INFO] Total repair token cost: 29,250 tokens
[INFO] Report written: out/p02/report.json
3.5 Data Formats / Schemas / Protocols
- NDJSON input records containing raw model output text.
- JSON Schema documents versioned by semantic version.
- Dead-letter NDJSON with raw responses, parser errors per attempt, and token costs.
- Report JSON with repair funnel metrics, per-schema-version pass rates, and cost breakdown.
3.6 Edge Cases
- Output is valid JSON but fails enum constraints.
- Output includes trailing analysis text after JSON block (“Here is the result: {…} Let me know…”).
- Repair loop oscillates between two invalid forms (fix A breaks B, fix B breaks A).
- Schema version mismatch between producer and enforcer.
- Model returns explanation text instead of corrected JSON during repair.
- Token limit truncates the JSON output mid-field.
- Repair prompt exceeds model context window (large previous output + many errors).
- Dead-letter sink write fails due to disk space.
3.7 Real World Outcome
This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.
3.7.1 How to Run (Copy/Paste)
$ uv run p02-enforcer validate --input fixtures/invoices.ndjson --schema schemas/invoice.v2.json --max-repair 2 --out out/p02
- Working directory:
project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS - Required inputs: project fixtures under
fixtures/ - Output directory:
out/p02
3.7.2 Golden Path Demo (Deterministic)
Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.
3.7.3 If CLI: exact terminal transcript
$ uv run p02-enforcer validate --input fixtures/invoices.ndjson --schema schemas/invoice.v2.json --max-repair 2 --out out/p02
[INFO] Loaded 500 responses from fixtures/invoices.ndjson
[INFO] Schema: invoice.v2.json (draft 2020-12)
[INFO] Extraction stats: 470 direct parse, 22 code-fence strip, 8 substring extract
[INFO] Initial schema pass: 442/500 (88.4%)
[INFO] Repair loop started for 58 items (max 2 attempts each)
[INFO] Repair attempt 1: +39 fixed (67.2% of 58)
[INFO] Repair attempt 2: +10 fixed (52.6% of 19)
[PASS] Final schema pass: 491/500 (98.2%)
[INFO] Repair loop uplift: +49 items (+9.8pp)
[INFO] Dead-letter items: 9 (1.8%)
[INFO] Dead-letter breakdown: 4 oscillation, 3 type_mismatch, 2 enum_violation
[INFO] Total repair token cost: 29,250 tokens ($0.44)
[INFO] Report written: out/p02/report.json
[INFO] Dead-letter file: out/p02/dead_letter.ndjson
$ echo $?
0
Failure demo:
$ uv run p02-enforcer validate --input fixtures/invoices.ndjson --schema schemas/missing.json --max-repair 2 --out out/p02
[ERROR] Schema file not found: schemas/missing.json
[HINT] Available schemas: schemas/invoice.v1.json, schemas/invoice.v2.json
$ echo $?
2
Dead-letter inspection:
$ uv run p02-enforcer inspect-dead-letter --file out/p02/dead_letter.ndjson --group-by error_type
[INFO] Dead-letter summary (9 items):
oscillation: 4 items (44.4%)
type_mismatch: 3 items (33.3%)
enum_violation: 2 items (22.2%)
[INFO] Most common failing field: /items/*/quantity (5 occurrences)
[HINT] Consider reviewing schema constraint for items[].quantity or adding examples to prompt.
4. Solution Architecture
4.1 High-Level Design
+-------------------+
| CLI Interface |
| (input, schema, |
| max-repair, out) |
+--------+----------+
|
+------------+------------+
| |
v v
+-------------------+ +--------------------+
| Input Loader | | Schema Loader |
| (NDJSON records) | | (versioned JSON |
| | | Schema files) |
+--------+----------+ +--------+-----------+
| |
+------------+-----------+
|
v
+----------------------------+
| JSON Extraction Layer |
| (direct, substring, fence) |
+-------------+--------------+
|
v
+----------------------------+
| Schema Validator |
| (all violations, paths) |
+-------------+--------------+
|
+---------+---------+
| |
PASS FAIL
| |
v v
+----------+ +-------------------+
| Output | | Failure Router |
| Emitter | | (policy vs schema)|
+----------+ +--------+----------+
|
+--------+---------+
| |
POLICY SCHEMA
| |
v v
ESCALATE +------------------+
| Repair Loop |
| (format errors, |
| build prompt, |
| track attempts) |
+--------+---------+
|
+---------+---------+
| |
FIXED EXHAUSTED
| |
v v
+----------+ +-------------------+
| Output | | Dead-Letter Sink |
| Emitter | | (full audit trail)|
| (uplift) | +-------------------+
+----------+
|
v
+----------------------------+
| Metrics & Report |
| (funnel, cost, pass rate) |
+----------------------------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | JSON Extraction Layer | Handles LLM output pathologies (code fences, trailing text, truncation). | Try multiple strategies in order; log which strategy succeeded for metrics. | | Schema Validator | Evaluates payloads against selected schema version, reports ALL violations with paths. | Use ajv (TS) or jsonschema/pydantic (Python). Set additionalProperties: false by default. | | Failure Router | Separates policy failures (escalate) from schema failures (repair loop). | Policy failures NEVER enter the repair loop. | | Repair Loop | Formats errors, builds repair prompts, tracks attempts, detects oscillation. | Max 2 attempts. Include previous output + all errors. Detect oscillation early. | | Dead-Letter Sink | Stores unrepairable outputs with full attempt history for triage. | NDJSON format. Include raw responses, errors, and token costs per attempt. | | Metrics Reporter | Computes repair funnel, cost breakdown, and per-schema pass rates. | Export as JSON for dashboard consumption. |
4.3 Data Structures (No Full Code)
EnforcerInput:
- item_id: string
- raw_response: string
- schema_version: string
ExtractionResult:
- status: enum[DIRECT_PARSE, SUBSTRING, CODE_FENCE, EXTRACTION_FAILED]
- parsed_json: object | null
- extraction_method: string
ValidationResult:
- status: enum[PASS, FAIL]
- violations: list[Violation]
- schema_version: string
Violation:
- path: string (JSON Pointer, e.g., "/items/1/quantity")
- constraint: string (e.g., "minimum")
- expected: any
- actual: any
- message: string
RepairAttempt:
- attempt_number: integer
- repair_prompt: string (truncated for storage)
- raw_response: string
- validation_result: ValidationResult
- tokens_used: integer
DeadLetterEntry:
- item_id: string
- original_input: string
- attempts: list[RepairAttempt]
- total_tokens: integer
- dead_letter_reason: enum[MAX_ATTEMPTS, OSCILLATION, POLICY_BLOCK]
- schema_version: string
- timestamp: string
RepairFunnel:
- total_items: integer
- initial_pass: integer
- per_attempt_fixes: list[integer]
- dead_lettered: integer
- final_pass_rate: float
- uplift_pp: float
- total_repair_tokens: integer
- total_repair_cost: float
4.4 Algorithm Overview
Key algorithm: Extract -> Validate -> Route -> Repair/Dead-letter
- Extract JSON from raw LLM response using strategy ladder.
- Validate extracted JSON against versioned schema, collecting all violations.
- Route: PASS outputs to emitter, POLICY failures to escalation, SCHEMA failures to repair loop.
- Repair loop: format errors, build repair prompt, call model, validate result, track attempt.
- If fixed: emit repaired output with uplift metadata. If oscillation detected or attempts exhausted: dead-letter.
- Aggregate metrics: repair funnel, cost breakdown, per-schema pass rates.
Complexity Analysis (conceptual):
- Time: O(n * r * v) where n is items, r is max repair attempts, v is validation cost per item.
- Space: O(n) for results + O(d * r) for dead-letter items with r attempts each.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies (TypeScript/Node.js 20+, ajv, or Python 3.11+, pydantic)
# 2) Prepare fixtures under fixtures/ (NDJSON with raw model responses)
# 3) Place schemas under schemas/ (versioned JSON Schema files)
# 4) Run: uv run p02-enforcer validate --input fixtures/invoices.ndjson --schema schemas/invoice.v2.json --max-repair 2 --out out/p02
5.2 Project Structure
p02/
├── src/
│ ├── cli.ts # CLI argument parsing and entrypoint
│ ├── extraction.ts # JSON extraction strategy ladder
│ ├── validator.ts # Schema validation with all-violations reporting
│ ├── repair_loop.ts # Error formatting, prompt building, attempt tracking
│ ├── dead_letter.ts # Dead-letter sink and triage support
│ ├── metrics.ts # Repair funnel and cost computation
│ └── report.ts # JSON report generation
├── schemas/
│ ├── invoice.v1.json
│ └── invoice.v2.json
├── fixtures/
│ └── invoices.ndjson
├── out/
└── README.md
5.3 The Core Question You’re Answering
“How do I force typed outputs when the model sometimes drifts into free-form text?”
This question matters because downstream systems expect deterministic, schema-conformant data, and any output that deviates silently corrupts data pipelines. The answer is not just “validate,” but “validate, repair with feedback, and dead-letter what cannot be fixed.”
5.4 Concepts You Must Understand First
- JSON Schema specification and validation
- Why is
additionalProperties: falseimportant for LLM outputs? - Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 (Encoding and Evolution)
- Why is
- Retry and repair loop design
- What is the difference between a blind retry and an informed repair?
- Book Reference: “Site Reliability Engineering” by Google - Ch. 22 (Addressing Cascading Failures)
- Dead-letter queue patterns
- Why move unrepairable items to a dead-letter sink instead of dropping them?
- Book Reference: “Release It!” by Michael Nygard - Ch. 5 (Stability Patterns)
- Constrained decoding for structured outputs
- How do provider-level structured output features change your validation strategy?
- Book Reference: “AI Engineering” by Chip Huyen - Structured output chapters
5.5 Questions to Guide Your Design
- Extraction and validation
- What LLM output pathologies must your extraction layer handle?
- How do you configure your validator to report ALL violations in a single pass?
- Should you use
additionalProperties: falseby default or make it configurable?
- Repair loop design
- How do you format validation errors so the model can fix them effectively?
- Should the repair prompt include the full previous output or just the errors?
- How do you detect oscillation (fix A breaks B, fix B breaks A)?
- Dead-letter and metrics
- What metadata does each dead-letter entry need for effective triage?
- How do you compute repair loop cost-effectiveness to decide when prompt improvement is better than more repairs?
- How do you make repair funnel metrics useful for non-technical stakeholders?
5.6 Thinking Exercise
Pre-Mortem for JSON Output Enforcer
Before implementing, map out the 10 most likely failure modes for a schema-validated LLM output pipeline in production. For each, classify whether it is: extraction failure, schema validation failure, repair loop failure, or operational failure.
Specific scenarios to analyze:
- The model wraps JSON in triple backticks with a language identifier
- The model returns a valid JSON array when the schema expects an object
- The repair prompt is so long it gets truncated by the context window
- Two different schema versions are deployed simultaneously
- The dead-letter file grows to 100GB because of a model regression
- A schema update changes “required” to include a field that 60% of existing responses lack
- The repair loop fixes structural errors but introduces semantic errors (wrong values)
Questions to answer:
- Which failures should trigger alerts?
- Which failures should be self-healing?
- Which require manual intervention and process changes?
5.7 The Interview Questions They’ll Ask
- “Why is schema validation necessary but not sufficient for LLM output quality?”
- “How do you design bounded retries without hiding defects behind successful repairs?”
- “What should go into a dead-letter queue for AI outputs, and how do you triage it?”
- “How would you version schemas safely across teams that consume the same model outputs?”
- “How do you prevent repair prompts from introducing new risks (hallucinated values, PII leakage)?”
- “Compare constrained decoding (provider-level) vs application-level validation. When do you need both?”
5.8 Hints in Layers
Hint 1: Build the extraction layer first Before you validate, you need clean JSON. Build and test the extraction strategy ladder (direct parse, substring, code fence strip) with realistic LLM outputs. Track which strategy succeeds per item – this metric tells you how clean the model’s formatting is.
Hint 2: Report all violations in one pass Configure your validator (ajv, pydantic, jsonschema) to collect ALL violations, not just the first. This is critical for repair efficiency. Pseudocode:
validator = create_validator(schema, all_errors=true)
result = validator.validate(parsed_json)
IF result.errors:
formatted = format_all_errors(result.errors)
# formatted includes path, constraint, actual value for each error
Hint 3: Include the previous output in repair prompts The repair prompt should include: original task, previous invalid JSON, and all formatted errors. Tell the model to fix ONLY the errors and return JSON with no explanation. This targeted approach has much higher success than regeneration.
Hint 4: Detect oscillation early After each repair attempt, compare the current error set to error sets from all previous attempts. If you see a repeated pattern (same errors returning after being fixed), dead-letter immediately instead of wasting another attempt.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Schema evolution and data contracts | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 4 | | Bounded retry and stability patterns | “Release It!” by Michael Nygard | Ch. 5 | | Cascading failure prevention | “Site Reliability Engineering” by Google | Ch. 22 | | Structured outputs for LLMs | “AI Engineering” by Chip Huyen | Output handling chapters | | LLM application patterns | “Building LLM Apps” by Valentina Alto | Relevant chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Build the JSON extraction layer with strategy ladder and extraction metrics.
- Implement schema validation with all-violations reporting and field path tracking.
- Build the output emitter for passing items.
- Checkpoint: Load NDJSON fixtures, extract JSON, validate against schema, report all violations with paths for failing items.
Phase 2: Core Functionality
- Build the repair loop: error formatter, repair prompt builder, attempt tracker.
- Implement oscillation detection.
- Build the dead-letter sink with full audit trail.
- Checkpoint: Run full pipeline: validate -> repair (max 2) -> dead-letter. Repair funnel metrics are computed.
Phase 3: Operational Hardening
- Add repair funnel metrics, cost breakdown, and per-schema pass rate tracking.
- Add
inspect-dead-letterCLI command for triage support. - Add schema version tracking in all artifacts.
- Document runbook: how to investigate dead-letter patterns, how to update schemas, how to tune repair parameters.
- Checkpoint: Complete CLI with validate and inspect-dead-letter commands. Full report.json with funnel and cost metrics. Reproducible results with fixed seed.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Extraction strategy | Single parse vs strategy ladder | Strategy ladder (3 strategies) | Handles real-world LLM output pathologies without entering repair loop | | Validation library | ajv (TS) / pydantic (Python) vs custom | Library-based (ajv or pydantic) | Battle-tested, full spec support, all-violations mode available | | additionalProperties | true vs false | false by default | Prevents hallucinated extra fields from reaching downstream systems | | Repair prompt style | Regenerate vs targeted fix | Targeted fix with previous output | Higher single-attempt success rate, preserves valid fields | | Max repair attempts | 1 / 2 / 3 | 2 | Diminishing returns after 2; 3 rarely improves over 2 but costs more | | Dead-letter format | Database vs NDJSON file | NDJSON file | Simple, portable, supports streaming writes and line-based inspection |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate individual components | Extraction strategies, error formatting, oscillation detection | | Integration Tests | Verify end-to-end pipeline | Full validate command with fixtures producing correct report | | Regression Tests | Ensure enforcer itself is deterministic | Same input + seed + schema always produces same results | | Edge Case Tests | Ensure robust pathology handling | Truncated JSON, multiple JSON objects, empty response |
6.2 Critical Test Cases
- Golden path: 500-item fixture with 88% initial pass, repair loop brings it to 98%+.
- All-pass: fixture where every item passes initial validation (repair loop is skipped).
- All-fail: fixture where every item fails (measures dead-letter sink capacity).
- Oscillation: fixture where repair attempts oscillate (detected and dead-lettered early).
- Extraction pathologies: code fences, trailing text, multiple JSON objects, truncated JSON.
- Schema not found: graceful error with available schema list.
- Deterministic replay: same seed + input always produces identical report.
6.3 Test Data
fixtures/golden_path_invoices.ndjson # 500 items, ~88% initial pass rate
fixtures/all_pass_invoices.ndjson # 100 items, all valid
fixtures/pathology_samples.ndjson # Various extraction challenges
fixtures/oscillation_cases.ndjson # Items that trigger oscillation
schemas/invoice.v1.json # Older schema version
schemas/invoice.v2.json # Current schema version
schemas/invoice.v3-draft.json # Future schema for compatibility testing
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Repair loop is expensive” | Token costs spike with high failure rates. | Cap attempts at 2. Measure repair funnel and invest in prompt improvement when cost exceeds threshold. | | “Schema passes but values are nonsense” | Business-level validation is missing (e.g., negative prices pass type check). | Add semantic validators (or numeric range constraints in schema) after schema pass. | | “Breakage after schema update” | No compatibility test suite exists. | Run contract tests against prior fixtures before promoting new schema version. | | “Repair prompt exceeds context window” | Large previous output + many errors overflow the model’s token limit. | Truncate previous output to error-relevant fields. Prioritize errors by severity. | | “Dead-letter fills disk” | Model regression causes massive dead-letter volume. | Set alerting on dead-letter rate. If rate exceeds 5%, pause processing and investigate. | | “Extraction fails on new model version” | Model changed its output formatting (e.g., new code fence style). | Add the new format to the extraction strategy ladder. Monitor extraction method metrics. |
7.2 Debugging Strategies
- Re-run a single failing item with verbose logging to trace through extraction -> validation -> repair.
- Diff validator errors between attempts to check if the repair is making progress or oscillating.
- Inspect dead-letter entries grouped by most common error type to find systematic patterns.
- Use
inspect-dead-letter --group-by error_typeto identify the highest-impact constraint to fix. - Check extraction method metrics: if code-fence stripping is increasing, the prompt might need a “return JSON only” instruction.
7.3 Performance Traps
- Unbounded retries inflate latency and cost exponentially. Always cap at max_repair.
- Serial repair processing bottlenecks on items that need multiple attempts. Parallelize initial validation; serialize repair per item.
- Storing full repair prompts in dead-letter entries bloats storage. Truncate prompts but keep full error details.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add extraction metrics (count per strategy) to the report.
- Add a
--dry-runflag that validates without entering the repair loop. - Support one additional schema format (e.g., YAML-defined schemas).
8.2 Intermediate Extensions
- Add oscillation detection that dead-letters early when detected.
- Add schema compatibility checking (v1 vs v2) with migration impact report.
- Add cost-per-item tracking and cost-effectiveness dashboard.
8.3 Advanced Extensions
- Integrate constrained decoding (OpenAI Structured Outputs) as an optimization and measure reduction in repair loop invocations.
- Add streaming validation for large NDJSON files (process items as they arrive, no full-file load).
- Build a dead-letter triage UI that groups items by error pattern and suggests schema/prompt changes.
- Add A/B testing support: run two schema versions side-by-side and compare pass rates.
9. Real-World Connections
9.1 Industry Applications
- Data extraction pipelines at fintech companies validating LLM-extracted transaction data.
- Healthcare NLP systems validating extracted clinical data against FHIR schemas.
- E-commerce platforms validating product classification outputs before indexing.
- Content moderation systems validating structured analysis before routing decisions.
9.2 Related Open Source Projects
- Instructor: Python library for structured LLM outputs with automatic validation and retry.
- json_repair: Python library for fixing common JSON syntax errors from LLMs.
- ajv: JavaScript JSON Schema validator used in production by millions of applications.
- Pydantic: Python data validation library with JSON Schema generation.
- Haystack: Framework with structured output auto-correction loops.
- LangChain OutputFixingParser: Output repair with LLM-based correction.
9.3 Interview Relevance
- Demonstrates understanding of the structured output problem and multiple solution layers (provider, application, repair).
- Shows practical cost-awareness: measuring repair loop ROI and knowing when to invest in prompt improvement.
- Proves knowledge of production patterns: dead-letter queues, bounded retries, schema versioning.
- Provides concrete examples of defense-in-depth thinking.
10. Resources
10.1 Essential Reading
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4 for schema evolution.
- “Release It!” by Michael Nygard - Ch. 5 for stability patterns and bounded retries.
- “Site Reliability Engineering” by Google - Ch. 22 for cascading failure prevention.
- JSON Schema specification: https://json-schema.org/specification
- OpenAI Structured Outputs: https://developers.openai.com/api/docs/guides/structured-outputs
- Anthropic Claude Structured Outputs: https://platform.claude.com/docs/en/build-with-claude/structured-outputs
10.2 Video Resources
- Talks on structured output engineering from AI Engineering Summit.
- Constrained decoding deep-dives from model provider conferences.
- Dead-letter queue patterns from Kafka Summit and distributed systems conferences.
10.3 Tools & Documentation
- ajv (TypeScript): https://ajv.js.org/
- Pydantic (Python): https://docs.pydantic.dev/
- Zod (TypeScript): https://zod.dev/
- Instructor: https://python.useinstructor.com/
- json_repair: https://github.com/mangiucugna/json_repair
- Haystack structured output tutorial: https://haystack.deepset.ai/tutorials/28_structured_output_with_loop
10.4 Related Projects in This Series
- P01 (Prompt Contract Harness): Provides the contract and invariant framework that feeds into schema validation.
- P08 (Prompt DSL + Linter): Static analysis of prompt files complements runtime output validation.
- P11 (Canary Prompt Rollout Controller): Uses schema pass rates as a canary metric for prompt deployments.
- P15 (Prompt Registry): Manages schema versions alongside prompt versions.
- P18 (Capstone): Composes the enforcer with other project components into a unified pipeline.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the difference between JSON extraction, schema validation, and semantic validation.
- I can explain why the repair loop is not a retry and why the distinction matters.
- I can design a repair prompt that includes previous output and formatted errors.
- I can explain when constrained decoding helps and when application-level validation is still needed.
- I can compute repair loop cost-effectiveness and identify the break-even point for prompt improvement.
11.2 Implementation
- JSON extraction handles code fences, trailing text, and truncated responses.
- Schema validation reports ALL violations with field paths in a single pass.
- Repair loop is bounded with configurable max attempts.
- Oscillation detection dead-letters items early when detected.
- Dead-letter entries contain full attempt history with token costs.
- Repair funnel metrics are computed and included in the report.
11.3 Growth
- I can describe one tradeoff I made (e.g., max attempts, additionalProperties) and why.
- I can explain this project design in an interview setting with cost-effectiveness analysis.
- I can identify dead-letter patterns and propose fixes (prompt, schema, or model changes).
12. Submission / Completion Criteria
Minimum Viable Completion:
- JSON extraction handles at least 3 pathologies (direct parse, code fences, trailing text).
- Schema validation reports all violations with field paths.
- Repair loop runs with configurable max attempts and produces deterministic results.
- Dead-letter sink captures unrepairable items with attempt history.
- Report includes initial pass rate, final pass rate, and repair uplift.
Full Completion:
- Oscillation detection dead-letters items early.
- Repair funnel metrics with per-attempt fix rates and cost breakdown.
inspect-dead-lettercommand for triage support.- Schema version tracking in all artifacts.
- Automated tests for golden path, pathology handling, oscillation, and deterministic replay.
Excellence (Above & Beyond):
- Constrained decoding integration with comparative metrics (with vs without).
- Cost-effectiveness analysis with break-even computation for prompt improvement.
- Streaming validation for large NDJSON files.
- Dead-letter triage UI or dashboard with grouped error patterns.
- Integrates with adjacent projects (P01 contracts, P11 canary, P15 registry) cleanly.