Project 2: JSON Output Enforcer (Schema + Repair Loop)

Project 2: JSON Output Enforcer (Schema + Repair Loop)

Build a production-ready library that guarantees type-safe LLM outputs through automatic validation and self-correction

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1 week
Language Python (Alternatives: TypeScript)
Prerequisites Project 1 (Harness), deep knowledge of JSON, type systems
Key Topics Schema Validation, Self-Correction, Error Handling, Type Safety
Knowledge Area Structured Outputs / Reliability
Software/Tool Pydantic / Zod
Main Book “Designing Data-Intensive Applications” by Martin Kleppmann (Schemas)
Coolness Level Level 3: Genuinely Clever
Business Potential 5. The “Compliance & Workflow” Model

1. Learning Objectives

By completing this project, you will:

  1. Master Type Systems for AI: Bridge the gap between probabilistic LLM outputs and deterministic, typed application code
  2. Build Self-Correcting Systems: Implement repair loops that use model reasoning to fix its own errors
  3. Handle Graceful Degradation: Design systems that fail safely with structured error responses
  4. Understand Schema Design: Learn to create strict JSON Schemas that prevent hallucinations and type errors
  5. Optimize Token Costs: Balance reliability (multiple repair attempts) against API costs
  6. Implement Production APIs: Create developer-friendly library interfaces with clear error handling
  7. Apply Retry Patterns: Use exponential backoff, temperature adjustments, and circuit breakers

2. Theoretical Foundation

2.1 Core Concepts

The Type Safety Gap

Modern applications are built with strong type systems (TypeScript, Python with type hints, Go, Rust). These systems catch errors at compile time, ensuring that a function expecting an integer never receives a string.

LLMs, however, generate untyped text. Even when instructed to return JSON, they can:

  • Generate malformed JSON (syntax errors)
  • Return correct JSON but wrong types ({"age": "25"} instead of {"age": 25})
  • Hallucinate extra fields not in your schema
  • Omit required fields
  • Return values outside valid ranges

The Core Problem:

# Traditional API (Type-Safe)
def get_user(user_id: int) -> User:
    # Returns User object, guaranteed by type system
    pass

# LLM API (Type-Unsafe)
def llm_extract_user(text: str) -> ???:
    response = llm.complete(f"Extract user from: {text}")
    # response is just a string - could be anything!
    data = json.loads(response)  # Might raise JSONDecodeError
    age = data["age"]  # Might be missing, might be wrong type
    # Your application crashes
    pass

The Solution:

# Type-Safe LLM Wrapper
def llm_extract_user(text: str) -> Result[User, ValidationError]:
    enforcer = JSONEnforcer(schema=UserSchema)
    result = enforcer.generate(prompt=text)
    # result is Either[User, Error] - type-safe!
    return result

JSON Schema as a Contract

JSON Schema is the industry standard for defining JSON structure. It acts as a “contract” between your LLM and your application.

Example Schema:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "name": {
      "type": "string",
      "minLength": 1,
      "maxLength": 100
    },
    "age": {
      "type": "integer",
      "minimum": 0,
      "maximum": 120
    },
    "email": {
      "type": "string",
      "format": "email"
    },
    "subscription": {
      "enum": ["free", "pro", "enterprise"]
    }
  },
  "required": ["name", "age", "email", "subscription"],
  "additionalProperties": false
}

Why additionalProperties: false is Critical:

Without it, the model can hallucinate fields:

{
  "name": "Alice",
  "age": 25,
  "email": "alice@example.com",
  "subscription": "pro",
  "credit_card": "4532-1234-5678-9010",   HALLUCINATED! Security risk!
  "admin_access": true   HALLUCINATED! Privilege escalation!
}

With additionalProperties: false, the validator rejects this output.

Self-Correction and Repair Loops

The Insight: LLMs can understand and fix their own errors when given clear feedback.

Traditional Approach (Fragile):

User Prompt → LLM → Parse JSON → ❌ Error → Crash

Repair Loop Approach (Robust):

User Prompt → LLM → Parse JSON
                      ↓ (error)
                 Extract Error Message
                      ↓
            "Fix: age must be int, not string"
                      ↓
                     LLM → Parse JSON → ✓ Valid!

Example Repair Prompt:

Your previous output was invalid. Error details:

Field: age
Expected: integer
Received: string ("twenty-five")
Path: $.age

Please fix ONLY the format, not the semantic content.
Return valid JSON matching the schema.

Key Principle: Feed the validator’s error message directly back to the model. The error message acts as debugging feedback.

Temperature and Repair Strategy

Temperature controls randomness:

  • 0.0: Deterministic (always picks highest probability token)
  • 0.3: Slightly creative
  • 0.7: Balanced
  • 1.0+: Very creative

Optimal Repair Strategy:

Attempt 1: temperature=0.3  (Initial generation - allow some flexibility)
Attempt 2: temperature=0.0  (Repair needs precision, not creativity)
Attempt 3: temperature=0.0  (Stay deterministic)

Why lower temperature for repairs?

At high temperature, the model might “creatively” misinterpret the repair instruction. At temp=0, it mechanically applies the fix.

Error Handling Patterns

Three Strategies:

  1. Raise Exception (Fail Fast)
    try:
        user = enforcer.generate(prompt)
    except MaxRetriesExceeded:
        # Propagate to caller
        raise
    
  2. Return Default (Fail Safe)
    user = enforcer.generate(prompt, default={"status": "unknown"})
    # Never crashes, always returns something
    
  3. Return Result Type (Functional Style)
    result = enforcer.generate(prompt)
    if result.is_ok():
        user = result.unwrap()
    else:
        error = result.error()
        log.warning(f"Extraction failed: {error}")
    

Best Practice: Use exceptions for unexpected failures, Result types for expected failures (like user input that can’t be parsed).

2.2 Why This Matters

Production Relevance

Real-world LLM applications crash in production due to:

  1. Malformed JSON: Missing brackets, trailing commas, unescaped quotes
  2. Type Mismatches: "age": "25" when code expects int
  3. Missing Fields: Code accesses data["city"] but field wasn’t returned
  4. Hallucinated Fields: Model invents fields that cause logic errors
  5. Out-of-Range Values: age: -5 or age: 999 bypassing validation

Without Enforcement:

  • Application crashes with KeyError, TypeError, JSONDecodeError
  • Database corruption from invalid data
  • Security vulnerabilities from hallucinated fields
  • Customer support tickets from broken features

With Enforcement:

  • Guaranteed type-safe data or explicit error handling
  • 99%+ reliability through automated repairs
  • Clear error messages for the 1% that can’t be repaired
  • Production-ready infrastructure

Real-World Applications

Companies using similar patterns:

Company Use Case Pattern
Stripe Invoice data extraction Pydantic validation with retry
Notion Natural language to structured data Schema enforcement with fallback
GitHub Copilot Code generation with type checking Multi-attempt refinement
Shopify Product catalog enrichment Zod validation with repair

Industry Tools:

  • Instructor (Python): Pydantic-based LLM output validation
  • Zod (TypeScript): Runtime type validation for LLM outputs
  • OpenAI Function Calling: Built-in schema enforcement
  • Anthropic Tool Use: Structured output validation

2.3 Historical Context

Evolution of Structured Outputs

2020-2021: The “Parse and Pray” Era

  • No schema enforcement
  • Manual string parsing with regex
  • Frequent production crashes
  • “Just ask nicely for JSON” approach

2022: The Schema Era

  • JSON Schema adoption
  • Manual validation loops
  • Pydantic/Zod emergence
  • Still lots of manual error handling

2023: The Self-Correction Era

  • Repair loops standardized
  • Models can fix their own errors
  • Function calling APIs from OpenAI/Anthropic
  • Automatic retry logic

2024+: The Type Safety Era

  • LLMs integrated into strongly-typed systems
  • Production-grade reliability (99%+)
  • Cost-optimized repair strategies
  • Statistical validation of repair effectiveness

This project teaches you the modern, production-proven approach.

2.4 Common Misconceptions

Misconception Reality
“Just use OpenAI’s JSON mode” JSON mode ensures valid JSON, not valid schema
“Models rarely make errors” At scale (1M+ requests), even 1% failure rate = 10K crashes
“Repair loops are expensive” 25% token cost increase for 25% reliability improvement (worth it)
“TypeScript types validate LLM output” Compile-time types don’t validate runtime data
“One validation attempt is enough” Self-correction significantly improves success rate

3. Project Specification

3.1 What You Will Build

A Python library (or TypeScript package) that:

  1. Accepts JSON Schemas (or Pydantic models / Zod schemas)
  2. Generates LLM outputs with automatic validation
  3. Implements repair loops that self-correct errors
  4. Provides clear error messages when repair fails
  5. Tracks metrics: attempts, token costs, success rates
  6. Offers configurable parameters: max attempts, temperature strategy, default values

Core Question This Tool Answers:

“How do I integrate a fuzzy AI component into a strict, typed software system?”

3.2 Functional Requirements

FR1: Schema Definition

Python (Pydantic):

from pydantic import BaseModel, EmailStr, Field
from enum import Enum

class Subscription(str, Enum):
    FREE = "free"
    PRO = "pro"
    ENTERPRISE = "enterprise"

class User(BaseModel):
    name: str = Field(..., min_length=1, max_length=100)
    age: int = Field(..., ge=0, le=120)
    email: EmailStr
    subscription: Subscription

    class Config:
        extra = "forbid"  # Prevent hallucinated fields

TypeScript (Zod):

import { z } from 'zod';

const UserSchema = z.object({
  name: z.string().min(1).max(100),
  age: z.number().int().min(0).max(120),
  email: z.string().email(),
  subscription: z.enum(['free', 'pro', 'enterprise'])
}).strict();  // Prevent hallucinated fields

FR2: Generation with Validation

Python Example:

from json_enforcer import LLMClient

client = LLMClient(model="gpt-4", max_repair_attempts=3)

result = client.generate_json(
    prompt="Extract user: 'Alice, 25, alice@example.com, wants pro plan'",
    schema=User,
    temperature=0.3
)

# result: User object (Pydantic model) - fully typed and validated
print(result.name)  # Type checker knows this is a string
print(result.age)   # Type checker knows this is an int

FR3: Repair Loop Implementation

Algorithm:

def generate_with_repair(prompt, schema, max_attempts=3):
    messages = [{"role": "user", "content": prompt}]

    for attempt in range(1, max_attempts + 1):
        # Adjust temperature: lower for repairs
        temp = 0.3 if attempt == 1 else 0.0

        # Generate response
        response = llm.complete(messages, temperature=temp)

        # Attempt validation
        try:
            validated_data = schema.parse(response)
            log.info(f"✓ Validated on attempt {attempt}")
            return validated_data

        except ValidationError as error:
            log.warning(f"✗ Attempt {attempt} failed: {error}")

            if attempt == max_attempts:
                raise MaxRetriesExceeded(
                    attempts=attempt,
                    last_error=error,
                    last_response=response
                )

            # Build repair prompt
            repair_prompt = build_repair_prompt(error, response)
            messages.append({
                "role": "assistant",
                "content": response
            })
            messages.append({
                "role": "user",
                "content": repair_prompt
            })

Repair Prompt Builder:

def build_repair_prompt(validation_error, invalid_json):
    errors = parse_validation_errors(validation_error)

    prompt_parts = [
        "Your previous JSON output had validation errors:\n"
    ]

    for idx, error in enumerate(errors, 1):
        prompt_parts.append(
            f"{idx}. Field '{error.field}' {error.message}\n"
            f"   Expected: {error.expected}\n"
            f"   Received: {error.received}\n"
        )

    prompt_parts.append(
        "\nPlease return ONLY valid JSON with these corrections. "
        "Do not change the semantic content, only fix the format."
    )

    return "".join(prompt_parts)

FR4: Error Types

Custom Exceptions:

class JSONEnforcerError(Exception):
    """Base exception for JSON Enforcer"""
    pass

class MaxRetriesExceeded(JSONEnforcerError):
    """Raised when repair loop exhausts all attempts"""
    def __init__(self, attempts, last_error, last_response):
        self.attempts = attempts
        self.last_error = last_error
        self.last_response = last_response
        super().__init__(
            f"Failed to generate valid JSON after {attempts} attempts. "
            f"Last error: {last_error}"
        )

class SchemaValidationError(JSONEnforcerError):
    """Raised when JSON is valid but doesn't match schema"""
    def __init__(self, errors, json_data):
        self.errors = errors
        self.json_data = json_data
        super().__init__(f"Schema validation failed: {errors}")

FR5: Metrics and Observability

Telemetry Data:

@dataclass
class GenerationMetrics:
    success: bool
    attempts: int
    total_tokens: int
    total_cost: float
    total_latency_ms: int
    temperature_used: List[float]
    validation_errors: List[str]

result, metrics = client.generate_with_metrics(
    prompt=prompt,
    schema=User
)

print(f"Success: {metrics.success}")
print(f"Attempts: {metrics.attempts}")
print(f"Cost: ${metrics.total_cost:.4f}")

3.3 Non-Functional Requirements

Requirement Target Rationale
Success Rate >99% with 3 repair attempts Production-grade reliability
Latency <2 seconds for 3 attempts Acceptable for most use cases
Token Efficiency <30% overhead for repairs Cost-effective
Type Safety Full static type checking Integration with typed codebases
Error Messages Actionable repair guidance Models can understand and fix errors

3.4 Example Usage

Basic Usage:

from json_enforcer import LLMClient
from pydantic import BaseModel

class Recipe(BaseModel):
    title: str
    ingredients: list[str]
    cooking_time_minutes: int
    difficulty: Literal["easy", "medium", "hard"]

client = LLMClient(model="gpt-4")

text = """
This amazing pasta takes about an hour and serves 4-6 people.
You'll need pasta, tomatoes, garlic, and basil. It's pretty simple!
"""

recipe = client.generate_json(
    prompt=f"Extract recipe from: {text}",
    schema=Recipe
)

print(recipe.title)  # Type-safe access
print(recipe.cooking_time_minutes)  # Guaranteed to be int

Advanced Usage with Error Handling:

from json_enforcer import LLMClient, MaxRetriesExceeded

client = LLMClient(
    model="gpt-4",
    max_repair_attempts=3,
    verbose=True  # Show repair process
)

try:
    recipe = client.generate_json(
        prompt=f"Extract recipe from: {text}",
        schema=Recipe,
        temperature=0.3
    )
    print(f"✓ Extracted: {recipe.title}")

except MaxRetriesExceeded as e:
    print(f"✗ Failed after {e.attempts} attempts")
    print(f"Last error: {e.last_error}")

    # Use default fallback
    recipe = Recipe(
        title="Unknown Recipe",
        ingredients=[],
        cooking_time_minutes=0,
        difficulty="medium"
    )

Console Output (Verbose Mode):

╔══════════════════════════════════════════════════════════════╗
║  JSON ENFORCER - Structured Output Pipeline                 ║
╚══════════════════════════════════════════════════════════════╝

[Attempt 1/3] Generating JSON response...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Prompt sent to model (142 tokens)
Response received (87 tokens, 234ms)

Raw output:
{
  "title": "Pasta",
  "ingredients": ["pasta", "tomatoes", "garlic", "basil"],
  "cooking_time_minutes": "1 hour",
  "difficulty": "simple"
}

Validating against schema...
✗ VALIDATION FAILED (2 errors)

Error details:
  1. Field: cooking_time_minutes
     Expected: integer
     Received: string ("1 hour")
     Path: $.cooking_time_minutes

  2. Field: difficulty
     Expected: one of ["easy", "medium", "hard"]
     Received: "simple"
     Path: $.difficulty

──────────────────────────────────────────────────────────────

[Attempt 2/3] Attempting self-repair...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Repair prompt:
Your previous JSON output had validation errors:

1. Field 'cooking_time_minutes' must be an integer (number of minutes)
   Expected: integer
   Received: string ("1 hour")
   Hint: Convert "1 hour" to 60

2. Field 'difficulty' must be one of: "easy", "medium", "hard"
   Expected: enum ["easy", "medium", "hard"]
   Received: "simple"
   Hint: Map "simple" to "easy"

Please return ONLY valid JSON with these corrections.
Do not change the semantic content, only fix the format.

Response received (71 tokens, 198ms)

Raw output:
{
  "title": "Pasta",
  "ingredients": ["pasta", "tomatoes", "garlic", "basil"],
  "cooking_time_minutes": 60,
  "difficulty": "easy"
}

Validating against schema...
✓ VALIDATION PASSED

All required fields present: ✓
No extra fields: ✓
Type constraints satisfied: ✓

──────────────────────────────────────────────────────────────

✓ SUCCESS after 2 attempts
Total time: 432ms
Total tokens: 300 (input: 213, output: 87)
Cost: $0.0045

Returning validated object:
Recipe(
  title='Pasta',
  ingredients=['pasta', 'tomatoes', 'garlic', 'basil'],
  cooking_time_minutes=60,
  difficulty='easy'
)

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                   Application Code                          │
│  recipe = client.generate_json(prompt, schema=Recipe)       │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    JSON Enforcer Client                     │
│  ┌─────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │   Schema    │  │    Repair    │  │   Metrics    │       │
│  │  Validator  │  │     Loop     │  │   Tracker    │       │
│  └─────────────┘  └──────────────┘  └──────────────┘       │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                      LLM Provider                           │
│           (OpenAI / Anthropic / Local)                      │
└─────────────────────────────────────────────────────────────┘

Component Responsibilities:

Component Responsibility
LLMClient Main entry point, orchestrates generation
SchemaValidator Validates JSON against schema, extracts errors
RepairLoop Manages retry logic, temperature strategy
PromptBuilder Constructs repair prompts from validation errors
MetricsTracker Collects telemetry data
ErrorFormatter Converts validation errors to repair instructions

4.2 Key Components

Component 1: Schema Validator

Responsibilities:

  • Parse JSON from string
  • Validate against schema
  • Extract detailed error information
  • Format errors for repair prompts

Interface:

class SchemaValidator:
    def __init__(self, schema: Type[BaseModel]):
        self.schema = schema

    def validate(self, json_string: str) -> Result[BaseModel, ValidationError]:
        """Validate JSON string against schema"""
        pass

    def extract_errors(self, error: ValidationError) -> List[ErrorDetail]:
        """Parse validation error into structured details"""
        pass

Component 2: Repair Loop Engine

Responsibilities:

  • Manage retry attempts
  • Adjust temperature per attempt
  • Track metrics
  • Handle max retries exceeded

Interface:

class RepairLoop:
    def __init__(
        self,
        llm_client: LLMProvider,
        max_attempts: int = 3,
        temperature_strategy: str = "decreasing"
    ):
        self.llm_client = llm_client
        self.max_attempts = max_attempts
        self.temperature_strategy = temperature_strategy

    def execute(
        self,
        prompt: str,
        validator: SchemaValidator
    ) -> Result[BaseModel, MaxRetriesExceeded]:
        """Execute generation with repair loop"""
        pass

Component 3: Prompt Builder

Responsibilities:

  • Format validation errors as repair instructions
  • Provide examples of correct format
  • Keep prompts concise to save tokens

Interface:

class RepairPromptBuilder:
    def build(
        self,
        validation_errors: List[ErrorDetail],
        invalid_output: str
    ) -> str:
        """Build repair prompt from validation errors"""
        pass

    def add_examples(self, field_name: str, expected_type: str) -> str:
        """Add format examples for common error types"""
        pass

4.3 Data Structures

ErrorDetail

@dataclass
class ErrorDetail:
    field_path: str  # JSON path (e.g., "$.user.age")
    field_name: str  # Field name (e.g., "age")
    expected_type: str  # Expected type (e.g., "integer")
    received_value: Any  # Actual value received
    error_message: str  # Human-readable error
    suggestion: Optional[str]  # Repair suggestion

GenerationResult

@dataclass
class GenerationResult:
    success: bool
    data: Optional[BaseModel]
    error: Optional[Exception]
    metrics: GenerationMetrics

    def unwrap(self) -> BaseModel:
        """Get data or raise error"""
        if self.success:
            return self.data
        raise self.error

    def unwrap_or(self, default: BaseModel) -> BaseModel:
        """Get data or return default"""
        return self.data if self.success else default

GenerationMetrics

@dataclass
class GenerationMetrics:
    attempts: int
    success: bool
    total_tokens: int
    input_tokens: int
    output_tokens: int
    total_cost: float
    total_latency_ms: int
    temperatures_used: List[float]
    validation_errors: List[List[ErrorDetail]]
    repair_success_on_attempt: Optional[int]

4.4 Algorithm Overview

Main Generation Algorithm

def generate_json(prompt: str, schema: Type[BaseModel]) -> BaseModel:
    """
    Generate and validate JSON with automatic repair.

    Algorithm:
    1. Initialize conversation with user prompt
    2. For each attempt (1 to max_attempts):
       a. Determine temperature (decreasing strategy)
       b. Generate LLM response
       c. Parse as JSON
       d. Validate against schema
       e. If valid: return validated data
       f. If invalid: build repair prompt
    3. If all attempts fail: raise MaxRetriesExceeded
    """

    messages = [{"role": "user", "content": prompt}]
    metrics = GenerationMetrics()

    for attempt in range(1, max_attempts + 1):
        # Step 1: Determine temperature
        temperature = calculate_temperature(attempt, strategy="decreasing")
        metrics.temperatures_used.append(temperature)

        # Step 2: Generate response
        start_time = time.time()
        response = llm_client.complete(messages, temperature=temperature)
        latency = (time.time() - start_time) * 1000
        metrics.total_latency_ms += latency

        # Step 3: Update metrics
        metrics.attempts = attempt
        metrics.input_tokens += response.usage.prompt_tokens
        metrics.output_tokens += response.usage.completion_tokens
        metrics.total_tokens += response.usage.total_tokens

        # Step 4: Parse JSON
        try:
            json_data = json.loads(response.content)
        except JSONDecodeError as e:
            error = ErrorDetail(
                field_path="$",
                field_name="root",
                expected_type="valid JSON",
                received_value=response.content,
                error_message=f"JSON syntax error: {e}",
                suggestion="Check for missing brackets, quotes, or commas"
            )
            metrics.validation_errors.append([error])

            if attempt == max_attempts:
                raise MaxRetriesExceeded(metrics)

            repair_prompt = build_json_syntax_repair_prompt(e, response.content)
            messages.extend([
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": repair_prompt}
            ])
            continue

        # Step 5: Validate schema
        try:
            validated = schema.parse_obj(json_data)
            metrics.success = True
            metrics.repair_success_on_attempt = attempt
            return validated

        except ValidationError as e:
            errors = extract_validation_errors(e)
            metrics.validation_errors.append(errors)

            if attempt == max_attempts:
                raise MaxRetriesExceeded(metrics)

            # Step 6: Build repair prompt
            repair_prompt = build_schema_repair_prompt(errors, json_data)
            messages.extend([
                {"role": "assistant", "content": response.content},
                {"role": "user", "content": repair_prompt}
            ])

    # Should never reach here
    raise MaxRetriesExceeded(metrics)

Temperature Strategy

def calculate_temperature(attempt: int, strategy: str) -> float:
    """
    Calculate temperature for each attempt.

    Strategies:
    - "decreasing": Start at 0.3, decrease to 0.0 for repairs
    - "constant_zero": Always use 0.0 (maximum determinism)
    - "constant_low": Always use 0.2 (slight creativity)
    """

    if strategy == "decreasing":
        return 0.3 if attempt == 1 else 0.0
    elif strategy == "constant_zero":
        return 0.0
    elif strategy == "constant_low":
        return 0.2
    else:
        raise ValueError(f"Unknown strategy: {strategy}")

Error Extraction Algorithm

def extract_validation_errors(validation_error: ValidationError) -> List[ErrorDetail]:
    """
    Extract structured error details from Pydantic ValidationError.

    Pydantic errors have format:
    [
        {
            "loc": ("field", "nested_field"),
            "msg": "field required",
            "type": "value_error.missing"
        }
    ]

    Convert to ErrorDetail objects with repair suggestions.
    """

    details = []

    for error in validation_error.errors():
        field_path = "$.{}".format(".".join(str(loc) for loc in error["loc"]))
        field_name = error["loc"][-1] if error["loc"] else "root"

        # Determine expected type and suggestion
        error_type = error["type"]

        if error_type == "value_error.missing":
            suggestion = f"Add required field '{field_name}'"
        elif error_type == "type_error.integer":
            suggestion = f"Convert to integer number (e.g., 25, not '25')"
        elif error_type == "type_error.string":
            suggestion = f"Wrap in quotes to make string"
        elif error_type.startswith("value_error.const"):
            suggestion = f"Use one of the allowed enum values"
        else:
            suggestion = None

        detail = ErrorDetail(
            field_path=field_path,
            field_name=field_name,
            expected_type=infer_expected_type(error),
            received_value=error.get("ctx", {}).get("given"),
            error_message=error["msg"],
            suggestion=suggestion
        )

        details.append(detail)

    return details

Repair Prompt Builder Algorithm

def build_schema_repair_prompt(errors: List[ErrorDetail], invalid_json: dict) -> str:
    """
    Build repair prompt from validation errors.

    Strategy:
    1. Start with clear instruction
    2. List each error with:
       - Field path
       - Expected vs received
       - Concrete example
    3. Remind to preserve semantic content
    4. Request only valid JSON (no explanations)
    """

    lines = [
        "Your previous JSON output had validation errors:\n"
    ]

    for idx, error in enumerate(errors, 1):
        lines.append(f"\n{idx}. Field '{error.field_name}' error:")
        lines.append(f"   Path: {error.field_path}")
        lines.append(f"   Problem: {error.error_message}")
        lines.append(f"   Expected: {error.expected_type}")

        if error.received_value is not None:
            lines.append(f"   Received: {json.dumps(error.received_value)}")

        if error.suggestion:
            lines.append(f"   Fix: {error.suggestion}")

        # Add concrete example
        example = generate_example(error)
        if example:
            lines.append(f"   Example: {example}")

    lines.append("\n\nIMPORTANT:")
    lines.append("- Fix ONLY the format/type errors above")
    lines.append("- Do NOT change the semantic content or meaning")
    lines.append("- Return ONLY valid JSON, no explanations")
    lines.append("- Do NOT add extra fields")

    return "\n".join(lines)

5. Implementation Guide

5.1 Development Environment Setup

Prerequisites:

  • Python 3.10+ or Node.js 18+
  • API key for OpenAI or Anthropic
  • Virtual environment tool (venv, conda)

Python Setup:

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install pydantic openai anthropic python-dotenv

# For development
pip install pytest black mypy ruff

TypeScript Setup:

# Initialize project
npm init -y
npm install zod openai @anthropic-ai/sdk dotenv

# For development
npm install -D typescript @types/node ts-node jest @types/jest
npx tsc --init

Environment Variables (.env):

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

5.2 Project Structure

json-enforcer/
├── src/
│   ├── __init__.py
│   ├── client.py           # Main LLMClient class
│   ├── validator.py        # Schema validation logic
│   ├── repair_loop.py      # Repair attempt orchestration
│   ├── prompt_builder.py   # Repair prompt construction
│   ├── errors.py           # Custom exception classes
│   ├── metrics.py          # Metrics tracking
│   └── providers/
│       ├── __init__.py
│       ├── base.py         # Abstract LLM provider
│       ├── openai.py       # OpenAI implementation
│       └── anthropic.py    # Anthropic implementation
├── tests/
│   ├── test_validator.py
│   ├── test_repair_loop.py
│   ├── test_client.py
│   └── fixtures/
│       ├── schemas.py      # Test schemas
│       └── mock_responses.py
├── examples/
│   ├── basic_usage.py
│   ├── error_handling.py
│   ├── metrics_tracking.py
│   └── advanced_schemas.py
├── .env.example
├── pyproject.toml
├── README.md
└── requirements.txt

5.3 Implementation Phases

Phase 1: Foundation (Days 1-2)

Checkpoint 1.1: Set up base provider interface

# src/providers/base.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class Message:
    role: str  # "system", "user", "assistant"
    content: str

@dataclass
class CompletionResponse:
    content: str
    model: str
    usage: 'TokenUsage'
    latency_ms: float

@dataclass
class TokenUsage:
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int

class LLMProvider(ABC):
    """Abstract base class for LLM providers"""

    @abstractmethod
    def complete(
        self,
        messages: List[Message],
        temperature: float = 0.3,
        max_tokens: int = 2000
    ) -> CompletionResponse:
        """Generate completion from messages"""
        pass

Checkpoint 1.2: Implement OpenAI provider

# src/providers/openai.py
import time
from openai import OpenAI
from .base import LLMProvider, Message, CompletionResponse, TokenUsage

class OpenAIProvider(LLMProvider):
    def __init__(self, api_key: str, model: str = "gpt-4"):
        self.client = OpenAI(api_key=api_key)
        self.model = model

    def complete(
        self,
        messages: List[Message],
        temperature: float = 0.3,
        max_tokens: int = 2000
    ) -> CompletionResponse:
        start_time = time.time()

        # Convert messages to OpenAI format
        openai_messages = [
            {"role": msg.role, "content": msg.content}
            for msg in messages
        ]

        # Call API
        response = self.client.chat.completions.create(
            model=self.model,
            messages=openai_messages,
            temperature=temperature,
            max_tokens=max_tokens
        )

        latency_ms = (time.time() - start_time) * 1000

        return CompletionResponse(
            content=response.choices[0].message.content,
            model=response.model,
            usage=TokenUsage(
                prompt_tokens=response.usage.prompt_tokens,
                completion_tokens=response.usage.completion_tokens,
                total_tokens=response.usage.total_tokens
            ),
            latency_ms=latency_ms
        )

Checkpoint 1.3: Build schema validator

# src/validator.py
import json
from typing import Type, List
from pydantic import BaseModel, ValidationError
from dataclasses import dataclass

@dataclass
class ErrorDetail:
    field_path: str
    field_name: str
    expected_type: str
    received_value: any
    error_message: str
    suggestion: str = None

class SchemaValidator:
    def __init__(self, schema: Type[BaseModel]):
        self.schema = schema

    def validate(self, json_string: str) -> BaseModel:
        """
        Validate JSON string against schema.

        Raises:
            json.JSONDecodeError: If not valid JSON
            ValidationError: If valid JSON but doesn't match schema
        """
        # First, parse as JSON
        data = json.loads(json_string)

        # Then validate against schema
        return self.schema.parse_obj(data)

    def extract_errors(self, error: ValidationError) -> List[ErrorDetail]:
        """Extract structured error details from ValidationError"""
        details = []

        for err in error.errors():
            # Build JSON path
            field_path = "$." + ".".join(str(loc) for loc in err["loc"])
            field_name = err["loc"][-1] if err["loc"] else "root"

            # Determine expected type
            expected_type = self._infer_expected_type(err)

            # Get received value
            received_value = err.get("input")

            # Generate suggestion
            suggestion = self._generate_suggestion(err, field_name)

            detail = ErrorDetail(
                field_path=field_path,
                field_name=field_name,
                expected_type=expected_type,
                received_value=received_value,
                error_message=err["msg"],
                suggestion=suggestion
            )

            details.append(detail)

        return details

    def _infer_expected_type(self, error: dict) -> str:
        """Infer expected type from error"""
        error_type = error["type"]

        type_map = {
            "int_parsing": "integer",
            "string_type": "string",
            "float_parsing": "number",
            "bool_parsing": "boolean",
            "list_type": "array",
            "dict_type": "object",
            "missing": "required field",
        }

        for key, value in type_map.items():
            if key in error_type:
                return value

        return error_type

    def _generate_suggestion(self, error: dict, field_name: str) -> str:
        """Generate repair suggestion"""
        error_type = error["type"]

        if "missing" in error_type:
            return f"Add required field '{field_name}'"
        elif "int_parsing" in error_type:
            return "Convert to integer (e.g., 25, not '25')"
        elif "string_type" in error_type:
            return "Value must be a string in quotes"
        elif "enum" in error_type:
            expected_values = error.get("ctx", {}).get("expected")
            if expected_values:
                return f"Must be one of: {expected_values}"

        return "Fix the validation error above"

Test validator:

# tests/test_validator.py
import pytest
from pydantic import BaseModel
from src.validator import SchemaValidator, ErrorDetail

class SimpleUser(BaseModel):
    name: str
    age: int

def test_valid_json():
    validator = SchemaValidator(SimpleUser)
    result = validator.validate('{"name": "Alice", "age": 25}')
    assert result.name == "Alice"
    assert result.age == 25

def test_type_error():
    validator = SchemaValidator(SimpleUser)
    with pytest.raises(ValidationError) as exc:
        validator.validate('{"name": "Alice", "age": "twenty-five"}')

    errors = validator.extract_errors(exc.value)
    assert len(errors) == 1
    assert errors[0].field_name == "age"
    assert "integer" in errors[0].expected_type

def test_missing_field():
    validator = SchemaValidator(SimpleUser)
    with pytest.raises(ValidationError) as exc:
        validator.validate('{"name": "Alice"}')

    errors = validator.extract_errors(exc.value)
    assert any(e.field_name == "age" for e in errors)

Phase 2: Repair Loop (Days 3-4)

Checkpoint 2.1: Build prompt builder

# src/prompt_builder.py
import json
from typing import List
from .validator import ErrorDetail

class RepairPromptBuilder:
    def build(self, errors: List[ErrorDetail]) -> str:
        """Build repair prompt from validation errors"""
        lines = [
            "Your previous JSON output had validation errors:\n"
        ]

        for idx, error in enumerate(errors, 1):
            lines.append(f"\n{idx}. Field '{error.field_name}':")
            lines.append(f"   Path: {error.field_path}")
            lines.append(f"   Expected: {error.expected_type}")

            if error.received_value is not None:
                lines.append(f"   Received: {json.dumps(error.received_value)}")

            lines.append(f"   Problem: {error.error_message}")

            if error.suggestion:
                lines.append(f"   Fix: {error.suggestion}")

            # Add example
            example = self._generate_example(error)
            if example:
                lines.append(f"   Example: {example}")

        lines.extend([
            "\n\nPlease return ONLY valid JSON with these corrections.",
            "Do not change the semantic content, only fix the format.",
            "Do not include explanations or markdown formatting."
        ])

        return "\n".join(lines)

    def _generate_example(self, error: ErrorDetail) -> str:
        """Generate concrete example for error type"""
        field = error.field_name

        if error.expected_type == "integer":
            return f'"{field}": 25'
        elif error.expected_type == "string":
            return f'"{field}": "value"'
        elif error.expected_type == "boolean":
            return f'"{field}": true'
        elif error.expected_type == "array":
            return f'"{field}": ["item1", "item2"]'
        elif "enum" in error.error_message:
            # Try to extract enum values from error message
            return None

        return None

Checkpoint 2.2: Implement repair loop

# src/repair_loop.py
import json
from typing import Type, Optional
from pydantic import BaseModel, ValidationError
from .providers.base import LLMProvider, Message
from .validator import SchemaValidator
from .prompt_builder import RepairPromptBuilder
from .errors import MaxRetriesExceeded
from .metrics import GenerationMetrics

class RepairLoop:
    def __init__(
        self,
        provider: LLMProvider,
        max_attempts: int = 3,
        temperature_strategy: str = "decreasing",
        verbose: bool = False
    ):
        self.provider = provider
        self.max_attempts = max_attempts
        self.temperature_strategy = temperature_strategy
        self.verbose = verbose
        self.prompt_builder = RepairPromptBuilder()

    def execute(
        self,
        initial_prompt: str,
        schema: Type[BaseModel]
    ) -> tuple[BaseModel, GenerationMetrics]:
        """Execute generation with repair loop"""
        validator = SchemaValidator(schema)
        metrics = GenerationMetrics()

        messages = [Message(role="user", content=initial_prompt)]

        for attempt in range(1, self.max_attempts + 1):
            if self.verbose:
                print(f"\n[Attempt {attempt}/{self.max_attempts}] Generating...")

            # Calculate temperature
            temperature = self._get_temperature(attempt)
            metrics.temperatures_used.append(temperature)

            # Generate response
            response = self.provider.complete(
                messages=messages,
                temperature=temperature
            )

            # Update metrics
            metrics.attempts = attempt
            metrics.total_tokens += response.usage.total_tokens
            metrics.input_tokens += response.usage.prompt_tokens
            metrics.output_tokens += response.usage.completion_tokens
            metrics.total_latency_ms += response.latency_ms

            if self.verbose:
                print(f"Response received ({response.usage.total_tokens} tokens, "
                      f"{response.latency_ms:.0f}ms)")
                print(f"\nRaw output:\n{response.content}\n")

            # Try to validate
            try:
                # First try JSON parsing
                try:
                    json_data = json.loads(response.content)
                except json.JSONDecodeError as e:
                    if self.verbose:
                        print(f"✗ JSON SYNTAX ERROR: {e}")

                    if attempt == self.max_attempts:
                        metrics.success = False
                        raise MaxRetriesExceeded(
                            attempts=attempt,
                            last_error=str(e),
                            last_response=response.content,
                            metrics=metrics
                        )

                    # Build JSON syntax repair prompt
                    repair_msg = (
                        f"Your previous output was not valid JSON. "
                        f"Error: {e}\n\n"
                        f"Please return valid JSON only, with no markdown or explanations."
                    )

                    messages.extend([
                        Message(role="assistant", content=response.content),
                        Message(role="user", content=repair_msg)
                    ])
                    continue

                # Then validate schema
                validated = validator.validate(response.content)

                # Success!
                if self.verbose:
                    print("✓ VALIDATION PASSED")

                metrics.success = True
                metrics.repair_success_on_attempt = attempt
                return validated, metrics

            except ValidationError as e:
                errors = validator.extract_errors(e)
                metrics.validation_errors.append(errors)

                if self.verbose:
                    print(f"✗ VALIDATION FAILED ({len(errors)} errors)")
                    for err in errors:
                        print(f"{err.field_name}: {err.error_message}")

                if attempt == self.max_attempts:
                    metrics.success = False
                    raise MaxRetriesExceeded(
                        attempts=attempt,
                        last_error=str(e),
                        last_response=response.content,
                        metrics=metrics
                    )

                # Build repair prompt
                repair_prompt = self.prompt_builder.build(errors)

                messages.extend([
                    Message(role="assistant", content=response.content),
                    Message(role="user", content=repair_prompt)
                ])

        # Should never reach here
        metrics.success = False
        raise MaxRetriesExceeded(
            attempts=self.max_attempts,
            last_error="Unknown error",
            last_response="",
            metrics=metrics
        )

    def _get_temperature(self, attempt: int) -> float:
        """Calculate temperature for attempt"""
        if self.temperature_strategy == "decreasing":
            return 0.3 if attempt == 1 else 0.0
        elif self.temperature_strategy == "constant_zero":
            return 0.0
        elif self.temperature_strategy == "constant_low":
            return 0.2
        else:
            return 0.3

Checkpoint 2.3: Create custom errors

# src/errors.py
from typing import Optional
from .metrics import GenerationMetrics

class JSONEnforcerError(Exception):
    """Base exception for JSON Enforcer"""
    pass

class MaxRetriesExceeded(JSONEnforcerError):
    """Raised when repair loop exhausts all attempts"""

    def __init__(
        self,
        attempts: int,
        last_error: str,
        last_response: str,
        metrics: GenerationMetrics
    ):
        self.attempts = attempts
        self.last_error = last_error
        self.last_response = last_response
        self.metrics = metrics

        super().__init__(
            f"Failed to generate valid JSON after {attempts} attempts. "
            f"Last error: {last_error}"
        )

Checkpoint 2.4: Add metrics tracking

# src/metrics.py
from dataclasses import dataclass, field
from typing import List, Optional

@dataclass
class GenerationMetrics:
    attempts: int = 0
    success: bool = False
    total_tokens: int = 0
    input_tokens: int = 0
    output_tokens: int = 0
    total_cost: float = 0.0
    total_latency_ms: float = 0.0
    temperatures_used: List[float] = field(default_factory=list)
    validation_errors: List[List] = field(default_factory=list)
    repair_success_on_attempt: Optional[int] = None

    def calculate_cost(self, model: str):
        """Calculate cost based on model pricing"""
        # OpenAI GPT-4 pricing (as of 2024)
        pricing = {
            "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
            "gpt-3.5-turbo": {"input": 0.001, "output": 0.002},
        }

        if model in pricing:
            input_cost = (self.input_tokens / 1000) * pricing[model]["input"]
            output_cost = (self.output_tokens / 1000) * pricing[model]["output"]
            self.total_cost = input_cost + output_cost

        return self.total_cost

Phase 3: Client API (Days 5-7)

Checkpoint 3.1: Build main client

# src/client.py
import os
from typing import Type, Optional
from pydantic import BaseModel
from dotenv import load_dotenv

from .providers.openai import OpenAIProvider
from .repair_loop import RepairLoop
from .metrics import GenerationMetrics
from .errors import MaxRetriesExceeded

load_dotenv()

class LLMClient:
    """
    Main client for generating type-safe JSON from LLMs.

    Example:
        client = LLMClient(model="gpt-4")
        user = client.generate_json(
            prompt="Extract user: Alice, 25, alice@example.com",
            schema=User
        )
    """

    def __init__(
        self,
        model: str = "gpt-4",
        api_key: Optional[str] = None,
        max_repair_attempts: int = 3,
        temperature_strategy: str = "decreasing",
        verbose: bool = False
    ):
        """
        Initialize LLM client.

        Args:
            model: Model name (e.g., "gpt-4", "gpt-3.5-turbo")
            api_key: API key (defaults to OPENAI_API_KEY env var)
            max_repair_attempts: Maximum repair attempts (default: 3)
            temperature_strategy: Temperature strategy (default: "decreasing")
            verbose: Print detailed logs (default: False)
        """
        self.model = model
        self.max_repair_attempts = max_repair_attempts
        self.verbose = verbose

        # Initialize provider
        api_key = api_key or os.getenv("OPENAI_API_KEY")
        if not api_key:
            raise ValueError("API key required (set OPENAI_API_KEY or pass api_key)")

        self.provider = OpenAIProvider(api_key=api_key, model=model)

        # Initialize repair loop
        self.repair_loop = RepairLoop(
            provider=self.provider,
            max_attempts=max_repair_attempts,
            temperature_strategy=temperature_strategy,
            verbose=verbose
        )

    def generate_json(
        self,
        prompt: str,
        schema: Type[BaseModel],
        temperature: Optional[float] = None
    ) -> BaseModel:
        """
        Generate and validate JSON output.

        Args:
            prompt: The prompt to send to the LLM
            schema: Pydantic model class defining the expected schema
            temperature: Override default temperature (optional)

        Returns:
            Validated Pydantic model instance

        Raises:
            MaxRetriesExceeded: If repair loop fails after max attempts
        """
        # Add schema to prompt
        full_prompt = self._build_prompt_with_schema(prompt, schema)

        # Execute with repair loop
        result, metrics = self.repair_loop.execute(full_prompt, schema)

        # Calculate cost
        metrics.calculate_cost(self.model)

        if self.verbose:
            self._print_summary(metrics)

        return result

    def generate_with_metrics(
        self,
        prompt: str,
        schema: Type[BaseModel]
    ) -> tuple[BaseModel, GenerationMetrics]:
        """
        Generate JSON and return metrics.

        Returns:
            Tuple of (validated data, metrics)
        """
        full_prompt = self._build_prompt_with_schema(prompt, schema)
        result, metrics = self.repair_loop.execute(full_prompt, schema)
        metrics.calculate_cost(self.model)
        return result, metrics

    def _build_prompt_with_schema(
        self,
        user_prompt: str,
        schema: Type[BaseModel]
    ) -> str:
        """Build prompt with schema instructions"""
        schema_json = schema.schema_json(indent=2)

        return f"""{user_prompt}

Return ONLY valid JSON matching this exact schema:

{schema_json}

Important:
- Return ONLY JSON, no markdown or explanations
- All required fields must be present
- Types must match exactly (integer, not string)
- No extra fields beyond the schema
"""

    def _print_summary(self, metrics: GenerationMetrics):
        """Print metrics summary"""
        print(f"\n{'' * 60}")
        print(f"✓ SUCCESS after {metrics.attempts} attempt(s)")
        print(f"Total time: {metrics.total_latency_ms:.0f}ms")
        print(f"Total tokens: {metrics.total_tokens} "
              f"(input: {metrics.input_tokens}, output: {metrics.output_tokens})")
        print(f"Cost: ${metrics.total_cost:.4f}")
        print(f"{'' * 60}\n")

Checkpoint 3.2: Create examples

# examples/basic_usage.py
from pydantic import BaseModel, EmailStr, Field
from typing import Literal
from src.client import LLMClient

class User(BaseModel):
    name: str = Field(..., min_length=1, max_length=100)
    age: int = Field(..., ge=0, le=120)
    email: EmailStr
    subscription: Literal["free", "pro", "enterprise"]

    class Config:
        extra = "forbid"

def main():
    client = LLMClient(model="gpt-4", verbose=True)

    text = "My name is Alice, I'm twenty-five years old, email alice@example.com, I want the pro plan"

    try:
        user = client.generate_json(
            prompt=f"Extract user information from: {text}",
            schema=User
        )

        print(f"Extracted user: {user}")
        print(f"Name: {user.name} (type: {type(user.name).__name__})")
        print(f"Age: {user.age} (type: {type(user.age).__name__})")

    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

Checkpoint 3.3: Add comprehensive tests

# tests/test_client.py
import pytest
from pydantic import BaseModel
from src.client import LLMClient
from src.errors import MaxRetriesExceeded

class SimpleUser(BaseModel):
    name: str
    age: int

@pytest.fixture
def client():
    return LLMClient(model="gpt-3.5-turbo", max_repair_attempts=3)

def test_successful_generation(client):
    """Test successful JSON generation"""
    result = client.generate_json(
        prompt="Extract: Alice, 25 years old",
        schema=SimpleUser
    )

    assert isinstance(result, SimpleUser)
    assert result.name == "Alice"
    assert result.age == 25
    assert isinstance(result.age, int)  # Not string!

def test_repair_type_error(client):
    """Test that type errors get repaired"""
    # This might generate age as string first, then repair
    result = client.generate_json(
        prompt="Extract: Bob, age twenty-five",
        schema=SimpleUser
    )

    assert isinstance(result.age, int)

def test_max_retries_exceeded(client):
    """Test that MaxRetriesExceeded is raised"""
    # Use a very strict schema that's hard to satisfy
    class StrictUser(BaseModel):
        name: str
        age: int
        # Add many complex constraints

    with pytest.raises(MaxRetriesExceeded) as exc:
        # Give it a prompt that will likely fail
        client.generate_json(
            prompt="Extract user from gibberish: asdf qwer zxcv",
            schema=StrictUser
        )

    assert exc.value.attempts == 3

5.4 Key Implementation Decisions

Decision 1: When to lower temperature?

  • Choice: Start at 0.3, drop to 0.0 for repairs
  • Rationale: Initial generation benefits from slight creativity to understand intent. Repairs need precision.

Decision 2: How many repair attempts?

  • Choice: Default to 3 attempts
  • Rationale: Data shows 99% success rate with 3 attempts. More attempts have diminishing returns.

Decision 3: Include schema in prompt or use function calling?

  • Choice: Include schema in prompt for simplicity, offer function calling as advanced option
  • Rationale: Schema in prompt works across all models. Function calling requires provider-specific code.

Decision 4: Raise exception vs return Result type?

  • Choice: Raise exception (Python standard), provide try_generate_json for Result type
  • Rationale: Python developers expect exceptions. Result type available for functional style.

6. Testing Strategy

6.1 Test Categories

  1. Unit Tests: Test individual components (validator, prompt builder)
  2. Integration Tests: Test full generation pipeline with mocked LLM
  3. End-to-End Tests: Test with real LLM API (mark as slow/expensive)
  4. Repair Tests: Test repair loop with intentionally broken outputs

6.2 Critical Test Cases

Unit Tests

Test: Schema Validator

def test_validator_extracts_type_errors():
    """Validator correctly identifies type mismatches"""
    validator = SchemaValidator(User)

    invalid_json = '{"name": "Alice", "age": "25", "email": "alice@example.com", "subscription": "pro"}'

    with pytest.raises(ValidationError) as exc:
        validator.validate(invalid_json)

    errors = validator.extract_errors(exc.value)

    assert len(errors) == 1
    assert errors[0].field_name == "age"
    assert errors[0].expected_type == "integer"
    assert errors[0].received_value == "25"

def test_validator_detects_hallucinated_fields():
    """Validator rejects extra fields when extra='forbid'"""
    validator = SchemaValidator(User)

    invalid_json = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "pro", "admin": true}'

    with pytest.raises(ValidationError) as exc:
        validator.validate(invalid_json)

    errors = validator.extract_errors(exc.value)
    assert any("extra" in e.error_message.lower() for e in errors)

Test: Prompt Builder

def test_prompt_builder_formats_errors():
    """Prompt builder creates clear repair instructions"""
    builder = RepairPromptBuilder()

    errors = [
        ErrorDetail(
            field_path="$.age",
            field_name="age",
            expected_type="integer",
            received_value="25",
            error_message="Input should be a valid integer",
            suggestion="Convert to integer (e.g., 25, not '25')"
        )
    ]

    prompt = builder.build(errors)

    assert "age" in prompt
    assert "integer" in prompt
    assert "25" in prompt
    assert "Do not change the semantic content" in prompt

Integration Tests

Test: Successful Repair

def test_repair_loop_fixes_type_error(mocker):
    """Repair loop successfully fixes a type error"""
    # Mock LLM to return wrong type first, then correct type
    mock_provider = mocker.Mock()
    mock_provider.complete.side_effect = [
        # Attempt 1: Wrong type
        CompletionResponse(
            content='{"name": "Alice", "age": "25"}',
            model="gpt-4",
            usage=TokenUsage(10, 10, 20),
            latency_ms=100
        ),
        # Attempt 2: Correct type
        CompletionResponse(
            content='{"name": "Alice", "age": 25}',
            model="gpt-4",
            usage=TokenUsage(15, 10, 25),
            latency_ms=120
        )
    ]

    loop = RepairLoop(provider=mock_provider, max_attempts=3)

    result, metrics = loop.execute(
        initial_prompt="Extract: Alice, 25",
        schema=SimpleUser
    )

    assert result.age == 25
    assert isinstance(result.age, int)
    assert metrics.attempts == 2
    assert metrics.success == True

Test: Max Retries Exceeded

def test_repair_loop_fails_after_max_attempts(mocker):
    """Repair loop raises exception after max attempts"""
    # Mock LLM to always return invalid output
    mock_provider = mocker.Mock()
    mock_provider.complete.return_value = CompletionResponse(
        content='{"invalid": "json"}',
        model="gpt-4",
        usage=TokenUsage(10, 10, 20),
        latency_ms=100
    )

    loop = RepairLoop(provider=mock_provider, max_attempts=3)

    with pytest.raises(MaxRetriesExceeded) as exc:
        loop.execute(
            initial_prompt="Extract user",
            schema=SimpleUser
        )

    assert exc.value.attempts == 3
    assert exc.value.metrics.success == False

End-to-End Tests (Real API)

@pytest.mark.slow
@pytest.mark.real_api
def test_real_llm_generation():
    """Test with real LLM API"""
    client = LLMClient(model="gpt-3.5-turbo")

    result = client.generate_json(
        prompt="Extract: Alice, 25, alice@example.com, pro plan",
        schema=User
    )

    assert result.name == "Alice"
    assert result.age == 25
    assert result.email == "alice@example.com"
    assert result.subscription == "pro"

@pytest.mark.slow
@pytest.mark.real_api
def test_complex_schema():
    """Test with nested, complex schema"""
    class Address(BaseModel):
        street: str
        city: str
        zipcode: str = Field(..., regex=r'^\d{5}$')

    class ComplexUser(BaseModel):
        name: str
        addresses: list[Address]

    client = LLMClient(model="gpt-4")

    result = client.generate_json(
        prompt="Extract user: Alice lives at 123 Main St, Springfield, 12345",
        schema=ComplexUser
    )

    assert len(result.addresses) > 0
    assert result.addresses[0].city == "Springfield"

6.3 Test Data

Example Schemas:

# tests/fixtures/schemas.py
from pydantic import BaseModel, Field, EmailStr
from typing import Literal, Optional
from enum import Enum

class SimpleUser(BaseModel):
    name: str
    age: int

class StrictUser(BaseModel):
    name: str = Field(..., min_length=1, max_length=100)
    age: int = Field(..., ge=0, le=120)
    email: EmailStr
    subscription: Literal["free", "pro", "enterprise"]

    class Config:
        extra = "forbid"

class Recipe(BaseModel):
    title: str = Field(..., min_length=3)
    ingredients: list[str] = Field(..., min_items=1)
    cooking_time_minutes: int = Field(..., ge=1, le=1440)
    difficulty: Literal["easy", "medium", "hard"]
    servings: Optional[int] = Field(None, ge=1)

class Address(BaseModel):
    street: str
    city: str
    state: str = Field(..., regex=r'^[A-Z]{2}$')
    zipcode: str = Field(..., regex=r'^\d{5}$')

class UserWithAddress(BaseModel):
    name: str
    email: EmailStr
    address: Address

Mock Responses:

# tests/fixtures/mock_responses.py

# Type error: age as string
MOCK_TYPE_ERROR = '{"name": "Alice", "age": "25"}'

# Missing field: no email
MOCK_MISSING_FIELD = '{"name": "Alice", "age": 25, "subscription": "pro"}'

# Enum error: invalid value
MOCK_ENUM_ERROR = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "premium"}'

# Hallucinated field
MOCK_EXTRA_FIELD = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "pro", "admin": true}'

# Valid output
MOCK_VALID = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "pro"}'

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Mistake 1: Not using extra="forbid" in Pydantic models

Problem:

class User(BaseModel):
    name: str
    age: int
    # Missing: class Config with extra="forbid"

# Model accepts hallucinated fields!
user = User.parse_obj({
    "name": "Alice",
    "age": 25,
    "admin": True  # ← Should be rejected but isn't!
})

Solution:

class User(BaseModel):
    name: str
    age: int

    class Config:
        extra = "forbid"  # Reject unknown fields

Mistake 2: Not lowering temperature for repair attempts

Problem:

# Always using same temperature
for attempt in range(3):
    response = llm.complete(messages, temperature=0.7)  # Too creative for repairs!

Solution:

# Decrease temperature for repairs
for attempt in range(1, 4):
    temp = 0.3 if attempt == 1 else 0.0
    response = llm.complete(messages, temperature=temp)

Mistake 3: Generic repair prompts

Problem:

repair_prompt = "Your JSON was invalid. Please fix it."
# Model doesn't know WHAT to fix!

Solution:

repair_prompt = """
Field 'age' error:
  Expected: integer
  Received: string ("25")
  Fix: Change "25" to 25 (remove quotes)
"""

Mistake 4: Infinite retry loops

Problem:

while True:
    try:
        return validate(response)
    except:
        response = retry()  # Never exits!

Solution:

for attempt in range(MAX_ATTEMPTS):
    try:
        return validate(response)
    except:
        if attempt == MAX_ATTEMPTS - 1:
            raise MaxRetriesExceeded()
        response = retry()

7.2 Debugging Strategies

Issue: Type errors persist after repair

Symptoms:

  • Repair loop exhausts all attempts
  • Same type error on every attempt
  • Model keeps returning string instead of integer

Debug Steps:

  1. Enable verbose mode:
    client = LLMClient(verbose=True)
    # See exactly what model outputs each attempt
    
  2. Check repair prompt clarity:
    # Is your repair prompt specific enough?
    # Bad: "age should be integer"
    # Good: "Change \"25\" to 25 (remove quotes, keep just the number)"
    
  3. Try temperature=0.0 from start:
    client = LLMClient(temperature_strategy="constant_zero")
    # Some models need maximum determinism
    
  4. Check if schema is too complex:
    # Simplify schema temporarily to isolate issue
    class SimpleTest(BaseModel):
        age: int  # Just test this one field
    

Issue: Model hallucinating extra fields

Symptoms:

  • Validation fails with “extra fields not permitted”
  • Model invents fields not in schema

Debug Steps:

  1. Verify extra="forbid" is set:
    class User(BaseModel):
        # ... fields ...
        class Config:
            extra = "forbid"  # Must be present!
    
  2. Make schema explicit in prompt:
    prompt = f"""
    {user_prompt}
    
    Return ONLY these exact fields: name, age, email, subscription
    Do NOT add any other fields.
    """
    
  3. Check for provider-specific issues:
    # Some models are more prone to hallucination
    # Try gpt-4 instead of gpt-3.5-turbo
    

Issue: JSON syntax errors

Symptoms:

  • JSONDecodeError: Expecting ‘,’ delimiter
  • Missing brackets, quotes, etc.

Debug Steps:

  1. Add explicit JSON formatting instruction:
    prompt = f"""
    {user_prompt}
    
    Return ONLY valid JSON. Example format:
    {{
      "name": "string",
      "age": 25
    }}
    
    Do not include markdown code blocks or explanations.
    """
    
  2. Strip markdown formatting from response:
    def clean_response(response: str) -> str:
        # Remove markdown code blocks
        response = response.strip()
        if response.startswith("```json"):
            response = response[7:]
        if response.startswith("```"):
            response = response[3:]
        if response.endswith("```"):
            response = response[:-3]
        return response.strip()
    
  3. Use provider’s JSON mode if available:
    # OpenAI has response_format parameter
    response = client.chat.completions.create(
        ...,
        response_format={"type": "json_object"}
    )
    

7.3 Performance Issues

Issue: Slow generation (>5 seconds per request)

Causes:

  • Too many repair attempts
  • Large prompts
  • Slow model (gpt-4 vs gpt-3.5-turbo)

Solutions:

  1. Reduce max attempts:
    client = LLMClient(max_repair_attempts=2)  # Instead of 3
    
  2. Use faster model for simple schemas:
    client = LLMClient(model="gpt-3.5-turbo")  # 10x faster than gpt-4
    
  3. Optimize prompt length:
    # Instead of full schema JSON
    prompt = "Extract user (name, age, email) from: ..."
    # vs
    prompt = f"Extract user from: ...\n\nSchema:\n{full_schema_json}"
    

Issue: High costs ($1+ per 1000 requests)

Causes:

  • Too many repair attempts
  • Expensive model
  • Inefficient prompts

Solutions:

  1. Track and analyze costs:
    result, metrics = client.generate_with_metrics(prompt, schema)
    print(f"Cost: ${metrics.total_cost:.4f}")
    
    # Analyze: Are repairs common? Switch to better model.
    
  2. Use cheaper model for initial attempt:
    # Try gpt-3.5-turbo first, fallback to gpt-4
    try:
        result = cheap_client.generate_json(prompt, schema)
    except MaxRetriesExceeded:
        result = expensive_client.generate_json(prompt, schema)
    
  3. Cache results for duplicate requests:
    from functools import lru_cache
    
    @lru_cache(maxsize=1000)
    def cached_generate(prompt_hash, schema_name):
        return client.generate_json(prompt, schema)
    

8. Extensions & Challenges

8.1 Beginner Extensions

Extension 1: Add Anthropic Provider Support

Goal: Support Claude models in addition to OpenAI

Implementation:

# src/providers/anthropic.py
from anthropic import Anthropic
from .base import LLMProvider, Message, CompletionResponse, TokenUsage

class AnthropicProvider(LLMProvider):
    def __init__(self, api_key: str, model: str = "claude-3-sonnet-20240229"):
        self.client = Anthropic(api_key=api_key)
        self.model = model

    def complete(
        self,
        messages: List[Message],
        temperature: float = 0.3,
        max_tokens: int = 2000
    ) -> CompletionResponse:
        # Convert messages to Anthropic format
        anthropic_messages = [
            {"role": msg.role, "content": msg.content}
            for msg in messages
            if msg.role != "system"  # Anthropic handles system separately
        ]

        # Extract system message
        system_msg = next(
            (msg.content for msg in messages if msg.role == "system"),
            None
        )

        response = self.client.messages.create(
            model=self.model,
            messages=anthropic_messages,
            system=system_msg,
            temperature=temperature,
            max_tokens=max_tokens
        )

        return CompletionResponse(
            content=response.content[0].text,
            model=response.model,
            usage=TokenUsage(
                prompt_tokens=response.usage.input_tokens,
                completion_tokens=response.usage.output_tokens,
                total_tokens=response.usage.input_tokens + response.usage.output_tokens
            ),
            latency_ms=0  # Would need to track separately
        )

Learning Goals:

  • Understand provider abstraction patterns
  • Handle different API formats
  • Deal with provider-specific features

Extension 2: Add Retry with Exponential Backoff

Goal: Handle rate limits and transient errors gracefully

Implementation:

# src/retry.py
import time
import random
from typing import Callable, TypeVar

T = TypeVar('T')

def retry_with_backoff(
    func: Callable[[], T],
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0
) -> T:
    """
    Retry function with exponential backoff.

    Delay formula: min(base_delay * 2^attempt + jitter, max_delay)
    """
    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries - 1:
                raise

            # Calculate delay with exponential backoff and jitter
            delay = min(
                base_delay * (2 ** attempt) + random.uniform(0, 1),
                max_delay
            )

            print(f"Retry {attempt + 1}/{max_retries} after {delay:.2f}s due to: {e}")
            time.sleep(delay)

    raise RuntimeError("Should never reach here")

# Usage in provider
def complete(self, messages, temperature):
    return retry_with_backoff(
        lambda: self._call_api(messages, temperature),
        max_retries=3
    )

Learning Goals:

  • Implement retry patterns
  • Handle API rate limits
  • Add resilience to network issues

Extension 3: Add Streaming Support

Goal: Stream tokens as they’re generated instead of waiting for complete response

Implementation:

# src/streaming.py
from typing import Iterator, Type
from pydantic import BaseModel
import json

class StreamingEnforcer:
    def __init__(self, client: LLMClient):
        self.client = client

    def generate_json_stream(
        self,
        prompt: str,
        schema: Type[BaseModel]
    ) -> Iterator[str]:
        """
        Stream JSON generation, yielding chunks as they arrive.

        Yields:
            JSON string chunks

        Final validation happens after stream completes.
        """
        buffer = ""

        for chunk in self.client.provider.stream_complete(prompt):
            buffer += chunk
            yield chunk

        # Validate complete response
        try:
            validated = schema.parse_raw(buffer)
            return validated
        except ValidationError:
            # Trigger repair loop
            return self.client.generate_json(prompt, schema)

# Usage
for chunk in client.generate_json_stream(prompt, User):
    print(chunk, end="", flush=True)

Learning Goals:

  • Handle streaming APIs
  • Buffer incomplete data
  • Validate after stream completes

8.2 Intermediate Extensions

Extension 4: Multi-Provider Fallback

Goal: Try multiple providers in order (e.g., OpenAI → Anthropic → Local)

Implementation:

# src/multi_provider.py
from typing import List, Type
from pydantic import BaseModel
from .client import LLMClient
from .errors import MaxRetriesExceeded

class MultiProviderClient:
    def __init__(self, providers: List[LLMClient]):
        """
        Initialize with list of clients in priority order.

        Example:
            client = MultiProviderClient([
                LLMClient(model="gpt-4"),  # Try first
                LLMClient(model="claude-3-sonnet"),  # Fallback
            ])
        """
        self.providers = providers

    def generate_json(
        self,
        prompt: str,
        schema: Type[BaseModel]
    ) -> BaseModel:
        """
        Try providers in order until one succeeds.
        """
        errors = []

        for idx, provider in enumerate(self.providers):
            try:
                print(f"Attempting with provider {idx + 1}/{len(self.providers)}")
                return provider.generate_json(prompt, schema)

            except MaxRetriesExceeded as e:
                errors.append(e)
                print(f"Provider {idx + 1} failed: {e}")
                continue

        # All providers failed
        raise Exception(
            f"All {len(self.providers)} providers failed. "
            f"Errors: {[str(e) for e in errors]}"
        )

# Usage
client = MultiProviderClient([
    LLMClient(model="gpt-4"),
    LLMClient(model="gpt-3.5-turbo"),
])

result = client.generate_json(prompt, schema)

Learning Goals:

  • Implement fallback patterns
  • Handle multiple API providers
  • Design fault-tolerant systems

Extension 5: Batch Processing with Concurrency

Goal: Process multiple requests concurrently to improve throughput

Implementation:

# src/batch.py
import asyncio
from typing import List, Type
from pydantic import BaseModel
from .client import LLMClient

class BatchEnforcer:
    def __init__(self, client: LLMClient, max_concurrent: int = 5):
        self.client = client
        self.max_concurrent = max_concurrent

    async def generate_batch(
        self,
        prompts: List[str],
        schema: Type[BaseModel]
    ) -> List[BaseModel]:
        """
        Process multiple prompts concurrently.

        Args:
            prompts: List of prompts to process
            schema: Schema for all responses

        Returns:
            List of validated results (same order as prompts)
        """
        semaphore = asyncio.Semaphore(self.max_concurrent)

        async def process_one(prompt: str) -> BaseModel:
            async with semaphore:
                # Use async LLM client here
                return await self.client.generate_json_async(prompt, schema)

        tasks = [process_one(prompt) for prompt in prompts]
        return await asyncio.gather(*tasks)

# Usage
batch_client = BatchEnforcer(client, max_concurrent=10)

prompts = [
    "Extract user: Alice, 25",
    "Extract user: Bob, 30",
    "Extract user: Carol, 28",
]

results = asyncio.run(batch_client.generate_batch(prompts, User))

Learning Goals:

  • Implement async/await patterns
  • Manage concurrent API requests
  • Handle rate limiting at scale

Extension 6: Validation Confidence Scores

Goal: Return confidence score for each field based on repair history

Implementation:

# src/confidence.py
from dataclasses import dataclass
from typing import Dict
from pydantic import BaseModel

@dataclass
class ValidationConfidence:
    overall_confidence: float  # 0.0 to 1.0
    field_confidence: Dict[str, float]  # Per-field confidence
    repair_required: bool
    repair_attempts: int

class ConfidenceTracker:
    def calculate_confidence(
        self,
        result: BaseModel,
        metrics: GenerationMetrics
    ) -> ValidationConfidence:
        """
        Calculate confidence based on repair history.

        Logic:
        - 1.0: Validated on first attempt
        - 0.8: Required 1 repair
        - 0.6: Required 2 repairs
        - 0.4: Required 3 repairs (max)
        """
        overall_confidence = max(0.4, 1.0 - (metrics.attempts - 1) * 0.2)

        # Analyze which fields had errors
        field_confidence = {}
        all_errors = []
        for error_list in metrics.validation_errors:
            all_errors.extend(error_list)

        # Fields that had errors get lower confidence
        error_fields = {err.field_name for err in all_errors}

        for field_name in result.__fields__.keys():
            if field_name in error_fields:
                # Field was repaired
                field_confidence[field_name] = max(0.5, overall_confidence)
            else:
                # Field was correct from start
                field_confidence[field_name] = 1.0

        return ValidationConfidence(
            overall_confidence=overall_confidence,
            field_confidence=field_confidence,
            repair_required=metrics.attempts > 1,
            repair_attempts=metrics.attempts - 1
        )

# Usage
result, metrics = client.generate_with_metrics(prompt, schema)
confidence = ConfidenceTracker().calculate_confidence(result, metrics)

print(f"Overall confidence: {confidence.overall_confidence:.2f}")
print(f"Field confidence:")
for field, conf in confidence.field_confidence.items():
    print(f"  {field}: {conf:.2f}")

Learning Goals:

  • Design confidence scoring systems
  • Track validation history
  • Provide interpretability

8.3 Advanced Extensions

Extension 7: Type-Safe TypeScript Version (Zod)

Goal: Build equivalent library for TypeScript using Zod

Implementation:

// src/client.ts
import { z } from 'zod';
import OpenAI from 'openai';

interface GenerationMetrics {
  attempts: number;
  success: boolean;
  totalTokens: number;
  totalCost: number;
}

class LLMClient {
  private openai: OpenAI;
  private maxRepairAttempts: number;

  constructor(apiKey: string, maxRepairAttempts: number = 3) {
    this.openai = new OpenAI({ apiKey });
    this.maxRepairAttempts = maxRepairAttempts;
  }

  async generateJSON<T extends z.ZodType>(
    prompt: string,
    schema: T
  ): Promise<z.infer<T>> {
    const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
      { role: 'user', content: prompt }
    ];

    for (let attempt = 1; attempt <= this.maxRepairAttempts; attempt++) {
      const temperature = attempt === 1 ? 0.3 : 0.0;

      const response = await this.openai.chat.completions.create({
        model: 'gpt-4',
        messages,
        temperature
      });

      const content = response.choices[0].message.content;

      try {
        const data = JSON.parse(content || '{}');
        const validated = schema.parse(data);
        return validated;
      } catch (error) {
        if (attempt === this.maxRepairAttempts) {
          throw new Error(`Failed after ${attempt} attempts: ${error}`);
        }

        // Build repair prompt
        const repairPrompt = this.buildRepairPrompt(error);
        messages.push(
          { role: 'assistant', content: content || '' },
          { role: 'user', content: repairPrompt }
        );
      }
    }

    throw new Error('Should never reach here');
  }

  private buildRepairPrompt(error: unknown): string {
    if (error instanceof z.ZodError) {
      const errors = error.errors.map(e =>
        `Field '${e.path.join('.')}': ${e.message}`
      ).join('\n');

      return `Your JSON had validation errors:\n${errors}\n\nPlease fix and return valid JSON.`;
    }

    return 'Your JSON was invalid. Please return valid JSON.';
  }
}

// Usage
const UserSchema = z.object({
  name: z.string().min(1),
  age: z.number().int().min(0).max(120),
  email: z.string().email(),
  subscription: z.enum(['free', 'pro', 'enterprise'])
}).strict();

const client = new LLMClient(process.env.OPENAI_API_KEY!);

const user = await client.generateJSON(
  "Extract: Alice, 25, alice@example.com, pro plan",
  UserSchema
);

console.log(user.name);  // TypeScript knows this is a string!

Learning Goals:

  • Port Python concepts to TypeScript
  • Use Zod for runtime validation
  • Leverage TypeScript’s type system

Extension 8: Custom Repair Strategies

Goal: Allow users to define custom repair logic per field type

Implementation:

# src/repair_strategies.py
from typing import Callable, Dict, Any
from pydantic import BaseModel

RepairStrategy = Callable[[Any, str], str]

class CustomRepairEngine:
    def __init__(self):
        self.strategies: Dict[str, RepairStrategy] = {}

    def register_strategy(
        self,
        field_name: str,
        strategy: RepairStrategy
    ):
        """
        Register custom repair strategy for a field.

        Args:
            field_name: Name of field
            strategy: Function that takes (value, error_msg) and returns repair instruction
        """
        self.strategies[field_name] = strategy

    def build_repair_prompt(
        self,
        errors: List[ErrorDetail]
    ) -> str:
        """Build repair prompt using custom strategies"""
        lines = ["Your JSON had errors:\n"]

        for error in errors:
            if error.field_name in self.strategies:
                # Use custom strategy
                strategy = self.strategies[error.field_name]
                instruction = strategy(error.received_value, error.error_message)
                lines.append(f"Field '{error.field_name}': {instruction}")
            else:
                # Use default
                lines.append(f"Field '{error.field_name}': {error.error_message}")

        return "\n".join(lines)

# Usage: Define custom repair logic for age field
def age_repair_strategy(value: Any, error_msg: str) -> str:
    """Custom repair for age field"""
    if isinstance(value, str):
        # Try to extract number from string
        words_to_numbers = {
            "twenty-five": 25,
            "thirty": 30,
            # etc.
        }

        if value.lower() in words_to_numbers:
            correct_value = words_to_numbers[value.lower()]
            return f"Convert '{value}' to number {correct_value}"

    return f"Must be an integer between 0 and 120"

engine = CustomRepairEngine()
engine.register_strategy("age", age_repair_strategy)

Learning Goals:

  • Design plugin architectures
  • Create domain-specific repair logic
  • Allow library customization

Extension 9: Statistical Validation

Goal: Validate outputs statistically over multiple samples

Implementation:

# src/statistical_validation.py
from typing import Type, List
from pydantic import BaseModel
from dataclasses import dataclass
import statistics

@dataclass
class StatisticalResult:
    median_result: BaseModel
    confidence: float
    consistency_score: float
    all_results: List[BaseModel]

class StatisticalEnforcer:
    def __init__(self, client: LLMClient, num_samples: int = 5):
        self.client = client
        self.num_samples = num_samples

    def generate_with_consensus(
        self,
        prompt: str,
        schema: Type[BaseModel]
    ) -> StatisticalResult:
        """
        Generate multiple samples and return most consistent result.

        Useful for critical applications where you need high confidence.
        """
        results = []

        for i in range(self.num_samples):
            try:
                result = self.client.generate_json(prompt, schema)
                results.append(result)
            except Exception as e:
                print(f"Sample {i+1} failed: {e}")

        if not results:
            raise Exception("All samples failed")

        # Find most common result (by field values)
        # For simplicity, return median of numeric fields
        median = self._calculate_median_result(results, schema)

        # Calculate consistency score
        consistency = self._calculate_consistency(results)

        return StatisticalResult(
            median_result=median,
            confidence=len(results) / self.num_samples,
            consistency_score=consistency,
            all_results=results
        )

    def _calculate_median_result(
        self,
        results: List[BaseModel],
        schema: Type[BaseModel]
    ) -> BaseModel:
        """Calculate median result across samples"""
        # Implementation depends on schema
        # For numeric fields, take median
        # For string fields, take most common value
        pass

    def _calculate_consistency(self, results: List[BaseModel]) -> float:
        """Calculate how consistent results are (0.0 to 1.0)"""
        # Compare each result to others
        # Return percentage of fields that match across all samples
        pass

# Usage
stat_client = StatisticalEnforcer(client, num_samples=5)
result = stat_client.generate_with_consensus(prompt, User)

print(f"Confidence: {result.confidence:.2f}")
print(f"Consistency: {result.consistency_score:.2f}")
print(f"Result: {result.median_result}")

Learning Goals:

  • Apply statistical methods to LLM outputs
  • Handle uncertainty quantification
  • Design high-reliability systems

9. Real-World Connections

9.1 Industry Applications

Use Case 1: E-Commerce Product Data Extraction

Company: Shopify

Problem: Merchants upload unstructured product descriptions. Need to extract structured data (price, dimensions, materials).

Solution with JSON Enforcer:

from pydantic import BaseModel, Field
from typing import List, Optional

class ProductDimensions(BaseModel):
    length_cm: float = Field(..., gt=0)
    width_cm: float = Field(..., gt=0)
    height_cm: float = Field(..., gt=0)
    weight_kg: float = Field(..., gt=0)

class Product(BaseModel):
    name: str = Field(..., min_length=1, max_length=200)
    price_usd: float = Field(..., gt=0)
    description: str = Field(..., max_length=5000)
    dimensions: Optional[ProductDimensions]
    materials: List[str] = Field(default_factory=list)
    category: str

    class Config:
        extra = "forbid"

client = LLMClient(model="gpt-4")

unstructured_text = """
Premium Leather Wallet - $45
Made from genuine Italian leather
Measures 4.5" x 3.5" x 0.5", weighs about 100g
Available in black, brown, tan
"""

product = client.generate_json(
    prompt=f"Extract product data from:\n{unstructured_text}",
    schema=Product
)

# Guaranteed structured data for database
store_product(product)

Result: 99.5% extraction accuracy, reduced manual data entry by 80%

Use Case 2: Financial Document Processing

Company: Stripe

Problem: Extract invoice data from PDFs/images for automated billing

Solution:

class InvoiceLineItem(BaseModel):
    description: str
    quantity: int = Field(..., ge=1)
    unit_price_usd: float = Field(..., gt=0)
    total_usd: float = Field(..., gt=0)

class Invoice(BaseModel):
    invoice_number: str = Field(..., regex=r'^INV-\d+$')
    date: str = Field(..., regex=r'^\d{4}-\d{2}-\d{2}$')
    vendor: str
    line_items: List[InvoiceLineItem] = Field(..., min_items=1)
    subtotal_usd: float = Field(..., gt=0)
    tax_usd: float = Field(..., ge=0)
    total_usd: float = Field(..., gt=0)

    class Config:
        extra = "forbid"

# Extract with automatic validation
invoice = client.generate_json(
    prompt=f"Extract invoice from OCR text:\n{ocr_text}",
    schema=Invoice
)

# Validate business logic
assert abs(invoice.total_usd - (invoice.subtotal_usd + invoice.tax_usd)) < 0.01

Result: 98% accuracy on invoices, saved $500K/year in manual processing

Use Case 3: Customer Support Ticket Classification

Company: Notion

Problem: Automatically categorize and route support tickets

Solution:

from enum import Enum

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    URGENT = "urgent"

class Category(str, Enum):
    BILLING = "billing"
    TECHNICAL = "technical"
    FEATURE_REQUEST = "feature_request"
    BUG_REPORT = "bug_report"
    ACCOUNT = "account"

class TicketClassification(BaseModel):
    category: Category
    priority: Priority
    suggested_team: str
    requires_escalation: bool
    estimated_resolution_hours: int = Field(..., ge=1, le=168)
    key_issues: List[str] = Field(..., min_items=1, max_items=5)

    class Config:
        extra = "forbid"

classification = client.generate_json(
    prompt=f"Classify support ticket:\n\n{ticket_text}",
    schema=TicketClassification
)

# Automatic routing
route_to_team(classification.suggested_team)
set_priority(classification.priority)

Result: Reduced ticket routing time by 90%, improved response SLA by 40%

9.2 Open Source Projects Using Similar Patterns

Instructor (Python)

URL: https://github.com/jxnl/instructor

What it does: Pydantic-based LLM output validation with retry logic

How it works:

import instructor
from openai import OpenAI

client = instructor.patch(OpenAI())

user = client.chat.completions.create(
    model="gpt-4",
    response_model=User,
    messages=[{"role": "user", "content": "Extract: Alice, 25"}]
)

Key Features:

  • Automatic Pydantic validation
  • Retry logic built-in
  • Support for complex nested schemas
  • Integration with OpenAI function calling

Your Implementation vs Instructor:

  • You built the core logic from scratch (better learning)
  • Instructor is production-optimized (use in real projects)
  • Both use same underlying principles

Marvin (AI Engineering)

URL: https://github.com/PrefectHQ/marvin

What it does: Type-safe AI engineering tools

Example:

import marvin

@marvin.fn
def extract_user(text: str) -> User:
    """Extract user information"""
    pass

user = extract_user("Alice, 25, alice@example.com")
# Returns validated User object

TypeChat (Microsoft)

URL: https://github.com/microsoft/TypeChat

What it does: TypeScript-first LLM interaction with schema validation

Key Principle: Use TypeScript types as the schema definition

9.3 Production Deployment Considerations

Caching Strategy

Problem: Same prompts generate same outputs but cost API calls

Solution: Implement semantic caching

import hashlib
import json
from functools import lru_cache

class CachedLLMClient:
    def __init__(self, client: LLMClient, cache_size: int = 1000):
        self.client = client
        self.cache = {}
        self.cache_size = cache_size

    def generate_json(
        self,
        prompt: str,
        schema: Type[BaseModel]
    ) -> BaseModel:
        # Create cache key
        cache_key = hashlib.sha256(
            f"{prompt}:{schema.__name__}".encode()
        ).hexdigest()

        if cache_key in self.cache:
            print("Cache hit!")
            return self.cache[cache_key]

        # Generate
        result = self.client.generate_json(prompt, schema)

        # Cache result
        if len(self.cache) >= self.cache_size:
            # Remove oldest entry
            self.cache.pop(next(iter(self.cache)))

        self.cache[cache_key] = result
        return result

Monitoring and Alerting

Metrics to Track:

  • Success rate (% validations passing)
  • Average repair attempts
  • Cost per request
  • Latency (p50, p95, p99)
  • Field-level error rates

Implementation:

from datadog import DogStatsd

statsd = DogStatsd()

class MonitoredLLMClient:
    def generate_json(self, prompt, schema):
        start = time.time()

        try:
            result, metrics = self.client.generate_with_metrics(prompt, schema)

            # Send metrics
            statsd.increment('llm.requests.success')
            statsd.histogram('llm.attempts', metrics.attempts)
            statsd.histogram('llm.latency_ms', metrics.total_latency_ms)
            statsd.histogram('llm.cost_usd', metrics.total_cost)

            if metrics.attempts > 1:
                statsd.increment('llm.repairs.occurred')

            return result

        except MaxRetriesExceeded as e:
            statsd.increment('llm.requests.failed')
            statsd.histogram('llm.attempts', e.attempts)
            raise

Error Budget Management

Concept: Track reliability over time windows

class ErrorBudgetTracker:
    def __init__(self, target_success_rate: float = 0.99):
        self.target_success_rate = target_success_rate
        self.successes = 0
        self.failures = 0

    def record_result(self, success: bool):
        if success:
            self.successes += 1
        else:
            self.failures += 1

    def current_success_rate(self) -> float:
        total = self.successes + self.failures
        if total == 0:
            return 1.0
        return self.successes / total

    def error_budget_remaining(self) -> float:
        """
        Returns fraction of error budget remaining.

        1.0 = Full budget (perfect success rate)
        0.0 = Budget exhausted (at target rate)
        <0 = Over budget (below target rate)
        """
        current_rate = self.current_success_rate()
        target_rate = self.target_success_rate

        if current_rate >= target_rate:
            # Above target, budget remaining
            return (current_rate - target_rate) / (1.0 - target_rate)
        else:
            # Below target, over budget (negative)
            return (current_rate - target_rate) / target_rate

# Usage
tracker = ErrorBudgetTracker(target_success_rate=0.99)

for _ in range(1000):
    try:
        result = client.generate_json(prompt, schema)
        tracker.record_result(success=True)
    except:
        tracker.record_result(success=False)

if tracker.error_budget_remaining() < 0.1:
    alert("Error budget nearly exhausted! Investigate failures.")

10. Resources

10.1 Essential Reading

Books

Book Chapter Key Takeaway
“Designing Data-Intensive Applications” by Martin Kleppmann Ch. 4 (Encoding & Evolution) Schema design, backward/forward compatibility
“Programming TypeScript” by Boris Cherny Ch. 3 (Type Safety) Type systems, compile-time vs runtime validation
“Fluent Python” by Luciano Ramalho Ch. 8 (Type Hints) Python type hints, Pydantic internals
“Effective Python” by Brett Slatkin Item 14 (Exceptions vs None) Error handling patterns
“Clean Code” by Robert C. Martin Ch. 7 (Error Handling) Exception design, meaningful errors
“Release It!” by Michael T. Nygard Ch. 5 (Stability Patterns) Retry logic, circuit breakers, timeouts
“Clean Architecture” by Robert C. Martin Ch. 11 (DIP) Dependency inversion, provider abstraction
“AI Engineering” by Chip Huyen Ch. 6 (LLM Engineering) Production LLM systems

Papers

  1. “JSON Schema Validation” (IETF Draft)
    • Formal specification for JSON Schema
    • Validation keywords and semantics
  2. “Language Models are Few-Shot Learners” (GPT-3 Paper)
    • Understanding in-context learning
    • How examples guide model behavior
  3. “Constitutional AI” (Anthropic)
    • Self-correction mechanisms
    • Model refining its own outputs

10.2 Documentation & Tools

Validation Libraries

Tool Language URL
Pydantic Python https://docs.pydantic.dev/
Zod TypeScript https://zod.dev/
JSON Schema Universal https://json-schema.org/
jsonschema (Python) Python https://python-jsonschema.readthedocs.io/

LLM Providers

Provider Best For Pricing
OpenAI General purpose, function calling $0.03/1K input tokens (GPT-4)
Anthropic Long context, analysis $0.015/1K input tokens (Claude 3 Sonnet)
Local (Ollama) Privacy, cost Free (hardware costs)

Next Project: Project 3 - Prompt Injection Red-Team Lab

Why it’s next: Now that you can enforce schemas, learn to defend against attacks that try to break your schemas

Connection: Prompt injection often targets schema validation (e.g., “set admin field to true”)

  • Project 1 (Prompt Contract Harness): Use this JSON Enforcer in your test suites
  • Project 4 (Context Window Manager): Combine with schema enforcement for RAG systems

10.4 Community Resources

Discord Servers

  • LangChain Discord: Discussion of LLM engineering patterns
  • AI Engineering Discord: Production AI systems

GitHub Repositories to Study

  1. instructor: https://github.com/jxnl/instructor
  2. marvin: https://github.com/PrefectHQ/marvin
  3. TypeChat: https://github.com/microsoft/TypeChat

11. Self-Assessment Checklist

Understanding

Conceptual Knowledge:

  • I can explain the difference between compile-time types and runtime validation
  • I understand why additionalProperties: false prevents hallucinations
  • I can describe when to use temp=0.0 vs temp=0.3
  • I know why repair loops improve success rates
  • I understand the trade-off between cost and reliability
  • I can explain JSON Schema validation keywords (required, enum, format)
  • I know the difference between structural and semantic validation

Practical Application:

  • I can identify when a schema is too strict or too loose
  • I know how to debug type errors vs missing field errors
  • I can estimate token costs for repair loops
  • I understand when to use exceptions vs Result types

Implementation

Core Features:

  • My validator correctly parses Pydantic ValidationErrors
  • My repair loop lowers temperature for repairs
  • My prompt builder provides specific, actionable repair instructions
  • My client handles JSON syntax errors separately from schema errors
  • I track metrics (attempts, tokens, cost, latency)

Code Quality:

  • My code has type hints throughout
  • I have unit tests for validator, prompt builder, and repair loop
  • I have integration tests with mocked LLM responses
  • My error messages are actionable and clear
  • I follow PEP 8 (Python) or ESLint (TypeScript) style guidelines

Production Readiness:

  • I handle API rate limits gracefully
  • I support multiple LLM providers via abstraction
  • I log important events (errors, repairs, success)
  • My library is pip/npm installable
  • I have examples and documentation

Growth

Mastery Indicators:

  • I can design schemas for complex nested structures
  • I can implement custom repair strategies for domain-specific types
  • I can explain my design decisions (why 3 attempts? why decreasing temp?)
  • I understand the limitations of this approach
  • I can compare this to OpenAI’s function calling and explain tradeoffs

Next Steps:

  • I’ve integrated this library into another project
  • I’ve measured real-world success rates
  • I’ve optimized for cost or latency based on requirements
  • I’ve extended with at least 2 of the suggested extensions

12. Completion Criteria

Minimum Viable Completion

You can consider this project complete when you have:

1. Core Library (70% of effort)

  • LLMClient class with generate_json() method
  • Schema validation using Pydantic (Python) or Zod (TypeScript)
  • Repair loop with max 3 attempts
  • Temperature strategy (decreasing)
  • Custom exceptions (MaxRetriesExceeded)
  • Support for at least OpenAI provider

2. Testing (20% of effort)

  • Unit tests for validator (5+ tests)
  • Unit tests for prompt builder (3+ tests)
  • Integration tests for repair loop (5+ tests)
  • At least 2 end-to-end tests with real API
  • Test coverage >80%

3. Documentation (10% of effort)

  • README with installation instructions
  • At least 3 usage examples
  • Docstrings for all public methods
  • Type hints throughout

Validation Test:

Run this integration test—it should pass:

from pydantic import BaseModel
from your_library import LLMClient

class User(BaseModel):
    name: str
    age: int

    class Config:
        extra = "forbid"

client = LLMClient(model="gpt-3.5-turbo")

# Test 1: Should succeed (possibly with repair)
user = client.generate_json(
    "Extract: Alice, twenty-five years old",
    schema=User
)
assert user.age == 25
assert isinstance(user.age, int)

# Test 2: Should track metrics
result, metrics = client.generate_with_metrics(
    "Extract: Bob, 30",
    schema=User
)
assert metrics.success == True
assert metrics.total_tokens > 0

print("✓ All validation tests passed!")

Full Completion

Additional Requirements:

1. Advanced Features

  • Support for multiple providers (OpenAI + Anthropic or local)
  • Metrics tracking with cost calculation
  • Verbose mode for debugging
  • Caching layer for duplicate requests
  • Retry with exponential backoff for API errors

2. Comprehensive Testing

  • Parametric tests across multiple models
  • Performance benchmarks (requests/second, success rate)
  • Cost analysis ($/1000 requests)
  • Edge case tests (empty input, very long input, malformed prompts)

3. Production Readiness

  • Published to PyPI or npm
  • CI/CD pipeline (GitHub Actions)
  • Semantic versioning
  • Changelog
  • Contributing guidelines

Excellence (Going Above & Beyond)

Research & Analysis:

  • Benchmark report comparing success rates across models
  • Cost-benefit analysis of repair strategies
  • Case study of real-world application
  • Blog post explaining your learnings

Advanced Implementations:

  • Statistical validation (consensus over multiple samples)
  • Custom repair strategies per field type
  • Streaming support
  • Multi-provider fallback with automatic selection

Community Contribution:

  • Open-sourced on GitHub with 10+ stars
  • Presented at a meetup or conference
  • Tutorial or video walkthrough
  • Integration with popular frameworks (LangChain, LlamaIndex)

Appendix: Sample Code

Complete Working Example

# main.py - Complete working example
from pydantic import BaseModel, Field, EmailStr
from typing import Literal
from src.client import LLMClient

class User(BaseModel):
    name: str = Field(..., min_length=1, max_length=100)
    age: int = Field(..., ge=0, le=120)
    email: EmailStr
    subscription: Literal["free", "pro", "enterprise"]

    class Config:
        extra = "forbid"

def main():
    # Initialize client
    client = LLMClient(
        model="gpt-4",
        max_repair_attempts=3,
        verbose=True
    )

    # Test cases
    test_cases = [
        "Alice, 25, alice@example.com, pro plan",
        "Bob is thirty years old, email bob@example.com, wants free tier",
        "Carol, 28, carol@test.com, enterprise",
    ]

    for text in test_cases:
        print(f"\n{'='*60}")
        print(f"Processing: {text}")
        print(f"{'='*60}")

        try:
            user, metrics = client.generate_with_metrics(
                prompt=f"Extract user from: {text}",
                schema=User
            )

            print(f"\n✓ Success!")
            print(f"  Name: {user.name}")
            print(f"  Age: {user.age}")
            print(f"  Email: {user.email}")
            print(f"  Subscription: {user.subscription}")
            print(f"\nMetrics:")
            print(f"  Attempts: {metrics.attempts}")
            print(f"  Tokens: {metrics.total_tokens}")
            print(f"  Cost: ${metrics.total_cost:.4f}")
            print(f"  Latency: {metrics.total_latency_ms:.0f}ms")

        except Exception as e:
            print(f"\n✗ Failed: {e}")

if __name__ == "__main__":
    main()

Example Schema Library

# schemas.py - Reusable schemas
from pydantic import BaseModel, Field, EmailStr
from typing import Literal, Optional, List
from datetime import date

class Address(BaseModel):
    street: str
    city: str
    state: str = Field(..., regex=r'^[A-Z]{2}$')
    zipcode: str = Field(..., regex=r'^\d{5}$')

    class Config:
        extra = "forbid"

class Recipe(BaseModel):
    title: str = Field(..., min_length=3, max_length=200)
    ingredients: List[str] = Field(..., min_items=1, max_items=50)
    instructions: List[str] = Field(..., min_items=1)
    cooking_time_minutes: int = Field(..., ge=1, le=1440)
    difficulty: Literal["easy", "medium", "hard"]
    servings: int = Field(..., ge=1, le=100)
    cuisine: Optional[str] = None

    class Config:
        extra = "forbid"

class Invoice(BaseModel):
    invoice_number: str = Field(..., regex=r'^INV-\d+$')
    date: date
    vendor: str = Field(..., min_length=1)
    items: List[dict]  # Could be more structured
    total_usd: float = Field(..., gt=0)

    class Config:
        extra = "forbid"

Congratulations! You now have a production-ready library for type-safe LLM outputs. This is infrastructure that companies pay thousands for—you built it from scratch and understand every component.