Project 2: JSON Output Enforcer (Schema + Repair Loop)
Project 2: JSON Output Enforcer (Schema + Repair Loop)
Build a production-ready library that guarantees type-safe LLM outputs through automatic validation and self-correction
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1 week |
| Language | Python (Alternatives: TypeScript) |
| Prerequisites | Project 1 (Harness), deep knowledge of JSON, type systems |
| Key Topics | Schema Validation, Self-Correction, Error Handling, Type Safety |
| Knowledge Area | Structured Outputs / Reliability |
| Software/Tool | Pydantic / Zod |
| Main Book | “Designing Data-Intensive Applications” by Martin Kleppmann (Schemas) |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 5. The “Compliance & Workflow” Model |
1. Learning Objectives
By completing this project, you will:
- Master Type Systems for AI: Bridge the gap between probabilistic LLM outputs and deterministic, typed application code
- Build Self-Correcting Systems: Implement repair loops that use model reasoning to fix its own errors
- Handle Graceful Degradation: Design systems that fail safely with structured error responses
- Understand Schema Design: Learn to create strict JSON Schemas that prevent hallucinations and type errors
- Optimize Token Costs: Balance reliability (multiple repair attempts) against API costs
- Implement Production APIs: Create developer-friendly library interfaces with clear error handling
- Apply Retry Patterns: Use exponential backoff, temperature adjustments, and circuit breakers
2. Theoretical Foundation
2.1 Core Concepts
The Type Safety Gap
Modern applications are built with strong type systems (TypeScript, Python with type hints, Go, Rust). These systems catch errors at compile time, ensuring that a function expecting an integer never receives a string.
LLMs, however, generate untyped text. Even when instructed to return JSON, they can:
- Generate malformed JSON (syntax errors)
- Return correct JSON but wrong types (
{"age": "25"}instead of{"age": 25}) - Hallucinate extra fields not in your schema
- Omit required fields
- Return values outside valid ranges
The Core Problem:
# Traditional API (Type-Safe)
def get_user(user_id: int) -> User:
# Returns User object, guaranteed by type system
pass
# LLM API (Type-Unsafe)
def llm_extract_user(text: str) -> ???:
response = llm.complete(f"Extract user from: {text}")
# response is just a string - could be anything!
data = json.loads(response) # Might raise JSONDecodeError
age = data["age"] # Might be missing, might be wrong type
# Your application crashes
pass
The Solution:
# Type-Safe LLM Wrapper
def llm_extract_user(text: str) -> Result[User, ValidationError]:
enforcer = JSONEnforcer(schema=UserSchema)
result = enforcer.generate(prompt=text)
# result is Either[User, Error] - type-safe!
return result
JSON Schema as a Contract
JSON Schema is the industry standard for defining JSON structure. It acts as a “contract” between your LLM and your application.
Example Schema:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {
"type": "string",
"minLength": 1,
"maxLength": 100
},
"age": {
"type": "integer",
"minimum": 0,
"maximum": 120
},
"email": {
"type": "string",
"format": "email"
},
"subscription": {
"enum": ["free", "pro", "enterprise"]
}
},
"required": ["name", "age", "email", "subscription"],
"additionalProperties": false
}
Why additionalProperties: false is Critical:
Without it, the model can hallucinate fields:
{
"name": "Alice",
"age": 25,
"email": "alice@example.com",
"subscription": "pro",
"credit_card": "4532-1234-5678-9010", ← HALLUCINATED! Security risk!
"admin_access": true ← HALLUCINATED! Privilege escalation!
}
With additionalProperties: false, the validator rejects this output.
Self-Correction and Repair Loops
The Insight: LLMs can understand and fix their own errors when given clear feedback.
Traditional Approach (Fragile):
User Prompt → LLM → Parse JSON → ❌ Error → Crash
Repair Loop Approach (Robust):
User Prompt → LLM → Parse JSON
↓ (error)
Extract Error Message
↓
"Fix: age must be int, not string"
↓
LLM → Parse JSON → ✓ Valid!
Example Repair Prompt:
Your previous output was invalid. Error details:
Field: age
Expected: integer
Received: string ("twenty-five")
Path: $.age
Please fix ONLY the format, not the semantic content.
Return valid JSON matching the schema.
Key Principle: Feed the validator’s error message directly back to the model. The error message acts as debugging feedback.
Temperature and Repair Strategy
Temperature controls randomness:
- 0.0: Deterministic (always picks highest probability token)
- 0.3: Slightly creative
- 0.7: Balanced
- 1.0+: Very creative
Optimal Repair Strategy:
Attempt 1: temperature=0.3 (Initial generation - allow some flexibility)
Attempt 2: temperature=0.0 (Repair needs precision, not creativity)
Attempt 3: temperature=0.0 (Stay deterministic)
Why lower temperature for repairs?
At high temperature, the model might “creatively” misinterpret the repair instruction. At temp=0, it mechanically applies the fix.
Error Handling Patterns
Three Strategies:
- Raise Exception (Fail Fast)
try: user = enforcer.generate(prompt) except MaxRetriesExceeded: # Propagate to caller raise - Return Default (Fail Safe)
user = enforcer.generate(prompt, default={"status": "unknown"}) # Never crashes, always returns something - Return Result Type (Functional Style)
result = enforcer.generate(prompt) if result.is_ok(): user = result.unwrap() else: error = result.error() log.warning(f"Extraction failed: {error}")
Best Practice: Use exceptions for unexpected failures, Result types for expected failures (like user input that can’t be parsed).
2.2 Why This Matters
Production Relevance
Real-world LLM applications crash in production due to:
- Malformed JSON: Missing brackets, trailing commas, unescaped quotes
- Type Mismatches:
"age": "25"when code expectsint - Missing Fields: Code accesses
data["city"]but field wasn’t returned - Hallucinated Fields: Model invents fields that cause logic errors
- Out-of-Range Values:
age: -5orage: 999bypassing validation
Without Enforcement:
- Application crashes with
KeyError,TypeError,JSONDecodeError - Database corruption from invalid data
- Security vulnerabilities from hallucinated fields
- Customer support tickets from broken features
With Enforcement:
- Guaranteed type-safe data or explicit error handling
- 99%+ reliability through automated repairs
- Clear error messages for the 1% that can’t be repaired
- Production-ready infrastructure
Real-World Applications
Companies using similar patterns:
| Company | Use Case | Pattern |
|---|---|---|
| Stripe | Invoice data extraction | Pydantic validation with retry |
| Notion | Natural language to structured data | Schema enforcement with fallback |
| GitHub Copilot | Code generation with type checking | Multi-attempt refinement |
| Shopify | Product catalog enrichment | Zod validation with repair |
Industry Tools:
- Instructor (Python): Pydantic-based LLM output validation
- Zod (TypeScript): Runtime type validation for LLM outputs
- OpenAI Function Calling: Built-in schema enforcement
- Anthropic Tool Use: Structured output validation
2.3 Historical Context
Evolution of Structured Outputs
2020-2021: The “Parse and Pray” Era
- No schema enforcement
- Manual string parsing with regex
- Frequent production crashes
- “Just ask nicely for JSON” approach
2022: The Schema Era
- JSON Schema adoption
- Manual validation loops
- Pydantic/Zod emergence
- Still lots of manual error handling
2023: The Self-Correction Era
- Repair loops standardized
- Models can fix their own errors
- Function calling APIs from OpenAI/Anthropic
- Automatic retry logic
2024+: The Type Safety Era
- LLMs integrated into strongly-typed systems
- Production-grade reliability (99%+)
- Cost-optimized repair strategies
- Statistical validation of repair effectiveness
This project teaches you the modern, production-proven approach.
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| “Just use OpenAI’s JSON mode” | JSON mode ensures valid JSON, not valid schema |
| “Models rarely make errors” | At scale (1M+ requests), even 1% failure rate = 10K crashes |
| “Repair loops are expensive” | 25% token cost increase for 25% reliability improvement (worth it) |
| “TypeScript types validate LLM output” | Compile-time types don’t validate runtime data |
| “One validation attempt is enough” | Self-correction significantly improves success rate |
3. Project Specification
3.1 What You Will Build
A Python library (or TypeScript package) that:
- Accepts JSON Schemas (or Pydantic models / Zod schemas)
- Generates LLM outputs with automatic validation
- Implements repair loops that self-correct errors
- Provides clear error messages when repair fails
- Tracks metrics: attempts, token costs, success rates
- Offers configurable parameters: max attempts, temperature strategy, default values
Core Question This Tool Answers:
“How do I integrate a fuzzy AI component into a strict, typed software system?”
3.2 Functional Requirements
FR1: Schema Definition
Python (Pydantic):
from pydantic import BaseModel, EmailStr, Field
from enum import Enum
class Subscription(str, Enum):
FREE = "free"
PRO = "pro"
ENTERPRISE = "enterprise"
class User(BaseModel):
name: str = Field(..., min_length=1, max_length=100)
age: int = Field(..., ge=0, le=120)
email: EmailStr
subscription: Subscription
class Config:
extra = "forbid" # Prevent hallucinated fields
TypeScript (Zod):
import { z } from 'zod';
const UserSchema = z.object({
name: z.string().min(1).max(100),
age: z.number().int().min(0).max(120),
email: z.string().email(),
subscription: z.enum(['free', 'pro', 'enterprise'])
}).strict(); // Prevent hallucinated fields
FR2: Generation with Validation
Python Example:
from json_enforcer import LLMClient
client = LLMClient(model="gpt-4", max_repair_attempts=3)
result = client.generate_json(
prompt="Extract user: 'Alice, 25, alice@example.com, wants pro plan'",
schema=User,
temperature=0.3
)
# result: User object (Pydantic model) - fully typed and validated
print(result.name) # Type checker knows this is a string
print(result.age) # Type checker knows this is an int
FR3: Repair Loop Implementation
Algorithm:
def generate_with_repair(prompt, schema, max_attempts=3):
messages = [{"role": "user", "content": prompt}]
for attempt in range(1, max_attempts + 1):
# Adjust temperature: lower for repairs
temp = 0.3 if attempt == 1 else 0.0
# Generate response
response = llm.complete(messages, temperature=temp)
# Attempt validation
try:
validated_data = schema.parse(response)
log.info(f"✓ Validated on attempt {attempt}")
return validated_data
except ValidationError as error:
log.warning(f"✗ Attempt {attempt} failed: {error}")
if attempt == max_attempts:
raise MaxRetriesExceeded(
attempts=attempt,
last_error=error,
last_response=response
)
# Build repair prompt
repair_prompt = build_repair_prompt(error, response)
messages.append({
"role": "assistant",
"content": response
})
messages.append({
"role": "user",
"content": repair_prompt
})
Repair Prompt Builder:
def build_repair_prompt(validation_error, invalid_json):
errors = parse_validation_errors(validation_error)
prompt_parts = [
"Your previous JSON output had validation errors:\n"
]
for idx, error in enumerate(errors, 1):
prompt_parts.append(
f"{idx}. Field '{error.field}' {error.message}\n"
f" Expected: {error.expected}\n"
f" Received: {error.received}\n"
)
prompt_parts.append(
"\nPlease return ONLY valid JSON with these corrections. "
"Do not change the semantic content, only fix the format."
)
return "".join(prompt_parts)
FR4: Error Types
Custom Exceptions:
class JSONEnforcerError(Exception):
"""Base exception for JSON Enforcer"""
pass
class MaxRetriesExceeded(JSONEnforcerError):
"""Raised when repair loop exhausts all attempts"""
def __init__(self, attempts, last_error, last_response):
self.attempts = attempts
self.last_error = last_error
self.last_response = last_response
super().__init__(
f"Failed to generate valid JSON after {attempts} attempts. "
f"Last error: {last_error}"
)
class SchemaValidationError(JSONEnforcerError):
"""Raised when JSON is valid but doesn't match schema"""
def __init__(self, errors, json_data):
self.errors = errors
self.json_data = json_data
super().__init__(f"Schema validation failed: {errors}")
FR5: Metrics and Observability
Telemetry Data:
@dataclass
class GenerationMetrics:
success: bool
attempts: int
total_tokens: int
total_cost: float
total_latency_ms: int
temperature_used: List[float]
validation_errors: List[str]
result, metrics = client.generate_with_metrics(
prompt=prompt,
schema=User
)
print(f"Success: {metrics.success}")
print(f"Attempts: {metrics.attempts}")
print(f"Cost: ${metrics.total_cost:.4f}")
3.3 Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Success Rate | >99% with 3 repair attempts | Production-grade reliability |
| Latency | <2 seconds for 3 attempts | Acceptable for most use cases |
| Token Efficiency | <30% overhead for repairs | Cost-effective |
| Type Safety | Full static type checking | Integration with typed codebases |
| Error Messages | Actionable repair guidance | Models can understand and fix errors |
3.4 Example Usage
Basic Usage:
from json_enforcer import LLMClient
from pydantic import BaseModel
class Recipe(BaseModel):
title: str
ingredients: list[str]
cooking_time_minutes: int
difficulty: Literal["easy", "medium", "hard"]
client = LLMClient(model="gpt-4")
text = """
This amazing pasta takes about an hour and serves 4-6 people.
You'll need pasta, tomatoes, garlic, and basil. It's pretty simple!
"""
recipe = client.generate_json(
prompt=f"Extract recipe from: {text}",
schema=Recipe
)
print(recipe.title) # Type-safe access
print(recipe.cooking_time_minutes) # Guaranteed to be int
Advanced Usage with Error Handling:
from json_enforcer import LLMClient, MaxRetriesExceeded
client = LLMClient(
model="gpt-4",
max_repair_attempts=3,
verbose=True # Show repair process
)
try:
recipe = client.generate_json(
prompt=f"Extract recipe from: {text}",
schema=Recipe,
temperature=0.3
)
print(f"✓ Extracted: {recipe.title}")
except MaxRetriesExceeded as e:
print(f"✗ Failed after {e.attempts} attempts")
print(f"Last error: {e.last_error}")
# Use default fallback
recipe = Recipe(
title="Unknown Recipe",
ingredients=[],
cooking_time_minutes=0,
difficulty="medium"
)
Console Output (Verbose Mode):
╔══════════════════════════════════════════════════════════════╗
║ JSON ENFORCER - Structured Output Pipeline ║
╚══════════════════════════════════════════════════════════════╝
[Attempt 1/3] Generating JSON response...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Prompt sent to model (142 tokens)
Response received (87 tokens, 234ms)
Raw output:
{
"title": "Pasta",
"ingredients": ["pasta", "tomatoes", "garlic", "basil"],
"cooking_time_minutes": "1 hour",
"difficulty": "simple"
}
Validating against schema...
✗ VALIDATION FAILED (2 errors)
Error details:
1. Field: cooking_time_minutes
Expected: integer
Received: string ("1 hour")
Path: $.cooking_time_minutes
2. Field: difficulty
Expected: one of ["easy", "medium", "hard"]
Received: "simple"
Path: $.difficulty
──────────────────────────────────────────────────────────────
[Attempt 2/3] Attempting self-repair...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Repair prompt:
Your previous JSON output had validation errors:
1. Field 'cooking_time_minutes' must be an integer (number of minutes)
Expected: integer
Received: string ("1 hour")
Hint: Convert "1 hour" to 60
2. Field 'difficulty' must be one of: "easy", "medium", "hard"
Expected: enum ["easy", "medium", "hard"]
Received: "simple"
Hint: Map "simple" to "easy"
Please return ONLY valid JSON with these corrections.
Do not change the semantic content, only fix the format.
Response received (71 tokens, 198ms)
Raw output:
{
"title": "Pasta",
"ingredients": ["pasta", "tomatoes", "garlic", "basil"],
"cooking_time_minutes": 60,
"difficulty": "easy"
}
Validating against schema...
✓ VALIDATION PASSED
All required fields present: ✓
No extra fields: ✓
Type constraints satisfied: ✓
──────────────────────────────────────────────────────────────
✓ SUCCESS after 2 attempts
Total time: 432ms
Total tokens: 300 (input: 213, output: 87)
Cost: $0.0045
Returning validated object:
Recipe(
title='Pasta',
ingredients=['pasta', 'tomatoes', 'garlic', 'basil'],
cooking_time_minutes=60,
difficulty='easy'
)
4. Solution Architecture
4.1 High-Level Design
┌─────────────────────────────────────────────────────────────┐
│ Application Code │
│ recipe = client.generate_json(prompt, schema=Recipe) │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ JSON Enforcer Client │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Schema │ │ Repair │ │ Metrics │ │
│ │ Validator │ │ Loop │ │ Tracker │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LLM Provider │
│ (OpenAI / Anthropic / Local) │
└─────────────────────────────────────────────────────────────┘
Component Responsibilities:
| Component | Responsibility |
|---|---|
| LLMClient | Main entry point, orchestrates generation |
| SchemaValidator | Validates JSON against schema, extracts errors |
| RepairLoop | Manages retry logic, temperature strategy |
| PromptBuilder | Constructs repair prompts from validation errors |
| MetricsTracker | Collects telemetry data |
| ErrorFormatter | Converts validation errors to repair instructions |
4.2 Key Components
Component 1: Schema Validator
Responsibilities:
- Parse JSON from string
- Validate against schema
- Extract detailed error information
- Format errors for repair prompts
Interface:
class SchemaValidator:
def __init__(self, schema: Type[BaseModel]):
self.schema = schema
def validate(self, json_string: str) -> Result[BaseModel, ValidationError]:
"""Validate JSON string against schema"""
pass
def extract_errors(self, error: ValidationError) -> List[ErrorDetail]:
"""Parse validation error into structured details"""
pass
Component 2: Repair Loop Engine
Responsibilities:
- Manage retry attempts
- Adjust temperature per attempt
- Track metrics
- Handle max retries exceeded
Interface:
class RepairLoop:
def __init__(
self,
llm_client: LLMProvider,
max_attempts: int = 3,
temperature_strategy: str = "decreasing"
):
self.llm_client = llm_client
self.max_attempts = max_attempts
self.temperature_strategy = temperature_strategy
def execute(
self,
prompt: str,
validator: SchemaValidator
) -> Result[BaseModel, MaxRetriesExceeded]:
"""Execute generation with repair loop"""
pass
Component 3: Prompt Builder
Responsibilities:
- Format validation errors as repair instructions
- Provide examples of correct format
- Keep prompts concise to save tokens
Interface:
class RepairPromptBuilder:
def build(
self,
validation_errors: List[ErrorDetail],
invalid_output: str
) -> str:
"""Build repair prompt from validation errors"""
pass
def add_examples(self, field_name: str, expected_type: str) -> str:
"""Add format examples for common error types"""
pass
4.3 Data Structures
ErrorDetail
@dataclass
class ErrorDetail:
field_path: str # JSON path (e.g., "$.user.age")
field_name: str # Field name (e.g., "age")
expected_type: str # Expected type (e.g., "integer")
received_value: Any # Actual value received
error_message: str # Human-readable error
suggestion: Optional[str] # Repair suggestion
GenerationResult
@dataclass
class GenerationResult:
success: bool
data: Optional[BaseModel]
error: Optional[Exception]
metrics: GenerationMetrics
def unwrap(self) -> BaseModel:
"""Get data or raise error"""
if self.success:
return self.data
raise self.error
def unwrap_or(self, default: BaseModel) -> BaseModel:
"""Get data or return default"""
return self.data if self.success else default
GenerationMetrics
@dataclass
class GenerationMetrics:
attempts: int
success: bool
total_tokens: int
input_tokens: int
output_tokens: int
total_cost: float
total_latency_ms: int
temperatures_used: List[float]
validation_errors: List[List[ErrorDetail]]
repair_success_on_attempt: Optional[int]
4.4 Algorithm Overview
Main Generation Algorithm
def generate_json(prompt: str, schema: Type[BaseModel]) -> BaseModel:
"""
Generate and validate JSON with automatic repair.
Algorithm:
1. Initialize conversation with user prompt
2. For each attempt (1 to max_attempts):
a. Determine temperature (decreasing strategy)
b. Generate LLM response
c. Parse as JSON
d. Validate against schema
e. If valid: return validated data
f. If invalid: build repair prompt
3. If all attempts fail: raise MaxRetriesExceeded
"""
messages = [{"role": "user", "content": prompt}]
metrics = GenerationMetrics()
for attempt in range(1, max_attempts + 1):
# Step 1: Determine temperature
temperature = calculate_temperature(attempt, strategy="decreasing")
metrics.temperatures_used.append(temperature)
# Step 2: Generate response
start_time = time.time()
response = llm_client.complete(messages, temperature=temperature)
latency = (time.time() - start_time) * 1000
metrics.total_latency_ms += latency
# Step 3: Update metrics
metrics.attempts = attempt
metrics.input_tokens += response.usage.prompt_tokens
metrics.output_tokens += response.usage.completion_tokens
metrics.total_tokens += response.usage.total_tokens
# Step 4: Parse JSON
try:
json_data = json.loads(response.content)
except JSONDecodeError as e:
error = ErrorDetail(
field_path="$",
field_name="root",
expected_type="valid JSON",
received_value=response.content,
error_message=f"JSON syntax error: {e}",
suggestion="Check for missing brackets, quotes, or commas"
)
metrics.validation_errors.append([error])
if attempt == max_attempts:
raise MaxRetriesExceeded(metrics)
repair_prompt = build_json_syntax_repair_prompt(e, response.content)
messages.extend([
{"role": "assistant", "content": response.content},
{"role": "user", "content": repair_prompt}
])
continue
# Step 5: Validate schema
try:
validated = schema.parse_obj(json_data)
metrics.success = True
metrics.repair_success_on_attempt = attempt
return validated
except ValidationError as e:
errors = extract_validation_errors(e)
metrics.validation_errors.append(errors)
if attempt == max_attempts:
raise MaxRetriesExceeded(metrics)
# Step 6: Build repair prompt
repair_prompt = build_schema_repair_prompt(errors, json_data)
messages.extend([
{"role": "assistant", "content": response.content},
{"role": "user", "content": repair_prompt}
])
# Should never reach here
raise MaxRetriesExceeded(metrics)
Temperature Strategy
def calculate_temperature(attempt: int, strategy: str) -> float:
"""
Calculate temperature for each attempt.
Strategies:
- "decreasing": Start at 0.3, decrease to 0.0 for repairs
- "constant_zero": Always use 0.0 (maximum determinism)
- "constant_low": Always use 0.2 (slight creativity)
"""
if strategy == "decreasing":
return 0.3 if attempt == 1 else 0.0
elif strategy == "constant_zero":
return 0.0
elif strategy == "constant_low":
return 0.2
else:
raise ValueError(f"Unknown strategy: {strategy}")
Error Extraction Algorithm
def extract_validation_errors(validation_error: ValidationError) -> List[ErrorDetail]:
"""
Extract structured error details from Pydantic ValidationError.
Pydantic errors have format:
[
{
"loc": ("field", "nested_field"),
"msg": "field required",
"type": "value_error.missing"
}
]
Convert to ErrorDetail objects with repair suggestions.
"""
details = []
for error in validation_error.errors():
field_path = "$.{}".format(".".join(str(loc) for loc in error["loc"]))
field_name = error["loc"][-1] if error["loc"] else "root"
# Determine expected type and suggestion
error_type = error["type"]
if error_type == "value_error.missing":
suggestion = f"Add required field '{field_name}'"
elif error_type == "type_error.integer":
suggestion = f"Convert to integer number (e.g., 25, not '25')"
elif error_type == "type_error.string":
suggestion = f"Wrap in quotes to make string"
elif error_type.startswith("value_error.const"):
suggestion = f"Use one of the allowed enum values"
else:
suggestion = None
detail = ErrorDetail(
field_path=field_path,
field_name=field_name,
expected_type=infer_expected_type(error),
received_value=error.get("ctx", {}).get("given"),
error_message=error["msg"],
suggestion=suggestion
)
details.append(detail)
return details
Repair Prompt Builder Algorithm
def build_schema_repair_prompt(errors: List[ErrorDetail], invalid_json: dict) -> str:
"""
Build repair prompt from validation errors.
Strategy:
1. Start with clear instruction
2. List each error with:
- Field path
- Expected vs received
- Concrete example
3. Remind to preserve semantic content
4. Request only valid JSON (no explanations)
"""
lines = [
"Your previous JSON output had validation errors:\n"
]
for idx, error in enumerate(errors, 1):
lines.append(f"\n{idx}. Field '{error.field_name}' error:")
lines.append(f" Path: {error.field_path}")
lines.append(f" Problem: {error.error_message}")
lines.append(f" Expected: {error.expected_type}")
if error.received_value is not None:
lines.append(f" Received: {json.dumps(error.received_value)}")
if error.suggestion:
lines.append(f" Fix: {error.suggestion}")
# Add concrete example
example = generate_example(error)
if example:
lines.append(f" Example: {example}")
lines.append("\n\nIMPORTANT:")
lines.append("- Fix ONLY the format/type errors above")
lines.append("- Do NOT change the semantic content or meaning")
lines.append("- Return ONLY valid JSON, no explanations")
lines.append("- Do NOT add extra fields")
return "\n".join(lines)
5. Implementation Guide
5.1 Development Environment Setup
Prerequisites:
- Python 3.10+ or Node.js 18+
- API key for OpenAI or Anthropic
- Virtual environment tool (venv, conda)
Python Setup:
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install pydantic openai anthropic python-dotenv
# For development
pip install pytest black mypy ruff
TypeScript Setup:
# Initialize project
npm init -y
npm install zod openai @anthropic-ai/sdk dotenv
# For development
npm install -D typescript @types/node ts-node jest @types/jest
npx tsc --init
Environment Variables (.env):
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
5.2 Project Structure
json-enforcer/
├── src/
│ ├── __init__.py
│ ├── client.py # Main LLMClient class
│ ├── validator.py # Schema validation logic
│ ├── repair_loop.py # Repair attempt orchestration
│ ├── prompt_builder.py # Repair prompt construction
│ ├── errors.py # Custom exception classes
│ ├── metrics.py # Metrics tracking
│ └── providers/
│ ├── __init__.py
│ ├── base.py # Abstract LLM provider
│ ├── openai.py # OpenAI implementation
│ └── anthropic.py # Anthropic implementation
├── tests/
│ ├── test_validator.py
│ ├── test_repair_loop.py
│ ├── test_client.py
│ └── fixtures/
│ ├── schemas.py # Test schemas
│ └── mock_responses.py
├── examples/
│ ├── basic_usage.py
│ ├── error_handling.py
│ ├── metrics_tracking.py
│ └── advanced_schemas.py
├── .env.example
├── pyproject.toml
├── README.md
└── requirements.txt
5.3 Implementation Phases
Phase 1: Foundation (Days 1-2)
Checkpoint 1.1: Set up base provider interface
# src/providers/base.py
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class Message:
role: str # "system", "user", "assistant"
content: str
@dataclass
class CompletionResponse:
content: str
model: str
usage: 'TokenUsage'
latency_ms: float
@dataclass
class TokenUsage:
prompt_tokens: int
completion_tokens: int
total_tokens: int
class LLMProvider(ABC):
"""Abstract base class for LLM providers"""
@abstractmethod
def complete(
self,
messages: List[Message],
temperature: float = 0.3,
max_tokens: int = 2000
) -> CompletionResponse:
"""Generate completion from messages"""
pass
Checkpoint 1.2: Implement OpenAI provider
# src/providers/openai.py
import time
from openai import OpenAI
from .base import LLMProvider, Message, CompletionResponse, TokenUsage
class OpenAIProvider(LLMProvider):
def __init__(self, api_key: str, model: str = "gpt-4"):
self.client = OpenAI(api_key=api_key)
self.model = model
def complete(
self,
messages: List[Message],
temperature: float = 0.3,
max_tokens: int = 2000
) -> CompletionResponse:
start_time = time.time()
# Convert messages to OpenAI format
openai_messages = [
{"role": msg.role, "content": msg.content}
for msg in messages
]
# Call API
response = self.client.chat.completions.create(
model=self.model,
messages=openai_messages,
temperature=temperature,
max_tokens=max_tokens
)
latency_ms = (time.time() - start_time) * 1000
return CompletionResponse(
content=response.choices[0].message.content,
model=response.model,
usage=TokenUsage(
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
total_tokens=response.usage.total_tokens
),
latency_ms=latency_ms
)
Checkpoint 1.3: Build schema validator
# src/validator.py
import json
from typing import Type, List
from pydantic import BaseModel, ValidationError
from dataclasses import dataclass
@dataclass
class ErrorDetail:
field_path: str
field_name: str
expected_type: str
received_value: any
error_message: str
suggestion: str = None
class SchemaValidator:
def __init__(self, schema: Type[BaseModel]):
self.schema = schema
def validate(self, json_string: str) -> BaseModel:
"""
Validate JSON string against schema.
Raises:
json.JSONDecodeError: If not valid JSON
ValidationError: If valid JSON but doesn't match schema
"""
# First, parse as JSON
data = json.loads(json_string)
# Then validate against schema
return self.schema.parse_obj(data)
def extract_errors(self, error: ValidationError) -> List[ErrorDetail]:
"""Extract structured error details from ValidationError"""
details = []
for err in error.errors():
# Build JSON path
field_path = "$." + ".".join(str(loc) for loc in err["loc"])
field_name = err["loc"][-1] if err["loc"] else "root"
# Determine expected type
expected_type = self._infer_expected_type(err)
# Get received value
received_value = err.get("input")
# Generate suggestion
suggestion = self._generate_suggestion(err, field_name)
detail = ErrorDetail(
field_path=field_path,
field_name=field_name,
expected_type=expected_type,
received_value=received_value,
error_message=err["msg"],
suggestion=suggestion
)
details.append(detail)
return details
def _infer_expected_type(self, error: dict) -> str:
"""Infer expected type from error"""
error_type = error["type"]
type_map = {
"int_parsing": "integer",
"string_type": "string",
"float_parsing": "number",
"bool_parsing": "boolean",
"list_type": "array",
"dict_type": "object",
"missing": "required field",
}
for key, value in type_map.items():
if key in error_type:
return value
return error_type
def _generate_suggestion(self, error: dict, field_name: str) -> str:
"""Generate repair suggestion"""
error_type = error["type"]
if "missing" in error_type:
return f"Add required field '{field_name}'"
elif "int_parsing" in error_type:
return "Convert to integer (e.g., 25, not '25')"
elif "string_type" in error_type:
return "Value must be a string in quotes"
elif "enum" in error_type:
expected_values = error.get("ctx", {}).get("expected")
if expected_values:
return f"Must be one of: {expected_values}"
return "Fix the validation error above"
Test validator:
# tests/test_validator.py
import pytest
from pydantic import BaseModel
from src.validator import SchemaValidator, ErrorDetail
class SimpleUser(BaseModel):
name: str
age: int
def test_valid_json():
validator = SchemaValidator(SimpleUser)
result = validator.validate('{"name": "Alice", "age": 25}')
assert result.name == "Alice"
assert result.age == 25
def test_type_error():
validator = SchemaValidator(SimpleUser)
with pytest.raises(ValidationError) as exc:
validator.validate('{"name": "Alice", "age": "twenty-five"}')
errors = validator.extract_errors(exc.value)
assert len(errors) == 1
assert errors[0].field_name == "age"
assert "integer" in errors[0].expected_type
def test_missing_field():
validator = SchemaValidator(SimpleUser)
with pytest.raises(ValidationError) as exc:
validator.validate('{"name": "Alice"}')
errors = validator.extract_errors(exc.value)
assert any(e.field_name == "age" for e in errors)
Phase 2: Repair Loop (Days 3-4)
Checkpoint 2.1: Build prompt builder
# src/prompt_builder.py
import json
from typing import List
from .validator import ErrorDetail
class RepairPromptBuilder:
def build(self, errors: List[ErrorDetail]) -> str:
"""Build repair prompt from validation errors"""
lines = [
"Your previous JSON output had validation errors:\n"
]
for idx, error in enumerate(errors, 1):
lines.append(f"\n{idx}. Field '{error.field_name}':")
lines.append(f" Path: {error.field_path}")
lines.append(f" Expected: {error.expected_type}")
if error.received_value is not None:
lines.append(f" Received: {json.dumps(error.received_value)}")
lines.append(f" Problem: {error.error_message}")
if error.suggestion:
lines.append(f" Fix: {error.suggestion}")
# Add example
example = self._generate_example(error)
if example:
lines.append(f" Example: {example}")
lines.extend([
"\n\nPlease return ONLY valid JSON with these corrections.",
"Do not change the semantic content, only fix the format.",
"Do not include explanations or markdown formatting."
])
return "\n".join(lines)
def _generate_example(self, error: ErrorDetail) -> str:
"""Generate concrete example for error type"""
field = error.field_name
if error.expected_type == "integer":
return f'"{field}": 25'
elif error.expected_type == "string":
return f'"{field}": "value"'
elif error.expected_type == "boolean":
return f'"{field}": true'
elif error.expected_type == "array":
return f'"{field}": ["item1", "item2"]'
elif "enum" in error.error_message:
# Try to extract enum values from error message
return None
return None
Checkpoint 2.2: Implement repair loop
# src/repair_loop.py
import json
from typing import Type, Optional
from pydantic import BaseModel, ValidationError
from .providers.base import LLMProvider, Message
from .validator import SchemaValidator
from .prompt_builder import RepairPromptBuilder
from .errors import MaxRetriesExceeded
from .metrics import GenerationMetrics
class RepairLoop:
def __init__(
self,
provider: LLMProvider,
max_attempts: int = 3,
temperature_strategy: str = "decreasing",
verbose: bool = False
):
self.provider = provider
self.max_attempts = max_attempts
self.temperature_strategy = temperature_strategy
self.verbose = verbose
self.prompt_builder = RepairPromptBuilder()
def execute(
self,
initial_prompt: str,
schema: Type[BaseModel]
) -> tuple[BaseModel, GenerationMetrics]:
"""Execute generation with repair loop"""
validator = SchemaValidator(schema)
metrics = GenerationMetrics()
messages = [Message(role="user", content=initial_prompt)]
for attempt in range(1, self.max_attempts + 1):
if self.verbose:
print(f"\n[Attempt {attempt}/{self.max_attempts}] Generating...")
# Calculate temperature
temperature = self._get_temperature(attempt)
metrics.temperatures_used.append(temperature)
# Generate response
response = self.provider.complete(
messages=messages,
temperature=temperature
)
# Update metrics
metrics.attempts = attempt
metrics.total_tokens += response.usage.total_tokens
metrics.input_tokens += response.usage.prompt_tokens
metrics.output_tokens += response.usage.completion_tokens
metrics.total_latency_ms += response.latency_ms
if self.verbose:
print(f"Response received ({response.usage.total_tokens} tokens, "
f"{response.latency_ms:.0f}ms)")
print(f"\nRaw output:\n{response.content}\n")
# Try to validate
try:
# First try JSON parsing
try:
json_data = json.loads(response.content)
except json.JSONDecodeError as e:
if self.verbose:
print(f"✗ JSON SYNTAX ERROR: {e}")
if attempt == self.max_attempts:
metrics.success = False
raise MaxRetriesExceeded(
attempts=attempt,
last_error=str(e),
last_response=response.content,
metrics=metrics
)
# Build JSON syntax repair prompt
repair_msg = (
f"Your previous output was not valid JSON. "
f"Error: {e}\n\n"
f"Please return valid JSON only, with no markdown or explanations."
)
messages.extend([
Message(role="assistant", content=response.content),
Message(role="user", content=repair_msg)
])
continue
# Then validate schema
validated = validator.validate(response.content)
# Success!
if self.verbose:
print("✓ VALIDATION PASSED")
metrics.success = True
metrics.repair_success_on_attempt = attempt
return validated, metrics
except ValidationError as e:
errors = validator.extract_errors(e)
metrics.validation_errors.append(errors)
if self.verbose:
print(f"✗ VALIDATION FAILED ({len(errors)} errors)")
for err in errors:
print(f" • {err.field_name}: {err.error_message}")
if attempt == self.max_attempts:
metrics.success = False
raise MaxRetriesExceeded(
attempts=attempt,
last_error=str(e),
last_response=response.content,
metrics=metrics
)
# Build repair prompt
repair_prompt = self.prompt_builder.build(errors)
messages.extend([
Message(role="assistant", content=response.content),
Message(role="user", content=repair_prompt)
])
# Should never reach here
metrics.success = False
raise MaxRetriesExceeded(
attempts=self.max_attempts,
last_error="Unknown error",
last_response="",
metrics=metrics
)
def _get_temperature(self, attempt: int) -> float:
"""Calculate temperature for attempt"""
if self.temperature_strategy == "decreasing":
return 0.3 if attempt == 1 else 0.0
elif self.temperature_strategy == "constant_zero":
return 0.0
elif self.temperature_strategy == "constant_low":
return 0.2
else:
return 0.3
Checkpoint 2.3: Create custom errors
# src/errors.py
from typing import Optional
from .metrics import GenerationMetrics
class JSONEnforcerError(Exception):
"""Base exception for JSON Enforcer"""
pass
class MaxRetriesExceeded(JSONEnforcerError):
"""Raised when repair loop exhausts all attempts"""
def __init__(
self,
attempts: int,
last_error: str,
last_response: str,
metrics: GenerationMetrics
):
self.attempts = attempts
self.last_error = last_error
self.last_response = last_response
self.metrics = metrics
super().__init__(
f"Failed to generate valid JSON after {attempts} attempts. "
f"Last error: {last_error}"
)
Checkpoint 2.4: Add metrics tracking
# src/metrics.py
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class GenerationMetrics:
attempts: int = 0
success: bool = False
total_tokens: int = 0
input_tokens: int = 0
output_tokens: int = 0
total_cost: float = 0.0
total_latency_ms: float = 0.0
temperatures_used: List[float] = field(default_factory=list)
validation_errors: List[List] = field(default_factory=list)
repair_success_on_attempt: Optional[int] = None
def calculate_cost(self, model: str):
"""Calculate cost based on model pricing"""
# OpenAI GPT-4 pricing (as of 2024)
pricing = {
"gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
"gpt-3.5-turbo": {"input": 0.001, "output": 0.002},
}
if model in pricing:
input_cost = (self.input_tokens / 1000) * pricing[model]["input"]
output_cost = (self.output_tokens / 1000) * pricing[model]["output"]
self.total_cost = input_cost + output_cost
return self.total_cost
Phase 3: Client API (Days 5-7)
Checkpoint 3.1: Build main client
# src/client.py
import os
from typing import Type, Optional
from pydantic import BaseModel
from dotenv import load_dotenv
from .providers.openai import OpenAIProvider
from .repair_loop import RepairLoop
from .metrics import GenerationMetrics
from .errors import MaxRetriesExceeded
load_dotenv()
class LLMClient:
"""
Main client for generating type-safe JSON from LLMs.
Example:
client = LLMClient(model="gpt-4")
user = client.generate_json(
prompt="Extract user: Alice, 25, alice@example.com",
schema=User
)
"""
def __init__(
self,
model: str = "gpt-4",
api_key: Optional[str] = None,
max_repair_attempts: int = 3,
temperature_strategy: str = "decreasing",
verbose: bool = False
):
"""
Initialize LLM client.
Args:
model: Model name (e.g., "gpt-4", "gpt-3.5-turbo")
api_key: API key (defaults to OPENAI_API_KEY env var)
max_repair_attempts: Maximum repair attempts (default: 3)
temperature_strategy: Temperature strategy (default: "decreasing")
verbose: Print detailed logs (default: False)
"""
self.model = model
self.max_repair_attempts = max_repair_attempts
self.verbose = verbose
# Initialize provider
api_key = api_key or os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("API key required (set OPENAI_API_KEY or pass api_key)")
self.provider = OpenAIProvider(api_key=api_key, model=model)
# Initialize repair loop
self.repair_loop = RepairLoop(
provider=self.provider,
max_attempts=max_repair_attempts,
temperature_strategy=temperature_strategy,
verbose=verbose
)
def generate_json(
self,
prompt: str,
schema: Type[BaseModel],
temperature: Optional[float] = None
) -> BaseModel:
"""
Generate and validate JSON output.
Args:
prompt: The prompt to send to the LLM
schema: Pydantic model class defining the expected schema
temperature: Override default temperature (optional)
Returns:
Validated Pydantic model instance
Raises:
MaxRetriesExceeded: If repair loop fails after max attempts
"""
# Add schema to prompt
full_prompt = self._build_prompt_with_schema(prompt, schema)
# Execute with repair loop
result, metrics = self.repair_loop.execute(full_prompt, schema)
# Calculate cost
metrics.calculate_cost(self.model)
if self.verbose:
self._print_summary(metrics)
return result
def generate_with_metrics(
self,
prompt: str,
schema: Type[BaseModel]
) -> tuple[BaseModel, GenerationMetrics]:
"""
Generate JSON and return metrics.
Returns:
Tuple of (validated data, metrics)
"""
full_prompt = self._build_prompt_with_schema(prompt, schema)
result, metrics = self.repair_loop.execute(full_prompt, schema)
metrics.calculate_cost(self.model)
return result, metrics
def _build_prompt_with_schema(
self,
user_prompt: str,
schema: Type[BaseModel]
) -> str:
"""Build prompt with schema instructions"""
schema_json = schema.schema_json(indent=2)
return f"""{user_prompt}
Return ONLY valid JSON matching this exact schema:
{schema_json}
Important:
- Return ONLY JSON, no markdown or explanations
- All required fields must be present
- Types must match exactly (integer, not string)
- No extra fields beyond the schema
"""
def _print_summary(self, metrics: GenerationMetrics):
"""Print metrics summary"""
print(f"\n{'─' * 60}")
print(f"✓ SUCCESS after {metrics.attempts} attempt(s)")
print(f"Total time: {metrics.total_latency_ms:.0f}ms")
print(f"Total tokens: {metrics.total_tokens} "
f"(input: {metrics.input_tokens}, output: {metrics.output_tokens})")
print(f"Cost: ${metrics.total_cost:.4f}")
print(f"{'─' * 60}\n")
Checkpoint 3.2: Create examples
# examples/basic_usage.py
from pydantic import BaseModel, EmailStr, Field
from typing import Literal
from src.client import LLMClient
class User(BaseModel):
name: str = Field(..., min_length=1, max_length=100)
age: int = Field(..., ge=0, le=120)
email: EmailStr
subscription: Literal["free", "pro", "enterprise"]
class Config:
extra = "forbid"
def main():
client = LLMClient(model="gpt-4", verbose=True)
text = "My name is Alice, I'm twenty-five years old, email alice@example.com, I want the pro plan"
try:
user = client.generate_json(
prompt=f"Extract user information from: {text}",
schema=User
)
print(f"Extracted user: {user}")
print(f"Name: {user.name} (type: {type(user.name).__name__})")
print(f"Age: {user.age} (type: {type(user.age).__name__})")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
Checkpoint 3.3: Add comprehensive tests
# tests/test_client.py
import pytest
from pydantic import BaseModel
from src.client import LLMClient
from src.errors import MaxRetriesExceeded
class SimpleUser(BaseModel):
name: str
age: int
@pytest.fixture
def client():
return LLMClient(model="gpt-3.5-turbo", max_repair_attempts=3)
def test_successful_generation(client):
"""Test successful JSON generation"""
result = client.generate_json(
prompt="Extract: Alice, 25 years old",
schema=SimpleUser
)
assert isinstance(result, SimpleUser)
assert result.name == "Alice"
assert result.age == 25
assert isinstance(result.age, int) # Not string!
def test_repair_type_error(client):
"""Test that type errors get repaired"""
# This might generate age as string first, then repair
result = client.generate_json(
prompt="Extract: Bob, age twenty-five",
schema=SimpleUser
)
assert isinstance(result.age, int)
def test_max_retries_exceeded(client):
"""Test that MaxRetriesExceeded is raised"""
# Use a very strict schema that's hard to satisfy
class StrictUser(BaseModel):
name: str
age: int
# Add many complex constraints
with pytest.raises(MaxRetriesExceeded) as exc:
# Give it a prompt that will likely fail
client.generate_json(
prompt="Extract user from gibberish: asdf qwer zxcv",
schema=StrictUser
)
assert exc.value.attempts == 3
5.4 Key Implementation Decisions
Decision 1: When to lower temperature?
- Choice: Start at 0.3, drop to 0.0 for repairs
- Rationale: Initial generation benefits from slight creativity to understand intent. Repairs need precision.
Decision 2: How many repair attempts?
- Choice: Default to 3 attempts
- Rationale: Data shows 99% success rate with 3 attempts. More attempts have diminishing returns.
Decision 3: Include schema in prompt or use function calling?
- Choice: Include schema in prompt for simplicity, offer function calling as advanced option
- Rationale: Schema in prompt works across all models. Function calling requires provider-specific code.
Decision 4: Raise exception vs return Result type?
- Choice: Raise exception (Python standard), provide
try_generate_jsonfor Result type - Rationale: Python developers expect exceptions. Result type available for functional style.
6. Testing Strategy
6.1 Test Categories
- Unit Tests: Test individual components (validator, prompt builder)
- Integration Tests: Test full generation pipeline with mocked LLM
- End-to-End Tests: Test with real LLM API (mark as slow/expensive)
- Repair Tests: Test repair loop with intentionally broken outputs
6.2 Critical Test Cases
Unit Tests
Test: Schema Validator
def test_validator_extracts_type_errors():
"""Validator correctly identifies type mismatches"""
validator = SchemaValidator(User)
invalid_json = '{"name": "Alice", "age": "25", "email": "alice@example.com", "subscription": "pro"}'
with pytest.raises(ValidationError) as exc:
validator.validate(invalid_json)
errors = validator.extract_errors(exc.value)
assert len(errors) == 1
assert errors[0].field_name == "age"
assert errors[0].expected_type == "integer"
assert errors[0].received_value == "25"
def test_validator_detects_hallucinated_fields():
"""Validator rejects extra fields when extra='forbid'"""
validator = SchemaValidator(User)
invalid_json = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "pro", "admin": true}'
with pytest.raises(ValidationError) as exc:
validator.validate(invalid_json)
errors = validator.extract_errors(exc.value)
assert any("extra" in e.error_message.lower() for e in errors)
Test: Prompt Builder
def test_prompt_builder_formats_errors():
"""Prompt builder creates clear repair instructions"""
builder = RepairPromptBuilder()
errors = [
ErrorDetail(
field_path="$.age",
field_name="age",
expected_type="integer",
received_value="25",
error_message="Input should be a valid integer",
suggestion="Convert to integer (e.g., 25, not '25')"
)
]
prompt = builder.build(errors)
assert "age" in prompt
assert "integer" in prompt
assert "25" in prompt
assert "Do not change the semantic content" in prompt
Integration Tests
Test: Successful Repair
def test_repair_loop_fixes_type_error(mocker):
"""Repair loop successfully fixes a type error"""
# Mock LLM to return wrong type first, then correct type
mock_provider = mocker.Mock()
mock_provider.complete.side_effect = [
# Attempt 1: Wrong type
CompletionResponse(
content='{"name": "Alice", "age": "25"}',
model="gpt-4",
usage=TokenUsage(10, 10, 20),
latency_ms=100
),
# Attempt 2: Correct type
CompletionResponse(
content='{"name": "Alice", "age": 25}',
model="gpt-4",
usage=TokenUsage(15, 10, 25),
latency_ms=120
)
]
loop = RepairLoop(provider=mock_provider, max_attempts=3)
result, metrics = loop.execute(
initial_prompt="Extract: Alice, 25",
schema=SimpleUser
)
assert result.age == 25
assert isinstance(result.age, int)
assert metrics.attempts == 2
assert metrics.success == True
Test: Max Retries Exceeded
def test_repair_loop_fails_after_max_attempts(mocker):
"""Repair loop raises exception after max attempts"""
# Mock LLM to always return invalid output
mock_provider = mocker.Mock()
mock_provider.complete.return_value = CompletionResponse(
content='{"invalid": "json"}',
model="gpt-4",
usage=TokenUsage(10, 10, 20),
latency_ms=100
)
loop = RepairLoop(provider=mock_provider, max_attempts=3)
with pytest.raises(MaxRetriesExceeded) as exc:
loop.execute(
initial_prompt="Extract user",
schema=SimpleUser
)
assert exc.value.attempts == 3
assert exc.value.metrics.success == False
End-to-End Tests (Real API)
@pytest.mark.slow
@pytest.mark.real_api
def test_real_llm_generation():
"""Test with real LLM API"""
client = LLMClient(model="gpt-3.5-turbo")
result = client.generate_json(
prompt="Extract: Alice, 25, alice@example.com, pro plan",
schema=User
)
assert result.name == "Alice"
assert result.age == 25
assert result.email == "alice@example.com"
assert result.subscription == "pro"
@pytest.mark.slow
@pytest.mark.real_api
def test_complex_schema():
"""Test with nested, complex schema"""
class Address(BaseModel):
street: str
city: str
zipcode: str = Field(..., regex=r'^\d{5}$')
class ComplexUser(BaseModel):
name: str
addresses: list[Address]
client = LLMClient(model="gpt-4")
result = client.generate_json(
prompt="Extract user: Alice lives at 123 Main St, Springfield, 12345",
schema=ComplexUser
)
assert len(result.addresses) > 0
assert result.addresses[0].city == "Springfield"
6.3 Test Data
Example Schemas:
# tests/fixtures/schemas.py
from pydantic import BaseModel, Field, EmailStr
from typing import Literal, Optional
from enum import Enum
class SimpleUser(BaseModel):
name: str
age: int
class StrictUser(BaseModel):
name: str = Field(..., min_length=1, max_length=100)
age: int = Field(..., ge=0, le=120)
email: EmailStr
subscription: Literal["free", "pro", "enterprise"]
class Config:
extra = "forbid"
class Recipe(BaseModel):
title: str = Field(..., min_length=3)
ingredients: list[str] = Field(..., min_items=1)
cooking_time_minutes: int = Field(..., ge=1, le=1440)
difficulty: Literal["easy", "medium", "hard"]
servings: Optional[int] = Field(None, ge=1)
class Address(BaseModel):
street: str
city: str
state: str = Field(..., regex=r'^[A-Z]{2}$')
zipcode: str = Field(..., regex=r'^\d{5}$')
class UserWithAddress(BaseModel):
name: str
email: EmailStr
address: Address
Mock Responses:
# tests/fixtures/mock_responses.py
# Type error: age as string
MOCK_TYPE_ERROR = '{"name": "Alice", "age": "25"}'
# Missing field: no email
MOCK_MISSING_FIELD = '{"name": "Alice", "age": 25, "subscription": "pro"}'
# Enum error: invalid value
MOCK_ENUM_ERROR = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "premium"}'
# Hallucinated field
MOCK_EXTRA_FIELD = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "pro", "admin": true}'
# Valid output
MOCK_VALID = '{"name": "Alice", "age": 25, "email": "alice@example.com", "subscription": "pro"}'
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
Mistake 1: Not using extra="forbid" in Pydantic models
Problem:
class User(BaseModel):
name: str
age: int
# Missing: class Config with extra="forbid"
# Model accepts hallucinated fields!
user = User.parse_obj({
"name": "Alice",
"age": 25,
"admin": True # ← Should be rejected but isn't!
})
Solution:
class User(BaseModel):
name: str
age: int
class Config:
extra = "forbid" # Reject unknown fields
Mistake 2: Not lowering temperature for repair attempts
Problem:
# Always using same temperature
for attempt in range(3):
response = llm.complete(messages, temperature=0.7) # Too creative for repairs!
Solution:
# Decrease temperature for repairs
for attempt in range(1, 4):
temp = 0.3 if attempt == 1 else 0.0
response = llm.complete(messages, temperature=temp)
Mistake 3: Generic repair prompts
Problem:
repair_prompt = "Your JSON was invalid. Please fix it."
# Model doesn't know WHAT to fix!
Solution:
repair_prompt = """
Field 'age' error:
Expected: integer
Received: string ("25")
Fix: Change "25" to 25 (remove quotes)
"""
Mistake 4: Infinite retry loops
Problem:
while True:
try:
return validate(response)
except:
response = retry() # Never exits!
Solution:
for attempt in range(MAX_ATTEMPTS):
try:
return validate(response)
except:
if attempt == MAX_ATTEMPTS - 1:
raise MaxRetriesExceeded()
response = retry()
7.2 Debugging Strategies
Issue: Type errors persist after repair
Symptoms:
- Repair loop exhausts all attempts
- Same type error on every attempt
- Model keeps returning string instead of integer
Debug Steps:
- Enable verbose mode:
client = LLMClient(verbose=True) # See exactly what model outputs each attempt - Check repair prompt clarity:
# Is your repair prompt specific enough? # Bad: "age should be integer" # Good: "Change \"25\" to 25 (remove quotes, keep just the number)" - Try temperature=0.0 from start:
client = LLMClient(temperature_strategy="constant_zero") # Some models need maximum determinism - Check if schema is too complex:
# Simplify schema temporarily to isolate issue class SimpleTest(BaseModel): age: int # Just test this one field
Issue: Model hallucinating extra fields
Symptoms:
- Validation fails with “extra fields not permitted”
- Model invents fields not in schema
Debug Steps:
- Verify
extra="forbid"is set:class User(BaseModel): # ... fields ... class Config: extra = "forbid" # Must be present! - Make schema explicit in prompt:
prompt = f""" {user_prompt} Return ONLY these exact fields: name, age, email, subscription Do NOT add any other fields. """ - Check for provider-specific issues:
# Some models are more prone to hallucination # Try gpt-4 instead of gpt-3.5-turbo
Issue: JSON syntax errors
Symptoms:
JSONDecodeError: Expecting ‘,’ delimiter- Missing brackets, quotes, etc.
Debug Steps:
- Add explicit JSON formatting instruction:
prompt = f""" {user_prompt} Return ONLY valid JSON. Example format: {{ "name": "string", "age": 25 }} Do not include markdown code blocks or explanations. """ - Strip markdown formatting from response:
def clean_response(response: str) -> str: # Remove markdown code blocks response = response.strip() if response.startswith("```json"): response = response[7:] if response.startswith("```"): response = response[3:] if response.endswith("```"): response = response[:-3] return response.strip() - Use provider’s JSON mode if available:
# OpenAI has response_format parameter response = client.chat.completions.create( ..., response_format={"type": "json_object"} )
7.3 Performance Issues
Issue: Slow generation (>5 seconds per request)
Causes:
- Too many repair attempts
- Large prompts
- Slow model (gpt-4 vs gpt-3.5-turbo)
Solutions:
- Reduce max attempts:
client = LLMClient(max_repair_attempts=2) # Instead of 3 - Use faster model for simple schemas:
client = LLMClient(model="gpt-3.5-turbo") # 10x faster than gpt-4 - Optimize prompt length:
# Instead of full schema JSON prompt = "Extract user (name, age, email) from: ..." # vs prompt = f"Extract user from: ...\n\nSchema:\n{full_schema_json}"
Issue: High costs ($1+ per 1000 requests)
Causes:
- Too many repair attempts
- Expensive model
- Inefficient prompts
Solutions:
- Track and analyze costs:
result, metrics = client.generate_with_metrics(prompt, schema) print(f"Cost: ${metrics.total_cost:.4f}") # Analyze: Are repairs common? Switch to better model. - Use cheaper model for initial attempt:
# Try gpt-3.5-turbo first, fallback to gpt-4 try: result = cheap_client.generate_json(prompt, schema) except MaxRetriesExceeded: result = expensive_client.generate_json(prompt, schema) - Cache results for duplicate requests:
from functools import lru_cache @lru_cache(maxsize=1000) def cached_generate(prompt_hash, schema_name): return client.generate_json(prompt, schema)
8. Extensions & Challenges
8.1 Beginner Extensions
Extension 1: Add Anthropic Provider Support
Goal: Support Claude models in addition to OpenAI
Implementation:
# src/providers/anthropic.py
from anthropic import Anthropic
from .base import LLMProvider, Message, CompletionResponse, TokenUsage
class AnthropicProvider(LLMProvider):
def __init__(self, api_key: str, model: str = "claude-3-sonnet-20240229"):
self.client = Anthropic(api_key=api_key)
self.model = model
def complete(
self,
messages: List[Message],
temperature: float = 0.3,
max_tokens: int = 2000
) -> CompletionResponse:
# Convert messages to Anthropic format
anthropic_messages = [
{"role": msg.role, "content": msg.content}
for msg in messages
if msg.role != "system" # Anthropic handles system separately
]
# Extract system message
system_msg = next(
(msg.content for msg in messages if msg.role == "system"),
None
)
response = self.client.messages.create(
model=self.model,
messages=anthropic_messages,
system=system_msg,
temperature=temperature,
max_tokens=max_tokens
)
return CompletionResponse(
content=response.content[0].text,
model=response.model,
usage=TokenUsage(
prompt_tokens=response.usage.input_tokens,
completion_tokens=response.usage.output_tokens,
total_tokens=response.usage.input_tokens + response.usage.output_tokens
),
latency_ms=0 # Would need to track separately
)
Learning Goals:
- Understand provider abstraction patterns
- Handle different API formats
- Deal with provider-specific features
Extension 2: Add Retry with Exponential Backoff
Goal: Handle rate limits and transient errors gracefully
Implementation:
# src/retry.py
import time
import random
from typing import Callable, TypeVar
T = TypeVar('T')
def retry_with_backoff(
func: Callable[[], T],
max_retries: int = 3,
base_delay: float = 1.0,
max_delay: float = 60.0
) -> T:
"""
Retry function with exponential backoff.
Delay formula: min(base_delay * 2^attempt + jitter, max_delay)
"""
for attempt in range(max_retries):
try:
return func()
except Exception as e:
if attempt == max_retries - 1:
raise
# Calculate delay with exponential backoff and jitter
delay = min(
base_delay * (2 ** attempt) + random.uniform(0, 1),
max_delay
)
print(f"Retry {attempt + 1}/{max_retries} after {delay:.2f}s due to: {e}")
time.sleep(delay)
raise RuntimeError("Should never reach here")
# Usage in provider
def complete(self, messages, temperature):
return retry_with_backoff(
lambda: self._call_api(messages, temperature),
max_retries=3
)
Learning Goals:
- Implement retry patterns
- Handle API rate limits
- Add resilience to network issues
Extension 3: Add Streaming Support
Goal: Stream tokens as they’re generated instead of waiting for complete response
Implementation:
# src/streaming.py
from typing import Iterator, Type
from pydantic import BaseModel
import json
class StreamingEnforcer:
def __init__(self, client: LLMClient):
self.client = client
def generate_json_stream(
self,
prompt: str,
schema: Type[BaseModel]
) -> Iterator[str]:
"""
Stream JSON generation, yielding chunks as they arrive.
Yields:
JSON string chunks
Final validation happens after stream completes.
"""
buffer = ""
for chunk in self.client.provider.stream_complete(prompt):
buffer += chunk
yield chunk
# Validate complete response
try:
validated = schema.parse_raw(buffer)
return validated
except ValidationError:
# Trigger repair loop
return self.client.generate_json(prompt, schema)
# Usage
for chunk in client.generate_json_stream(prompt, User):
print(chunk, end="", flush=True)
Learning Goals:
- Handle streaming APIs
- Buffer incomplete data
- Validate after stream completes
8.2 Intermediate Extensions
Extension 4: Multi-Provider Fallback
Goal: Try multiple providers in order (e.g., OpenAI → Anthropic → Local)
Implementation:
# src/multi_provider.py
from typing import List, Type
from pydantic import BaseModel
from .client import LLMClient
from .errors import MaxRetriesExceeded
class MultiProviderClient:
def __init__(self, providers: List[LLMClient]):
"""
Initialize with list of clients in priority order.
Example:
client = MultiProviderClient([
LLMClient(model="gpt-4"), # Try first
LLMClient(model="claude-3-sonnet"), # Fallback
])
"""
self.providers = providers
def generate_json(
self,
prompt: str,
schema: Type[BaseModel]
) -> BaseModel:
"""
Try providers in order until one succeeds.
"""
errors = []
for idx, provider in enumerate(self.providers):
try:
print(f"Attempting with provider {idx + 1}/{len(self.providers)}")
return provider.generate_json(prompt, schema)
except MaxRetriesExceeded as e:
errors.append(e)
print(f"Provider {idx + 1} failed: {e}")
continue
# All providers failed
raise Exception(
f"All {len(self.providers)} providers failed. "
f"Errors: {[str(e) for e in errors]}"
)
# Usage
client = MultiProviderClient([
LLMClient(model="gpt-4"),
LLMClient(model="gpt-3.5-turbo"),
])
result = client.generate_json(prompt, schema)
Learning Goals:
- Implement fallback patterns
- Handle multiple API providers
- Design fault-tolerant systems
Extension 5: Batch Processing with Concurrency
Goal: Process multiple requests concurrently to improve throughput
Implementation:
# src/batch.py
import asyncio
from typing import List, Type
from pydantic import BaseModel
from .client import LLMClient
class BatchEnforcer:
def __init__(self, client: LLMClient, max_concurrent: int = 5):
self.client = client
self.max_concurrent = max_concurrent
async def generate_batch(
self,
prompts: List[str],
schema: Type[BaseModel]
) -> List[BaseModel]:
"""
Process multiple prompts concurrently.
Args:
prompts: List of prompts to process
schema: Schema for all responses
Returns:
List of validated results (same order as prompts)
"""
semaphore = asyncio.Semaphore(self.max_concurrent)
async def process_one(prompt: str) -> BaseModel:
async with semaphore:
# Use async LLM client here
return await self.client.generate_json_async(prompt, schema)
tasks = [process_one(prompt) for prompt in prompts]
return await asyncio.gather(*tasks)
# Usage
batch_client = BatchEnforcer(client, max_concurrent=10)
prompts = [
"Extract user: Alice, 25",
"Extract user: Bob, 30",
"Extract user: Carol, 28",
]
results = asyncio.run(batch_client.generate_batch(prompts, User))
Learning Goals:
- Implement async/await patterns
- Manage concurrent API requests
- Handle rate limiting at scale
Extension 6: Validation Confidence Scores
Goal: Return confidence score for each field based on repair history
Implementation:
# src/confidence.py
from dataclasses import dataclass
from typing import Dict
from pydantic import BaseModel
@dataclass
class ValidationConfidence:
overall_confidence: float # 0.0 to 1.0
field_confidence: Dict[str, float] # Per-field confidence
repair_required: bool
repair_attempts: int
class ConfidenceTracker:
def calculate_confidence(
self,
result: BaseModel,
metrics: GenerationMetrics
) -> ValidationConfidence:
"""
Calculate confidence based on repair history.
Logic:
- 1.0: Validated on first attempt
- 0.8: Required 1 repair
- 0.6: Required 2 repairs
- 0.4: Required 3 repairs (max)
"""
overall_confidence = max(0.4, 1.0 - (metrics.attempts - 1) * 0.2)
# Analyze which fields had errors
field_confidence = {}
all_errors = []
for error_list in metrics.validation_errors:
all_errors.extend(error_list)
# Fields that had errors get lower confidence
error_fields = {err.field_name for err in all_errors}
for field_name in result.__fields__.keys():
if field_name in error_fields:
# Field was repaired
field_confidence[field_name] = max(0.5, overall_confidence)
else:
# Field was correct from start
field_confidence[field_name] = 1.0
return ValidationConfidence(
overall_confidence=overall_confidence,
field_confidence=field_confidence,
repair_required=metrics.attempts > 1,
repair_attempts=metrics.attempts - 1
)
# Usage
result, metrics = client.generate_with_metrics(prompt, schema)
confidence = ConfidenceTracker().calculate_confidence(result, metrics)
print(f"Overall confidence: {confidence.overall_confidence:.2f}")
print(f"Field confidence:")
for field, conf in confidence.field_confidence.items():
print(f" {field}: {conf:.2f}")
Learning Goals:
- Design confidence scoring systems
- Track validation history
- Provide interpretability
8.3 Advanced Extensions
Extension 7: Type-Safe TypeScript Version (Zod)
Goal: Build equivalent library for TypeScript using Zod
Implementation:
// src/client.ts
import { z } from 'zod';
import OpenAI from 'openai';
interface GenerationMetrics {
attempts: number;
success: boolean;
totalTokens: number;
totalCost: number;
}
class LLMClient {
private openai: OpenAI;
private maxRepairAttempts: number;
constructor(apiKey: string, maxRepairAttempts: number = 3) {
this.openai = new OpenAI({ apiKey });
this.maxRepairAttempts = maxRepairAttempts;
}
async generateJSON<T extends z.ZodType>(
prompt: string,
schema: T
): Promise<z.infer<T>> {
const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
{ role: 'user', content: prompt }
];
for (let attempt = 1; attempt <= this.maxRepairAttempts; attempt++) {
const temperature = attempt === 1 ? 0.3 : 0.0;
const response = await this.openai.chat.completions.create({
model: 'gpt-4',
messages,
temperature
});
const content = response.choices[0].message.content;
try {
const data = JSON.parse(content || '{}');
const validated = schema.parse(data);
return validated;
} catch (error) {
if (attempt === this.maxRepairAttempts) {
throw new Error(`Failed after ${attempt} attempts: ${error}`);
}
// Build repair prompt
const repairPrompt = this.buildRepairPrompt(error);
messages.push(
{ role: 'assistant', content: content || '' },
{ role: 'user', content: repairPrompt }
);
}
}
throw new Error('Should never reach here');
}
private buildRepairPrompt(error: unknown): string {
if (error instanceof z.ZodError) {
const errors = error.errors.map(e =>
`Field '${e.path.join('.')}': ${e.message}`
).join('\n');
return `Your JSON had validation errors:\n${errors}\n\nPlease fix and return valid JSON.`;
}
return 'Your JSON was invalid. Please return valid JSON.';
}
}
// Usage
const UserSchema = z.object({
name: z.string().min(1),
age: z.number().int().min(0).max(120),
email: z.string().email(),
subscription: z.enum(['free', 'pro', 'enterprise'])
}).strict();
const client = new LLMClient(process.env.OPENAI_API_KEY!);
const user = await client.generateJSON(
"Extract: Alice, 25, alice@example.com, pro plan",
UserSchema
);
console.log(user.name); // TypeScript knows this is a string!
Learning Goals:
- Port Python concepts to TypeScript
- Use Zod for runtime validation
- Leverage TypeScript’s type system
Extension 8: Custom Repair Strategies
Goal: Allow users to define custom repair logic per field type
Implementation:
# src/repair_strategies.py
from typing import Callable, Dict, Any
from pydantic import BaseModel
RepairStrategy = Callable[[Any, str], str]
class CustomRepairEngine:
def __init__(self):
self.strategies: Dict[str, RepairStrategy] = {}
def register_strategy(
self,
field_name: str,
strategy: RepairStrategy
):
"""
Register custom repair strategy for a field.
Args:
field_name: Name of field
strategy: Function that takes (value, error_msg) and returns repair instruction
"""
self.strategies[field_name] = strategy
def build_repair_prompt(
self,
errors: List[ErrorDetail]
) -> str:
"""Build repair prompt using custom strategies"""
lines = ["Your JSON had errors:\n"]
for error in errors:
if error.field_name in self.strategies:
# Use custom strategy
strategy = self.strategies[error.field_name]
instruction = strategy(error.received_value, error.error_message)
lines.append(f"Field '{error.field_name}': {instruction}")
else:
# Use default
lines.append(f"Field '{error.field_name}': {error.error_message}")
return "\n".join(lines)
# Usage: Define custom repair logic for age field
def age_repair_strategy(value: Any, error_msg: str) -> str:
"""Custom repair for age field"""
if isinstance(value, str):
# Try to extract number from string
words_to_numbers = {
"twenty-five": 25,
"thirty": 30,
# etc.
}
if value.lower() in words_to_numbers:
correct_value = words_to_numbers[value.lower()]
return f"Convert '{value}' to number {correct_value}"
return f"Must be an integer between 0 and 120"
engine = CustomRepairEngine()
engine.register_strategy("age", age_repair_strategy)
Learning Goals:
- Design plugin architectures
- Create domain-specific repair logic
- Allow library customization
Extension 9: Statistical Validation
Goal: Validate outputs statistically over multiple samples
Implementation:
# src/statistical_validation.py
from typing import Type, List
from pydantic import BaseModel
from dataclasses import dataclass
import statistics
@dataclass
class StatisticalResult:
median_result: BaseModel
confidence: float
consistency_score: float
all_results: List[BaseModel]
class StatisticalEnforcer:
def __init__(self, client: LLMClient, num_samples: int = 5):
self.client = client
self.num_samples = num_samples
def generate_with_consensus(
self,
prompt: str,
schema: Type[BaseModel]
) -> StatisticalResult:
"""
Generate multiple samples and return most consistent result.
Useful for critical applications where you need high confidence.
"""
results = []
for i in range(self.num_samples):
try:
result = self.client.generate_json(prompt, schema)
results.append(result)
except Exception as e:
print(f"Sample {i+1} failed: {e}")
if not results:
raise Exception("All samples failed")
# Find most common result (by field values)
# For simplicity, return median of numeric fields
median = self._calculate_median_result(results, schema)
# Calculate consistency score
consistency = self._calculate_consistency(results)
return StatisticalResult(
median_result=median,
confidence=len(results) / self.num_samples,
consistency_score=consistency,
all_results=results
)
def _calculate_median_result(
self,
results: List[BaseModel],
schema: Type[BaseModel]
) -> BaseModel:
"""Calculate median result across samples"""
# Implementation depends on schema
# For numeric fields, take median
# For string fields, take most common value
pass
def _calculate_consistency(self, results: List[BaseModel]) -> float:
"""Calculate how consistent results are (0.0 to 1.0)"""
# Compare each result to others
# Return percentage of fields that match across all samples
pass
# Usage
stat_client = StatisticalEnforcer(client, num_samples=5)
result = stat_client.generate_with_consensus(prompt, User)
print(f"Confidence: {result.confidence:.2f}")
print(f"Consistency: {result.consistency_score:.2f}")
print(f"Result: {result.median_result}")
Learning Goals:
- Apply statistical methods to LLM outputs
- Handle uncertainty quantification
- Design high-reliability systems
9. Real-World Connections
9.1 Industry Applications
Use Case 1: E-Commerce Product Data Extraction
Company: Shopify
Problem: Merchants upload unstructured product descriptions. Need to extract structured data (price, dimensions, materials).
Solution with JSON Enforcer:
from pydantic import BaseModel, Field
from typing import List, Optional
class ProductDimensions(BaseModel):
length_cm: float = Field(..., gt=0)
width_cm: float = Field(..., gt=0)
height_cm: float = Field(..., gt=0)
weight_kg: float = Field(..., gt=0)
class Product(BaseModel):
name: str = Field(..., min_length=1, max_length=200)
price_usd: float = Field(..., gt=0)
description: str = Field(..., max_length=5000)
dimensions: Optional[ProductDimensions]
materials: List[str] = Field(default_factory=list)
category: str
class Config:
extra = "forbid"
client = LLMClient(model="gpt-4")
unstructured_text = """
Premium Leather Wallet - $45
Made from genuine Italian leather
Measures 4.5" x 3.5" x 0.5", weighs about 100g
Available in black, brown, tan
"""
product = client.generate_json(
prompt=f"Extract product data from:\n{unstructured_text}",
schema=Product
)
# Guaranteed structured data for database
store_product(product)
Result: 99.5% extraction accuracy, reduced manual data entry by 80%
Use Case 2: Financial Document Processing
Company: Stripe
Problem: Extract invoice data from PDFs/images for automated billing
Solution:
class InvoiceLineItem(BaseModel):
description: str
quantity: int = Field(..., ge=1)
unit_price_usd: float = Field(..., gt=0)
total_usd: float = Field(..., gt=0)
class Invoice(BaseModel):
invoice_number: str = Field(..., regex=r'^INV-\d+$')
date: str = Field(..., regex=r'^\d{4}-\d{2}-\d{2}$')
vendor: str
line_items: List[InvoiceLineItem] = Field(..., min_items=1)
subtotal_usd: float = Field(..., gt=0)
tax_usd: float = Field(..., ge=0)
total_usd: float = Field(..., gt=0)
class Config:
extra = "forbid"
# Extract with automatic validation
invoice = client.generate_json(
prompt=f"Extract invoice from OCR text:\n{ocr_text}",
schema=Invoice
)
# Validate business logic
assert abs(invoice.total_usd - (invoice.subtotal_usd + invoice.tax_usd)) < 0.01
Result: 98% accuracy on invoices, saved $500K/year in manual processing
Use Case 3: Customer Support Ticket Classification
Company: Notion
Problem: Automatically categorize and route support tickets
Solution:
from enum import Enum
class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
URGENT = "urgent"
class Category(str, Enum):
BILLING = "billing"
TECHNICAL = "technical"
FEATURE_REQUEST = "feature_request"
BUG_REPORT = "bug_report"
ACCOUNT = "account"
class TicketClassification(BaseModel):
category: Category
priority: Priority
suggested_team: str
requires_escalation: bool
estimated_resolution_hours: int = Field(..., ge=1, le=168)
key_issues: List[str] = Field(..., min_items=1, max_items=5)
class Config:
extra = "forbid"
classification = client.generate_json(
prompt=f"Classify support ticket:\n\n{ticket_text}",
schema=TicketClassification
)
# Automatic routing
route_to_team(classification.suggested_team)
set_priority(classification.priority)
Result: Reduced ticket routing time by 90%, improved response SLA by 40%
9.2 Open Source Projects Using Similar Patterns
Instructor (Python)
URL: https://github.com/jxnl/instructor
What it does: Pydantic-based LLM output validation with retry logic
How it works:
import instructor
from openai import OpenAI
client = instructor.patch(OpenAI())
user = client.chat.completions.create(
model="gpt-4",
response_model=User,
messages=[{"role": "user", "content": "Extract: Alice, 25"}]
)
Key Features:
- Automatic Pydantic validation
- Retry logic built-in
- Support for complex nested schemas
- Integration with OpenAI function calling
Your Implementation vs Instructor:
- You built the core logic from scratch (better learning)
- Instructor is production-optimized (use in real projects)
- Both use same underlying principles
Marvin (AI Engineering)
URL: https://github.com/PrefectHQ/marvin
What it does: Type-safe AI engineering tools
Example:
import marvin
@marvin.fn
def extract_user(text: str) -> User:
"""Extract user information"""
pass
user = extract_user("Alice, 25, alice@example.com")
# Returns validated User object
TypeChat (Microsoft)
URL: https://github.com/microsoft/TypeChat
What it does: TypeScript-first LLM interaction with schema validation
Key Principle: Use TypeScript types as the schema definition
9.3 Production Deployment Considerations
Caching Strategy
Problem: Same prompts generate same outputs but cost API calls
Solution: Implement semantic caching
import hashlib
import json
from functools import lru_cache
class CachedLLMClient:
def __init__(self, client: LLMClient, cache_size: int = 1000):
self.client = client
self.cache = {}
self.cache_size = cache_size
def generate_json(
self,
prompt: str,
schema: Type[BaseModel]
) -> BaseModel:
# Create cache key
cache_key = hashlib.sha256(
f"{prompt}:{schema.__name__}".encode()
).hexdigest()
if cache_key in self.cache:
print("Cache hit!")
return self.cache[cache_key]
# Generate
result = self.client.generate_json(prompt, schema)
# Cache result
if len(self.cache) >= self.cache_size:
# Remove oldest entry
self.cache.pop(next(iter(self.cache)))
self.cache[cache_key] = result
return result
Monitoring and Alerting
Metrics to Track:
- Success rate (% validations passing)
- Average repair attempts
- Cost per request
- Latency (p50, p95, p99)
- Field-level error rates
Implementation:
from datadog import DogStatsd
statsd = DogStatsd()
class MonitoredLLMClient:
def generate_json(self, prompt, schema):
start = time.time()
try:
result, metrics = self.client.generate_with_metrics(prompt, schema)
# Send metrics
statsd.increment('llm.requests.success')
statsd.histogram('llm.attempts', metrics.attempts)
statsd.histogram('llm.latency_ms', metrics.total_latency_ms)
statsd.histogram('llm.cost_usd', metrics.total_cost)
if metrics.attempts > 1:
statsd.increment('llm.repairs.occurred')
return result
except MaxRetriesExceeded as e:
statsd.increment('llm.requests.failed')
statsd.histogram('llm.attempts', e.attempts)
raise
Error Budget Management
Concept: Track reliability over time windows
class ErrorBudgetTracker:
def __init__(self, target_success_rate: float = 0.99):
self.target_success_rate = target_success_rate
self.successes = 0
self.failures = 0
def record_result(self, success: bool):
if success:
self.successes += 1
else:
self.failures += 1
def current_success_rate(self) -> float:
total = self.successes + self.failures
if total == 0:
return 1.0
return self.successes / total
def error_budget_remaining(self) -> float:
"""
Returns fraction of error budget remaining.
1.0 = Full budget (perfect success rate)
0.0 = Budget exhausted (at target rate)
<0 = Over budget (below target rate)
"""
current_rate = self.current_success_rate()
target_rate = self.target_success_rate
if current_rate >= target_rate:
# Above target, budget remaining
return (current_rate - target_rate) / (1.0 - target_rate)
else:
# Below target, over budget (negative)
return (current_rate - target_rate) / target_rate
# Usage
tracker = ErrorBudgetTracker(target_success_rate=0.99)
for _ in range(1000):
try:
result = client.generate_json(prompt, schema)
tracker.record_result(success=True)
except:
tracker.record_result(success=False)
if tracker.error_budget_remaining() < 0.1:
alert("Error budget nearly exhausted! Investigate failures.")
10. Resources
10.1 Essential Reading
Books
| Book | Chapter | Key Takeaway |
|---|---|---|
| “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 4 (Encoding & Evolution) | Schema design, backward/forward compatibility |
| “Programming TypeScript” by Boris Cherny | Ch. 3 (Type Safety) | Type systems, compile-time vs runtime validation |
| “Fluent Python” by Luciano Ramalho | Ch. 8 (Type Hints) | Python type hints, Pydantic internals |
| “Effective Python” by Brett Slatkin | Item 14 (Exceptions vs None) | Error handling patterns |
| “Clean Code” by Robert C. Martin | Ch. 7 (Error Handling) | Exception design, meaningful errors |
| “Release It!” by Michael T. Nygard | Ch. 5 (Stability Patterns) | Retry logic, circuit breakers, timeouts |
| “Clean Architecture” by Robert C. Martin | Ch. 11 (DIP) | Dependency inversion, provider abstraction |
| “AI Engineering” by Chip Huyen | Ch. 6 (LLM Engineering) | Production LLM systems |
Papers
- “JSON Schema Validation” (IETF Draft)
- Formal specification for JSON Schema
- Validation keywords and semantics
- “Language Models are Few-Shot Learners” (GPT-3 Paper)
- Understanding in-context learning
- How examples guide model behavior
- “Constitutional AI” (Anthropic)
- Self-correction mechanisms
- Model refining its own outputs
10.2 Documentation & Tools
Validation Libraries
| Tool | Language | URL |
|---|---|---|
| Pydantic | Python | https://docs.pydantic.dev/ |
| Zod | TypeScript | https://zod.dev/ |
| JSON Schema | Universal | https://json-schema.org/ |
| jsonschema (Python) | Python | https://python-jsonschema.readthedocs.io/ |
LLM Providers
| Provider | Best For | Pricing |
|---|---|---|
| OpenAI | General purpose, function calling | $0.03/1K input tokens (GPT-4) |
| Anthropic | Long context, analysis | $0.015/1K input tokens (Claude 3 Sonnet) |
| Local (Ollama) | Privacy, cost | Free (hardware costs) |
10.3 Related Projects in This Series
Next Project: Project 3 - Prompt Injection Red-Team Lab
Why it’s next: Now that you can enforce schemas, learn to defend against attacks that try to break your schemas
Connection: Prompt injection often targets schema validation (e.g., “set admin field to true”)
Related Projects
- Project 1 (Prompt Contract Harness): Use this JSON Enforcer in your test suites
- Project 4 (Context Window Manager): Combine with schema enforcement for RAG systems
10.4 Community Resources
Discord Servers
- LangChain Discord: Discussion of LLM engineering patterns
- AI Engineering Discord: Production AI systems
GitHub Repositories to Study
- instructor: https://github.com/jxnl/instructor
- marvin: https://github.com/PrefectHQ/marvin
- TypeChat: https://github.com/microsoft/TypeChat
11. Self-Assessment Checklist
Understanding
Conceptual Knowledge:
- I can explain the difference between compile-time types and runtime validation
- I understand why
additionalProperties: falseprevents hallucinations - I can describe when to use temp=0.0 vs temp=0.3
- I know why repair loops improve success rates
- I understand the trade-off between cost and reliability
- I can explain JSON Schema validation keywords (required, enum, format)
- I know the difference between structural and semantic validation
Practical Application:
- I can identify when a schema is too strict or too loose
- I know how to debug type errors vs missing field errors
- I can estimate token costs for repair loops
- I understand when to use exceptions vs Result types
Implementation
Core Features:
- My validator correctly parses Pydantic ValidationErrors
- My repair loop lowers temperature for repairs
- My prompt builder provides specific, actionable repair instructions
- My client handles JSON syntax errors separately from schema errors
- I track metrics (attempts, tokens, cost, latency)
Code Quality:
- My code has type hints throughout
- I have unit tests for validator, prompt builder, and repair loop
- I have integration tests with mocked LLM responses
- My error messages are actionable and clear
- I follow PEP 8 (Python) or ESLint (TypeScript) style guidelines
Production Readiness:
- I handle API rate limits gracefully
- I support multiple LLM providers via abstraction
- I log important events (errors, repairs, success)
- My library is pip/npm installable
- I have examples and documentation
Growth
Mastery Indicators:
- I can design schemas for complex nested structures
- I can implement custom repair strategies for domain-specific types
- I can explain my design decisions (why 3 attempts? why decreasing temp?)
- I understand the limitations of this approach
- I can compare this to OpenAI’s function calling and explain tradeoffs
Next Steps:
- I’ve integrated this library into another project
- I’ve measured real-world success rates
- I’ve optimized for cost or latency based on requirements
- I’ve extended with at least 2 of the suggested extensions
12. Completion Criteria
Minimum Viable Completion
You can consider this project complete when you have:
1. Core Library (70% of effort)
LLMClientclass withgenerate_json()method- Schema validation using Pydantic (Python) or Zod (TypeScript)
- Repair loop with max 3 attempts
- Temperature strategy (decreasing)
- Custom exceptions (
MaxRetriesExceeded) - Support for at least OpenAI provider
2. Testing (20% of effort)
- Unit tests for validator (5+ tests)
- Unit tests for prompt builder (3+ tests)
- Integration tests for repair loop (5+ tests)
- At least 2 end-to-end tests with real API
- Test coverage >80%
3. Documentation (10% of effort)
- README with installation instructions
- At least 3 usage examples
- Docstrings for all public methods
- Type hints throughout
Validation Test:
Run this integration test—it should pass:
from pydantic import BaseModel
from your_library import LLMClient
class User(BaseModel):
name: str
age: int
class Config:
extra = "forbid"
client = LLMClient(model="gpt-3.5-turbo")
# Test 1: Should succeed (possibly with repair)
user = client.generate_json(
"Extract: Alice, twenty-five years old",
schema=User
)
assert user.age == 25
assert isinstance(user.age, int)
# Test 2: Should track metrics
result, metrics = client.generate_with_metrics(
"Extract: Bob, 30",
schema=User
)
assert metrics.success == True
assert metrics.total_tokens > 0
print("✓ All validation tests passed!")
Full Completion
Additional Requirements:
1. Advanced Features
- Support for multiple providers (OpenAI + Anthropic or local)
- Metrics tracking with cost calculation
- Verbose mode for debugging
- Caching layer for duplicate requests
- Retry with exponential backoff for API errors
2. Comprehensive Testing
- Parametric tests across multiple models
- Performance benchmarks (requests/second, success rate)
- Cost analysis ($/1000 requests)
- Edge case tests (empty input, very long input, malformed prompts)
3. Production Readiness
- Published to PyPI or npm
- CI/CD pipeline (GitHub Actions)
- Semantic versioning
- Changelog
- Contributing guidelines
Excellence (Going Above & Beyond)
Research & Analysis:
- Benchmark report comparing success rates across models
- Cost-benefit analysis of repair strategies
- Case study of real-world application
- Blog post explaining your learnings
Advanced Implementations:
- Statistical validation (consensus over multiple samples)
- Custom repair strategies per field type
- Streaming support
- Multi-provider fallback with automatic selection
Community Contribution:
- Open-sourced on GitHub with 10+ stars
- Presented at a meetup or conference
- Tutorial or video walkthrough
- Integration with popular frameworks (LangChain, LlamaIndex)
Appendix: Sample Code
Complete Working Example
# main.py - Complete working example
from pydantic import BaseModel, Field, EmailStr
from typing import Literal
from src.client import LLMClient
class User(BaseModel):
name: str = Field(..., min_length=1, max_length=100)
age: int = Field(..., ge=0, le=120)
email: EmailStr
subscription: Literal["free", "pro", "enterprise"]
class Config:
extra = "forbid"
def main():
# Initialize client
client = LLMClient(
model="gpt-4",
max_repair_attempts=3,
verbose=True
)
# Test cases
test_cases = [
"Alice, 25, alice@example.com, pro plan",
"Bob is thirty years old, email bob@example.com, wants free tier",
"Carol, 28, carol@test.com, enterprise",
]
for text in test_cases:
print(f"\n{'='*60}")
print(f"Processing: {text}")
print(f"{'='*60}")
try:
user, metrics = client.generate_with_metrics(
prompt=f"Extract user from: {text}",
schema=User
)
print(f"\n✓ Success!")
print(f" Name: {user.name}")
print(f" Age: {user.age}")
print(f" Email: {user.email}")
print(f" Subscription: {user.subscription}")
print(f"\nMetrics:")
print(f" Attempts: {metrics.attempts}")
print(f" Tokens: {metrics.total_tokens}")
print(f" Cost: ${metrics.total_cost:.4f}")
print(f" Latency: {metrics.total_latency_ms:.0f}ms")
except Exception as e:
print(f"\n✗ Failed: {e}")
if __name__ == "__main__":
main()
Example Schema Library
# schemas.py - Reusable schemas
from pydantic import BaseModel, Field, EmailStr
from typing import Literal, Optional, List
from datetime import date
class Address(BaseModel):
street: str
city: str
state: str = Field(..., regex=r'^[A-Z]{2}$')
zipcode: str = Field(..., regex=r'^\d{5}$')
class Config:
extra = "forbid"
class Recipe(BaseModel):
title: str = Field(..., min_length=3, max_length=200)
ingredients: List[str] = Field(..., min_items=1, max_items=50)
instructions: List[str] = Field(..., min_items=1)
cooking_time_minutes: int = Field(..., ge=1, le=1440)
difficulty: Literal["easy", "medium", "hard"]
servings: int = Field(..., ge=1, le=100)
cuisine: Optional[str] = None
class Config:
extra = "forbid"
class Invoice(BaseModel):
invoice_number: str = Field(..., regex=r'^INV-\d+$')
date: date
vendor: str = Field(..., min_length=1)
items: List[dict] # Could be more structured
total_usd: float = Field(..., gt=0)
class Config:
extra = "forbid"
Congratulations! You now have a production-ready library for type-safe LLM outputs. This is infrastructure that companies pay thousands for—you built it from scratch and understand every component.