Project 27: Schema-Validated Output - Structured Data Extraction
Project 27: Schema-Validated Output - Structured Data Extraction
Build a structured data extraction pipeline using
--json-schemato ensure Claudeโs output matches expected formats: extract API specs from code, generate typed data from unstructured input, and validate outputs against schemas.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Language | Python (Alternatives: TypeScript, Go) |
| Prerequisites | Projects 24-26 completed, JSON Schema understanding |
| Key Topics | JSON Schema, structured output, data extraction, schema validation |
| Main Book | โUnderstanding JSON Schemaโ (json-schema.org) |
1. Learning Objectives
By completing this project, you will:
- Master JSON Schema design: Create schemas that capture complex data structures
- Use schema-validated output: Apply
--json-schemafor guaranteed structure - Handle validation failures: Implement retry and fallback strategies
- Design for extraction: Craft prompts and schemas for reliable data extraction
- Evolve schemas safely: Version schemas without breaking existing data
- Build robust pipelines: Create end-to-end extraction workflows with validation
2. Real World Outcome
When complete, youโll have a pipeline that extracts structured data reliably:
$ python extract_api.py --source ./src/routes --schema api-spec.json
Extracting API specification...
Schema: api-spec.json
- endpoints: array of objects
- each endpoint: method, path, parameters, response
Validation passed!
Extracted API Specification:
{
"endpoints": [
{
"method": "GET",
"path": "/users/{id}",
"parameters": [
{"name": "id", "type": "string", "required": true}
],
"response": {
"type": "object",
"properties": {
"id": "string",
"name": "string",
"email": "string"
}
}
},
{
"method": "POST",
"path": "/users",
"parameters": [
{"name": "name", "type": "string", "required": true},
{"name": "email", "type": "string", "required": true}
],
"response": {
"type": "object",
"properties": {
"id": "string",
"created": "boolean"
}
}
}
]
}
Saved to: api-spec-output.json
Why Schema Validation Matters
Without schema validation:
Claude: "The API has GET /users and POST /users endpoints..."
You: *manually parse this text*
Result: Fragile, error-prone extraction
With schema validation:
Claude: {"endpoints": [{"method": "GET", ...}, ...]}
You: *validate against schema, use directly*
Result: Reliable, typed data
3. The Core Question Youโre Answering
โHow do I ensure Claudeโs output matches a specific structure, enabling reliable data extraction and integration?โ
Unstructured LLM output is hard to parse reliably. JSON Schema validation ensures you get exactly the data structure you expect, every time. This enables:
- Type-safe downstream processing
- Consistent data pipelines
- Automated error detection
- Version-controlled output formats
4. Concepts You Must Understand First
Stop and research these before coding:
4.1 JSON Schema Basics
JSON Schema defines the structure of JSON documents:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
"email": {"type": "string", "format": "email"}
},
"required": ["name", "email"]
}
Key constructs:
| Construct | Purpose | Example |
|---|---|---|
type |
Data type | "string", "object", "array" |
properties |
Object fields | {"name": {"type": "string"}} |
required |
Mandatory fields | ["name", "email"] |
items |
Array element schema | {"items": {"type": "string"}} |
enum |
Fixed values | ["GET", "POST", "PUT"] |
$ref |
Schema reuse | "$ref": "#/definitions/User" |
Reference: json-schema.org - โUnderstanding JSON Schemaโ
4.2 Claudeโs Schema Support
# Use --json-schema to enforce output structure
claude -p "Extract API endpoints" \
--json-schema ./schemas/api.json \
--output-format json
# Claude will output data matching the schema
# Validation happens automatically
Key Questions:
- What happens when output doesnโt match schema?
- Are there schema complexity limits?
- How does Claude use the schema to guide generation?
Reference: Claude Code documentation - โโjson-schemaโ
4.3 Schema Design Patterns
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SCHEMA DESIGN PATTERNS โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ FLAT SCHEMA (Simple) โ
โ โโโ Single-level properties โ
โ โโโ Easy to understand and validate โ
โ โโโ Good for simple data extraction โ
โ โ
โ NESTED SCHEMA (Complex) โ
โ โโโ Objects within objects โ
โ โโโ Captures relationships โ
โ โโโ Requires careful path navigation โ
โ โ
โ ARRAY SCHEMA (Collections) โ
โ โโโ Lists of structured items โ
โ โโโ Define item schema once โ
โ โโโ Handles variable-length output โ
โ โ
โ UNION SCHEMA (Variants) โ
โ โโโ oneOf, anyOf for alternatives โ
โ โโโ Handles different output types โ
โ โโโ More complex validation logic โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5. Questions to Guide Your Design
5.1 What Data to Extract?
| Use Case | Extraction Target | Schema Complexity |
|---|---|---|
| API documentation | Endpoints, parameters, responses | Medium |
| Code comments | TODO items, deprecation notices | Low |
| Type definitions | Interfaces, types, relationships | High |
| Configuration | Settings, defaults, options | Medium |
| Dependencies | Package names, versions | Low |
5.2 How Strict Should the Schema Be?
// LOOSE: Allows additional properties
{
"type": "object",
"properties": {"name": {"type": "string"}},
"additionalProperties": true // Claude can add extra fields
}
// STRICT: Only defined properties
{
"type": "object",
"properties": {"name": {"type": "string"}},
"additionalProperties": false, // No extra fields allowed
"required": ["name"]
}
// ENUM: Fixed values only
{
"method": {
"type": "string",
"enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
}
}
Trade-offs:
- Strict = More predictable, less flexible
- Loose = More flexible, harder to process
5.3 How to Handle Failures?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FAILURE HANDLING STRATEGY โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Schema validation fails โ
โ โ โ
โ โโโ Retry with simpler prompt โ
โ โ "Please output ONLY valid JSON matching this schema" โ
โ โ โ
โ โโโ Retry with simpler schema โ
โ โ Remove optional fields, loosen constraints โ
โ โ โ
โ โโโ Extract partial results โ
โ โ Use what validates, flag rest as incomplete โ
โ โ โ
โ โโโ Return detailed error โ
โ Show what failed and why โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
6. Thinking Exercise
Design an API Spec Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "API Specification",
"type": "object",
"properties": {
"endpoints": {
"type": "array",
"items": {
"type": "object",
"properties": {
"method": {
"type": "string",
"enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
},
"path": {
"type": "string",
"pattern": "^/.*"
},
"description": {
"type": "string"
},
"parameters": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string"},
"required": {"type": "boolean"},
"location": {
"type": "string",
"enum": ["path", "query", "body", "header"]
}
},
"required": ["name", "type"]
}
},
"response": {
"type": "object",
"properties": {
"type": {"type": "string"},
"properties": {
"type": "object",
"additionalProperties": {"type": "string"}
}
}
}
},
"required": ["method", "path"]
}
},
"version": {
"type": "string"
}
},
"required": ["endpoints"]
}
Design Questions
- What if an endpoint has no parameters?
- Parameters array can be empty
[] - Or make it optional (not in
required) - Handle both in downstream code
- Parameters array can be empty
- How do you handle response types?
- Simple: Just record the type name
- Complex: Full recursive schema
- Practical: Use OpenAPI-like references
- Should you allow unknown methods?
- Strict (
enum): Only known methods - Loose (no
enum): Accept any string - Consider custom methods like
SUBSCRIBE
- Strict (
7. The Interview Questions Theyโll Ask
7.1 โHow do you ensure structured output from an LLM?โ
Good answer:
Use JSON Schema validation with Claudeโs --json-schema flag. This:
- Guides Claude to produce the expected structure
- Validates output automatically
- Fails fast on malformed data
- Enables type-safe downstream processing
7.2 โWhat is JSON Schema and how would you use it?โ
Key points:
- Declarative format for describing JSON structure
- Defines types, required fields, constraints
- Enables automated validation
- Standard format (draft-07 most common)
from jsonschema import validate, ValidationError
schema = {"type": "object", "required": ["name"]}
data = {"name": "test"}
try:
validate(instance=data, schema=schema)
print("Valid!")
except ValidationError as e:
print(f"Invalid: {e.message}")
7.3 โHow do you handle schema validation failures?โ
Strategies:
- Retry: Ask Claude again with emphasis on structure
- Simplify: Use a less strict schema
- Partial: Accept valid portions, flag rest
- Transform: Fix common errors programmatically
- Fallback: Return error with details for human review
7.4 โWhat are the trade-offs of strict vs loose schemas?โ
| Aspect | Strict Schema | Loose Schema |
|---|---|---|
| Validation | All or nothing | Partial success |
| Flexibility | Low | High |
| Predictability | High | Low |
| Error detection | Immediate | Deferred |
| Evolution | Breaking changes | Easier migration |
7.5 โHow would you version schemas for evolving data?โ
Approaches:
- Additive: Add optional fields (backward compatible)
- Deprecation: Mark fields as deprecated before removal
- Version header: Include schema version in output
- Multiple schemas: Support multiple versions simultaneously
8. Hints in Layers
Hint 1: Start Simple
Begin with a flat schema:
{
"type": "object",
"properties": {
"endpoints": {
"type": "array",
"items": {"type": "string"}
}
}
}
Just extract endpoint paths as strings first.
Hint 2: Use Enums for Fixed Values
Constrain method values:
{
"method": {
"type": "string",
"enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
}
}
This catches invalid methods immediately.
Hint 3: Handle Arrays Carefully
Define item schemas for consistency:
{
"parameters": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string"}
},
"required": ["name", "type"]
}
}
}
Hint 4: Test with Edge Cases
Try:
- Empty arrays
- Missing optional fields
- Unexpected values
- Very long strings
- Unicode characters
- Nested objects at max depth
9. Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| JSON Schema | โUnderstanding JSON Schemaโ | All (free online) |
| Data validation | โDesigning Data-Intensive Applicationsโ by Kleppmann | Ch. 4: Encoding and Evolution |
| API design | โRESTful Web APIsโ by Richardson & Ruby | Ch. 3-4: Resources and Representations |
| Type systems | โTypes and Programming Languagesโ by Pierce | Ch. 1-3: Introduction to types |
| Data modeling | โData Model Patternsโ by Hay | Ch. 2-4: Entity patterns |
10. Implementation Guide
10.1 Complete Extraction Pipeline
import subprocess
import json
from pathlib import Path
from jsonschema import validate, ValidationError, Draft7Validator
from dataclasses import dataclass
from typing import Optional
@dataclass
class ExtractionResult:
success: bool
data: Optional[dict] = None
error: Optional[str] = None
validation_errors: list = None
class SchemaExtractor:
def __init__(self, schema_path: str):
with open(schema_path) as f:
self.schema = json.load(f)
self.validator = Draft7Validator(self.schema)
def extract(self, prompt: str, max_retries: int = 3) -> ExtractionResult:
"""Extract data with schema validation and retry."""
for attempt in range(max_retries):
result = self._run_extraction(prompt)
if result.success:
return result
# Modify prompt for retry
if attempt < max_retries - 1:
prompt = self._enhance_prompt(prompt, result.validation_errors)
print(f"Retry {attempt + 1}: Enhancing prompt...")
return result
def _run_extraction(self, prompt: str) -> ExtractionResult:
"""Run single extraction attempt."""
try:
# Run Claude with schema
result = subprocess.run(
[
"claude", "-p", prompt,
"--json-schema", json.dumps(self.schema),
"--output-format", "json"
],
capture_output=True,
text=True,
timeout=120
)
if result.returncode != 0:
return ExtractionResult(
success=False,
error=f"Claude error: {result.stderr}"
)
# Parse response
response = json.loads(result.stdout)
data = response.get("result")
# If result is a string, try to parse as JSON
if isinstance(data, str):
try:
data = json.loads(data)
except json.JSONDecodeError:
return ExtractionResult(
success=False,
error="Result is not valid JSON"
)
# Validate against schema
validation_errors = list(self.validator.iter_errors(data))
if validation_errors:
return ExtractionResult(
success=False,
validation_errors=[str(e) for e in validation_errors]
)
return ExtractionResult(success=True, data=data)
except subprocess.TimeoutExpired:
return ExtractionResult(success=False, error="Timeout")
except json.JSONDecodeError as e:
return ExtractionResult(success=False, error=f"JSON parse error: {e}")
def _enhance_prompt(self, original: str, errors: list) -> str:
"""Enhance prompt based on validation errors."""
error_summary = "\n".join(errors[:3]) if errors else "Schema mismatch"
return f"""{original}
IMPORTANT: Your previous response had validation errors:
{error_summary}
Please ensure your response EXACTLY matches the JSON schema.
Output ONLY valid JSON, no additional text."""
def extract_api_spec(source_dir: str, schema_path: str) -> dict:
"""Extract API specification from source code."""
extractor = SchemaExtractor(schema_path)
prompt = f"""Analyze the source code in {source_dir} and extract the API specification.
Look for:
- Route definitions (GET, POST, PUT, DELETE endpoints)
- Path parameters (like :id or {{id}})
- Query parameters
- Request body parameters
- Response types
Output a JSON object matching the provided schema with all endpoints found."""
result = extractor.extract(prompt)
if result.success:
return result.data
else:
raise Exception(f"Extraction failed: {result.error or result.validation_errors}")
# Example usage
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--source", required=True, help="Source directory")
parser.add_argument("--schema", required=True, help="JSON Schema file")
parser.add_argument("--output", default="output.json", help="Output file")
args = parser.parse_args()
print(f"Extracting from {args.source} using {args.schema}...")
try:
data = extract_api_spec(args.source, args.schema)
with open(args.output, "w") as f:
json.dump(data, f, indent=2)
print(f"\nSuccess! Saved to {args.output}")
print(f"Found {len(data.get('endpoints', []))} endpoints")
except Exception as e:
print(f"\nError: {e}")
exit(1)
10.2 Schema Definitions
Save as schemas/api-spec.json:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "API Specification",
"type": "object",
"properties": {
"endpoints": {
"type": "array",
"items": {
"$ref": "#/definitions/Endpoint"
}
},
"version": {
"type": "string"
},
"baseUrl": {
"type": "string"
}
},
"required": ["endpoints"],
"definitions": {
"Endpoint": {
"type": "object",
"properties": {
"method": {
"type": "string",
"enum": ["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS", "HEAD"]
},
"path": {
"type": "string"
},
"description": {
"type": "string"
},
"parameters": {
"type": "array",
"items": {
"$ref": "#/definitions/Parameter"
}
},
"requestBody": {
"$ref": "#/definitions/RequestBody"
},
"response": {
"$ref": "#/definitions/Response"
}
},
"required": ["method", "path"]
},
"Parameter": {
"type": "object",
"properties": {
"name": {"type": "string"},
"type": {"type": "string"},
"required": {"type": "boolean", "default": false},
"location": {
"type": "string",
"enum": ["path", "query", "header", "cookie"]
},
"description": {"type": "string"}
},
"required": ["name", "type", "location"]
},
"RequestBody": {
"type": "object",
"properties": {
"contentType": {"type": "string"},
"schema": {"type": "object"}
}
},
"Response": {
"type": "object",
"properties": {
"statusCode": {"type": "integer"},
"contentType": {"type": "string"},
"schema": {"type": "object"}
}
}
}
}
10.3 Type Generation from Schema
from typing import TypedDict, List, Optional
# Generated from schema - can be auto-generated
class Parameter(TypedDict, total=False):
name: str
type: str
required: bool
location: str
description: str
class Response(TypedDict, total=False):
statusCode: int
contentType: str
schema: dict
class Endpoint(TypedDict, total=False):
method: str
path: str
description: str
parameters: List[Parameter]
response: Response
class APISpec(TypedDict, total=False):
endpoints: List[Endpoint]
version: str
baseUrl: str
def process_spec(spec: APISpec):
"""Type-safe processing of extracted spec."""
for endpoint in spec["endpoints"]:
method = endpoint["method"] # Type: str
path = endpoint["path"] # Type: str
params = endpoint.get("parameters", [])
for param in params:
print(f" {param['location']}: {param['name']} ({param['type']})")
11. Learning Milestones
| Milestone | Description | Verification |
|---|---|---|
| 1 | Simple schema works | Flat object validates |
| 2 | Nested schema works | Objects within objects |
| 3 | Array schema works | List of endpoints extracted |
| 4 | Enum validation works | Invalid methods rejected |
| 5 | Retry logic works | Failed extraction retries |
| 6 | Error messages helpful | Clear validation errors |
| 7 | Real extraction works | API spec from real code |
12. Common Pitfalls
12.1 Overly Complex Schemas
// TOO COMPLEX: Claude may struggle
{
"type": "object",
"properties": {
"level1": {
"type": "object",
"properties": {
"level2": {
"type": "object",
"properties": {
"level3": { /* ... */ }
}
}
}
}
}
}
// BETTER: Flatten or use references
{
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {"$ref": "#/definitions/Item"}
}
},
"definitions": {
"Item": { /* ... */ }
}
}
12.2 Missing Required Fields
// WRONG: Forgot to mark required
{
"properties": {
"name": {"type": "string"},
"id": {"type": "integer"}
}
// id should be required!
}
// RIGHT: Explicit required array
{
"properties": {
"name": {"type": "string"},
"id": {"type": "integer"}
},
"required": ["id"]
}
12.3 Not Handling Empty Arrays
# WRONG: Assumes non-empty
endpoints = result["endpoints"][0] # IndexError if empty
# RIGHT: Check first
endpoints = result.get("endpoints", [])
if endpoints:
first = endpoints[0]
else:
print("No endpoints found")
12.4 Schema Mismatch with Prompt
# WRONG: Prompt asks for X, schema expects Y
prompt = "List all functions"
schema = {"properties": {"endpoints": ...}} # Mismatch!
# RIGHT: Align prompt and schema
prompt = "Extract API endpoints as JSON"
schema = {"properties": {"endpoints": ...}} # Match!
13. Extension Ideas
- OpenAPI generator: Output OpenAPI/Swagger format
- TypeScript types: Generate TypeScript interfaces from schema
- Schema inference: Learn schema from examples
- Multi-format: Extract to JSON, YAML, or XML
- Incremental extraction: Update spec when code changes
- Validation reports: Detailed schema compliance reports
14. Summary
This project teaches you to:
- Design JSON Schemas for data extraction
- Use
--json-schemafor validated output - Handle validation failures gracefully
- Build type-safe extraction pipelines
- Version schemas for evolving data
The key insight is that schemas transform extraction from โparsing free textโ to โvalidating structured data.โ Instead of hoping Claudeโs output is parseable, you define exactly what you expect and validate it automatically.
Key takeaway: Schema validation is about contracts. You define a contract (the schema) that Claudeโs output must satisfy. When it does, you have guaranteed structure. When it doesnโt, you know exactly what went wrong. This transforms unreliable text extraction into a robust data pipeline.