Project 27: Schema-Validated Output - Structured Data Extraction

Build a structured data extraction pipeline using --json-schema to ensure Claude’s output matches expected formats: extract API specs from code, generate typed data from unstructured input, and validate outputs against schemas.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Language	Python (Alternatives: TypeScript, Go)
Prerequisites	Projects 24-26 completed, JSON Schema understanding
Key Topics	JSON Schema, structured output, data extraction, schema validation
Main Book	“Understanding JSON Schema” (json-schema.org)

1. Learning Objectives

By completing this project, you will:

Master JSON Schema design: Create schemas that capture complex data structures
Use schema-validated output: Apply --json-schema for guaranteed structure
Handle validation failures: Implement retry and fallback strategies
Design for extraction: Craft prompts and schemas for reliable data extraction
Evolve schemas safely: Version schemas without breaking existing data
Build robust pipelines: Create end-to-end extraction workflows with validation

2. Real World Outcome

When complete, you’ll have a pipeline that extracts structured data reliably:

$ python extract_api.py --source ./src/routes --schema api-spec.json

Extracting API specification...

Schema: api-spec.json
- endpoints: array of objects
- each endpoint: method, path, parameters, response

Validation passed!

Extracted API Specification:

{
  "endpoints": [
    {
      "method": "GET",
      "path": "/users/{id}",
      "parameters": [
        {"name": "id", "type": "string", "required": true}
      ],
      "response": {
        "type": "object",
        "properties": {
          "id": "string",
          "name": "string",
          "email": "string"
        }
      }
    },
    {
      "method": "POST",
      "path": "/users",
      "parameters": [
        {"name": "name", "type": "string", "required": true},
        {"name": "email", "type": "string", "required": true}
      ],
      "response": {
        "type": "object",
        "properties": {
          "id": "string",
          "created": "boolean"
        }
      }
    }
  ]
}

Saved to: api-spec-output.json

Why Schema Validation Matters

Without schema validation:

Claude: "The API has GET /users and POST /users endpoints..."
You: *manually parse this text*
Result: Fragile, error-prone extraction

With schema validation:

Claude: {"endpoints": [{"method": "GET", ...}, ...]}
You: *validate against schema, use directly*
Result: Reliable, typed data

3. The Core Question You’re Answering

“How do I ensure Claude’s output matches a specific structure, enabling reliable data extraction and integration?”

Unstructured LLM output is hard to parse reliably. JSON Schema validation ensures you get exactly the data structure you expect, every time. This enables:

Type-safe downstream processing
Consistent data pipelines
Automated error detection
Version-controlled output formats

4. Concepts You Must Understand First

Stop and research these before coding:

4.1 JSON Schema Basics

JSON Schema defines the structure of JSON documents:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer", "minimum": 0},
    "email": {"type": "string", "format": "email"}
  },
  "required": ["name", "email"]
}

Key constructs:

Construct	Purpose	Example
`type`	Data type	`"string"`, `"object"`, `"array"`
`properties`	Object fields	`{"name": {"type": "string"}}`
`required`	Mandatory fields	`["name", "email"]`
`items`	Array element schema	`{"items": {"type": "string"}}`
`enum`	Fixed values	`["GET", "POST", "PUT"]`
`$ref`	Schema reuse	`"$ref": "#/definitions/User"`

Reference: json-schema.org - “Understanding JSON Schema”

4.2 Claude’s Schema Support

# Use --json-schema to enforce output structure
claude -p "Extract API endpoints" \
  --json-schema ./schemas/api.json \
  --output-format json

# Claude will output data matching the schema
# Validation happens automatically

Key Questions:

What happens when output doesn’t match schema?
Are there schema complexity limits?
How does Claude use the schema to guide generation?

Reference: Claude Code documentation - “–json-schema”

4.3 Schema Design Patterns

┌─────────────────────────────────────────────────────────────────┐
│                    SCHEMA DESIGN PATTERNS                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  FLAT SCHEMA (Simple)                                           │
│  ├── Single-level properties                                    │
│  ├── Easy to understand and validate                            │
│  └── Good for simple data extraction                            │
│                                                                  │
│  NESTED SCHEMA (Complex)                                        │
│  ├── Objects within objects                                     │
│  ├── Captures relationships                                     │
│  └── Requires careful path navigation                           │
│                                                                  │
│  ARRAY SCHEMA (Collections)                                     │
│  ├── Lists of structured items                                  │
│  ├── Define item schema once                                    │
│  └── Handles variable-length output                             │
│                                                                  │
│  UNION SCHEMA (Variants)                                        │
│  ├── oneOf, anyOf for alternatives                              │
│  ├── Handles different output types                             │
│  └── More complex validation logic                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

5. Questions to Guide Your Design

5.1 What Data to Extract?

Use Case	Extraction Target	Schema Complexity
API documentation	Endpoints, parameters, responses	Medium
Code comments	TODO items, deprecation notices	Low
Type definitions	Interfaces, types, relationships	High
Configuration	Settings, defaults, options	Medium
Dependencies	Package names, versions	Low

5.2 How Strict Should the Schema Be?

// LOOSE: Allows additional properties
{
  "type": "object",
  "properties": {"name": {"type": "string"}},
  "additionalProperties": true  // Claude can add extra fields
}

// STRICT: Only defined properties
{
  "type": "object",
  "properties": {"name": {"type": "string"}},
  "additionalProperties": false,  // No extra fields allowed
  "required": ["name"]
}

// ENUM: Fixed values only
{
  "method": {
    "type": "string",
    "enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
  }
}

Trade-offs:

Strict = More predictable, less flexible
Loose = More flexible, harder to process

5.3 How to Handle Failures?

┌─────────────────────────────────────────────────────────────────┐
│                    FAILURE HANDLING STRATEGY                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Schema validation fails                                         │
│       │                                                          │
│       ├── Retry with simpler prompt                             │
│       │   "Please output ONLY valid JSON matching this schema"   │
│       │                                                          │
│       ├── Retry with simpler schema                              │
│       │   Remove optional fields, loosen constraints             │
│       │                                                          │
│       ├── Extract partial results                                │
│       │   Use what validates, flag rest as incomplete            │
│       │                                                          │
│       └── Return detailed error                                  │
│           Show what failed and why                               │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

6. Thinking Exercise

Design an API Spec Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "API Specification",
  "type": "object",
  "properties": {
    "endpoints": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "method": {
            "type": "string",
            "enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
          },
          "path": {
            "type": "string",
            "pattern": "^/.*"
          },
          "description": {
            "type": "string"
          },
          "parameters": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "name": {"type": "string"},
                "type": {"type": "string"},
                "required": {"type": "boolean"},
                "location": {
                  "type": "string",
                  "enum": ["path", "query", "body", "header"]
                }
              },
              "required": ["name", "type"]
            }
          },
          "response": {
            "type": "object",
            "properties": {
              "type": {"type": "string"},
              "properties": {
                "type": "object",
                "additionalProperties": {"type": "string"}
              }
            }
          }
        },
        "required": ["method", "path"]
      }
    },
    "version": {
      "type": "string"
    }
  },
  "required": ["endpoints"]
}

Design Questions

What if an endpoint has no parameters?
- Parameters array can be empty []
- Or make it optional (not in required)
- Handle both in downstream code
How do you handle response types?
- Simple: Just record the type name
- Complex: Full recursive schema
- Practical: Use OpenAPI-like references
Should you allow unknown methods?
- Strict (enum): Only known methods
- Loose (no enum): Accept any string
- Consider custom methods like SUBSCRIBE

7. The Interview Questions They’ll Ask

7.1 “How do you ensure structured output from an LLM?”

Good answer: Use JSON Schema validation with Claude’s --json-schema flag. This:

Guides Claude to produce the expected structure
Validates output automatically
Fails fast on malformed data
Enables type-safe downstream processing

7.2 “What is JSON Schema and how would you use it?”

Key points:

Declarative format for describing JSON structure
Defines types, required fields, constraints
Enables automated validation
Standard format (draft-07 most common)

from jsonschema import validate, ValidationError

schema = {"type": "object", "required": ["name"]}
data = {"name": "test"}

try:
    validate(instance=data, schema=schema)
    print("Valid!")
except ValidationError as e:
    print(f"Invalid: {e.message}")

7.3 “How do you handle schema validation failures?”

Strategies:

Retry: Ask Claude again with emphasis on structure
Simplify: Use a less strict schema
Partial: Accept valid portions, flag rest
Transform: Fix common errors programmatically
Fallback: Return error with details for human review

7.4 “What are the trade-offs of strict vs loose schemas?”

Aspect	Strict Schema	Loose Schema
Validation	All or nothing	Partial success
Flexibility	Low	High
Predictability	High	Low
Error detection	Immediate	Deferred
Evolution	Breaking changes	Easier migration

7.5 “How would you version schemas for evolving data?”

Approaches:

Additive: Add optional fields (backward compatible)
Deprecation: Mark fields as deprecated before removal
Version header: Include schema version in output
Multiple schemas: Support multiple versions simultaneously

8. Hints in Layers

Hint 1: Start Simple

Begin with a flat schema:

{
  "type": "object",
  "properties": {
    "endpoints": {
      "type": "array",
      "items": {"type": "string"}
    }
  }
}

Just extract endpoint paths as strings first.

Hint 2: Use Enums for Fixed Values

Constrain method values:

{
  "method": {
    "type": "string",
    "enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
  }
}

This catches invalid methods immediately.

Hint 3: Handle Arrays Carefully

Define item schemas for consistency:

{
  "parameters": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "type": {"type": "string"}
      },
      "required": ["name", "type"]
    }
  }
}

Hint 4: Test with Edge Cases

Try:

Empty arrays
Missing optional fields
Unexpected values
Very long strings
Unicode characters
Nested objects at max depth

9. Books That Will Help

Topic	Book	Chapter/Section
JSON Schema	“Understanding JSON Schema”	All (free online)
Data validation	“Designing Data-Intensive Applications” by Kleppmann	Ch. 4: Encoding and Evolution
API design	“RESTful Web APIs” by Richardson & Ruby	Ch. 3-4: Resources and Representations
Type systems	“Types and Programming Languages” by Pierce	Ch. 1-3: Introduction to types
Data modeling	“Data Model Patterns” by Hay	Ch. 2-4: Entity patterns

10. Implementation Guide

10.1 Complete Extraction Pipeline

import subprocess
import json
from pathlib import Path
from jsonschema import validate, ValidationError, Draft7Validator
from dataclasses import dataclass
from typing import Optional

@dataclass
class ExtractionResult:
    success: bool
    data: Optional[dict] = None
    error: Optional[str] = None
    validation_errors: list = None

class SchemaExtractor:
    def __init__(self, schema_path: str):
        with open(schema_path) as f:
            self.schema = json.load(f)
        self.validator = Draft7Validator(self.schema)

    def extract(self, prompt: str, max_retries: int = 3) -> ExtractionResult:
        """Extract data with schema validation and retry."""
        for attempt in range(max_retries):
            result = self._run_extraction(prompt)

            if result.success:
                return result

            # Modify prompt for retry
            if attempt < max_retries - 1:
                prompt = self._enhance_prompt(prompt, result.validation_errors)
                print(f"Retry {attempt + 1}: Enhancing prompt...")

        return result

    def _run_extraction(self, prompt: str) -> ExtractionResult:
        """Run single extraction attempt."""
        try:
            # Run Claude with schema
            result = subprocess.run(
                [
                    "claude", "-p", prompt,
                    "--json-schema", json.dumps(self.schema),
                    "--output-format", "json"
                ],
                capture_output=True,
                text=True,
                timeout=120
            )

            if result.returncode != 0:
                return ExtractionResult(
                    success=False,
                    error=f"Claude error: {result.stderr}"
                )

            # Parse response
            response = json.loads(result.stdout)
            data = response.get("result")

            # If result is a string, try to parse as JSON
            if isinstance(data, str):
                try:
                    data = json.loads(data)
                except json.JSONDecodeError:
                    return ExtractionResult(
                        success=False,
                        error="Result is not valid JSON"
                    )

            # Validate against schema
            validation_errors = list(self.validator.iter_errors(data))

            if validation_errors:
                return ExtractionResult(
                    success=False,
                    validation_errors=[str(e) for e in validation_errors]
                )

            return ExtractionResult(success=True, data=data)

        except subprocess.TimeoutExpired:
            return ExtractionResult(success=False, error="Timeout")
        except json.JSONDecodeError as e:
            return ExtractionResult(success=False, error=f"JSON parse error: {e}")

    def _enhance_prompt(self, original: str, errors: list) -> str:
        """Enhance prompt based on validation errors."""
        error_summary = "\n".join(errors[:3]) if errors else "Schema mismatch"

        return f"""{original}

IMPORTANT: Your previous response had validation errors:
{error_summary}

Please ensure your response EXACTLY matches the JSON schema.
Output ONLY valid JSON, no additional text."""


def extract_api_spec(source_dir: str, schema_path: str) -> dict:
    """Extract API specification from source code."""
    extractor = SchemaExtractor(schema_path)

    prompt = f"""Analyze the source code in {source_dir} and extract the API specification.

Look for:
- Route definitions (GET, POST, PUT, DELETE endpoints)
- Path parameters (like :id or {{id}})
- Query parameters
- Request body parameters
- Response types

Output a JSON object matching the provided schema with all endpoints found."""

    result = extractor.extract(prompt)

    if result.success:
        return result.data
    else:
        raise Exception(f"Extraction failed: {result.error or result.validation_errors}")


# Example usage
if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--source", required=True, help="Source directory")
    parser.add_argument("--schema", required=True, help="JSON Schema file")
    parser.add_argument("--output", default="output.json", help="Output file")
    args = parser.parse_args()

    print(f"Extracting from {args.source} using {args.schema}...")

    try:
        data = extract_api_spec(args.source, args.schema)

        with open(args.output, "w") as f:
            json.dump(data, f, indent=2)

        print(f"\nSuccess! Saved to {args.output}")
        print(f"Found {len(data.get('endpoints', []))} endpoints")

    except Exception as e:
        print(f"\nError: {e}")
        exit(1)

10.2 Schema Definitions

Save as schemas/api-spec.json:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "API Specification",
  "type": "object",
  "properties": {
    "endpoints": {
      "type": "array",
      "items": {
        "$ref": "#/definitions/Endpoint"
      }
    },
    "version": {
      "type": "string"
    },
    "baseUrl": {
      "type": "string"
    }
  },
  "required": ["endpoints"],
  "definitions": {
    "Endpoint": {
      "type": "object",
      "properties": {
        "method": {
          "type": "string",
          "enum": ["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS", "HEAD"]
        },
        "path": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "parameters": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/Parameter"
          }
        },
        "requestBody": {
          "$ref": "#/definitions/RequestBody"
        },
        "response": {
          "$ref": "#/definitions/Response"
        }
      },
      "required": ["method", "path"]
    },
    "Parameter": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "type": {"type": "string"},
        "required": {"type": "boolean", "default": false},
        "location": {
          "type": "string",
          "enum": ["path", "query", "header", "cookie"]
        },
        "description": {"type": "string"}
      },
      "required": ["name", "type", "location"]
    },
    "RequestBody": {
      "type": "object",
      "properties": {
        "contentType": {"type": "string"},
        "schema": {"type": "object"}
      }
    },
    "Response": {
      "type": "object",
      "properties": {
        "statusCode": {"type": "integer"},
        "contentType": {"type": "string"},
        "schema": {"type": "object"}
      }
    }
  }
}

10.3 Type Generation from Schema

from typing import TypedDict, List, Optional

# Generated from schema - can be auto-generated
class Parameter(TypedDict, total=False):
    name: str
    type: str
    required: bool
    location: str
    description: str

class Response(TypedDict, total=False):
    statusCode: int
    contentType: str
    schema: dict

class Endpoint(TypedDict, total=False):
    method: str
    path: str
    description: str
    parameters: List[Parameter]
    response: Response

class APISpec(TypedDict, total=False):
    endpoints: List[Endpoint]
    version: str
    baseUrl: str


def process_spec(spec: APISpec):
    """Type-safe processing of extracted spec."""
    for endpoint in spec["endpoints"]:
        method = endpoint["method"]  # Type: str
        path = endpoint["path"]      # Type: str

        params = endpoint.get("parameters", [])
        for param in params:
            print(f"  {param['location']}: {param['name']} ({param['type']})")

11. Learning Milestones

Milestone	Description	Verification
1	Simple schema works	Flat object validates
2	Nested schema works	Objects within objects
3	Array schema works	List of endpoints extracted
4	Enum validation works	Invalid methods rejected
5	Retry logic works	Failed extraction retries
6	Error messages helpful	Clear validation errors
7	Real extraction works	API spec from real code

12. Common Pitfalls

12.1 Overly Complex Schemas

// TOO COMPLEX: Claude may struggle
{
  "type": "object",
  "properties": {
    "level1": {
      "type": "object",
      "properties": {
        "level2": {
          "type": "object",
          "properties": {
            "level3": { /* ... */ }
          }
        }
      }
    }
  }
}

// BETTER: Flatten or use references
{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {"$ref": "#/definitions/Item"}
    }
  },
  "definitions": {
    "Item": { /* ... */ }
  }
}

12.2 Missing Required Fields

// WRONG: Forgot to mark required
{
  "properties": {
    "name": {"type": "string"},
    "id": {"type": "integer"}
  }
  // id should be required!
}

// RIGHT: Explicit required array
{
  "properties": {
    "name": {"type": "string"},
    "id": {"type": "integer"}
  },
  "required": ["id"]
}

12.3 Not Handling Empty Arrays

# WRONG: Assumes non-empty
endpoints = result["endpoints"][0]  # IndexError if empty

# RIGHT: Check first
endpoints = result.get("endpoints", [])
if endpoints:
    first = endpoints[0]
else:
    print("No endpoints found")

12.4 Schema Mismatch with Prompt

# WRONG: Prompt asks for X, schema expects Y
prompt = "List all functions"
schema = {"properties": {"endpoints": ...}}  # Mismatch!

# RIGHT: Align prompt and schema
prompt = "Extract API endpoints as JSON"
schema = {"properties": {"endpoints": ...}}  # Match!

13. Extension Ideas

OpenAPI generator: Output OpenAPI/Swagger format
TypeScript types: Generate TypeScript interfaces from schema
Schema inference: Learn schema from examples
Multi-format: Extract to JSON, YAML, or XML
Incremental extraction: Update spec when code changes
Validation reports: Detailed schema compliance reports

14. Summary

This project teaches you to:

Design JSON Schemas for data extraction
Use --json-schema for validated output
Handle validation failures gracefully
Build type-safe extraction pipelines
Version schemas for evolving data

The key insight is that schemas transform extraction from “parsing free text” to “validating structured data.” Instead of hoping Claude’s output is parseable, you define exactly what you expect and validate it automatically.

Key takeaway: Schema validation is about contracts. You define a contract (the schema) that Claude’s output must satisfy. When it does, you have guaranteed structure. When it doesn’t, you know exactly what went wrong. This transforms unreliable text extraction into a robust data pipeline.