Project 27: Schema-Validated Output - Structured Data Extraction

Project 27: Schema-Validated Output - Structured Data Extraction

Build a structured data extraction pipeline using --json-schema to ensure Claudeโ€™s output matches expected formats: extract API specs from code, generate typed data from unstructured input, and validate outputs against schemas.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Language Python (Alternatives: TypeScript, Go)
Prerequisites Projects 24-26 completed, JSON Schema understanding
Key Topics JSON Schema, structured output, data extraction, schema validation
Main Book โ€œUnderstanding JSON Schemaโ€ (json-schema.org)

1. Learning Objectives

By completing this project, you will:

  1. Master JSON Schema design: Create schemas that capture complex data structures
  2. Use schema-validated output: Apply --json-schema for guaranteed structure
  3. Handle validation failures: Implement retry and fallback strategies
  4. Design for extraction: Craft prompts and schemas for reliable data extraction
  5. Evolve schemas safely: Version schemas without breaking existing data
  6. Build robust pipelines: Create end-to-end extraction workflows with validation

2. Real World Outcome

When complete, youโ€™ll have a pipeline that extracts structured data reliably:

$ python extract_api.py --source ./src/routes --schema api-spec.json

Extracting API specification...

Schema: api-spec.json
- endpoints: array of objects
- each endpoint: method, path, parameters, response

Validation passed!

Extracted API Specification:

{
  "endpoints": [
    {
      "method": "GET",
      "path": "/users/{id}",
      "parameters": [
        {"name": "id", "type": "string", "required": true}
      ],
      "response": {
        "type": "object",
        "properties": {
          "id": "string",
          "name": "string",
          "email": "string"
        }
      }
    },
    {
      "method": "POST",
      "path": "/users",
      "parameters": [
        {"name": "name", "type": "string", "required": true},
        {"name": "email", "type": "string", "required": true}
      ],
      "response": {
        "type": "object",
        "properties": {
          "id": "string",
          "created": "boolean"
        }
      }
    }
  ]
}

Saved to: api-spec-output.json

Why Schema Validation Matters

Without schema validation:

Claude: "The API has GET /users and POST /users endpoints..."
You: *manually parse this text*
Result: Fragile, error-prone extraction

With schema validation:

Claude: {"endpoints": [{"method": "GET", ...}, ...]}
You: *validate against schema, use directly*
Result: Reliable, typed data

3. The Core Question Youโ€™re Answering

โ€œHow do I ensure Claudeโ€™s output matches a specific structure, enabling reliable data extraction and integration?โ€

Unstructured LLM output is hard to parse reliably. JSON Schema validation ensures you get exactly the data structure you expect, every time. This enables:

  • Type-safe downstream processing
  • Consistent data pipelines
  • Automated error detection
  • Version-controlled output formats

4. Concepts You Must Understand First

Stop and research these before coding:

4.1 JSON Schema Basics

JSON Schema defines the structure of JSON documents:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer", "minimum": 0},
    "email": {"type": "string", "format": "email"}
  },
  "required": ["name", "email"]
}

Key constructs:

Construct Purpose Example
type Data type "string", "object", "array"
properties Object fields {"name": {"type": "string"}}
required Mandatory fields ["name", "email"]
items Array element schema {"items": {"type": "string"}}
enum Fixed values ["GET", "POST", "PUT"]
$ref Schema reuse "$ref": "#/definitions/User"

Reference: json-schema.org - โ€œUnderstanding JSON Schemaโ€

4.2 Claudeโ€™s Schema Support

# Use --json-schema to enforce output structure
claude -p "Extract API endpoints" \
  --json-schema ./schemas/api.json \
  --output-format json

# Claude will output data matching the schema
# Validation happens automatically

Key Questions:

  • What happens when output doesnโ€™t match schema?
  • Are there schema complexity limits?
  • How does Claude use the schema to guide generation?

Reference: Claude Code documentation - โ€œโ€“json-schemaโ€

4.3 Schema Design Patterns

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SCHEMA DESIGN PATTERNS                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  FLAT SCHEMA (Simple)                                           โ”‚
โ”‚  โ”œโ”€โ”€ Single-level properties                                    โ”‚
โ”‚  โ”œโ”€โ”€ Easy to understand and validate                            โ”‚
โ”‚  โ””โ”€โ”€ Good for simple data extraction                            โ”‚
โ”‚                                                                  โ”‚
โ”‚  NESTED SCHEMA (Complex)                                        โ”‚
โ”‚  โ”œโ”€โ”€ Objects within objects                                     โ”‚
โ”‚  โ”œโ”€โ”€ Captures relationships                                     โ”‚
โ”‚  โ””โ”€โ”€ Requires careful path navigation                           โ”‚
โ”‚                                                                  โ”‚
โ”‚  ARRAY SCHEMA (Collections)                                     โ”‚
โ”‚  โ”œโ”€โ”€ Lists of structured items                                  โ”‚
โ”‚  โ”œโ”€โ”€ Define item schema once                                    โ”‚
โ”‚  โ””โ”€โ”€ Handles variable-length output                             โ”‚
โ”‚                                                                  โ”‚
โ”‚  UNION SCHEMA (Variants)                                        โ”‚
โ”‚  โ”œโ”€โ”€ oneOf, anyOf for alternatives                              โ”‚
โ”‚  โ”œโ”€โ”€ Handles different output types                             โ”‚
โ”‚  โ””โ”€โ”€ More complex validation logic                              โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5. Questions to Guide Your Design

5.1 What Data to Extract?

Use Case Extraction Target Schema Complexity
API documentation Endpoints, parameters, responses Medium
Code comments TODO items, deprecation notices Low
Type definitions Interfaces, types, relationships High
Configuration Settings, defaults, options Medium
Dependencies Package names, versions Low

5.2 How Strict Should the Schema Be?

// LOOSE: Allows additional properties
{
  "type": "object",
  "properties": {"name": {"type": "string"}},
  "additionalProperties": true  // Claude can add extra fields
}

// STRICT: Only defined properties
{
  "type": "object",
  "properties": {"name": {"type": "string"}},
  "additionalProperties": false,  // No extra fields allowed
  "required": ["name"]
}

// ENUM: Fixed values only
{
  "method": {
    "type": "string",
    "enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
  }
}

Trade-offs:

  • Strict = More predictable, less flexible
  • Loose = More flexible, harder to process

5.3 How to Handle Failures?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    FAILURE HANDLING STRATEGY                     โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Schema validation fails                                         โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚       โ”œโ”€โ”€ Retry with simpler prompt                             โ”‚
โ”‚       โ”‚   "Please output ONLY valid JSON matching this schema"   โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚       โ”œโ”€โ”€ Retry with simpler schema                              โ”‚
โ”‚       โ”‚   Remove optional fields, loosen constraints             โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚       โ”œโ”€โ”€ Extract partial results                                โ”‚
โ”‚       โ”‚   Use what validates, flag rest as incomplete            โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚       โ””โ”€โ”€ Return detailed error                                  โ”‚
โ”‚           Show what failed and why                               โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

6. Thinking Exercise

Design an API Spec Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "API Specification",
  "type": "object",
  "properties": {
    "endpoints": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "method": {
            "type": "string",
            "enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
          },
          "path": {
            "type": "string",
            "pattern": "^/.*"
          },
          "description": {
            "type": "string"
          },
          "parameters": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "name": {"type": "string"},
                "type": {"type": "string"},
                "required": {"type": "boolean"},
                "location": {
                  "type": "string",
                  "enum": ["path", "query", "body", "header"]
                }
              },
              "required": ["name", "type"]
            }
          },
          "response": {
            "type": "object",
            "properties": {
              "type": {"type": "string"},
              "properties": {
                "type": "object",
                "additionalProperties": {"type": "string"}
              }
            }
          }
        },
        "required": ["method", "path"]
      }
    },
    "version": {
      "type": "string"
    }
  },
  "required": ["endpoints"]
}

Design Questions

  1. What if an endpoint has no parameters?
    • Parameters array can be empty []
    • Or make it optional (not in required)
    • Handle both in downstream code
  2. How do you handle response types?
    • Simple: Just record the type name
    • Complex: Full recursive schema
    • Practical: Use OpenAPI-like references
  3. Should you allow unknown methods?
    • Strict (enum): Only known methods
    • Loose (no enum): Accept any string
    • Consider custom methods like SUBSCRIBE

7. The Interview Questions Theyโ€™ll Ask

7.1 โ€œHow do you ensure structured output from an LLM?โ€

Good answer: Use JSON Schema validation with Claudeโ€™s --json-schema flag. This:

  1. Guides Claude to produce the expected structure
  2. Validates output automatically
  3. Fails fast on malformed data
  4. Enables type-safe downstream processing

7.2 โ€œWhat is JSON Schema and how would you use it?โ€

Key points:

  • Declarative format for describing JSON structure
  • Defines types, required fields, constraints
  • Enables automated validation
  • Standard format (draft-07 most common)
from jsonschema import validate, ValidationError

schema = {"type": "object", "required": ["name"]}
data = {"name": "test"}

try:
    validate(instance=data, schema=schema)
    print("Valid!")
except ValidationError as e:
    print(f"Invalid: {e.message}")

7.3 โ€œHow do you handle schema validation failures?โ€

Strategies:

  1. Retry: Ask Claude again with emphasis on structure
  2. Simplify: Use a less strict schema
  3. Partial: Accept valid portions, flag rest
  4. Transform: Fix common errors programmatically
  5. Fallback: Return error with details for human review

7.4 โ€œWhat are the trade-offs of strict vs loose schemas?โ€

Aspect Strict Schema Loose Schema
Validation All or nothing Partial success
Flexibility Low High
Predictability High Low
Error detection Immediate Deferred
Evolution Breaking changes Easier migration

7.5 โ€œHow would you version schemas for evolving data?โ€

Approaches:

  • Additive: Add optional fields (backward compatible)
  • Deprecation: Mark fields as deprecated before removal
  • Version header: Include schema version in output
  • Multiple schemas: Support multiple versions simultaneously

8. Hints in Layers

Hint 1: Start Simple

Begin with a flat schema:

{
  "type": "object",
  "properties": {
    "endpoints": {
      "type": "array",
      "items": {"type": "string"}
    }
  }
}

Just extract endpoint paths as strings first.

Hint 2: Use Enums for Fixed Values

Constrain method values:

{
  "method": {
    "type": "string",
    "enum": ["GET", "POST", "PUT", "DELETE", "PATCH"]
  }
}

This catches invalid methods immediately.

Hint 3: Handle Arrays Carefully

Define item schemas for consistency:

{
  "parameters": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "type": {"type": "string"}
      },
      "required": ["name", "type"]
    }
  }
}

Hint 4: Test with Edge Cases

Try:

  • Empty arrays
  • Missing optional fields
  • Unexpected values
  • Very long strings
  • Unicode characters
  • Nested objects at max depth

9. Books That Will Help

Topic Book Chapter/Section
JSON Schema โ€œUnderstanding JSON Schemaโ€ All (free online)
Data validation โ€œDesigning Data-Intensive Applicationsโ€ by Kleppmann Ch. 4: Encoding and Evolution
API design โ€œRESTful Web APIsโ€ by Richardson & Ruby Ch. 3-4: Resources and Representations
Type systems โ€œTypes and Programming Languagesโ€ by Pierce Ch. 1-3: Introduction to types
Data modeling โ€œData Model Patternsโ€ by Hay Ch. 2-4: Entity patterns

10. Implementation Guide

10.1 Complete Extraction Pipeline

import subprocess
import json
from pathlib import Path
from jsonschema import validate, ValidationError, Draft7Validator
from dataclasses import dataclass
from typing import Optional

@dataclass
class ExtractionResult:
    success: bool
    data: Optional[dict] = None
    error: Optional[str] = None
    validation_errors: list = None

class SchemaExtractor:
    def __init__(self, schema_path: str):
        with open(schema_path) as f:
            self.schema = json.load(f)
        self.validator = Draft7Validator(self.schema)

    def extract(self, prompt: str, max_retries: int = 3) -> ExtractionResult:
        """Extract data with schema validation and retry."""
        for attempt in range(max_retries):
            result = self._run_extraction(prompt)

            if result.success:
                return result

            # Modify prompt for retry
            if attempt < max_retries - 1:
                prompt = self._enhance_prompt(prompt, result.validation_errors)
                print(f"Retry {attempt + 1}: Enhancing prompt...")

        return result

    def _run_extraction(self, prompt: str) -> ExtractionResult:
        """Run single extraction attempt."""
        try:
            # Run Claude with schema
            result = subprocess.run(
                [
                    "claude", "-p", prompt,
                    "--json-schema", json.dumps(self.schema),
                    "--output-format", "json"
                ],
                capture_output=True,
                text=True,
                timeout=120
            )

            if result.returncode != 0:
                return ExtractionResult(
                    success=False,
                    error=f"Claude error: {result.stderr}"
                )

            # Parse response
            response = json.loads(result.stdout)
            data = response.get("result")

            # If result is a string, try to parse as JSON
            if isinstance(data, str):
                try:
                    data = json.loads(data)
                except json.JSONDecodeError:
                    return ExtractionResult(
                        success=False,
                        error="Result is not valid JSON"
                    )

            # Validate against schema
            validation_errors = list(self.validator.iter_errors(data))

            if validation_errors:
                return ExtractionResult(
                    success=False,
                    validation_errors=[str(e) for e in validation_errors]
                )

            return ExtractionResult(success=True, data=data)

        except subprocess.TimeoutExpired:
            return ExtractionResult(success=False, error="Timeout")
        except json.JSONDecodeError as e:
            return ExtractionResult(success=False, error=f"JSON parse error: {e}")

    def _enhance_prompt(self, original: str, errors: list) -> str:
        """Enhance prompt based on validation errors."""
        error_summary = "\n".join(errors[:3]) if errors else "Schema mismatch"

        return f"""{original}

IMPORTANT: Your previous response had validation errors:
{error_summary}

Please ensure your response EXACTLY matches the JSON schema.
Output ONLY valid JSON, no additional text."""


def extract_api_spec(source_dir: str, schema_path: str) -> dict:
    """Extract API specification from source code."""
    extractor = SchemaExtractor(schema_path)

    prompt = f"""Analyze the source code in {source_dir} and extract the API specification.

Look for:
- Route definitions (GET, POST, PUT, DELETE endpoints)
- Path parameters (like :id or {{id}})
- Query parameters
- Request body parameters
- Response types

Output a JSON object matching the provided schema with all endpoints found."""

    result = extractor.extract(prompt)

    if result.success:
        return result.data
    else:
        raise Exception(f"Extraction failed: {result.error or result.validation_errors}")


# Example usage
if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--source", required=True, help="Source directory")
    parser.add_argument("--schema", required=True, help="JSON Schema file")
    parser.add_argument("--output", default="output.json", help="Output file")
    args = parser.parse_args()

    print(f"Extracting from {args.source} using {args.schema}...")

    try:
        data = extract_api_spec(args.source, args.schema)

        with open(args.output, "w") as f:
            json.dump(data, f, indent=2)

        print(f"\nSuccess! Saved to {args.output}")
        print(f"Found {len(data.get('endpoints', []))} endpoints")

    except Exception as e:
        print(f"\nError: {e}")
        exit(1)

10.2 Schema Definitions

Save as schemas/api-spec.json:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "API Specification",
  "type": "object",
  "properties": {
    "endpoints": {
      "type": "array",
      "items": {
        "$ref": "#/definitions/Endpoint"
      }
    },
    "version": {
      "type": "string"
    },
    "baseUrl": {
      "type": "string"
    }
  },
  "required": ["endpoints"],
  "definitions": {
    "Endpoint": {
      "type": "object",
      "properties": {
        "method": {
          "type": "string",
          "enum": ["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS", "HEAD"]
        },
        "path": {
          "type": "string"
        },
        "description": {
          "type": "string"
        },
        "parameters": {
          "type": "array",
          "items": {
            "$ref": "#/definitions/Parameter"
          }
        },
        "requestBody": {
          "$ref": "#/definitions/RequestBody"
        },
        "response": {
          "$ref": "#/definitions/Response"
        }
      },
      "required": ["method", "path"]
    },
    "Parameter": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "type": {"type": "string"},
        "required": {"type": "boolean", "default": false},
        "location": {
          "type": "string",
          "enum": ["path", "query", "header", "cookie"]
        },
        "description": {"type": "string"}
      },
      "required": ["name", "type", "location"]
    },
    "RequestBody": {
      "type": "object",
      "properties": {
        "contentType": {"type": "string"},
        "schema": {"type": "object"}
      }
    },
    "Response": {
      "type": "object",
      "properties": {
        "statusCode": {"type": "integer"},
        "contentType": {"type": "string"},
        "schema": {"type": "object"}
      }
    }
  }
}

10.3 Type Generation from Schema

from typing import TypedDict, List, Optional

# Generated from schema - can be auto-generated
class Parameter(TypedDict, total=False):
    name: str
    type: str
    required: bool
    location: str
    description: str

class Response(TypedDict, total=False):
    statusCode: int
    contentType: str
    schema: dict

class Endpoint(TypedDict, total=False):
    method: str
    path: str
    description: str
    parameters: List[Parameter]
    response: Response

class APISpec(TypedDict, total=False):
    endpoints: List[Endpoint]
    version: str
    baseUrl: str


def process_spec(spec: APISpec):
    """Type-safe processing of extracted spec."""
    for endpoint in spec["endpoints"]:
        method = endpoint["method"]  # Type: str
        path = endpoint["path"]      # Type: str

        params = endpoint.get("parameters", [])
        for param in params:
            print(f"  {param['location']}: {param['name']} ({param['type']})")

11. Learning Milestones

Milestone Description Verification
1 Simple schema works Flat object validates
2 Nested schema works Objects within objects
3 Array schema works List of endpoints extracted
4 Enum validation works Invalid methods rejected
5 Retry logic works Failed extraction retries
6 Error messages helpful Clear validation errors
7 Real extraction works API spec from real code

12. Common Pitfalls

12.1 Overly Complex Schemas

// TOO COMPLEX: Claude may struggle
{
  "type": "object",
  "properties": {
    "level1": {
      "type": "object",
      "properties": {
        "level2": {
          "type": "object",
          "properties": {
            "level3": { /* ... */ }
          }
        }
      }
    }
  }
}

// BETTER: Flatten or use references
{
  "type": "object",
  "properties": {
    "items": {
      "type": "array",
      "items": {"$ref": "#/definitions/Item"}
    }
  },
  "definitions": {
    "Item": { /* ... */ }
  }
}

12.2 Missing Required Fields

// WRONG: Forgot to mark required
{
  "properties": {
    "name": {"type": "string"},
    "id": {"type": "integer"}
  }
  // id should be required!
}

// RIGHT: Explicit required array
{
  "properties": {
    "name": {"type": "string"},
    "id": {"type": "integer"}
  },
  "required": ["id"]
}

12.3 Not Handling Empty Arrays

# WRONG: Assumes non-empty
endpoints = result["endpoints"][0]  # IndexError if empty

# RIGHT: Check first
endpoints = result.get("endpoints", [])
if endpoints:
    first = endpoints[0]
else:
    print("No endpoints found")

12.4 Schema Mismatch with Prompt

# WRONG: Prompt asks for X, schema expects Y
prompt = "List all functions"
schema = {"properties": {"endpoints": ...}}  # Mismatch!

# RIGHT: Align prompt and schema
prompt = "Extract API endpoints as JSON"
schema = {"properties": {"endpoints": ...}}  # Match!

13. Extension Ideas

  1. OpenAPI generator: Output OpenAPI/Swagger format
  2. TypeScript types: Generate TypeScript interfaces from schema
  3. Schema inference: Learn schema from examples
  4. Multi-format: Extract to JSON, YAML, or XML
  5. Incremental extraction: Update spec when code changes
  6. Validation reports: Detailed schema compliance reports

14. Summary

This project teaches you to:

  • Design JSON Schemas for data extraction
  • Use --json-schema for validated output
  • Handle validation failures gracefully
  • Build type-safe extraction pipelines
  • Version schemas for evolving data

The key insight is that schemas transform extraction from โ€œparsing free textโ€ to โ€œvalidating structured data.โ€ Instead of hoping Claudeโ€™s output is parseable, you define exactly what you expect and validate it automatically.

Key takeaway: Schema validation is about contracts. You define a contract (the schema) that Claudeโ€™s output must satisfy. When it does, you have guaranteed structure. When it doesnโ€™t, you know exactly what went wrong. This transforms unreliable text extraction into a robust data pipeline.