Project 1: Schema Validator CLI

Build a CLI tool that validates JSON/YAML files against Pydantic schemas, showing detailed error messages and suggesting fixes—like a linter for your data.

Learning Objectives

By completing this project, you will:

Understand Pydantic’s core validation model - How data flows from raw input to validated Python objects
Master error handling - Parse ValidationError structure and create user-friendly error messages
Work with nested models - Build complex schemas with relationships between models
Use Field configuration - Apply constraints like min_length, max_length, patterns, and custom descriptions
Implement dynamic model loading - Load Pydantic models from Python files at runtime
Build production CLI tools - Use Click/Typer for professional command-line interfaces

Theoretical Foundation

The Philosophy of Data Validation

Data validation is one of the most critical aspects of software engineering. Consider these scenarios:

A user submits a form with an email like “not-an-email”
An API receives JSON with a negative age value
A configuration file contains an invalid database URL
A CSV import has dates in the wrong format

Without proper validation, these errors propagate through your system, causing bugs that are hard to trace, security vulnerabilities, and corrupted data.

The Principle of Early Failure: Validate data at system boundaries—as soon as it enters your application. This is called “fail fast” or “parse, don’t validate.” Instead of checking data validity throughout your code, validate once at entry and work with guaranteed-valid data afterward.

┌─────────────────────────────────────────────────────────────┐
│                    SYSTEM BOUNDARY                          │
│                                                             │
│   External World                    Your Application        │
│   (Untrusted)                       (Trusted)              │
│                                                             │
│   ┌──────────┐      Validation      ┌──────────────────┐   │
│   │ Raw JSON │ ──────────────────►  │ Validated Model  │   │
│   │ {"age":  │      ┌─────────┐     │                  │   │
│   │  "-5"}   │ ──►  │ Pydantic│ ──► │ User(age=25)     │   │
│   └──────────┘      │ Schema  │     │ # Type-safe!     │   │
│                     └─────────┘     └──────────────────┘   │
│                          │                                  │
│                          ▼                                  │
│                  ValidationError                            │
│                  (Rejected!)                                │
└─────────────────────────────────────────────────────────────┘

How Pydantic Validates Data

Pydantic V2 uses a multi-stage validation pipeline:

Stage 1: Input Parsing

Raw input (JSON string, dict, etc.) is converted to a Python dictionary. For model_validate_json(), this uses a Rust-based JSON parser for speed.

Stage 2: Type Coercion (Default Mode)

Pydantic attempts to convert values to their target types:

"123" → 123 (string to int)
"true" → True (string to bool)
1.5 → 1 (float to int, truncated)
"2024-01-15" → date(2024, 1, 15) (string to date)

Strict Mode disables coercion—types must match exactly.

Stage 3: Constraint Validation

Field constraints are checked:

min_length, max_length for strings
ge, gt, le, lt for numbers
pattern for regex matching
Custom validators

Stage 4: Field Validators

Custom @field_validator functions run on individual fields.

Stage 5: Model Validators

@model_validator functions run with access to all fields for cross-field validation.

Stage 6: Object Construction

If all validation passes, the Pydantic model instance is created.

Understanding ValidationError

When validation fails, Pydantic raises ValidationError with rich error information:

from pydantic import BaseModel, ValidationError, Field

class User(BaseModel):
    name: str = Field(min_length=1)
    age: int = Field(ge=0)
    email: str

try:
    User(name="", age=-5, email="not-an-email")
except ValidationError as e:
    print(e.json(indent=2))

Output:

[
  {
    "type": "string_too_short",
    "loc": ["name"],
    "msg": "String should have at least 1 character",
    "input": "",
    "ctx": {"min_length": 1}
  },
  {
    "type": "greater_than_equal",
    "loc": ["age"],
    "msg": "Input should be greater than or equal to 0",
    "input": -5,
    "ctx": {"ge": 0}
  },
  {
    "type": "value_error",
    "loc": ["email"],
    "msg": "value is not a valid email address: An email address must have an @-sign.",
    "input": "not-an-email"
  }
]

Error Structure Anatomy:

type: Machine-readable error identifier (e.g., "string_too_short")
loc: Tuple path to the field (e.g., ["address", "city"] for nested fields)
msg: Human-readable error message
input: The invalid value that was provided
ctx: Additional context (constraint values, expected formats)

Field Configuration Deep Dive

The Field() function is how you configure individual fields:

from pydantic import BaseModel, Field
from typing import Optional

class Product(BaseModel):
    # Required field with constraints
    name: str = Field(
        min_length=1,
        max_length=100,
        description="Product name",
        examples=["Widget Pro", "Gadget X"]
    )

    # Numeric constraints
    price: float = Field(
        gt=0,  # greater than
        le=10000,  # less than or equal
        description="Price in USD"
    )

    # Default value
    quantity: int = Field(
        default=0,
        ge=0,
        description="Stock quantity"
    )

    # Pattern matching
    sku: str = Field(
        pattern=r'^[A-Z]{2}-\d{4}$',
        description="Stock Keeping Unit (e.g., AB-1234)"
    )

    # Optional with None default
    description: Optional[str] = Field(
        default=None,
        max_length=1000
    )

Available Constraints:

Constraint	Applies To	Description
`gt`	Numbers	Greater than
`ge`	Numbers	Greater than or equal
`lt`	Numbers	Less than
`le`	Numbers	Less than or equal
`multiple_of`	Numbers	Must be divisible by
`min_length`	Strings, Lists	Minimum length
`max_length`	Strings, Lists	Maximum length
`pattern`	Strings	Regex pattern
`strict`	Any	Disable type coercion

Nested Models and Complex Schemas

Real-world data is rarely flat. Pydantic handles nested models elegantly:

from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime

class Address(BaseModel):
    street: str
    city: str
    state: str = Field(min_length=2, max_length=2)  # State code
    zip_code: str = Field(pattern=r'^\d{5}(-\d{4})?$')
    country: str = Field(default="US")

class ContactInfo(BaseModel):
    email: str
    phone: Optional[str] = None
    address: Optional[Address] = None

class Company(BaseModel):
    name: str
    founded: datetime
    employees: int = Field(ge=1)
    headquarters: Address
    contacts: List[ContactInfo] = Field(default_factory=list)

When validation fails on nested models, the loc tuple shows the full path:

# Error loc: ["headquarters", "zip_code"]
# Means: company.headquarters.zip_code failed validation

Dynamic Model Loading

For a CLI tool, you need to load Pydantic models from user-provided files. Python’s importlib makes this possible:

import importlib.util
import sys
from pathlib import Path

def load_model_from_file(file_path: str, model_name: str):
    """Load a Pydantic model class from a Python file."""
    path = Path(file_path)

    # Create a module spec
    spec = importlib.util.spec_from_file_location(
        path.stem,  # module name from filename
        path
    )

    # Create the module
    module = importlib.util.module_from_spec(spec)
    sys.modules[path.stem] = module

    # Execute the module (runs the code)
    spec.loader.exec_module(module)

    # Get the model class
    if not hasattr(module, model_name):
        raise ValueError(f"Model '{model_name}' not found in {file_path}")

    model_class = getattr(module, model_name)

    # Verify it's a Pydantic model
    from pydantic import BaseModel
    if not isinstance(model_class, type) or not issubclass(model_class, BaseModel):
        raise TypeError(f"'{model_name}' is not a Pydantic BaseModel")

    return model_class

Project Specification

Functional Requirements

Build a CLI tool called pydantic-validate that:

Validates data files against Pydantic schemas
- Accept a schema file (Python file with Pydantic models)
- Accept a data file (JSON or YAML)
- Report validation success or detailed errors
Provides helpful error messages
- Show field path for nested errors
- Display the invalid value
- Suggest corrections when possible
- Use color-coding for readability
Supports multiple records
- Validate arrays of objects
- Report per-record errors
- Show summary statistics
Handles multiple formats
- JSON files
- YAML files
- Detect format from extension

CLI Interface

# Basic usage
pydantic-validate --schema path/to/schema.py --model User --file data.json

# Validate YAML
pydantic-validate -s schema.py -m Config -f settings.yaml

# Verbose output
pydantic-validate -s schema.py -m User -f users.json --verbose

# Output as JSON (for automation)
pydantic-validate -s schema.py -m User -f data.json --output json

Example Schema File

# schemas/user.py
from pydantic import BaseModel, Field, EmailStr
from typing import Optional, List
from datetime import date
from enum import Enum

class UserRole(str, Enum):
    admin = "admin"
    user = "user"
    guest = "guest"

class Address(BaseModel):
    street: str = Field(min_length=1)
    city: str = Field(min_length=1)
    country: str = Field(min_length=2, max_length=2, description="ISO 3166-1 alpha-2")
    postal_code: Optional[str] = None

class User(BaseModel):
    id: int = Field(ge=1)
    name: str = Field(min_length=1, max_length=100)
    email: EmailStr
    age: int = Field(ge=0, le=150)
    role: UserRole = Field(default=UserRole.user)
    tags: List[str] = Field(default_factory=list, max_length=10)
    address: Optional[Address] = None
    created_at: date

Example Data File

[
  {
    "id": 1,
    "name": "John Doe",
    "email": "john@example.com",
    "age": 30,
    "role": "admin",
    "created_at": "2024-01-15"
  },
  {
    "id": -1,
    "name": "",
    "email": "not-an-email",
    "age": 200,
    "role": "superuser",
    "address": {
      "street": "123 Main St",
      "city": "NYC",
      "country": "United States"
    },
    "created_at": "invalid-date"
  }
]

Expected Output

Validating data.json against User schema...

✓ Record 1: Valid

✗ Record 2: 6 validation errors
  ├── id
  │   └── Input should be greater than or equal to 1 [type=greater_than_equal]
  │       Got: -1
  │       Expected: id ≥ 1
  │
  ├── name
  │   └── String should have at least 1 character [type=string_too_short]
  │       Got: "" (empty string)
  │       Expected: 1 ≤ length ≤ 100
  │
  ├── email
  │   └── value is not a valid email address [type=value_error]
  │       Got: "not-an-email"
  │       Suggestion: Add @ and domain (e.g., user@example.com)
  │
  ├── age
  │   └── Input should be less than or equal to 150 [type=less_than_equal]
  │       Got: 200
  │       Expected: 0 ≤ age ≤ 150
  │
  ├── role
  │   └── Input should be 'admin', 'user' or 'guest' [type=enum]
  │       Got: "superuser"
  │       Valid values: admin, user, guest
  │
  ├── address.country
  │   └── String should have at most 2 characters [type=string_too_long]
  │       Got: "United States" (13 characters)
  │       Suggestion: Use ISO 3166-1 alpha-2 code (e.g., "US")
  │
  └── created_at
      └── Input should be a valid date [type=date_parsing]
          Got: "invalid-date"
          Expected: ISO 8601 format (YYYY-MM-DD)

────────────────────────────────────────────────────
Summary: 1/2 records valid (50.0%)
────────────────────────────────────────────────────

Solution Architecture

Component Design

┌─────────────────────────────────────────────────────────────┐
│                         CLI Layer                           │
│                     (Click/Typer App)                       │
│  - Parse command line arguments                             │
│  - Handle --verbose, --output flags                         │
│  - Coordinate validation flow                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Schema Loader                          │
│  - Load Python files dynamically                            │
│  - Extract Pydantic model classes                           │
│  - Validate model is BaseModel subclass                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                       Data Reader                           │
│  - Detect format (JSON/YAML)                                │
│  - Parse file contents                                      │
│  - Handle single object vs array                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                       Validator                             │
│  - Apply Pydantic model to each record                      │
│  - Collect ValidationErrors                                 │
│  - Track success/failure counts                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                      Error Formatter                        │
│  - Parse error location paths                               │
│  - Generate human-friendly messages                         │
│  - Add suggestions based on error type                      │
│  - Apply color coding                                       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                        Reporter                             │
│  - Console output (tree format)                             │
│  - JSON output (for automation)                             │
│  - Summary statistics                                       │
└─────────────────────────────────────────────────────────────┘

Key Design Decisions

Separation of Concerns: Each component has a single responsibility
Error Enrichment: Transform raw ValidationError into user-friendly format
Pluggable Output: Support both human-readable and machine-readable output
Graceful Degradation: Handle partial failures (some records valid, some not)

Data Flow

# Pseudo-code for main validation flow

def validate_command(schema_path, model_name, data_path, output_format):
    # 1. Load the Pydantic model
    model_class = schema_loader.load(schema_path, model_name)

    # 2. Read and parse data
    data = data_reader.read(data_path)  # Returns dict or list of dicts

    # 3. Ensure we have a list
    records = data if isinstance(data, list) else [data]

    # 4. Validate each record
    results = []
    for i, record in enumerate(records):
        result = validator.validate(model_class, record)
        results.append(ValidationResult(
            index=i,
            record=record,
            success=result.success,
            errors=result.errors if not result.success else None
        ))

    # 5. Format and output
    if output_format == "json":
        reporter.output_json(results)
    else:
        reporter.output_console(results)

Phased Implementation Guide

Phase 1: Basic CLI Structure (1-2 hours)

Goal: Create a working CLI that accepts arguments.

Set up project structure:

pydantic-validate/
├── pyproject.toml
├── src/
│   └── pydantic_validate/
│       ├── __init__.py
│       ├── cli.py
│       ├── loader.py
│       ├── reader.py
│       ├── validator.py
│       └── formatter.py
└── tests/

Install dependencies:
```
pip install pydantic click pyyaml rich
```
Create basic CLI with Click:
- Define --schema, --model, --file options
- Add --verbose and --output flags
- Validate that files exist

Checkpoint: pydantic-validate --help shows usage.

Phase 2: Schema Loading (1-2 hours)

Goal: Dynamically load Pydantic models from files.

Implement loader.py:
- Use importlib.util to load Python files
- Extract specified model class
- Validate it’s a BaseModel subclass
Handle errors:
- File not found
- Syntax errors in schema file
- Model not found in file
- Not a Pydantic model

Checkpoint: Can load a User model from a schema file.

Phase 3: Data Reading (1 hour)

Goal: Read JSON and YAML files.

Implement reader.py:
- Detect format from file extension
- Parse JSON with json.load()
- Parse YAML with yaml.safe_load()
- Handle both single objects and arrays
Error handling:
- Invalid JSON/YAML syntax
- Empty files

Checkpoint: Can read and parse sample data files.

Phase 4: Core Validation (2 hours)

Goal: Validate records and collect errors.

Implement validator.py:
- Create ValidationResult dataclass
- Try to instantiate model with each record
- Catch ValidationError and extract error details
Process errors:
- Parse e.errors() list
- Store location, message, type, input

Checkpoint: Can detect invalid records and list errors.

Phase 5: Error Formatting (2-3 hours)

Goal: Create beautiful, helpful error output.

Implement formatter.py:
- Format error paths (join nested locations)
- Add suggestions based on error type
- Use rich library for colors and tree structure
Error suggestions:
- string_too_short → Show required length
- string_too_long → Show max length
- greater_than_equal → Show minimum value
- value_error for email → Suggest format
- enum → List valid values

Checkpoint: Errors display with color and suggestions.

Phase 6: Summary and JSON Output (1 hour)

Goal: Complete the user experience.

Add summary statistics:
- Count valid/invalid records
- Calculate percentage
- Display summary line
Implement JSON output:
- Create structured output for automation
- Include all error details

Checkpoint: Full working tool with both output formats.

Testing Strategy

Unit Tests

# tests/test_loader.py
def test_load_valid_model():
    model = load_model("schemas/user.py", "User")
    assert issubclass(model, BaseModel)

def test_load_missing_file():
    with pytest.raises(FileNotFoundError):
        load_model("nonexistent.py", "User")

def test_load_invalid_model_name():
    with pytest.raises(ValueError, match="not found"):
        load_model("schemas/user.py", "NonexistentModel")

# tests/test_validator.py
def test_validate_valid_record():
    result = validate(User, {"name": "John", "email": "john@example.com", ...})
    assert result.success is True

def test_validate_invalid_record():
    result = validate(User, {"name": "", "email": "invalid"})
    assert result.success is False
    assert len(result.errors) == 2

# tests/test_formatter.py
def test_format_nested_error():
    error = {"loc": ("address", "city"), "msg": "required"}
    formatted = format_error(error)
    assert "address.city" in formatted

Integration Tests

# tests/test_cli.py
from click.testing import CliRunner

def test_validate_valid_file():
    runner = CliRunner()
    result = runner.invoke(cli, [
        "--schema", "tests/fixtures/schema.py",
        "--model", "User",
        "--file", "tests/fixtures/valid_users.json"
    ])
    assert result.exit_code == 0
    assert "100.0%" in result.output

def test_validate_invalid_file():
    runner = CliRunner()
    result = runner.invoke(cli, [
        "--schema", "tests/fixtures/schema.py",
        "--model", "User",
        "--file", "tests/fixtures/invalid_users.json"
    ])
    assert result.exit_code == 1
    assert "validation error" in result.output

Test Fixtures

Create sample schema and data files for testing:

valid_users.json - All records valid
invalid_users.json - Mix of valid and invalid
empty.json - Empty array
malformed.json - Invalid JSON syntax

Common Pitfalls and Debugging

Pitfall 1: Import Errors in Schema Files

Problem: Schema file imports fail because dependencies aren’t installed.

Solution:

Catch ImportError and report which dependency is missing
Suggest pip install <package>

try:
    spec.loader.exec_module(module)
except ImportError as e:
    print(f"Schema requires: {e.name}")
    print(f"Install with: pip install {e.name}")

Pitfall 2: Forward References

Problem: Models reference each other before definition.

Solution: Use model_rebuild() after loading:

model_class.model_rebuild()

Pitfall 3: Circular Imports

Problem: Schema file has circular dependencies.

Solution: Recommend using TYPE_CHECKING guard:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from .other import OtherModel

Pitfall 4: Path Handling

Problem: Relative paths in error output are confusing.

Solution: Use absolute paths internally, display relative:

from pathlib import Path

display_path = Path(full_path).relative_to(Path.cwd())

Pitfall 5: Unicode in Output

Problem: Special characters cause encoding errors.