Project 1: Schema Validator CLI

Project 1: Schema Validator CLI

Build a CLI tool that validates JSON/YAML files against Pydantic schemas, showing detailed error messages and suggesting fixesโ€”like a linter for your data.


Learning Objectives

By completing this project, you will:

  1. Understand Pydanticโ€™s core validation model - How data flows from raw input to validated Python objects
  2. Master error handling - Parse ValidationError structure and create user-friendly error messages
  3. Work with nested models - Build complex schemas with relationships between models
  4. Use Field configuration - Apply constraints like min_length, max_length, patterns, and custom descriptions
  5. Implement dynamic model loading - Load Pydantic models from Python files at runtime
  6. Build production CLI tools - Use Click/Typer for professional command-line interfaces

Theoretical Foundation

The Philosophy of Data Validation

Data validation is one of the most critical aspects of software engineering. Consider these scenarios:

  • A user submits a form with an email like โ€œnot-an-emailโ€
  • An API receives JSON with a negative age value
  • A configuration file contains an invalid database URL
  • A CSV import has dates in the wrong format

Without proper validation, these errors propagate through your system, causing bugs that are hard to trace, security vulnerabilities, and corrupted data.

The Principle of Early Failure: Validate data at system boundariesโ€”as soon as it enters your application. This is called โ€œfail fastโ€ or โ€œparse, donโ€™t validate.โ€ Instead of checking data validity throughout your code, validate once at entry and work with guaranteed-valid data afterward.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    SYSTEM BOUNDARY                          โ”‚
โ”‚                                                             โ”‚
โ”‚   External World                    Your Application        โ”‚
โ”‚   (Untrusted)                       (Trusted)              โ”‚
โ”‚                                                             โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      Validation      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚ Raw JSON โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ  โ”‚ Validated Model  โ”‚   โ”‚
โ”‚   โ”‚ {"age":  โ”‚      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚                  โ”‚   โ”‚
โ”‚   โ”‚  "-5"}   โ”‚ โ”€โ”€โ–บ  โ”‚ Pydanticโ”‚ โ”€โ”€โ–บ โ”‚ User(age=25)     โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚ Schema  โ”‚     โ”‚ # Type-safe!     โ”‚   โ”‚
โ”‚                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                          โ”‚                                  โ”‚
โ”‚                          โ–ผ                                  โ”‚
โ”‚                  ValidationError                            โ”‚
โ”‚                  (Rejected!)                                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

How Pydantic Validates Data

Pydantic V2 uses a multi-stage validation pipeline:

Stage 1: Input Parsing

Raw input (JSON string, dict, etc.) is converted to a Python dictionary. For model_validate_json(), this uses a Rust-based JSON parser for speed.

Stage 2: Type Coercion (Default Mode)

Pydantic attempts to convert values to their target types:

  • "123" โ†’ 123 (string to int)
  • "true" โ†’ True (string to bool)
  • 1.5 โ†’ 1 (float to int, truncated)
  • "2024-01-15" โ†’ date(2024, 1, 15) (string to date)

Strict Mode disables coercionโ€”types must match exactly.

Stage 3: Constraint Validation

Field constraints are checked:

  • min_length, max_length for strings
  • ge, gt, le, lt for numbers
  • pattern for regex matching
  • Custom validators

Stage 4: Field Validators

Custom @field_validator functions run on individual fields.

Stage 5: Model Validators

@model_validator functions run with access to all fields for cross-field validation.

Stage 6: Object Construction

If all validation passes, the Pydantic model instance is created.

Understanding ValidationError

When validation fails, Pydantic raises ValidationError with rich error information:

from pydantic import BaseModel, ValidationError, Field

class User(BaseModel):
    name: str = Field(min_length=1)
    age: int = Field(ge=0)
    email: str

try:
    User(name="", age=-5, email="not-an-email")
except ValidationError as e:
    print(e.json(indent=2))

Output:

[
  {
    "type": "string_too_short",
    "loc": ["name"],
    "msg": "String should have at least 1 character",
    "input": "",
    "ctx": {"min_length": 1}
  },
  {
    "type": "greater_than_equal",
    "loc": ["age"],
    "msg": "Input should be greater than or equal to 0",
    "input": -5,
    "ctx": {"ge": 0}
  },
  {
    "type": "value_error",
    "loc": ["email"],
    "msg": "value is not a valid email address: An email address must have an @-sign.",
    "input": "not-an-email"
  }
]

Error Structure Anatomy:

  • type: Machine-readable error identifier (e.g., "string_too_short")
  • loc: Tuple path to the field (e.g., ["address", "city"] for nested fields)
  • msg: Human-readable error message
  • input: The invalid value that was provided
  • ctx: Additional context (constraint values, expected formats)

Field Configuration Deep Dive

The Field() function is how you configure individual fields:

from pydantic import BaseModel, Field
from typing import Optional

class Product(BaseModel):
    # Required field with constraints
    name: str = Field(
        min_length=1,
        max_length=100,
        description="Product name",
        examples=["Widget Pro", "Gadget X"]
    )

    # Numeric constraints
    price: float = Field(
        gt=0,  # greater than
        le=10000,  # less than or equal
        description="Price in USD"
    )

    # Default value
    quantity: int = Field(
        default=0,
        ge=0,
        description="Stock quantity"
    )

    # Pattern matching
    sku: str = Field(
        pattern=r'^[A-Z]{2}-\d{4}$',
        description="Stock Keeping Unit (e.g., AB-1234)"
    )

    # Optional with None default
    description: Optional[str] = Field(
        default=None,
        max_length=1000
    )

Available Constraints:

Constraint Applies To Description
gt Numbers Greater than
ge Numbers Greater than or equal
lt Numbers Less than
le Numbers Less than or equal
multiple_of Numbers Must be divisible by
min_length Strings, Lists Minimum length
max_length Strings, Lists Maximum length
pattern Strings Regex pattern
strict Any Disable type coercion

Nested Models and Complex Schemas

Real-world data is rarely flat. Pydantic handles nested models elegantly:

from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime

class Address(BaseModel):
    street: str
    city: str
    state: str = Field(min_length=2, max_length=2)  # State code
    zip_code: str = Field(pattern=r'^\d{5}(-\d{4})?$')
    country: str = Field(default="US")

class ContactInfo(BaseModel):
    email: str
    phone: Optional[str] = None
    address: Optional[Address] = None

class Company(BaseModel):
    name: str
    founded: datetime
    employees: int = Field(ge=1)
    headquarters: Address
    contacts: List[ContactInfo] = Field(default_factory=list)

When validation fails on nested models, the loc tuple shows the full path:

# Error loc: ["headquarters", "zip_code"]
# Means: company.headquarters.zip_code failed validation

Dynamic Model Loading

For a CLI tool, you need to load Pydantic models from user-provided files. Pythonโ€™s importlib makes this possible:

import importlib.util
import sys
from pathlib import Path

def load_model_from_file(file_path: str, model_name: str):
    """Load a Pydantic model class from a Python file."""
    path = Path(file_path)

    # Create a module spec
    spec = importlib.util.spec_from_file_location(
        path.stem,  # module name from filename
        path
    )

    # Create the module
    module = importlib.util.module_from_spec(spec)
    sys.modules[path.stem] = module

    # Execute the module (runs the code)
    spec.loader.exec_module(module)

    # Get the model class
    if not hasattr(module, model_name):
        raise ValueError(f"Model '{model_name}' not found in {file_path}")

    model_class = getattr(module, model_name)

    # Verify it's a Pydantic model
    from pydantic import BaseModel
    if not isinstance(model_class, type) or not issubclass(model_class, BaseModel):
        raise TypeError(f"'{model_name}' is not a Pydantic BaseModel")

    return model_class

Project Specification

Functional Requirements

Build a CLI tool called pydantic-validate that:

  1. Validates data files against Pydantic schemas
    • Accept a schema file (Python file with Pydantic models)
    • Accept a data file (JSON or YAML)
    • Report validation success or detailed errors
  2. Provides helpful error messages
    • Show field path for nested errors
    • Display the invalid value
    • Suggest corrections when possible
    • Use color-coding for readability
  3. Supports multiple records
    • Validate arrays of objects
    • Report per-record errors
    • Show summary statistics
  4. Handles multiple formats
    • JSON files
    • YAML files
    • Detect format from extension

CLI Interface

# Basic usage
pydantic-validate --schema path/to/schema.py --model User --file data.json

# Validate YAML
pydantic-validate -s schema.py -m Config -f settings.yaml

# Verbose output
pydantic-validate -s schema.py -m User -f users.json --verbose

# Output as JSON (for automation)
pydantic-validate -s schema.py -m User -f data.json --output json

Example Schema File

# schemas/user.py
from pydantic import BaseModel, Field, EmailStr
from typing import Optional, List
from datetime import date
from enum import Enum

class UserRole(str, Enum):
    admin = "admin"
    user = "user"
    guest = "guest"

class Address(BaseModel):
    street: str = Field(min_length=1)
    city: str = Field(min_length=1)
    country: str = Field(min_length=2, max_length=2, description="ISO 3166-1 alpha-2")
    postal_code: Optional[str] = None

class User(BaseModel):
    id: int = Field(ge=1)
    name: str = Field(min_length=1, max_length=100)
    email: EmailStr
    age: int = Field(ge=0, le=150)
    role: UserRole = Field(default=UserRole.user)
    tags: List[str] = Field(default_factory=list, max_length=10)
    address: Optional[Address] = None
    created_at: date

Example Data File

[
  {
    "id": 1,
    "name": "John Doe",
    "email": "john@example.com",
    "age": 30,
    "role": "admin",
    "created_at": "2024-01-15"
  },
  {
    "id": -1,
    "name": "",
    "email": "not-an-email",
    "age": 200,
    "role": "superuser",
    "address": {
      "street": "123 Main St",
      "city": "NYC",
      "country": "United States"
    },
    "created_at": "invalid-date"
  }
]

Expected Output

Validating data.json against User schema...

โœ“ Record 1: Valid

โœ— Record 2: 6 validation errors
  โ”œโ”€โ”€ id
  โ”‚   โ””โ”€โ”€ Input should be greater than or equal to 1 [type=greater_than_equal]
  โ”‚       Got: -1
  โ”‚       Expected: id โ‰ฅ 1
  โ”‚
  โ”œโ”€โ”€ name
  โ”‚   โ””โ”€โ”€ String should have at least 1 character [type=string_too_short]
  โ”‚       Got: "" (empty string)
  โ”‚       Expected: 1 โ‰ค length โ‰ค 100
  โ”‚
  โ”œโ”€โ”€ email
  โ”‚   โ””โ”€โ”€ value is not a valid email address [type=value_error]
  โ”‚       Got: "not-an-email"
  โ”‚       Suggestion: Add @ and domain (e.g., user@example.com)
  โ”‚
  โ”œโ”€โ”€ age
  โ”‚   โ””โ”€โ”€ Input should be less than or equal to 150 [type=less_than_equal]
  โ”‚       Got: 200
  โ”‚       Expected: 0 โ‰ค age โ‰ค 150
  โ”‚
  โ”œโ”€โ”€ role
  โ”‚   โ””โ”€โ”€ Input should be 'admin', 'user' or 'guest' [type=enum]
  โ”‚       Got: "superuser"
  โ”‚       Valid values: admin, user, guest
  โ”‚
  โ”œโ”€โ”€ address.country
  โ”‚   โ””โ”€โ”€ String should have at most 2 characters [type=string_too_long]
  โ”‚       Got: "United States" (13 characters)
  โ”‚       Suggestion: Use ISO 3166-1 alpha-2 code (e.g., "US")
  โ”‚
  โ””โ”€โ”€ created_at
      โ””โ”€โ”€ Input should be a valid date [type=date_parsing]
          Got: "invalid-date"
          Expected: ISO 8601 format (YYYY-MM-DD)

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Summary: 1/2 records valid (50.0%)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Solution Architecture

Component Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         CLI Layer                           โ”‚
โ”‚                     (Click/Typer App)                       โ”‚
โ”‚  - Parse command line arguments                             โ”‚
โ”‚  - Handle --verbose, --output flags                         โ”‚
โ”‚  - Coordinate validation flow                               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Schema Loader                          โ”‚
โ”‚  - Load Python files dynamically                            โ”‚
โ”‚  - Extract Pydantic model classes                           โ”‚
โ”‚  - Validate model is BaseModel subclass                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       Data Reader                           โ”‚
โ”‚  - Detect format (JSON/YAML)                                โ”‚
โ”‚  - Parse file contents                                      โ”‚
โ”‚  - Handle single object vs array                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       Validator                             โ”‚
โ”‚  - Apply Pydantic model to each record                      โ”‚
โ”‚  - Collect ValidationErrors                                 โ”‚
โ”‚  - Track success/failure counts                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      Error Formatter                        โ”‚
โ”‚  - Parse error location paths                               โ”‚
โ”‚  - Generate human-friendly messages                         โ”‚
โ”‚  - Add suggestions based on error type                      โ”‚
โ”‚  - Apply color coding                                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
                              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        Reporter                             โ”‚
โ”‚  - Console output (tree format)                             โ”‚
โ”‚  - JSON output (for automation)                             โ”‚
โ”‚  - Summary statistics                                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Design Decisions

  1. Separation of Concerns: Each component has a single responsibility
  2. Error Enrichment: Transform raw ValidationError into user-friendly format
  3. Pluggable Output: Support both human-readable and machine-readable output
  4. Graceful Degradation: Handle partial failures (some records valid, some not)

Data Flow

# Pseudo-code for main validation flow

def validate_command(schema_path, model_name, data_path, output_format):
    # 1. Load the Pydantic model
    model_class = schema_loader.load(schema_path, model_name)

    # 2. Read and parse data
    data = data_reader.read(data_path)  # Returns dict or list of dicts

    # 3. Ensure we have a list
    records = data if isinstance(data, list) else [data]

    # 4. Validate each record
    results = []
    for i, record in enumerate(records):
        result = validator.validate(model_class, record)
        results.append(ValidationResult(
            index=i,
            record=record,
            success=result.success,
            errors=result.errors if not result.success else None
        ))

    # 5. Format and output
    if output_format == "json":
        reporter.output_json(results)
    else:
        reporter.output_console(results)

Phased Implementation Guide

Phase 1: Basic CLI Structure (1-2 hours)

Goal: Create a working CLI that accepts arguments.

  1. Set up project structure:
    pydantic-validate/
    โ”œโ”€โ”€ pyproject.toml
    โ”œโ”€โ”€ src/
    โ”‚   โ””โ”€โ”€ pydantic_validate/
    โ”‚       โ”œโ”€โ”€ __init__.py
    โ”‚       โ”œโ”€โ”€ cli.py
    โ”‚       โ”œโ”€โ”€ loader.py
    โ”‚       โ”œโ”€โ”€ reader.py
    โ”‚       โ”œโ”€โ”€ validator.py
    โ”‚       โ””โ”€โ”€ formatter.py
    โ””โ”€โ”€ tests/
    
  2. Install dependencies:
    pip install pydantic click pyyaml rich
    
  3. Create basic CLI with Click:
    • Define --schema, --model, --file options
    • Add --verbose and --output flags
    • Validate that files exist

Checkpoint: pydantic-validate --help shows usage.

Phase 2: Schema Loading (1-2 hours)

Goal: Dynamically load Pydantic models from files.

  1. Implement loader.py:
    • Use importlib.util to load Python files
    • Extract specified model class
    • Validate itโ€™s a BaseModel subclass
  2. Handle errors:
    • File not found
    • Syntax errors in schema file
    • Model not found in file
    • Not a Pydantic model

Checkpoint: Can load a User model from a schema file.

Phase 3: Data Reading (1 hour)

Goal: Read JSON and YAML files.

  1. Implement reader.py:
    • Detect format from file extension
    • Parse JSON with json.load()
    • Parse YAML with yaml.safe_load()
    • Handle both single objects and arrays
  2. Error handling:
    • Invalid JSON/YAML syntax
    • Empty files

Checkpoint: Can read and parse sample data files.

Phase 4: Core Validation (2 hours)

Goal: Validate records and collect errors.

  1. Implement validator.py:
    • Create ValidationResult dataclass
    • Try to instantiate model with each record
    • Catch ValidationError and extract error details
  2. Process errors:
    • Parse e.errors() list
    • Store location, message, type, input

Checkpoint: Can detect invalid records and list errors.

Phase 5: Error Formatting (2-3 hours)

Goal: Create beautiful, helpful error output.

  1. Implement formatter.py:
    • Format error paths (join nested locations)
    • Add suggestions based on error type
    • Use rich library for colors and tree structure
  2. Error suggestions:
    • string_too_short โ†’ Show required length
    • string_too_long โ†’ Show max length
    • greater_than_equal โ†’ Show minimum value
    • value_error for email โ†’ Suggest format
    • enum โ†’ List valid values

Checkpoint: Errors display with color and suggestions.

Phase 6: Summary and JSON Output (1 hour)

Goal: Complete the user experience.

  1. Add summary statistics:
    • Count valid/invalid records
    • Calculate percentage
    • Display summary line
  2. Implement JSON output:
    • Create structured output for automation
    • Include all error details

Checkpoint: Full working tool with both output formats.


Testing Strategy

Unit Tests

# tests/test_loader.py
def test_load_valid_model():
    model = load_model("schemas/user.py", "User")
    assert issubclass(model, BaseModel)

def test_load_missing_file():
    with pytest.raises(FileNotFoundError):
        load_model("nonexistent.py", "User")

def test_load_invalid_model_name():
    with pytest.raises(ValueError, match="not found"):
        load_model("schemas/user.py", "NonexistentModel")

# tests/test_validator.py
def test_validate_valid_record():
    result = validate(User, {"name": "John", "email": "john@example.com", ...})
    assert result.success is True

def test_validate_invalid_record():
    result = validate(User, {"name": "", "email": "invalid"})
    assert result.success is False
    assert len(result.errors) == 2

# tests/test_formatter.py
def test_format_nested_error():
    error = {"loc": ("address", "city"), "msg": "required"}
    formatted = format_error(error)
    assert "address.city" in formatted

Integration Tests

# tests/test_cli.py
from click.testing import CliRunner

def test_validate_valid_file():
    runner = CliRunner()
    result = runner.invoke(cli, [
        "--schema", "tests/fixtures/schema.py",
        "--model", "User",
        "--file", "tests/fixtures/valid_users.json"
    ])
    assert result.exit_code == 0
    assert "100.0%" in result.output

def test_validate_invalid_file():
    runner = CliRunner()
    result = runner.invoke(cli, [
        "--schema", "tests/fixtures/schema.py",
        "--model", "User",
        "--file", "tests/fixtures/invalid_users.json"
    ])
    assert result.exit_code == 1
    assert "validation error" in result.output

Test Fixtures

Create sample schema and data files for testing:

  • valid_users.json - All records valid
  • invalid_users.json - Mix of valid and invalid
  • empty.json - Empty array
  • malformed.json - Invalid JSON syntax

Common Pitfalls and Debugging

Pitfall 1: Import Errors in Schema Files

Problem: Schema file imports fail because dependencies arenโ€™t installed.

Solution:

  • Catch ImportError and report which dependency is missing
  • Suggest pip install <package>
try:
    spec.loader.exec_module(module)
except ImportError as e:
    print(f"Schema requires: {e.name}")
    print(f"Install with: pip install {e.name}")

Pitfall 2: Forward References

Problem: Models reference each other before definition.

Solution: Use model_rebuild() after loading:

model_class.model_rebuild()

Pitfall 3: Circular Imports

Problem: Schema file has circular dependencies.

Solution: Recommend using TYPE_CHECKING guard:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from .other import OtherModel

Pitfall 4: Path Handling

Problem: Relative paths in error output are confusing.

Solution: Use absolute paths internally, display relative:

from pathlib import Path

display_path = Path(full_path).relative_to(Path.cwd())

Pitfall 5: Unicode in Output

Problem: Special characters cause encoding errors.

Solution:

  • Use encoding="utf-8" when reading files
  • Rich library handles unicode output properly

Extensions and Challenges

Extension 1: Schema Discovery

Auto-discover all BaseModel subclasses in a file instead of requiring --model:

pydantic-validate --schema schema.py --file data.json
# Outputs: Found models: User, Address, Company. Use --model to specify.

Extension 2: Fix Suggestions

For certain errors, suggest the corrected value:

โœ— address.country: "United States" should be 2 characters
  Suggestion: Use "US" (ISO 3166-1 alpha-2 country code)

Extension 3: Watch Mode

Re-validate when files change:

pydantic-validate --schema schema.py --model User --file data.json --watch

Extension 4: Schema Documentation

Generate documentation from schema:

pydantic-validate --schema schema.py --model User --docs
# Outputs markdown documentation for the User schema

Extension 5: Partial Validation

Validate only specific fields:

pydantic-validate --schema schema.py --model User --file data.json --fields email,age

Real-World Connections

Where This Pattern Appears

  1. CI/CD Pipelines: Validate configuration files before deployment
  2. Data Import Tools: Validate CSV/JSON before database insertion
  3. API Testing: Validate request/response payloads
  4. Schema Evolution: Check if old data matches new schemas

Industry Examples

  • Great Expectations: Data validation for data pipelines
  • JSON Schema validators: CLI tools like ajv-cli
  • Terraform validate: Configuration validation
  • Kubernetes: YAML manifest validation

Production Considerations

  • Performance: For large files, consider streaming validation
  • Memory: Donโ€™t load entire file into memory
  • Parallelism: Validate multiple records concurrently

Self-Assessment Checklist

Core Understanding

  • Can I explain the difference between type coercion and strict mode?
  • Can I describe the stages of Pydanticโ€™s validation pipeline?
  • Can I parse a ValidationError and extract field paths, messages, and values?
  • Can I use Field() to apply constraints like min_length, pattern, ge/le?

Implementation Skills

  • Can I dynamically load a Python module and extract a class?
  • Can I build a CLI with Click/Typer with options and flags?
  • Can I format nested error paths into readable strings?
  • Can I handle both JSON and YAML input formats?

Design Understanding

  • Can I explain why validation should happen at system boundaries?
  • Can I design a component architecture for a validation tool?
  • Can I create user-friendly error messages with suggestions?
  • Can I support both human and machine output formats?

Mastery Indicators

  • Tool handles edge cases gracefully (empty files, syntax errors)
  • Error messages are genuinely helpful for fixing issues
  • Code is well-organized and testable
  • Output is beautiful with color and formatting

Resources

Documentation

Books

  • โ€œRobust Pythonโ€ by Patrick Viafore - Chapter 4 on Type Hints
  • โ€œPython Testing with pytestโ€ by Brian Okken - For testing strategies