Project 1: Schema Validator CLI
Project 1: Schema Validator CLI
Build a CLI tool that validates JSON/YAML files against Pydantic schemas, showing detailed error messages and suggesting fixesโlike a linter for your data.
Learning Objectives
By completing this project, you will:
- Understand Pydanticโs core validation model - How data flows from raw input to validated Python objects
- Master error handling - Parse ValidationError structure and create user-friendly error messages
- Work with nested models - Build complex schemas with relationships between models
- Use Field configuration - Apply constraints like min_length, max_length, patterns, and custom descriptions
- Implement dynamic model loading - Load Pydantic models from Python files at runtime
- Build production CLI tools - Use Click/Typer for professional command-line interfaces
Theoretical Foundation
The Philosophy of Data Validation
Data validation is one of the most critical aspects of software engineering. Consider these scenarios:
- A user submits a form with an email like โnot-an-emailโ
- An API receives JSON with a negative age value
- A configuration file contains an invalid database URL
- A CSV import has dates in the wrong format
Without proper validation, these errors propagate through your system, causing bugs that are hard to trace, security vulnerabilities, and corrupted data.
The Principle of Early Failure: Validate data at system boundariesโas soon as it enters your application. This is called โfail fastโ or โparse, donโt validate.โ Instead of checking data validity throughout your code, validate once at entry and work with guaranteed-valid data afterward.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SYSTEM BOUNDARY โ
โ โ
โ External World Your Application โ
โ (Untrusted) (Trusted) โ
โ โ
โ โโโโโโโโโโโโ Validation โโโโโโโโโโโโโโโโโโโโ โ
โ โ Raw JSON โ โโโโโโโโโโโโโโโโโโโบ โ Validated Model โ โ
โ โ {"age": โ โโโโโโโโโโโ โ โ โ
โ โ "-5"} โ โโโบ โ Pydanticโ โโโบ โ User(age=25) โ โ
โ โโโโโโโโโโโโ โ Schema โ โ # Type-safe! โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ ValidationError โ
โ (Rejected!) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
How Pydantic Validates Data
Pydantic V2 uses a multi-stage validation pipeline:
Stage 1: Input Parsing
Raw input (JSON string, dict, etc.) is converted to a Python dictionary. For model_validate_json(), this uses a Rust-based JSON parser for speed.
Stage 2: Type Coercion (Default Mode)
Pydantic attempts to convert values to their target types:
"123"โ123(string to int)"true"โTrue(string to bool)1.5โ1(float to int, truncated)"2024-01-15"โdate(2024, 1, 15)(string to date)
Strict Mode disables coercionโtypes must match exactly.
Stage 3: Constraint Validation
Field constraints are checked:
min_length,max_lengthfor stringsge,gt,le,ltfor numberspatternfor regex matching- Custom validators
Stage 4: Field Validators
Custom @field_validator functions run on individual fields.
Stage 5: Model Validators
@model_validator functions run with access to all fields for cross-field validation.
Stage 6: Object Construction
If all validation passes, the Pydantic model instance is created.
Understanding ValidationError
When validation fails, Pydantic raises ValidationError with rich error information:
from pydantic import BaseModel, ValidationError, Field
class User(BaseModel):
name: str = Field(min_length=1)
age: int = Field(ge=0)
email: str
try:
User(name="", age=-5, email="not-an-email")
except ValidationError as e:
print(e.json(indent=2))
Output:
[
{
"type": "string_too_short",
"loc": ["name"],
"msg": "String should have at least 1 character",
"input": "",
"ctx": {"min_length": 1}
},
{
"type": "greater_than_equal",
"loc": ["age"],
"msg": "Input should be greater than or equal to 0",
"input": -5,
"ctx": {"ge": 0}
},
{
"type": "value_error",
"loc": ["email"],
"msg": "value is not a valid email address: An email address must have an @-sign.",
"input": "not-an-email"
}
]
Error Structure Anatomy:
type: Machine-readable error identifier (e.g.,"string_too_short")loc: Tuple path to the field (e.g.,["address", "city"]for nested fields)msg: Human-readable error messageinput: The invalid value that was providedctx: Additional context (constraint values, expected formats)
Field Configuration Deep Dive
The Field() function is how you configure individual fields:
from pydantic import BaseModel, Field
from typing import Optional
class Product(BaseModel):
# Required field with constraints
name: str = Field(
min_length=1,
max_length=100,
description="Product name",
examples=["Widget Pro", "Gadget X"]
)
# Numeric constraints
price: float = Field(
gt=0, # greater than
le=10000, # less than or equal
description="Price in USD"
)
# Default value
quantity: int = Field(
default=0,
ge=0,
description="Stock quantity"
)
# Pattern matching
sku: str = Field(
pattern=r'^[A-Z]{2}-\d{4}$',
description="Stock Keeping Unit (e.g., AB-1234)"
)
# Optional with None default
description: Optional[str] = Field(
default=None,
max_length=1000
)
Available Constraints:
| Constraint | Applies To | Description |
|---|---|---|
gt |
Numbers | Greater than |
ge |
Numbers | Greater than or equal |
lt |
Numbers | Less than |
le |
Numbers | Less than or equal |
multiple_of |
Numbers | Must be divisible by |
min_length |
Strings, Lists | Minimum length |
max_length |
Strings, Lists | Maximum length |
pattern |
Strings | Regex pattern |
strict |
Any | Disable type coercion |
Nested Models and Complex Schemas
Real-world data is rarely flat. Pydantic handles nested models elegantly:
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime
class Address(BaseModel):
street: str
city: str
state: str = Field(min_length=2, max_length=2) # State code
zip_code: str = Field(pattern=r'^\d{5}(-\d{4})?$')
country: str = Field(default="US")
class ContactInfo(BaseModel):
email: str
phone: Optional[str] = None
address: Optional[Address] = None
class Company(BaseModel):
name: str
founded: datetime
employees: int = Field(ge=1)
headquarters: Address
contacts: List[ContactInfo] = Field(default_factory=list)
When validation fails on nested models, the loc tuple shows the full path:
# Error loc: ["headquarters", "zip_code"]
# Means: company.headquarters.zip_code failed validation
Dynamic Model Loading
For a CLI tool, you need to load Pydantic models from user-provided files. Pythonโs importlib makes this possible:
import importlib.util
import sys
from pathlib import Path
def load_model_from_file(file_path: str, model_name: str):
"""Load a Pydantic model class from a Python file."""
path = Path(file_path)
# Create a module spec
spec = importlib.util.spec_from_file_location(
path.stem, # module name from filename
path
)
# Create the module
module = importlib.util.module_from_spec(spec)
sys.modules[path.stem] = module
# Execute the module (runs the code)
spec.loader.exec_module(module)
# Get the model class
if not hasattr(module, model_name):
raise ValueError(f"Model '{model_name}' not found in {file_path}")
model_class = getattr(module, model_name)
# Verify it's a Pydantic model
from pydantic import BaseModel
if not isinstance(model_class, type) or not issubclass(model_class, BaseModel):
raise TypeError(f"'{model_name}' is not a Pydantic BaseModel")
return model_class
Project Specification
Functional Requirements
Build a CLI tool called pydantic-validate that:
- Validates data files against Pydantic schemas
- Accept a schema file (Python file with Pydantic models)
- Accept a data file (JSON or YAML)
- Report validation success or detailed errors
- Provides helpful error messages
- Show field path for nested errors
- Display the invalid value
- Suggest corrections when possible
- Use color-coding for readability
- Supports multiple records
- Validate arrays of objects
- Report per-record errors
- Show summary statistics
- Handles multiple formats
- JSON files
- YAML files
- Detect format from extension
CLI Interface
# Basic usage
pydantic-validate --schema path/to/schema.py --model User --file data.json
# Validate YAML
pydantic-validate -s schema.py -m Config -f settings.yaml
# Verbose output
pydantic-validate -s schema.py -m User -f users.json --verbose
# Output as JSON (for automation)
pydantic-validate -s schema.py -m User -f data.json --output json
Example Schema File
# schemas/user.py
from pydantic import BaseModel, Field, EmailStr
from typing import Optional, List
from datetime import date
from enum import Enum
class UserRole(str, Enum):
admin = "admin"
user = "user"
guest = "guest"
class Address(BaseModel):
street: str = Field(min_length=1)
city: str = Field(min_length=1)
country: str = Field(min_length=2, max_length=2, description="ISO 3166-1 alpha-2")
postal_code: Optional[str] = None
class User(BaseModel):
id: int = Field(ge=1)
name: str = Field(min_length=1, max_length=100)
email: EmailStr
age: int = Field(ge=0, le=150)
role: UserRole = Field(default=UserRole.user)
tags: List[str] = Field(default_factory=list, max_length=10)
address: Optional[Address] = None
created_at: date
Example Data File
[
{
"id": 1,
"name": "John Doe",
"email": "john@example.com",
"age": 30,
"role": "admin",
"created_at": "2024-01-15"
},
{
"id": -1,
"name": "",
"email": "not-an-email",
"age": 200,
"role": "superuser",
"address": {
"street": "123 Main St",
"city": "NYC",
"country": "United States"
},
"created_at": "invalid-date"
}
]
Expected Output
Validating data.json against User schema...
โ Record 1: Valid
โ Record 2: 6 validation errors
โโโ id
โ โโโ Input should be greater than or equal to 1 [type=greater_than_equal]
โ Got: -1
โ Expected: id โฅ 1
โ
โโโ name
โ โโโ String should have at least 1 character [type=string_too_short]
โ Got: "" (empty string)
โ Expected: 1 โค length โค 100
โ
โโโ email
โ โโโ value is not a valid email address [type=value_error]
โ Got: "not-an-email"
โ Suggestion: Add @ and domain (e.g., user@example.com)
โ
โโโ age
โ โโโ Input should be less than or equal to 150 [type=less_than_equal]
โ Got: 200
โ Expected: 0 โค age โค 150
โ
โโโ role
โ โโโ Input should be 'admin', 'user' or 'guest' [type=enum]
โ Got: "superuser"
โ Valid values: admin, user, guest
โ
โโโ address.country
โ โโโ String should have at most 2 characters [type=string_too_long]
โ Got: "United States" (13 characters)
โ Suggestion: Use ISO 3166-1 alpha-2 code (e.g., "US")
โ
โโโ created_at
โโโ Input should be a valid date [type=date_parsing]
Got: "invalid-date"
Expected: ISO 8601 format (YYYY-MM-DD)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Summary: 1/2 records valid (50.0%)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Solution Architecture
Component Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CLI Layer โ
โ (Click/Typer App) โ
โ - Parse command line arguments โ
โ - Handle --verbose, --output flags โ
โ - Coordinate validation flow โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Schema Loader โ
โ - Load Python files dynamically โ
โ - Extract Pydantic model classes โ
โ - Validate model is BaseModel subclass โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Reader โ
โ - Detect format (JSON/YAML) โ
โ - Parse file contents โ
โ - Handle single object vs array โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Validator โ
โ - Apply Pydantic model to each record โ
โ - Collect ValidationErrors โ
โ - Track success/failure counts โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Error Formatter โ
โ - Parse error location paths โ
โ - Generate human-friendly messages โ
โ - Add suggestions based on error type โ
โ - Apply color coding โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Reporter โ
โ - Console output (tree format) โ
โ - JSON output (for automation) โ
โ - Summary statistics โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Key Design Decisions
- Separation of Concerns: Each component has a single responsibility
- Error Enrichment: Transform raw ValidationError into user-friendly format
- Pluggable Output: Support both human-readable and machine-readable output
- Graceful Degradation: Handle partial failures (some records valid, some not)
Data Flow
# Pseudo-code for main validation flow
def validate_command(schema_path, model_name, data_path, output_format):
# 1. Load the Pydantic model
model_class = schema_loader.load(schema_path, model_name)
# 2. Read and parse data
data = data_reader.read(data_path) # Returns dict or list of dicts
# 3. Ensure we have a list
records = data if isinstance(data, list) else [data]
# 4. Validate each record
results = []
for i, record in enumerate(records):
result = validator.validate(model_class, record)
results.append(ValidationResult(
index=i,
record=record,
success=result.success,
errors=result.errors if not result.success else None
))
# 5. Format and output
if output_format == "json":
reporter.output_json(results)
else:
reporter.output_console(results)
Phased Implementation Guide
Phase 1: Basic CLI Structure (1-2 hours)
Goal: Create a working CLI that accepts arguments.
- Set up project structure:
pydantic-validate/ โโโ pyproject.toml โโโ src/ โ โโโ pydantic_validate/ โ โโโ __init__.py โ โโโ cli.py โ โโโ loader.py โ โโโ reader.py โ โโโ validator.py โ โโโ formatter.py โโโ tests/ - Install dependencies:
pip install pydantic click pyyaml rich - Create basic CLI with Click:
- Define
--schema,--model,--fileoptions - Add
--verboseand--outputflags - Validate that files exist
- Define
Checkpoint: pydantic-validate --help shows usage.
Phase 2: Schema Loading (1-2 hours)
Goal: Dynamically load Pydantic models from files.
- Implement
loader.py:- Use
importlib.utilto load Python files - Extract specified model class
- Validate itโs a BaseModel subclass
- Use
- Handle errors:
- File not found
- Syntax errors in schema file
- Model not found in file
- Not a Pydantic model
Checkpoint: Can load a User model from a schema file.
Phase 3: Data Reading (1 hour)
Goal: Read JSON and YAML files.
- Implement
reader.py:- Detect format from file extension
- Parse JSON with
json.load() - Parse YAML with
yaml.safe_load() - Handle both single objects and arrays
- Error handling:
- Invalid JSON/YAML syntax
- Empty files
Checkpoint: Can read and parse sample data files.
Phase 4: Core Validation (2 hours)
Goal: Validate records and collect errors.
- Implement
validator.py:- Create
ValidationResultdataclass - Try to instantiate model with each record
- Catch
ValidationErrorand extract error details
- Create
- Process errors:
- Parse
e.errors()list - Store location, message, type, input
- Parse
Checkpoint: Can detect invalid records and list errors.
Phase 5: Error Formatting (2-3 hours)
Goal: Create beautiful, helpful error output.
- Implement
formatter.py:- Format error paths (join nested locations)
- Add suggestions based on error type
- Use
richlibrary for colors and tree structure
- Error suggestions:
string_too_shortโ Show required lengthstring_too_longโ Show max lengthgreater_than_equalโ Show minimum valuevalue_errorfor email โ Suggest formatenumโ List valid values
Checkpoint: Errors display with color and suggestions.
Phase 6: Summary and JSON Output (1 hour)
Goal: Complete the user experience.
- Add summary statistics:
- Count valid/invalid records
- Calculate percentage
- Display summary line
- Implement JSON output:
- Create structured output for automation
- Include all error details
Checkpoint: Full working tool with both output formats.
Testing Strategy
Unit Tests
# tests/test_loader.py
def test_load_valid_model():
model = load_model("schemas/user.py", "User")
assert issubclass(model, BaseModel)
def test_load_missing_file():
with pytest.raises(FileNotFoundError):
load_model("nonexistent.py", "User")
def test_load_invalid_model_name():
with pytest.raises(ValueError, match="not found"):
load_model("schemas/user.py", "NonexistentModel")
# tests/test_validator.py
def test_validate_valid_record():
result = validate(User, {"name": "John", "email": "john@example.com", ...})
assert result.success is True
def test_validate_invalid_record():
result = validate(User, {"name": "", "email": "invalid"})
assert result.success is False
assert len(result.errors) == 2
# tests/test_formatter.py
def test_format_nested_error():
error = {"loc": ("address", "city"), "msg": "required"}
formatted = format_error(error)
assert "address.city" in formatted
Integration Tests
# tests/test_cli.py
from click.testing import CliRunner
def test_validate_valid_file():
runner = CliRunner()
result = runner.invoke(cli, [
"--schema", "tests/fixtures/schema.py",
"--model", "User",
"--file", "tests/fixtures/valid_users.json"
])
assert result.exit_code == 0
assert "100.0%" in result.output
def test_validate_invalid_file():
runner = CliRunner()
result = runner.invoke(cli, [
"--schema", "tests/fixtures/schema.py",
"--model", "User",
"--file", "tests/fixtures/invalid_users.json"
])
assert result.exit_code == 1
assert "validation error" in result.output
Test Fixtures
Create sample schema and data files for testing:
valid_users.json- All records validinvalid_users.json- Mix of valid and invalidempty.json- Empty arraymalformed.json- Invalid JSON syntax
Common Pitfalls and Debugging
Pitfall 1: Import Errors in Schema Files
Problem: Schema file imports fail because dependencies arenโt installed.
Solution:
- Catch
ImportErrorand report which dependency is missing - Suggest
pip install <package>
try:
spec.loader.exec_module(module)
except ImportError as e:
print(f"Schema requires: {e.name}")
print(f"Install with: pip install {e.name}")
Pitfall 2: Forward References
Problem: Models reference each other before definition.
Solution: Use model_rebuild() after loading:
model_class.model_rebuild()
Pitfall 3: Circular Imports
Problem: Schema file has circular dependencies.
Solution: Recommend using TYPE_CHECKING guard:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from .other import OtherModel
Pitfall 4: Path Handling
Problem: Relative paths in error output are confusing.
Solution: Use absolute paths internally, display relative:
from pathlib import Path
display_path = Path(full_path).relative_to(Path.cwd())
Pitfall 5: Unicode in Output
Problem: Special characters cause encoding errors.
Solution:
- Use
encoding="utf-8"when reading files - Rich library handles unicode output properly
Extensions and Challenges
Extension 1: Schema Discovery
Auto-discover all BaseModel subclasses in a file instead of requiring --model:
pydantic-validate --schema schema.py --file data.json
# Outputs: Found models: User, Address, Company. Use --model to specify.
Extension 2: Fix Suggestions
For certain errors, suggest the corrected value:
โ address.country: "United States" should be 2 characters
Suggestion: Use "US" (ISO 3166-1 alpha-2 country code)
Extension 3: Watch Mode
Re-validate when files change:
pydantic-validate --schema schema.py --model User --file data.json --watch
Extension 4: Schema Documentation
Generate documentation from schema:
pydantic-validate --schema schema.py --model User --docs
# Outputs markdown documentation for the User schema
Extension 5: Partial Validation
Validate only specific fields:
pydantic-validate --schema schema.py --model User --file data.json --fields email,age
Real-World Connections
Where This Pattern Appears
- CI/CD Pipelines: Validate configuration files before deployment
- Data Import Tools: Validate CSV/JSON before database insertion
- API Testing: Validate request/response payloads
- Schema Evolution: Check if old data matches new schemas
Industry Examples
- Great Expectations: Data validation for data pipelines
- JSON Schema validators: CLI tools like
ajv-cli - Terraform validate: Configuration validation
- Kubernetes: YAML manifest validation
Production Considerations
- Performance: For large files, consider streaming validation
- Memory: Donโt load entire file into memory
- Parallelism: Validate multiple records concurrently
Self-Assessment Checklist
Core Understanding
- Can I explain the difference between type coercion and strict mode?
- Can I describe the stages of Pydanticโs validation pipeline?
- Can I parse a ValidationError and extract field paths, messages, and values?
- Can I use Field() to apply constraints like min_length, pattern, ge/le?
Implementation Skills
- Can I dynamically load a Python module and extract a class?
- Can I build a CLI with Click/Typer with options and flags?
- Can I format nested error paths into readable strings?
- Can I handle both JSON and YAML input formats?
Design Understanding
- Can I explain why validation should happen at system boundaries?
- Can I design a component architecture for a validation tool?
- Can I create user-friendly error messages with suggestions?
- Can I support both human and machine output formats?
Mastery Indicators
- Tool handles edge cases gracefully (empty files, syntax errors)
- Error messages are genuinely helpful for fixing issues
- Code is well-organized and testable
- Output is beautiful with color and formatting
Resources
Documentation
Books
- โRobust Pythonโ by Patrick Viafore - Chapter 4 on Type Hints
- โPython Testing with pytestโ by Brian Okken - For testing strategies
Related Projects
- pydantic-cli - Similar concept
- jsonschema CLI - JSON Schema validation