Project 1: The Basic Information Extractor
Build a PydanticAI pipeline that extracts structured fields from text with strict validation.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | 4-6 hours |
| Language | Python |
| Prerequisites | Pydantic basics, JSON schema |
| Key Topics | structured output, validation, retries |
1. Learning Objectives
By completing this project, you will:
- Define Pydantic models for extraction.
- Validate LLM outputs against schema.
- Implement retry logic for invalid outputs.
- Log extraction failures for debugging.
- Measure extraction accuracy on sample text.
2. Theoretical Foundation
2.1 Why PydanticAI
Schema-first extraction turns probabilistic text into typed data you can trust.
3. Project Specification
3.1 What You Will Build
A CLI tool that takes unstructured text (emails, tickets) and outputs validated JSON.
3.2 Functional Requirements
- Pydantic model with required fields.
- Output parser tied to schema.
- Retry strategy on validation errors.
- Error logs for raw outputs.
- Evaluation on a small dataset.
3.3 Non-Functional Requirements
- Deterministic mode for testing.
- Clear error messages for invalid outputs.
- Configurable schema for new domains.
4. Solution Architecture
4.1 Components
| Component | Responsibility |
|---|---|
| Schema Model | Define fields and types |
| Agent | Generate structured output |
| Validator | Enforce schema |
| Logger | Track failures |
5. Implementation Guide
5.1 Project Structure
LEARN_PYDANTIC_AI/P01-basic-extractor/
├── src/
│ ├── models.py
│ ├── agent.py
│ ├── validate.py
│ └── cli.py
5.2 Implementation Phases
Phase 1: Schema (2h)
- Define extraction fields.
- Checkpoint: sample data validates.
Phase 2: Agent + validation (2-3h)
- Run extraction with Pydantic validation.
- Checkpoint: invalid output triggers retry.
Phase 3: Evaluation (1-2h)
- Score accuracy on sample inputs.
- Checkpoint: report shows success rate.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | schema | missing field fails |
| Integration | agent | valid JSON produced |
| Regression | retries | recover from invalid output |
6.2 Critical Test Cases
- Missing required field triggers retry.
- Extra fields are rejected or ignored.
- Output matches schema exactly.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Loose schema | inconsistent output | tighten field types |
| Over-strict schema | many failures | allow optional fields |
| Hidden errors | hard to debug | log raw output |
8. Extensions & Challenges
Beginner
- Add a second schema type.
- Add CSV export.
Intermediate
- Add confidence scores.
- Add human review queue.
Advanced
- Add schema auto-inference.
- Add active learning for hard cases.
9. Real-World Connections
- Support workflows rely on structured extraction.
- Compliance needs validated fields.
10. Resources
- PydanticAI docs
- JSON schema references
11. Self-Assessment Checklist
- I can define Pydantic schemas for extraction.
- I can validate and retry invalid outputs.
- I can measure extraction accuracy.
12. Submission / Completion Criteria
Minimum Completion:
- Schema-validated extraction
- Retry on invalid output
Full Completion:
- Evaluation report
- Error logging
Excellence:
- Confidence scoring
- Human review workflow
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/LEARN_PYDANTIC_AI.md.