Project 1: The Basic Information Extractor

Build a PydanticAI pipeline that extracts structured fields from text with strict validation.

Quick Reference

Attribute Value
Difficulty Level 1: Beginner
Time Estimate 4-6 hours
Language Python
Prerequisites Pydantic basics, JSON schema
Key Topics structured output, validation, retries

1. Learning Objectives

By completing this project, you will:

  1. Define Pydantic models for extraction.
  2. Validate LLM outputs against schema.
  3. Implement retry logic for invalid outputs.
  4. Log extraction failures for debugging.
  5. Measure extraction accuracy on sample text.

2. Theoretical Foundation

2.1 Why PydanticAI

Schema-first extraction turns probabilistic text into typed data you can trust.


3. Project Specification

3.1 What You Will Build

A CLI tool that takes unstructured text (emails, tickets) and outputs validated JSON.

3.2 Functional Requirements

  1. Pydantic model with required fields.
  2. Output parser tied to schema.
  3. Retry strategy on validation errors.
  4. Error logs for raw outputs.
  5. Evaluation on a small dataset.

3.3 Non-Functional Requirements

  • Deterministic mode for testing.
  • Clear error messages for invalid outputs.
  • Configurable schema for new domains.

4. Solution Architecture

4.1 Components

Component Responsibility
Schema Model Define fields and types
Agent Generate structured output
Validator Enforce schema
Logger Track failures

5. Implementation Guide

5.1 Project Structure

LEARN_PYDANTIC_AI/P01-basic-extractor/
├── src/
│   ├── models.py
│   ├── agent.py
│   ├── validate.py
│   └── cli.py

5.2 Implementation Phases

Phase 1: Schema (2h)

  • Define extraction fields.
  • Checkpoint: sample data validates.

Phase 2: Agent + validation (2-3h)

  • Run extraction with Pydantic validation.
  • Checkpoint: invalid output triggers retry.

Phase 3: Evaluation (1-2h)

  • Score accuracy on sample inputs.
  • Checkpoint: report shows success rate.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit schema missing field fails
Integration agent valid JSON produced
Regression retries recover from invalid output

6.2 Critical Test Cases

  1. Missing required field triggers retry.
  2. Extra fields are rejected or ignored.
  3. Output matches schema exactly.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Loose schema inconsistent output tighten field types
Over-strict schema many failures allow optional fields
Hidden errors hard to debug log raw output

8. Extensions & Challenges

Beginner

  • Add a second schema type.
  • Add CSV export.

Intermediate

  • Add confidence scores.
  • Add human review queue.

Advanced

  • Add schema auto-inference.
  • Add active learning for hard cases.

9. Real-World Connections

  • Support workflows rely on structured extraction.
  • Compliance needs validated fields.

10. Resources

  • PydanticAI docs
  • JSON schema references

11. Self-Assessment Checklist

  • I can define Pydantic schemas for extraction.
  • I can validate and retry invalid outputs.
  • I can measure extraction accuracy.

12. Submission / Completion Criteria

Minimum Completion:

  • Schema-validated extraction
  • Retry on invalid output

Full Completion:

  • Evaluation report
  • Error logging

Excellence:

  • Confidence scoring
  • Human review workflow

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/LEARN_PYDANTIC_AI.md.