Project 3: Entity Extraction Pipeline

Build an LLM-powered pipeline that extracts entities and relationships from conversation text, transforming raw episodes into structured graph data.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1 week (15-20 hours)
Language Python (Alternatives: TypeScript)
Prerequisites Project 1-2, LLM API basics, JSON schema understanding
Key Topics Named entity recognition, relationship extraction, structured LLM output, prompt engineering, schema design

1. Learning Objectives

By completing this project, you will:

  1. Design prompts that extract structured data from unstructured text.
  2. Use Pydantic models to enforce LLM output schemas.
  3. Handle extraction errors and ambiguities gracefully.
  4. Build a pipeline that processes episodes into graph-ready data.
  5. Understand the tradeoffs between precision and recall in extraction.

2. Theoretical Foundation

2.1 Core Concepts

  • Named Entity Recognition (NER): Identifying and classifying named entities (people, organizations, dates, concepts) in text.

  • Relationship Extraction: Identifying semantic relationships between entities (e.g., “Alice works at Acme”).

  • Structured Output: Using JSON schemas or Pydantic models to constrain LLM output format.

  • Coreference Resolution: Linking pronouns and references to their antecedents (“she” → “Alice”).

2.2 Why This Matters

Raw conversations contain implicit knowledge that must be made explicit for graph storage:

  • “I started using Rust last month” → (User)-[:USES {since: “2024-11”}]->(Rust)
  • “My manager Alice approved the budget” → (Alice)-[:MANAGES]->(User), (Alice)-[:APPROVED]->(Budget)

2.3 Common Misconceptions

  • “LLMs are perfect extractors.” They hallucinate entities and miss subtle relationships.
  • “One prompt handles everything.” Different entity types need different extraction strategies.
  • “Schema guarantees correctness.” Schema enforces format, not accuracy.

2.4 ASCII Diagram: Extraction Pipeline

INPUT TEXT
==========
"Alice mentioned she's been working on the API
redesign with Bob since October. They're using
the new microservices architecture."

          │
          ▼
┌─────────────────────────────────────────┐
│         ENTITY EXTRACTION               │
│                                         │
│  LLM Prompt: "Extract entities..."      │
│                                         │
│  Output:                                │
│  - Alice (Person)                       │
│  - Bob (Person)                         │
│  - API redesign (Project)               │
│  - microservices architecture (Tech)    │
│  - October (Date)                       │
└─────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────┐
│      RELATIONSHIP EXTRACTION            │
│                                         │
│  LLM Prompt: "Extract relationships..." │
│                                         │
│  Output:                                │
│  - Alice WORKS_ON API_redesign          │
│  - Bob WORKS_ON API_redesign            │
│  - Alice COLLABORATES_WITH Bob          │
│  - API_redesign USES microservices      │
│  - API_redesign STARTED_IN October      │
└─────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────┐
│         GRAPH OUTPUT                    │
│                                         │
│     (Alice)──WORKS_ON──►(API_redesign)  │
│        │                     │          │
│   COLLABORATES_WITH        USES         │
│        │                     │          │
│        ▼                     ▼          │
│      (Bob)      (microservices_arch)    │
└─────────────────────────────────────────┘

3. Project Specification

3.1 What You Will Build

A Python pipeline that:

  • Takes conversation text as input
  • Extracts entities with types and properties
  • Extracts relationships between entities
  • Outputs graph-ready structured data

3.2 Functional Requirements

  1. Extract entities: pipeline.extract_entities(text) → List[Entity]
  2. Extract relationships: pipeline.extract_relationships(text, entities) → List[Relationship]
  3. Full extraction: pipeline.extract(text) → GraphData
  4. Batch processing: pipeline.process_episodes(episodes) → List[GraphData]
  5. Configurable schema: Support custom entity types and relationship types

3.3 Example Usage / Output

from extraction_pipeline import ExtractionPipeline

pipeline = ExtractionPipeline(model="gpt-4o-mini")

text = """
Alice mentioned she's been working on the API redesign with Bob since October.
They're using the new microservices architecture. The deadline is December 15th.
"""

result = pipeline.extract(text)

print("Entities:")
for entity in result.entities:
    print(f"  {entity.name} ({entity.type})")
    # Alice (Person)
    # Bob (Person)
    # API redesign (Project)
    # microservices architecture (Technology)
    # December 15th (Date)

print("\nRelationships:")
for rel in result.relationships:
    print(f"  {rel.subject} --{rel.type}--> {rel.object}")
    # Alice --WORKS_ON--> API redesign
    # Bob --WORKS_ON--> API redesign
    # Alice --COLLABORATES_WITH--> Bob
    # API redesign --USES--> microservices architecture
    # API redesign --HAS_DEADLINE--> December 15th

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Text      │────▶│   Entity     │────▶│ Relationship │
│    Input     │     │  Extractor   │     │  Extractor   │
└──────────────┘     └──────────────┘     └──────────────┘
                            │                     │
                            ▼                     ▼
                     ┌──────────────┐     ┌──────────────┐
                     │   Schema     │     │   Schema     │
                     │  Validator   │     │  Validator   │
                     └──────────────┘     └──────────────┘
                            │                     │
                            └──────────┬──────────┘
                                       ▼
                              ┌──────────────┐
                              │  GraphData   │
                              │   Output     │
                              └──────────────┘

4.2 Key Components

Component Responsibility Technology
ExtractionPipeline Orchestration Python class
EntityExtractor Extract entities from text LLM + Pydantic
RelationshipExtractor Extract relationships LLM + Pydantic
SchemaValidator Validate output format Pydantic models
PromptTemplates Entity/relationship prompts Jinja2 templates

4.3 Data Models

from pydantic import BaseModel
from typing import Literal

class Entity(BaseModel):
    name: str
    type: Literal["Person", "Organization", "Project", "Technology", "Date", "Concept"]
    properties: dict = {}
    confidence: float = 1.0
    source_span: str | None = None

class Relationship(BaseModel):
    subject: str  # Entity name
    type: str     # Relationship type
    object: str   # Entity name
    properties: dict = {}
    confidence: float = 1.0

class GraphData(BaseModel):
    entities: list[Entity]
    relationships: list[Relationship]
    source_text: str
    extraction_metadata: dict = {}

5. Implementation Guide

5.1 Development Environment Setup

mkdir extraction-pipeline && cd extraction-pipeline
python -m venv .venv && source .venv/bin/activate
pip install openai anthropic pydantic instructor jinja2

5.2 Project Structure

extraction-pipeline/
├── src/
│   ├── pipeline.py       # Main pipeline class
│   ├── extractors.py     # Entity and relationship extractors
│   ├── prompts/          # Prompt templates
│   │   ├── entity.jinja2
│   │   └── relationship.jinja2
│   ├── models.py         # Pydantic models
│   └── validators.py     # Post-processing validation
├── tests/
│   ├── test_extraction.py
│   └── fixtures/         # Test conversations
└── README.md

5.3 Implementation Phases

Phase 1: Entity Extraction (5-6h)

Goals:

  • Extract basic entities with types
  • Use structured output (Pydantic)

Tasks:

  1. Create entity extraction prompt template
  2. Use instructor library for structured LLM output
  3. Implement entity type classification
  4. Handle extraction failures gracefully

Checkpoint: Entities extracted from sample text.

Phase 2: Relationship Extraction (5-6h)

Goals:

  • Extract relationships between entities
  • Handle complex relationship types

Tasks:

  1. Create relationship extraction prompt
  2. Link relationships to extracted entities
  3. Handle coreference (pronouns)
  4. Validate relationship consistency

Checkpoint: Relationships link entities correctly.

Phase 3: Pipeline Integration (4-5h)

Goals:

  • Combine extractors into pipeline
  • Add batch processing
  • Handle edge cases

Tasks:

  1. Build pipeline orchestration
  2. Add batch processing for episodes
  3. Implement retry logic
  4. Add confidence scoring

Checkpoint: Full pipeline processes episodes.


6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Test prompt construction Template rendering
Extraction Test extraction accuracy Known entity/relationship sets
Integration Test full pipeline Episode → graph data

6.2 Critical Test Cases

  1. Entity coverage: All mentioned entities are extracted
  2. Relationship accuracy: Relationships match text semantics
  3. Coreference: Pronouns resolved correctly
  4. Ambiguity: Ambiguous cases flagged appropriately

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Schema too strict Frequent extraction failures Add flexibility, use Optional
Missing context Coreference fails Include surrounding text
Hallucinated entities Entities not in text Add source_span verification
Relationship cycles Infinite loops Add cycle detection

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add entity deduplication within text
  • Add confidence thresholding

8.2 Intermediate Extensions

  • Implement coreference resolution
  • Add temporal expression parsing

8.3 Advanced Extensions

  • Fine-tune extraction model
  • Add entity linking to knowledge base

9. Real-World Connections

9.1 Industry Applications

  • Knowledge Graph Construction: Google, Amazon product graphs
  • AI Memory Systems: Zep, Graphiti
  • Information Extraction: News aggregation, financial analysis

9.2 Interview Relevance

  • Explain structured LLM output techniques
  • Discuss extraction vs. generation trade-offs

10. Resources

10.1 Essential Reading

  • Instructor Library Documentation — Structured LLM outputs
  • “AI Engineering” by Chip Huyen — Ch. on Tool Use
  • Pydantic Documentation — Schema validation
  • Previous: Project 2 (Episode Store)
  • Next: Project 4 (Entity Resolution)

11. Self-Assessment Checklist

  • I can design prompts for entity extraction
  • I understand structured output with Pydantic
  • I can handle extraction failures gracefully
  • I know when entities need deduplication

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Entity extraction with 5+ types
  • Relationship extraction working
  • Structured output with Pydantic

Full Completion:

  • Batch processing pipeline
  • Confidence scoring
  • Error handling

Excellence:

  • Coreference resolution
  • Entity linking
  • Performance benchmarks