Project 3: Entity Extraction Pipeline

Build an LLM-powered pipeline that extracts entities and relationships from conversation text, transforming raw episodes into structured graph data.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	1 week (15-20 hours)
Language	Python (Alternatives: TypeScript)
Prerequisites	Project 1-2, LLM API basics, JSON schema understanding
Key Topics	Named entity recognition, relationship extraction, structured LLM output, prompt engineering, schema design

1. Learning Objectives

By completing this project, you will:

Design prompts that extract structured data from unstructured text.
Use Pydantic models to enforce LLM output schemas.
Handle extraction errors and ambiguities gracefully.
Build a pipeline that processes episodes into graph-ready data.
Understand the tradeoffs between precision and recall in extraction.

2. Theoretical Foundation

2.1 Core Concepts

Named Entity Recognition (NER): Identifying and classifying named entities (people, organizations, dates, concepts) in text.
Relationship Extraction: Identifying semantic relationships between entities (e.g., “Alice works at Acme”).
Structured Output: Using JSON schemas or Pydantic models to constrain LLM output format.
Coreference Resolution: Linking pronouns and references to their antecedents (“she” → “Alice”).

2.2 Why This Matters

Raw conversations contain implicit knowledge that must be made explicit for graph storage:

“I started using Rust last month” → (User)-[:USES {since: “2024-11”}]->(Rust)
“My manager Alice approved the budget” → (Alice)-[:MANAGES]->(User), (Alice)-[:APPROVED]->(Budget)

2.3 Common Misconceptions

“LLMs are perfect extractors.” They hallucinate entities and miss subtle relationships.
“One prompt handles everything.” Different entity types need different extraction strategies.
“Schema guarantees correctness.” Schema enforces format, not accuracy.

2.4 ASCII Diagram: Extraction Pipeline

INPUT TEXT
==========
"Alice mentioned she's been working on the API
redesign with Bob since October. They're using
the new microservices architecture."

          │
          ▼
┌─────────────────────────────────────────┐
│         ENTITY EXTRACTION               │
│                                         │
│  LLM Prompt: "Extract entities..."      │
│                                         │
│  Output:                                │
│  - Alice (Person)                       │
│  - Bob (Person)                         │
│  - API redesign (Project)               │
│  - microservices architecture (Tech)    │
│  - October (Date)                       │
└─────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────┐
│      RELATIONSHIP EXTRACTION            │
│                                         │
│  LLM Prompt: "Extract relationships..." │
│                                         │
│  Output:                                │
│  - Alice WORKS_ON API_redesign          │
│  - Bob WORKS_ON API_redesign            │
│  - Alice COLLABORATES_WITH Bob          │
│  - API_redesign USES microservices      │
│  - API_redesign STARTED_IN October      │
└─────────────────────────────────────────┘
          │
          ▼
┌─────────────────────────────────────────┐
│         GRAPH OUTPUT                    │
│                                         │
│     (Alice)──WORKS_ON──►(API_redesign)  │
│        │                     │          │
│   COLLABORATES_WITH        USES         │
│        │                     │          │
│        ▼                     ▼          │
│      (Bob)      (microservices_arch)    │
└─────────────────────────────────────────┘

3. Project Specification

3.1 What You Will Build

A Python pipeline that:

Takes conversation text as input
Extracts entities with types and properties
Extracts relationships between entities
Outputs graph-ready structured data

3.2 Functional Requirements

Extract entities: pipeline.extract_entities(text) → List[Entity]
Extract relationships: pipeline.extract_relationships(text, entities) → List[Relationship]
Full extraction: pipeline.extract(text) → GraphData
Batch processing: pipeline.process_episodes(episodes) → List[GraphData]
Configurable schema: Support custom entity types and relationship types

3.3 Example Usage / Output

from extraction_pipeline import ExtractionPipeline

pipeline = ExtractionPipeline(model="gpt-4o-mini")

text = """
Alice mentioned she's been working on the API redesign with Bob since October.
They're using the new microservices architecture. The deadline is December 15th.
"""

result = pipeline.extract(text)

print("Entities:")
for entity in result.entities:
    print(f"  {entity.name} ({entity.type})")
    # Alice (Person)
    # Bob (Person)
    # API redesign (Project)
    # microservices architecture (Technology)
    # December 15th (Date)

print("\nRelationships:")
for rel in result.relationships:
    print(f"  {rel.subject} --{rel.type}--> {rel.object}")
    # Alice --WORKS_ON--> API redesign
    # Bob --WORKS_ON--> API redesign
    # Alice --COLLABORATES_WITH--> Bob
    # API redesign --USES--> microservices architecture
    # API redesign --HAS_DEADLINE--> December 15th

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│    Text      │────▶│   Entity     │────▶│ Relationship │
│    Input     │     │  Extractor   │     │  Extractor   │
└──────────────┘     └──────────────┘     └──────────────┘
                            │                     │
                            ▼                     ▼
                     ┌──────────────┐     ┌──────────────┐
                     │   Schema     │     │   Schema     │
                     │  Validator   │     │  Validator   │
                     └──────────────┘     └──────────────┘
                            │                     │
                            └──────────┬──────────┘
                                       ▼
                              ┌──────────────┐
                              │  GraphData   │
                              │   Output     │
                              └──────────────┘

4.2 Key Components

Component	Responsibility	Technology
ExtractionPipeline	Orchestration	Python class
EntityExtractor	Extract entities from text	LLM + Pydantic
RelationshipExtractor	Extract relationships	LLM + Pydantic
SchemaValidator	Validate output format	Pydantic models
PromptTemplates	Entity/relationship prompts	Jinja2 templates

4.3 Data Models

from pydantic import BaseModel
from typing import Literal

class Entity(BaseModel):
    name: str
    type: Literal["Person", "Organization", "Project", "Technology", "Date", "Concept"]
    properties: dict = {}
    confidence: float = 1.0
    source_span: str | None = None

class Relationship(BaseModel):
    subject: str  # Entity name
    type: str     # Relationship type
    object: str   # Entity name
    properties: dict = {}
    confidence: float = 1.0

class GraphData(BaseModel):
    entities: list[Entity]
    relationships: list[Relationship]
    source_text: str
    extraction_metadata: dict = {}

5. Implementation Guide

5.1 Development Environment Setup

mkdir extraction-pipeline && cd extraction-pipeline
python -m venv .venv && source .venv/bin/activate
pip install openai anthropic pydantic instructor jinja2

5.2 Project Structure

extraction-pipeline/
├── src/
│   ├── pipeline.py       # Main pipeline class
│   ├── extractors.py     # Entity and relationship extractors
│   ├── prompts/          # Prompt templates
│   │   ├── entity.jinja2
│   │   └── relationship.jinja2
│   ├── models.py         # Pydantic models
│   └── validators.py     # Post-processing validation
├── tests/
│   ├── test_extraction.py
│   └── fixtures/         # Test conversations
└── README.md

5.3 Implementation Phases

Phase 1: Entity Extraction (5-6h)

Goals:

Extract basic entities with types
Use structured output (Pydantic)

Tasks:

Create entity extraction prompt template
Use instructor library for structured LLM output
Implement entity type classification
Handle extraction failures gracefully

Checkpoint: Entities extracted from sample text.

Phase 2: Relationship Extraction (5-6h)

Goals:

Extract relationships between entities
Handle complex relationship types

Tasks:

Create relationship extraction prompt
Link relationships to extracted entities
Handle coreference (pronouns)
Validate relationship consistency

Checkpoint: Relationships link entities correctly.

Phase 3: Pipeline Integration (4-5h)

Goals:

Combine extractors into pipeline
Add batch processing
Handle edge cases

Tasks:

Build pipeline orchestration
Add batch processing for episodes
Implement retry logic
Add confidence scoring

Checkpoint: Full pipeline processes episodes.

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Test prompt construction	Template rendering
Extraction	Test extraction accuracy	Known entity/relationship sets
Integration	Test full pipeline	Episode → graph data

6.2 Critical Test Cases

Entity coverage: All mentioned entities are extracted
Relationship accuracy: Relationships match text semantics
Coreference: Pronouns resolved correctly
Ambiguity: Ambiguous cases flagged appropriately

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Schema too strict	Frequent extraction failures	Add flexibility, use Optional
Missing context	Coreference fails	Include surrounding text
Hallucinated entities	Entities not in text	Add source_span verification
Relationship cycles	Infinite loops	Add cycle detection

8. Extensions & Challenges

8.1 Beginner Extensions

Add entity deduplication within text
Add confidence thresholding

8.2 Intermediate Extensions

Implement coreference resolution
Add temporal expression parsing

8.3 Advanced Extensions

Fine-tune extraction model
Add entity linking to knowledge base

9. Real-World Connections

9.1 Industry Applications

Knowledge Graph Construction: Google, Amazon product graphs
AI Memory Systems: Zep, Graphiti
Information Extraction: News aggregation, financial analysis

9.2 Interview Relevance

Explain structured LLM output techniques
Discuss extraction vs. generation trade-offs

10. Resources

10.1 Essential Reading

Instructor Library Documentation — Structured LLM outputs
“AI Engineering” by Chip Huyen — Ch. on Tool Use
Pydantic Documentation — Schema validation

Previous: Project 2 (Episode Store)
Next: Project 4 (Entity Resolution)

11. Self-Assessment Checklist

I can design prompts for entity extraction
I understand structured output with Pydantic
I can handle extraction failures gracefully
I know when entities need deduplication

12. Submission / Completion Criteria

Minimum Viable Completion:

Entity extraction with 5+ types
Relationship extraction working
Structured output with Pydantic

Full Completion:

Batch processing pipeline
Confidence scoring
Error handling

Excellence:

Coreference resolution
Entity linking
Performance benchmarks

Project 3: Entity Extraction Pipeline

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

2.4 ASCII Diagram: Extraction Pipeline

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Models

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Entity Extraction (5-6h)

Phase 2: Relationship Extraction (5-6h)

Phase 3: Pipeline Integration (4-5h)

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Related Projects

11. Self-Assessment Checklist

12. Submission / Completion Criteria