Project 3: Entity Extraction Pipeline
Build an LLM-powered pipeline that extracts entities and relationships from conversation text, transforming raw episodes into structured graph data.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1 week (15-20 hours) |
| Language | Python (Alternatives: TypeScript) |
| Prerequisites | Project 1-2, LLM API basics, JSON schema understanding |
| Key Topics | Named entity recognition, relationship extraction, structured LLM output, prompt engineering, schema design |
1. Learning Objectives
By completing this project, you will:
- Design prompts that extract structured data from unstructured text.
- Use Pydantic models to enforce LLM output schemas.
- Handle extraction errors and ambiguities gracefully.
- Build a pipeline that processes episodes into graph-ready data.
- Understand the tradeoffs between precision and recall in extraction.
2. Theoretical Foundation
2.1 Core Concepts
-
Named Entity Recognition (NER): Identifying and classifying named entities (people, organizations, dates, concepts) in text.
-
Relationship Extraction: Identifying semantic relationships between entities (e.g., “Alice works at Acme”).
-
Structured Output: Using JSON schemas or Pydantic models to constrain LLM output format.
-
Coreference Resolution: Linking pronouns and references to their antecedents (“she” → “Alice”).
2.2 Why This Matters
Raw conversations contain implicit knowledge that must be made explicit for graph storage:
- “I started using Rust last month” → (User)-[:USES {since: “2024-11”}]->(Rust)
- “My manager Alice approved the budget” → (Alice)-[:MANAGES]->(User), (Alice)-[:APPROVED]->(Budget)
2.3 Common Misconceptions
- “LLMs are perfect extractors.” They hallucinate entities and miss subtle relationships.
- “One prompt handles everything.” Different entity types need different extraction strategies.
- “Schema guarantees correctness.” Schema enforces format, not accuracy.
2.4 ASCII Diagram: Extraction Pipeline
INPUT TEXT
==========
"Alice mentioned she's been working on the API
redesign with Bob since October. They're using
the new microservices architecture."
│
▼
┌─────────────────────────────────────────┐
│ ENTITY EXTRACTION │
│ │
│ LLM Prompt: "Extract entities..." │
│ │
│ Output: │
│ - Alice (Person) │
│ - Bob (Person) │
│ - API redesign (Project) │
│ - microservices architecture (Tech) │
│ - October (Date) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ RELATIONSHIP EXTRACTION │
│ │
│ LLM Prompt: "Extract relationships..." │
│ │
│ Output: │
│ - Alice WORKS_ON API_redesign │
│ - Bob WORKS_ON API_redesign │
│ - Alice COLLABORATES_WITH Bob │
│ - API_redesign USES microservices │
│ - API_redesign STARTED_IN October │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ GRAPH OUTPUT │
│ │
│ (Alice)──WORKS_ON──►(API_redesign) │
│ │ │ │
│ COLLABORATES_WITH USES │
│ │ │ │
│ ▼ ▼ │
│ (Bob) (microservices_arch) │
└─────────────────────────────────────────┘
3. Project Specification
3.1 What You Will Build
A Python pipeline that:
- Takes conversation text as input
- Extracts entities with types and properties
- Extracts relationships between entities
- Outputs graph-ready structured data
3.2 Functional Requirements
- Extract entities:
pipeline.extract_entities(text) → List[Entity] - Extract relationships:
pipeline.extract_relationships(text, entities) → List[Relationship] - Full extraction:
pipeline.extract(text) → GraphData - Batch processing:
pipeline.process_episodes(episodes) → List[GraphData] - Configurable schema: Support custom entity types and relationship types
3.3 Example Usage / Output
from extraction_pipeline import ExtractionPipeline
pipeline = ExtractionPipeline(model="gpt-4o-mini")
text = """
Alice mentioned she's been working on the API redesign with Bob since October.
They're using the new microservices architecture. The deadline is December 15th.
"""
result = pipeline.extract(text)
print("Entities:")
for entity in result.entities:
print(f" {entity.name} ({entity.type})")
# Alice (Person)
# Bob (Person)
# API redesign (Project)
# microservices architecture (Technology)
# December 15th (Date)
print("\nRelationships:")
for rel in result.relationships:
print(f" {rel.subject} --{rel.type}--> {rel.object}")
# Alice --WORKS_ON--> API redesign
# Bob --WORKS_ON--> API redesign
# Alice --COLLABORATES_WITH--> Bob
# API redesign --USES--> microservices architecture
# API redesign --HAS_DEADLINE--> December 15th
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Text │────▶│ Entity │────▶│ Relationship │
│ Input │ │ Extractor │ │ Extractor │
└──────────────┘ └──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Schema │ │ Schema │
│ Validator │ │ Validator │
└──────────────┘ └──────────────┘
│ │
└──────────┬──────────┘
▼
┌──────────────┐
│ GraphData │
│ Output │
└──────────────┘
4.2 Key Components
| Component | Responsibility | Technology |
|---|---|---|
| ExtractionPipeline | Orchestration | Python class |
| EntityExtractor | Extract entities from text | LLM + Pydantic |
| RelationshipExtractor | Extract relationships | LLM + Pydantic |
| SchemaValidator | Validate output format | Pydantic models |
| PromptTemplates | Entity/relationship prompts | Jinja2 templates |
4.3 Data Models
from pydantic import BaseModel
from typing import Literal
class Entity(BaseModel):
name: str
type: Literal["Person", "Organization", "Project", "Technology", "Date", "Concept"]
properties: dict = {}
confidence: float = 1.0
source_span: str | None = None
class Relationship(BaseModel):
subject: str # Entity name
type: str # Relationship type
object: str # Entity name
properties: dict = {}
confidence: float = 1.0
class GraphData(BaseModel):
entities: list[Entity]
relationships: list[Relationship]
source_text: str
extraction_metadata: dict = {}
5. Implementation Guide
5.1 Development Environment Setup
mkdir extraction-pipeline && cd extraction-pipeline
python -m venv .venv && source .venv/bin/activate
pip install openai anthropic pydantic instructor jinja2
5.2 Project Structure
extraction-pipeline/
├── src/
│ ├── pipeline.py # Main pipeline class
│ ├── extractors.py # Entity and relationship extractors
│ ├── prompts/ # Prompt templates
│ │ ├── entity.jinja2
│ │ └── relationship.jinja2
│ ├── models.py # Pydantic models
│ └── validators.py # Post-processing validation
├── tests/
│ ├── test_extraction.py
│ └── fixtures/ # Test conversations
└── README.md
5.3 Implementation Phases
Phase 1: Entity Extraction (5-6h)
Goals:
- Extract basic entities with types
- Use structured output (Pydantic)
Tasks:
- Create entity extraction prompt template
- Use
instructorlibrary for structured LLM output - Implement entity type classification
- Handle extraction failures gracefully
Checkpoint: Entities extracted from sample text.
Phase 2: Relationship Extraction (5-6h)
Goals:
- Extract relationships between entities
- Handle complex relationship types
Tasks:
- Create relationship extraction prompt
- Link relationships to extracted entities
- Handle coreference (pronouns)
- Validate relationship consistency
Checkpoint: Relationships link entities correctly.
Phase 3: Pipeline Integration (4-5h)
Goals:
- Combine extractors into pipeline
- Add batch processing
- Handle edge cases
Tasks:
- Build pipeline orchestration
- Add batch processing for episodes
- Implement retry logic
- Add confidence scoring
Checkpoint: Full pipeline processes episodes.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Test prompt construction | Template rendering |
| Extraction | Test extraction accuracy | Known entity/relationship sets |
| Integration | Test full pipeline | Episode → graph data |
6.2 Critical Test Cases
- Entity coverage: All mentioned entities are extracted
- Relationship accuracy: Relationships match text semantics
- Coreference: Pronouns resolved correctly
- Ambiguity: Ambiguous cases flagged appropriately
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Schema too strict | Frequent extraction failures | Add flexibility, use Optional |
| Missing context | Coreference fails | Include surrounding text |
| Hallucinated entities | Entities not in text | Add source_span verification |
| Relationship cycles | Infinite loops | Add cycle detection |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add entity deduplication within text
- Add confidence thresholding
8.2 Intermediate Extensions
- Implement coreference resolution
- Add temporal expression parsing
8.3 Advanced Extensions
- Fine-tune extraction model
- Add entity linking to knowledge base
9. Real-World Connections
9.1 Industry Applications
- Knowledge Graph Construction: Google, Amazon product graphs
- AI Memory Systems: Zep, Graphiti
- Information Extraction: News aggregation, financial analysis
9.2 Interview Relevance
- Explain structured LLM output techniques
- Discuss extraction vs. generation trade-offs
10. Resources
10.1 Essential Reading
- Instructor Library Documentation — Structured LLM outputs
- “AI Engineering” by Chip Huyen — Ch. on Tool Use
- Pydantic Documentation — Schema validation
10.2 Related Projects
- Previous: Project 2 (Episode Store)
- Next: Project 4 (Entity Resolution)
11. Self-Assessment Checklist
- I can design prompts for entity extraction
- I understand structured output with Pydantic
- I can handle extraction failures gracefully
- I know when entities need deduplication
12. Submission / Completion Criteria
Minimum Viable Completion:
- Entity extraction with 5+ types
- Relationship extraction working
- Structured output with Pydantic
Full Completion:
- Batch processing pipeline
- Confidence scoring
- Error handling
Excellence:
- Coreference resolution
- Entity linking
- Performance benchmarks