Project 4: Entity Resolution System
Build an entity resolution pipeline that deduplicates and merges entities across extraction runs, maintaining canonical references while preserving provenance.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 1-2 weeks (20-30 hours) |
| Language | Python (Alternatives: TypeScript, Go) |
| Prerequisites | Projects 1-3, graph algorithms, embedding similarity |
| Key Topics | Entity deduplication, string similarity, embedding comparison, merge strategies, canonical references, provenance tracking |
1. Learning Objectives
By completing this project, you will:
- Understand why entity resolution is critical for knowledge graph quality.
- Implement multiple similarity metrics (string, semantic, structural).
- Design merge strategies that preserve information and provenance.
- Build a system that handles both batch and incremental resolution.
- Learn techniques for human-in-the-loop disambiguation.
2. Theoretical Foundation
2.1 Core Concepts
-
Entity Resolution (ER): The process of identifying when two or more records refer to the same real-world entity. Also called record linkage, deduplication, or entity matching.
-
Canonical Reference: A single authoritative representation of an entity to which all duplicates point.
-
Blocking: Pre-filtering technique to reduce comparison space (avoid O(n²) comparisons).
- Similarity Metrics:
- String similarity: Levenshtein, Jaro-Winkler, fuzzy matching
- Semantic similarity: Embedding cosine distance
- Structural similarity: Shared neighbors in graph
- Merge Strategies:
- Winner-take-all: Choose one representation
- Union: Combine all properties
- Confidence-weighted: Weight by extraction confidence
2.2 Why This Matters
Without entity resolution, your knowledge graph degrades quickly:
- “Alice Smith”, “A. Smith”, “Alice S.” become three separate people
- Facts get scattered across duplicates
- Graph queries return incomplete results
- Memory retrieval misses relevant context
2.3 Common Misconceptions
- “Exact string matching is enough.” Names vary in spelling, abbreviation, and typos.
- “Embeddings solve everything.” Embeddings can match semantically similar but different entities.
- “Merge everything above threshold.” Aggressive merging creates false positives worse than duplicates.
2.4 ASCII Diagram: Resolution Pipeline
EXTRACTION OUTPUT
=================
Episode 1: "Alice from Acme" → (Alice) [confidence: 0.9]
Episode 2: "A. Smith at ACME" → (A. Smith) [confidence: 0.85]
Episode 3: "Alice Smith" → (Alice Smith) [confidence: 0.95]
│
▼
┌─────────────────────────────────────────────────────────────┐
│ BLOCKING │
│ │
│ Block by: first letter + entity type │
│ Block "A": [Alice, A. Smith, Alice Smith] │
│ (Reduces O(n²) to O(b × k²) where k << n) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PAIRWISE COMPARISON │
│ │
│ Alice ↔ A. Smith: │
│ - String: Jaro-Winkler = 0.72 │
│ - Semantic: cosine = 0.89 │
│ - Structural: shared neighbors = 1 (Acme) │
│ - Combined score: 0.83 │
│ │
│ Alice ↔ Alice Smith: │
│ - String: Jaro-Winkler = 0.91 │
│ - Semantic: cosine = 0.96 │
│ - Combined score: 0.94 │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CLUSTERING │
│ │
│ Threshold: 0.80 │
│ Cluster: {Alice, A. Smith, Alice Smith} │
│ │
│ Select canonical: "Alice Smith" (highest confidence) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ MERGE │
│ │
│ Canonical: (Alice Smith) │
│ - aliases: ["Alice", "A. Smith"] │
│ - sources: [episode_1, episode_2, episode_3] │
│ - merged_properties: {...} │
│ │
│ Redirects: │
│ (Alice) ──SAME_AS──► (Alice Smith) │
│ (A. Smith) ──SAME_AS──► (Alice Smith) │
└─────────────────────────────────────────────────────────────┘
3. Project Specification
3.1 What You Will Build
A Python library that:
- Compares entities using multiple similarity metrics
- Clusters similar entities into resolution groups
- Merges entities while preserving provenance
- Supports both batch and incremental resolution
3.2 Functional Requirements
- Blocking:
resolver.block(entities) → Dict[str, List[Entity]] - Pairwise similarity:
resolver.compare(entity_a, entity_b) → SimilarityScore - Cluster entities:
resolver.cluster(entities, threshold) → List[Cluster] - Merge cluster:
resolver.merge(cluster) → CanonicalEntity - Incremental resolution:
resolver.resolve_new(entity, existing) → Resolution - Export redirects:
resolver.get_redirects() → List[Redirect]
3.3 Example Usage / Output
from entity_resolver import EntityResolver
resolver = EntityResolver(
similarity_weights={"string": 0.3, "semantic": 0.5, "structural": 0.2},
threshold=0.80
)
# Batch resolution
entities = [
Entity(name="Alice Smith", type="Person", properties={"company": "Acme"}),
Entity(name="A. Smith", type="Person", properties={"role": "Engineer"}),
Entity(name="Alice", type="Person", properties={"company": "Acme Corp"}),
]
result = resolver.resolve_batch(entities)
print(f"Found {len(result.clusters)} clusters")
# Found 1 clusters
for cluster in result.clusters:
print(f"Canonical: {cluster.canonical.name}")
print(f"Aliases: {cluster.aliases}")
print(f"Merged properties: {cluster.canonical.properties}")
# Canonical: Alice Smith
# Aliases: ['A. Smith', 'Alice']
# Merged properties: {'company': 'Acme', 'role': 'Engineer'}
# Incremental resolution
new_entity = Entity(name="Alice S.", type="Person")
resolution = resolver.resolve_new(new_entity)
print(f"Matched to: {resolution.canonical.name} (score: {resolution.score:.2f})")
# Matched to: Alice Smith (score: 0.87)
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Blocker │───▶│ Comparator │───▶│ Clusterer │
└──────────────┘ └──────────────┘ └──────────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Provenance │◀───│ Merger │◀───│ Ranker │
│ Store │ └──────────────┘ └──────────────┘
└──────────────┘
4.2 Key Components
| Component | Responsibility | Technology |
|---|---|---|
| Blocker | Reduce comparison space | LSH, phonetic encoding |
| Comparator | Compute similarity scores | rapidfuzz, sentence-transformers |
| Clusterer | Group similar entities | Union-Find, agglomerative |
| Ranker | Select canonical entity | Confidence scoring |
| Merger | Combine properties | Custom merge strategies |
| ProvenanceStore | Track origins | SQLite/JSON |
4.3 Data Models
from pydantic import BaseModel
from typing import Literal
class Entity(BaseModel):
id: str
name: str
type: str
properties: dict = {}
embedding: list[float] | None = None
confidence: float = 1.0
source_episode: str | None = None
class SimilarityScore(BaseModel):
string_score: float
semantic_score: float
structural_score: float
combined_score: float
class Cluster(BaseModel):
canonical: Entity
members: list[Entity]
aliases: list[str]
merge_strategy: Literal["union", "confidence_weighted", "latest"]
class Resolution(BaseModel):
entity: Entity
canonical: Entity | None
score: float
is_new: bool
5. Implementation Guide
5.1 Development Environment Setup
mkdir entity-resolver && cd entity-resolver
python -m venv .venv && source .venv/bin/activate
pip install rapidfuzz sentence-transformers pydantic numpy scipy
5.2 Project Structure
entity-resolver/
├── src/
│ ├── resolver.py # Main EntityResolver class
│ ├── blocking.py # Blocking strategies
│ ├── similarity.py # Similarity metrics
│ ├── clustering.py # Clustering algorithms
│ ├── merging.py # Merge strategies
│ ├── provenance.py # Source tracking
│ └── models.py # Pydantic models
├── tests/
│ ├── test_similarity.py
│ ├── test_clustering.py
│ └── fixtures/
└── README.md
5.3 Implementation Phases
Phase 1: Similarity Metrics (6-8h)
Goals:
- Implement string similarity (Jaro-Winkler, fuzzy)
- Add semantic similarity via embeddings
- Create weighted combination
Tasks:
- Implement string similarity using rapidfuzz
- Generate embeddings with sentence-transformers
- Compute cosine similarity for embeddings
- Create weighted score combiner
Checkpoint: Compare two entities and get similarity score.
Phase 2: Blocking and Clustering (6-8h)
Goals:
- Reduce comparison space with blocking
- Cluster entities above threshold
Tasks:
- Implement phonetic blocking (Soundex, Metaphone)
- Implement LSH blocking for embeddings
- Build Union-Find clustering
- Add threshold-based cluster formation
Checkpoint: Cluster 100 entities efficiently.
Phase 3: Merge and Provenance (6-8h)
Goals:
- Merge clusters into canonical entities
- Track provenance through resolution
Tasks:
- Implement canonical selection strategies
- Build property merge logic
- Create SAME_AS redirect tracking
- Add incremental resolution for new entities
Checkpoint: Full resolution pipeline with provenance.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Test similarity functions | Known string pairs |
| Integration | Test full pipeline | Batch resolution |
| Quality | Measure precision/recall | Labeled test set |
6.2 Critical Test Cases
- Obvious matches: “John Smith” ↔ “John Smith” → score ≈ 1.0
- Abbreviations: “J. Smith” ↔ “John Smith” → high score
- Different entities: “John Smith” ↔ “Jane Smith” → below threshold
- Transitive closure: If A≈B and B≈C, cluster {A,B,C}
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| O(n²) comparisons | Resolution takes hours | Add blocking |
| Over-merging | Unrelated entities merged | Raise threshold, add type constraints |
| Under-merging | Obvious duplicates remain | Lower threshold, add more metrics |
| Lost provenance | Can’t trace merged data | Store source_episode on all properties |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add human review queue for uncertain matches
- Implement alias suggestion from merged data
8.2 Intermediate Extensions
- Add active learning to improve thresholds
- Implement graph-based structural similarity
8.3 Advanced Extensions
- Train entity matching model on your domain
- Add distributed resolution for large graphs
9. Real-World Connections
9.1 Industry Applications
- Master Data Management: Enterprise customer/product deduplication
- Knowledge Graph Construction: Google, Amazon, LinkedIn
- AI Memory Systems: Zep, Graphiti entity consolidation
9.2 Interview Relevance
- Explain blocking strategies and their complexity tradeoffs
- Discuss precision/recall tradeoffs in entity matching
- Describe how to handle ambiguous cases
10. Resources
10.1 Essential Reading
- “Data Matching” by Peter Christen — Comprehensive ER techniques
- “AI Engineering” by Chip Huyen — Ch. on Data Quality
- rapidfuzz documentation — String similarity algorithms
10.2 Related Projects
- Previous: Project 3 (Entity Extraction)
- Next: Project 5 (Bi-Temporal Fact Store)
11. Self-Assessment Checklist
- I can explain why blocking is necessary for scalability
- I understand the tradeoffs between different similarity metrics
- I can design a merge strategy that preserves provenance
- I know when to use batch vs. incremental resolution
12. Submission / Completion Criteria
Minimum Viable Completion:
- String + semantic similarity working
- Basic clustering with threshold
- Canonical selection implemented
Full Completion:
- Blocking strategies implemented
- Multiple merge strategies
- Provenance tracking complete
Excellence:
- Human review queue
- Active learning threshold tuning
- Performance benchmarks on 10K+ entities