Project 4: Entity Resolution System

Build an entity resolution pipeline that deduplicates and merges entities across extraction runs, maintaining canonical references while preserving provenance.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	1-2 weeks (20-30 hours)
Language	Python (Alternatives: TypeScript, Go)
Prerequisites	Projects 1-3, graph algorithms, embedding similarity
Key Topics	Entity deduplication, string similarity, embedding comparison, merge strategies, canonical references, provenance tracking

1. Learning Objectives

By completing this project, you will:

Understand why entity resolution is critical for knowledge graph quality.
Implement multiple similarity metrics (string, semantic, structural).
Design merge strategies that preserve information and provenance.
Build a system that handles both batch and incremental resolution.
Learn techniques for human-in-the-loop disambiguation.

2. Theoretical Foundation

2.1 Core Concepts

Entity Resolution (ER): The process of identifying when two or more records refer to the same real-world entity. Also called record linkage, deduplication, or entity matching.
Canonical Reference: A single authoritative representation of an entity to which all duplicates point.
Blocking: Pre-filtering technique to reduce comparison space (avoid O(n²) comparisons).
Similarity Metrics:
- String similarity: Levenshtein, Jaro-Winkler, fuzzy matching
- Semantic similarity: Embedding cosine distance
- Structural similarity: Shared neighbors in graph
Merge Strategies:
- Winner-take-all: Choose one representation
- Union: Combine all properties
- Confidence-weighted: Weight by extraction confidence

2.2 Why This Matters

Without entity resolution, your knowledge graph degrades quickly:

“Alice Smith”, “A. Smith”, “Alice S.” become three separate people
Facts get scattered across duplicates
Graph queries return incomplete results
Memory retrieval misses relevant context

2.3 Common Misconceptions

“Exact string matching is enough.” Names vary in spelling, abbreviation, and typos.
“Embeddings solve everything.” Embeddings can match semantically similar but different entities.
“Merge everything above threshold.” Aggressive merging creates false positives worse than duplicates.

2.4 ASCII Diagram: Resolution Pipeline

EXTRACTION OUTPUT
=================
Episode 1: "Alice from Acme"  →  (Alice) [confidence: 0.9]
Episode 2: "A. Smith at ACME" →  (A. Smith) [confidence: 0.85]
Episode 3: "Alice Smith"      →  (Alice Smith) [confidence: 0.95]

                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                    BLOCKING                                  │
│                                                              │
│  Block by: first letter + entity type                        │
│  Block "A": [Alice, A. Smith, Alice Smith]                  │
│  (Reduces O(n²) to O(b × k²) where k << n)                  │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                 PAIRWISE COMPARISON                          │
│                                                              │
│  Alice ↔ A. Smith:                                          │
│    - String: Jaro-Winkler = 0.72                            │
│    - Semantic: cosine = 0.89                                │
│    - Structural: shared neighbors = 1 (Acme)                │
│    - Combined score: 0.83                                    │
│                                                              │
│  Alice ↔ Alice Smith:                                        │
│    - String: Jaro-Winkler = 0.91                            │
│    - Semantic: cosine = 0.96                                │
│    - Combined score: 0.94                                    │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                    CLUSTERING                                │
│                                                              │
│  Threshold: 0.80                                             │
│  Cluster: {Alice, A. Smith, Alice Smith}                    │
│                                                              │
│  Select canonical: "Alice Smith" (highest confidence)        │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                     MERGE                                    │
│                                                              │
│  Canonical: (Alice Smith)                                    │
│    - aliases: ["Alice", "A. Smith"]                         │
│    - sources: [episode_1, episode_2, episode_3]             │
│    - merged_properties: {...}                                │
│                                                              │
│  Redirects:                                                  │
│    (Alice) ──SAME_AS──► (Alice Smith)                       │
│    (A. Smith) ──SAME_AS──► (Alice Smith)                    │
└─────────────────────────────────────────────────────────────┘

3. Project Specification

3.1 What You Will Build

A Python library that:

Compares entities using multiple similarity metrics
Clusters similar entities into resolution groups
Merges entities while preserving provenance
Supports both batch and incremental resolution

3.2 Functional Requirements

Blocking: resolver.block(entities) → Dict[str, List[Entity]]
Pairwise similarity: resolver.compare(entity_a, entity_b) → SimilarityScore
Cluster entities: resolver.cluster(entities, threshold) → List[Cluster]
Merge cluster: resolver.merge(cluster) → CanonicalEntity
Incremental resolution: resolver.resolve_new(entity, existing) → Resolution
Export redirects: resolver.get_redirects() → List[Redirect]

3.3 Example Usage / Output

from entity_resolver import EntityResolver

resolver = EntityResolver(
    similarity_weights={"string": 0.3, "semantic": 0.5, "structural": 0.2},
    threshold=0.80
)

# Batch resolution
entities = [
    Entity(name="Alice Smith", type="Person", properties={"company": "Acme"}),
    Entity(name="A. Smith", type="Person", properties={"role": "Engineer"}),
    Entity(name="Alice", type="Person", properties={"company": "Acme Corp"}),
]

result = resolver.resolve_batch(entities)

print(f"Found {len(result.clusters)} clusters")
# Found 1 clusters

for cluster in result.clusters:
    print(f"Canonical: {cluster.canonical.name}")
    print(f"Aliases: {cluster.aliases}")
    print(f"Merged properties: {cluster.canonical.properties}")
    # Canonical: Alice Smith
    # Aliases: ['A. Smith', 'Alice']
    # Merged properties: {'company': 'Acme', 'role': 'Engineer'}

# Incremental resolution
new_entity = Entity(name="Alice S.", type="Person")
resolution = resolver.resolve_new(new_entity)
print(f"Matched to: {resolution.canonical.name} (score: {resolution.score:.2f})")
# Matched to: Alice Smith (score: 0.87)

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Blocker    │───▶│  Comparator  │───▶│  Clusterer   │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                                               ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Provenance  │◀───│    Merger    │◀───│   Ranker     │
│    Store     │    └──────────────┘    └──────────────┘
└──────────────┘

4.2 Key Components

Component	Responsibility	Technology
Blocker	Reduce comparison space	LSH, phonetic encoding
Comparator	Compute similarity scores	rapidfuzz, sentence-transformers
Clusterer	Group similar entities	Union-Find, agglomerative
Ranker	Select canonical entity	Confidence scoring
Merger	Combine properties	Custom merge strategies
ProvenanceStore	Track origins	SQLite/JSON

4.3 Data Models

from pydantic import BaseModel
from typing import Literal

class Entity(BaseModel):
    id: str
    name: str
    type: str
    properties: dict = {}
    embedding: list[float] | None = None
    confidence: float = 1.0
    source_episode: str | None = None

class SimilarityScore(BaseModel):
    string_score: float
    semantic_score: float
    structural_score: float
    combined_score: float

class Cluster(BaseModel):
    canonical: Entity
    members: list[Entity]
    aliases: list[str]
    merge_strategy: Literal["union", "confidence_weighted", "latest"]

class Resolution(BaseModel):
    entity: Entity
    canonical: Entity | None
    score: float
    is_new: bool

5. Implementation Guide

5.1 Development Environment Setup

mkdir entity-resolver && cd entity-resolver
python -m venv .venv && source .venv/bin/activate
pip install rapidfuzz sentence-transformers pydantic numpy scipy

5.2 Project Structure

entity-resolver/
├── src/
│   ├── resolver.py       # Main EntityResolver class
│   ├── blocking.py       # Blocking strategies
│   ├── similarity.py     # Similarity metrics
│   ├── clustering.py     # Clustering algorithms
│   ├── merging.py        # Merge strategies
│   ├── provenance.py     # Source tracking
│   └── models.py         # Pydantic models
├── tests/
│   ├── test_similarity.py
│   ├── test_clustering.py
│   └── fixtures/
└── README.md

5.3 Implementation Phases

Phase 1: Similarity Metrics (6-8h)

Goals:

Implement string similarity (Jaro-Winkler, fuzzy)
Add semantic similarity via embeddings
Create weighted combination

Tasks:

Implement string similarity using rapidfuzz
Generate embeddings with sentence-transformers
Compute cosine similarity for embeddings
Create weighted score combiner

Checkpoint: Compare two entities and get similarity score.

Phase 2: Blocking and Clustering (6-8h)

Goals:

Reduce comparison space with blocking
Cluster entities above threshold

Tasks:

Implement phonetic blocking (Soundex, Metaphone)
Implement LSH blocking for embeddings
Build Union-Find clustering
Add threshold-based cluster formation

Checkpoint: Cluster 100 entities efficiently.

Phase 3: Merge and Provenance (6-8h)

Goals:

Merge clusters into canonical entities
Track provenance through resolution

Tasks:

Implement canonical selection strategies
Build property merge logic
Create SAME_AS redirect tracking
Add incremental resolution for new entities

Checkpoint: Full resolution pipeline with provenance.

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Test similarity functions	Known string pairs
Integration	Test full pipeline	Batch resolution
Quality	Measure precision/recall	Labeled test set

6.2 Critical Test Cases

Obvious matches: “John Smith” ↔ “John Smith” → score ≈ 1.0
Abbreviations: “J. Smith” ↔ “John Smith” → high score
Different entities: “John Smith” ↔ “Jane Smith” → below threshold
Transitive closure: If A≈B and B≈C, cluster {A,B,C}

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
O(n²) comparisons	Resolution takes hours	Add blocking
Over-merging	Unrelated entities merged	Raise threshold, add type constraints
Under-merging	Obvious duplicates remain	Lower threshold, add more metrics
Lost provenance	Can’t trace merged data	Store source_episode on all properties

8. Extensions & Challenges

8.1 Beginner Extensions

Add human review queue for uncertain matches
Implement alias suggestion from merged data

8.2 Intermediate Extensions

Add active learning to improve thresholds
Implement graph-based structural similarity

8.3 Advanced Extensions

Train entity matching model on your domain
Add distributed resolution for large graphs

9. Real-World Connections

9.1 Industry Applications

Master Data Management: Enterprise customer/product deduplication
Knowledge Graph Construction: Google, Amazon, LinkedIn
AI Memory Systems: Zep, Graphiti entity consolidation

9.2 Interview Relevance

Explain blocking strategies and their complexity tradeoffs
Discuss precision/recall tradeoffs in entity matching
Describe how to handle ambiguous cases

10. Resources

10.1 Essential Reading

“Data Matching” by Peter Christen — Comprehensive ER techniques
“AI Engineering” by Chip Huyen — Ch. on Data Quality
rapidfuzz documentation — String similarity algorithms

Previous: Project 3 (Entity Extraction)
Next: Project 5 (Bi-Temporal Fact Store)

11. Self-Assessment Checklist

I can explain why blocking is necessary for scalability
I understand the tradeoffs between different similarity metrics
I can design a merge strategy that preserves provenance
I know when to use batch vs. incremental resolution

12. Submission / Completion Criteria

Minimum Viable Completion:

String + semantic similarity working
Basic clustering with threshold
Canonical selection implemented

Full Completion:

Blocking strategies implemented
Multiple merge strategies
Provenance tracking complete

Excellence:

Human review queue
Active learning threshold tuning
Performance benchmarks on 10K+ entities

Project 4: Entity Resolution System

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

2.4 ASCII Diagram: Resolution Pipeline

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Models

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Similarity Metrics (6-8h)

Phase 2: Blocking and Clustering (6-8h)

Phase 3: Merge and Provenance (6-8h)

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Related Projects

11. Self-Assessment Checklist

12. Submission / Completion Criteria