Project 4: Entity Resolution System

Build an entity resolution pipeline that deduplicates and merges entities across extraction runs, maintaining canonical references while preserving provenance.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 1-2 weeks (20-30 hours)
Language Python (Alternatives: TypeScript, Go)
Prerequisites Projects 1-3, graph algorithms, embedding similarity
Key Topics Entity deduplication, string similarity, embedding comparison, merge strategies, canonical references, provenance tracking

1. Learning Objectives

By completing this project, you will:

  1. Understand why entity resolution is critical for knowledge graph quality.
  2. Implement multiple similarity metrics (string, semantic, structural).
  3. Design merge strategies that preserve information and provenance.
  4. Build a system that handles both batch and incremental resolution.
  5. Learn techniques for human-in-the-loop disambiguation.

2. Theoretical Foundation

2.1 Core Concepts

  • Entity Resolution (ER): The process of identifying when two or more records refer to the same real-world entity. Also called record linkage, deduplication, or entity matching.

  • Canonical Reference: A single authoritative representation of an entity to which all duplicates point.

  • Blocking: Pre-filtering technique to reduce comparison space (avoid O(n²) comparisons).

  • Similarity Metrics:
    • String similarity: Levenshtein, Jaro-Winkler, fuzzy matching
    • Semantic similarity: Embedding cosine distance
    • Structural similarity: Shared neighbors in graph
  • Merge Strategies:
    • Winner-take-all: Choose one representation
    • Union: Combine all properties
    • Confidence-weighted: Weight by extraction confidence

2.2 Why This Matters

Without entity resolution, your knowledge graph degrades quickly:

  • “Alice Smith”, “A. Smith”, “Alice S.” become three separate people
  • Facts get scattered across duplicates
  • Graph queries return incomplete results
  • Memory retrieval misses relevant context

2.3 Common Misconceptions

  • “Exact string matching is enough.” Names vary in spelling, abbreviation, and typos.
  • “Embeddings solve everything.” Embeddings can match semantically similar but different entities.
  • “Merge everything above threshold.” Aggressive merging creates false positives worse than duplicates.

2.4 ASCII Diagram: Resolution Pipeline

EXTRACTION OUTPUT
=================
Episode 1: "Alice from Acme"  →  (Alice) [confidence: 0.9]
Episode 2: "A. Smith at ACME" →  (A. Smith) [confidence: 0.85]
Episode 3: "Alice Smith"      →  (Alice Smith) [confidence: 0.95]

                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                    BLOCKING                                  │
│                                                              │
│  Block by: first letter + entity type                        │
│  Block "A": [Alice, A. Smith, Alice Smith]                  │
│  (Reduces O(n²) to O(b × k²) where k << n)                  │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                 PAIRWISE COMPARISON                          │
│                                                              │
│  Alice ↔ A. Smith:                                          │
│    - String: Jaro-Winkler = 0.72                            │
│    - Semantic: cosine = 0.89                                │
│    - Structural: shared neighbors = 1 (Acme)                │
│    - Combined score: 0.83                                    │
│                                                              │
│  Alice ↔ Alice Smith:                                        │
│    - String: Jaro-Winkler = 0.91                            │
│    - Semantic: cosine = 0.96                                │
│    - Combined score: 0.94                                    │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                    CLUSTERING                                │
│                                                              │
│  Threshold: 0.80                                             │
│  Cluster: {Alice, A. Smith, Alice Smith}                    │
│                                                              │
│  Select canonical: "Alice Smith" (highest confidence)        │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────┐
│                     MERGE                                    │
│                                                              │
│  Canonical: (Alice Smith)                                    │
│    - aliases: ["Alice", "A. Smith"]                         │
│    - sources: [episode_1, episode_2, episode_3]             │
│    - merged_properties: {...}                                │
│                                                              │
│  Redirects:                                                  │
│    (Alice) ──SAME_AS──► (Alice Smith)                       │
│    (A. Smith) ──SAME_AS──► (Alice Smith)                    │
└─────────────────────────────────────────────────────────────┘

3. Project Specification

3.1 What You Will Build

A Python library that:

  • Compares entities using multiple similarity metrics
  • Clusters similar entities into resolution groups
  • Merges entities while preserving provenance
  • Supports both batch and incremental resolution

3.2 Functional Requirements

  1. Blocking: resolver.block(entities) → Dict[str, List[Entity]]
  2. Pairwise similarity: resolver.compare(entity_a, entity_b) → SimilarityScore
  3. Cluster entities: resolver.cluster(entities, threshold) → List[Cluster]
  4. Merge cluster: resolver.merge(cluster) → CanonicalEntity
  5. Incremental resolution: resolver.resolve_new(entity, existing) → Resolution
  6. Export redirects: resolver.get_redirects() → List[Redirect]

3.3 Example Usage / Output

from entity_resolver import EntityResolver

resolver = EntityResolver(
    similarity_weights={"string": 0.3, "semantic": 0.5, "structural": 0.2},
    threshold=0.80
)

# Batch resolution
entities = [
    Entity(name="Alice Smith", type="Person", properties={"company": "Acme"}),
    Entity(name="A. Smith", type="Person", properties={"role": "Engineer"}),
    Entity(name="Alice", type="Person", properties={"company": "Acme Corp"}),
]

result = resolver.resolve_batch(entities)

print(f"Found {len(result.clusters)} clusters")
# Found 1 clusters

for cluster in result.clusters:
    print(f"Canonical: {cluster.canonical.name}")
    print(f"Aliases: {cluster.aliases}")
    print(f"Merged properties: {cluster.canonical.properties}")
    # Canonical: Alice Smith
    # Aliases: ['A. Smith', 'Alice']
    # Merged properties: {'company': 'Acme', 'role': 'Engineer'}

# Incremental resolution
new_entity = Entity(name="Alice S.", type="Person")
resolution = resolver.resolve_new(new_entity)
print(f"Matched to: {resolution.canonical.name} (score: {resolution.score:.2f})")
# Matched to: Alice Smith (score: 0.87)

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Blocker    │───▶│  Comparator  │───▶│  Clusterer   │
└──────────────┘    └──────────────┘    └──────────────┘
                                               │
                                               ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  Provenance  │◀───│    Merger    │◀───│   Ranker     │
│    Store     │    └──────────────┘    └──────────────┘
└──────────────┘

4.2 Key Components

Component Responsibility Technology
Blocker Reduce comparison space LSH, phonetic encoding
Comparator Compute similarity scores rapidfuzz, sentence-transformers
Clusterer Group similar entities Union-Find, agglomerative
Ranker Select canonical entity Confidence scoring
Merger Combine properties Custom merge strategies
ProvenanceStore Track origins SQLite/JSON

4.3 Data Models

from pydantic import BaseModel
from typing import Literal

class Entity(BaseModel):
    id: str
    name: str
    type: str
    properties: dict = {}
    embedding: list[float] | None = None
    confidence: float = 1.0
    source_episode: str | None = None

class SimilarityScore(BaseModel):
    string_score: float
    semantic_score: float
    structural_score: float
    combined_score: float

class Cluster(BaseModel):
    canonical: Entity
    members: list[Entity]
    aliases: list[str]
    merge_strategy: Literal["union", "confidence_weighted", "latest"]

class Resolution(BaseModel):
    entity: Entity
    canonical: Entity | None
    score: float
    is_new: bool

5. Implementation Guide

5.1 Development Environment Setup

mkdir entity-resolver && cd entity-resolver
python -m venv .venv && source .venv/bin/activate
pip install rapidfuzz sentence-transformers pydantic numpy scipy

5.2 Project Structure

entity-resolver/
├── src/
│   ├── resolver.py       # Main EntityResolver class
│   ├── blocking.py       # Blocking strategies
│   ├── similarity.py     # Similarity metrics
│   ├── clustering.py     # Clustering algorithms
│   ├── merging.py        # Merge strategies
│   ├── provenance.py     # Source tracking
│   └── models.py         # Pydantic models
├── tests/
│   ├── test_similarity.py
│   ├── test_clustering.py
│   └── fixtures/
└── README.md

5.3 Implementation Phases

Phase 1: Similarity Metrics (6-8h)

Goals:

  • Implement string similarity (Jaro-Winkler, fuzzy)
  • Add semantic similarity via embeddings
  • Create weighted combination

Tasks:

  1. Implement string similarity using rapidfuzz
  2. Generate embeddings with sentence-transformers
  3. Compute cosine similarity for embeddings
  4. Create weighted score combiner

Checkpoint: Compare two entities and get similarity score.

Phase 2: Blocking and Clustering (6-8h)

Goals:

  • Reduce comparison space with blocking
  • Cluster entities above threshold

Tasks:

  1. Implement phonetic blocking (Soundex, Metaphone)
  2. Implement LSH blocking for embeddings
  3. Build Union-Find clustering
  4. Add threshold-based cluster formation

Checkpoint: Cluster 100 entities efficiently.

Phase 3: Merge and Provenance (6-8h)

Goals:

  • Merge clusters into canonical entities
  • Track provenance through resolution

Tasks:

  1. Implement canonical selection strategies
  2. Build property merge logic
  3. Create SAME_AS redirect tracking
  4. Add incremental resolution for new entities

Checkpoint: Full resolution pipeline with provenance.


6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Test similarity functions Known string pairs
Integration Test full pipeline Batch resolution
Quality Measure precision/recall Labeled test set

6.2 Critical Test Cases

  1. Obvious matches: “John Smith” ↔ “John Smith” → score ≈ 1.0
  2. Abbreviations: “J. Smith” ↔ “John Smith” → high score
  3. Different entities: “John Smith” ↔ “Jane Smith” → below threshold
  4. Transitive closure: If A≈B and B≈C, cluster {A,B,C}

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
O(n²) comparisons Resolution takes hours Add blocking
Over-merging Unrelated entities merged Raise threshold, add type constraints
Under-merging Obvious duplicates remain Lower threshold, add more metrics
Lost provenance Can’t trace merged data Store source_episode on all properties

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add human review queue for uncertain matches
  • Implement alias suggestion from merged data

8.2 Intermediate Extensions

  • Add active learning to improve thresholds
  • Implement graph-based structural similarity

8.3 Advanced Extensions

  • Train entity matching model on your domain
  • Add distributed resolution for large graphs

9. Real-World Connections

9.1 Industry Applications

  • Master Data Management: Enterprise customer/product deduplication
  • Knowledge Graph Construction: Google, Amazon, LinkedIn
  • AI Memory Systems: Zep, Graphiti entity consolidation

9.2 Interview Relevance

  • Explain blocking strategies and their complexity tradeoffs
  • Discuss precision/recall tradeoffs in entity matching
  • Describe how to handle ambiguous cases

10. Resources

10.1 Essential Reading

  • “Data Matching” by Peter Christen — Comprehensive ER techniques
  • “AI Engineering” by Chip Huyen — Ch. on Data Quality
  • rapidfuzz documentation — String similarity algorithms
  • Previous: Project 3 (Entity Extraction)
  • Next: Project 5 (Bi-Temporal Fact Store)

11. Self-Assessment Checklist

  • I can explain why blocking is necessary for scalability
  • I understand the tradeoffs between different similarity metrics
  • I can design a merge strategy that preserves provenance
  • I know when to use batch vs. incremental resolution

12. Submission / Completion Criteria

Minimum Viable Completion:

  • String + semantic similarity working
  • Basic clustering with threshold
  • Canonical selection implemented

Full Completion:

  • Blocking strategies implemented
  • Multiple merge strategies
  • Provenance tracking complete

Excellence:

  • Human review queue
  • Active learning threshold tuning
  • Performance benchmarks on 10K+ entities