Project 2: Locality Sensitive Hashing (LSH)
Build an LSH index to speed up approximate nearest neighbor search with controllable recall.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 10-16 hours |
| Language | Python |
| Prerequisites | Project 1, probability basics |
| Key Topics | hashing, ANN, recall/precision tradeoffs |
1. Learning Objectives
By completing this project, you will:
- Implement LSH with random hyperplanes.
- Tune hash tables for recall/latency.
- Compare LSH results to brute-force baseline.
- Measure recall@k and speedup.
- Visualize bucket distributions.
2. Theoretical Foundation
2.1 LSH Intuition
LSH maps similar vectors into the same buckets with high probability, enabling sublinear search.
3. Project Specification
3.1 What You Will Build
An LSH-based vector search tool with configurable hash tables and probes.
3.2 Functional Requirements
- Hash functions using random hyperplanes.
- Multiple hash tables for recall tuning.
- Query pipeline that probes buckets.
- Recall evaluation vs brute-force.
- Benchmarking for latency.
3.3 Non-Functional Requirements
- Deterministic seeds for repeatable runs.
- Configurable parameters for tuning.
- Clear metrics reporting.
4. Solution Architecture
4.1 Components
| Component | Responsibility |
|---|---|
| Hasher | Map vectors to buckets |
| Tables | Store vector IDs |
| Query Engine | Probe buckets |
| Evaluator | Recall and latency metrics |
5. Implementation Guide
5.1 Project Structure
LEARN_VECTOR_DATABASES/P02-lsh/
├── src/
│ ├── hash.py
│ ├── index.py
│ ├── query.py
│ └── eval.py
5.2 Implementation Phases
Phase 1: Hashing (4-6h)
- Implement random hyperplane hashes.
- Checkpoint: bucket assignments stable.
Phase 2: Querying (3-5h)
- Probe buckets and compute candidates.
- Checkpoint: results approximate brute-force.
Phase 3: Evaluation (3-5h)
- Compute recall@k and speedup.
- Checkpoint: report shows tradeoffs.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | hashing | consistent buckets |
| Integration | query | candidate set size |
| Regression | eval | recall stable |
6.2 Critical Test Cases
- Same input hashes to same bucket.
- Recall improves with more tables.
- Latency drops vs brute-force.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Low recall | missed neighbors | add tables or probes |
| Large buckets | slow queries | increase hash length |
| Unbalanced buckets | poor performance | tune projection count |
8. Extensions & Challenges
Beginner
- Add cosine vs dot comparisons.
- Add bucket size histograms.
Intermediate
- Add multi-probe LSH.
- Add adaptive table selection.
Advanced
- Add product quantization after LSH.
- Compare with HNSW.
9. Real-World Connections
- ANN search uses LSH in early systems.
- Large-scale retrieval depends on hash tuning.
10. Resources
- LSH research papers
- ANN benchmark datasets
11. Self-Assessment Checklist
- I can implement LSH indexing.
- I can measure recall vs speed.
- I can tune hash parameters.
12. Submission / Completion Criteria
Minimum Completion:
- LSH indexing and querying
Full Completion:
- Recall and latency benchmarks
Excellence:
- Multi-probe LSH
- Hybrid indexing experiments
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/LEARN_VECTOR_DATABASES.md.