Project 2: Locality Sensitive Hashing (LSH)

Build an LSH index to speed up approximate nearest neighbor search with controllable recall.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 10-16 hours
Language Python
Prerequisites Project 1, probability basics
Key Topics hashing, ANN, recall/precision tradeoffs

1. Learning Objectives

By completing this project, you will:

  1. Implement LSH with random hyperplanes.
  2. Tune hash tables for recall/latency.
  3. Compare LSH results to brute-force baseline.
  4. Measure recall@k and speedup.
  5. Visualize bucket distributions.

2. Theoretical Foundation

2.1 LSH Intuition

LSH maps similar vectors into the same buckets with high probability, enabling sublinear search.


3. Project Specification

3.1 What You Will Build

An LSH-based vector search tool with configurable hash tables and probes.

3.2 Functional Requirements

  1. Hash functions using random hyperplanes.
  2. Multiple hash tables for recall tuning.
  3. Query pipeline that probes buckets.
  4. Recall evaluation vs brute-force.
  5. Benchmarking for latency.

3.3 Non-Functional Requirements

  • Deterministic seeds for repeatable runs.
  • Configurable parameters for tuning.
  • Clear metrics reporting.

4. Solution Architecture

4.1 Components

Component Responsibility
Hasher Map vectors to buckets
Tables Store vector IDs
Query Engine Probe buckets
Evaluator Recall and latency metrics

5. Implementation Guide

5.1 Project Structure

LEARN_VECTOR_DATABASES/P02-lsh/
├── src/
│   ├── hash.py
│   ├── index.py
│   ├── query.py
│   └── eval.py

5.2 Implementation Phases

Phase 1: Hashing (4-6h)

  • Implement random hyperplane hashes.
  • Checkpoint: bucket assignments stable.

Phase 2: Querying (3-5h)

  • Probe buckets and compute candidates.
  • Checkpoint: results approximate brute-force.

Phase 3: Evaluation (3-5h)

  • Compute recall@k and speedup.
  • Checkpoint: report shows tradeoffs.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit hashing consistent buckets
Integration query candidate set size
Regression eval recall stable

6.2 Critical Test Cases

  1. Same input hashes to same bucket.
  2. Recall improves with more tables.
  3. Latency drops vs brute-force.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Low recall missed neighbors add tables or probes
Large buckets slow queries increase hash length
Unbalanced buckets poor performance tune projection count

8. Extensions & Challenges

Beginner

  • Add cosine vs dot comparisons.
  • Add bucket size histograms.

Intermediate

  • Add multi-probe LSH.
  • Add adaptive table selection.

Advanced

  • Add product quantization after LSH.
  • Compare with HNSW.

9. Real-World Connections

  • ANN search uses LSH in early systems.
  • Large-scale retrieval depends on hash tuning.

10. Resources

  • LSH research papers
  • ANN benchmark datasets

11. Self-Assessment Checklist

  • I can implement LSH indexing.
  • I can measure recall vs speed.
  • I can tune hash parameters.

12. Submission / Completion Criteria

Minimum Completion:

  • LSH indexing and querying

Full Completion:

  • Recall and latency benchmarks

Excellence:

  • Multi-probe LSH
  • Hybrid indexing experiments

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/LEARN_VECTOR_DATABASES.md.