LEARN VECTOR DATABASES
Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by *meaning*.
Learn Vector Databases: From Zero to Building Your Own
Goal: Deeply understand how vector databases work internallyโfrom the mathematics of similarity search through indexing algorithms like HNSW and IVF, to building a production-ready vector search engine. Master the technology powering modern AI applications.
Why Vector Databases Matter
Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by meaning.
The challenge: How do you efficiently find the 10 most similar items among a billion vectors, each with 1536 dimensions? Brute force would take hours. Vector databases do it in milliseconds.
After completing these projects, you will:
- Understand exactly how similarity search works at scale
- Implement HNSW, IVF, LSH, and other ANN algorithms from scratch
- Know when to use each algorithm and what tradeoffs to make
- Build a complete vector database that handles real workloads
- Appreciate the engineering behind Pinecone, Milvus, Qdrant, and others
Core Concept Analysis
What is a Vector?
In the context of vector databases, a vector (or embedding) is a list of numbers that represents the meaning or features of something:
"The quick brown fox" โ [0.12, -0.34, 0.56, ..., 0.78] (1536 dimensions)
[Image of a cat] โ [0.23, 0.45, -0.67, ..., 0.89] (512 dimensions)
[Product attributes] โ [0.91, -0.23, 0.45, ..., 0.12] (256 dimensions)
These vectors come from ML models (OpenAI embeddings, CLIP, sentence-transformers) that position similar items close together in high-dimensional space.
Distance Metrics
How do we measure โclosenessโ between vectors?
- Euclidean Distance (L2): Straight-line distance in space
d(a, b) = โ(ฮฃ(aแตข - bแตข)ยฒ)- Good for: Absolute position matters
- Range: [0, โ)
- Cosine Similarity: Angle between vectors (normalized)
cos(a, b) = (a ยท b) / (||a|| ร ||b||)- Good for: Direction matters, magnitude doesnโt
- Range: [-1, 1] (1 = identical, 0 = orthogonal, -1 = opposite)
- Dot Product (Inner Product): Combines magnitude and direction
a ยท b = ฮฃ(aแตข ร bแตข)- Good for: When vectors are already normalized
- Range: (-โ, โ)
- Manhattan Distance (L1): Sum of absolute differences
d(a, b) = ฮฃ|aแตข - bแตข|
The Curse of Dimensionality
Why is vector search hard?
In high dimensions (100+), counterintuitive things happen:
- All points are โfarโ: The distance between any two random points converges
- Volume explodes: A hypercube with side 1 has volume 1, but almost all that volume is in the โcornersโ
- Brute force is slow: Comparing a query to 1 billion vectors = 1 billion distance calculations
This is why we need Approximate Nearest Neighbor (ANN) algorithmsโtrading perfect accuracy for massive speedups.
The ANN Tradeoff
Accuracy (Recall) โโโโโโโโโโโโโโโโโโโโ Speed
100% 0.001ms
Brute Force (exact) Too slow
HNSW (graph) Great balance
IVF (clustering) Faster, less accurate
LSH (hashing) Fastest, least accurate
Recall@K: Of the true K nearest neighbors, what fraction did we find?
- Recall@10 = 0.95 means we found 95% of the true top-10 neighbors
Vector Database Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Client Layer โ
โ (REST API, gRPC, SDKs, Query Language) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Query Layer โ
โ (Query Planning, Filtering, Caching, Result Ranking) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Index Layer โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โ โ HNSW โ โ IVF โ โ LSH โ โ Flat โ โ
โ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Storage Layer โ
โ (Vector Storage, Metadata, WAL, Compression/PQ) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Distributed Layer โ
โ (Sharding, Replication, Consistency, Routing) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Vector Embeddings | Vectors are numeric representations of meaning. Similar items map to nearby points in high-dimensional space. Embeddings come from ML models (transformers, CNNs) that compress semantics into fixed-size arrays. |
| Distance Metrics | Similarity = inverse of distance. Euclidean measures absolute position, cosine measures direction (angle), dot product combines both. Choice depends on whether magnitude matters for your domain. |
| Curse of Dimensionality | In high dimensions (100+), all points become โfarโ from each other, volume concentrates in corners, and brute force becomes impossibly slow. This is why approximate search exists. |
| ANN vs Exact Search | Exact k-NN guarantees finding the true nearest neighbors but doesnโt scale. ANN (Approximate Nearest Neighbor) trades accuracy for speedโ99% accuracy at 1000x speedup is the typical tradeoff. |
| LSH (Locality Sensitive Hashing) | Hash similar vectors to same buckets using random projections. Multiple hash tables = OR (higher recall), multiple hashes per table = AND (higher precision). Probabilistic but fast. |
| IVF (Inverted File Index) | Cluster vectors with k-means, then search only nearby clusters. Nprobe controls recall/speed tradeoff. Foundation of FAISS and most production systems. Simple but effective. |
| Product Quantization | Compress vectors by splitting dimensions into subspaces and quantizing each independently. 128D float32 (512 bytes) โ 128 uint8 codes (128 bytes) = 4x compression. Lossy but enables billion-scale search. |
| HNSW (Hierarchical NSW) | Multi-layer proximity graph. Upper layers = express highways, lower layers = local streets. Greedy search with backtracking. Best all-around algorithm for accuracy and speed. |
| Index Composition | Real systems combine techniques: IVF for partitioning + PQ for compression + HNSW for graph search. Each layer optimizes a different bottleneck. |
| Metadata Filtering | Vector search + filters (price < $100, category = shoes). Pre-filtering wastes recall, post-filtering wastes compute. Hybrid indexes and filtered IVF are the solutions. |
| Distributed Systems | Sharding for scale (partition by vector ID or space), replication for availability, consistency for correctness. CAP theorem applies. Leader-follower or quorum-based writes. |
| Disk-Based Indexes | RAM is expensive at billion-vector scale. Disk-based indexes use SSD-friendly layouts: compressed vectors, graph edges on disk, selective caching. DiskANN achieves 95%+ recall. |
Deep Dive Reading By Concept
This section maps each concept from above to specific book chapters and papers for deeper understanding. Read these before or alongside the projects to build strong mental models.
Vector Embeddings & Representation Learning
| Concept | Resource |
|---|---|
| What are embeddings and how are they created | Speech and Language Processing by Jurafsky & Martin โ Ch. 6: โVector Semantics and Embeddingsโ |
| Word2Vec and semantic spaces | Neural Network Methods for Natural Language Processing by Yoav Goldberg โ Ch. 10: โPre-trained Word Representationsโ |
| Modern transformer embeddings | Natural Language Processing with Transformers by Tunstall, von Werra & Wolf โ Ch. 2: โText Classificationโ (BERT embeddings section) |
| Image embeddings (CNNs) | Deep Learning by Goodfellow, Bengio & Courville โ Ch. 9: โConvolutional Networksโ |
| Multimodal embeddings (CLIP) | Paper: โLearning Transferable Visual Models From Natural Language Supervisionโ (Radford et al., 2021) |
Distance Metrics & Similarity Search
| Concept | Resource |
|---|---|
| Distance functions and metric spaces | Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ Ch. 3.5: โDistance Measuresโ |
| Cosine vs Euclidean: when to use which | Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ Ch. 6: โScoring, term weighting, and the vector space modelโ |
| Dot product for normalized vectors | Foundations of Statistical Natural Language Processing by Manning & Schรผtze โ Ch. 8: โLexical Semanticsโ |
| Metric properties (triangle inequality, symmetry) | Pattern Recognition and Machine Learning by Bishop โ Ch. 2.5: โBayesian Decision Theoryโ |
Curse of Dimensionality & High-Dimensional Geometry
| Concept | Resource |
|---|---|
| Why high dimensions are different | The Elements of Statistical Learning by Hastie, Tibshirani & Friedman โ Ch. 2.5: โLocal Methods in High Dimensionsโ |
| Volume concentration in hyperspheres | Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ Ch. 7: โClusteringโ (curse of dimensionality section) |
| Distance concentration phenomenon | Paper: โWhen Is โNearest Neighborโ Meaningful?โ (Beyer et al., 1999) |
| Implications for machine learning | Pattern Recognition and Machine Learning by Bishop โ Ch. 1.4: โThe Curse of Dimensionalityโ |
Approximate Nearest Neighbor Algorithms
| Concept | Resource |
|---|---|
| ANN survey and taxonomy | Paper: โA Survey on Nearest Neighbor Search Methodsโ (Wang et al., 2021) |
| Recall/precision/latency tradeoffs | Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ Ch. 3.6: โLocality-Sensitive Hashing for Documentsโ |
| Benchmarking ANN algorithms | Website: ann-benchmarks.com โ comprehensive comparisons |
Locality Sensitive Hashing (LSH)
| Concept | Resource |
|---|---|
| LSH fundamentals and theory | Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ Ch. 3.4: โLocality-Sensitive Hashingโ |
| Random hyperplanes for cosine similarity | Paper: โSimilarity Estimation Techniques from Rounding Algorithmsโ (Charikar, 2002) |
| LSH families for different metrics | Foundations of Data Science by Blum, Hopcroft & Kannan โ Ch. 2.5: โLocality Sensitive Hashingโ |
| Multi-probe LSH | Paper: โMulti-Probe LSH: Efficient Indexing for High-Dimensional Similarity Searchโ (Lv et al., 2007) |
| Practical LSH implementations | Tutorial: Pinecone LSH Guide |
Clustering & K-Means
| Concept | Resource |
|---|---|
| K-means algorithm and convergence | Pattern Recognition and Machine Learning by Bishop โ Ch. 9.1: โK-means Clusteringโ |
| Choosing K (number of clusters) | The Elements of Statistical Learning by Hastie, Tibshirani & Friedman โ Ch. 14.3: โCluster Analysisโ |
| K-means++ initialization | Paper: โk-means++: The Advantages of Careful Seedingโ (Arthur & Vassilvitskii, 2007) |
| Mini-batch k-means for large datasets | Scikit-learn documentation on Mini-Batch K-Means |
Inverted File Index (IVF)
| Concept | Resource |
|---|---|
| Inverted indexes in text search | Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ Ch. 1: โBoolean Retrievalโ |
| IVF for vector search | Paper: โProduct Quantization for Nearest Neighbor Searchโ (Jegou et al., 2011) โ Section on IVF |
| FAISS IVF indexes | FAISS Wiki: Index Types |
| Nprobe tuning strategies | Pinecone blog: โVector Indexesโ series |
Product Quantization (PQ)
| Concept | Resource |
|---|---|
| Vector quantization fundamentals | Pattern Recognition and Machine Learning by Bishop โ Ch. 9.2: โMixtures of Gaussiansโ (VQ as special case) |
| Product Quantization paper (must-read) | Paper: โProduct Quantization for Nearest Neighbor Searchโ (Jegou et al., 2011) |
| Asymmetric distance computation | Same paper, Section 3.3 |
| Optimized PQ (OPQ) | Paper: โOptimized Product Quantizationโ (Ge et al., 2013) |
| PQ in FAISS | FAISS Wiki: Product Quantizers |
HNSW (Hierarchical Navigable Small World)
| Concept | Resource |
|---|---|
| Small-world networks | Networks by Newman โ Ch. 11: โSmall-World Networksโ |
| Navigable Small World graphs | Paper: โApproximate Nearest Neighbor Algorithm Based on Navigable Small World Graphsโ (Malkov et al., 2014) |
| HNSW algorithm (must-read) | Paper: โEfficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphsโ (Malkov & Yashunin, 2018) |
| HNSW implementation details | hnswlib GitHub |
| HNSW tuning (M and ef parameters) | Pinecone blog: โHNSW Index Deep Diveโ |
Graph-Based Search
| Concept | Resource |
|---|---|
| Proximity graphs for search | Nearest Neighbor Search: A Database Perspective by Shasha & Wang โ Ch. 4 |
| Navigating graphs efficiently | Paper: โA Comparative Study of Graph-based Approximate Nearest Neighbor Algorithmsโ (Wang et al., 2021) |
| Graph construction strategies | HNSW paper, Section 3.2 |
Compression & Quantization
| Concept | Resource |
|---|---|
| Vector quantization theory | The Elements of Statistical Learning by Hastie, Tibshirani & Friedman โ Ch. 14.3.12: โVector Quantizationโ |
| Scalar quantization | Understanding Digital Signal Processing by Lyons โ Ch. 12: โData Compressionโ |
| Additive quantization | Paper: โAdditive Quantization for Extreme Vector Compressionโ (Babenko & Lempitsky, 2014) |
Disk-Based Vector Search
| Concept | Resource |
|---|---|
| SSD-friendly data structures | Database Internals by Petrov โ Ch. 3: โFile Formatsโ |
| DiskANN algorithm | Paper: โDiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Nodeโ (Jayaram Subramanya et al., 2019) |
| Memory-mapped files | The Linux Programming Interface by Kerrisk โ Ch. 49: โMemory Mappingsโ |
| Vamana graph (DiskANNโs graph) | DiskANN paper, Section 3 |
Distributed Systems for Vector Search
| Concept | Resource |
|---|---|
| Distributed system fundamentals | Designing Data-Intensive Applications by Kleppmann โ Ch. 5-9 (Replication, Partitioning, Transactions, Consistency) |
| Sharding strategies | Designing Data-Intensive Applications by Kleppmann โ Ch. 6: โPartitioningโ |
| CAP theorem and consistency | Distributed Systems by Tanenbaum & van Steen โ Ch. 7: โConsistency and Replicationโ |
| Leader election (Raft/Paxos) | Paper: โIn Search of an Understandable Consensus Algorithm (Extended Version)โ (Ongaro & Ousterhout, 2014) โ Raft |
| Distributed vector databases | Milvus documentation: Architecture Overview |
Metadata Filtering & Hybrid Search
| Concept | Resource |
|---|---|
| Filtered vector search | Paper: โFiltered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filtersโ (Singh et al., 2023) |
| Hybrid search (sparse + dense) | Pinecone blog: โHybrid Search Explainedโ |
| Inverted indexes for metadata | Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ Ch. 2: โThe Term Vocabulary and Postings Listsโ |
Storage Engines & Persistence
| Concept | Resource |
|---|---|
| LSM-trees for write-heavy workloads | Designing Data-Intensive Applications by Kleppmann โ Ch. 3: โStorage and Retrievalโ (LSM-trees section) |
| WAL (Write-Ahead Logging) | Database Internals by Petrov โ Ch. 4: โImplementing B-Treesโ (WAL section) |
| Memory management in databases | Database Internals by Petrov โ Ch. 5: โTransaction Processing and Recoveryโ |
Performance & Benchmarking
| Concept | Resource |
|---|---|
| Measuring recall and precision | Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ Ch. 8: โEvaluation in Information Retrievalโ |
| Latency percentiles (p99, p95) | Designing Data-Intensive Applications by Kleppmann โ Ch. 1: โReliable, Scalable, and Maintainable Applicationsโ |
| Benchmarking methodology | Paper: โANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithmsโ (Aumรผller et al., 2020) |
Essential Reading Order
For maximum comprehension, read in this order:
- Foundation (Week 1):
- Mining of Massive Datasets Ch. 3 (distance metrics, LSH fundamentals)
- Speech and Language Processing Ch. 6 (what embeddings are)
- Pattern Recognition and Machine Learning Ch. 9.1 (k-means)
- Core Algorithms (Week 2-3):
- LSH paper (Charikar, 2002)
- Product Quantization paper (Jegou et al., 2011) โ must-read
- HNSW paper (Malkov & Yashunin, 2018) โ must-read
- Advanced Techniques (Week 4):
- DiskANN paper (for billion-scale search)
- Filtered-DiskANN paper (for metadata filtering)
- Milvus/Qdrant architecture docs (for distributed systems)
- Systems & Production (Week 5+):
- Designing Data-Intensive Applications Ch. 3-6 (storage, replication, partitioning)
- Database Internals (storage engine details)
- Production vector DB docs (Pinecone, Qdrant, Milvus)
Project List
Projects progress from fundamental algorithms to a complete vector database implementation.
Project 1: Brute Force k-NN and Distance Metrics
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, C++, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The โResume Goldโ
- Difficulty: Level 1: Beginner
- Knowledge Area: Vector Mathematics / Similarity Search
- Software or Tool: Distance Calculator
- Main Book: โMining of Massive Datasetsโ by Leskovec, Rajaraman, Ullman (Chapter 3)
What youโll build: A brute-force k-nearest neighbors search engine supporting multiple distance metrics (Euclidean, cosine, dot product), with optimizations using NumPy vectorization and SIMD.
Why it teaches fundamentals: Before learning approximate algorithms, you must understand exact search. This establishes the baseline that all other algorithms try to beat. Youโll also learn why brute force becomes impossible at scale.
Core challenges youโll face:
- Implementing distance metrics correctly โ maps to vector mathematics
- Vectorizing with NumPy โ maps to batch operations for speed
- Memory management for large datasets โ maps to chunked processing
- Benchmarking accurately โ maps to measuring recall and latency
Key Concepts:
- Vector Operations: Linear Algebra fundamentals
- Distance Metrics: โMining of Massive Datasetsโ Chapter 3 - Leskovec et al.
- NumPy Broadcasting: NumPy documentation
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of vectors/matrices. NumPy familiarity helpful.
Real world outcome:
$ ./vector_search --dataset random_10k_128d.npy --query query.npy --k 10
Loading dataset: 10,000 vectors ร 128 dimensions
Query vector: 128 dimensions
Brute Force k-NN Results (k=10):
โโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโ
โ Rank โ Vector ID โ Distance (L2) โ
โ โโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโฃ
โ 1 โ 4,521 โ 0.823 โ
โ 2 โ 7,892 โ 0.891 โ
โ 3 โ 2,103 โ 0.912 โ
โ ... โ ... โ ... โ
โ 10 โ 9,234 โ 1.102 โ
โโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโ
Performance:
Distance calculations: 10,000
Time: 2.3ms (vectorized), 156ms (naive loop)
Throughput: 4,347 queries/second
$ ./vector_search --dataset random_1M_768d.npy --query query.npy --k 10
Loading dataset: 1,000,000 vectors ร 768 dimensions
Performance:
Time: 1,847ms (too slow for production!)
This is why we need approximate algorithms...
Implementation Hints:
Distance metric implementations:
Euclidean Distance (L2):
d(q, v) = sqrt(sum((q[i] - v[i])^2 for i in range(dim)))
Cosine Similarity:
sim(q, v) = dot(q, v) / (norm(q) * norm(v))
distance = 1 - sim (to convert to distance)
Dot Product:
score(q, v) = sum(q[i] * v[i] for i in range(dim))
(higher is better, not a distance)
Naive implementation (slow):
def knn_naive(query, vectors, k):
distances = []
for i, v in enumerate(vectors):
d = euclidean_distance(query, v)
distances.append((i, d))
distances.sort(key=lambda x: x[1])
return distances[:k]
Vectorized implementation (fast):
def knn_vectorized(query, vectors, k):
# Compute all distances at once
# (query - vectors)^2 = query^2 - 2*query*vectors + vectors^2
query_sq = np.sum(query ** 2)
vectors_sq = np.sum(vectors ** 2, axis=1)
cross_term = 2 * np.dot(vectors, query)
distances = query_sq - cross_term + vectors_sq
# Skip sqrt for ranking (monotonic transformation)
# Partial sort for top-k
indices = np.argpartition(distances, k)[:k]
top_k = indices[np.argsort(distances[indices])]
return top_k, distances[top_k]
Key optimization insight: For ranking, you donโt need the actual distanceโjust the ordering. So skip expensive operations like sqrt() that donโt change the ranking.
Chunked processing for large datasets:
def knn_chunked(query, vectors, k, chunk_size=100000):
candidates = []
for i in range(0, len(vectors), chunk_size):
chunk = vectors[i:i+chunk_size]
chunk_results = knn_vectorized(query, chunk, k)
candidates.extend([(idx + i, dist) for idx, dist in chunk_results])
candidates.sort(key=lambda x: x[1])
return candidates[:k]
The Core Question Youโre Answering
โHow do distance metrics actually work, and why does vectorization make them 50x faster?โ
Before writing code, understand what youโre measuring. Distance isnโt just mathโitโs the foundation of similarity. Every vector database operation depends on computing distances correctly and efficiently.
Concepts You Must Understand First
Stop and research these before coding:
- Linear Algebra Fundamentals
- What is a vector in mathematical terms vs in programming?
- How do dot products relate to projections?
- What does โnormโ mean (L1, L2, infinity norm)?
- Book Reference: โIntroduction to Linear Algebraโ by Gilbert Strang โ Ch. 1-2
- NumPy Broadcasting and Vectorization
- Why is a loop over 1M items slow but a NumPy operation fast?
- What is SIMD (Single Instruction Multiple Data)?
- How does NumPy use CPU vector instructions?
- Book Reference: โPython for Data Analysisโ by Wes McKinney โ Ch. 4: NumPy Basics
- Distance Metric Properties
- What makes a function a โmetricโ (triangle inequality, symmetry)?
- Why is cosine similarity NOT a distance (hint: itโs a similarity)?
- When should you use Euclidean vs cosine?
- Book Reference: โMining of Massive Datasetsโ Ch. 3.5
- Big-O Complexity
- What is the complexity of brute force k-NN? O(N*D) for what operations?
- Why does the curse of dimensionality make D matter?
- Book Reference: โIntroduction to Algorithmsโ (CLRS) โ Ch. 3
Questions to Guide Your Design
Before implementing, think through these:
- API Design
- Should your search function return indices, distances, or both?
- How do you handle the case where k > dataset size?
- Should you support batched queries (multiple query vectors at once)?
- Distance Metrics
- For cosine similarity, do you normalize vectors once (at index time) or per query?
- Can you avoid computing sqrt() for Euclidean distance? (Hint: yes, for ranking)
- How do you handle zero vectors in cosine similarity?
- Memory Management
- Can you load the entire dataset in RAM? If not, how do you chunk it?
- Should you memory-map the data file or load it entirely?
- Where does the bottleneck appear: CPU or memory bandwidth?
- Benchmarking
- What datasets should you test on? (SIFT1M, GIST1M are standard)
- How do you verify your results are correct? (Check against sklearn)
- What metrics matter: latency, throughput, or both?
Thinking Exercise
Trace Distance Calculation By Hand
Before coding, compute these manually:
Vector A = [3, 4]
Vector B = [6, 8]
Questions:
1. Euclidean distance between A and B?
2. Cosine similarity between A and B?
3. Dot product of A and B?
4. If you normalize both vectors first, what's the dot product?
5. Why are (3) and (4) different? What does this tell you?
Work this out on paper. The insight from question 5 is crucial.
Performance Mental Model
Predict the speedup:
- Naive Python loops over 10,000 vectors ร 128 dimensions
- NumPy vectorized version with SIMD
Make a guess: 2x? 10x? 100x?
Then implement both and measure. The actual speedup will teach you why vectorization matters.
The Interview Questions Theyโll Ask
Prepare to answer these:
- โWhatโs the difference between Euclidean distance and cosine similarity?โ
- Euclidean measures absolute distance in space; cosine measures angle between vectors
- Euclidean is magnitude-sensitive; cosine is magnitude-invariant
- Use Euclidean when magnitude matters (e.g., image brightness), cosine for orientation (e.g., text embeddings)
- โWhy is brute force k-NN O(N) and not O(N log N)?โ
- Computing all N distances is O(N ร D)
- Finding top-k with argpartition is O(N), not O(N log N)
- Full sort would be O(N log N), but we donโt need it
- โCan you skip the sqrt() in Euclidean distance? Why or why not?โ
- Yes for ranking (sqrt is monotonic, preserves order)
- No if you need actual distance values
- Saves computation: sqrt is ~10-20x slower than multiply
- โYour search is slow. Is it CPU-bound or memory-bound? How would you check?โ
- Profile with
timeand check CPU utilization - Memory-bound: low CPU usage, high memory bandwidth
- CPU-bound: high CPU usage, bottleneck in computation
- Tools:
top,htop,perf, Python profilers
- Profile with
- โHow would you find k-NN for 1000 query vectors efficiently?โ
- Batch compute: distances matrix of shape (1000, N)
- Use matrix multiplication instead of 1000 loops
- Exploit CPU cache locality and SIMD
- โWhat happens to distance distributions as dimensionality increases?โ
- Curse of dimensionality: all points become equidistant
- Distance concentration: min and max distances converge
- Relative contrast (max/min) decreases exponentially
- This is why high-dimensional spaces need special techniques
- โHow does NumPy make array operations fast?โ
- Uses compiled C/Fortran libraries (BLAS, LAPACK)
- SIMD vectorization (compute 4-8 values simultaneously)
- Contiguous memory layout (cache-friendly)
- Avoids Python interpreter overhead
Hints in Layers
Hint 1: Start Simple
Implement naive Euclidean distance with Python loops first. Make sure it works correctly before optimizing.
import math
def euclidean_naive(v1, v2):
dist = 0
for i in range(len(v1)):
dist += (v1[i] - v2[i]) ** 2
return math.sqrt(dist)
Test it on small examples where you know the answer.
Hint 2: Vectorize with NumPy
Replace loops with NumPy operations:
import numpy as np
def euclidean_vectorized(v1, v2):
return np.linalg.norm(v1 - v2)
# Or: return np.sqrt(np.sum((v1 - v2) ** 2))
Benchmark the difference using timeit or time.perf_counter().
Hint 3: Batch Computation
To find k-NN, you need distances from query to ALL vectors. Compute all at once:
# query: (D,)
# vectors: (N, D)
# Result: (N,) distances
dists = np.linalg.norm(vectors - query, axis=1)
This is the key insight: broadcast the query and compute N distances in one operation.
Hint 4: Skip Sqrt for Ranking
For finding top-k, you only need the ORDERING, not actual distances:
# These give the same ordering:
distances = np.sqrt(np.sum((vectors - query)**2, axis=1))
distances_squared = np.sum((vectors - query)**2, axis=1)
# So skip the sqrt:
top_k_indices = np.argpartition(distances_squared, k)[:k]
Sqrt is expensive. Avoid it when you can.
Hint 5: Cosine Similarity Trick
Pre-normalize all vectors once. Then cosine similarity = dot product:
# At index time:
vectors_normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
# At query time:
query_normalized = query / np.linalg.norm(query)
similarities = np.dot(vectors_normalized, query_normalized)
# No need for division anymore!
This converts O(N ร D) divisions to O(N ร D) multiplications (which youโd do anyway for dot product).
Hint 6: Use argpartition, Not argsort
Finding top-k doesnโt require full sort:
# O(N log N) - wasteful
top_k = np.argsort(distances)[:k]
# O(N) - better
top_k = np.argpartition(distances, k)[:k]
top_k_sorted = top_k[np.argsort(distances[top_k])]
For k ยซย N, this is a massive speedup.
Hint 7: Memory-Map Large Datasets
For datasets that donโt fit in RAM:
# Memory-mapped array (doesn't load entire file)
vectors = np.load('large_dataset.npy', mmap_mode='r')
# Process in chunks
chunk_size = 100000
for i in range(0, len(vectors), chunk_size):
chunk = vectors[i:i+chunk_size]
# Compute distances for this chunk
This lets you work with datasets larger than RAM.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Vector and matrix operations | โIntroduction to Linear Algebraโ by Gilbert Strang | Ch. 1-2 |
| Distance metrics in depth | โMining of Massive Datasetsโ by Leskovec et al. | Ch. 3: Finding Similar Items |
| NumPy fundamentals | โPython for Data Analysisโ by Wes McKinney | Ch. 4: NumPy Basics |
| Performance and vectorization | โHigh Performance Pythonโ by Gorelick & Ozsvald | Ch. 6: Matrix and Vector Computation |
| K-NN algorithms | โPattern Recognition and Machine Learningโ by Bishop | Ch. 2.5: Nearest Neighbor Methods |
| Big-O analysis | โIntroduction to Algorithmsโ (CLRS) | Ch. 3: Growth of Functions |
Learning milestones:
- Distance metrics compute correctly โ You understand vector math
- Vectorized version is 50x+ faster โ You understand batch operations
- You hit the wall at 1M vectors โ You understand why ANN is needed
- Recall@10 is always 100% โ You understand exact search baseline
Project 2: Locality Sensitive Hashing (LSH)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, C++, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The โService & Supportโ Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Hashing / Probabilistic Data Structures
- Software or Tool: LSH Index
- Main Book: โMining of Massive Datasetsโ by Leskovec, Rajaraman, Ullman (Chapter 3)
What youโll build: A Locality Sensitive Hashing index for approximate nearest neighbor search, using random hyperplanes for cosine similarity or random projections for Euclidean distance.
Why it teaches LSH: LSH is the simplest ANN algorithm to understand. Itโs based on a beautiful insight: if we hash vectors such that similar vectors likely get the same hash, we can use hash tables for fast lookup. Understanding LSH makes other algorithms clearer.
Core challenges youโll face:
- Designing hash functions โ maps to random hyperplanes and projections
- Amplification (AND/OR) โ maps to balancing precision and recall
- Multi-probe LSH โ maps to checking nearby buckets
- Parameter tuning โ maps to number of tables and hash functions
Resources for key challenges:
- Pinecone LSH Guide - Illustrated explanation
- MIT LSH Page - Original research and implementations
Key Concepts:
- LSH Fundamentals: โMining of Massive Datasetsโ Chapter 3 - Leskovec et al.
- Random Projections: Tyler Neylonโs LSH Introduction
- Amplification: LSH literature on AND/OR constructions
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 completed. Understanding of hash tables and probability.
Real world outcome:
$ ./lsh_search --build --dataset sift_1M.npy --tables 20 --hashes 8
Building LSH Index:
Dataset: 1,000,000 vectors ร 128 dimensions
Hash tables: 20
Hashes per table: 8
Total hash functions: 160
Build time: 12.3s
Index size: 245 MB (vs 512 MB raw data)
$ ./lsh_search --query query.npy --k 10
LSH Query Results:
โโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโ
โ Rank โ Vector ID โ Distance โ Found in Table โ
โ โโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโฃ
โ 1 โ 4,521 โ 0.823 โ 3, 7, 12 โ
โ 2 โ 7,892 โ 0.891 โ 7, 15 โ
โ ... โ ... โ ... โ ... โ
โโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโ
Performance:
Candidates examined: 2,042 (vs 1,000,000 brute force)
Time: 4.2ms (vs 1,847ms brute force)
Speedup: 440x
Recall@10: 0.87 (87% of true neighbors found)
$ ./lsh_search --sweep-params
Sweeping parameters for optimal recall/speed...
Tables โ Hashes โ Recall@10 โ Query Time โ Index Size
โโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโ
5 โ 4 โ 0.62 โ 1.1ms โ 61 MB
10 โ 6 โ 0.78 โ 2.3ms โ 122 MB
20 โ 8 โ 0.87 โ 4.2ms โ 245 MB
40 โ 10 โ 0.94 โ 8.7ms โ 490 MB
Implementation Hints:
The key insight of LSH:
Hash function property:
P(h(a) = h(b)) โ similarity(a, b)
If vectors are similar โ high probability of same hash
If vectors are different โ low probability of same hash
For cosine similarity, use random hyperplanes:
def random_hyperplane_hash(vector, hyperplane):
# Returns 1 if vector is on positive side, 0 otherwise
return 1 if np.dot(vector, hyperplane) >= 0 else 0
def compute_hash(vector, hyperplanes):
# Combine multiple hyperplane hashes into one integer
hash_value = 0
for i, hp in enumerate(hyperplanes):
bit = random_hyperplane_hash(vector, hp)
hash_value |= (bit << i)
return hash_value
LSH Index structure:
class LSHIndex:
def __init__(self, num_tables, hashes_per_table, dim):
self.tables = []
for _ in range(num_tables):
hyperplanes = np.random.randn(hashes_per_table, dim)
hyperplanes /= np.linalg.norm(hyperplanes, axis=1, keepdims=True)
table = {
'hyperplanes': hyperplanes,
'buckets': defaultdict(list) # hash -> [vector_ids]
}
self.tables.append(table)
def insert(self, vector_id, vector):
for table in self.tables:
hash_val = compute_hash(vector, table['hyperplanes'])
table['buckets'][hash_val].append(vector_id)
def query(self, vector, k):
candidates = set()
for table in self.tables:
hash_val = compute_hash(vector, table['hyperplanes'])
candidates.update(table['buckets'][hash_val])
# Re-rank candidates by actual distance
results = []
for cand_id in candidates:
dist = distance(vector, self.vectors[cand_id])
results.append((cand_id, dist))
results.sort(key=lambda x: x[1])
return results[:k]
Amplification (AND/OR construction):
- More hashes per table (AND): Fewer candidates, higher precision
- More tables (OR): More candidates, higher recall
- Balance: Tune based on your recall requirements
Multi-probe LSH: Instead of only checking the exact bucket, also check nearby buckets (flip one bit at a time). Increases recall without adding tables.
The Core Question Youโre Answering
Why does hashing similar vectors to the same buckets work for finding nearest neighbors?
This is the fundamental insight of LSH. Unlike traditional hash functions (which try to distribute items uniformly), LSH hash functions are designed so that:
- Similar items collide with HIGH probability
- Dissimilar items collide with LOW probability
The probability of collision is directly related to similarity. For random hyperplane hashing and cosine similarity:
P(h(a) = h(b)) = 1 - (angle(a,b) / ฯ)
This means: if two vectors point in similar directions (small angle), theyโre likely to be on the same side of a random hyperplane. By using multiple hyperplanes, you create a hash code where similar vectors get similar (often identical) codes.
The โaha!โ moment: Youโre not trying to find the EXACT nearest neighbor. Youโre willing to accept approximate answers in exchange for massive speedups. LSH gives you probabilistic guarantees: โWith probability 1-ฮด, Iโll find a neighbor within (1+ฮต) of optimal.โ
Concepts You Must Understand First
Before implementing LSH, make sure you understand:
1. Random Projections
- Projecting high-dimensional vectors onto random lines/hyperplanes
- The Johnson-Lindenstrauss lemma (distances are approximately preserved)
- Why random hyperplanes work for cosine similarity
2. Hash Functions and Collision
- Traditional hash functions minimize collisions
- LSH hash functions MAXIMIZE collisions for similar items
- Hash codes as binary strings (each bit = one hyperplane test)
3. Probability Basics
- P(h(a) = h(b)) depends on similarity(a, b)
- Independence of hash functions
- Amplification through AND/OR combinations
4. AND/OR Amplification
- AND construction: All k hashes must match โ Higher precision, lower recall
- OR construction: Match in any of L tables โ Higher recall, lower precision
- Combined: L tables, each with k hashes โ Tunable precision/recall
Study these first:
- โMining of Massive Datasetsโ Chapter 3.4 (LSH for Cosine Distance)
- Tyler Neylonโs LSH Tutorial - excellent visual explanations
- Pinecone LSH Guide
Questions to Guide Your Design
Hash Function Design:
- How many hash functions (hyperplanes) should each table use?
- More hashes โ smaller buckets โ higher precision, lower recall
- Fewer hashes โ larger buckets โ lower precision, higher recall
- Typical: 6-12 hashes per table
- How do you generate random hyperplanes?
- Sample from Gaussian:
np.random.randn(dim) - Normalize to unit length:
hyperplane / ||hyperplane|| - Each hyperplane is independent
- Sample from Gaussian:
- How do you combine multiple bits into a hash?
- Treat bits as binary number:
hash = bitโ + 2*bitโ + 4*bitโ + ... - Use bitwise operations for speed:
hash |= (bit << i)
- Treat bits as binary number:
Amplification Design:
- How many tables should you use?
- More tables โ more chances to find neighbors โ higher recall
- But also more memory and slower indexing
- Typical: 10-50 tables depending on recall requirements
- How do you balance precision and recall?
- High recall needed? โ Many tables (L), fewer hashes per table (k)
- High precision needed? โ Fewer tables, more hashes per table
- Formula:
P_collision = (1 - (1 - p^k)^L)where p = similarity
Multi-Probe Strategy:
- Which nearby buckets should you probe?
- Hamming distance 1: flip each bit individually (k probes)
- Hamming distance 2: flip pairs of bits (k choose 2 probes)
- Prioritize bits closest to decision boundary
- How many probes before diminishing returns?
- Empirically: 2-10 probes per table
- Track marginal increase in candidates found
- Stop when cost exceeds benefit
Memory vs Speed:
- How do you store hash tables efficiently?
- Dict of lists:
buckets[hash] = [id1, id2, ...] - Problem: empty buckets waste space
- Solution: only store non-empty buckets
- Dict of lists:
- Should you store vectors or just IDs?
- IDs only โ compact, but need separate vector storage
- Vectors included โ faster re-ranking, but larger memory
Thinking Exercise: Understanding Hash Collisions
Exercise 1: Manual Hash Computation
Given two 3D vectors and one random hyperplane, compute if they hash to the same value:
v1 = [1, 2, 1] (normalized: [0.408, 0.816, 0.408])
v2 = [1, 2, 2] (normalized: [0.333, 0.667, 0.667])
hyperplane = [1, 0, 0]
Steps:
- Compute dot product:
v1 ยท handv2 ยท h - Check sign: positive โ bit=1, negative โ bit=0
- Do they match?
Answer: v1ยทh = 0.408 (bit=1), v2ยทh = 0.333 (bit=1) โ COLLISION โ
Exercise 2: Multi-Bit Hash
Now use 3 hyperplanes:
h1 = [1, 0, 0]
h2 = [0, 1, 0]
h3 = [0, 0, 1]
Compute 3-bit hash for both v1 and v2:
- For each hyperplane, compute sign(v ยท h)
- Combine into binary: bitโ + 2bitโ + 4bitโ
Answer:
- v1: [+, +, +] โ 111โ โ 7
- v2: [+, +, +] โ 111โ โ 7
- Still collision! โ
Exercise 3: Probability Calculation
If two vectors have cosine similarity 0.9, whatโs the probability they collide with:
- 1 random hyperplane?
- 4 random hyperplanes (all must match)?
- 10 tables of 4 hashes each (match in ANY table)?
Answers:
- 1 hyperplane:
P = 1 - arccos(0.9)/ฯ โ 0.87 - 4 hyperplanes:
P = 0.87^4 โ 0.57 - 10 tables:
P = 1 - (1 - 0.57)^10 โ 0.998(very high recall!)
Exercise 4: Understanding the Tradeoff
You have 1M vectors, target recall@10 of 0.85. Compare configurations:
| Tables | Hashes | Expected Candidates | Recall | Memory |
|---|---|---|---|---|
| 5 | 8 | 500 | 0.65 | 60 MB |
| 20 | 6 | 2000 | 0.85 | 240 MB |
| 50 | 4 | 5000 | 0.95 | 600 MB |
Question: Which would you choose for a latency-critical application?
Hint: More candidates = slower re-ranking. Look at candidates vs recall ratio.
The Interview Questions Theyโll Ask
Conceptual Understanding:
- โExplain how LSH works to a software engineer unfamiliar with it.โ
- Good answer: โLSH uses special hash functions where similar items collide frequently. By hashing vectors with random hyperplanes and checking which side they fall on, we create codes where similar vectors get similar codes. Then we use hash tables for fast lookup.โ
- Great answer: Add probability guarantees, AND/OR amplification, and concrete example with numbers.
- โWhy use random hyperplanes for cosine similarity?โ
- The probability two vectors are on the same side equals their angular similarity
- Mathematically:
P(sign(vยทh) = sign(uยทh)) = 1 - ฮธ/ฯwhere ฮธ = angle(v,u) - This directly connects hash collision to cosine similarity
- โWhatโs the difference between AND and OR construction?โ
- AND: All k hashes must match โ fewer collisions โ higher precision
- OR: Match in any of L tables โ more chances โ higher recall
- Combined: (k AND) OR L โ tunable precision/recall tradeoff
Implementation Details:
- โHow do you choose the number of tables and hashes per table?โ
- Start with target recall requirement (e.g., 0.85)
- Estimate collision probability based on expected similarity
- Use formula:
P_recall โ 1 - (1 - p^k)^L - Solve for L and k given recall target
- Empirically tune on validation set
- โWhatโs multi-probe LSH and why does it help?โ
- Instead of checking only exact hash bucket, also check nearby buckets
- โNearbyโ = Hamming distance 1 or 2 (flip 1-2 bits)
- Helps because vectors near decision boundaries might be in adjacent buckets
- Increases recall without adding tables (saves memory)
- โHow would you handle updates (insertions/deletions)?โ
- Insertions: Hash new vector, add to all L buckets
- Deletions: Remove from all L buckets (requires storing vectorโbucket mapping)
- Challenge: bucket lists can become fragmented
- Solution: periodic rebuilding or use more sophisticated data structures
Performance & Scaling:
- โWhy is LSH typically slower than IVF or HNSW in practice?โ
- High-dimensional spaces: need many bits for good precision
- Many tables needed for good recall
- Re-ranking cost: examine many candidates even if most are false positives
- Modern alternatives (HNSW, IVF) have better recall/speed tradeoffs
- โWhen would you choose LSH over other ANN algorithms?โ
- Very high dimensionality (>1000D) where clustering is hard
- Streaming/online scenarios with frequent insertions
- When you need theoretical guarantees on approximation
- When simplicity and debuggability matter more than peak performance
- โHow does LSH scale with dimensionality?โ
- Need O(log n / ฮตยฒ) hash functions for (1+ฮต) approximation
- Higher dimensions โ need more bits โ larger hash codes
- But curse of dimensionality affects ALL ANN methods
- LSH degrades more gracefully than some alternatives
Advanced Topics:
- โExplain data-dependent LSH vs data-independent LSH.โ
- Data-independent: random hyperplanes (what weโve implemented)
- Data-dependent: learn hyperplanes from data to maximize separation
- Example: use PCA directions or train with neural networks
- Tradeoff: better performance vs complexity and adaptability
- โHow would you implement LSH for other distance metrics?โ
- L2 (Euclidean): p-stable distributions (Cauchy for L2)
- Hamming: random bit sampling
- Jaccard: MinHash
- Each metric needs a different hash family with collision โ similarity
- โWhatโs the theoretical guarantee of LSH?โ
- (c, rโ, rโ, pโ, pโ)-sensitive hash family
- If d(a,b) โค rโ โ P(h(a)=h(b)) โฅ pโ (nearby items collide often)
- If d(a,b) โฅ rโ โ P(h(a)=h(b)) โค pโ (distant items rarely collide)
- With amplification: guarantees on finding (1+ฮต)-approximate neighbors
Hints in Layers
Layer 1: Single Hash Table (Start Here)
- Implement one table with k=8 random hyperplanes
- Hash vectors by checking sign of each projection
- Store in dict:
buckets[hash_value] = [vector_ids] - Query: hash query vector, return all vectors in same bucket
- Test: verify similar vectors often share buckets
Layer 2: Multiple Tables (OR Construction)
- Create L=20 independent tables (different random hyperplanes)
- Insert: add vector to bucket in each table
- Query: collect candidates from all L tables (union)
- Re-rank candidates by actual distance
- Test: measure recall@10 vs single table
Layer 3: Parameter Tuning
- Implement parameter sweep: vary L and k
- For each configuration, measure:
- Recall@10
- Query time
- Memory usage
- Number of candidates examined
- Plot Pareto frontier: recall vs query time
- Test: find optimal L,k for target recall of 0.85
Layer 4: Multi-Probe LSH
- Generate probe sequence: flip 1 bit, 2 bits, etc.
- For each table, check exact bucket + probes
- Implement early stopping: stop if enough candidates found
- Test: compare recall with/without multi-probe at same #candidates
Layer 5: Optimizations
- Vectorized hash computation: batch process hyperplane projections
- Compact bucket storage: use numpy arrays instead of lists
- Early termination: stop search after examining N candidates
- Test: benchmark against baseline, aim for 2-3x speedup
Debugging Tips:
- Low recall? โ Increase L (more tables) or decrease k (fewer hashes per table)
- Too slow? โ Decrease L or use multi-probe instead of more tables
- Memory issues? โ Decrease L or use approximate storage (bloom filters)
- Verification: Always re-rank candidates with exact distances!
Books That Will Help
Core LSH Theory:
- โMining of Massive Datasetsโ by Leskovec, Rajaraman, Ullman
- Chapter 3: Finding Similar Items
- Section 3.4: LSH for Cosine Distance (essential reading)
- Section 3.6: Applications of LSH
- Why: Best introduction to LSH with clear examples and proofs
Random Projections & Theory:
- โFoundations of Data Scienceโ by Blum, Hopcroft, Kannan
- Chapter 2: Random Projection and Dimensionality Reduction
- Johnson-Lindenstrauss lemma and applications
- Why: Theoretical foundation for why LSH works
Algorithms & Probability:
- โRandomized Algorithmsโ by Motwani and Raghavan
- Chapter 6: Hashing
- Probabilistic analysis of hash functions
- Why: Deep dive into probability tools needed for LSH analysis
Practical Implementation:
- โIntroduction to Information Retrievalโ by Manning, Raghavan, Schรผtze
- Chapter 19: Web Search Basics
- Near-duplicate detection using LSH
- Why: Real-world applications and engineering considerations
Advanced Topics:
- โNearest-Neighbor Methods in Learning and Visionโ edited by Shakhnarovich, Darrell, Indyk
- Chapter 5: LSH by Indyk
- Data-dependent LSH
- Why: Written by LSH pioneer, covers cutting-edge techniques
Online Resources:
- MIT OCW: Algorithmic Lower Bounds - Lecture on LSH
- Alex Andoniโs LSH Page - Original research and code
- Tyler Neylonโs LSH Tutorial Series - Extremely clear visual explanations
Papers to Read (after implementing):
- โSimilarity Search in High Dimensions via Hashingโ (Gionis et al., 1999) - Original LSH paper
- โLocality-Sensitive Hashing Scheme Based on p-Stable Distributionsโ (Datar et al., 2004) - LSH for L2 distance
- โMulti-Probe LSHโ (Lv et al., 2007) - Improves efficiency significantly
Learning milestones:
- Hash function maps similar vectors together โ You understand LSH basics
- Multiple tables improve recall โ You understand amplification
- Parameter tuning affects recall/speed โ You understand the tradeoff
- Multi-probe improves efficiency โ You understand advanced LSH
Project 3: Inverted File Index (IVF)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Clustering / Inverted Indexes
- Software or Tool: IVF Index
- Main Book: โIntroduction to Information Retrievalโ by Manning, Raghavan, Schรผtze
What youโll build: An Inverted File (IVF) index that clusters vectors with k-means, then searches only nearby clusters at query timeโthe foundation of production vector search.
Why it teaches IVF: IVF is the workhorse of production vector databases. Itโs conceptually simple (just clustering!) but has nuances around cluster count, nprobe tuning, and training data. Understanding IVF unlocks FAISS and most production systems.
Core challenges youโll face:
- K-means clustering at scale โ maps to efficient training
- Choosing number of clusters โ maps to sqrt(N) heuristic
- nprobe tuning โ maps to recall vs speed tradeoff
- Non-uniform cluster sizes โ maps to data distribution challenges
Resources for key challenges:
- Pinecone Vector Indexes Guide - IVF explanation
- FAISS IVF Documentation
- Zilliz Vector Index Tutorial
Key Concepts:
- Inverted Index: โIntroduction to Information Retrievalโ Chapter 1 - Manning et al.
- K-Means Clustering: โPattern Recognition and Machine Learningโ - Bishop
- IVF in FAISS: FAISS documentation
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1 completed. Understanding of k-means clustering.
Real world outcome:
$ ./ivf_search --build --dataset sift_1M.npy --clusters 1000
Building IVF Index:
Dataset: 1,000,000 vectors ร 128 dimensions
Training k-means with 100,000 sample vectors...
K-means iterations: 25, converged: True
Clusters: 1,000 (avg 1,000 vectors per cluster)
Cluster size distribution:
Min: 423, Max: 2,156, Median: 987, StdDev: 312
Build time: 45.2s
Index size: 134 MB (centroids) + 512 MB (data)
$ ./ivf_search --query query.npy --k 10 --nprobe 10
IVF Query Results (nprobe=10):
Clusters searched: 10 (1% of index)
Vectors compared: 9,870
โโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโ
โ Rank โ Vector ID โ Distance โ Cluster ID โ
โ โโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโฃ
โ 1 โ 4,521 โ 0.823 โ 127 โ
โ 2 โ 7,892 โ 0.891 โ 127 โ
โ ... โ ... โ ... โ ... โ
โโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโ
Performance:
Time: 8.3ms (vs 1,847ms brute force)
Speedup: 222x
Recall@10: 0.92
$ ./ivf_search --nprobe-sweep
nprobe โ Recall@10 โ Query Time โ Vectors Compared
โโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโ
1 โ 0.52 โ 1.2ms โ 987
5 โ 0.81 โ 4.8ms โ 4,935
10 โ 0.92 โ 8.3ms โ 9,870
20 โ 0.97 โ 15.2ms โ 19,740
50 โ 0.99 โ 38.1ms โ 49,350
Implementation Hints:
IVF structure:
class IVFIndex:
def __init__(self, num_clusters, dim):
self.num_clusters = num_clusters
self.dim = dim
self.centroids = None # (num_clusters, dim)
self.inverted_lists = [[] for _ in range(num_clusters)]
# inverted_lists[i] = [(vector_id, vector), ...]
def train(self, training_vectors):
# Run k-means to find cluster centroids
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=self.num_clusters, n_init=1)
kmeans.fit(training_vectors)
self.centroids = kmeans.cluster_centers_
def add(self, vector_id, vector):
# Find nearest centroid
distances = np.linalg.norm(self.centroids - vector, axis=1)
cluster_id = np.argmin(distances)
self.inverted_lists[cluster_id].append((vector_id, vector))
def search(self, query, k, nprobe):
# Find nprobe nearest centroids
centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
nearest_clusters = np.argsort(centroid_distances)[:nprobe]
# Search within those clusters
candidates = []
for cluster_id in nearest_clusters:
for vector_id, vector in self.inverted_lists[cluster_id]:
dist = np.linalg.norm(query - vector)
candidates.append((vector_id, dist))
# Sort and return top-k
candidates.sort(key=lambda x: x[1])
return candidates[:k]
Choosing number of clusters:
Rule of thumb: num_clusters โ sqrt(num_vectors)
1M vectors โ ~1,000 clusters
100M vectors โ ~10,000 clusters
Too few clusters: Each cluster too large, slow search
Too many clusters: Centroid comparison overhead, missed neighbors
Training considerations:
- Train on a representative sample (10-20% of data is often enough)
- Ensure training data has same distribution as full dataset
- More k-means iterations = better centroids, but slower training
Non-exhaustive search pitfall:
If nearest neighbor is in cluster 5, but query is closest to
centroid of cluster 3, we might miss it with low nprobe.
Solution: Higher nprobe, or use HNSW to find nearby centroids faster
The Core Question Youโre Answering
โHow does clustering enable fast search in high-dimensional spaces?โ
This is the foundational insight of IVF: if you partition your dataset into regions (clusters), you can search only the regions near your query instead of the entire dataset. The key challenge is: how do you partition the space efficiently, and how do you decide which regions to search?
Think about it:
- Without clustering: You must compare your query to all N vectors (O(N) time)
- With clustering: You compare to K centroids, then search only nprobe clusters (O(K + nprobe * N/K) time)
- The tradeoff: Speed vs accuracy - fewer clusters searched = faster but might miss the true nearest neighbor
Concepts You Must Understand First
Before implementing IVF, you need solid understanding of:
- K-means Clustering
- How Lloydโs algorithm iteratively assigns points to centroids and updates centroids
- Why k-means minimizes within-cluster variance
- Convergence criteria and initialization strategies (k-means++)
- Time complexity: O(iterations ร N ร K ร D) where K = number of clusters, D = dimensions
- Voronoi Partitions
- Each centroid defines a region (Voronoi cell) of all points closer to it than any other centroid
- The partition is exhaustive (every point belongs to exactly one cell)
- Voronoi boundaries are hyperplanes in D-dimensional space
- Understanding: โWhich cluster does this query belong to?โ = โWhich centroid is nearest?โ
- The nprobe Parameter
- nprobe = number of clusters to search at query time
- nprobe = 1: Search only the nearest cluster (fast, low recall)
- nprobe = K: Search all clusters (slow, perfect recall = brute force)
- nprobe = 10-50: Typical production values (good recall/speed tradeoff)
- Recall vs Speed Tradeoff
- Recall@k = fraction of true top-k neighbors found
- As nprobe increases: recall โ, query time โ
- The goal: Find minimum nprobe that achieves acceptable recall (often 90-95%)
- Cluster Imbalance
- Real data doesnโt cluster uniformly - some clusters are huge, others tiny
- Imbalanced clusters hurt performance (searching a 5000-vector cluster vs a 200-vector cluster)
- Solutions: Use more clusters, or switch to techniques like IVF-PQ or HNSW
Questions to Guide Your Design
As you build IVF, constantly ask yourself:
- How many clusters should I use?
- Too few (K = 10): Huge clusters, slow search even with nprobe=1
- Too many (K = 100,000): Centroid comparison overhead, might need high nprobe
- Rule of thumb: K โ sqrt(N) - but why? (Balances centroid comparison vs in-cluster search)
- For 1M vectors: K โ 1000 clusters โ 1000 vectors per cluster
- How much training data do I need?
- K-means needs enough samples per cluster (at least 30-50 per cluster)
- For K=1000 clusters: need at least 30,000-50,000 training vectors
- But k-means is expensive on large datasets - can you subsample?
- Answer: Yes! 10-20% of data is often sufficient if representative
- What if clusters are imbalanced?
- Measure cluster size distribution: min, max, median, stddev
- If max/min > 10x, you have imbalance
- Options: Increase K, use hierarchical clustering, or accept it
- Real-world data is always imbalanced - design for it!
- How do I tune nprobe?
- Sweep nprobe from 1 to 100, measure recall@10 and query time
- Plot recall vs nprobe and time vs nprobe
- Find โkneeโ of curve - where recall plateaus (diminishing returns)
- Production systems often expose nprobe as a per-query parameter
- What distance metric should I use?
- K-means assumes Euclidean distance (L2)
- For cosine similarity: normalize vectors to unit length, then use L2
- For other metrics (Manhattan, etc.): k-means may not work well
- IVF is fundamentally Euclidean - thatโs a limitation!
Thinking Exercise
Before coding, do this by hand to internalize the algorithm:
Setup: You have 20 2D vectors and K=4 clusters.
Vectors (x, y):
[1,1], [2,1], [1,2], [2,2], # Cluster 1 (bottom-left)
[8,1], [9,1], [8,2], [9,2], # Cluster 2 (bottom-right)
[1,8], [2,8], [1,9], [2,9], # Cluster 3 (top-left)
[8,8], [9,8], [8,9], [9,9], # Cluster 4 (top-right)
[5,5], [5,6], [6,5], [6,6] # Middle vectors
Part 1: K-means
- Initialize 4 random centroids
- Assign each vector to nearest centroid
- Recompute centroids as mean of assigned vectors
- Repeat until convergence
- What are the final centroids? (Should be near [1.5,1.5], [8.5,1.5], [1.5,8.5], [8.5,8.5])
Part 2: Search with nprobe=1
- Query: [5.5, 5.5]
- Which centroid is nearest? (Could be any, theyโre equidistant!)
- Search only that cluster
- What if the true nearest neighbor is in a different cluster? MISSED!
Part 3: Search with nprobe=2
- Query: [5.5, 5.5]
- Find 2 nearest centroids
- Search both clusters
- Higher chance of finding the true nearest neighbor
Part 4: Understanding the tradeoff
- nprobe=1: Search ~5 vectors (1/4 of dataset)
- nprobe=2: Search ~10 vectors (1/2 of dataset)
- nprobe=4: Search all 20 vectors (brute force!)
Key insight: The middle vectors [5,5], [5,6], [6,5], [6,6] are equidistant from all centroids. Queries near them NEED higher nprobe. This is the fundamental limitation of IVF!
The Interview Questions Theyโll Ask
When you understand IVF deeply, you can answer:
- Theory Questions:
- โExplain how IVF reduces search time complexity from O(N) to O(K + nprobe * N/K)โ
- โWhy do we use k-means for IVF? What properties make it suitable?โ
- โWhatโs the difference between IVF and HNSW? When would you use each?โ
- โHow does IVF handle new data points? Is retraining required?โ
- Design Questions:
- โYou have 100M 768-dim vectors. How many clusters would you use? Why?โ
- โYour IVF index has 90% recall with nprobe=20 but you need 95%. What do you do?โ
- โYou notice cluster sizes range from 100 to 50,000 vectors. Is this a problem? How do you fix it?โ
- โHow would you handle streaming inserts to an IVF index?โ
- Tradeoff Questions:
- โGiven a query latency SLA of 10ms, how do you choose nprobe?โ
- โShould you train k-means on all 100M vectors or a sample? Whatโs the tradeoff?โ
- โYour index is 50GB. How does IVF help with memory? How could you compress further?โ
- โCompare IVF with nprobe=10 vs HNSW with ef=10. Which is faster? More accurate?โ
- Implementation Questions:
- โHow do you implement efficient centroid search? (Hint: itโs brute force for small K!)โ
- โWhat data structure do you use for inverted lists? Array? Linked list?โ
- โHow do you parallelize IVF search across multiple queries?โ
- โWhatโs the index build time for 1M vectors, 1000 clusters, 50 k-means iterations?โ
- Production Questions:
- โYour IVF index has high query latency variance. Why might this happen?โ
- Answer: Cluster imbalance - some clusters are 10x larger than others
- โHow do you handle deletes in IVF? Tombstones? Rebuilding?โ
- โCan you combine IVF with PQ? Whatโs the memory/accuracy tradeoff?โ
- โHow would you shard an IVF index across 10 machines?โ
- โYour IVF index has high query latency variance. Why might this happen?โ
Hints in Layers
Layer 1: If youโre completely stuck starting
- Start with toy data: 1000 2D vectors, K=10 clusters
- Use sklearn.cluster.KMeans for training (donโt implement k-means from scratch yet)
- Store clusters as a list of lists:
clusters[i] = [(vector_id, vector), ...] - For search: find nearest centroid with brute force, search that cluster
Layer 2: When basic IVF works
- Implement nprobe: search multiple clusters
- Measure recall by comparing to brute force results
- Plot recall vs nprobe curve
- Implement cluster size distribution statistics
Layer 3: When you want to optimize
- Implement your own k-means (Lloydโs algorithm)
- Add k-means++ initialization
- Use numpy vectorization for centroid distances
- Parallelize search across clusters
Layer 4: When you want production features
- Serialize index to disk (save centroids + inverted lists)
- Add batch query support
- Implement cluster rebalancing (split large clusters)
- Add IVF + flat refinement (search with IVF, refine with exact distances)
Common mistakes to avoid:
- Mistake: Using too few clusters (K < 100 for 1M vectors)
- Result: Slow search even with nprobe=1
- Fix: Use K โ sqrt(N)
- Mistake: Training k-means on biased sample
- Result: Poor cluster quality, low recall
- Fix: Ensure training sample is representative (random sample)
- Mistake: Not handling empty clusters
- Result: Some centroids have no assigned vectors
- Fix: Reassign empty centroids to split large clusters
- Mistake: Computing exact distances to all centroids
- Result: Slow centroid search
- Fix: For K < 10,000, brute force is fine. For larger K, use approximate methods
- Mistake: Storing full vectors in inverted lists
- Result: High memory usage
- Fix: Store only vector IDs, load vectors on-demand (or use PQ!)
Books That Will Help
- โIntroduction to Information Retrievalโ by Manning, Raghavan, Schรผtze
- Chapter 1: Inverted indexes (the inspiration for IVF)
- Chapter 6: Scoring and ranking
- Chapter 18: Hierarchical clustering
- Why it helps: IVF is literally an inverted index for vectors
- โPattern Recognition and Machine Learningโ by Christopher Bishop
- Chapter 9: Mixture Models and EM (includes k-means)
- Chapter 12: Continuous Latent Variables
- Why it helps: Deep understanding of k-means and clustering theory
- โFoundations of Multidimensional and Metric Data Structuresโ by Hanan Samet
- Chapter 4: Point Access Methods
- Chapter 5: Multidimensional Point Data
- Why it helps: IVF in the broader context of spatial indexing
- Research Papers:
- โVideo Google: A Text Retrieval Approach to Object Matching in Videosโ (Sivic & Zisserman, 2003)
- Original IVF for visual search
- โProduct Quantization for Nearest Neighbor Searchโ (Jรฉgou et al., 2011)
- IVF-PQ, the production combination
- FAISS paper: โBillion-scale similarity search with GPUsโ (Johnson et al., 2017)
- Modern IVF implementations
- โVideo Google: A Text Retrieval Approach to Object Matching in Videosโ (Sivic & Zisserman, 2003)
- Online Resources:
Learning milestones:
- K-means clusters vectors โ You understand the preprocessing
- nprobe controls recall โ You understand the search phase
- Cluster imbalance affects performance โ You understand data distribution
- You compare with FAISS โ You validate your implementation
Project 4: Product Quantization (PQ)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Vector Compression / Quantization
- Software or Tool: PQ Encoder/Decoder
- Main Book: โProduct Quantization for Nearest Neighbor Searchโ - Jรฉgou et al.
What youโll build: A Product Quantization system that compresses 128-dimensional float vectors into just 8-16 bytes, enabling 97% memory reduction while maintaining search quality.
Why it teaches quantization: Memory is often the bottleneck for large vector databases. PQ is the key technique that makes billion-scale search possible on a single machine. Understanding PQ unlocks IVF-PQ, the most common production configuration.
Core challenges youโll face:
- Subspace decomposition โ maps to splitting vectors into segments
- Codebook training โ maps to k-means per subspace
- Asymmetric distance computation โ maps to query vs database accuracy
- Distance lookup tables โ maps to precomputed distances for speed
Resources for key challenges:
- Pinecone PQ Guide - 97% compression explained
- Zilliz PQ Tutorial - Hands-on Python
Key Concepts:
- Product Quantization: Original paper by Jรฉgou, Douze, Schmid
- Vector Quantization: Lloydโs algorithm / k-means
- Asymmetric Distance: ADC (Asymmetric Distance Computation)
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 3 completed. Understanding of k-means.
Real world outcome:
$ ./pq_compress --dataset sift_1M.npy --segments 8 --bits 8
Training Product Quantization:
Original: 1,000,000 vectors ร 128 dims ร 4 bytes = 512 MB
Segments: 8 (16 dims each)
Centroids per segment: 256 (8 bits)
Training codebooks (k-means per segment)...
Segment 0: 256 centroids, trained on 1M subvectors
Segment 1: 256 centroids, trained on 1M subvectors
...
Segment 7: 256 centroids, trained on 1M subvectors
Compressing vectors...
Compressed: 1,000,000 vectors ร 8 bytes = 8 MB
Compression ratio: 64x (98.4% reduction!)
Codebook size: 8 segments ร 256 centroids ร 16 dims ร 4 bytes = 128 KB
$ ./pq_search --query query.npy --k 10
PQ Search with ADC (Asymmetric Distance Computation):
Building distance lookup table: 0.02ms
Scanning 1M compressed codes: 12.3ms
Results:
โโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโ
โ Rank โ Vector ID โ Approx Dist โ True Dist โ
โ โโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโฃ
โ 1 โ 4,521 โ 0.841 โ 0.823 โ
โ 2 โ 7,102 โ 0.923 โ 0.905 โ
โ ... โ ... โ ... โ ... โ
โโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโ
Quality:
Recall@10: 0.89 (vs exact search)
Mean distance error: 3.2%
Performance:
Time: 12.3ms (vs 1,847ms brute force on raw vectors)
Memory: 8 MB (vs 512 MB raw)
Speedup: 150x
Implementation Hints:
Product Quantization structure:
Original vector (128 dims):
[v0, v1, v2, ..., v127]
Split into M=8 subvectors (16 dims each):
[v0-v15], [v16-v31], ..., [v112-v127]
Each subvector โ quantized to one of 256 centroids
Compressed code: [c0, c1, c2, c3, c4, c5, c6, c7] (8 bytes total)
Implementation:
class ProductQuantizer:
def __init__(self, dim, num_segments, num_centroids=256):
self.dim = dim
self.num_segments = num_segments
self.segment_dim = dim // num_segments
self.num_centroids = num_centroids
self.codebooks = [] # One codebook per segment
def train(self, vectors):
for seg in range(self.num_segments):
start = seg * self.segment_dim
end = start + self.segment_dim
subvectors = vectors[:, start:end]
kmeans = KMeans(n_clusters=self.num_centroids)
kmeans.fit(subvectors)
self.codebooks.append(kmeans.cluster_centers_)
def encode(self, vector):
codes = []
for seg in range(self.num_segments):
start = seg * self.segment_dim
end = start + self.segment_dim
subvector = vector[start:end]
# Find nearest centroid
distances = np.linalg.norm(self.codebooks[seg] - subvector, axis=1)
code = np.argmin(distances)
codes.append(code)
return np.array(codes, dtype=np.uint8)
def decode(self, codes):
vector = []
for seg, code in enumerate(codes):
vector.extend(self.codebooks[seg][code])
return np.array(vector)
Asymmetric Distance Computation (ADC):
def compute_distance_tables(self, query):
"""Precompute distances from query subvectors to all centroids"""
tables = []
for seg in range(self.num_segments):
start = seg * self.segment_dim
end = start + self.segment_dim
query_sub = query[start:end]
# Distance from query subvector to each centroid
distances = np.linalg.norm(self.codebooks[seg] - query_sub, axis=1)
tables.append(distances)
return np.array(tables) # Shape: (num_segments, num_centroids)
def search_pq(self, query, codes, k):
"""Search using precomputed distance tables"""
tables = self.compute_distance_tables(query)
# Compute approximate distance to each code
distances = np.zeros(len(codes))
for seg in range(self.num_segments):
distances += tables[seg, codes[:, seg]] ** 2
# Top-k
indices = np.argpartition(distances, k)[:k]
return indices[np.argsort(distances[indices])]
Why ADC is fast:
- Query: Compute 8 ร 256 = 2,048 subvector distances (one time)
- Per vector: 8 table lookups + 8 additions
- 1M vectors: 8M lookups vs 128M multiplications
The Core Question Youโre Answering
โHow do we compress 512-byte vectors to 128 bytes (or even 8 bytes) without losing search quality?โ
This is the central problem of Product Quantization. When you have billions of vectors, memory becomes the primary bottleneck. Loading 512GB of vectors into RAM is expensive; loading 8GB is feasible. But naive compression destroys the distance relationships we need for similarity search.
The insight: Instead of compressing the entire vector as one unit, split it into independent subspaces and quantize each separately. This preserves local structure while achieving massive compression.
Why this matters in production:
- Billion-scale search on single machines: 1B vectors ร 512 bytes = 512GB (impossible) vs 1B ร 8 bytes = 8GB (feasible)
- Cache efficiency: Compressed codes fit in CPU cache, enabling faster scanning
- Cost reduction: Less RAM = cheaper infrastructure
- IVF-PQ combination: The most common production config in vector databases
Concepts You Must Understand First
Before implementing Product Quantization, ensure you understand these foundational concepts:
1. Vector Quantization (VQ)
- What it is: Replacing continuous values with discrete codes from a codebook
- Classic example: k-means clustering where each point is represented by its cluster centroid
- The tradeoff: Memory savings vs reconstruction error
- Why you need it: PQ is just VQ applied to subvectors instead of full vectors
2. K-Means Clustering
- Lloydโs algorithm: Iterative refinement (assign โ update โ repeat)
- Codebook: The set of cluster centroids
- Quantization error: Distance between original point and assigned centroid
- Why you need it: Training PQ codebooks is running k-means on subvectors
3. Subspace Decomposition
- Concept: Splitting a 128D vector into 8 ร 16D independent subvectors
- Independence assumption: Distance in each subspace contributes independently to total distance
- Why it works: For high-dimensional random vectors, this assumption holds reasonably well
- Why you need it: This is what makes โproductโ quantization different from regular VQ
4. Asymmetric Distance Computation (ADC)
- Symmetric: Both query and database vectors are quantized
- Asymmetric: Only database vectors are quantized; query remains exact
- Why asymmetric is better: Preserves query accuracy while compressing database
- Why you need it: This is the secret to maintaining search quality with PQ
5. Distance Decomposition
- For squared Euclidean distance:
||q - v||ยฒ = ฮฃแตข ||qแตข - vแตข||ยฒwhere i indexes subspaces - This enables ADC: Compute distance per subspace, then sum
- Precomputation: Build lookup tables of subspace distances once per query
- Why you need it: This is how PQ achieves fast approximate distance computation
Study resources:
- Original PQ paper: Jรฉgou et al., โProduct Quantization for Nearest Neighbor Searchโ
- K-means: Chapter 9 of โPattern Recognition and Machine Learningโ by Bishop
- Vector quantization: โVector Quantization and Signal Compressionโ by Gersho & Gray
Questions to Guide Your Design
1. How many subspaces (M) should I use?
- Tradeoff: More subspaces = better compression, but larger overhead
- Common choices: M = 8, 16, 32, 64
- Rule of thumb: Each subspace should have at least 8-16 dimensions
- Example: For 128D vectors, M=8 gives 16D subspaces; M=16 gives 8D subspaces
2. How many centroids (K) per codebook?
- Determines code size: K=256 requires 8 bits per subspace
- Common choices: K = 256 (8 bits), K = 16 (4 bits), K = 4096 (12 bits)
- Tradeoff: More centroids = better reconstruction, but larger lookup tables
- Memory impact: Each codebook is K ร (D/M) ร 4 bytes
3. How much training data do I need?
- Minimum: At least 256 ร 20 = 5,120 vectors for k-means to converge (if K=256)
- Recommended: 100K-1M vectors for stable codebooks
- Quality: More training data = codebooks better represent the distribution
- Tradeoff: Training time vs codebook quality
4. Symmetric vs Asymmetric distance?
- Symmetric (SDC): Quantize both query and database
- Pros: Simpler, slightly faster
- Cons: Lower accuracy (query quantization error)
- Asymmetric (ADC): Keep query exact, quantize database only
- Pros: Better accuracy (no query error)
- Cons: Requires distance tables
- Recommendation: Always use ADC for better recall
5. How do I balance compression vs accuracy?
- Compression ratio = (D ร 4 bytes) / (M ร bits_per_code / 8)
- Example: 128D, M=8, 8 bits โ (128ร4) / (8ร1) = 64ร compression
- Accuracy: Measured by recall@k compared to exact search
- Tuning: Increase M or K to improve recall (at cost of compression/speed)
6. Should I normalize vectors before PQ?
- Yes, if using cosine similarity: Normalize to unit length first
- Euclidean distance: Normalization optional but often helps
- Why: Prevents one subspace from dominating distance computation
- How: L2 normalize each vector before training and encoding
Thinking Exercise
Manual Product Quantization Walkthrough
Letโs manually apply PQ to understand the process:
Given vector (16D for simplicity):
v = [0.5, 0.3, 0.8, 0.2, 0.1, 0.9, 0.4, 0.7, 0.6, 0.2, 0.3, 0.8, 0.9, 0.1, 0.5, 0.4]
Step 1: Split into M=4 subvectors (4D each):
vโ = [0.5, 0.3, 0.8, 0.2]
vโ = [0.1, 0.9, 0.4, 0.7]
vโ = [0.6, 0.2, 0.3, 0.8]
vโ = [0.9, 0.1, 0.5, 0.4]
Step 2: Assume we trained codebooks (K=4 centroids each):
Codebook 1:
cโโ = [0.1, 0.1, 0.2, 0.1]
cโโ = [0.5, 0.3, 0.7, 0.2] โ closest to vโ
cโโ = [0.8, 0.7, 0.9, 0.8]
cโโ = [0.3, 0.5, 0.4, 0.6]
Codebook 2:
cโโ = [0.1, 0.8, 0.3, 0.6]
cโโ = [0.2, 0.9, 0.5, 0.7] โ closest to vโ
cโโ = [0.7, 0.3, 0.8, 0.2]
cโโ = [0.9, 0.1, 0.1, 0.9]
Codebook 3:
cโโ = [0.2, 0.1, 0.1, 0.3]
cโโ = [0.6, 0.2, 0.4, 0.7] โ closest to vโ
cโโ = [0.9, 0.8, 0.7, 0.9]
cโโ = [0.4, 0.5, 0.6, 0.5]
Codebook 4:
cโโ = [0.1, 0.1, 0.2, 0.1]
cโโ = [0.4, 0.3, 0.5, 0.3]
cโโ = [0.8, 0.2, 0.4, 0.5] โ closest to vโ
cโโ = [0.9, 0.9, 0.8, 0.9]
Step 3: Encode vector (find nearest centroid per subspace):
vโ โ cโโ (code 1)
vโ โ cโโ (code 1)
vโ โ cโโ (code 1)
vโ โ cโโ (code 2)
Compressed code: [1, 1, 1, 2] (4 codes, 2 bits each = 1 byte total)
Step 4: Decode to approximate vector:
v' = [cโโ, cโโ, cโโ, cโโ]
= [0.5, 0.3, 0.7, 0.2, 0.2, 0.9, 0.5, 0.7, 0.6, 0.2, 0.4, 0.7, 0.8, 0.2, 0.4, 0.5]
Step 5: Compute reconstruction error:
||v - v'||ยฒ = (0.8-0.7)ยฒ + (0.4-0.5)ยฒ + (0.3-0.4)ยฒ + (0.8-0.7)ยฒ + (0.9-0.8)ยฒ + (0.1-0.2)ยฒ + (0.5-0.4)ยฒ + (0.4-0.5)ยฒ
= 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01
= 0.08
Step 6: Asymmetric distance to query:
Given query: q = [0.4, 0.2, 0.7, 0.1, 0.3, 0.8, 0.5, 0.6, 0.5, 0.3, 0.2, 0.9, 0.8, 0.2, 0.4, 0.3]
Build distance tables:
Table 1 (distances from qโ=[0.4,0.2,0.7,0.1] to each centroid in Codebook 1):
d(qโ, cโโ) = 0.94
d(qโ, cโโ) = 0.13 โ vโ encoded as code 1
d(qโ, cโโ) = 0.65
d(qโ, cโโ) = 0.28
Table 2 (distances from qโ=[0.3,0.8,0.5,0.6] to each centroid in Codebook 2):
d(qโ, cโโ) = 0.14
d(qโ, cโโ) = 0.11 โ vโ encoded as code 1
d(qโ, cโโ) = 0.57
d(qโ, cโโ) = 0.91
Table 3 (distances from qโ=[0.5,0.3,0.2,0.9] to each centroid in Codebook 3):
d(qโ, cโโ) = 0.73
d(qโ, cโโ) = 0.23 โ vโ encoded as code 1
d(qโ, cโโ) = 0.69
d(qโ, cโโ) = 0.38
Table 4 (distances from qโ=[0.8,0.2,0.4,0.3] to each centroid in Codebook 4):
d(qโ, cโโ) = 0.84
d(qโ, cโโ) = 0.45
d(qโ, cโโ) = 0.05 โ vโ encoded as code 2
d(qโ, cโโ) = 0.67
Approximate distance:
d(q, v) โ Tableโ[1]ยฒ + Tableโ[1]ยฒ + Tableโ[1]ยฒ + Tableโ[2]ยฒ
= 0.13ยฒ + 0.11ยฒ + 0.23ยฒ + 0.05ยฒ
= 0.0169 + 0.0121 + 0.0529 + 0.0025
= 0.0844
Key insights:
- Compression: 16D ร 4 bytes = 64 bytes โ 4 codes ร 0.25 bytes = 1 byte (64ร compression!)
- Reconstruction error: 0.08 (some information lost)
- Asymmetric distance: Only 4 table lookups + 4 additions per vector
- Codebook storage: 4 codebooks ร 4 centroids ร 4D ร 4 bytes = 256 bytes
The Interview Questions Theyโll Ask
Theory Questions:
- Explain Product Quantization in one sentence.
- Answer: โPQ splits vectors into subspaces, quantizes each subspace independently, and represents vectors as concatenated codes for massive compression with minimal search quality loss.โ
- Why is it called โproductโ quantization?
- Answer: The set of possible reconstructed vectors is the Cartesian product of the M codebooks. With M subspaces and K centroids each, you have K^M possible codes (e.g., 256^8 โ 10^19 for M=8, K=256).
- Whatโs the compression ratio for 128D float vectors with M=8, K=256?
- Answer: Original: 128 dims ร 4 bytes = 512 bytes. Compressed: 8 codes ร 1 byte = 8 bytes. Ratio: 64ร (98.4% reduction).
- Why use asymmetric distance instead of symmetric?
- Answer: Symmetric quantizes both query and database, introducing error on both sides. Asymmetric keeps the query exact, only quantizing database vectors, which preserves better search accuracy with minimal overhead (building distance tables).
- How does PQ affect recall?
- Answer: Recall decreases because quantization introduces approximation error. Typical recall@10: 85-95% with proper tuning. Increasing M or K improves recall but reduces compression.
Design Questions:
- You have 1B vectors of 512D. Design a PQ configuration.
- Considerations:
- Memory budget: 1B ร X bytes (choose X based on RAM)
- M: 512/32 = 16D per subspace with M=32
- K: 256 (8 bits) for good reconstruction
- Result: 1B ร 32 bytes = 32GB + codebooks (32 ร 256 ร 16 ร 4 bytes = 512KB)
- Training: Sample 1M vectors for codebook training
- Considerations:
- How would you debug low recall in PQ search?
- Check reconstruction error: Decode some vectors and compare to originals
- Visualize codebook coverage: Are centroids well-distributed?
- Check training data: Is it representative of query distribution?
- Try different M: Too few subspaces may lose structure
- Compare ADC vs SDC: Should see higher recall with ADC
- How does PQ integrate with IVF?
- Answer: IVF-PQ combines coarse quantization (IVF for partitioning) with fine quantization (PQ for compression). IVF narrows search to a few partitions (~100-1000), then PQ enables fast scanning within partitions. This is the most common production setup (used by FAISS, Milvus, etc.).
Implementation Questions:
- Walk me through the training process.
- Answer:
- Sample training vectors (e.g., 100K-1M)
- For each subspace m = 0..M-1:
a. Extract subvectors:
subvecs = train[:, m*D/M : (m+1)*D/M]b. Run k-means with K centroids c. Store centroids as codebook_m - Save all M codebooks
- Answer:
- How do you implement fast ADC search?
- Answer:
- Precompute distance tables: For each subspace, compute distance from query subvector to all K centroids โ M ร K distances
- For each database vector code [cโ, cโ, โฆ, c_{M-1}]:
a. Look up distances:
d = [table[0][cโ], table[1][cโ], ..., table[M-1][c_{M-1}]]b. Sum squared distances:dist = sum(dยฒ) - Return top-k by dist
- Complexity: O(MรK) table computation + O(NรM) scanning vs O(NรD) brute force
- Answer:
- How would you optimize PQ for cache efficiency?
- Answer:
- Store codes contiguously in memory (good spatial locality)
- Align distance tables to cache lines (64 bytes)
- Use SIMD for parallel table lookups and summation
- Batch queries to amortize table computation
- Interleave codes by subspace for better access patterns
- Answer:
Tradeoff Questions:
- PQ vs scalar quantization (SQ)?
- PQ: Better compression (64ร), more complex, requires training
- SQ: Simple (min/max per dimension), moderate compression (4ร), no training needed
- Use PQ when: Memory is critical, can afford training time
- Use SQ when: Need simple deployment, moderate compression sufficient
- What happens if query distribution differs from training data?
- Answer: Codebooks may not represent query-relevant regions well, leading to higher quantization error and lower recall. Solutions: Retrain periodically, use larger K, or use query-aware training (sample training data from past queries).
Hints in Layers
Layer 1: Getting Started (Beginner)
- Start with a tiny dataset: 1000 vectors, 16D, M=4, K=16
- Implement basic encoding: Split vector โ find nearest centroid per subspace โ save codes
- Test reconstruction: Decode codes back to vectors, compute error
- Verify: Reconstruction error should be small but non-zero
Layer 2: Core PQ (Intermediate)
- Implement k-means training: Use sklearn or write simple Lloydโs algorithm
- Train one codebook per subspace on real data (e.g., 10K vectors)
- Encode entire dataset, measure compression ratio
- Implement decoding and visualize reconstruction error distribution
- Verify: Codebooks should have reasonable spread, reconstruction error < 10%
Layer 3: Asymmetric Distance (Intermediate-Advanced)
- Implement distance table computation: Query subvector to all centroids
- Implement ADC search: Table lookups + summation
- Compare with brute force on original vectors: Measure recall@10
- Optimize table computation: Vectorize with NumPy
- Verify: ADC should be 10-100ร faster than brute force, recall > 80%
Layer 4: Optimization (Advanced)
- Batch table computation: Compute for multiple queries at once
- Optimize code storage: Use np.uint8 for codes, contiguous arrays
- Add SIMD hints: Align arrays, use numba for JIT compilation
- Implement early termination: Stop after finding k candidates
- Verify: Should achieve <1ms per query on 1M dataset
Layer 5: Production Ready (Expert)
- Add incremental updates: Insert new vectors without full retraining
- Implement codebook persistence: Save/load codebooks efficiently
- Add monitoring: Track recall, latency, compression ratio over time
- Combine with IVF: Partition space, then apply PQ within partitions
- Tune hyperparameters: Grid search over M, K, training size
- Verify: Recall > 90%, latency < 10ms on 10M+ dataset
Debugging tips:
- Low recall? Check if codebooks are well-trained (visualize centroids)
- Slow search? Profile table computation vs scanning (should be scanning-dominated)
- High memory? Verify code dtype is uint8/uint16, not float
- Poor reconstruction? Increase K or M, or normalize vectors before PQ
Books That Will Help
Primary Resource (Deep Dive):
- โProduct Quantization for Nearest Neighbor Searchโ - Jรฉgou, Douze, Schmid (TPAMI 2011)
- The original paper - only 12 pages but incredibly dense
- Sections to focus on: Section III (Product Quantization), Section IV (Asymmetric Distance)
- Contains all the math and theory you need
- Available: IEEE Xplore or search โJรฉgou PQ paperโ
Foundational Books:
- โVector Quantization and Signal Compressionโ - Gersho & Gray
- Chapter 1-4: Foundations of quantization theory
- Chapter 6: Vector quantization and k-means
- Why it helps: Provides deep background on quantization, distortion, rate-distortion theory
- โPattern Recognition and Machine Learningโ - Christopher Bishop
- Chapter 9: K-means clustering and EM algorithm
- Chapter 12: PCA and dimensionality reduction
- Why it helps: Builds intuition for clustering and subspace methods
- โFoundations of Multidimensional and Metric Data Structuresโ - Hanan Samet
- Chapter 4: High-dimensional indexing
- Why it helps: Context for where PQ fits in the landscape of ANN algorithms
Practical Resources:
- FAISS Documentation - Facebook AI Research
- FAISS Wiki on PQ
- Real-world implementation details and tuning guidelines
- C++ code to study for optimization techniques
- โBillion-scale similarity search with GPUsโ - Johnson, Douze, Jรฉgou (2017)
- How PQ is used at billion-scale in production
- GPU optimization techniques
- Available: arXiv:1702.08734
Online Tutorials:
- Pinecone Learning Center
- Product Quantization Guide
- Visual explanations and interactive examples
- Great for building intuition before diving into papers
- Towards Data Science Articles
- Search for โProduct Quantizationโ and โVector Quantizationโ
- Many practical tutorials with code examples
- Good for seeing different implementation approaches
Research Papers (Advanced):
- โOptimized Product Quantizationโ - Ge et al. (CVPR 2013)
- Improvements to basic PQ: non-orthogonal subspace decomposition
- Why it helps: Shows how to push PQ further for higher recall
- โPolysemous Codesโ - Douze et al. (ECCV 2016)
- Combines PQ codes with binary codes for faster filtering
- Why it helps: State-of-the-art extension of PQ
How to Use These Resources:
- Week 1: Read Jรฉgou PQ paper (section by section), implement basic encoding/decoding
- Week 2: Study k-means (Bishop Ch 9), implement codebook training
- Week 3: Read ADC section carefully, implement asymmetric distance
- Week 4: Study FAISS code, optimize your implementation
- Week 5+: Explore advanced papers (OPQ, polysemous codes) for extensions
Learning milestones:
- Codebooks train successfully โ You understand quantization
- Encoding/decoding works โ You understand the representation
- ADC is much faster than brute force โ You understand the speedup
- Recall drops gracefully โ You understand the accuracy tradeoff
Project 5: HNSW (Hierarchical Navigable Small World)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Graph-Based Search / Skip Lists
- Software or Tool: HNSW Index
- Main Book: โEfficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphsโ - Malkov & Yashunin
What youโll build: A complete HNSW implementationโthe state-of-the-art algorithm for vector search, combining navigable small-world graphs with a hierarchical structure for logarithmic search complexity.
Why it teaches HNSW: HNSW is the default choice for most vector databases (Pinecone, Qdrant, Weaviate, pgvector). It offers the best balance of speed, accuracy, and flexibility. Understanding it deeply means understanding modern vector search.
Core challenges youโll face:
- Graph construction โ maps to finding good neighbors during insertion
- Hierarchical layers โ maps to skip-list-like navigation
- Greedy search โ maps to navigating the graph efficiently
- Parameter tuning (M, efConstruction, efSearch) โ maps to quality vs speed
Resources for key challenges:
- Pinecone HNSW Guide - Step-by-step explanation
- Zilliz HNSW Tutorial - Visual explanation
- Original HNSW Paper - The source
Key Concepts:
- Small World Graphs: Watts-Strogatz model
- Skip Lists: Probabilistic data structure
- Greedy Search: Local optimization in graphs
- HNSW Paper: Malkov & Yashunin, 2016
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-2 completed. Understanding of graph data structures.
Real world outcome:
$ ./hnsw_build --dataset sift_1M.npy --M 16 --efConstruction 200
Building HNSW Index:
Dataset: 1,000,000 vectors ร 128 dimensions
M (max connections): 16
efConstruction: 200
Construction progress:
Layer 3: 12 nodes (entry points)
Layer 2: 234 nodes
Layer 1: 4,891 nodes
Layer 0: 1,000,000 nodes (all vectors)
Build time: 8m 32s
Index size: 287 MB (graph) + 512 MB (vectors) = 799 MB
Graph statistics:
Avg connections at layer 0: 14.2
Max layer reached: 4
Graph diameter (estimated): 6 hops
$ ./hnsw_search --query query.npy --k 10 --ef 50
HNSW Search (ef=50):
Starting from entry point at layer 3
Layer 3: 2 hops โ found node 8,234
Layer 2: 3 hops โ found node 12,891
Layer 1: 4 hops โ found node 4,521
Layer 0: Expanded 50 candidates
Results:
โโโโโโโโโคโโโโโโโโโโโโคโโโโโโโโโโโโโโโ
โ Rank โ Vector ID โ Distance โ
โ โโโโโโโโชโโโโโโโโโโโโชโโโโโโโโโโโโโโโฃ
โ 1 โ 4,521 โ 0.823 โ
โ 2 โ 7,892 โ 0.891 โ
โ ... โ ... โ ... โ
โโโโโโโโโงโโโโโโโโโโโโงโโโโโโโโโโโโโโโ
Performance:
Nodes visited: 143
Distance calculations: 143 (vs 1,000,000 brute force)
Time: 0.8ms
Recall@10: 0.98
$ ./hnsw_search --ef-sweep
ef โ Recall@10 โ Query Time โ Nodes Visited
โโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโ
10 โ 0.72 โ 0.2ms โ 38
20 โ 0.88 โ 0.4ms โ 67
50 โ 0.98 โ 0.8ms โ 143
100 โ 0.995 โ 1.4ms โ 271
200 โ 0.999 โ 2.6ms โ 523
Implementation Hints:
HNSW key insight: Combine two ideas:
- Navigable Small World (NSW): A graph where you can reach any node in few hops via greedy routing
- Hierarchy (Skip List): Multiple layers where upper layers have fewer nodes, enabling fast coarse search
Layer 2: [A]----------[B]----------[C] (few nodes, long edges)
| | |
Layer 1: [A]---[D]---[B]---[E]---[C] (more nodes)
| | | | |
Layer 0: [A][F][D][G][B][H][E][I][C] (all nodes, short edges)
Core data structures:
class HNSWNode:
def __init__(self, vector_id, vector, level):
self.id = vector_id
self.vector = vector
self.level = level
self.neighbors = [[] for _ in range(level + 1)]
# neighbors[layer] = list of neighbor node ids
class HNSWIndex:
def __init__(self, dim, M=16, ef_construction=200):
self.dim = dim
self.M = M # Max connections per node
self.M_max0 = M * 2 # Max connections at layer 0
self.ef_construction = ef_construction
self.ml = 1 / np.log(M) # Level multiplier
self.nodes = {} # id -> HNSWNode
self.entry_point = None
self.max_level = -1
Insertion algorithm:
def insert(self, vector_id, vector):
# 1. Determine level for new node (probabilistic)
level = self._random_level()
node = HNSWNode(vector_id, vector, level)
if self.entry_point is None:
self.entry_point = node
self.max_level = level
self.nodes[vector_id] = node
return
# 2. Find entry point at each level via greedy search
current = self.entry_point
# 3. Search from top to node's level (greedy, single result)
for lc in range(self.max_level, level, -1):
current = self._search_layer_greedy(vector, current, lc)
# 4. At node's level and below, find ef_construction neighbors
for lc in range(min(level, self.max_level), -1, -1):
candidates = self._search_layer(vector, current, self.ef_construction, lc)
# Select M best neighbors
neighbors = self._select_neighbors(vector, candidates, self.M if lc > 0 else self.M_max0)
# Bidirectional connections
for neighbor in neighbors:
node.neighbors[lc].append(neighbor.id)
neighbor.neighbors[lc].append(vector_id)
# Prune if too many connections
if len(neighbor.neighbors[lc]) > (self.M if lc > 0 else self.M_max0):
neighbor.neighbors[lc] = self._select_neighbors(
neighbor.vector,
[self.nodes[n] for n in neighbor.neighbors[lc]],
self.M if lc > 0 else self.M_max0
)
if len(candidates) > 0:
current = candidates[0] # Closest candidate
# 5. Update entry point if new node is higher
if level > self.max_level:
self.entry_point = node
self.max_level = level
self.nodes[vector_id] = node
def _random_level(self):
level = 0
while np.random.random() < self.ml and level < 16:
level += 1
return level
Search algorithm:
def search(self, query, k, ef):
if self.entry_point is None:
return []
current = self.entry_point
# Greedy search from top to layer 1
for lc in range(self.max_level, 0, -1):
current = self._search_layer_greedy(query, current, lc)
# At layer 0, do beam search with ef candidates
candidates = self._search_layer(query, current, ef, 0)
# Return top-k
return sorted(candidates, key=lambda n: distance(query, n.vector))[:k]
def _search_layer(self, query, entry, ef, layer):
visited = {entry.id}
candidates = [(distance(query, entry.vector), entry)] # min-heap
results = [(-distance(query, entry.vector), entry)] # max-heap (negative for max)
heapify(candidates)
heapify(results)
while candidates:
dist_c, current = heappop(candidates)
# Stop if closest candidate is worse than worst result
if dist_c > -results[0][0]:
break
for neighbor_id in current.neighbors[layer]:
if neighbor_id in visited:
continue
visited.add(neighbor_id)
neighbor = self.nodes[neighbor_id]
dist_n = distance(query, neighbor.vector)
if dist_n < -results[0][0] or len(results) < ef:
heappush(candidates, (dist_n, neighbor))
heappush(results, (-dist_n, neighbor))
if len(results) > ef:
heappop(results)
return [node for _, node in results]
Parameters explained:
- M: Max connections per node. Higher = more accurate, more memory
- ef_construction: Beam width during build. Higher = better graph, slower build
- ef_search: Beam width during search. Higher = better recall, slower search
The Core Question Youโre Answering
โHow does a multi-layer graph enable logarithmic search in high-dimensional spaces?โ
This is the heart of HNSW. Unlike flat graphs (NSW) that require examining many nodes, or tree structures that fail in high dimensions, HNSW uses a hierarchical graph structure inspired by skip lists. Upper layers have fewer nodes with long-range connections for coarse navigation, while lower layers have dense connections for fine-grained search.
The key insight: Start at the top sparse layer to quickly traverse the space, then progressively refine your search by moving down layers until you reach layer 0 where all vectors exist. This gives you O(log N) search complexity instead of O(N).
Why this matters for vector databases:
- Enables sub-millisecond search on millions of vectors
- Scales logarithmically rather than linearly
- Maintains high recall (>95%) while being 100x-1000x faster than brute force
- Powers production systems like Pinecone, Weaviate, and Qdrant
Concepts You Must Understand First
Before building HNSW, ensure you understand these foundational concepts:
1. Small-World Networks (Watts-Strogatz Model)
What it is: A graph where most nodes can be reached from any other node in a small number of hops.
Why it matters: HNSW builds on the Navigable Small World (NSW) concept where greedy routing (always moving to the neighbor closest to the target) can find approximate nearest neighbors efficiently.
Key property: High clustering coefficient + short average path length
Learn it:
- Study the Watts-Strogatz model and how random edges create shortcuts
- Understand why social networks are small-world graphs (โsix degrees of separationโ)
- Read about the NSW algorithm (precursor to HNSW)
2. Skip Lists
What it is: A probabilistic data structure that allows O(log n) search in sorted linked lists by maintaining multiple layers of increasingly sparse elements.
Why it matters: HNSWโs hierarchical structure is directly inspired by skip lists. Each node is assigned a random layer level, and upper layers contain progressively fewer nodes.
Key property: Probabilistic layer assignment creates logarithmic height
Learn it:
- Implement a basic skip list for 1D data
- Understand probabilistic level assignment:
level = floor(-ln(random()) * ml) - See how search starts at top layer and descends to find elements
3. Navigable Graphs
What it is: A graph where greedy routing (always moving to the best neighbor) reliably finds good paths to targets.
Why it matters: HNSW search works because the graph maintains navigability at each layer.
Key property: At each node, there exists a neighbor thatโs closer to any query point
Learn it:
- Understand greedy search on graphs (always move to closest neighbor)
- Study Delaunay graphs (naturally navigable but expensive to build)
- Learn why random graphs are NOT navigable
4. Graph Layers and Zoom Levels
What it is: Different levels of granularity in representing the same graph structure.
Why it matters: Upper layers in HNSW act like โzoom outโ views with long-range connections, while layer 0 is fully โzoomed inโ.
Analogy: Like Google Maps - start with country view (layer 3), zoom to city (layer 2), then street (layer 1), then buildings (layer 0)
Questions to Guide Your Design
Before coding, think through these design questions:
1. M Parameter (Max Connections per Node)
- Q: If M=16, each node has max 16 neighbors per layer. How does this affect:
- Graph connectivity?
- Memory usage?
- Search quality?
- Build time?
- Think about: What happens if M is too small (4)? Too large (128)?
- Answer: M controls the tradeoff between recall and memory. Typical values: 12-48.
2. efConstruction (Beam Width During Build)
-
Q: During insertion, you search for
efConstructioncandidates, then select M best. Why not just keep the first M you find? - Think about: How does exploring more candidates during build affect the final graph quality?
- Answer: Higher efConstruction creates a better-connected graph but slows down insertion. Typical: 100-500.
3. Layer Selection Strategy
- Q: How do you decide which layer a new node belongs to?
- Think about:
- Should it be deterministic or random?
- What probability distribution?
- How does this affect the layer structure?
- Answer: Use exponential decay:
P(level = l) = e^(-l * ln(M)). This creates a skip-list-like structure.
4. Neighbor Selection Heuristic
- Q: You have efConstruction=200 candidates but can only keep M=16 neighbors. Which do you pick?
- The 16 closest by distance?
- 16 that maximize diversity?
- 16 that maintain graph connectivity?
- Think about: Pure distance-based selection can create โclustersโ that hurt navigability.
- Answer: Use a heuristic that balances closeness and diversity (e.g., avoid neighbors too close to each other).
5. Greedy Routing at Upper Layers
-
Q: At layer 3 (few nodes), you do greedy search (single path). At layer 0 (all nodes), you do beam search (ef paths). Why this difference?
- Think about: Computational cost vs accuracy at different layers.
- Answer: Upper layers are for coarse navigation (speed), layer 0 is for finding actual neighbors (accuracy).
Thinking Exercise: Trace Search by Hand
Setup: Imagine a 2D dataset (for visualization) with 1000 points and this HNSW structure:
- Layer 2: 4 nodes
- Layer 1: 40 nodes
- Layer 0: 1000 nodes (all points)
- M = 8, ef = 20
Exercise: Trace this search for query Q at position (7.5, 3.2):
Layer 2: [A:(0,0), B:(10,0), C:(0,10), D:(10,10)]
Entry point: A
Query Q: (7.5, 3.2)
Step-by-step:
- Layer 2 - Greedy Search:
- Current: A (0,0), distance to Q = 8.2
- Neighbors of A at layer 2: [B, C]
- Check B: (10,0), distance = 3.8 โ closer!
- Check C: (0,10), distance = 11.3 โ worse
- Move to B. Check Bโs neighbors: [A, D]
- Check D: (10,10), distance = 7.1 โ worse than current
- Result: Best node at layer 2 = B (10, 0)
- Layer 1 - Greedy Search:
- Start from Bโs layer 1 representation
- Bโs neighbors at layer 1: [B1:(9,0), B2:(10,1), โฆ]
- Greedy walk until local minimum
- Result: Best node at layer 1 = (8, 3)
- Layer 0 - Beam Search:
- Start from (8, 3)
- Maintain ef=20 candidates
- Explore neighbors in a priority queue
- Keep top-20 closest candidates
- Result: Final 10 nearest neighbors
Key observations:
- Each layer narrows the search region
- Upper layers: O(1) hops because few nodes
- Lower layers: O(log N) hops
- Total complexity: O(log N)
Your task: Draw this scenario on paper. Create 4 nodes at layer 2, 40 at layer 1 (include the 4), and mark query Q. Trace the greedy path down the layers.
The Interview Questions Theyโll Ask
If you understand HNSW deeply, you should be able to answer these questions:
Theory Questions
- โExplain why HNSW has O(log N) search complexity while NSW is O(N).โ
- Expected answer: The hierarchical structure reduces the search space at each layer. Upper layers have fewer nodes (sparse), so greedy search reaches the target region in O(1) hops. Descending layers increases nodes but maintains logarithmic hops. Without hierarchy (NSW), you must traverse a flat dense graph.
- โWhat happens if you set M=2? What about M=128?โ
- M=2: Graph becomes a chain-like structure, poor connectivity, searches fail to find good paths
- M=128: Excellent connectivity and recall, but massive memory overhead and slower search (more neighbors to check)
- โWhy use exponential decay for layer assignment instead of uniform random?โ
- Exponential decay creates a skip-list structure where upper layers have exponentially fewer nodes. This ensures O(log N) layers. Uniform random would create O(N) layers or unbalanced structures.
- โHNSW vs IVF: when would you choose each?โ
- HNSW: Better recall, consistent latency, online updates, smaller datasets (<100M)
- IVF: Better for billion-scale, lower memory (with PQ), batch workloads, GPU acceleration
Implementation Questions
- โHow do you handle the entry point initialization?โ
- First node becomes entry point at its random level
- Each new node that reaches a higher level becomes the new entry point
- Entry point is the starting node for all searches
- โExplain the pruning step during insertion.โ
- When adding bidirectional edge, neighbor might exceed M connections
- Must prune to M using selection heuristic (keep diverse neighbors)
- Without pruning, memory grows unbounded
- โWhatโs the difference between efConstruction and efSearch?โ
- efConstruction: beam width during index building (affects graph quality)
- efSearch: beam width during query (affects recall/speed tradeoff)
- Higher efConstruction โ better graph โ can use lower efSearch
- โHow would you make HNSW distributed?โ
- Partition vectors by ID or cluster
- Each shard builds local HNSW
- Search: scatter query to all shards, gather top-k from each, merge
- Challenge: no global entry point
Debugging Questions
- โYour recall is only 60% but hnswlib gets 95%. Whatโs wrong?โ
- Check: Are you doing bidirectional connections?
- Check: Is neighbor selection too greedy? (need diversity)
- Check: Is efConstruction too low?
- Check: Is distance calculation correct?
- โBuild time is 10x slower than expected. Whereโs the bottleneck?โ
- Profile: Is _search_layer called too often?
- Check: Is efConstruction too high?
- Check: Are you using a heap for candidate selection?
- Check: Distance computation vectorized?
Hints in Layers
Build your understanding progressively:
Layer 1: Build a Flat Navigable Graph (NSW)
Goal: Understand greedy graph search before adding hierarchy.
Start here:
class NSWGraph:
def __init__(self, M=16):
self.M = M
self.nodes = {}
self.entry_point = None
def insert(self, vector_id, vector):
# Find M nearest neighbors via greedy search
# Add bidirectional edges
# That's it! No layers yet.
What youโll learn:
- Greedy search on graphs
- Bidirectional edge maintenance
- Why M matters for connectivity
Success criteria: Can search 10K vectors with 70-80% recall
Layer 2: Add the Hierarchy (Skip List Structure)
Goal: Add probabilistic layers like skip lists.
Add this:
def _random_level(self):
level = 0
while random.random() < 1/self.M and level < 16:
level += 1
return level
class HNSWNode:
def __init__(self, vector_id, vector, level):
self.level = level
self.neighbors = [[] for _ in range(level + 1)]
What youโll learn:
- How layer assignment creates hierarchy
- Multi-layer graph structure
- Entry point management
Success criteria: Can build multi-layer graph, verify layer distribution
Layer 3: Implement Layer-by-Layer Search
Goal: Start at top, greedily descend to layer 0.
Add this:
def search(self, query, k, ef):
current = self.entry_point
# Greedy at upper layers
for lc in range(self.max_level, 0, -1):
current = self._search_layer_greedy(query, current, lc)
# Beam search at layer 0
candidates = self._search_layer(query, current, ef, 0)
return top_k(candidates)
What youโll learn:
- Why greedy search at top layers
- Why beam search at layer 0
- How entry point guides search
Success criteria: Search is faster than flat NSW, recall still 70-80%
Layer 4: Add Pruning and Selection Heuristics
Goal: Improve graph quality with smart neighbor selection.
Add this:
def _select_neighbors_heuristic(self, base_vector, candidates, M):
# Don't just pick M closest
# Pick M that maximize diversity and avoid redundancy
# Example: avoid neighbors too close to each other
selected = []
for candidate in sorted(candidates, key=distance):
if len(selected) >= M:
break
# Check: is candidate too close to already selected?
if not too_close_to_any(candidate, selected):
selected.append(candidate)
return selected
What youโll learn:
- Why pure distance is insufficient
- How diversity improves navigability
- Graph connectivity vs local clustering
Success criteria: Recall jumps to 90-95%
Layer 5: Optimize and Match Production Quality
Goal: Make it fast and robust.
Add this:
- Vectorize distance computations
- Use heapq for candidate management
- Pre-allocate numpy arrays
- Add build progress monitoring
- Parameter validation
Success criteria: Match hnswlib within 2x on speed and 99% on recall
Books That Will Help
Primary Resources
- โEfficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphsโ - Malkov & Yashunin (2016)
- The original HNSW paper
- Read sections: Abstract, Algorithm 1 (insertion), Algorithm 2 (search)
- Skip: Heavy theory sections on first read
- Focus on: Figure 1 (layer structure), Algorithm descriptions
- โSearching in Metric Spacesโ - Zezula, Amato, Dohnal, Batko
- Chapter on navigable small worlds
- Background on metric space searching
- Focus on: NSW foundations, greedy routing
Supplementary Materials
- Skip Lists papers
- โSkip Lists: A Probabilistic Alternative to Balanced Treesโ - Pugh (1990)
- Understanding probabilistic level assignment
- See parallels to HNSW layer selection
- Small-World Networks
- โCollective dynamics of โsmall-worldโ networksโ - Watts & Strogatz (1998)
- Understand why small-world graphs enable efficient navigation
- Key insight: High clustering + short path length
Practical Guides
- Pinecone Learning Center - HNSW
- URL: https://www.pinecone.io/learn/series/faiss/hnsw/
- Best practical explanation with visualizations
- Focus on: Layer-by-layer search visualization
- hnswlib Source Code
- URL: https://github.com/nmslib/hnswlib
- Production C++ implementation
- Focus on:
hnswlib/hnswalg.h- core algorithms - Compare your Python to their optimized C++
- FAISS Source Code
- URL: https://github.com/facebookresearch/faiss
- File:
faiss/IndexHNSW.cpp - See how Facebook implements HNSW at scale
Video Resources
- โBillion-scale approximate nearest neighbor searchโ - Matthijs Douze (Facebook AI)
- YouTube search for FAISS talks
- Understanding HNSW in context of other methods
- โVector Search & HNSWโ - Pinecone
- Explains intuition before math
- Good for visual learners
Papers for Deep Dive
- โNavigating the Landscape of Metric Space Searchingโ - Malkov et al.
- Comprehensive comparison of graph-based methods
- See how NSW evolved to HNSW
Reading Order:
- Start with Pinecone guide (intuition)
- Read HNSW paper abstract + algorithms
- Implement Layer 1-2 from hints
- Read Skip Lists paper (understand probabilistic layers)
- Implement Layer 3-4
- Study hnswlib source code (optimizations)
- Read Small-World Networks paper (why it works)
Learning milestones:
- Hierarchical structure builds correctly โ You understand the layers
- Greedy search finds good paths โ You understand navigation
- Recall matches hnswlib โ Youโve implemented correctly
- Parameter tuning works โ You understand the tradeoffs
Project 6: IVF-PQ Composite Index
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Combined Indexing / Large-Scale Search
- Software or Tool: IVF-PQ Index
- Main Book: โBillion-scale similarity search with GPUsโ - Johnson et al.
What youโll build: A composite IVF-PQ index that combines clustering (IVF) with compression (PQ) for billion-scale search with minimal memory usage.
Why it teaches composite indexes: No single technique is enough at scale. IVF-PQ combines the best of both worlds: IVF reduces search scope, PQ reduces memory. This is how FAISS handles billion-vector datasets.
Core challenges youโll face:
- Training pipeline โ maps to IVF clustering then PQ on residuals
- Residual encoding โ maps to PQ on (vector - centroid)
- Search optimization โ maps to precomputed tables + SIMD
- Memory layout โ maps to efficient storage for cache
Key Concepts:
- Residual Quantization: PQ on the difference from centroid
- Two-Stage Training: IVF first, then PQ
- ADC with IVF: Combining both speedups
Learning milestones:
- Two-stage training works โ You understand the pipeline
- Residual encoding improves accuracy โ You understand why it helps
- Memory reduction is massive โ You understand the compression
- Search matches FAISS performance โ Youโve implemented correctly
Project 7: Disk-Based Vector Index (DiskANN-style)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: C++
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Storage Systems / SSD Optimization
- Software or Tool: Disk-Based Vector Index
- Main Book: โDiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Nodeโ - Microsoft Research
What youโll build: A vector index that stores vectors on SSD with an in-memory graph for navigation, enabling billion-scale search on a single commodity machine.
Why it teaches disk-based search: Memory is expensive. Storing 1B vectors in RAM requires 500GB+. DiskANN-style indexes keep only the graph in memory (~10GB) and fetch vectors from SSD during search, reducing cost by 50x.
Core challenges youโll face:
- Graph fits in memory, vectors on disk โ maps to memory-mapped I/O
- Minimizing disk reads โ maps to graph design for locality
- SSD-friendly access patterns โ maps to sequential vs random reads
- Beam search with async I/O โ maps to pipelining disk access
Key Concepts:
- DiskANN Paper: Microsoft Research, 2019
- Memory-Mapped Files: OS-level I/O optimization
- SSD Performance: Random vs sequential read patterns
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 5 completed. C++ proficiency, understanding of file I/O.
Real world outcome:
$ ./diskann_build --dataset deep_1B.npy --memory-limit 16GB
Building DiskANN Index:
Dataset: 1,000,000,000 vectors ร 96 dimensions
Vector data size: 384 GB (will be on disk)
Memory limit: 16 GB (for graph)
Building graph with memory-efficient construction...
Pass 1: Sample-based graph initialization
Pass 2: Full graph refinement (streaming)
Pass 3: Graph compaction
Index files:
vectors.bin: 384 GB (memory-mapped)
graph.bin: 12 GB (in-memory)
metadata.json: 1 KB
$ ./diskann_search --query query.npy --k 10 --beam-width 50
DiskANN Search:
Graph navigation: 0.4ms (in-memory)
Disk reads: 23 vectors fetched
SSD latency: 2.1ms (avg 0.09ms per read)
Total: 2.5ms
Performance analysis:
Graph hops: 8
Candidate set size: 50
Disk reads: 23 (< beam width due to caching)
Cache hits: 12
Recall@10: 0.96
$ ./diskann_search --benchmark --queries 10000
Benchmark (10K queries):
Mean latency: 2.8ms
P99 latency: 5.2ms
Throughput: 357 queries/second
SSD bandwidth: 1.2 GB/s
Implementation Hints:
Architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ In-Memory Graph โ
โ (Node ID โ [Neighbor IDs], 12 GB) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Memory-Mapped Vectors โ
โ (Node ID โ Vector, 384 GB on SSD) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Core structures:
struct DiskANNIndex {
// In-memory: graph only
vector<vector<uint32_t>> graph; // Neighbor lists
// Memory-mapped: vectors on disk
float* vectors; // mmap'd from file
size_t num_vectors;
size_t dim;
// Caching
LRUCache<uint32_t, vector<float>> vector_cache;
};
Memory-mapping vectors:
void open_vectors(const string& path) {
int fd = open(path.c_str(), O_RDONLY);
size_t file_size = num_vectors * dim * sizeof(float);
vectors = (float*)mmap(
nullptr,
file_size,
PROT_READ,
MAP_PRIVATE, // Copy-on-write
fd,
0
);
// Advise OS about access pattern
madvise(vectors, file_size, MADV_RANDOM);
}
vector<float> get_vector(uint32_t id) {
// Check cache first
if (auto cached = vector_cache.get(id)) {
return *cached;
}
// Load from mmap (may trigger disk read)
float* ptr = vectors + id * dim;
vector<float> vec(ptr, ptr + dim);
vector_cache.put(id, vec);
return vec;
}
Search with prefetching:
vector<pair<uint32_t, float>> search(
const vector<float>& query,
size_t k,
size_t beam_width
) {
priority_queue<pair<float, uint32_t>> candidates; // max-heap
priority_queue<pair<float, uint32_t>, vector<...>, greater<>> results; // min-heap
unordered_set<uint32_t> visited;
// Start from entry point
auto entry_vec = get_vector(entry_point);
float dist = distance(query, entry_vec);
candidates.push({-dist, entry_point}); // Negative for min behavior
results.push({dist, entry_point});
visited.insert(entry_point);
while (!candidates.empty()) {
auto [neg_dist, current] = candidates.top();
candidates.pop();
if (-neg_dist > results.top().first && results.size() >= beam_width) {
break;
}
// Prefetch neighbor vectors asynchronously
auto& neighbors = graph[current];
for (uint32_t neighbor : neighbors) {
// Hint to OS to prefetch this page
prefetch(&vectors[neighbor * dim]);
}
// Process neighbors
for (uint32_t neighbor : neighbors) {
if (visited.count(neighbor)) continue;
visited.insert(neighbor);
auto neighbor_vec = get_vector(neighbor);
float d = distance(query, neighbor_vec);
if (d < results.top().first || results.size() < beam_width) {
candidates.push({-d, neighbor});
results.push({d, neighbor});
if (results.size() > beam_width) {
results.pop();
}
}
}
}
// Extract top-k from results
vector<pair<uint32_t, float>> top_k;
while (!results.empty() && top_k.size() < k) {
top_k.push_back({results.top().second, results.top().first});
results.pop();
}
reverse(top_k.begin(), top_k.end());
return top_k;
}
Graph construction (memory-efficient):
void build_graph(const string& vectors_path, size_t R, size_t L) {
// R = max degree, L = candidate list size during construction
// Initialize with random graph
for (size_t i = 0; i < num_vectors; i++) {
graph[i] = random_neighbors(R);
}
// Iterative refinement (process vectors in chunks to limit memory)
size_t chunk_size = 1000000; // 1M vectors at a time
for (size_t offset = 0; offset < num_vectors; offset += chunk_size) {
size_t end = min(offset + chunk_size, num_vectors);
// Load chunk into memory
vector<vector<float>> chunk_vectors = load_chunk(vectors_path, offset, end);
// For each vector in chunk, improve its neighbors
for (size_t i = offset; i < end; i++) {
auto candidates = beam_search(chunk_vectors[i - offset], L);
graph[i] = select_diverse_neighbors(candidates, R);
// Add reverse edges
for (uint32_t neighbor : graph[i]) {
if (graph[neighbor].size() < R) {
graph[neighbor].push_back(i);
}
}
}
}
}
The Core Question Youโre Answering:
โHow do we search billions of vectors that donโt fit in RAM?โ
This is the fundamental scaling challenge in vector search. A billion 96-dimensional float vectors require 384GB of RAM just for storage, plus additional memory for the index structure. At $10/GB/month for cloud RAM, thatโs $3,840/month just for storage. DiskANN solves this by keeping only the navigation graph (~10-20GB) in memory and storing vectors on SSD (~$0.10/GB/month), reducing costs by 50x while maintaining sub-10ms search latency.
The insight: You donโt need all vectors in RAM. You only need the graph structure that tells you which vectors to examine. With careful graph design and SSD-aware access patterns, you can achieve 95%+ recall with just 20-50 disk reads per query.
Concepts You Must Understand First:
Before building DiskANN, you need to understand:
- SSD Performance Characteristics:
- Sequential read: 3-7 GB/s (fast)
- Random read: 500K-1M IOPS, ~50-100 ฮผs latency per read (slower)
- Page granularity: SSDs read in 4KB pages, not individual bytes
- Why it matters: Reading 96 floats (384 bytes) costs the same as reading 4KB. You pay for the whole page.
- Memory-Mapped Files (mmap):
- OS maps file content to virtual memory addresses
- Reading a mapped address triggers page fault โ OS loads page from disk โ returns data
- OS manages page cache automatically (LRU eviction)
- Why it matters: You get disk-backed arrays without manual I/O code.
- Vamana Graph Structure:
- Undirected graph with bounded degree R (typically 64-128)
- Built to maximize navigability: short path to any target
- Pruning strategy maintains both connectivity and diversity
- Why it matters: Good graph structure = fewer disk reads per search.
- Beam Search Navigation:
- Maintains beam_width candidates (e.g., 50-100) during search
- Explores candidates in best-first order
- Greedy: expands most promising nodes first
- Why it matters: Limits number of nodes visited = limits disk reads.
- Caching Strategies:
- Hot vectors (entry points, frequently visited nodes) should stay in cache
- LRU cache for recently accessed vectors
- Pre-warming: load entry point and high-degree nodes at startup
- Why it matters: Cache hit = no disk read (0.1ms saved per hit).
- Async I/O and Prefetching:
- Issue prefetch hints for upcoming neighbor accesses
- OS can fetch multiple pages in parallel
- Overlaps computation with I/O latency
- Why it matters: Can reduce total latency by 30-50%.
Questions to Guide Your Design:
Ask yourself these questions while building:
- Page Size and Alignment:
- Should vectors be aligned to 4KB page boundaries?
- Should you pack multiple small vectors per page?
- How does dimension affect packing efficiency?
- Graph Degree R:
- Higher R = more neighbors = better recall but more disk reads
- Lower R = faster search but may get stuck in local minima
- Whatโs the sweet spot for your dataset?
- Beam Width:
- Larger beam = more candidates = more disk reads but better recall
- Smaller beam = faster but may miss the target
- How does beam width affect the recall/latency tradeoff?
- What to Cache:
- Entry point (always accessed)?
- High-degree nodes (hubs in the graph)?
- Recently accessed vectors (temporal locality)?
- What cache size gives diminishing returns?
- Graph Construction:
- Can you build the graph without loading all vectors into RAM?
- Can you process vectors in streaming fashion?
- How do you avoid out-of-memory during construction?
- Disk Access Patterns:
- Are your disk reads sequential or random?
- Can you batch reads to improve throughput?
- Are you triggering unnecessary page faults?
Thinking Exercise:
Before coding, work through these calculations:
- Memory Budget:
- 1B vectors ร 96 dims ร 4 bytes = 384 GB (vectors on disk)
- 1B nodes ร R neighbors ร 4 bytes (node ID)
- If R=64: 1B ร 64 ร 4 = 256 GB (graph structure)
- Thatโs already over your 16GB memory limit!
- Key insight: You need compressed graph representation. Store only node IDs, not full neighbor lists.
- Disk Reads Per Query:
- Beam width = 50
- Assume 8 graph hops to reach target
- Worst case: 50 candidates ร 8 hops = 400 disk reads
- But! Many candidates share neighbors โ actual reads โ 100-150
- With caching: reduce to 20-50 unique disk reads
- At 0.1ms per read: 2-5ms disk latency
- Page Fault Analysis:
- One vector = 96 floats ร 4 bytes = 384 bytes
- SSD page = 4KB
- Vectors per page: 4096 / 384 โ 10 vectors
- If vectors are randomly placed: 1 page fault per vector
- If vectors are co-located by graph proximity: 1 page fault per 10 vectors (10x speedup!)
- Cache Size Impact:
- 1GB cache = ~2.7M vectors (at 384 bytes each)
- If top 0.1% of vectors account for 20% of accesses (Zipf distribution)
- Caching 1M vectors (0.1%) could eliminate 20% of disk reads
- Diminishing returns after ~2-4GB cache
The Interview Questions Theyโll Ask:
Practice answering these:
- โWhy use DiskANN instead of HNSW for billion-scale search?โ
- HNSW requires all vectors in RAM โ 384GB for 1B vectors โ $3,840/month
- DiskANN uses 10-20GB RAM + 400GB SSD โ $200/month (19x cheaper)
- Latency: HNSW 1-2ms, DiskANN 3-8ms (3-4x slower but acceptable)
- When to use: DiskANN when cost matters, HNSW when latency is critical
- โHow do you minimize disk reads during search?โ
- Good graph structure: short paths (6-8 hops) to any target
- Beam search: prune candidates aggressively
- Caching: keep hot vectors (entry points, hubs) in memory
- Prefetching: hint OS to load neighbors in parallel
- Result: 20-50 disk reads per query instead of 400+
- โWhatโs the difference between random and sequential SSD access?โ
- Sequential: 3-7 GB/s (reading contiguous blocks)
- Random: 500K IOPS ร 4KB = 2 GB/s (jumping around)
- DiskANN is random access by nature (graph navigation)
- Mitigation: Pre-sort vectors by graph proximity, batch reads
- โHow do you build the graph without running out of memory?โ
- Donโt load all vectors at once
- Process in chunks (1M vectors per iteration)
- For each chunk: improve graph connections using beam search
- Multiple passes: gradually refine graph quality
- Final graph fits in memory, but construction is streaming
- โWhat happens if your cache is too small?โ
- More cache misses โ more disk reads โ higher latency
- If cache < entry point + critical nodes: every query hits disk
- Sweet spot: 1-4GB (0.1-0.4% of vectors) captures 20-40% of accesses
- Beyond 10GB: diminishing returns (<5% reduction in disk reads)
- โHow do memory-mapped files work?โ
- OS maps file to virtual address range
- Reading address triggers page fault if page not in RAM
- OS loads 4KB page from disk into RAM
- Subsequent reads in same page are free (already in RAM)
- OS evicts cold pages using LRU policy
Hints in Layers (Reveal When Stuck):
Layer 1 - Start Simple: Build in-memory HNSW first
- Start with Project 5โs HNSW implementation
- Verify it works: build index, search, measure recall
- Understand the baseline: how much RAM does it use?
- This is your reference point for correctness
Layer 2 - Add Memory Mapping: Move vectors to disk
- Keep graph in memory, move vectors to memory-mapped file
- Use
mmap()to map vector file to address space - Measure: how many page faults per query? (use
perf stat) - You should see ~50-200 page faults per query initially
Layer 3 - Add Caching: Reduce disk reads
- Implement LRU cache for vectors (1-2GB)
- Cache the entry point and high-degree nodes at startup
- Measure cache hit rate: aim for 30-50%
- This should reduce latency by 30-40%
Layer 4 - Optimize Graph: Improve locality
- Re-order vector IDs by graph proximity (BFS from entry point)
- Vectors that are neighbors in graph should be close in file
- This improves page reuse: 10 vectors per page instead of 1
- Should reduce page faults by 5-10x
Layer 5 - Add Prefetching: Overlap I/O with computation
- Use
madvise(MADV_WILLNEED)to hint upcoming accesses - When visiting node N, prefetch all neighbors of N
- Measure: latency should drop 20-30% as CPU and I/O overlap
Layer 6 - Compression (Advanced): Reduce disk bandwidth
- Store vectors in compressed format (Product Quantization)
- Load compressed vectors (1/8 size), decompress in CPU cache
- Final reranking: fetch full-precision vectors for top-K
- This is how production systems (Bing, Meta) scale to 10B+ vectors
Debugging Tips:
- Use
iostatto monitor disk I/O: are you reading more than expected? - Use
perf statto count page faults: should be ~20-50 per query - Profile with
perf record: where is time spent? (should be I/O wait) - Verify graph connectivity: run BFS from entry point, ensure all nodes reachable
Books That Will Help:
- โDiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Nodeโ - Subramanya et al., NeurIPS 2019
- The original paper (30 pages)
- Covers Vamana graph, graph construction, search algorithm
- Read sections 3-4 for graph design, section 5 for implementation
- Available free at: https://papers.nips.cc/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e8-Abstract.html
- โOperating Systems: Three Easy Piecesโ - Remzi Arpaci-Dusseau
- Chapter 13: Address Spaces
- Chapter 36: I/O Devices (SSD characteristics)
- Chapter 39: Files and Directories (memory-mapped I/O)
- Free online: https://pages.cs.wisc.edu/~remzi/OSTEP/
- โDatabase Internalsโ - Alex Petrov (OโReilly, 2019)
- Chapter 4: Implementing B-Trees (page-based storage)
- Chapter 5: Transaction Processing and Recovery (disk I/O patterns)
- Shows how databases manage disk-backed data structures
- โSystems Performanceโ - Brendan Gregg (Pearson, 2020)
- Chapter 8: File Systems (page cache, read-ahead)
- Chapter 9: Disks (SSD internals, I/O scheduling)
- Essential for understanding I/O bottlenecks and profiling
- โThe Art of Computer Systems Performance Analysisโ - Raj Jain (Wiley, 1991)
- Chapter 12: Queueing Models (I/O queuing, latency analysis)
- How to reason about disk read latency distributions
- Classic text for performance modeling
Papers to Read:
- โApproximate Nearest Neighbor Search on High Dimensional Dataโ - Li et al., ICDE 2020 (DiskANN improvements)
- โSPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Searchโ - Chen et al., NeurIPS 2021 (Microsoftโs production system)
- โHM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memoryโ - Wang et al., NeurIPS 2022 (combines RAM + SSD + persistent memory)
Learning milestones:
- Memory-mapped I/O works โ You understand OS-level optimization
- Graph fits in memory limit โ You understand memory management
- Latency is SSD-bound โ You understand the bottleneck
- Prefetching helps โ You understand pipelining
Project 8: Metadata Filtering with Vector Search
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The โOpen Coreโ Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Hybrid Search / Inverted Indexes
- Software or Tool: Filtered Vector Search
- Main Book: โIntroduction to Information Retrievalโ by Manning, Raghavan, Schรผtze
What youโll build: A vector search system that supports metadata filtering (e.g., โfind similar products where price < $100 and category = โelectronicsโโ) efficiently, not just post-filtering.
Why it teaches filtered search: Real applications always have filters. Naive post-filtering is slow (scan 1M vectors, filter to 1K, return top-10). Smart pre-filtering or integrated filtering is essential for production systems.
Core challenges youโll face:
- Pre-filter vs post-filter โ maps to when to apply filters
- Bitmap indexes for metadata โ maps to efficient set operations
- Filtered HNSW traversal โ maps to skipping invalid nodes
- Cost estimation โ maps to choosing the right strategy
Key Concepts:
- Inverted Indexes: โIntroduction to Information Retrievalโ - Manning et al.
- Bitmap Indexes: Database index fundamentals
- Filtered Graph Search: Qdrant, Pinecone approaches
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5 completed. Understanding of database indexes.
Real world outcome:
$ ./filtered_search "
SELECT * FROM products
WHERE vector NEAR query_embedding
AND category = 'electronics'
AND price < 100
AND rating >= 4.0
LIMIT 10
"
Query Analysis:
Filter selectivity: 2.3% (23,000 of 1,000,000 match)
Strategy: Pre-filter + HNSW on filtered subset
Execution:
Step 1: Apply filters (bitmap intersection): 0.8ms
Step 2: Build filtered entry points: 0.1ms
Step 3: HNSW search on filtered graph: 1.2ms
Results:
โโโโโโโโโคโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโคโโโโโโโโโคโโโโโโโโโโโโโ
โ Rank โ Product ID โ Distance โ Price โ Category โ
โ โโโโโโโโชโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโชโโโโโโโโโชโโโโโโโโโโโโโฃ
โ 1 โ PROD-34521 โ 0.234 โ $49.99 โ electronicsโ
โ 2 โ PROD-78234 โ 0.287 โ $89.99 โ electronicsโ
โ ... โ ... โ ... โ ... โ ... โ
โโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโงโโโโโโโโโงโโโโโโโโโโโโโ
Performance: 2.1ms total
(vs 847ms with post-filtering on full HNSW traversal)
$ ./filtered_search --analyze-strategies
Filter Selectivity โ Best Strategy โ Expected Time
โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ
> 50% โ Post-filter โ ~2ms
10% - 50% โ Filtered HNSW โ ~3ms
1% - 10% โ Pre-filter โ ~2ms
< 1% โ Scan filtered โ ~1ms
Implementation Hints:
Metadata storage with bitmap indexes:
class MetadataIndex:
def __init__(self):
# For categorical: inverted index of bitmaps
self.categorical = {} # field -> {value -> bitmap}
# For numerical: sorted index
self.numerical = {} # field -> sorted [(value, vector_id)]
def add(self, vector_id, metadata):
for field, value in metadata.items():
if isinstance(value, str):
if field not in self.categorical:
self.categorical[field] = {}
if value not in self.categorical[field]:
self.categorical[field][value] = BitSet()
self.categorical[field][value].set(vector_id)
else:
if field not in self.numerical:
self.numerical[field] = []
self.numerical[field].append((value, vector_id))
def build_indexes(self):
for field in self.numerical:
self.numerical[field].sort()
def filter(self, conditions):
result = None
for field, op, value in conditions:
if field in self.categorical:
if op == '==':
bits = self.categorical[field].get(value, BitSet())
elif op == 'in':
bits = BitSet()
for v in value:
bits |= self.categorical[field].get(v, BitSet())
else: # numerical
bits = self._range_filter(field, op, value)
if result is None:
result = bits
else:
result &= bits # Intersection
return result
Filtered HNSW search:
class FilteredHNSW(HNSWIndex):
def search_filtered(self, query, k, ef, valid_ids):
"""Search considering only nodes in valid_ids"""
if len(valid_ids) == 0:
return []
# Convert to set for O(1) lookup
valid_set = set(valid_ids)
# Find a valid entry point
entry = self._find_valid_entry(valid_set)
if entry is None:
# Fall back to scanning
return self._scan_filtered(query, k, valid_set)
# Modified beam search that skips invalid nodes
visited = {entry.id}
candidates = [(distance(query, entry.vector), entry)]
results = []
while candidates:
dist_c, current = heappop(candidates)
# Add to results if valid
if current.id in valid_set:
heappush(results, (-dist_c, current))
if len(results) > ef:
heappop(results)
# Explore neighbors (including invalid ones for navigation)
for neighbor_id in current.neighbors[0]:
if neighbor_id in visited:
continue
visited.add(neighbor_id)
neighbor = self.nodes[neighbor_id]
dist_n = distance(query, neighbor.vector)
# Always add to candidates for navigation
if len(results) < ef or dist_n < -results[0][0]:
heappush(candidates, (dist_n, neighbor))
# Extract top-k valid results
return sorted(
[(n.id, -d) for d, n in results],
key=lambda x: x[1]
)[:k]
Cost-based strategy selection:
def choose_strategy(self, query, filters, k):
# Estimate filter selectivity
selectivity = self.estimate_selectivity(filters)
num_valid = int(self.num_vectors * selectivity)
# Estimate costs
post_filter_cost = self.hnsw_search_cost() + num_valid * 0.001
pre_filter_cost = self.filter_cost(filters) + self.scan_cost(num_valid)
filtered_hnsw_cost = self.filter_cost(filters) + self.hnsw_search_cost() * (1 / selectivity)
if selectivity > 0.5:
return 'post_filter'
elif selectivity < 0.01:
return 'pre_filter_scan'
else:
return 'filtered_hnsw'
The Core Question Youโre Answering
โHow do we search by similarity AND attributes without losing accuracy?โ
This is the fundamental tension in filtered vector search. Three naive approaches all fail:
- Post-filtering: Search all vectors, then filter โ Fast search, but wastes 99% of work when filters are selective
- Pre-filtering + brute force: Filter first, scan matches โ Accurate, but O(n) when you have millions of filtered results
- Separate systems: Vector index + SQL database โ Two queries, consistency nightmares, network overhead
The real question is: Can we integrate filtering into the vector search itself, making the index aware of metadata?
This requires understanding:
- When filtering BEFORE search is faster (high selectivity)
- When filtering DURING search is better (medium selectivity)
- When filtering AFTER search wins (low selectivity)
- How to traverse HNSW graphs while respecting filter constraints
- Why bitmap indexes enable sub-millisecond filter evaluation
Real-world impact: Qdrantโs filtered search is 100-400x faster than post-filtering. Pineconeโs metadata filtering enables production e-commerce search. Understanding this makes you valuable to any company doing RAG or semantic search.
Concepts You Must Understand First
1. Pre-filtering vs Post-filtering
Pre-filtering:
Filter (1M โ 10K) โ Vector search on 10K โ Top 10
Cost: Filter time + search(10K)
Post-filtering:
Vector search on 1M โ Filter top 100 โ Return matches
Cost: search(1M) + filter(100)
Problem: Might only get 3 results after filtering top 100
Key insight: Pre-filtering works when selectivity is low (< 5%). Post-filtering works when selectivity is high (> 50%).
2. Filtered IVF (Inverted File Index)
Standard IVF: Cluster vectors, search nearest clusters Filtered IVF: Each cluster stores metadata bitmaps
Cluster 1: [vec1, vec2, vec5] + {category: electronics โ [1,0,1]}
Cluster 2: [vec3, vec4] + {category: electronics โ [0,1]}
Search with filter โcategory = electronicsโ:
- Visit cluster 1, consider only vec1 and vec5
- Visit cluster 2, consider only vec4
- Skip non-matching vectors during traversal
3. Hybrid Indexes
Combine inverted index (for filters) with vector index (for similarity):
Inverted Index: Vector Index (HNSW):
category=electronics โ [1,5,9] 1 โ [2,3,5]
price<100 โ [1,2,3,8,9] 2 โ [1,4,6]
...
Intersection: [1,9] Search only among [1,9]
Fast path: Bitmap intersection in microseconds, vector search on subset.
4. Filter Selectivity
Selectivity = matching_vectors / total_vectors
- High selectivity (80%): Most vectors match โ post-filter is fine
- Medium selectivity (10-50%): Some vectors match โ filtered HNSW
- Low selectivity (< 5%): Few vectors match โ pre-filter + scan/small index
Estimating selectivity:
# Categorical: exact count from bitmap
selectivity = bitmap[category='electronics'].count() / total
# Numerical: histogram approximation
selectivity = histogram.estimate_range(price < 100) / total
5. Filtered Graph Traversal
Key challenge: HNSW navigation needs to traverse invalid nodes to reach valid ones.
Entry โ [invalid] โ [invalid] โ [VALID TARGET]
Two strategies:
- Strict filtering: Only traverse valid nodes โ May get trapped, poor recall
- Relaxed filtering: Traverse all, return only valid โ Better recall, slightly slower
Qdrant uses relaxed with โvisited valid countโ early termination.
6. Bitmap Index Fundamentals
Store set membership as bit vectors:
category = 'electronics': 10110010 (vectors 0,2,3,6 match)
price < 100: 11010010 (vectors 0,1,3,6 match)
Intersection (AND): 10010010 (vectors 0,3,6 match both)
Operations are CPU-cache-friendly SIMD instructions (nanoseconds per 1M bits).
Resources:
- โDatabase Internalsโ (Chapter 5: Transaction Processing and Recovery) - bitmap indexes
- โIntroduction to Information Retrievalโ (Chapter 1) - inverted indexes
- Qdrant documentation on filtered search strategies
Questions to Guide Your Design
1. Filter Selectivity Analysis
- How do you estimate selectivity BEFORE executing the filter?
- What data structures enable fast selectivity estimation?
- Histograms for numerical ranges?
- Cardinality sketches for categorical values?
- When is the estimate good enough vs exact count needed?
2. Index Strategy Selection
Three main strategies:
Post-filter: search_all(k * overfetch) โ filter โ take_top_k
Pre-filter: filter โ search_subset(k)
Hybrid: filter โ search_with_constraint(k)
For each query, which strategy?
- Whatโs the decision boundary in terms of selectivity?
- How do you handle multiple filters (category AND price AND rating)?
- What if selectivity estimation is wrong?
3. Bitmap vs Range Index
Categorical filters: bitmap is obvious choice Numerical filters: bitmap or sorted array?
# Bitmap approach
price_ranges[0-10] = 0101010
price_ranges[10-20] = 1010001
# Fast intersection, memory intensive
# Sorted array approach
[(5, vec1), (12, vec2), (18, vec3), ...]
# Binary search for range, less memory
When to use which?
- Cardinality of numerical values?
- Range query patterns?
- Memory constraints?
4. HNSW Traversal with Filters
How do you modify HNSW search?
- Start from filtered entry point or original entry?
- Count only valid nodes toward ef parameter?
- Expand search if not enough valid results?
Edge case: What if filter matches 5 vectors but you need top-10?
- Return fewer results?
- Relax filter?
- Warn the user?
5. Index Maintenance
Vectors are added/updated/deleted with metadata:
- How do you update bitmaps incrementally?
- Do you rebuild indexes periodically?
- How do you handle metadata schema changes?
6. Cost Model Accuracy
Your strategy selection depends on cost estimates:
post_filter_cost = hnsw_time(N) + filter_time(k * overfetch)
pre_filter_cost = filter_time(N) + search_time(filtered_count)
How do you calibrate these costs?
- Offline benchmarking?
- Online adaptive tuning?
- What if workload changes?
Think about these before implementing. The best engineers reason about trade-offs before writing code.
Thinking Exercise
Scenario: 1 million product vectors with metadata:
{
"category": "electronics", # 10% of products
"price": 79.99, # 30% are < $100
"rating": 4.5 # 40% are >= 4.0
}
Query: โFind top-10 similar to [query_vec] where category=โelectronicsโ AND price < 100 AND rating >= 4.0โ
Part 1: Trace Pre-filtering Approach
Step-by-step execution:
1. category='electronics' โ 100K matches (10%)
2. Intersect price<100 โ ? matches
3. Intersect rating>=4.0 โ ? matches
4. Search algorithm on filtered set โ ?
Calculate:
- How many vectors after each filter?
- Assuming independent distributions, whatโs final selectivity?
- If you use brute force on filtered set, how many distance calculations?
- If you use HNSW on full graph with filter, how many nodes visited?
Part 2: Trace Post-filtering Approach
1. HNSW search on 1M vectors with ef=200 โ top 200 candidates
2. Filter candidates โ how many match all conditions?
3. If < 10 matches, what do you do?
Calculate:
- Probability that top-200 contains >= 10 filtered matches
- Expected number of matches in top-200
- Overfetch factor needed to get 10 results with 95% confidence
Part 3: Compare Recall
Pre-filtering with HNSW:
- True top-10 from filtered set: A = [โฆ]
- Your results: B = [โฆ]
-
Recall = A โฉ B / 10
Post-filtering with overfetch:
- True top-10 considering filters: C = [โฆ]
- Your results: D = [โฆ]
-
Recall = C โฉ D / 10
Why might they differ?
- HNSW approximation errors
- Filter position in pipeline
- Overfetch amount
Part 4: Calculate Break-even Point
When does pre-filter become faster than post-filter?
Pre-filter cost:
T_filter + T_search(N_filtered)
Post-filter cost:
T_search(N_total) + T_filter(ef)
Solve for selectivity where costs are equal.
Part 5: Design Hybrid Strategy
Can you combine both?
if selectivity < 0.05:
pre_filter + scan
elif selectivity < 0.3:
pre_filter + HNSW on filtered graph
else:
HNSW + post-filter with overfetch
For the given query (selectivity โ 0.012), which strategy?
- Show your calculation
- Estimate latency for each approach
- Which would you choose?
This exercise forces you to think quantitatively about filtering strategies.
The Interview Questions Theyโll Ask
Q1: โExplain pre-filtering vs post-filtering in vector search. When would you use each?โ
Bad answer: โPre-filtering filters first, post-filtering filters after.โ
Good answer:
- Pre-filtering: Apply metadata filter โ search filtered subset
- Best when: Filter is highly selective (< 5% match)
- Cost: Filter is cheap, but searching small index might lack structure
- Risk: If too few vectors match, accuracy suffers
- Post-filtering: Search all vectors โ filter top-k * overfetch
- Best when: Filter is not selective (> 50% match)
- Cost: Full search is expensive, but filtering small result set is cheap
- Risk: Need overfetch factor to ensure enough filtered results
- Decision: Estimate selectivity, compare costs, choose dynamically
Q2: โHow do bitmap indexes enable fast filtering?โ
Bad answer: โTheyโre just bit arrays that are fast.โ
Good answer:
- Represent set membership as bit vectors: category=โelectronicsโ โ 10110010
- Set operations map to bitwise AND/OR: extremely cache-friendly
- Modern CPUs: SIMD instructions process 256 bits at once
- Intersection of 1M vectors: ~1ms (vs ~10ms with hash set)
- Space: 1 bit per vector per categorical value
- Trade-off: Memory vs speed (use for low-cardinality categoricals)
Show you understand the implementation details.
Q3: โHow would you search HNSW with filters?โ
Bad answer: โJust skip nodes that donโt match the filter.โ
Good answer - Three approaches:
- Strict filtering: Only traverse nodes matching filter
- Problem: Can get stuck in filtered subgraph, poor connectivity
- Recall suffers when filter is selective
- Relaxed filtering: Traverse all nodes, return only filtered
- Better graph connectivity, better recall
- Slightly more node visits
- Hybrid with entry points: Pre-filter to get entry points, HNSW from there
- Find valid nodes in each layer
- Start search from closest valid entry
- Qdrantโs approach
Implementation detail: Track โvalid nodes visitedโ separately from โtotal visitedโ for ef parameter.
Q4: โHow do you estimate filter selectivity without scanning all vectors?โ
Bad answer: โCount how many match.โ
Good answer:
- Categorical: Maintain exact counts in bitmap indexes (instant)
selectivity = bitmap[category='electronics'].popcount() / total_vectors - Numerical ranges: Use histograms
# 100 buckets for price [0, 1000] histogram[70-80] = 5243 vectors histogram[80-90] = 6891 vectors # Estimate price < 100: sum first 10 buckets -
Multiple filters: Assume independence (overestimate) or maintain joint histograms (memory intensive)
- Validation: Compare estimate vs actual, adjust over time
Q5: โFiltered-DiskANN is mentioned in the DiskANN paper. How does it work?โ
This tests if youโve read the papers deeply.
Good answer:
- DiskANN already uses graph navigation on SSD
- Filtering adds: Bitmap index on SSD + filtered graph traversal
- Key insight: During beam search, only expand nodes matching filter
- Challenge: SSD random reads are slow, need prefetching
- Solution: Prefetch neighbor metadata + vectors in single I/O
- Trade-off: Larger I/O (fetch metadata too) vs more I/Os (separate metadata reads)
Layout on disk:
[Node ID | Vector | Neighbors | Metadata Bitmap]
Single 4KB read gets everything needed for filtering decision.
Q6: โYour filter matches 50% of vectors. Which strategy and why?โ
Bad answer: โPost-filtering because itโs simple.โ
Good answer - Walk through the calculation:
Assume: 1M vectors, HNSW search ~2ms, bitmap filter ~0.5ms
Post-filter:
Cost = search(1M) + filter(top-200)
= 2ms + 0.01ms โ 2ms
Expected matches in top-200: ~100 (plenty for top-10)
Pre-filter + HNSW:
Filtered set: 500K vectors
Cost = filter(1M) + search(500K)
= 0.5ms + ~1.5ms โ 2ms
Conclusion: Both are similar! But post-filter is simpler and has better recall (no HNSW approximation on filtered graph connectivity issues).
Choose post-filter with small overfetch (ef=50 is enough).
Q7: โHow would you handle a filter that matches only 10 vectors?โ
Bad answer: โUse pre-filtering.โ
Good answer:
- Recognize this is extreme selectivity (0.001%)
- HNSW is overkill for 10 vectors
- Best approach: Pre-filter โ brute force scan of 10 vectors
valid_ids = bitmap.intersect_all(filters) # 10 vectors if len(valid_ids) < 100: # Just scan them return sorted([(id, distance(query, vectors[id])) for id in valid_ids])[:k] - Cost: Filter (~0.5ms) + 10 distance calculations (~0.001ms) โ 0.5ms
- Much faster than HNSW search
Q8: โWhat if your selectivity estimate is wrong?โ
This tests production thinking.
Good answer:
- Adaptive execution: Start with estimated strategy, measure actual
strategy = choose_strategy(estimated_selectivity) results, actual_cost = execute(strategy) if actual_cost > threshold: # Log for calibration update_cost_model(estimated_selectivity, actual_cost) - Graceful degradation: If post-filter doesnโt get enough results, fall back to pre-filter
results = post_filter_search(k=10, ef=100) if len(results) < 10: # Not enough matches, try pre-filter results = pre_filter_search(k=10) - Cost-based optimizer: Like SQL databases, maintain statistics and update periodically
These questions test depth of understanding. Study the trade-offs, not just the algorithms.
Hints in Layers
Layer 1: Start with Post-filtering (Simplest baseline)
Build a naive implementation:
class PostFilterSearch:
def search(self, query, k, filters, overfetch=5):
# Search without filters
candidates = self.hnsw.search(query, k * overfetch)
# Filter results
matches = [c for c in candidates if self.matches_filter(c.id, filters)]
return matches[:k]
What youโll learn:
- How overfetch factor affects recall
- When you get < k results (not enough matches in top-N)
- That this works fine for non-selective filters
Test with varying selectivity (90%, 50%, 10%, 1%) and measure:
- Latency
- Recall (compare to ground truth)
- Required overfetch for 95% chance of k results
Layer 2: Add Bitmap Pre-filtering
Implement inverted index for categorical metadata:
class BitmapIndex:
def __init__(self, num_vectors):
self.num_vectors = num_vectors
self.indexes = {} # field -> value -> bitset
def add(self, vector_id, field, value):
if field not in self.indexes:
self.indexes[field] = {}
if value not in self.indexes[field]:
self.indexes[field][value] = Bitset(self.num_vectors)
self.indexes[field][value].set(vector_id)
def query(self, field, value):
return self.indexes[field].get(value, Bitset(self.num_vectors))
def intersect(self, filters):
result = None
for field, value in filters:
bits = self.query(field, value)
result = bits if result is None else result & bits
return result.to_list() # List of matching IDs
What youโll learn:
- Bitmap intersection is incredibly fast
- Memory usage scales with cardinality * num_vectors
- Can estimate selectivity for free (popcount)
Layer 3: Implement Strategy Selection
Add a query planner:
class FilteredSearch:
def search(self, query, k, filters):
selectivity = self.estimate_selectivity(filters)
if selectivity > 0.5:
return self.post_filter(query, k, filters)
elif selectivity < 0.01:
valid_ids = self.bitmap.intersect(filters)
return self.brute_force_scan(query, k, valid_ids)
else:
return self.filtered_hnsw(query, k, filters)
def estimate_selectivity(self, filters):
result = None
for field, value in filters:
bits = self.bitmap.query(field, value)
count = bits.popcount()
sel = count / self.num_vectors
result = sel if result is None else result * sel
return result
What youโll learn:
- Selectivity estimation is critical
- Different strategies have different sweet spots
- Cost model needs calibration with real data
Layer 4: Filtered HNSW Traversal
This is the hard part. Modify HNSW to respect filters during search:
def search_filtered(self, query, k, ef, valid_ids):
valid_set = set(valid_ids)
# Find entry point in filtered set
entry = self._find_valid_entry(valid_set)
visited = set()
candidates = [(distance(query, self.vectors[entry]), entry)]
results = []
while candidates:
dist, current = heappop(candidates)
if current in visited:
continue
visited.add(current)
# Add to results if valid
if current in valid_set:
results.append((current, dist))
if len(results) >= ef:
break
# Explore neighbors (even invalid ones for navigation)
for neighbor in self.graph[current]:
if neighbor not in visited:
dist_n = distance(query, self.vectors[neighbor])
heappush(candidates, (dist_n, neighbor))
return sorted(results, key=lambda x: x[1])[:k]
Key design decisions:
- Should you traverse invalid nodes? (Yes, for connectivity)
- How to count toward ef? (Only valid nodes)
- Entry point selection? (Closest valid node to original entry)
What youโll learn:
- Graph connectivity matters for filtered search
- Relaxed filtering beats strict filtering
- Entry point quality affects recall
Layer 5: Add Numerical Range Indexes
Bitmaps work for categorical, but numerical ranges need different approach:
class RangeIndex:
def __init__(self):
self.sorted_values = [] # [(value, vector_id), ...]
def build(self, vectors_metadata, field):
self.sorted_values = sorted(
[(meta[field], vid) for vid, meta in vectors_metadata.items()]
)
def range_query(self, min_val, max_val):
# Binary search for range
start = bisect_left(self.sorted_values, (min_val, 0))
end = bisect_right(self.sorted_values, (max_val, float('inf')))
return [vid for val, vid in self.sorted_values[start:end]]
For very selective range queries, this beats bitmap.
What youโll learn:
- Different index structures for different data types
- Trade-offs: bitmap (memory, fast AND) vs sorted (space-efficient, slower intersection)
Layer 6: Implement Hybrid Strategy (Pre-filter + Small HNSW)
For medium selectivity (5-20%), build HNSW on filtered subset:
def hybrid_search(self, query, k, filters):
# Pre-filter
valid_ids = self.bitmap.intersect(filters)
if len(valid_ids) < 1000:
# Too small for HNSW, brute force
return self.brute_force_scan(query, k, valid_ids)
# Build temporary HNSW on filtered subset
filtered_vectors = [self.vectors[vid] for vid in valid_ids]
temp_hnsw = HNSW(filtered_vectors, M=16, ef_construction=100)
# Search
results = temp_hnsw.search(query, k)
return [(valid_ids[r.id], r.distance) for r in results]
This is expensive (building HNSW), but for repeated queries on same filter, you can cache.
Layer 7: Add Cost-Based Optimization
Calibrate your cost model with benchmarks:
class CostModel:
def __init__(self):
# Calibrated constants (microseconds)
self.hnsw_base = 2000 # 2ms base HNSW search
self.hnsw_per_node = 0.01 # 10ns per node
self.bitmap_filter = 500 # 0.5ms to filter 1M vectors
self.distance_calc = 0.1 # 100ns per distance
def estimate_cost(self, strategy, selectivity, num_vectors):
if strategy == 'post_filter':
return self.hnsw_base + self.bitmap_filter * 0.001
elif strategy == 'pre_filter_scan':
filtered = int(num_vectors * selectivity)
return self.bitmap_filter + filtered * self.distance_calc
elif strategy == 'filtered_hnsw':
return self.bitmap_filter + self.hnsw_base * (1/selectivity)
def choose_strategy(self, selectivity, num_vectors):
costs = {
'post_filter': self.estimate_cost('post_filter', selectivity, num_vectors),
'pre_filter_scan': self.estimate_cost('pre_filter_scan', selectivity, num_vectors),
'filtered_hnsw': self.estimate_cost('filtered_hnsw', selectivity, num_vectors),
}
return min(costs, key=costs.get)
Run experiments, measure actual costs, tune constants.
Layer 8: Production Features
Add robustness:
- Graceful degradation when estimate is wrong
- Query timeout and fallback strategies
- Logging and observability (which strategy chosen, actual vs estimated cost)
- Adaptive cost model that learns from query patterns
By building layer by layer, you understand the trade-offs deeply.
Books That Will Help
Core Books:
- โIntroduction to Information Retrievalโ by Manning, Raghavan, Schรผtze
- Chapter 1: Boolean Retrieval (inverted indexes, bitmaps)
- Chapter 6: Scoring and Ranking (cost models)
- Why: Foundational understanding of inverted indexes, which are the basis of bitmap filtering
- โDatabase Internalsโ by Alex Petrov
- Chapter 5: Transaction Processing and Recovery (bitmap indexes)
- Chapter 7: Log-Structured Storage (LSM trees for metadata)
- Why: Deep dive into bitmap index implementation, how databases do filtering efficiently
- โMining of Massive Datasetsโ by Leskovec, Rajaraman, Ullman
- Chapter 3: Finding Similar Items (LSH with filtering)
- Chapter 9: Recommendation Systems (hybrid filtering)
- Why: Understand how filtering works at scale, selectivity estimation
Papers:
- โFiltered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filtersโ (2022)
- Core paper on integrating filters into graph-based ANN
- Explains pre-filter vs post-filter vs integrated filtering
- Shows how to modify graph traversal for filters
- Read: https://harsha-simhadri.org/pubs/Filtered-DiskANN22.pdf
- โQdrant: Vector Search Engine with Extended Filtering Supportโ (Documentation)
- Production system implementing filtered search
- Explains payload indexes, filtered HNSW strategy
- Read: https://qdrant.tech/documentation/concepts/filtering/
- โPANN: Personalized Approximate Nearest Neighbor Searchโ (2020)
- Filtered ANN for recommendation systems
- User-based filtering strategies
- Shows real-world cost models
Supplementary:
- โDesigning Data-Intensive Applicationsโ by Martin Kleppmann
- Chapter 3: Storage and Retrieval (indexes)
- Why: Understand trade-offs in index design, when to use what
- โDatabase Systems: The Complete Bookโ by Garcia-Molina, Ullman, Widom
- Chapter 14: Index Structures (B-trees, bitmaps, spatial indexes)
- Why: Comprehensive coverage of all index types, helps you choose right structure
Blogs and Documentation:
- Pineconeโs โMetadata Filtering Guideโ
- https://www.pinecone.io/learn/vector-search-filtering/
- Production perspective on filtered vector search
- Weaviateโs โFiltered Vector Searchโ
- How they combine inverted index + HNSW
- Real-world performance numbers
- PostgreSQL Documentation: Index Types
- Understand GIN (Generalized Inverted Index)
- See how SQL databases integrate full-text search with filtering
Reading Order:
- Start: Manningโs โIntroduction to Information Retrievalโ (Chapter 1) - understand inverted indexes
- Then: Filtered-DiskANN paper - see how it applies to vector search
- Next: Qdrant documentation - see production implementation
- Deep dive: โDatabase Internalsโ (Chapter 5) - understand bitmap indexes deeply
- Advanced: PANN paper - see personalized filtering
These books and papers will make you an expert in filtered search, not just implementer.
Learning milestones:
- Bitmap filters are fast โ You understand inverted indexes
- Strategy selection matters โ You understand cost-based optimization
- Filtered HNSW works โ You understand graph traversal with constraints
- Performance beats naive approach โ Youโve built an efficient system
Project 9: Vector Database Storage Engine
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Rust
- Alternative Programming Languages: C++, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The โIndustry Disruptorโ
- Difficulty: Level 4: Expert
- Knowledge Area: Storage Systems / Write-Ahead Logging
- Software or Tool: Storage Engine
- Main Book: โDesigning Data-Intensive Applicationsโ by Martin Kleppmann
What youโll build: A durable storage engine for vectors with write-ahead logging (WAL), crash recovery, compaction, and efficient serializationโthe persistence layer of a real vector database.
Why it teaches storage: Indexes are useless if they disappear on restart. A real vector database needs ACID properties: atomic writes, crash recovery, and efficient disk formats. This is what makes the difference between a prototype and a production system.
Core challenges youโll face:
- Write-ahead logging โ maps to durability before acknowledging writes
- Crash recovery โ maps to replaying WAL on startup
- Compaction โ maps to merging and garbage collection
- Efficient serialization โ maps to binary formats for vectors
Key Concepts:
- WAL: โDesigning Data-Intensive Applicationsโ Chapter 3 - Kleppmann
- LSM Trees: For append-friendly storage
- Memory-Mapped Files: OS-level optimization
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, experience with file I/O and binary formats.
Real world outcome:
$ ./vecdb_storage --init --path ./data
Initializing VectorDB Storage Engine:
Data directory: ./data
WAL segment size: 64 MB
Compaction threshold: 1 GB
Files created:
./data/wal/segment_000000.wal
./data/vectors/vectors_000000.bin
./data/index/hnsw_000000.idx
./data/metadata.json
$ ./vecdb_storage --insert --count 100000
Inserting 100,000 vectors...
WAL writes: 100,000
Batch flushes: 10
Time: 4.2s (23,800 inserts/sec)
$ kill -9 $PID # Simulate crash
$ ./vecdb_storage --recover
Recovery started:
Scanning WAL segments...
Found 1 segment with 100,000 entries
Replaying entries: [โโโโโโโโโโโโโโโโโโโโ] 100%
Rebuilding index: 2.3s
Recovery complete!
Verified:
Vectors recovered: 100,000 โ
Index consistency: OK โ
No data loss โ
$ ./vecdb_storage --compact
Compaction:
Before: 3 segments, 1.2 GB
Merging...
After: 1 segment, 0.9 GB
Space reclaimed: 300 MB (25%)
Implementation Hints:
Storage architecture:
./data/
โโโ wal/
โ โโโ segment_000000.wal # Write-ahead log
โ โโโ segment_000001.wal
โ โโโ ...
โโโ vectors/
โ โโโ vectors_000000.bin # Actual vector data
โโโ index/
โ โโโ hnsw_000000.idx # Serialized index
โโโ metadata.json # DB state
WAL entry format:
struct WALEntry {
sequence_number: u64,
operation: Operation, // Insert, Delete, Update
timestamp: u64,
vector_id: u64,
vector_data: Vec<f32>, // Empty for Delete
metadata: HashMap<String, Value>,
checksum: u32,
}
enum Operation {
Insert = 1,
Delete = 2,
Update = 3,
}
WAL writer:
struct WALWriter {
current_segment: File,
segment_size: usize,
current_size: usize,
sequence: AtomicU64,
}
impl WALWriter {
fn append(&mut self, entry: WALEntry) -> Result<u64> {
let seq = self.sequence.fetch_add(1, Ordering::SeqCst);
let mut entry = entry;
entry.sequence_number = seq;
// Serialize entry
let bytes = entry.serialize();
entry.checksum = crc32(&bytes);
// Write length + data + checksum
self.current_segment.write_all(&(bytes.len() as u32).to_le_bytes())?;
self.current_segment.write_all(&bytes)?;
self.current_segment.write_all(&entry.checksum.to_le_bytes())?;
// Flush to disk (durability!)
self.current_segment.sync_all()?;
self.current_size += bytes.len() + 8;
// Rotate segment if needed
if self.current_size >= self.segment_size {
self.rotate_segment()?;
}
Ok(seq)
}
}
Recovery:
fn recover(data_dir: &Path) -> Result<VectorDB> {
let mut db = VectorDB::new();
// Read metadata to find last checkpoint
let metadata = read_metadata(data_dir)?;
let last_checkpoint = metadata.last_checkpoint_sequence;
// Replay WAL from last checkpoint
for segment in list_wal_segments(data_dir)? {
let reader = WALReader::open(segment)?;
for entry in reader.entries() {
let entry = entry?;
// Verify checksum
if crc32(&entry.data) != entry.checksum {
log::error!("Corrupt WAL entry at seq {}", entry.sequence_number);
continue;
}
// Skip already applied entries
if entry.sequence_number <= last_checkpoint {
continue;
}
// Apply to in-memory state
match entry.operation {
Operation::Insert => db.insert_recovered(entry.vector_id, entry.vector_data),
Operation::Delete => db.delete_recovered(entry.vector_id),
Operation::Update => db.update_recovered(entry.vector_id, entry.vector_data),
}
}
}
// Rebuild index if needed
if db.index_needs_rebuild() {
db.rebuild_index()?;
}
Ok(db)
}
Compaction:
fn compact(&mut self) -> Result<()> {
// Create new segment file
let new_segment = create_new_segment()?;
// Write live vectors to new segment
for (id, vector) in self.vectors.iter() {
if !self.deleted.contains(id) {
write_vector(&new_segment, id, vector)?;
}
}
// Atomically swap segments
new_segment.sync_all()?;
rename(new_segment.path(), current_segment_path())?;
// Delete old segments
for old in old_segments {
remove_file(old)?;
}
// Truncate WAL (everything is now in new segment)
self.wal.truncate_before(self.sequence)?;
Ok(())
}
The Core Question Youโre Answering
โHow do we persist vectors with ACID guarantees and fast recovery?โ
When your vector database crashes mid-write, how do you ensure:
- Durability: Acknowledged writes survive crashes
- Atomicity: No partial writes corrupt the index
- Recovery: Restart quickly without full reindexing
- Performance: Donโt sacrifice write throughput for durability
This is the difference between a toy and a production system. Understanding storage engines means understanding the tradeoffs between:
- Write amplification (how many times does data hit disk?)
- Read amplification (how many files to read for one query?)
- Space amplification (how much extra space for metadata/logs?)
Youโre building the persistence layer that makes vectors survive power failures.
Concepts You Must Understand First
Before writing code, understand these storage fundamentals:
1. Write-Ahead Logging (WAL)
The durability guarantee:
- Every modification is written to an append-only log before updating in-memory structures
- Log entries are fsynced to disk before acknowledging the write
- On crash, replay the log to reconstruct state
- Key insight: Sequential writes are 100x faster than random writes
Critical details:
- WAL entries need checksums (detect corruption)
- Log segmentation (rotate to new files periodically)
- Idempotency (replaying same entry twice is safe)
Read: โDesigning Data-Intensive Applicationsโ Chapter 3 (Storage and Retrieval) - Kleppmann
2. LSM Trees (Log-Structured Merge Trees)
The write-optimized architecture:
- Writes go to memtable (in-memory sorted structure)
- When memtable fills, flush to SSTable (sorted string table on disk)
- Background compaction merges SSTables to reduce read amplification
- Bloom filters avoid reading files that donโt contain a key
Why LSM for vectors?
- Vector inserts are write-heavy (ingestion pipelines)
- Immutable SSTables enable safe concurrent reads
- Compaction reclaims space from deleted vectors
Read: โDatabase Internalsโ Chapter 7 (Log-Structured Storage) - Alex Petrov
3. Memtable and SSTable Structure
Memtable (in-memory):
- Red-black tree or skip list (sorted structure)
- Fast insertions O(log n)
- Periodically flushed to disk
SSTable (on disk):
- Immutable sorted file
-
Binary format: [key value_offset value_size] index + data blocks - Optional compression (Snappy, LZ4)
4. Compaction Strategies
Size-tiered compaction:
- Merge SSTables of similar size
- Good for write-heavy workloads
- Higher space amplification
Leveled compaction:
- Organize SSTables into levels (L0, L1, L2โฆ)
- Each level is 10x larger than previous
- Lower space amplification, more I/O
Which to use?
- Start with size-tiered (simpler)
- Switch to leveled if space is constrained
5. Crash Recovery Protocol
Recovery steps:
- Read manifest file (which SSTables are valid)
- Replay WAL from last checkpoint
- Rebuild memtable from WAL
- Optionally rebuild index from SSTables
- Mark recovery complete, accept writes
Key insight: Recovery time depends on WAL size, not database size
Questions to Guide Your Design
WAL Format Design
- What goes in a WAL entry?
- Sequence number, operation type, vector ID, vector data, timestamp, checksum
- How do you serialize variable-length vectors efficiently?
- Should you compress WAL entries?
- How do you handle corrupted entries?
- CRC32 checksums for integrity
- What if corruption is in the middle of the log?
- Skip and continue, or stop recovery?
- When do you fsync()?
- After every write? (slow but durable)
- Batched every N writes? (fast but loss window)
- Configurable durability vs performance tradeoff
Flush Policy
- When do you flush memtable to SSTable?
- Size threshold (e.g., 64 MB memtable)
- Time threshold (e.g., every 60 seconds)
- Memory pressure (adaptive flushing)
- What if you crash during flush?
- Flush is not atomic (partial SSTable on disk)
- Solution: Write to temp file, rename on completion (atomic on POSIX)
- Keep WAL entries until flush completes
Compaction Strategy
- How often do you compact?
- Too frequent: wastes CPU/disk I/O
- Too rare: read amplification hurts queries
- Heuristic: compact when you have N SSTables at same level
- What about tombstones (deleted vectors)?
- Mark deleted in WAL, filter during compaction
- Bloom filters prevent reading SSTables with deleted keys
Crash Recovery
- How do you know where to start recovery?
- Checkpoint sequence number in manifest file
- Replay WAL entries with
seq > last_checkpoint
- What if the index is corrupted?
- Can you rebuild HNSW from vectors alone?
- Trade-off: fast recovery vs full reindex
Thinking Exercise
Trace an insert through the storage engine:
User calls: db.insert(vector_id=42, vector=[0.1, 0.2, ...])
Step 1: Write to WAL
- Serialize WAL entry (seq=100, op=Insert, id=42, data=[...])
- Write to wal/segment_000000.wal
- fsync() to guarantee durability
- Return sequence number
Step 2: Update memtable
- Insert into in-memory sorted structure
- memtable.insert(42, vector)
Step 3: Check flush condition
- if memtable.size() > 64 MB:
- Freeze memtable (make immutable)
- Start new memtable for writes
- Background thread flushes frozen memtable to SSTable
Step 4: Flush to SSTable (background)
- Create temp file: vectors/vectors_000001.tmp
- Write sorted entries from frozen memtable
- Write index block (key -> offset mapping)
- Rename to vectors/vectors_000001.sst (atomic!)
- Update manifest: "SSTable 000001 contains seq 0-100"
- Truncate WAL entries <= 100 (no longer needed)
Step 5: Compaction (background, periodically)
- Scan for SSTables at same level with overlap
- Merge-sort entries from multiple SSTables
- Write new compacted SSTable
- Delete old SSTables
- Update manifest
Now trace a crash scenario:
Crash happens during Step 4 (flush in progress)
On restart:
1. Read manifest -> last valid SSTable is 000000 (seq 0-50)
2. Scan WAL -> find entries seq 51-100
3. Replay:
- entry seq=51: insert(id=10, ...)
- entry seq=52: insert(id=11, ...)
- ...
- entry seq=100: insert(id=42, ...)
4. Memtable now contains all vectors from seq 51-100
5. Check for partial SSTable 000001.tmp -> delete (incomplete flush)
6. Rebuild index from vectors 0-100
7. Ready to accept writes!
Questions to consider:
- What if crash happens during compaction?
- How do you handle concurrent reads during flush?
- What if WAL itself is corrupted?
The Interview Questions Theyโll Ask
Fundamental Understanding
- โExplain the difference between WAL and journaling in filesystems.โ
- Answer: Both are write-ahead logs, but:
- Filesystem journaling: logs metadata changes (file renames, permission changes)
- Database WAL: logs data modifications (actual record changes)
- WAL is application-level, journaling is OS-level
- WAL provides logical recovery, journaling provides crash consistency
- Answer: Both are write-ahead logs, but:
- โWhy use LSM trees instead of B-trees for vectors?โ
- LSM advantages:
- Sequential writes (no random I/O)
- Higher write throughput
- Better compression (immutable files)
- B-tree advantages:
- Lower read amplification
- Predictable read latency
- For vectors: LSM wins because ingestion is write-heavy, and vectors are rarely updated in-place
- LSM advantages:
- โWhatโs the difference between size-tiered and leveled compaction?โ
- Size-tiered: Merge SSTables of similar size โ less I/O, more space
- Leveled: Fixed size levels, compact to next level โ more I/O, less space
- Example: RocksDB uses leveled by default, Cassandra uses size-tiered
Design Tradeoffs
- โHow do you minimize write amplification?โ
- Techniques:
- Larger memtable (fewer flushes)
- Less aggressive compaction (fewer rewrites)
- Compression (write less data)
- Tradeoff: Write amplification vs space amplification
- Techniques:
- โWhat if your WAL grows unbounded?โ
- Solutions:
- Checkpoint: flush memtable, mark WAL prefix as safe to delete
- Segment rotation: create new WAL file periodically
- Truncation: delete old segments after checkpoint
- Key: WAL size is bounded by checkpoint frequency
- Solutions:
- โHow do you handle read-your-own-writes consistency?โ
- Problem: Write to WAL, query before flush โ vector not in SSTable yet
- Solution: Check memtable first, then SSTables (layered query)
- Ordering: memtable > newest SSTable > oldest SSTable
Crash Recovery
- โWhat happens if you crash during compaction?โ
- Safe approach: Compaction writes to new SSTables, deletes old ones only after success
- Crash scenario: Old SSTables still exist, new ones partially written
- Recovery: Read manifest โ ignore partial SSTables โ use old SSTables
- Key: Never delete source files until output is complete
- โHow do you detect corrupted WAL entries?โ
- Checksums: CRC32 or xxHash on each entry
- On mismatch: Skip entry and log warning, or stop recovery (configurable)
- Best practice: Also checksum the entire segment (detect tail corruption)
Performance
- โHow do you optimize fsync latency?โ
- Group commit: Batch multiple writes, single fsync
- Async replication: Acknowledge write after replicating to N nodes (donโt wait for fsync)
- Battery-backed write cache: Hardware acceleration (risky!)
- โWhy are sequential writes so much faster than random writes?โ
- HDD: No seek time (200 MB/s sequential vs 1 MB/s random)
- SSD: Wear leveling, internal parallelism optimized for sequential
- OS: Readahead and writeback buffers help sequential I/O
Hints in Layers
Layer 1: In-memory only (ignore crashes)
struct VectorDB {
vectors: HashMap<u64, Vec<f32>>, // id -> vector
}
impl VectorDB {
fn insert(&mut self, id: u64, vector: Vec<f32>) {
self.vectors.insert(id, vector);
}
fn search(&self, query: Vec<f32>, k: usize) -> Vec<u64> {
// Brute-force search
}
}
Test: Insert 100k vectors, search works, restart โ all data lost โ
Layer 2: Add write-ahead log (durability)
struct WALEntry {
seq: u64,
op: Operation,
id: u64,
vector: Vec<f32>,
}
struct VectorDB {
vectors: HashMap<u64, Vec<f32>>,
wal: WALWriter,
}
impl VectorDB {
fn insert(&mut self, id: u64, vector: Vec<f32>) -> Result<()> {
// Write to WAL first!
let entry = WALEntry { seq: self.next_seq(), op: Insert, id, vector: vector.clone() };
self.wal.append(&entry)?;
self.wal.fsync()?; // Durability!
// Now safe to update memory
self.vectors.insert(id, vector);
Ok(())
}
}
fn recover(wal_path: &Path) -> VectorDB {
let mut db = VectorDB::new();
for entry in WALReader::read(wal_path) {
db.vectors.insert(entry.id, entry.vector);
}
db
}
Test: Insert vectors, kill -9, restart โ data recovered โ
Layer 3: Add memtable + flush (reduce WAL size)
struct VectorDB {
memtable: BTreeMap<u64, Vec<f32>>, // Sorted!
sstables: Vec<SSTableReader>,
wal: WALWriter,
}
impl VectorDB {
fn insert(&mut self, id: u64, vector: Vec<f32>) -> Result<()> {
self.wal.append(&entry)?;
self.memtable.insert(id, vector);
if self.memtable.size() > 64 * 1024 * 1024 {
self.flush_memtable()?;
}
Ok(())
}
fn flush_memtable(&mut self) -> Result<()> {
let tmp_path = "vectors_temp.sst";
let mut writer = SSTableWriter::new(tmp_path)?;
for (id, vector) in self.memtable.iter() {
writer.write(id, vector)?;
}
writer.finish()?;
// Atomic rename!
rename(tmp_path, "vectors_000000.sst")?;
self.sstables.push(SSTableReader::open("vectors_000000.sst")?);
self.memtable.clear();
self.wal.checkpoint()?; // Safe to truncate WAL now
Ok(())
}
}
Test: Insert 1M vectors โ WAL doesnโt grow unbounded โ
Layer 4: Add compaction (reduce read amplification)
impl VectorDB {
fn compact(&mut self) -> Result<()> {
// Size-tiered: merge SSTables of similar size
let to_merge = self.find_mergeable_sstables();
let mut writer = SSTableWriter::new("compacted_temp.sst")?;
let mut heap = BinaryHeap::new(); // K-way merge
// Initialize heap with first entry from each SSTable
for sst in &to_merge {
if let Some(entry) = sst.iter().next() {
heap.push(entry);
}
}
// Merge-sort
while let Some(entry) = heap.pop() {
writer.write(entry.id, entry.vector)?;
if let Some(next) = entry.source.iter().next() {
heap.push(next);
}
}
writer.finish()?;
rename("compacted_temp.sst", "compacted_000000.sst")?;
// Delete old SSTables
for sst in to_merge {
remove_file(sst.path())?;
}
Ok(())
}
}
Test: Insert 1M, delete 500k, compact โ disk space reclaimed โ
Layer 5: Add checksums and error handling (production-ready)
struct WALEntry {
seq: u64,
op: Operation,
id: u64,
vector: Vec<f32>,
checksum: u32, // CRC32
}
impl WALReader {
fn read_entry(&mut self) -> Result<WALEntry> {
let len = self.read_u32()?;
let data = self.read_bytes(len)?;
let checksum = self.read_u32()?;
if crc32(&data) != checksum {
return Err(CorruptedEntry);
}
WALEntry::deserialize(&data)
}
}
Test: Corrupt WAL file โ recovery skips bad entries, logs warning โ
Books That Will Help
Essential Reading
- โDesigning Data-Intensive Applicationsโ by Martin Kleppmann
- Chapter 3: Storage and Retrieval (WAL, LSM trees, B-trees)
- Chapter 11: Stream Processing (log-based systems)
- Why: The definitive guide to storage engine design
- โDatabase Internalsโ by Alex Petrov
- Part I: Storage Engines (file formats, indexing)
- Chapter 7: Log-Structured Storage (LSM deep dive)
- Chapter 9: Failure Detection (crash recovery)
- Why: Implementation-focused, covers RocksDB and LevelDB internals
Deep Dives
- โThe Log-Structured Merge-Tree (LSM-Tree)โ - OโNeil et al. (1996)
- The original LSM paper
- Available free online
- Why: Understand the theory behind modern storage engines
- โRocksDB Tuning Guideโ (online documentation)
- Real-world compaction strategies
- Write amplification benchmarks
- Why: Learn from a production LSM implementation
Reference Implementation
- Study LevelDB source code (C++, ~5000 lines)
db/log_writer.cc- WAL implementationdb/memtable.cc- Skip list memtabletable/table_builder.cc- SSTable format- Why: Clean, well-commented reference implementation
Learning milestones:
- WAL writes are durable โ You understand write-ahead logging
- Crash recovery works โ You understand durability guarantees
- Compaction reclaims space โ You understand garbage collection
- Performance is acceptable โ Youโve built an efficient storage layer
Project 10: Distributed Vector Search Cluster
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The โIndustry Disruptorโ
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Systems / Sharding
- Software or Tool: Distributed Vector Database
- Main Book: โDesigning Data-Intensive Applicationsโ by Martin Kleppmann
What youโll build: A distributed vector search cluster with sharding, replication, routing, and distributed query executionโturning a single-node index into a scalable system.
Why it teaches distributed systems: No single machine can handle billions of vectors with low latency. Distributed systems introduce new challenges: consistency, partition tolerance, coordination. Understanding these is essential for production systems.
Core challenges youโll face:
- Sharding strategies โ maps to distributing vectors across nodes
- Query routing โ maps to finding which shards to query
- Result merging โ maps to combining results from multiple shards
- Replication โ maps to fault tolerance and read scaling
- Consistency โ maps to handling concurrent updates
Key Concepts:
- Sharding: โDesigning Data-Intensive Applicationsโ Chapter 6 - Kleppmann
- Replication: โDesigning Data-Intensive Applicationsโ Chapter 5 - Kleppmann
- Distributed Consensus: Raft algorithm
Difficulty: Master Time estimate: 4-6 weeks Prerequisites: All previous projects, distributed systems experience.
Real world outcome:
$ ./vecdb cluster init --nodes 3 --shards 9 --replicas 2
Initializing distributed cluster:
Nodes: node-0 (coordinator), node-1, node-2
Shards: 9 (3 per node primary, 6 replicas distributed)
Replication factor: 2
Shard distribution:
โโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโ
โ Shard โ Primary Node โ Replica Nodes โ
โ โโโโโโโโโชโโโโโโโโโโโโโโโโโโโโชโโโโโโโโโโโโโโโโโโโฃ
โ 0 โ node-0 โ node-1, node-2 โ
โ 1 โ node-0 โ node-1, node-2 โ
โ 2 โ node-0 โ node-1, node-2 โ
โ 3 โ node-1 โ node-0, node-2 โ
โ ... โ ... โ ... โ
โโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโ
Cluster ready!
$ ./vecdb insert --vectors 10000000 --distributed
Inserting 10M vectors across cluster:
Routing: hash(vector_id) mod 9
Progress: [โโโโโโโโโโโโโโโโโโโโ] 100%
Time: 45s (222K inserts/sec)
$ ./vecdb search --query query.npy --k 10
Distributed Query Execution:
Step 1: Route query to all 9 shards
Step 2: Parallel search on each shard (local k=30)
Step 3: Merge 270 candidates from 9 shards
Step 4: Re-rank and return top-10
Performance:
Shard latencies: [1.2ms, 1.4ms, 1.1ms, 1.3ms, 1.5ms, 1.2ms, 1.0ms, 1.4ms, 1.3ms]
Merge time: 0.3ms
Total: 1.8ms (limited by slowest shard + merge)
$ ./vecdb failover --kill node-1
Simulating node-1 failure...
Detecting failure: 2.1s (heartbeat timeout)
Promoting replicas:
Shard 3: node-0 promoted to primary
Shard 4: node-2 promoted to primary
Shard 5: node-0 promoted to primary
Cluster healthy (degraded: missing replicas)
$ ./vecdb search --query query.npy --k 10
Search still works! (using remaining nodes)
Latency: 2.1ms (slightly higher due to fewer nodes)
Implementation Hints:
Cluster architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Coordinator โ
โ (Query routing, shard map, cluster membership) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Node-0 โ โ Node-1 โ โ Node-2 โ โ
โ โ โ โ โ โ โ โ
โ โ Shard 0 (P) โ โ Shard 3 (P) โ โ Shard 6 (P) โ โ
โ โ Shard 1 (P) โ โ Shard 4 (P) โ โ Shard 7 (P) โ โ
โ โ Shard 2 (P) โ โ Shard 5 (P) โ โ Shard 8 (P) โ โ
โ โ Shard 3 (R) โ โ Shard 0 (R) โ โ Shard 0 (R) โ โ
โ โ Shard 4 (R) โ โ Shard 1 (R) โ โ Shard 1 (R) โ โ
โ โ ... โ โ ... โ โ ... โ โ
โ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
P = Primary, R = Replica
Sharding:
type ShardRouter struct {
numShards int
shardMap map[int][]NodeAddress // shard -> [primary, replicas]
}
func (r *ShardRouter) GetShard(vectorID uint64) int {
return int(vectorID % uint64(r.numShards))
}
func (r *ShardRouter) RouteInsert(vectorID uint64, vector []float32) error {
shardID := r.GetShard(vectorID)
primary := r.shardMap[shardID][0]
// Send to primary (it will replicate)
return r.sendInsert(primary, vectorID, vector)
}
func (r *ShardRouter) RouteSearch(query []float32, k int) ([]SearchResult, error) {
// Search all shards in parallel
results := make(chan []SearchResult, r.numShards)
errors := make(chan error, r.numShards)
for shardID := 0; shardID < r.numShards; shardID++ {
go func(sid int) {
node := r.pickNode(sid) // Primary or replica
res, err := r.sendSearch(node, query, k*3) // Over-fetch for merge
if err != nil {
errors <- err
return
}
results <- res
}(shardID)
}
// Collect and merge
allResults := []SearchResult{}
for i := 0; i < r.numShards; i++ {
select {
case res := <-results:
allResults = append(allResults, res...)
case err := <-errors:
return nil, err
}
}
// Sort by distance and take top-k
sort.Slice(allResults, func(i, j int) bool {
return allResults[i].Distance < allResults[j].Distance
})
if len(allResults) > k {
allResults = allResults[:k]
}
return allResults, nil
}
Replication:
type ReplicationManager struct {
primary *Shard
replicas []*RemoteReplica
}
func (rm *ReplicationManager) Insert(id uint64, vector []float32) error {
// Write to primary first
if err := rm.primary.Insert(id, vector); err != nil {
return err
}
// Replicate asynchronously (eventual consistency)
// Or synchronously (strong consistency)
for _, replica := range rm.replicas {
go replica.Replicate(id, vector)
}
return nil
}
Failure detection:
func (c *Cluster) monitorHealth() {
ticker := time.NewTicker(1 * time.Second)
for range ticker.C {
for _, node := range c.nodes {
if !c.ping(node) {
node.missedHeartbeats++
if node.missedHeartbeats > 3 {
c.handleNodeFailure(node)
}
} else {
node.missedHeartbeats = 0
}
}
}
}
func (c *Cluster) handleNodeFailure(failed *Node) {
log.Printf("Node %s failed, promoting replicas", failed.ID)
for _, shard := range failed.PrimaryShards {
// Find a replica to promote
replica := c.findReplica(shard)
if replica != nil {
c.promoteToPrimary(replica, shard)
}
}
c.updateShardMap()
c.notifyClients()
}
The Core Question Youโre Answering
โHow do we scale vector search across machines while maintaining consistency?โ
This isnโt about making a single node fasterโthatโs optimization. This is about fundamental distributed systems questions:
- How do we partition billions of vectors across nodes so queries stay fast?
- How do we ensure writes are durable and consistent across replicas?
- How do we handle node failures without losing data or availability?
- How do we route queries to the right shards and merge results correctly?
The single biggest misconception: โDistributed systems are just running multiple copies.โ Wrong. Distribution introduces coordination overhead, network partitions, partial failures, and consistency challenges that donโt exist in single-node systems.
What makes this hard: Youโre balancing the CAP theorem tradeoffs. Strong consistency means higher latency and lower availability. Eventual consistency means dealing with stale reads and conflict resolution. Thereโs no perfect answerโonly informed tradeoffs.
Concepts You Must Understand First
Before you write distributed code, you need mental models for:
- Sharding vs Replication
- Sharding = Horizontal partitioning. Split data across nodes to distribute load.
- Replication = Copy data to multiple nodes for fault tolerance and read scaling.
- You need BOTH: shard for capacity, replicate for availability.
- Consistency Models
- Strong consistency: All replicas agree before acknowledging writes (slow, safe)
- Eventual consistency: Replicas converge over time (fast, risky)
- Quorum reads/writes: W + R > N guarantees consistency (W=write replicas, R=read replicas, N=total replicas)
- CAP Theorem
- Consistency: All nodes see the same data
- Availability: Every request gets a response
- Partition tolerance: System works despite network failures
- You can only pick 2. In distributed systems, partitions WILL happen, so you choose CA or AP.
- Partitioning Strategies
- Hash-based sharding:
hash(key) % num_shardsโ Uniform distribution, but no range queries - Range-based sharding: Assign key ranges to shards โ Supports range queries, but risk of hot shards
- Consistent hashing: Virtual nodes minimize resharding when nodes are added/removed
- Hash-based sharding:
- Distributed Consensus
- Raft/Paxos: Algorithms for agreeing on state across nodes
- Leader election: One node coordinates writes
- Log replication: Leaderโs log is replicated to followers
Study these chapters:
- โDesigning Data-Intensive Applicationsโ Chapter 5 (Replication)
- โDesigning Data-Intensive Applicationsโ Chapter 6 (Partitioning)
- โDesigning Data-Intensive Applicationsโ Chapter 9 (Consistency and Consensus)
- Raft paper: โIn Search of an Understandable Consensus Algorithmโ
Questions to Guide Your Design
Before writing code, answer these architectural questions:
Sharding:
- Whatโs your shard key? (vector ID? category? timestamp?)
- How many shards? (Too few = hot spots. Too many = coordination overhead.)
- Hash-based or range-based partitioning?
- How do you handle shard rebalancing when nodes are added/removed?
Replication:
- Whatโs your replication factor? (RF=3 is common: 1 primary + 2 replicas)
- Synchronous or asynchronous replication?
- How do you handle replication lag?
- What happens if a replica falls behind?
Consistency:
- Strong or eventual consistency?
- What are your quorum sizes? (e.g., W=2, R=2, N=3)
- How do you handle write conflicts?
- Do you need read-after-write consistency?
Query Routing:
- How does the client know which shard to query?
- Option A: Client-side routing (client knows shard map)
- Option B: Coordinator routing (coordinator routes to shards)
- Option C: Random node routing (any node can forward)
- For vector search, do you query ALL shards or use a routing hint?
- How do you merge results from multiple shards?
Failure Handling:
- How do you detect node failures? (Heartbeats? Gossip protocol?)
- How do you handle a primary failure? (Promote a replica? Re-elect leader?)
- What happens during a network partition?
- How do you prevent split-brain scenarios?
Thinking Exercise
Trace a query through the distributed cluster:
Client sends: search(query_vector, k=10)
Step 1: Router receives query
โ Looks up shard map (9 shards, RF=2)
โ Decision: Query ALL shards (can't route ANN search to subset)
Step 2: Router fans out to 9 shards
โ Sends search(query, k=30) to each shard (over-fetch for accurate merge)
โ Sends to PRIMARY or REPLICA? (Read from replica for load balancing)
Step 3: Each shard searches locally
โ Shard 0: finds 30 nearest neighbors
โ Shard 1: finds 30 nearest neighbors
โ ... (9 parallel searches)
Step 4: Router collects 270 results (9 shards ร 30 results)
โ Merges by distance
โ Returns top-10
Total latency = max(shard_latencies) + merge_time + network_overhead
Now trace a write:
Client sends: insert(vector_id=12345, vector=[0.1, 0.2, ...])
Step 1: Router determines shard
โ shard_id = hash(12345) % 9 = 3
โ Looks up shard 3: primary=node-1, replicas=[node-0, node-2]
Step 2: Router sends to primary (node-1)
โ Primary writes to its local index
โ Primary replicates to 2 replicas
Step 3: Replication strategy (choose one):
A) Synchronous: Wait for W=2 replicas to acknowledge (slow, safe)
B) Asynchronous: Acknowledge immediately, replicate in background (fast, risky)
Step 4: Router acknowledges to client
โ If synchronous: only after quorum
โ If async: immediately after primary write
Failure scenario:
Node-1 fails (it was primary for shards 3, 4, 5)
Step 1: Failure detection
โ Heartbeat monitor notices node-1 missed 3 consecutive pings
โ Marks node-1 as DEAD
Step 2: Replica promotion
โ For shard 3: promote node-0's replica to primary
โ For shard 4: promote node-2's replica to primary
โ For shard 5: promote node-0's replica to primary
Step 3: Update shard map
โ Router updates routing table
โ Clients see new primaries
Step 4: Cluster is now degraded
โ RF=2 instead of RF=3 (missing one replica)
โ Background process creates new replicas to restore RF=3
The Interview Questions Theyโll Ask
Q1: โWhatโs the difference between sharding and replication?โ
Bad answer: โSharding is horizontal scaling, replication is copying data.โ
Good answer: โSharding partitions data across nodes to distribute LOADโitโs about capacity. Replication copies data to multiple nodes for FAULT TOLERANCEโitโs about availability. You need both: shard to handle more data than one node can hold, replicate so a node failure doesnโt lose data. For example, with 9 shards and RF=3, each shard exists on 3 nodes. Total nodes = 3, with each node holding 3 primary shards + 6 replica shards.โ
Q2: โHow would you shard a vector database?โ
Bad answer: โJust hash the vector ID and mod by number of shards.โ
Good answer: โIt depends on the query pattern. For pure ANN search where you must query all vectors, hash-based sharding by vector ID works wellโit distributes uniformly and is simple. But if you have metadata filters (e.g., โcategory=imagesโ), you might shard by category to enable shard pruningโquery only relevant shards. The tradeoff: category-based sharding risks hot shards if one category dominates. In practice, hybrid approaches work: hash-shard within category, or use consistent hashing with virtual nodes to balance load.โ
Q3: โStrong consistency or eventual consistency for vector search?โ
Bad answer: โStrong consistency because we want correct results.โ
Good answer: โIt depends on the use case. For vector search specifically, eventual consistency is often acceptable because:
- Vector embeddings rarely changeโtheyโre mostly append-only
- Slight staleness (a few seconds) doesnโt break user experience
- Strong consistency adds latency (wait for quorum), which hurts search UX
However, for METADATA updates (e.g., deleting a userโs data for GDPR), you need strong consistencyโcanโt show deleted data. So hybrid: eventual consistency for inserts, strong consistency for deletes. This is what Pinecone and Weaviate do.โ
Q4: โHow do you handle merging results from multiple shards?โ
Bad answer: โJust sort all results by distance and take top-k.โ
Good answer: โThat works, but you must over-fetch from each shard. If you want global top-10, you canโt just get top-10 from each shardโa shard might have all 10 global bests. Solution: fetch top-kรC from each shard (C is a safety factor, typically 2-3), then merge and re-rank. Example: for k=10, fetch top-30 from each of 9 shards (270 candidates), merge by distance, return top-10. The tradeoff: larger C = better recall but more data transferred. Production systems tune C based on data distribution.โ
Q5: โWhat happens during a network partition?โ
Bad answer: โThe system stops working.โ
Good answer: โIt depends on your CAP choice. If you chose CP (consistency + partition tolerance), you sacrifice availabilityโnodes in the minority partition refuse writes to prevent split-brain. If you chose AP (availability + partition tolerance), you sacrifice consistencyโboth partitions accept writes, and you resolve conflicts later (e.g., last-write-wins, vector timestamps). For vector search, AP is often betterโitโs better to serve slightly stale results than be unavailable. But you need a reconciliation strategy when the partition heals.โ
Q6: โHow does leader election work in Raft?โ
Bad answer: โNodes vote for a leader.โ
Good answer: โRaft uses term-based elections. When a followerโs heartbeat timeout expires, it increments its term, transitions to CANDIDATE, votes for itself, and requests votes from other nodes. A candidate wins if it receives votes from a majority (quorum). Once elected, the leader sends heartbeats to prevent new elections. Key details: randomized timeouts prevent split votes, log completeness rule ensures the leader has the most up-to-date log, and higher term numbers always win. This guarantees at most one leader per term.โ
Hints in Layers
Layer 1: Single-Node Baseline (Week 1)
Start simple. Build a single-node vector database with HNSW index (from Project 5). Wrap it in a network service:
type VectorNode struct {
index *HNSW
}
func (n *VectorNode) Insert(ctx context.Context, req *InsertRequest) error {
return n.index.Insert(req.VectorID, req.Vector)
}
func (n *VectorNode) Search(ctx context.Context, req *SearchRequest) ([]*SearchResult, error) {
return n.index.Search(req.Query, req.K)
}
func main() {
node := NewVectorNode()
grpc.Serve(":50051", node) // Use gRPC for inter-node communication
}
Why start here: You need a working single-node system before distributing it. This baseline lets you measure distributed overhead later.
Layer 2: Add Sharding (Week 2)
Partition data across multiple nodes. Implement hash-based sharding:
type ShardRouter struct {
nodes []*NodeClient // RPC clients to nodes
numShards int
}
func (r *ShardRouter) GetShardID(vectorID uint64) int {
return int(vectorID % uint64(r.numShards))
}
func (r *ShardRouter) GetNode(shardID int) *NodeClient {
// Assuming equal shard distribution
return r.nodes[shardID % len(r.nodes)]
}
func (r *ShardRouter) Insert(vectorID uint64, vector []float32) error {
shardID := r.GetShardID(vectorID)
node := r.GetNode(shardID)
return node.Insert(vectorID, vector)
}
func (r *ShardRouter) Search(query []float32, k int) ([]*SearchResult, error) {
// Fan out to ALL shards (can't route ANN search)
resultsChan := make(chan []*SearchResult, len(r.nodes))
for _, node := range r.nodes {
go func(n *NodeClient) {
res, _ := n.Search(query, k*3) // Over-fetch
resultsChan <- res
}(node)
}
// Merge results
var allResults []*SearchResult
for i := 0; i < len(r.nodes); i++ {
allResults = append(allResults, <-resultsChan...)
}
sort.Slice(allResults, func(i, j int) bool {
return allResults[i].Distance < allResults[j].Distance
})
return allResults[:k], nil
}
Test it: Insert 1M vectors, verify theyโre distributed across shards. Run searches, verify results are correct.
Key insight: Sharding reduces per-node load but INCREASES query complexity (must query all shards).
Layer 3: Add Replication (Week 3)
Each shard now has replicas. Implement synchronous replication with quorum writes:
type ReplicatedShard struct {
primary *NodeClient
replicas []*NodeClient
W int // Write quorum
R int // Read quorum
}
func (s *ReplicatedShard) Insert(vectorID uint64, vector []float32) error {
// Send to all replicas
ackChan := make(chan error, len(s.replicas)+1)
// Write to primary
go func() {
ackChan <- s.primary.Insert(vectorID, vector)
}()
// Write to replicas
for _, replica := range s.replicas {
go func(r *NodeClient) {
ackChan <- r.Insert(vectorID, vector)
}(replica)
}
// Wait for W acknowledgments
acks := 0
for i := 0; i < len(s.replicas)+1; i++ {
if <-ackChan == nil {
acks++
if acks >= s.W {
return nil // Quorum reached
}
}
}
return errors.New("quorum not reached")
}
Test it: Kill a replica mid-write, verify quorum still succeeds. Kill W nodes, verify writes fail (correct behavior).
Key insight: Replication trades latency for durability. W=1 is fast but risky. W=N is safe but slow.
Layer 4: Add Failure Detection (Week 4)
Implement heartbeat-based failure detection:
type HealthMonitor struct {
nodes []*Node
heartbeatTicker *time.Ticker
}
func (h *HealthMonitor) Start() {
h.heartbeatTicker = time.NewTicker(1 * time.Second)
go func() {
for range h.heartbeatTicker.C {
for _, node := range h.nodes {
if err := node.Ping(); err != nil {
node.FailureCount++
if node.FailureCount >= 3 {
h.handleFailure(node)
}
} else {
node.FailureCount = 0
}
}
}
}()
}
func (h *HealthMonitor) handleFailure(failed *Node) {
log.Printf("Node %s failed", failed.ID)
// Promote replicas to primary
for _, shard := range failed.PrimaryShards {
replica := h.findHealthyReplica(shard)
h.promoteToPrimary(replica, shard)
}
// Update routing table
h.updateShardMap()
}
Test it: Kill a primary node, verify replica promotion. Kill multiple nodes, verify the system degrades gracefully.
Key insight: Failure detection has false positives (slow network != dead node). Tune timeouts carefully.
Layer 5: Add Consensus (Weeks 5-6)
Implement Raft for consistent shard map updates and leader election. This is the hardest part.
Option A: Use an existing Raft library (etcdโs raft, Hashicorpโs raft) Option B: Implement Raft yourself (educational but time-consuming)
// Using Hashicorp's raft library
func (c *Cluster) InitRaft() error {
config := raft.DefaultConfig()
config.LocalID = raft.ServerID(c.nodeID)
store := raft.NewInmemStore()
snapshots := raft.NewInmemSnapshotStore()
transport, _ := raft.NewTCPTransport(c.bindAddr, nil, 3, 10*time.Second, os.Stderr)
r, err := raft.NewRaft(config, c, store, store, snapshots, transport)
if err != nil {
return err
}
c.raft = r
return nil
}
// Apply shard map updates through Raft
func (c *Cluster) UpdateShardMap(update *ShardMapUpdate) error {
if c.raft.State() != raft.Leader {
return c.forwardToLeader(update)
}
data, _ := json.Marshal(update)
future := c.raft.Apply(data, 10*time.Second)
return future.Error()
}
Test it: Kill the leader, verify a new leader is elected. Make writes during leader election, verify theyโre queued and applied after election.
Key insight: Consensus is expensive. Only use it for critical state (shard map, membership), not every write.
Books That Will Help
- โDesigning Data-Intensive Applicationsโ by Martin Kleppmann
- Chapter 5: Replication โ Covers leader-follower replication, multi-leader, leaderless (Dynamo-style)
- Chapter 6: Partitioning โ Covers sharding strategies, rebalancing, routing
- Chapter 7: Transactions โ Covers consistency guarantees
- Chapter 8: The Trouble with Distributed Systems โ Covers network faults, clocks, Byzantine faults
- Chapter 9: Consistency and Consensus โ Covers linearizability, CAP theorem, Raft/Paxos
- This is THE book for distributed systems. Read chapters 5-9 before starting this project.
- โDatabase Internalsโ by Alex Petrov
- Part II: Distributed Systems โ Covers failure detection, leader election, replication, anti-entropy
- More implementation-focused than Kleppmann.
- Raft Paper: โIn Search of an Understandable Consensus Algorithmโ
- Original Raft paper by Diego Ongaro and John Ousterhout
- 16 pages, very readable (unlike Paxos)
- Available at: https://raft.github.io/raft.pdf
- โDistributed Systemsโ by Maarten van Steen and Andrew S. Tanenbaum
- Chapter 6: Synchronization โ Covers clocks, mutual exclusion
- Chapter 7: Consistency and Replication โ Covers consistency models
- More academic, but comprehensive.
- Raft Visualization
- Interactive visualization: https://raft.github.io/
- Play with leader election, log replication, network partitions
- Real-world systems to study:
- Cassandra: Leaderless, eventual consistency, tunable quorums
- MongoDB: Leader-based, strong consistency, automatic sharding
- Elasticsearch: Sharding, replication, Lucene-based (similar to vector DBs)
- Pinecone/Weaviate: Production vector databases (study their architecture blogs)
Reading plan:
- Week 1: DDIA Chapters 5-6 (Replication, Partitioning)
- Week 2: DDIA Chapters 8-9 (Distributed Systems, Consensus)
- Week 3: Raft paper + visualization
- Weeks 4-6: Build while referencing โDatabase Internalsโ Part II
Learning milestones:
- Sharding distributes load โ You understand partitioning
- Parallel search works โ You understand distributed query execution
- Failover promotes replicas โ You understand fault tolerance
- Consistency is maintained โ You understand distributed state
Capstone Project: Full-Featured Vector Database
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Rust
- Alternative Programming Languages: C++, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The โIndustry Disruptorโ
- Difficulty: Level 5: Master
- Knowledge Area: Complete Database System
- Software or Tool: Production Vector Database
- Main Book: All previous + โDatabase Internalsโ by Alex Petrov
What youโll build: A complete, production-ready vector database with: HNSW and IVF-PQ indexes, durable storage, metadata filtering, distributed clustering, gRPC API, and client SDKs.
Why this is the ultimate test: This combines everything: algorithms, storage, networking, distributed systems, and API design. Building a complete vector database proves you truly understand how these systems work.
Core challenges youโll face:
- Integrating all components โ maps to system design
- API design โ maps to user experience
- Performance tuning โ maps to profiling and optimization
- Operational features โ maps to monitoring, backups, upgrades
Key Concepts:
- Everything from previous projects
- API Design: gRPC best practices
- Operations: Observability, deployment
Difficulty: Master Time estimate: 3-6 months Prerequisites: All previous projects completed.
Real world outcome:
$ vecdb-server --config production.yaml
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VecDB Server v1.0 โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ Configuration: โ
โ Storage: /data/vecdb โ
โ Index: HNSW (M=32, ef=200) โ
โ Shards: 12 (distributed across 4 nodes) โ
โ Replicas: 2 โ
โ โ
โ Endpoints: โ
โ gRPC: 0.0.0.0:6333 โ
โ HTTP: 0.0.0.0:6334 โ
โ Metrics: 0.0.0.0:9090/metrics โ
โ โ
โ Status: Ready to accept connections โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Python client usage
from vecdb import VecDBClient
client = VecDBClient("localhost:6333")
# Create collection
client.create_collection(
name="products",
dimension=768,
metric="cosine",
index_config={
"type": "hnsw",
"m": 32,
"ef_construction": 200
}
)
# Insert vectors
client.upsert(
collection="products",
vectors=[
{"id": "prod_001", "vector": [...], "metadata": {"category": "electronics", "price": 99.99}},
{"id": "prod_002", "vector": [...], "metadata": {"category": "clothing", "price": 49.99}},
...
]
)
# Search with filters
results = client.search(
collection="products",
vector=query_embedding,
limit=10,
filter={"category": "electronics", "price": {"$lt": 100}}
)
# Results
[
{"id": "prod_001", "score": 0.95, "metadata": {...}},
{"id": "prod_042", "score": 0.91, "metadata": {...}},
...
]
Implementation Hints:
System architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Client SDKs โ
โ (Python, JavaScript, Rust, Go) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ API Layer (gRPC + HTTP) โ
โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ Insert โ Search โ Delete โ Admin โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Query Processor โ
โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ Planner โ Optimizer โ Executor โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Index Layer โ
โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ HNSW โ IVF-PQ โ Metadata โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Storage Layer โ
โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ WAL โ Vectors โ Index โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Distributed Layer โ
โ โโโโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโ โ
โ โ Sharding โReplication โ Consensus โ โ
โ โโโโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
gRPC API definition:
service VecDB {
rpc CreateCollection(CreateCollectionRequest) returns (CreateCollectionResponse);
rpc Upsert(UpsertRequest) returns (UpsertResponse);
rpc Search(SearchRequest) returns (SearchResponse);
rpc Delete(DeleteRequest) returns (DeleteResponse);
rpc GetCollection(GetCollectionRequest) returns (GetCollectionResponse);
}
message SearchRequest {
string collection = 1;
repeated float vector = 2;
uint32 limit = 3;
Filter filter = 4;
SearchParams params = 5;
}
message SearchResponse {
repeated SearchResult results = 1;
uint64 searched_vectors = 2;
float latency_ms = 3;
}
Collection management:
struct Collection {
name: String,
dimension: usize,
metric: DistanceMetric,
// Index
index: Box<dyn VectorIndex>,
// Storage
storage: StorageEngine,
// Metadata
metadata_index: MetadataIndex,
// Configuration
config: CollectionConfig,
}
impl Collection {
fn upsert(&mut self, vectors: Vec<VectorRecord>) -> Result<()> {
// Write to WAL first
self.storage.wal_append(&vectors)?;
// Update index
for record in &vectors {
self.index.insert(record.id, &record.vector)?;
self.metadata_index.insert(record.id, &record.metadata)?;
}
// Flush periodically
if self.should_flush() {
self.flush()?;
}
Ok(())
}
fn search(&self, query: &[f32], k: usize, filter: Option<Filter>) -> Result<Vec<SearchResult>> {
let valid_ids = match filter {
Some(f) => Some(self.metadata_index.filter(&f)?),
None => None,
};
self.index.search(query, k, valid_ids)
}
}
Metrics and monitoring:
struct Metrics {
insert_latency: Histogram,
search_latency: Histogram,
vectors_total: Gauge,
index_size_bytes: Gauge,
queries_per_second: Counter,
}
impl Metrics {
fn record_search(&self, latency: Duration, results: usize) {
self.search_latency.observe(latency.as_secs_f64());
self.queries_per_second.inc();
}
}
Learning milestones:
- Basic CRUD works โ Core functionality complete
- Filtered search works โ Advanced querying done
- Distributed mode works โ Scalability achieved
- Production deployed โ Operational readiness
- Handles real workloads โ Production quality
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Brute Force k-NN | Beginner | Weekend | โญโญ | โญโญโญ |
| 2. LSH | Intermediate | 1-2 weeks | โญโญโญ | โญโญโญโญ |
| 3. IVF | Advanced | 2 weeks | โญโญโญโญ | โญโญโญ |
| 4. Product Quantization | Advanced | 2-3 weeks | โญโญโญโญ | โญโญโญโญ |
| 5. HNSW | Expert | 3-4 weeks | โญโญโญโญโญ | โญโญโญโญโญ |
| 6. IVF-PQ | Expert | 2-3 weeks | โญโญโญโญโญ | โญโญโญโญ |
| 7. DiskANN-style | Expert | 3-4 weeks | โญโญโญโญโญ | โญโญโญโญ |
| 8. Filtered Search | Advanced | 2-3 weeks | โญโญโญโญ | โญโญโญโญ |
| 9. Storage Engine | Expert | 3-4 weeks | โญโญโญโญโญ | โญโญโญโญ |
| 10. Distributed Cluster | Master | 4-6 weeks | โญโญโญโญโญ | โญโญโญโญโญ |
| Capstone: Full DB | Master | 3-6 months | โญโญโญโญโญ | โญโญโญโญโญ |
Recommended Learning Path
Path A: The Complete Journey (6-12 months)
- Weeks 1-2: Projects 1-2 (Fundamentals + LSH)
- Weeks 3-5: Projects 3-4 (IVF + PQ)
- Weeks 6-9: Project 5 (HNSW - the key algorithm)
- Weeks 10-12: Project 6 (IVF-PQ composite)
- Weeks 13-16: Projects 7-8 (Disk-based + Filtering)
- Weeks 17-20: Project 9 (Storage Engine)
- Weeks 21-26: Project 10 (Distributed)
- Months 7-12: Capstone
Path B: Practical Focus (2-3 months)
If you want to understand production systems quickly:
1 โ 3 โ 5 โ 8 โ (use existing storage/distribution)
This covers: brute force baseline, IVF, HNSW, and filteringโenough to work with any production vector database.
Path C: Algorithm Deep Dive (2-3 months)
If you want to understand the algorithms deeply:
1 โ 2 โ 3 โ 4 โ 5 โ 6
This covers all the major ANN algorithms without the systems complexity.
Recommendation Based on Your Goals
If youโre building RAG applications: Do Projects 1, 3, 5, 8. Youโll understand how to choose and tune indexes.
If you want to contribute to vector DBs: Do the full path. Open-source projects like Qdrant, Milvus need contributors who understand internals.
If youโre researching ANN algorithms: Focus on Projects 2-6, read the original papers, benchmark against ann-benchmarks.com.
Essential Resources
Papers
- HNSW Paper - Malkov & Yashunin
- Product Quantization Paper - Jรฉgou, Douze, Schmid
- DiskANN Paper - Microsoft Research
- Billion-scale ANN - FAISS paper
Tutorials and Guides
- Pinecone Learning Center - Excellent visual explanations
- Zilliz Learn - Milvus teamโs tutorials
- FAISS Wiki - Official FAISS documentation
Benchmarks
- ANN-Benchmarks - Standard benchmark for ANN algorithms
- Big-ANN Benchmarks - Billion-scale benchmarks
Books
- โDesigning Data-Intensive Applicationsโ by Martin Kleppmann
- โMining of Massive Datasetsโ by Leskovec, Rajaraman, Ullman
- โIntroduction to Information Retrievalโ by Manning, Raghavan, Schรผtze
Libraries to Study
- FAISS - Facebookโs library (C++ with Python bindings)
- hnswlib - Original HNSW implementation
- Qdrant - Rust vector database
- Milvus - Distributed vector database
Summary
| # | Project Name | Main Language |
|---|---|---|
| 1 | Brute Force k-NN and Distance Metrics | Python |
| 2 | Locality Sensitive Hashing (LSH) | Python |
| 3 | Inverted File Index (IVF) | Python |
| 4 | Product Quantization (PQ) | Python |
| 5 | HNSW Index | Python |
| 6 | IVF-PQ Composite Index | Python |
| 7 | Disk-Based Vector Index (DiskANN-style) | C++ |
| 8 | Metadata Filtering with Vector Search | Python |
| 9 | Vector Database Storage Engine | Rust |
| 10 | Distributed Vector Search Cluster | Go |
| Capstone | Full-Featured Vector Database | Rust |
Vector databases are where algorithms meet systems engineering. The best way to understand them is to build one. Start with brute force, feel the pain, then appreciate the elegance of HNSW and the efficiency of IVF-PQ. By the end, youโll know exactly what happens when you call collection.search().