← Back to all projects

LEARN VECTOR DATABASES

Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by *meaning*.

Learn Vector Databases: From Zero to Building Your Own

Goal: Deeply understand how vector databases work internally—from the mathematics of similarity search through indexing algorithms like HNSW and IVF, to building a production-ready vector search engine. Master the technology powering modern AI applications.

Why Vector Databases Matter

The challenge: How do you efficiently find the 10 most similar items among a billion vectors, each with 1536 dimensions? Brute force would take hours. Vector databases do it in milliseconds.

After completing these projects, you will:

Understand exactly how similarity search works at scale
Implement HNSW, IVF, LSH, and other ANN algorithms from scratch
Know when to use each algorithm and what tradeoffs to make
Build a complete vector database that handles real workloads
Appreciate the engineering behind Pinecone, Milvus, Qdrant, and others

Core Concept Analysis

What is a Vector?

In the context of vector databases, a vector (or embedding) is a list of numbers that represents the meaning or features of something:

"The quick brown fox" → [0.12, -0.34, 0.56, ..., 0.78]  (1536 dimensions)
[Image of a cat]      → [0.23, 0.45, -0.67, ..., 0.89]  (512 dimensions)
[Product attributes]  → [0.91, -0.23, 0.45, ..., 0.12]  (256 dimensions)

These vectors come from ML models (OpenAI embeddings, CLIP, sentence-transformers) that position similar items close together in high-dimensional space.

Distance Metrics

How do we measure “closeness” between vectors?

Euclidean Distance (L2): Straight-line distance in space
```
d(a, b) = √(Σ(aᵢ - bᵢ)²)
```
- Good for: Absolute position matters
- Range: [0, ∞)
Cosine Similarity: Angle between vectors (normalized)
```
cos(a, b) = (a · b) / (||a|| × ||b||)
```
- Good for: Direction matters, magnitude doesn’t
- Range: [-1, 1] (1 = identical, 0 = orthogonal, -1 = opposite)
Dot Product (Inner Product): Combines magnitude and direction
```
a · b = Σ(aᵢ × bᵢ)
```
- Good for: When vectors are already normalized
- Range: (-∞, ∞)
Manhattan Distance (L1): Sum of absolute differences
```
d(a, b) = Σ|aᵢ - bᵢ|
```

The Curse of Dimensionality

Why is vector search hard?

In high dimensions (100+), counterintuitive things happen:

All points are “far”: The distance between any two random points converges
Volume explodes: A hypercube with side 1 has volume 1, but almost all that volume is in the “corners”
Brute force is slow: Comparing a query to 1 billion vectors = 1 billion distance calculations

This is why we need Approximate Nearest Neighbor (ANN) algorithms—trading perfect accuracy for massive speedups.

The ANN Tradeoff

Accuracy (Recall)  ←——————————————————→  Speed
    100%                                    0.001ms

    Brute Force (exact)                     Too slow
    HNSW (graph)                            Great balance
    IVF (clustering)                        Faster, less accurate
    LSH (hashing)                           Fastest, least accurate

Recall@K: Of the true K nearest neighbors, what fraction did we find?

Recall@10 = 0.95 means we found 95% of the true top-10 neighbors

Vector Database Architecture

┌─────────────────────────────────────────────────────────┐
│                     Client Layer                         │
│         (REST API, gRPC, SDKs, Query Language)          │
├─────────────────────────────────────────────────────────┤
│                     Query Layer                          │
│    (Query Planning, Filtering, Caching, Result Ranking) │
├─────────────────────────────────────────────────────────┤
│                     Index Layer                          │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │
│   │  HNSW   │  │   IVF   │  │   LSH   │  │  Flat   │   │
│   └─────────┘  └─────────┘  └─────────┘  └─────────┘   │
├─────────────────────────────────────────────────────────┤
│                    Storage Layer                         │
│    (Vector Storage, Metadata, WAL, Compression/PQ)      │
├─────────────────────────────────────────────────────────┤
│                  Distributed Layer                       │
│      (Sharding, Replication, Consistency, Routing)      │
└─────────────────────────────────────────────────────────┘

Concept Summary Table

Concept Cluster	What You Need to Internalize
Vector Embeddings	Vectors are numeric representations of meaning. Similar items map to nearby points in high-dimensional space. Embeddings come from ML models (transformers, CNNs) that compress semantics into fixed-size arrays.
Distance Metrics	Similarity = inverse of distance. Euclidean measures absolute position, cosine measures direction (angle), dot product combines both. Choice depends on whether magnitude matters for your domain.
Curse of Dimensionality	In high dimensions (100+), all points become “far” from each other, volume concentrates in corners, and brute force becomes impossibly slow. This is why approximate search exists.
ANN vs Exact Search	Exact k-NN guarantees finding the true nearest neighbors but doesn’t scale. ANN (Approximate Nearest Neighbor) trades accuracy for speed—99% accuracy at 1000x speedup is the typical tradeoff.
LSH (Locality Sensitive Hashing)	Hash similar vectors to same buckets using random projections. Multiple hash tables = OR (higher recall), multiple hashes per table = AND (higher precision). Probabilistic but fast.
IVF (Inverted File Index)	Cluster vectors with k-means, then search only nearby clusters. Nprobe controls recall/speed tradeoff. Foundation of FAISS and most production systems. Simple but effective.
Product Quantization	Compress vectors by splitting dimensions into subspaces and quantizing each independently. 128D float32 (512 bytes) → 128 uint8 codes (128 bytes) = 4x compression. Lossy but enables billion-scale search.
HNSW (Hierarchical NSW)	Multi-layer proximity graph. Upper layers = express highways, lower layers = local streets. Greedy search with backtracking. Best all-around algorithm for accuracy and speed.
Index Composition	Real systems combine techniques: IVF for partitioning + PQ for compression + HNSW for graph search. Each layer optimizes a different bottleneck.
Metadata Filtering	Vector search + filters (price < $100, category = shoes). Pre-filtering wastes recall, post-filtering wastes compute. Hybrid indexes and filtered IVF are the solutions.
Distributed Systems	Sharding for scale (partition by vector ID or space), replication for availability, consistency for correctness. CAP theorem applies. Leader-follower or quorum-based writes.
Disk-Based Indexes	RAM is expensive at billion-vector scale. Disk-based indexes use SSD-friendly layouts: compressed vectors, graph edges on disk, selective caching. DiskANN achieves 95%+ recall.

Deep Dive Reading By Concept

This section maps each concept from above to specific book chapters and papers for deeper understanding. Read these before or alongside the projects to build strong mental models.

Vector Embeddings & Representation Learning

Concept	Resource
What are embeddings and how are they created	Speech and Language Processing by Jurafsky & Martin — Ch. 6: “Vector Semantics and Embeddings”
Word2Vec and semantic spaces	Neural Network Methods for Natural Language Processing by Yoav Goldberg — Ch. 10: “Pre-trained Word Representations”
Modern transformer embeddings	Natural Language Processing with Transformers by Tunstall, von Werra & Wolf — Ch. 2: “Text Classification” (BERT embeddings section)
Image embeddings (CNNs)	Deep Learning by Goodfellow, Bengio & Courville — Ch. 9: “Convolutional Networks”
Multimodal embeddings (CLIP)	Paper: “Learning Transferable Visual Models From Natural Language Supervision” (Radford et al., 2021)

Distance Metrics & Similarity Search

Concept	Resource
Distance functions and metric spaces	Mining of Massive Datasets by Leskovec, Rajaraman & Ullman — Ch. 3.5: “Distance Measures”
Cosine vs Euclidean: when to use which	Introduction to Information Retrieval by Manning, Raghavan & Schütze — Ch. 6: “Scoring, term weighting, and the vector space model”
Dot product for normalized vectors	Foundations of Statistical Natural Language Processing by Manning & Schütze — Ch. 8: “Lexical Semantics”
Metric properties (triangle inequality, symmetry)	Pattern Recognition and Machine Learning by Bishop — Ch. 2.5: “Bayesian Decision Theory”

Curse of Dimensionality & High-Dimensional Geometry

Concept	Resource
Why high dimensions are different	The Elements of Statistical Learning by Hastie, Tibshirani & Friedman — Ch. 2.5: “Local Methods in High Dimensions”
Volume concentration in hyperspheres	Mining of Massive Datasets by Leskovec, Rajaraman & Ullman — Ch. 7: “Clustering” (curse of dimensionality section)
Distance concentration phenomenon	Paper: “When Is ‘Nearest Neighbor’ Meaningful?” (Beyer et al., 1999)
Implications for machine learning	Pattern Recognition and Machine Learning by Bishop — Ch. 1.4: “The Curse of Dimensionality”

Approximate Nearest Neighbor Algorithms

Concept	Resource
ANN survey and taxonomy	Paper: “A Survey on Nearest Neighbor Search Methods” (Wang et al., 2021)
Recall/precision/latency tradeoffs	Mining of Massive Datasets by Leskovec, Rajaraman & Ullman — Ch. 3.6: “Locality-Sensitive Hashing for Documents”
Benchmarking ANN algorithms	Website: ann-benchmarks.com — comprehensive comparisons

Locality Sensitive Hashing (LSH)

Concept	Resource
LSH fundamentals and theory	Mining of Massive Datasets by Leskovec, Rajaraman & Ullman — Ch. 3.4: “Locality-Sensitive Hashing”
Random hyperplanes for cosine similarity	Paper: “Similarity Estimation Techniques from Rounding Algorithms” (Charikar, 2002)
LSH families for different metrics	Foundations of Data Science by Blum, Hopcroft & Kannan — Ch. 2.5: “Locality Sensitive Hashing”
Multi-probe LSH	Paper: “Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search” (Lv et al., 2007)
Practical LSH implementations	Tutorial: Pinecone LSH Guide

Clustering & K-Means

Concept	Resource
K-means algorithm and convergence	Pattern Recognition and Machine Learning by Bishop — Ch. 9.1: “K-means Clustering”
Choosing K (number of clusters)	The Elements of Statistical Learning by Hastie, Tibshirani & Friedman — Ch. 14.3: “Cluster Analysis”
K-means++ initialization	Paper: “k-means++: The Advantages of Careful Seeding” (Arthur & Vassilvitskii, 2007)
Mini-batch k-means for large datasets	Scikit-learn documentation on Mini-Batch K-Means

Inverted File Index (IVF)

Concept	Resource
Inverted indexes in text search	Introduction to Information Retrieval by Manning, Raghavan & Schütze — Ch. 1: “Boolean Retrieval”
IVF for vector search	Paper: “Product Quantization for Nearest Neighbor Search” (Jegou et al., 2011) — Section on IVF
FAISS IVF indexes	FAISS Wiki: Index Types
Nprobe tuning strategies	Pinecone blog: “Vector Indexes” series

Product Quantization (PQ)

Concept	Resource
Vector quantization fundamentals	Pattern Recognition and Machine Learning by Bishop — Ch. 9.2: “Mixtures of Gaussians” (VQ as special case)
Product Quantization paper (must-read)	Paper: “Product Quantization for Nearest Neighbor Search” (Jegou et al., 2011)
Asymmetric distance computation	Same paper, Section 3.3
Optimized PQ (OPQ)	Paper: “Optimized Product Quantization” (Ge et al., 2013)
PQ in FAISS	FAISS Wiki: Product Quantizers

HNSW (Hierarchical Navigable Small World)

Concept	Resource
Small-world networks	Networks by Newman — Ch. 11: “Small-World Networks”
Navigable Small World graphs	Paper: “Approximate Nearest Neighbor Algorithm Based on Navigable Small World Graphs” (Malkov et al., 2014)
HNSW algorithm (must-read)	Paper: “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” (Malkov & Yashunin, 2018)
HNSW implementation details	hnswlib GitHub
HNSW tuning (M and ef parameters)	Pinecone blog: “HNSW Index Deep Dive”

Graph-Based Search

Concept	Resource
Proximity graphs for search	Nearest Neighbor Search: A Database Perspective by Shasha & Wang — Ch. 4
Navigating graphs efficiently	Paper: “A Comparative Study of Graph-based Approximate Nearest Neighbor Algorithms” (Wang et al., 2021)
Graph construction strategies	HNSW paper, Section 3.2

Compression & Quantization

Concept	Resource
Vector quantization theory	The Elements of Statistical Learning by Hastie, Tibshirani & Friedman — Ch. 14.3.12: “Vector Quantization”
Scalar quantization	Understanding Digital Signal Processing by Lyons — Ch. 12: “Data Compression”
Additive quantization	Paper: “Additive Quantization for Extreme Vector Compression” (Babenko & Lempitsky, 2014)

Disk-Based Vector Search

Concept	Resource
SSD-friendly data structures	Database Internals by Petrov — Ch. 3: “File Formats”
DiskANN algorithm	Paper: “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node” (Jayaram Subramanya et al., 2019)
Memory-mapped files	The Linux Programming Interface by Kerrisk — Ch. 49: “Memory Mappings”
Vamana graph (DiskANN’s graph)	DiskANN paper, Section 3

Distributed Systems for Vector Search

Concept	Resource
Distributed system fundamentals	Designing Data-Intensive Applications by Kleppmann — Ch. 5-9 (Replication, Partitioning, Transactions, Consistency)
Sharding strategies	Designing Data-Intensive Applications by Kleppmann — Ch. 6: “Partitioning”
CAP theorem and consistency	Distributed Systems by Tanenbaum & van Steen — Ch. 7: “Consistency and Replication”
Leader election (Raft/Paxos)	Paper: “In Search of an Understandable Consensus Algorithm (Extended Version)” (Ongaro & Ousterhout, 2014) — Raft
Distributed vector databases	Milvus documentation: Architecture Overview

Metadata Filtering & Hybrid Search

Concept	Resource
Filtered vector search	Paper: “Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters” (Singh et al., 2023)
Hybrid search (sparse + dense)	Pinecone blog: “Hybrid Search Explained”
Inverted indexes for metadata	Introduction to Information Retrieval by Manning, Raghavan & Schütze — Ch. 2: “The Term Vocabulary and Postings Lists”

Storage Engines & Persistence

Concept	Resource
LSM-trees for write-heavy workloads	Designing Data-Intensive Applications by Kleppmann — Ch. 3: “Storage and Retrieval” (LSM-trees section)
WAL (Write-Ahead Logging)	Database Internals by Petrov — Ch. 4: “Implementing B-Trees” (WAL section)
Memory management in databases	Database Internals by Petrov — Ch. 5: “Transaction Processing and Recovery”

Performance & Benchmarking

Concept	Resource
Measuring recall and precision	Introduction to Information Retrieval by Manning, Raghavan & Schütze — Ch. 8: “Evaluation in Information Retrieval”
Latency percentiles (p99, p95)	Designing Data-Intensive Applications by Kleppmann — Ch. 1: “Reliable, Scalable, and Maintainable Applications”
Benchmarking methodology	Paper: “ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms” (Aumüller et al., 2020)

Essential Reading Order

For maximum comprehension, read in this order:

Foundation (Week 1):
- Mining of Massive Datasets Ch. 3 (distance metrics, LSH fundamentals)
- Speech and Language Processing Ch. 6 (what embeddings are)
- Pattern Recognition and Machine Learning Ch. 9.1 (k-means)
Core Algorithms (Week 2-3):
- LSH paper (Charikar, 2002)
- Product Quantization paper (Jegou et al., 2011) — must-read
- HNSW paper (Malkov & Yashunin, 2018) — must-read
Advanced Techniques (Week 4):
- DiskANN paper (for billion-scale search)
- Filtered-DiskANN paper (for metadata filtering)
- Milvus/Qdrant architecture docs (for distributed systems)
Systems & Production (Week 5+):
- Designing Data-Intensive Applications Ch. 3-6 (storage, replication, partitioning)
- Database Internals (storage engine details)
- Production vector DB docs (Pinecone, Qdrant, Milvus)

Project List

Projects progress from fundamental algorithms to a complete vector database implementation.

Project 1: Brute Force k-NN and Distance Metrics

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: Rust, C++, Go
Coolness Level: Level 2: Practical but Forgettable
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Vector Mathematics / Similarity Search
Software or Tool: Distance Calculator
Main Book: “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman (Chapter 3)

What you’ll build: A brute-force k-nearest neighbors search engine supporting multiple distance metrics (Euclidean, cosine, dot product), with optimizations using NumPy vectorization and SIMD.

Why it teaches fundamentals: Before learning approximate algorithms, you must understand exact search. This establishes the baseline that all other algorithms try to beat. You’ll also learn why brute force becomes impossible at scale.

Core challenges you’ll face:

Implementing distance metrics correctly → maps to vector mathematics
Vectorizing with NumPy → maps to batch operations for speed
Memory management for large datasets → maps to chunked processing
Benchmarking accurately → maps to measuring recall and latency

Key Concepts:

Vector Operations: Linear Algebra fundamentals
Distance Metrics: “Mining of Massive Datasets” Chapter 3 - Leskovec et al.
NumPy Broadcasting: NumPy documentation

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of vectors/matrices. NumPy familiarity helpful.

Real world outcome:

$ ./vector_search --dataset random_10k_128d.npy --query query.npy --k 10

Loading dataset: 10,000 vectors × 128 dimensions
Query vector: 128 dimensions

Brute Force k-NN Results (k=10):
╔═══════╤═══════════╤════════════════╗
║ Rank  │ Vector ID │ Distance (L2)  ║
╠═══════╪═══════════╪════════════════╣
║   1   │   4,521   │    0.823       ║
║   2   │   7,892   │    0.891       ║
║   3   │   2,103   │    0.912       ║
║  ...  │    ...    │     ...        ║
║  10   │   9,234   │    1.102       ║
╚═══════╧═══════════╧════════════════╝

Performance:
  Distance calculations: 10,000
  Time: 2.3ms (vectorized), 156ms (naive loop)
  Throughput: 4,347 queries/second

$ ./vector_search --dataset random_1M_768d.npy --query query.npy --k 10
Loading dataset: 1,000,000 vectors × 768 dimensions

Performance:
  Time: 1,847ms (too slow for production!)
  This is why we need approximate algorithms...

Implementation Hints:

Distance metric implementations:

Euclidean Distance (L2):
  d(q, v) = sqrt(sum((q[i] - v[i])^2 for i in range(dim)))

Cosine Similarity:
  sim(q, v) = dot(q, v) / (norm(q) * norm(v))
  distance = 1 - sim  (to convert to distance)

Dot Product:
  score(q, v) = sum(q[i] * v[i] for i in range(dim))
  (higher is better, not a distance)

Naive implementation (slow):

def knn_naive(query, vectors, k):
    distances = []
    for i, v in enumerate(vectors):
        d = euclidean_distance(query, v)
        distances.append((i, d))
    distances.sort(key=lambda x: x[1])
    return distances[:k]

Vectorized implementation (fast):

def knn_vectorized(query, vectors, k):
    # Compute all distances at once
    # (query - vectors)^2 = query^2 - 2*query*vectors + vectors^2
    query_sq = np.sum(query ** 2)
    vectors_sq = np.sum(vectors ** 2, axis=1)
    cross_term = 2 * np.dot(vectors, query)

    distances = query_sq - cross_term + vectors_sq
    # Skip sqrt for ranking (monotonic transformation)

    # Partial sort for top-k
    indices = np.argpartition(distances, k)[:k]
    top_k = indices[np.argsort(distances[indices])]

    return top_k, distances[top_k]

Key optimization insight: For ranking, you don’t need the actual distance—just the ordering. So skip expensive operations like sqrt() that don’t change the ranking.

Chunked processing for large datasets:

def knn_chunked(query, vectors, k, chunk_size=100000):
    candidates = []
    for i in range(0, len(vectors), chunk_size):
        chunk = vectors[i:i+chunk_size]
        chunk_results = knn_vectorized(query, chunk, k)
        candidates.extend([(idx + i, dist) for idx, dist in chunk_results])

    candidates.sort(key=lambda x: x[1])
    return candidates[:k]

The Core Question You’re Answering

“How do distance metrics actually work, and why does vectorization make them 50x faster?”

Before writing code, understand what you’re measuring. Distance isn’t just math—it’s the foundation of similarity. Every vector database operation depends on computing distances correctly and efficiently.

Concepts You Must Understand First

Stop and research these before coding:

Linear Algebra Fundamentals
- What is a vector in mathematical terms vs in programming?
- How do dot products relate to projections?
- What does “norm” mean (L1, L2, infinity norm)?
- Book Reference: “Introduction to Linear Algebra” by Gilbert Strang — Ch. 1-2
NumPy Broadcasting and Vectorization
- Why is a loop over 1M items slow but a NumPy operation fast?
- What is SIMD (Single Instruction Multiple Data)?
- How does NumPy use CPU vector instructions?
- Book Reference: “Python for Data Analysis” by Wes McKinney — Ch. 4: NumPy Basics
Distance Metric Properties
- What makes a function a “metric” (triangle inequality, symmetry)?
- Why is cosine similarity NOT a distance (hint: it’s a similarity)?
- When should you use Euclidean vs cosine?
- Book Reference: “Mining of Massive Datasets” Ch. 3.5
Big-O Complexity
- What is the complexity of brute force k-NN? O(N*D) for what operations?
- Why does the curse of dimensionality make D matter?
- Book Reference: “Introduction to Algorithms” (CLRS) — Ch. 3

Questions to Guide Your Design

Before implementing, think through these:

API Design
- Should your search function return indices, distances, or both?
- How do you handle the case where k > dataset size?
- Should you support batched queries (multiple query vectors at once)?
Distance Metrics
- For cosine similarity, do you normalize vectors once (at index time) or per query?
- Can you avoid computing sqrt() for Euclidean distance? (Hint: yes, for ranking)
- How do you handle zero vectors in cosine similarity?
Memory Management
- Can you load the entire dataset in RAM? If not, how do you chunk it?
- Should you memory-map the data file or load it entirely?
- Where does the bottleneck appear: CPU or memory bandwidth?
Benchmarking
- What datasets should you test on? (SIFT1M, GIST1M are standard)
- How do you verify your results are correct? (Check against sklearn)
- What metrics matter: latency, throughput, or both?

Thinking Exercise

Trace Distance Calculation By Hand

Before coding, compute these manually:

Vector A = [3, 4]
Vector B = [6, 8]

Questions:
1. Euclidean distance between A and B?
2. Cosine similarity between A and B?
3. Dot product of A and B?
4. If you normalize both vectors first, what's the dot product?
5. Why are (3) and (4) different? What does this tell you?

Work this out on paper. The insight from question 5 is crucial.

Performance Mental Model

Predict the speedup:

Naive Python loops over 10,000 vectors × 128 dimensions
NumPy vectorized version with SIMD

Make a guess: 2x? 10x? 100x?

Then implement both and measure. The actual speedup will teach you why vectorization matters.

The Interview Questions They’ll Ask

Prepare to answer these:

“What’s the difference between Euclidean distance and cosine similarity?”
- Euclidean measures absolute distance in space; cosine measures angle between vectors
- Euclidean is magnitude-sensitive; cosine is magnitude-invariant
- Use Euclidean when magnitude matters (e.g., image brightness), cosine for orientation (e.g., text embeddings)
“Why is brute force k-NN O(N) and not O(N log N)?”
- Computing all N distances is O(N × D)
- Finding top-k with argpartition is O(N), not O(N log N)
- Full sort would be O(N log N), but we don’t need it
“Can you skip the sqrt() in Euclidean distance? Why or why not?”
- Yes for ranking (sqrt is monotonic, preserves order)
- No if you need actual distance values
- Saves computation: sqrt is ~10-20x slower than multiply
“Your search is slow. Is it CPU-bound or memory-bound? How would you check?”
- Profile with time and check CPU utilization
- Memory-bound: low CPU usage, high memory bandwidth
- CPU-bound: high CPU usage, bottleneck in computation
- Tools: top, htop, perf, Python profilers
“How would you find k-NN for 1000 query vectors efficiently?”
- Batch compute: distances matrix of shape (1000, N)
- Use matrix multiplication instead of 1000 loops
- Exploit CPU cache locality and SIMD
“What happens to distance distributions as dimensionality increases?”
- Curse of dimensionality: all points become equidistant
- Distance concentration: min and max distances converge
- Relative contrast (max/min) decreases exponentially
- This is why high-dimensional spaces need special techniques
“How does NumPy make array operations fast?”
- Uses compiled C/Fortran libraries (BLAS, LAPACK)
- SIMD vectorization (compute 4-8 values simultaneously)
- Contiguous memory layout (cache-friendly)
- Avoids Python interpreter overhead

Hints in Layers

Hint 1: Start Simple

Implement naive Euclidean distance with Python loops first. Make sure it works correctly before optimizing.

import math

def euclidean_naive(v1, v2):
    dist = 0
    for i in range(len(v1)):
        dist += (v1[i] - v2[i]) ** 2
    return math.sqrt(dist)

Test it on small examples where you know the answer.

Hint 2: Vectorize with NumPy

Replace loops with NumPy operations:

import numpy as np

def euclidean_vectorized(v1, v2):
    return np.linalg.norm(v1 - v2)
    # Or: return np.sqrt(np.sum((v1 - v2) ** 2))

Benchmark the difference using timeit or time.perf_counter().

Hint 3: Batch Computation

To find k-NN, you need distances from query to ALL vectors. Compute all at once:

# query: (D,)
# vectors: (N, D)
# Result: (N,) distances
dists = np.linalg.norm(vectors - query, axis=1)

This is the key insight: broadcast the query and compute N distances in one operation.

Hint 4: Skip Sqrt for Ranking

For finding top-k, you only need the ORDERING, not actual distances:

# These give the same ordering:
distances = np.sqrt(np.sum((vectors - query)**2, axis=1))
distances_squared = np.sum((vectors - query)**2, axis=1)

# So skip the sqrt:
top_k_indices = np.argpartition(distances_squared, k)[:k]

Sqrt is expensive. Avoid it when you can.

Hint 5: Cosine Similarity Trick

Pre-normalize all vectors once. Then cosine similarity = dot product:

# At index time:
vectors_normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

# At query time:
query_normalized = query / np.linalg.norm(query)
similarities = np.dot(vectors_normalized, query_normalized)
# No need for division anymore!

This converts O(N × D) divisions to O(N × D) multiplications (which you’d do anyway for dot product).

Hint 6: Use argpartition, Not argsort

Finding top-k doesn’t require full sort:

# O(N log N) - wasteful
top_k = np.argsort(distances)[:k]

# O(N) - better
top_k = np.argpartition(distances, k)[:k]
top_k_sorted = top_k[np.argsort(distances[top_k])]

For k « N, this is a massive speedup.

Hint 7: Memory-Map Large Datasets

For datasets that don’t fit in RAM:

# Memory-mapped array (doesn't load entire file)
vectors = np.load('large_dataset.npy', mmap_mode='r')

# Process in chunks
chunk_size = 100000
for i in range(0, len(vectors), chunk_size):
    chunk = vectors[i:i+chunk_size]
    # Compute distances for this chunk

This lets you work with datasets larger than RAM.

Books That Will Help

Topic	Book	Chapter
Vector and matrix operations	“Introduction to Linear Algebra” by Gilbert Strang	Ch. 1-2
Distance metrics in depth	“Mining of Massive Datasets” by Leskovec et al.	Ch. 3: Finding Similar Items
NumPy fundamentals	“Python for Data Analysis” by Wes McKinney	Ch. 4: NumPy Basics
Performance and vectorization	“High Performance Python” by Gorelick & Ozsvald	Ch. 6: Matrix and Vector Computation
K-NN algorithms	“Pattern Recognition and Machine Learning” by Bishop	Ch. 2.5: Nearest Neighbor Methods
Big-O analysis	“Introduction to Algorithms” (CLRS)	Ch. 3: Growth of Functions

Learning milestones:

Distance metrics compute correctly → You understand vector math
Vectorized version is 50x+ faster → You understand batch operations
You hit the wall at 1M vectors → You understand why ANN is needed
Recall@10 is always 100% → You understand exact search baseline

Project 2: Locality Sensitive Hashing (LSH)

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: Rust, C++, Java
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Hashing / Probabilistic Data Structures
Software or Tool: LSH Index
Main Book: “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman (Chapter 3)

What you’ll build: A Locality Sensitive Hashing index for approximate nearest neighbor search, using random hyperplanes for cosine similarity or random projections for Euclidean distance.

Why it teaches LSH: LSH is the simplest ANN algorithm to understand. It’s based on a beautiful insight: if we hash vectors such that similar vectors likely get the same hash, we can use hash tables for fast lookup. Understanding LSH makes other algorithms clearer.

Core challenges you’ll face:

Designing hash functions → maps to random hyperplanes and projections
Amplification (AND/OR) → maps to balancing precision and recall
Multi-probe LSH → maps to checking nearby buckets
Parameter tuning → maps to number of tables and hash functions

Resources for key challenges:

Pinecone LSH Guide - Illustrated explanation
MIT LSH Page - Original research and implementations

Key Concepts:

LSH Fundamentals: “Mining of Massive Datasets” Chapter 3 - Leskovec et al.
Random Projections: Tyler Neylon’s LSH Introduction
Amplification: LSH literature on AND/OR constructions

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 completed. Understanding of hash tables and probability.

Real world outcome:

$ ./lsh_search --build --dataset sift_1M.npy --tables 20 --hashes 8

Building LSH Index:
  Dataset: 1,000,000 vectors × 128 dimensions
  Hash tables: 20
  Hashes per table: 8
  Total hash functions: 160

Build time: 12.3s
Index size: 245 MB (vs 512 MB raw data)

$ ./lsh_search --query query.npy --k 10

LSH Query Results:
╔═══════╤═══════════╤══════════════╤═════════════════╗
║ Rank  │ Vector ID │ Distance     │ Found in Table  ║
╠═══════╪═══════════╪══════════════╪═════════════════╣
║   1   │   4,521   │    0.823     │      3, 7, 12   ║
║   2   │   7,892   │    0.891     │      7, 15      ║
║  ...  │    ...    │     ...      │       ...       ║
╚═══════╧═══════════╧══════════════╧═════════════════╝

Performance:
  Candidates examined: 2,042 (vs 1,000,000 brute force)
  Time: 4.2ms (vs 1,847ms brute force)
  Speedup: 440x
  Recall@10: 0.87 (87% of true neighbors found)

$ ./lsh_search --sweep-params
Sweeping parameters for optimal recall/speed...

Tables │ Hashes │ Recall@10 │ Query Time │ Index Size
───────┼────────┼───────────┼────────────┼───────────
   5   │   4    │   0.62    │   1.1ms    │   61 MB
  10   │   6    │   0.78    │   2.3ms    │  122 MB
  20   │   8    │   0.87    │   4.2ms    │  245 MB
  40   │  10    │   0.94    │   8.7ms    │  490 MB

Implementation Hints:

The key insight of LSH:

Hash function property:
  P(h(a) = h(b)) ∝ similarity(a, b)

If vectors are similar → high probability of same hash
If vectors are different → low probability of same hash

For cosine similarity, use random hyperplanes:

def random_hyperplane_hash(vector, hyperplane):
    # Returns 1 if vector is on positive side, 0 otherwise
    return 1 if np.dot(vector, hyperplane) >= 0 else 0

def compute_hash(vector, hyperplanes):
    # Combine multiple hyperplane hashes into one integer
    hash_value = 0
    for i, hp in enumerate(hyperplanes):
        bit = random_hyperplane_hash(vector, hp)
        hash_value |= (bit << i)
    return hash_value

LSH Index structure:

class LSHIndex:
    def __init__(self, num_tables, hashes_per_table, dim):
        self.tables = []
        for _ in range(num_tables):
            hyperplanes = np.random.randn(hashes_per_table, dim)
            hyperplanes /= np.linalg.norm(hyperplanes, axis=1, keepdims=True)
            table = {
                'hyperplanes': hyperplanes,
                'buckets': defaultdict(list)  # hash -> [vector_ids]
            }
            self.tables.append(table)

    def insert(self, vector_id, vector):
        for table in self.tables:
            hash_val = compute_hash(vector, table['hyperplanes'])
            table['buckets'][hash_val].append(vector_id)

    def query(self, vector, k):
        candidates = set()
        for table in self.tables:
            hash_val = compute_hash(vector, table['hyperplanes'])
            candidates.update(table['buckets'][hash_val])

        # Re-rank candidates by actual distance
        results = []
        for cand_id in candidates:
            dist = distance(vector, self.vectors[cand_id])
            results.append((cand_id, dist))

        results.sort(key=lambda x: x[1])
        return results[:k]

Amplification (AND/OR construction):

More hashes per table (AND): Fewer candidates, higher precision
More tables (OR): More candidates, higher recall
Balance: Tune based on your recall requirements

Multi-probe LSH: Instead of only checking the exact bucket, also check nearby buckets (flip one bit at a time). Increases recall without adding tables.

The Core Question You’re Answering

Why does hashing similar vectors to the same buckets work for finding nearest neighbors?

This is the fundamental insight of LSH. Unlike traditional hash functions (which try to distribute items uniformly), LSH hash functions are designed so that:

Similar items collide with HIGH probability
Dissimilar items collide with LOW probability

The probability of collision is directly related to similarity. For random hyperplane hashing and cosine similarity:

P(h(a) = h(b)) = 1 - (angle(a,b) / π)

This means: if two vectors point in similar directions (small angle), they’re likely to be on the same side of a random hyperplane. By using multiple hyperplanes, you create a hash code where similar vectors get similar (often identical) codes.

The “aha!” moment: You’re not trying to find the EXACT nearest neighbor. You’re willing to accept approximate answers in exchange for massive speedups. LSH gives you probabilistic guarantees: “With probability 1-δ, I’ll find a neighbor within (1+ε) of optimal.”

Concepts You Must Understand First

Before implementing LSH, make sure you understand:

1. Random Projections

Projecting high-dimensional vectors onto random lines/hyperplanes
The Johnson-Lindenstrauss lemma (distances are approximately preserved)
Why random hyperplanes work for cosine similarity

2. Hash Functions and Collision

Traditional hash functions minimize collisions
LSH hash functions MAXIMIZE collisions for similar items
Hash codes as binary strings (each bit = one hyperplane test)

3. Probability Basics

P(h(a) = h(b)) depends on similarity(a, b)
Independence of hash functions
Amplification through AND/OR combinations

4. AND/OR Amplification

AND construction: All k hashes must match → Higher precision, lower recall
OR construction: Match in any of L tables → Higher recall, lower precision
Combined: L tables, each with k hashes → Tunable precision/recall

Study these first:

“Mining of Massive Datasets” Chapter 3.4 (LSH for Cosine Distance)
Tyler Neylon’s LSH Tutorial - excellent visual explanations
Pinecone LSH Guide

Questions to Guide Your Design

Hash Function Design:

How many hash functions (hyperplanes) should each table use?
- More hashes → smaller buckets → higher precision, lower recall
- Fewer hashes → larger buckets → lower precision, higher recall
- Typical: 6-12 hashes per table
How do you generate random hyperplanes?
- Sample from Gaussian: np.random.randn(dim)
- Normalize to unit length: hyperplane / ||hyperplane||
- Each hyperplane is independent
How do you combine multiple bits into a hash?
- Treat bits as binary number: hash = bit₀ + 2*bit₁ + 4*bit₂ + ...
- Use bitwise operations for speed: hash |= (bit << i)

Amplification Design:

How many tables should you use?
- More tables → more chances to find neighbors → higher recall
- But also more memory and slower indexing
- Typical: 10-50 tables depending on recall requirements
How do you balance precision and recall?
- High recall needed? → Many tables (L), fewer hashes per table (k)
- High precision needed? → Fewer tables, more hashes per table
- Formula: P_collision = (1 - (1 - p^k)^L) where p = similarity

Multi-Probe Strategy:

Which nearby buckets should you probe?
- Hamming distance 1: flip each bit individually (k probes)
- Hamming distance 2: flip pairs of bits (k choose 2 probes)
- Prioritize bits closest to decision boundary
How many probes before diminishing returns?
- Empirically: 2-10 probes per table
- Track marginal increase in candidates found
- Stop when cost exceeds benefit

Memory vs Speed:

How do you store hash tables efficiently?
- Dict of lists: buckets[hash] = [id1, id2, ...]
- Problem: empty buckets waste space
- Solution: only store non-empty buckets
Should you store vectors or just IDs?
- IDs only → compact, but need separate vector storage
- Vectors included → faster re-ranking, but larger memory

Thinking Exercise: Understanding Hash Collisions

Exercise 1: Manual Hash Computation

Given two 3D vectors and one random hyperplane, compute if they hash to the same value:

v1 = [1, 2, 1]      (normalized: [0.408, 0.816, 0.408])
v2 = [1, 2, 2]      (normalized: [0.333, 0.667, 0.667])
hyperplane = [1, 0, 0]

Steps:

Compute dot product: v1 · h and v2 · h
Check sign: positive → bit=1, negative → bit=0
Do they match?

Answer: v1·h = 0.408 (bit=1), v2·h = 0.333 (bit=1) → COLLISION ✓

Exercise 2: Multi-Bit Hash

Now use 3 hyperplanes:

h1 = [1, 0, 0]
h2 = [0, 1, 0]
h3 = [0, 0, 1]

Compute 3-bit hash for both v1 and v2:

For each hyperplane, compute sign(v · h)
Combine into binary: bit₀ + 2bit₁ + 4bit₂

Answer:

v1: [+, +, +] → 111₂ → 7
v2: [+, +, +] → 111₂ → 7
Still collision! ✓

Exercise 3: Probability Calculation

If two vectors have cosine similarity 0.9, what’s the probability they collide with:

1 random hyperplane?
4 random hyperplanes (all must match)?
10 tables of 4 hashes each (match in ANY table)?

Answers:

1 hyperplane: P = 1 - arccos(0.9)/π ≈ 0.87
4 hyperplanes: P = 0.87^4 ≈ 0.57
10 tables: P = 1 - (1 - 0.57)^10 ≈ 0.998 (very high recall!)

Exercise 4: Understanding the Tradeoff

You have 1M vectors, target recall@10 of 0.85. Compare configurations:

Tables	Hashes	Expected Candidates	Recall	Memory
5	8	500	0.65	60 MB
20	6	2000	0.85	240 MB
50	4	5000	0.95	600 MB

Question: Which would you choose for a latency-critical application?

Hint: More candidates = slower re-ranking. Look at candidates vs recall ratio.

The Interview Questions They’ll Ask

Conceptual Understanding:

“Explain how LSH works to a software engineer unfamiliar with it.”
- Good answer: “LSH uses special hash functions where similar items collide frequently. By hashing vectors with random hyperplanes and checking which side they fall on, we create codes where similar vectors get similar codes. Then we use hash tables for fast lookup.”
- Great answer: Add probability guarantees, AND/OR amplification, and concrete example with numbers.
“Why use random hyperplanes for cosine similarity?”
- The probability two vectors are on the same side equals their angular similarity
- Mathematically: P(sign(v·h) = sign(u·h)) = 1 - θ/π where θ = angle(v,u)
- This directly connects hash collision to cosine similarity
“What’s the difference between AND and OR construction?”
- AND: All k hashes must match → fewer collisions → higher precision
- OR: Match in any of L tables → more chances → higher recall
- Combined: (k AND) OR L → tunable precision/recall tradeoff

Implementation Details:

“How do you choose the number of tables and hashes per table?”
- Start with target recall requirement (e.g., 0.85)
- Estimate collision probability based on expected similarity
- Use formula: P_recall ≈ 1 - (1 - p^k)^L
- Solve for L and k given recall target
- Empirically tune on validation set
“What’s multi-probe LSH and why does it help?”
- Instead of checking only exact hash bucket, also check nearby buckets
- “Nearby” = Hamming distance 1 or 2 (flip 1-2 bits)
- Helps because vectors near decision boundaries might be in adjacent buckets
- Increases recall without adding tables (saves memory)
“How would you handle updates (insertions/deletions)?”
- Insertions: Hash new vector, add to all L buckets
- Deletions: Remove from all L buckets (requires storing vector→bucket mapping)
- Challenge: bucket lists can become fragmented
- Solution: periodic rebuilding or use more sophisticated data structures

Performance & Scaling:

“Why is LSH typically slower than IVF or HNSW in practice?”
- High-dimensional spaces: need many bits for good precision
- Many tables needed for good recall
- Re-ranking cost: examine many candidates even if most are false positives
- Modern alternatives (HNSW, IVF) have better recall/speed tradeoffs
“When would you choose LSH over other ANN algorithms?”
- Very high dimensionality (>1000D) where clustering is hard
- Streaming/online scenarios with frequent insertions
- When you need theoretical guarantees on approximation
- When simplicity and debuggability matter more than peak performance
“How does LSH scale with dimensionality?”
- Need O(log n / ε²) hash functions for (1+ε) approximation
- Higher dimensions → need more bits → larger hash codes
- But curse of dimensionality affects ALL ANN methods
- LSH degrades more gracefully than some alternatives

Advanced Topics:

“Explain data-dependent LSH vs data-independent LSH.”
- Data-independent: random hyperplanes (what we’ve implemented)
- Data-dependent: learn hyperplanes from data to maximize separation
- Example: use PCA directions or train with neural networks
- Tradeoff: better performance vs complexity and adaptability
“How would you implement LSH for other distance metrics?”
- L2 (Euclidean): p-stable distributions (Cauchy for L2)
- Hamming: random bit sampling
- Jaccard: MinHash
- Each metric needs a different hash family with collision ∝ similarity
“What’s the theoretical guarantee of LSH?”
- (c, r₁, r₂, p₁, p₂)-sensitive hash family
- If d(a,b) ≤ r₁ → P(h(a)=h(b)) ≥ p₁ (nearby items collide often)
- If d(a,b) ≥ r₂ → P(h(a)=h(b)) ≤ p₂ (distant items rarely collide)
- With amplification: guarantees on finding (1+ε)-approximate neighbors

Hints in Layers

Layer 1: Single Hash Table (Start Here)

Implement one table with k=8 random hyperplanes
Hash vectors by checking sign of each projection
Store in dict: buckets[hash_value] = [vector_ids]
Query: hash query vector, return all vectors in same bucket
Test: verify similar vectors often share buckets

Layer 2: Multiple Tables (OR Construction)

Create L=20 independent tables (different random hyperplanes)
Insert: add vector to bucket in each table
Query: collect candidates from all L tables (union)
Re-rank candidates by actual distance
Test: measure recall@10 vs single table

Layer 3: Parameter Tuning

Implement parameter sweep: vary L and k
For each configuration, measure:
- Recall@10
- Query time
- Memory usage
- Number of candidates examined
Plot Pareto frontier: recall vs query time
Test: find optimal L,k for target recall of 0.85

Layer 4: Multi-Probe LSH

Generate probe sequence: flip 1 bit, 2 bits, etc.
For each table, check exact bucket + probes
Implement early stopping: stop if enough candidates found
Test: compare recall with/without multi-probe at same #candidates

Layer 5: Optimizations

Vectorized hash computation: batch process hyperplane projections
Compact bucket storage: use numpy arrays instead of lists
Early termination: stop search after examining N candidates
Test: benchmark against baseline, aim for 2-3x speedup

Debugging Tips:

Low recall? → Increase L (more tables) or decrease k (fewer hashes per table)
Too slow? → Decrease L or use multi-probe instead of more tables
Memory issues? → Decrease L or use approximate storage (bloom filters)
Verification: Always re-rank candidates with exact distances!

Books That Will Help

Core LSH Theory:

“Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman
- Chapter 3: Finding Similar Items
- Section 3.4: LSH for Cosine Distance (essential reading)
- Section 3.6: Applications of LSH
- Why: Best introduction to LSH with clear examples and proofs

Random Projections & Theory:

“Foundations of Data Science” by Blum, Hopcroft, Kannan
- Chapter 2: Random Projection and Dimensionality Reduction
- Johnson-Lindenstrauss lemma and applications
- Why: Theoretical foundation for why LSH works

Algorithms & Probability:

“Randomized Algorithms” by Motwani and Raghavan
- Chapter 6: Hashing
- Probabilistic analysis of hash functions
- Why: Deep dive into probability tools needed for LSH analysis

Practical Implementation:

“Introduction to Information Retrieval” by Manning, Raghavan, Schütze
- Chapter 19: Web Search Basics
- Near-duplicate detection using LSH
- Why: Real-world applications and engineering considerations

Advanced Topics:

“Nearest-Neighbor Methods in Learning and Vision” edited by Shakhnarovich, Darrell, Indyk
- Chapter 5: LSH by Indyk
- Data-dependent LSH
- Why: Written by LSH pioneer, covers cutting-edge techniques

Online Resources:

MIT OCW: Algorithmic Lower Bounds - Lecture on LSH
Alex Andoni’s LSH Page - Original research and code
Tyler Neylon’s LSH Tutorial Series - Extremely clear visual explanations

Papers to Read (after implementing):

“Similarity Search in High Dimensions via Hashing” (Gionis et al., 1999) - Original LSH paper
“Locality-Sensitive Hashing Scheme Based on p-Stable Distributions” (Datar et al., 2004) - LSH for L2 distance
“Multi-Probe LSH” (Lv et al., 2007) - Improves efficiency significantly

Learning milestones:

Hash function maps similar vectors together → You understand LSH basics
Multiple tables improve recall → You understand amplification
Parameter tuning affects recall/speed → You understand the tradeoff
Multi-probe improves efficiency → You understand advanced LSH

Project 3: Inverted File Index (IVF)

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: C++, Rust, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Clustering / Inverted Indexes
Software or Tool: IVF Index
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: An Inverted File (IVF) index that clusters vectors with k-means, then searches only nearby clusters at query time—the foundation of production vector search.

Why it teaches IVF: IVF is the workhorse of production vector databases. It’s conceptually simple (just clustering!) but has nuances around cluster count, nprobe tuning, and training data. Understanding IVF unlocks FAISS and most production systems.

Core challenges you’ll face:

K-means clustering at scale → maps to efficient training
Choosing number of clusters → maps to sqrt(N) heuristic
nprobe tuning → maps to recall vs speed tradeoff
Non-uniform cluster sizes → maps to data distribution challenges

Resources for key challenges:

Pinecone Vector Indexes Guide - IVF explanation
FAISS IVF Documentation
Zilliz Vector Index Tutorial

Key Concepts:

Inverted Index: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
K-Means Clustering: “Pattern Recognition and Machine Learning” - Bishop
IVF in FAISS: FAISS documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1 completed. Understanding of k-means clustering.

Real world outcome:

$ ./ivf_search --build --dataset sift_1M.npy --clusters 1000

Building IVF Index:
  Dataset: 1,000,000 vectors × 128 dimensions
  Training k-means with 100,000 sample vectors...
  K-means iterations: 25, converged: True
  Clusters: 1,000 (avg 1,000 vectors per cluster)

Cluster size distribution:
  Min: 423, Max: 2,156, Median: 987, StdDev: 312

Build time: 45.2s
Index size: 134 MB (centroids) + 512 MB (data)

$ ./ivf_search --query query.npy --k 10 --nprobe 10

IVF Query Results (nprobe=10):
  Clusters searched: 10 (1% of index)
  Vectors compared: 9,870

╔═══════╤═══════════╤══════════════╤═════════════════╗
║ Rank  │ Vector ID │ Distance     │ Cluster ID      ║
╠═══════╪═══════════╪══════════════╪═════════════════╣
║   1   │   4,521   │    0.823     │      127        ║
║   2   │   7,892   │    0.891     │      127        ║
║  ...  │    ...    │     ...      │       ...       ║
╚═══════╧═══════════╧══════════════╧═════════════════╝

Performance:
  Time: 8.3ms (vs 1,847ms brute force)
  Speedup: 222x
  Recall@10: 0.92

$ ./ivf_search --nprobe-sweep
nprobe │ Recall@10 │ Query Time │ Vectors Compared
───────┼───────────┼────────────┼─────────────────
    1  │   0.52    │   1.2ms    │      987
    5  │   0.81    │   4.8ms    │    4,935
   10  │   0.92    │   8.3ms    │    9,870
   20  │   0.97    │  15.2ms    │   19,740
   50  │   0.99    │  38.1ms    │   49,350

Implementation Hints:

IVF structure:

class IVFIndex:
    def __init__(self, num_clusters, dim):
        self.num_clusters = num_clusters
        self.dim = dim
        self.centroids = None  # (num_clusters, dim)
        self.inverted_lists = [[] for _ in range(num_clusters)]
        # inverted_lists[i] = [(vector_id, vector), ...]

    def train(self, training_vectors):
        # Run k-means to find cluster centroids
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=self.num_clusters, n_init=1)
        kmeans.fit(training_vectors)
        self.centroids = kmeans.cluster_centers_

    def add(self, vector_id, vector):
        # Find nearest centroid
        distances = np.linalg.norm(self.centroids - vector, axis=1)
        cluster_id = np.argmin(distances)
        self.inverted_lists[cluster_id].append((vector_id, vector))

    def search(self, query, k, nprobe):
        # Find nprobe nearest centroids
        centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
        nearest_clusters = np.argsort(centroid_distances)[:nprobe]

        # Search within those clusters
        candidates = []
        for cluster_id in nearest_clusters:
            for vector_id, vector in self.inverted_lists[cluster_id]:
                dist = np.linalg.norm(query - vector)
                candidates.append((vector_id, dist))

        # Sort and return top-k
        candidates.sort(key=lambda x: x[1])
        return candidates[:k]

Choosing number of clusters:

Rule of thumb: num_clusters ≈ sqrt(num_vectors)

1M vectors → ~1,000 clusters
100M vectors → ~10,000 clusters

Too few clusters: Each cluster too large, slow search
Too many clusters: Centroid comparison overhead, missed neighbors

Training considerations:

Train on a representative sample (10-20% of data is often enough)
Ensure training data has same distribution as full dataset
More k-means iterations = better centroids, but slower training

Non-exhaustive search pitfall:

If nearest neighbor is in cluster 5, but query is closest to
centroid of cluster 3, we might miss it with low nprobe.

Solution: Higher nprobe, or use HNSW to find nearby centroids faster

The Core Question You’re Answering

“How does clustering enable fast search in high-dimensional spaces?”

This is the foundational insight of IVF: if you partition your dataset into regions (clusters), you can search only the regions near your query instead of the entire dataset. The key challenge is: how do you partition the space efficiently, and how do you decide which regions to search?

Think about it:

Without clustering: You must compare your query to all N vectors (O(N) time)
With clustering: You compare to K centroids, then search only nprobe clusters (O(K + nprobe * N/K) time)
The tradeoff: Speed vs accuracy - fewer clusters searched = faster but might miss the true nearest neighbor

Concepts You Must Understand First

Before implementing IVF, you need solid understanding of:

K-means Clustering
- How Lloyd’s algorithm iteratively assigns points to centroids and updates centroids
- Why k-means minimizes within-cluster variance
- Convergence criteria and initialization strategies (k-means++)
- Time complexity: O(iterations × N × K × D) where K = number of clusters, D = dimensions
Voronoi Partitions
- Each centroid defines a region (Voronoi cell) of all points closer to it than any other centroid
- The partition is exhaustive (every point belongs to exactly one cell)
- Voronoi boundaries are hyperplanes in D-dimensional space
- Understanding: “Which cluster does this query belong to?” = “Which centroid is nearest?”
The nprobe Parameter
- nprobe = number of clusters to search at query time
- nprobe = 1: Search only the nearest cluster (fast, low recall)
- nprobe = K: Search all clusters (slow, perfect recall = brute force)
- nprobe = 10-50: Typical production values (good recall/speed tradeoff)
Recall vs Speed Tradeoff
- Recall@k = fraction of true top-k neighbors found
- As nprobe increases: recall ↑, query time ↑
- The goal: Find minimum nprobe that achieves acceptable recall (often 90-95%)
Cluster Imbalance
- Real data doesn’t cluster uniformly - some clusters are huge, others tiny
- Imbalanced clusters hurt performance (searching a 5000-vector cluster vs a 200-vector cluster)
- Solutions: Use more clusters, or switch to techniques like IVF-PQ or HNSW

Questions to Guide Your Design

As you build IVF, constantly ask yourself:

How many clusters should I use?
- Too few (K = 10): Huge clusters, slow search even with nprobe=1
- Too many (K = 100,000): Centroid comparison overhead, might need high nprobe
- Rule of thumb: K ≈ sqrt(N) - but why? (Balances centroid comparison vs in-cluster search)
- For 1M vectors: K ≈ 1000 clusters ≈ 1000 vectors per cluster
How much training data do I need?
- K-means needs enough samples per cluster (at least 30-50 per cluster)
- For K=1000 clusters: need at least 30,000-50,000 training vectors
- But k-means is expensive on large datasets - can you subsample?
- Answer: Yes! 10-20% of data is often sufficient if representative
What if clusters are imbalanced?
- Measure cluster size distribution: min, max, median, stddev
- If max/min > 10x, you have imbalance
- Options: Increase K, use hierarchical clustering, or accept it
- Real-world data is always imbalanced - design for it!
How do I tune nprobe?
- Sweep nprobe from 1 to 100, measure recall@10 and query time
- Plot recall vs nprobe and time vs nprobe
- Find “knee” of curve - where recall plateaus (diminishing returns)
- Production systems often expose nprobe as a per-query parameter
What distance metric should I use?
- K-means assumes Euclidean distance (L2)
- For cosine similarity: normalize vectors to unit length, then use L2
- For other metrics (Manhattan, etc.): k-means may not work well
- IVF is fundamentally Euclidean - that’s a limitation!

Thinking Exercise

Before coding, do this by hand to internalize the algorithm:

Setup: You have 20 2D vectors and K=4 clusters.

Vectors (x, y):
[1,1], [2,1], [1,2], [2,2],       # Cluster 1 (bottom-left)
[8,1], [9,1], [8,2], [9,2],       # Cluster 2 (bottom-right)
[1,8], [2,8], [1,9], [2,9],       # Cluster 3 (top-left)
[8,8], [9,8], [8,9], [9,9],       # Cluster 4 (top-right)
[5,5], [5,6], [6,5], [6,6]        # Middle vectors

Part 1: K-means

Initialize 4 random centroids
Assign each vector to nearest centroid
Recompute centroids as mean of assigned vectors
Repeat until convergence
What are the final centroids? (Should be near [1.5,1.5], [8.5,1.5], [1.5,8.5], [8.5,8.5])

Part 2: Search with nprobe=1

Query: [5.5, 5.5]
Which centroid is nearest? (Could be any, they’re equidistant!)
Search only that cluster
What if the true nearest neighbor is in a different cluster? MISSED!

Part 3: Search with nprobe=2

Query: [5.5, 5.5]
Find 2 nearest centroids
Search both clusters
Higher chance of finding the true nearest neighbor

Part 4: Understanding the tradeoff

nprobe=1: Search ~5 vectors (1/4 of dataset)
nprobe=2: Search ~10 vectors (1/2 of dataset)
nprobe=4: Search all 20 vectors (brute force!)

Key insight: The middle vectors [5,5], [5,6], [6,5], [6,6] are equidistant from all centroids. Queries near them NEED higher nprobe. This is the fundamental limitation of IVF!

The Interview Questions They’ll Ask

When you understand IVF deeply, you can answer:

Theory Questions:
- “Explain how IVF reduces search time complexity from O(N) to O(K + nprobe * N/K)”
- “Why do we use k-means for IVF? What properties make it suitable?”
- “What’s the difference between IVF and HNSW? When would you use each?”
- “How does IVF handle new data points? Is retraining required?”
Design Questions:
- “You have 100M 768-dim vectors. How many clusters would you use? Why?”
- “Your IVF index has 90% recall with nprobe=20 but you need 95%. What do you do?”
- “You notice cluster sizes range from 100 to 50,000 vectors. Is this a problem? How do you fix it?”
- “How would you handle streaming inserts to an IVF index?”
Tradeoff Questions:
- “Given a query latency SLA of 10ms, how do you choose nprobe?”
- “Should you train k-means on all 100M vectors or a sample? What’s the tradeoff?”
- “Your index is 50GB. How does IVF help with memory? How could you compress further?”
- “Compare IVF with nprobe=10 vs HNSW with ef=10. Which is faster? More accurate?”
Implementation Questions:
- “How do you implement efficient centroid search? (Hint: it’s brute force for small K!)”
- “What data structure do you use for inverted lists? Array? Linked list?”
- “How do you parallelize IVF search across multiple queries?”
- “What’s the index build time for 1M vectors, 1000 clusters, 50 k-means iterations?”
Production Questions:
- “Your IVF index has high query latency variance. Why might this happen?”
  - Answer: Cluster imbalance - some clusters are 10x larger than others
- “How do you handle deletes in IVF? Tombstones? Rebuilding?”
- “Can you combine IVF with PQ? What’s the memory/accuracy tradeoff?”
- “How would you shard an IVF index across 10 machines?”

Hints in Layers

Layer 1: If you’re completely stuck starting

Start with toy data: 1000 2D vectors, K=10 clusters
Use sklearn.cluster.KMeans for training (don’t implement k-means from scratch yet)
Store clusters as a list of lists: clusters[i] = [(vector_id, vector), ...]
For search: find nearest centroid with brute force, search that cluster

Layer 2: When basic IVF works

Implement nprobe: search multiple clusters
Measure recall by comparing to brute force results
Plot recall vs nprobe curve
Implement cluster size distribution statistics

Layer 3: When you want to optimize

Implement your own k-means (Lloyd’s algorithm)
Add k-means++ initialization
Use numpy vectorization for centroid distances
Parallelize search across clusters

Layer 4: When you want production features

Serialize index to disk (save centroids + inverted lists)
Add batch query support
Implement cluster rebalancing (split large clusters)
Add IVF + flat refinement (search with IVF, refine with exact distances)

Common mistakes to avoid:

Mistake: Using too few clusters (K < 100 for 1M vectors)
- Result: Slow search even with nprobe=1
- Fix: Use K ≈ sqrt(N)
Mistake: Training k-means on biased sample
- Result: Poor cluster quality, low recall
- Fix: Ensure training sample is representative (random sample)
Mistake: Not handling empty clusters
- Result: Some centroids have no assigned vectors
- Fix: Reassign empty centroids to split large clusters
Mistake: Computing exact distances to all centroids
- Result: Slow centroid search
- Fix: For K < 10,000, brute force is fine. For larger K, use approximate methods
Mistake: Storing full vectors in inverted lists
- Result: High memory usage
- Fix: Store only vector IDs, load vectors on-demand (or use PQ!)

Books That Will Help

“Introduction to Information Retrieval” by Manning, Raghavan, Schütze
- Chapter 1: Inverted indexes (the inspiration for IVF)
- Chapter 6: Scoring and ranking
- Chapter 18: Hierarchical clustering
- Why it helps: IVF is literally an inverted index for vectors
“Pattern Recognition and Machine Learning” by Christopher Bishop
- Chapter 9: Mixture Models and EM (includes k-means)
- Chapter 12: Continuous Latent Variables
- Why it helps: Deep understanding of k-means and clustering theory
“Foundations of Multidimensional and Metric Data Structures” by Hanan Samet
- Chapter 4: Point Access Methods
- Chapter 5: Multidimensional Point Data
- Why it helps: IVF in the broader context of spatial indexing
Research Papers:
- “Video Google: A Text Retrieval Approach to Object Matching in Videos” (Sivic & Zisserman, 2003)
  - Original IVF for visual search
- “Product Quantization for Nearest Neighbor Search” (Jégou et al., 2011)
  - IVF-PQ, the production combination
- FAISS paper: “Billion-scale similarity search with GPUs” (Johnson et al., 2017)
  - Modern IVF implementations
Online Resources:

Learning milestones:

K-means clusters vectors → You understand the preprocessing
nprobe controls recall → You understand the search phase
Cluster imbalance affects performance → You understand data distribution
You compare with FAISS → You validate your implementation

Project 4: Product Quantization (PQ)

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Vector Compression / Quantization
Software or Tool: PQ Encoder/Decoder
Main Book: “Product Quantization for Nearest Neighbor Search” - Jégou et al.

What you’ll build: A Product Quantization system that compresses 128-dimensional float vectors into just 8-16 bytes, enabling 97% memory reduction while maintaining search quality.

Why it teaches quantization: Memory is often the bottleneck for large vector databases. PQ is the key technique that makes billion-scale search possible on a single machine. Understanding PQ unlocks IVF-PQ, the most common production configuration.

Core challenges you’ll face:

Subspace decomposition → maps to splitting vectors into segments
Codebook training → maps to k-means per subspace
Asymmetric distance computation → maps to query vs database accuracy
Distance lookup tables → maps to precomputed distances for speed

Resources for key challenges:

Pinecone PQ Guide - 97% compression explained
Zilliz PQ Tutorial - Hands-on Python

Key Concepts:

Product Quantization: Original paper by Jégou, Douze, Schmid
Vector Quantization: Lloyd’s algorithm / k-means
Asymmetric Distance: ADC (Asymmetric Distance Computation)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 3 completed. Understanding of k-means.

Real world outcome:

$ ./pq_compress --dataset sift_1M.npy --segments 8 --bits 8

Training Product Quantization:
  Original: 1,000,000 vectors × 128 dims × 4 bytes = 512 MB
  Segments: 8 (16 dims each)
  Centroids per segment: 256 (8 bits)

Training codebooks (k-means per segment)...
  Segment 0: 256 centroids, trained on 1M subvectors
  Segment 1: 256 centroids, trained on 1M subvectors
  ...
  Segment 7: 256 centroids, trained on 1M subvectors

Compressing vectors...
  Compressed: 1,000,000 vectors × 8 bytes = 8 MB
  Compression ratio: 64x (98.4% reduction!)

Codebook size: 8 segments × 256 centroids × 16 dims × 4 bytes = 128 KB

$ ./pq_search --query query.npy --k 10

PQ Search with ADC (Asymmetric Distance Computation):
  Building distance lookup table: 0.02ms
  Scanning 1M compressed codes: 12.3ms

Results:
╔═══════╤═══════════╤════════════════╤═════════════════╗
║ Rank  │ Vector ID │ Approx Dist    │ True Dist       ║
╠═══════╪═══════════╪════════════════╪═════════════════╣
║   1   │   4,521   │    0.841       │    0.823        ║
║   2   │   7,102   │    0.923       │    0.905        ║
║  ...  │    ...    │     ...        │      ...        ║
╚═══════╧═══════════╧════════════════╧═════════════════╝

Quality:
  Recall@10: 0.89 (vs exact search)
  Mean distance error: 3.2%

Performance:
  Time: 12.3ms (vs 1,847ms brute force on raw vectors)
  Memory: 8 MB (vs 512 MB raw)
  Speedup: 150x

Implementation Hints:

Product Quantization structure:

Original vector (128 dims):
[v0, v1, v2, ..., v127]

Split into M=8 subvectors (16 dims each):
[v0-v15], [v16-v31], ..., [v112-v127]

Each subvector → quantized to one of 256 centroids
Compressed code: [c0, c1, c2, c3, c4, c5, c6, c7]  (8 bytes total)

Implementation:

class ProductQuantizer:
    def __init__(self, dim, num_segments, num_centroids=256):
        self.dim = dim
        self.num_segments = num_segments
        self.segment_dim = dim // num_segments
        self.num_centroids = num_centroids
        self.codebooks = []  # One codebook per segment

    def train(self, vectors):
        for seg in range(self.num_segments):
            start = seg * self.segment_dim
            end = start + self.segment_dim
            subvectors = vectors[:, start:end]

            kmeans = KMeans(n_clusters=self.num_centroids)
            kmeans.fit(subvectors)
            self.codebooks.append(kmeans.cluster_centers_)

    def encode(self, vector):
        codes = []
        for seg in range(self.num_segments):
            start = seg * self.segment_dim
            end = start + self.segment_dim
            subvector = vector[start:end]

            # Find nearest centroid
            distances = np.linalg.norm(self.codebooks[seg] - subvector, axis=1)
            code = np.argmin(distances)
            codes.append(code)

        return np.array(codes, dtype=np.uint8)

    def decode(self, codes):
        vector = []
        for seg, code in enumerate(codes):
            vector.extend(self.codebooks[seg][code])
        return np.array(vector)

Asymmetric Distance Computation (ADC):

def compute_distance_tables(self, query):
    """Precompute distances from query subvectors to all centroids"""
    tables = []
    for seg in range(self.num_segments):
        start = seg * self.segment_dim
        end = start + self.segment_dim
        query_sub = query[start:end]

        # Distance from query subvector to each centroid
        distances = np.linalg.norm(self.codebooks[seg] - query_sub, axis=1)
        tables.append(distances)

    return np.array(tables)  # Shape: (num_segments, num_centroids)

def search_pq(self, query, codes, k):
    """Search using precomputed distance tables"""
    tables = self.compute_distance_tables(query)

    # Compute approximate distance to each code
    distances = np.zeros(len(codes))
    for seg in range(self.num_segments):
        distances += tables[seg, codes[:, seg]] ** 2

    # Top-k
    indices = np.argpartition(distances, k)[:k]
    return indices[np.argsort(distances[indices])]

Why ADC is fast:

Query: Compute 8 × 256 = 2,048 subvector distances (one time)
Per vector: 8 table lookups + 8 additions
1M vectors: 8M lookups vs 128M multiplications

The Core Question You’re Answering

“How do we compress 512-byte vectors to 128 bytes (or even 8 bytes) without losing search quality?”

This is the central problem of Product Quantization. When you have billions of vectors, memory becomes the primary bottleneck. Loading 512GB of vectors into RAM is expensive; loading 8GB is feasible. But naive compression destroys the distance relationships we need for similarity search.

The insight: Instead of compressing the entire vector as one unit, split it into independent subspaces and quantize each separately. This preserves local structure while achieving massive compression.

Why this matters in production:

Billion-scale search on single machines: 1B vectors × 512 bytes = 512GB (impossible) vs 1B × 8 bytes = 8GB (feasible)
Cache efficiency: Compressed codes fit in CPU cache, enabling faster scanning
Cost reduction: Less RAM = cheaper infrastructure
IVF-PQ combination: The most common production config in vector databases

Concepts You Must Understand First

Before implementing Product Quantization, ensure you understand these foundational concepts:

1. Vector Quantization (VQ)

What it is: Replacing continuous values with discrete codes from a codebook
Classic example: k-means clustering where each point is represented by its cluster centroid
The tradeoff: Memory savings vs reconstruction error
Why you need it: PQ is just VQ applied to subvectors instead of full vectors

2. K-Means Clustering

Lloyd’s algorithm: Iterative refinement (assign → update → repeat)
Codebook: The set of cluster centroids
Quantization error: Distance between original point and assigned centroid
Why you need it: Training PQ codebooks is running k-means on subvectors

3. Subspace Decomposition

Concept: Splitting a 128D vector into 8 × 16D independent subvectors
Independence assumption: Distance in each subspace contributes independently to total distance
Why it works: For high-dimensional random vectors, this assumption holds reasonably well
Why you need it: This is what makes “product” quantization different from regular VQ

4. Asymmetric Distance Computation (ADC)

Symmetric: Both query and database vectors are quantized
Asymmetric: Only database vectors are quantized; query remains exact
Why asymmetric is better: Preserves query accuracy while compressing database
Why you need it: This is the secret to maintaining search quality with PQ

5. Distance Decomposition

For squared Euclidean distance: ||q - v||² = Σᵢ ||qᵢ - vᵢ||² where i indexes subspaces
This enables ADC: Compute distance per subspace, then sum
Precomputation: Build lookup tables of subspace distances once per query
Why you need it: This is how PQ achieves fast approximate distance computation

Study resources:

Original PQ paper: Jégou et al., “Product Quantization for Nearest Neighbor Search”
K-means: Chapter 9 of “Pattern Recognition and Machine Learning” by Bishop
Vector quantization: “Vector Quantization and Signal Compression” by Gersho & Gray

Questions to Guide Your Design

1. How many subspaces (M) should I use?

Tradeoff: More subspaces = better compression, but larger overhead
Common choices: M = 8, 16, 32, 64
Rule of thumb: Each subspace should have at least 8-16 dimensions
Example: For 128D vectors, M=8 gives 16D subspaces; M=16 gives 8D subspaces

2. How many centroids (K) per codebook?

Determines code size: K=256 requires 8 bits per subspace
Common choices: K = 256 (8 bits), K = 16 (4 bits), K = 4096 (12 bits)
Tradeoff: More centroids = better reconstruction, but larger lookup tables
Memory impact: Each codebook is K × (D/M) × 4 bytes

3. How much training data do I need?

Minimum: At least 256 × 20 = 5,120 vectors for k-means to converge (if K=256)
Recommended: 100K-1M vectors for stable codebooks
Quality: More training data = codebooks better represent the distribution
Tradeoff: Training time vs codebook quality

4. Symmetric vs Asymmetric distance?

Symmetric (SDC): Quantize both query and database
- Pros: Simpler, slightly faster
- Cons: Lower accuracy (query quantization error)
Asymmetric (ADC): Keep query exact, quantize database only
- Pros: Better accuracy (no query error)
- Cons: Requires distance tables
Recommendation: Always use ADC for better recall

5. How do I balance compression vs accuracy?

Compression ratio = (D × 4 bytes) / (M × bits_per_code / 8)
Example: 128D, M=8, 8 bits → (128×4) / (8×1) = 64× compression
Accuracy: Measured by recall@k compared to exact search
Tuning: Increase M or K to improve recall (at cost of compression/speed)

6. Should I normalize vectors before PQ?

Yes, if using cosine similarity: Normalize to unit length first
Euclidean distance: Normalization optional but often helps
Why: Prevents one subspace from dominating distance computation
How: L2 normalize each vector before training and encoding

Thinking Exercise

Manual Product Quantization Walkthrough

Let’s manually apply PQ to understand the process:

Given vector (16D for simplicity):

v = [0.5, 0.3, 0.8, 0.2,  0.1, 0.9, 0.4, 0.7,  0.6, 0.2, 0.3, 0.8,  0.9, 0.1, 0.5, 0.4]

Step 1: Split into M=4 subvectors (4D each):

v₁ = [0.5, 0.3, 0.8, 0.2]
v₂ = [0.1, 0.9, 0.4, 0.7]
v₃ = [0.6, 0.2, 0.3, 0.8]
v₄ = [0.9, 0.1, 0.5, 0.4]

Step 2: Assume we trained codebooks (K=4 centroids each):

Codebook 1:
  c₁₀ = [0.1, 0.1, 0.2, 0.1]
  c₁₁ = [0.5, 0.3, 0.7, 0.2]  ← closest to v₁
  c₁₂ = [0.8, 0.7, 0.9, 0.8]
  c₁₃ = [0.3, 0.5, 0.4, 0.6]

Codebook 2:
  c₂₀ = [0.1, 0.8, 0.3, 0.6]
  c₂₁ = [0.2, 0.9, 0.5, 0.7]  ← closest to v₂
  c₂₂ = [0.7, 0.3, 0.8, 0.2]
  c₂₃ = [0.9, 0.1, 0.1, 0.9]

Codebook 3:
  c₃₀ = [0.2, 0.1, 0.1, 0.3]
  c₃₁ = [0.6, 0.2, 0.4, 0.7]  ← closest to v₃
  c₃₂ = [0.9, 0.8, 0.7, 0.9]
  c₃₃ = [0.4, 0.5, 0.6, 0.5]

Codebook 4:
  c₄₀ = [0.1, 0.1, 0.2, 0.1]
  c₄₁ = [0.4, 0.3, 0.5, 0.3]
  c₄₂ = [0.8, 0.2, 0.4, 0.5]  ← closest to v₄
  c₄₃ = [0.9, 0.9, 0.8, 0.9]

Step 3: Encode vector (find nearest centroid per subspace):

v₁ → c₁₁ (code 1)
v₂ → c₂₁ (code 1)
v₃ → c₃₁ (code 1)
v₄ → c₄₂ (code 2)

Compressed code: [1, 1, 1, 2]  (4 codes, 2 bits each = 1 byte total)

Step 4: Decode to approximate vector:

v' = [c₁₁, c₂₁, c₃₁, c₄₂]
   = [0.5, 0.3, 0.7, 0.2,  0.2, 0.9, 0.5, 0.7,  0.6, 0.2, 0.4, 0.7,  0.8, 0.2, 0.4, 0.5]

Step 5: Compute reconstruction error:

||v - v'||² = (0.8-0.7)² + (0.4-0.5)² + (0.3-0.4)² + (0.8-0.7)² + (0.9-0.8)² + (0.1-0.2)² + (0.5-0.4)² + (0.4-0.5)²
            = 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01
            = 0.08

Step 6: Asymmetric distance to query:

Given query: q = [0.4, 0.2, 0.7, 0.1, 0.3, 0.8, 0.5, 0.6, 0.5, 0.3, 0.2, 0.9, 0.8, 0.2, 0.4, 0.3]

Build distance tables:

Table 1 (distances from q₁=[0.4,0.2,0.7,0.1] to each centroid in Codebook 1):
  d(q₁, c₁₀) = 0.94
  d(q₁, c₁₁) = 0.13  ← v₁ encoded as code 1
  d(q₁, c₁₂) = 0.65
  d(q₁, c₁₃) = 0.28

Table 2 (distances from q₂=[0.3,0.8,0.5,0.6] to each centroid in Codebook 2):
  d(q₂, c₂₀) = 0.14
  d(q₂, c₂₁) = 0.11  ← v₂ encoded as code 1
  d(q₂, c₂₂) = 0.57
  d(q₂, c₂₃) = 0.91

Table 3 (distances from q₃=[0.5,0.3,0.2,0.9] to each centroid in Codebook 3):
  d(q₃, c₃₀) = 0.73
  d(q₃, c₃₁) = 0.23  ← v₃ encoded as code 1
  d(q₃, c₃₂) = 0.69
  d(q₃, c₃₃) = 0.38

Table 4 (distances from q₄=[0.8,0.2,0.4,0.3] to each centroid in Codebook 4):
  d(q₄, c₄₀) = 0.84
  d(q₄, c₄₁) = 0.45
  d(q₄, c₄₂) = 0.05  ← v₄ encoded as code 2
  d(q₄, c₄₃) = 0.67

Approximate distance:

d(q, v) ≈ Table₁[1]² + Table₂[1]² + Table₃[1]² + Table₄[2]²
        = 0.13² + 0.11² + 0.23² + 0.05²
        = 0.0169 + 0.0121 + 0.0529 + 0.0025
        = 0.0844

Key insights:

Compression: 16D × 4 bytes = 64 bytes → 4 codes × 0.25 bytes = 1 byte (64× compression!)
Reconstruction error: 0.08 (some information lost)
Asymmetric distance: Only 4 table lookups + 4 additions per vector
Codebook storage: 4 codebooks × 4 centroids × 4D × 4 bytes = 256 bytes

The Interview Questions They’ll Ask

Theory Questions:

Explain Product Quantization in one sentence.
- Answer: “PQ splits vectors into subspaces, quantizes each subspace independently, and represents vectors as concatenated codes for massive compression with minimal search quality loss.”
Why is it called ‘product’ quantization?
- Answer: The set of possible reconstructed vectors is the Cartesian product of the M codebooks. With M subspaces and K centroids each, you have K^M possible codes (e.g., 256^8 ≈ 10^19 for M=8, K=256).
What’s the compression ratio for 128D float vectors with M=8, K=256?
- Answer: Original: 128 dims × 4 bytes = 512 bytes. Compressed: 8 codes × 1 byte = 8 bytes. Ratio: 64× (98.4% reduction).
Why use asymmetric distance instead of symmetric?
- Answer: Symmetric quantizes both query and database, introducing error on both sides. Asymmetric keeps the query exact, only quantizing database vectors, which preserves better search accuracy with minimal overhead (building distance tables).
How does PQ affect recall?
- Answer: Recall decreases because quantization introduces approximation error. Typical recall@10: 85-95% with proper tuning. Increasing M or K improves recall but reduces compression.

Design Questions:

You have 1B vectors of 512D. Design a PQ configuration.
- Considerations:
  - Memory budget: 1B × X bytes (choose X based on RAM)
  - M: 512/32 = 16D per subspace with M=32
  - K: 256 (8 bits) for good reconstruction
  - Result: 1B × 32 bytes = 32GB + codebooks (32 × 256 × 16 × 4 bytes = 512KB)
  - Training: Sample 1M vectors for codebook training
How would you debug low recall in PQ search?
- Check reconstruction error: Decode some vectors and compare to originals
- Visualize codebook coverage: Are centroids well-distributed?
- Check training data: Is it representative of query distribution?
- Try different M: Too few subspaces may lose structure
- Compare ADC vs SDC: Should see higher recall with ADC
How does PQ integrate with IVF?
- Answer: IVF-PQ combines coarse quantization (IVF for partitioning) with fine quantization (PQ for compression). IVF narrows search to a few partitions (~100-1000), then PQ enables fast scanning within partitions. This is the most common production setup (used by FAISS, Milvus, etc.).

Implementation Questions:

Walk me through the training process.
- Answer:
  1. Sample training vectors (e.g., 100K-1M)
  2. For each subspace m = 0..M-1: a. Extract subvectors: subvecs = train[:, m*D/M : (m+1)*D/M] b. Run k-means with K centroids c. Store centroids as codebook_m
  3. Save all M codebooks
How do you implement fast ADC search?
- Answer:
  1. Precompute distance tables: For each subspace, compute distance from query subvector to all K centroids → M × K distances
  2. For each database vector code [c₀, c₁, …, c_{M-1}]: a. Look up distances: d = [table[0][c₀], table[1][c₁], ..., table[M-1][c_{M-1}]] b. Sum squared distances: dist = sum(d²)
  3. Return top-k by dist
    - Complexity: O(M×K) table computation + O(N×M) scanning vs O(N×D) brute force
How would you optimize PQ for cache efficiency?
- Answer:
  - Store codes contiguously in memory (good spatial locality)
  - Align distance tables to cache lines (64 bytes)
  - Use SIMD for parallel table lookups and summation
  - Batch queries to amortize table computation
  - Interleave codes by subspace for better access patterns

Tradeoff Questions:

PQ vs scalar quantization (SQ)?
- PQ: Better compression (64×), more complex, requires training
- SQ: Simple (min/max per dimension), moderate compression (4×), no training needed
- Use PQ when: Memory is critical, can afford training time
- Use SQ when: Need simple deployment, moderate compression sufficient
What happens if query distribution differs from training data?
- Answer: Codebooks may not represent query-relevant regions well, leading to higher quantization error and lower recall. Solutions: Retrain periodically, use larger K, or use query-aware training (sample training data from past queries).

Hints in Layers

Layer 1: Getting Started (Beginner)

Start with a tiny dataset: 1000 vectors, 16D, M=4, K=16
Implement basic encoding: Split vector → find nearest centroid per subspace → save codes
Test reconstruction: Decode codes back to vectors, compute error
Verify: Reconstruction error should be small but non-zero

Layer 2: Core PQ (Intermediate)

Implement k-means training: Use sklearn or write simple Lloyd’s algorithm
Train one codebook per subspace on real data (e.g., 10K vectors)
Encode entire dataset, measure compression ratio
Implement decoding and visualize reconstruction error distribution
Verify: Codebooks should have reasonable spread, reconstruction error < 10%

Layer 3: Asymmetric Distance (Intermediate-Advanced)

Implement distance table computation: Query subvector to all centroids
Implement ADC search: Table lookups + summation
Compare with brute force on original vectors: Measure recall@10
Optimize table computation: Vectorize with NumPy
Verify: ADC should be 10-100× faster than brute force, recall > 80%

Layer 4: Optimization (Advanced)

Batch table computation: Compute for multiple queries at once
Optimize code storage: Use np.uint8 for codes, contiguous arrays
Add SIMD hints: Align arrays, use numba for JIT compilation
Implement early termination: Stop after finding k candidates
Verify: Should achieve <1ms per query on 1M dataset

Layer 5: Production Ready (Expert)

Add incremental updates: Insert new vectors without full retraining
Implement codebook persistence: Save/load codebooks efficiently
Add monitoring: Track recall, latency, compression ratio over time
Combine with IVF: Partition space, then apply PQ within partitions
Tune hyperparameters: Grid search over M, K, training size
Verify: Recall > 90%, latency < 10ms on 10M+ dataset

Debugging tips:

Low recall? Check if codebooks are well-trained (visualize centroids)
Slow search? Profile table computation vs scanning (should be scanning-dominated)
High memory? Verify code dtype is uint8/uint16, not float
Poor reconstruction? Increase K or M, or normalize vectors before PQ

Books That Will Help

Primary Resource (Deep Dive):

“Product Quantization for Nearest Neighbor Search” - Jégou, Douze, Schmid (TPAMI 2011)
- The original paper - only 12 pages but incredibly dense
- Sections to focus on: Section III (Product Quantization), Section IV (Asymmetric Distance)
- Contains all the math and theory you need
- Available: IEEE Xplore or search “Jégou PQ paper”

Foundational Books:

“Vector Quantization and Signal Compression” - Gersho & Gray
- Chapter 1-4: Foundations of quantization theory
- Chapter 6: Vector quantization and k-means
- Why it helps: Provides deep background on quantization, distortion, rate-distortion theory
“Pattern Recognition and Machine Learning” - Christopher Bishop
- Chapter 9: K-means clustering and EM algorithm
- Chapter 12: PCA and dimensionality reduction
- Why it helps: Builds intuition for clustering and subspace methods
“Foundations of Multidimensional and Metric Data Structures” - Hanan Samet
- Chapter 4: High-dimensional indexing
- Why it helps: Context for where PQ fits in the landscape of ANN algorithms

Practical Resources:

FAISS Documentation - Facebook AI Research
- FAISS Wiki on PQ
- Real-world implementation details and tuning guidelines
- C++ code to study for optimization techniques
“Billion-scale similarity search with GPUs” - Johnson, Douze, Jégou (2017)
- How PQ is used at billion-scale in production
- GPU optimization techniques
- Available: arXiv:1702.08734

Online Tutorials:

Pinecone Learning Center
- Product Quantization Guide
- Visual explanations and interactive examples
- Great for building intuition before diving into papers
Towards Data Science Articles
- Search for “Product Quantization” and “Vector Quantization”
- Many practical tutorials with code examples
- Good for seeing different implementation approaches

Research Papers (Advanced):

“Optimized Product Quantization” - Ge et al. (CVPR 2013)
- Improvements to basic PQ: non-orthogonal subspace decomposition
- Why it helps: Shows how to push PQ further for higher recall
“Polysemous Codes” - Douze et al. (ECCV 2016)
- Combines PQ codes with binary codes for faster filtering
- Why it helps: State-of-the-art extension of PQ

How to Use These Resources:

Week 1: Read Jégou PQ paper (section by section), implement basic encoding/decoding
Week 2: Study k-means (Bishop Ch 9), implement codebook training
Week 3: Read ADC section carefully, implement asymmetric distance
Week 4: Study FAISS code, optimize your implementation
Week 5+: Explore advanced papers (OPQ, polysemous codes) for extensions

Learning milestones:

Codebooks train successfully → You understand quantization
Encoding/decoding works → You understand the representation
ADC is much faster than brute force → You understand the speedup
Recall drops gracefully → You understand the accuracy tradeoff

Project 5: HNSW (Hierarchical Navigable Small World)

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: C++, Rust, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Graph-Based Search / Skip Lists
Software or Tool: HNSW Index
Main Book: “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” - Malkov & Yashunin

What you’ll build: A complete HNSW implementation—the state-of-the-art algorithm for vector search, combining navigable small-world graphs with a hierarchical structure for logarithmic search complexity.

Why it teaches HNSW: HNSW is the default choice for most vector databases (Pinecone, Qdrant, Weaviate, pgvector). It offers the best balance of speed, accuracy, and flexibility. Understanding it deeply means understanding modern vector search.

Core challenges you’ll face:

Graph construction → maps to finding good neighbors during insertion
Hierarchical layers → maps to skip-list-like navigation
Greedy search → maps to navigating the graph efficiently
Parameter tuning (M, efConstruction, efSearch) → maps to quality vs speed

Resources for key challenges:

Pinecone HNSW Guide - Step-by-step explanation
Zilliz HNSW Tutorial - Visual explanation
Original HNSW Paper - The source

Key Concepts:

Small World Graphs: Watts-Strogatz model
Skip Lists: Probabilistic data structure
Greedy Search: Local optimization in graphs
HNSW Paper: Malkov & Yashunin, 2016

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-2 completed. Understanding of graph data structures.

Real world outcome:

$ ./hnsw_build --dataset sift_1M.npy --M 16 --efConstruction 200

Building HNSW Index:
  Dataset: 1,000,000 vectors × 128 dimensions
  M (max connections): 16
  efConstruction: 200

Construction progress:
  Layer 3:        12 nodes (entry points)
  Layer 2:       234 nodes
  Layer 1:     4,891 nodes
  Layer 0: 1,000,000 nodes (all vectors)

Build time: 8m 32s
Index size: 287 MB (graph) + 512 MB (vectors) = 799 MB

Graph statistics:
  Avg connections at layer 0: 14.2
  Max layer reached: 4
  Graph diameter (estimated): 6 hops

$ ./hnsw_search --query query.npy --k 10 --ef 50

HNSW Search (ef=50):
  Starting from entry point at layer 3
  Layer 3: 2 hops → found node 8,234
  Layer 2: 3 hops → found node 12,891
  Layer 1: 4 hops → found node 4,521
  Layer 0: Expanded 50 candidates

Results:
╔═══════╤═══════════╤══════════════╗
║ Rank  │ Vector ID │ Distance     ║
╠═══════╪═══════════╪══════════════╣
║   1   │   4,521   │    0.823     ║
║   2   │   7,892   │    0.891     ║
║  ...  │    ...    │     ...      ║
╚═══════╧═══════════╧══════════════╝

Performance:
  Nodes visited: 143
  Distance calculations: 143 (vs 1,000,000 brute force)
  Time: 0.8ms
  Recall@10: 0.98

$ ./hnsw_search --ef-sweep
ef   │ Recall@10 │ Query Time │ Nodes Visited
─────┼───────────┼────────────┼──────────────
  10 │   0.72    │   0.2ms    │       38
  20 │   0.88    │   0.4ms    │       67
  50 │   0.98    │   0.8ms    │      143
 100 │   0.995   │   1.4ms    │      271
 200 │   0.999   │   2.6ms    │      523

Implementation Hints:

HNSW key insight: Combine two ideas:

Navigable Small World (NSW): A graph where you can reach any node in few hops via greedy routing
Hierarchy (Skip List): Multiple layers where upper layers have fewer nodes, enabling fast coarse search

Layer 2:  [A]----------[B]----------[C]  (few nodes, long edges)
           |            |            |
Layer 1:  [A]---[D]---[B]---[E]---[C]  (more nodes)
           |    |      |    |      |
Layer 0:  [A][F][D][G][B][H][E][I][C]  (all nodes, short edges)

Core data structures:

class HNSWNode:
    def __init__(self, vector_id, vector, level):
        self.id = vector_id
        self.vector = vector
        self.level = level
        self.neighbors = [[] for _ in range(level + 1)]
        # neighbors[layer] = list of neighbor node ids

class HNSWIndex:
    def __init__(self, dim, M=16, ef_construction=200):
        self.dim = dim
        self.M = M  # Max connections per node
        self.M_max0 = M * 2  # Max connections at layer 0
        self.ef_construction = ef_construction
        self.ml = 1 / np.log(M)  # Level multiplier
        self.nodes = {}  # id -> HNSWNode
        self.entry_point = None
        self.max_level = -1

Insertion algorithm:

def insert(self, vector_id, vector):
    # 1. Determine level for new node (probabilistic)
    level = self._random_level()
    node = HNSWNode(vector_id, vector, level)

    if self.entry_point is None:
        self.entry_point = node
        self.max_level = level
        self.nodes[vector_id] = node
        return

    # 2. Find entry point at each level via greedy search
    current = self.entry_point

    # 3. Search from top to node's level (greedy, single result)
    for lc in range(self.max_level, level, -1):
        current = self._search_layer_greedy(vector, current, lc)

    # 4. At node's level and below, find ef_construction neighbors
    for lc in range(min(level, self.max_level), -1, -1):
        candidates = self._search_layer(vector, current, self.ef_construction, lc)

        # Select M best neighbors
        neighbors = self._select_neighbors(vector, candidates, self.M if lc > 0 else self.M_max0)

        # Bidirectional connections
        for neighbor in neighbors:
            node.neighbors[lc].append(neighbor.id)
            neighbor.neighbors[lc].append(vector_id)

            # Prune if too many connections
            if len(neighbor.neighbors[lc]) > (self.M if lc > 0 else self.M_max0):
                neighbor.neighbors[lc] = self._select_neighbors(
                    neighbor.vector,
                    [self.nodes[n] for n in neighbor.neighbors[lc]],
                    self.M if lc > 0 else self.M_max0
                )

        if len(candidates) > 0:
            current = candidates[0]  # Closest candidate

    # 5. Update entry point if new node is higher
    if level > self.max_level:
        self.entry_point = node
        self.max_level = level

    self.nodes[vector_id] = node

def _random_level(self):
    level = 0
    while np.random.random() < self.ml and level < 16:
        level += 1
    return level

Search algorithm:

def search(self, query, k, ef):
    if self.entry_point is None:
        return []

    current = self.entry_point

    # Greedy search from top to layer 1
    for lc in range(self.max_level, 0, -1):
        current = self._search_layer_greedy(query, current, lc)

    # At layer 0, do beam search with ef candidates
    candidates = self._search_layer(query, current, ef, 0)

    # Return top-k
    return sorted(candidates, key=lambda n: distance(query, n.vector))[:k]

def _search_layer(self, query, entry, ef, layer):
    visited = {entry.id}
    candidates = [(distance(query, entry.vector), entry)]  # min-heap
    results = [(-distance(query, entry.vector), entry)]  # max-heap (negative for max)

    heapify(candidates)
    heapify(results)

    while candidates:
        dist_c, current = heappop(candidates)

        # Stop if closest candidate is worse than worst result
        if dist_c > -results[0][0]:
            break

        for neighbor_id in current.neighbors[layer]:
            if neighbor_id in visited:
                continue
            visited.add(neighbor_id)

            neighbor = self.nodes[neighbor_id]
            dist_n = distance(query, neighbor.vector)

            if dist_n < -results[0][0] or len(results) < ef:
                heappush(candidates, (dist_n, neighbor))
                heappush(results, (-dist_n, neighbor))

                if len(results) > ef:
                    heappop(results)

    return [node for _, node in results]

Parameters explained:

M: Max connections per node. Higher = more accurate, more memory
ef_construction: Beam width during build. Higher = better graph, slower build
ef_search: Beam width during search. Higher = better recall, slower search

The Core Question You’re Answering

“How does a multi-layer graph enable logarithmic search in high-dimensional spaces?”

This is the heart of HNSW. Unlike flat graphs (NSW) that require examining many nodes, or tree structures that fail in high dimensions, HNSW uses a hierarchical graph structure inspired by skip lists. Upper layers have fewer nodes with long-range connections for coarse navigation, while lower layers have dense connections for fine-grained search.

The key insight: Start at the top sparse layer to quickly traverse the space, then progressively refine your search by moving down layers until you reach layer 0 where all vectors exist. This gives you O(log N) search complexity instead of O(N).

Why this matters for vector databases:

Enables sub-millisecond search on millions of vectors
Scales logarithmically rather than linearly
Maintains high recall (>95%) while being 100x-1000x faster than brute force
Powers production systems like Pinecone, Weaviate, and Qdrant

Concepts You Must Understand First

Before building HNSW, ensure you understand these foundational concepts:

1. Small-World Networks (Watts-Strogatz Model)

What it is: A graph where most nodes can be reached from any other node in a small number of hops.

Why it matters: HNSW builds on the Navigable Small World (NSW) concept where greedy routing (always moving to the neighbor closest to the target) can find approximate nearest neighbors efficiently.

Key property: High clustering coefficient + short average path length

Learn it:

Study the Watts-Strogatz model and how random edges create shortcuts
Understand why social networks are small-world graphs (“six degrees of separation”)
Read about the NSW algorithm (precursor to HNSW)

2. Skip Lists

What it is: A probabilistic data structure that allows O(log n) search in sorted linked lists by maintaining multiple layers of increasingly sparse elements.

Why it matters: HNSW’s hierarchical structure is directly inspired by skip lists. Each node is assigned a random layer level, and upper layers contain progressively fewer nodes.

Key property: Probabilistic layer assignment creates logarithmic height

Learn it:

Implement a basic skip list for 1D data
Understand probabilistic level assignment: level = floor(-ln(random()) * ml)
See how search starts at top layer and descends to find elements

3. Navigable Graphs

What it is: A graph where greedy routing (always moving to the best neighbor) reliably finds good paths to targets.

Why it matters: HNSW search works because the graph maintains navigability at each layer.

Key property: At each node, there exists a neighbor that’s closer to any query point

Learn it:

Understand greedy search on graphs (always move to closest neighbor)
Study Delaunay graphs (naturally navigable but expensive to build)
Learn why random graphs are NOT navigable

4. Graph Layers and Zoom Levels

What it is: Different levels of granularity in representing the same graph structure.

Why it matters: Upper layers in HNSW act like “zoom out” views with long-range connections, while layer 0 is fully “zoomed in”.

Analogy: Like Google Maps - start with country view (layer 3), zoom to city (layer 2), then street (layer 1), then buildings (layer 0)

Questions to Guide Your Design

Before coding, think through these design questions:

1. M Parameter (Max Connections per Node)

Q: If M=16, each node has max 16 neighbors per layer. How does this affect:
- Graph connectivity?
- Memory usage?
- Search quality?
- Build time?
Think about: What happens if M is too small (4)? Too large (128)?
Answer: M controls the tradeoff between recall and memory. Typical values: 12-48.

2. efConstruction (Beam Width During Build)

Q: During insertion, you search for efConstruction candidates, then select M best. Why not just keep the first M you find?
Think about: How does exploring more candidates during build affect the final graph quality?
Answer: Higher efConstruction creates a better-connected graph but slows down insertion. Typical: 100-500.

3. Layer Selection Strategy

Q: How do you decide which layer a new node belongs to?
Think about:
- Should it be deterministic or random?
- What probability distribution?
- How does this affect the layer structure?
Answer: Use exponential decay: P(level = l) = e^(-l * ln(M)). This creates a skip-list-like structure.

4. Neighbor Selection Heuristic

Q: You have efConstruction=200 candidates but can only keep M=16 neighbors. Which do you pick?
- The 16 closest by distance?
- 16 that maximize diversity?
- 16 that maintain graph connectivity?
Think about: Pure distance-based selection can create “clusters” that hurt navigability.
Answer: Use a heuristic that balances closeness and diversity (e.g., avoid neighbors too close to each other).

5. Greedy Routing at Upper Layers

Q: At layer 3 (few nodes), you do greedy search (single path). At layer 0 (all nodes), you do beam search (ef paths). Why this difference?
Think about: Computational cost vs accuracy at different layers.
Answer: Upper layers are for coarse navigation (speed), layer 0 is for finding actual neighbors (accuracy).

Thinking Exercise: Trace Search by Hand

Setup: Imagine a 2D dataset (for visualization) with 1000 points and this HNSW structure:

Layer 2: 4 nodes
Layer 1: 40 nodes
Layer 0: 1000 nodes (all points)
M = 8, ef = 20

Exercise: Trace this search for query Q at position (7.5, 3.2):

Layer 2: [A:(0,0), B:(10,0), C:(0,10), D:(10,10)]
Entry point: A

Query Q: (7.5, 3.2)

Step-by-step:

Layer 2 - Greedy Search:
- Current: A (0,0), distance to Q = 8.2
- Neighbors of A at layer 2: [B, C]
- Check B: (10,0), distance = 3.8 ✓ closer!
- Check C: (0,10), distance = 11.3 ✗ worse
- Move to B. Check B’s neighbors: [A, D]
- Check D: (10,10), distance = 7.1 ✗ worse than current
- Result: Best node at layer 2 = B (10, 0)
Layer 1 - Greedy Search:
- Start from B’s layer 1 representation
- B’s neighbors at layer 1: [B1:(9,0), B2:(10,1), …]
- Greedy walk until local minimum
- Result: Best node at layer 1 = (8, 3)
Layer 0 - Beam Search:
- Start from (8, 3)
- Maintain ef=20 candidates
- Explore neighbors in a priority queue
- Keep top-20 closest candidates
- Result: Final 10 nearest neighbors

Key observations:

Each layer narrows the search region
Upper layers: O(1) hops because few nodes
Lower layers: O(log N) hops
Total complexity: O(log N)

Your task: Draw this scenario on paper. Create 4 nodes at layer 2, 40 at layer 1 (include the 4), and mark query Q. Trace the greedy path down the layers.

The Interview Questions They’ll Ask

If you understand HNSW deeply, you should be able to answer these questions:

Theory Questions

“Explain why HNSW has O(log N) search complexity while NSW is O(N).”
- Expected answer: The hierarchical structure reduces the search space at each layer. Upper layers have fewer nodes (sparse), so greedy search reaches the target region in O(1) hops. Descending layers increases nodes but maintains logarithmic hops. Without hierarchy (NSW), you must traverse a flat dense graph.
“What happens if you set M=2? What about M=128?”
- M=2: Graph becomes a chain-like structure, poor connectivity, searches fail to find good paths
- M=128: Excellent connectivity and recall, but massive memory overhead and slower search (more neighbors to check)
“Why use exponential decay for layer assignment instead of uniform random?”
- Exponential decay creates a skip-list structure where upper layers have exponentially fewer nodes. This ensures O(log N) layers. Uniform random would create O(N) layers or unbalanced structures.
“HNSW vs IVF: when would you choose each?”
- HNSW: Better recall, consistent latency, online updates, smaller datasets (<100M)
- IVF: Better for billion-scale, lower memory (with PQ), batch workloads, GPU acceleration

Implementation Questions

“How do you handle the entry point initialization?”
- First node becomes entry point at its random level
- Each new node that reaches a higher level becomes the new entry point
- Entry point is the starting node for all searches
“Explain the pruning step during insertion.”
- When adding bidirectional edge, neighbor might exceed M connections
- Must prune to M using selection heuristic (keep diverse neighbors)
- Without pruning, memory grows unbounded
“What’s the difference between efConstruction and efSearch?”
- efConstruction: beam width during index building (affects graph quality)
- efSearch: beam width during query (affects recall/speed tradeoff)
- Higher efConstruction → better graph → can use lower efSearch
“How would you make HNSW distributed?”
- Partition vectors by ID or cluster
- Each shard builds local HNSW
- Search: scatter query to all shards, gather top-k from each, merge
- Challenge: no global entry point

Debugging Questions

“Your recall is only 60% but hnswlib gets 95%. What’s wrong?”
- Check: Are you doing bidirectional connections?
- Check: Is neighbor selection too greedy? (need diversity)
- Check: Is efConstruction too low?
- Check: Is distance calculation correct?
“Build time is 10x slower than expected. Where’s the bottleneck?”
- Profile: Is _search_layer called too often?
- Check: Is efConstruction too high?
- Check: Are you using a heap for candidate selection?
- Check: Distance computation vectorized?

Hints in Layers

Build your understanding progressively:

Layer 1: Build a Flat Navigable Graph (NSW)

Goal: Understand greedy graph search before adding hierarchy.

Start here:

class NSWGraph:
    def __init__(self, M=16):
        self.M = M
        self.nodes = {}
        self.entry_point = None

    def insert(self, vector_id, vector):
        # Find M nearest neighbors via greedy search
        # Add bidirectional edges
        # That's it! No layers yet.

What you’ll learn:

Greedy search on graphs
Bidirectional edge maintenance
Why M matters for connectivity

Success criteria: Can search 10K vectors with 70-80% recall

Layer 2: Add the Hierarchy (Skip List Structure)

Goal: Add probabilistic layers like skip lists.

Add this:

def _random_level(self):
    level = 0
    while random.random() < 1/self.M and level < 16:
        level += 1
    return level

class HNSWNode:
    def __init__(self, vector_id, vector, level):
        self.level = level
        self.neighbors = [[] for _ in range(level + 1)]

What you’ll learn:

How layer assignment creates hierarchy
Multi-layer graph structure
Entry point management

Success criteria: Can build multi-layer graph, verify layer distribution

Layer 3: Implement Layer-by-Layer Search

Goal: Start at top, greedily descend to layer 0.

Add this:

def search(self, query, k, ef):
    current = self.entry_point

    # Greedy at upper layers
    for lc in range(self.max_level, 0, -1):
        current = self._search_layer_greedy(query, current, lc)

    # Beam search at layer 0
    candidates = self._search_layer(query, current, ef, 0)
    return top_k(candidates)

What you’ll learn:

Why greedy search at top layers
Why beam search at layer 0
How entry point guides search

Success criteria: Search is faster than flat NSW, recall still 70-80%

Layer 4: Add Pruning and Selection Heuristics

Goal: Improve graph quality with smart neighbor selection.

Add this:

def _select_neighbors_heuristic(self, base_vector, candidates, M):
    # Don't just pick M closest
    # Pick M that maximize diversity and avoid redundancy
    # Example: avoid neighbors too close to each other
    selected = []
    for candidate in sorted(candidates, key=distance):
        if len(selected) >= M:
            break
        # Check: is candidate too close to already selected?
        if not too_close_to_any(candidate, selected):
            selected.append(candidate)
    return selected

What you’ll learn:

Why pure distance is insufficient
How diversity improves navigability
Graph connectivity vs local clustering

Success criteria: Recall jumps to 90-95%

Layer 5: Optimize and Match Production Quality

Goal: Make it fast and robust.

Add this:

Vectorize distance computations
Use heapq for candidate management
Pre-allocate numpy arrays
Add build progress monitoring
Parameter validation

Success criteria: Match hnswlib within 2x on speed and 99% on recall

Books That Will Help

Primary Resources

“Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” - Malkov & Yashunin (2016)
- The original HNSW paper
- Read sections: Abstract, Algorithm 1 (insertion), Algorithm 2 (search)
- Skip: Heavy theory sections on first read
- Focus on: Figure 1 (layer structure), Algorithm descriptions
“Searching in Metric Spaces” - Zezula, Amato, Dohnal, Batko
- Chapter on navigable small worlds
- Background on metric space searching
- Focus on: NSW foundations, greedy routing

Supplementary Materials

Skip Lists papers
- “Skip Lists: A Probabilistic Alternative to Balanced Trees” - Pugh (1990)
- Understanding probabilistic level assignment
- See parallels to HNSW layer selection
Small-World Networks
- “Collective dynamics of ‘small-world’ networks” - Watts & Strogatz (1998)
- Understand why small-world graphs enable efficient navigation
- Key insight: High clustering + short path length

Practical Guides

Pinecone Learning Center - HNSW
- URL: https://www.pinecone.io/learn/series/faiss/hnsw/
- Best practical explanation with visualizations
- Focus on: Layer-by-layer search visualization
hnswlib Source Code
- URL: https://github.com/nmslib/hnswlib
- Production C++ implementation
- Focus on: hnswlib/hnswalg.h - core algorithms
- Compare your Python to their optimized C++
FAISS Source Code
- URL: https://github.com/facebookresearch/faiss
- File: faiss/IndexHNSW.cpp
- See how Facebook implements HNSW at scale

Video Resources

“Billion-scale approximate nearest neighbor search” - Matthijs Douze (Facebook AI)
- YouTube search for FAISS talks
- Understanding HNSW in context of other methods
“Vector Search & HNSW” - Pinecone
- Explains intuition before math
- Good for visual learners

Papers for Deep Dive

“Navigating the Landscape of Metric Space Searching” - Malkov et al.
- Comprehensive comparison of graph-based methods
- See how NSW evolved to HNSW

Reading Order:

Start with Pinecone guide (intuition)
Read HNSW paper abstract + algorithms
Implement Layer 1-2 from hints
Read Skip Lists paper (understand probabilistic layers)
Implement Layer 3-4
Study hnswlib source code (optimizations)
Read Small-World Networks paper (why it works)

Learning milestones:

Hierarchical structure builds correctly → You understand the layers
Greedy search finds good paths → You understand navigation
Recall matches hnswlib → You’ve implemented correctly
Parameter tuning works → You understand the tradeoffs

Project 6: IVF-PQ Composite Index

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: C++, Rust
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Combined Indexing / Large-Scale Search
Software or Tool: IVF-PQ Index
Main Book: “Billion-scale similarity search with GPUs” - Johnson et al.

What you’ll build: A composite IVF-PQ index that combines clustering (IVF) with compression (PQ) for billion-scale search with minimal memory usage.

Why it teaches composite indexes: No single technique is enough at scale. IVF-PQ combines the best of both worlds: IVF reduces search scope, PQ reduces memory. This is how FAISS handles billion-vector datasets.

Core challenges you’ll face:

Training pipeline → maps to IVF clustering then PQ on residuals
Residual encoding → maps to PQ on (vector - centroid)
Search optimization → maps to precomputed tables + SIMD
Memory layout → maps to efficient storage for cache

Key Concepts:

Residual Quantization: PQ on the difference from centroid
Two-Stage Training: IVF first, then PQ
ADC with IVF: Combining both speedups

Learning milestones:

Two-stage training works → You understand the pipeline
Residual encoding improves accuracy → You understand why it helps
Memory reduction is massive → You understand the compression
Search matches FAISS performance → You’ve implemented correctly

Project 7: Disk-Based Vector Index (DiskANN-style)

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: C++
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 4: Expert
Knowledge Area: Storage Systems / SSD Optimization
Software or Tool: Disk-Based Vector Index
Main Book: “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node” - Microsoft Research

What you’ll build: A vector index that stores vectors on SSD with an in-memory graph for navigation, enabling billion-scale search on a single commodity machine.

Why it teaches disk-based search: Memory is expensive. Storing 1B vectors in RAM requires 500GB+. DiskANN-style indexes keep only the graph in memory (~10GB) and fetch vectors from SSD during search, reducing cost by 50x.

Core challenges you’ll face:

Graph fits in memory, vectors on disk → maps to memory-mapped I/O
Minimizing disk reads → maps to graph design for locality
SSD-friendly access patterns → maps to sequential vs random reads
Beam search with async I/O → maps to pipelining disk access

Key Concepts:

DiskANN Paper: Microsoft Research, 2019
Memory-Mapped Files: OS-level I/O optimization
SSD Performance: Random vs sequential read patterns

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 5 completed. C++ proficiency, understanding of file I/O.

Real world outcome:

$ ./diskann_build --dataset deep_1B.npy --memory-limit 16GB

Building DiskANN Index:
  Dataset: 1,000,000,000 vectors × 96 dimensions
  Vector data size: 384 GB (will be on disk)
  Memory limit: 16 GB (for graph)

Building graph with memory-efficient construction...
  Pass 1: Sample-based graph initialization
  Pass 2: Full graph refinement (streaming)
  Pass 3: Graph compaction

Index files:
  vectors.bin: 384 GB (memory-mapped)
  graph.bin: 12 GB (in-memory)
  metadata.json: 1 KB

$ ./diskann_search --query query.npy --k 10 --beam-width 50

DiskANN Search:
  Graph navigation: 0.4ms (in-memory)
  Disk reads: 23 vectors fetched
  SSD latency: 2.1ms (avg 0.09ms per read)
  Total: 2.5ms

Performance analysis:
  Graph hops: 8
  Candidate set size: 50
  Disk reads: 23 (< beam width due to caching)
  Cache hits: 12

Recall@10: 0.96

$ ./diskann_search --benchmark --queries 10000
Benchmark (10K queries):
  Mean latency: 2.8ms
  P99 latency: 5.2ms
  Throughput: 357 queries/second
  SSD bandwidth: 1.2 GB/s

Implementation Hints:

Architecture:

┌─────────────────────────────────────────┐
│             In-Memory Graph              │
│  (Node ID → [Neighbor IDs], 12 GB)      │
├─────────────────────────────────────────┤
│           Memory-Mapped Vectors          │
│    (Node ID → Vector, 384 GB on SSD)    │
└─────────────────────────────────────────┘

Core structures:

struct DiskANNIndex {
    // In-memory: graph only
    vector<vector<uint32_t>> graph;  // Neighbor lists

    // Memory-mapped: vectors on disk
    float* vectors;  // mmap'd from file
    size_t num_vectors;
    size_t dim;

    // Caching
    LRUCache<uint32_t, vector<float>> vector_cache;
};

Memory-mapping vectors:

void open_vectors(const string& path) {
    int fd = open(path.c_str(), O_RDONLY);
    size_t file_size = num_vectors * dim * sizeof(float);

    vectors = (float*)mmap(
        nullptr,
        file_size,
        PROT_READ,
        MAP_PRIVATE,  // Copy-on-write
        fd,
        0
    );

    // Advise OS about access pattern
    madvise(vectors, file_size, MADV_RANDOM);
}

vector<float> get_vector(uint32_t id) {
    // Check cache first
    if (auto cached = vector_cache.get(id)) {
        return *cached;
    }

    // Load from mmap (may trigger disk read)
    float* ptr = vectors + id * dim;
    vector<float> vec(ptr, ptr + dim);

    vector_cache.put(id, vec);
    return vec;
}

Search with prefetching:

vector<pair<uint32_t, float>> search(
    const vector<float>& query,
    size_t k,
    size_t beam_width
) {
    priority_queue<pair<float, uint32_t>> candidates;  // max-heap
    priority_queue<pair<float, uint32_t>, vector<...>, greater<>> results;  // min-heap
    unordered_set<uint32_t> visited;

    // Start from entry point
    auto entry_vec = get_vector(entry_point);
    float dist = distance(query, entry_vec);
    candidates.push({-dist, entry_point});  // Negative for min behavior
    results.push({dist, entry_point});
    visited.insert(entry_point);

    while (!candidates.empty()) {
        auto [neg_dist, current] = candidates.top();
        candidates.pop();

        if (-neg_dist > results.top().first && results.size() >= beam_width) {
            break;
        }

        // Prefetch neighbor vectors asynchronously
        auto& neighbors = graph[current];
        for (uint32_t neighbor : neighbors) {
            // Hint to OS to prefetch this page
            prefetch(&vectors[neighbor * dim]);
        }

        // Process neighbors
        for (uint32_t neighbor : neighbors) {
            if (visited.count(neighbor)) continue;
            visited.insert(neighbor);

            auto neighbor_vec = get_vector(neighbor);
            float d = distance(query, neighbor_vec);

            if (d < results.top().first || results.size() < beam_width) {
                candidates.push({-d, neighbor});
                results.push({d, neighbor});
                if (results.size() > beam_width) {
                    results.pop();
                }
            }
        }
    }

    // Extract top-k from results
    vector<pair<uint32_t, float>> top_k;
    while (!results.empty() && top_k.size() < k) {
        top_k.push_back({results.top().second, results.top().first});
        results.pop();
    }
    reverse(top_k.begin(), top_k.end());
    return top_k;
}

Graph construction (memory-efficient):

void build_graph(const string& vectors_path, size_t R, size_t L) {
    // R = max degree, L = candidate list size during construction

    // Initialize with random graph
    for (size_t i = 0; i < num_vectors; i++) {
        graph[i] = random_neighbors(R);
    }

    // Iterative refinement (process vectors in chunks to limit memory)
    size_t chunk_size = 1000000;  // 1M vectors at a time
    for (size_t offset = 0; offset < num_vectors; offset += chunk_size) {
        size_t end = min(offset + chunk_size, num_vectors);

        // Load chunk into memory
        vector<vector<float>> chunk_vectors = load_chunk(vectors_path, offset, end);

        // For each vector in chunk, improve its neighbors
        for (size_t i = offset; i < end; i++) {
            auto candidates = beam_search(chunk_vectors[i - offset], L);
            graph[i] = select_diverse_neighbors(candidates, R);

            // Add reverse edges
            for (uint32_t neighbor : graph[i]) {
                if (graph[neighbor].size() < R) {
                    graph[neighbor].push_back(i);
                }
            }
        }
    }
}

The Core Question You’re Answering:

“How do we search billions of vectors that don’t fit in RAM?”

This is the fundamental scaling challenge in vector search. A billion 96-dimensional float vectors require 384GB of RAM just for storage, plus additional memory for the index structure. At $10/GB/month for cloud RAM, that’s $3,840/month just for storage. DiskANN solves this by keeping only the navigation graph (~10-20GB) in memory and storing vectors on SSD (~$0.10/GB/month), reducing costs by 50x while maintaining sub-10ms search latency.

The insight: You don’t need all vectors in RAM. You only need the graph structure that tells you which vectors to examine. With careful graph design and SSD-aware access patterns, you can achieve 95%+ recall with just 20-50 disk reads per query.

Concepts You Must Understand First:

Before building DiskANN, you need to understand:

SSD Performance Characteristics:
- Sequential read: 3-7 GB/s (fast)
- Random read: 500K-1M IOPS, ~50-100 μs latency per read (slower)
- Page granularity: SSDs read in 4KB pages, not individual bytes
- Why it matters: Reading 96 floats (384 bytes) costs the same as reading 4KB. You pay for the whole page.
Memory-Mapped Files (mmap):
- OS maps file content to virtual memory addresses
- Reading a mapped address triggers page fault → OS loads page from disk → returns data
- OS manages page cache automatically (LRU eviction)
- Why it matters: You get disk-backed arrays without manual I/O code.
Vamana Graph Structure:
- Undirected graph with bounded degree R (typically 64-128)
- Built to maximize navigability: short path to any target
- Pruning strategy maintains both connectivity and diversity
- Why it matters: Good graph structure = fewer disk reads per search.
Beam Search Navigation:
- Maintains beam_width candidates (e.g., 50-100) during search
- Explores candidates in best-first order
- Greedy: expands most promising nodes first
- Why it matters: Limits number of nodes visited = limits disk reads.
Caching Strategies:
- Hot vectors (entry points, frequently visited nodes) should stay in cache
- LRU cache for recently accessed vectors
- Pre-warming: load entry point and high-degree nodes at startup
- Why it matters: Cache hit = no disk read (0.1ms saved per hit).
Async I/O and Prefetching:
- Issue prefetch hints for upcoming neighbor accesses
- OS can fetch multiple pages in parallel
- Overlaps computation with I/O latency
- Why it matters: Can reduce total latency by 30-50%.

Questions to Guide Your Design:

Ask yourself these questions while building:

Page Size and Alignment:
- Should vectors be aligned to 4KB page boundaries?
- Should you pack multiple small vectors per page?
- How does dimension affect packing efficiency?
Graph Degree R:
- Higher R = more neighbors = better recall but more disk reads
- Lower R = faster search but may get stuck in local minima
- What’s the sweet spot for your dataset?
Beam Width:
- Larger beam = more candidates = more disk reads but better recall
- Smaller beam = faster but may miss the target
- How does beam width affect the recall/latency tradeoff?
What to Cache:
- Entry point (always accessed)?
- High-degree nodes (hubs in the graph)?
- Recently accessed vectors (temporal locality)?
- What cache size gives diminishing returns?
Graph Construction:
- Can you build the graph without loading all vectors into RAM?
- Can you process vectors in streaming fashion?
- How do you avoid out-of-memory during construction?
Disk Access Patterns:
- Are your disk reads sequential or random?
- Can you batch reads to improve throughput?
- Are you triggering unnecessary page faults?

Thinking Exercise:

Before coding, work through these calculations:

Memory Budget:
- 1B vectors × 96 dims × 4 bytes = 384 GB (vectors on disk)
- 1B nodes × R neighbors × 4 bytes (node ID)
- If R=64: 1B × 64 × 4 = 256 GB (graph structure)
- That’s already over your 16GB memory limit!
- Key insight: You need compressed graph representation. Store only node IDs, not full neighbor lists.
Disk Reads Per Query:
- Beam width = 50
- Assume 8 graph hops to reach target
- Worst case: 50 candidates × 8 hops = 400 disk reads
- But! Many candidates share neighbors → actual reads ≈ 100-150
- With caching: reduce to 20-50 unique disk reads
- At 0.1ms per read: 2-5ms disk latency
Page Fault Analysis:
- One vector = 96 floats × 4 bytes = 384 bytes
- SSD page = 4KB
- Vectors per page: 4096 / 384 ≈ 10 vectors
- If vectors are randomly placed: 1 page fault per vector
- If vectors are co-located by graph proximity: 1 page fault per 10 vectors (10x speedup!)
Cache Size Impact:
- 1GB cache = ~2.7M vectors (at 384 bytes each)
- If top 0.1% of vectors account for 20% of accesses (Zipf distribution)
- Caching 1M vectors (0.1%) could eliminate 20% of disk reads
- Diminishing returns after ~2-4GB cache

The Interview Questions They’ll Ask:

Practice answering these:

“Why use DiskANN instead of HNSW for billion-scale search?”
- HNSW requires all vectors in RAM → 384GB for 1B vectors → $3,840/month
- DiskANN uses 10-20GB RAM + 400GB SSD → $200/month (19x cheaper)
- Latency: HNSW 1-2ms, DiskANN 3-8ms (3-4x slower but acceptable)
- When to use: DiskANN when cost matters, HNSW when latency is critical
“How do you minimize disk reads during search?”
- Good graph structure: short paths (6-8 hops) to any target
- Beam search: prune candidates aggressively
- Caching: keep hot vectors (entry points, hubs) in memory
- Prefetching: hint OS to load neighbors in parallel
- Result: 20-50 disk reads per query instead of 400+
“What’s the difference between random and sequential SSD access?”
- Sequential: 3-7 GB/s (reading contiguous blocks)
- Random: 500K IOPS × 4KB = 2 GB/s (jumping around)
- DiskANN is random access by nature (graph navigation)
- Mitigation: Pre-sort vectors by graph proximity, batch reads
“How do you build the graph without running out of memory?”
- Don’t load all vectors at once
- Process in chunks (1M vectors per iteration)
- For each chunk: improve graph connections using beam search
- Multiple passes: gradually refine graph quality
- Final graph fits in memory, but construction is streaming
“What happens if your cache is too small?”
- More cache misses → more disk reads → higher latency
- If cache < entry point + critical nodes: every query hits disk
- Sweet spot: 1-4GB (0.1-0.4% of vectors) captures 20-40% of accesses
- Beyond 10GB: diminishing returns (<5% reduction in disk reads)
“How do memory-mapped files work?”
- OS maps file to virtual address range
- Reading address triggers page fault if page not in RAM
- OS loads 4KB page from disk into RAM
- Subsequent reads in same page are free (already in RAM)
- OS evicts cold pages using LRU policy

Hints in Layers (Reveal When Stuck):

Layer 1 - Start Simple: Build in-memory HNSW first

Start with Project 5’s HNSW implementation
Verify it works: build index, search, measure recall
Understand the baseline: how much RAM does it use?
This is your reference point for correctness

Layer 2 - Add Memory Mapping: Move vectors to disk

Keep graph in memory, move vectors to memory-mapped file
Use mmap() to map vector file to address space
Measure: how many page faults per query? (use perf stat)
You should see ~50-200 page faults per query initially

Layer 3 - Add Caching: Reduce disk reads

Implement LRU cache for vectors (1-2GB)
Cache the entry point and high-degree nodes at startup
Measure cache hit rate: aim for 30-50%
This should reduce latency by 30-40%

Layer 4 - Optimize Graph: Improve locality

Re-order vector IDs by graph proximity (BFS from entry point)
Vectors that are neighbors in graph should be close in file
This improves page reuse: 10 vectors per page instead of 1
Should reduce page faults by 5-10x

Layer 5 - Add Prefetching: Overlap I/O with computation

Use madvise(MADV_WILLNEED) to hint upcoming accesses
When visiting node N, prefetch all neighbors of N
Measure: latency should drop 20-30% as CPU and I/O overlap

Layer 6 - Compression (Advanced): Reduce disk bandwidth

Store vectors in compressed format (Product Quantization)
Load compressed vectors (1/8 size), decompress in CPU cache
Final reranking: fetch full-precision vectors for top-K
This is how production systems (Bing, Meta) scale to 10B+ vectors

Debugging Tips:

Use iostat to monitor disk I/O: are you reading more than expected?
Use perf stat to count page faults: should be ~20-50 per query
Profile with perf record: where is time spent? (should be I/O wait)
Verify graph connectivity: run BFS from entry point, ensure all nodes reachable

Books That Will Help:

“DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node” - Subramanya et al., NeurIPS 2019
- The original paper (30 pages)
- Covers Vamana graph, graph construction, search algorithm
- Read sections 3-4 for graph design, section 5 for implementation
- Available free at: https://papers.nips.cc/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e8-Abstract.html
“Operating Systems: Three Easy Pieces” - Remzi Arpaci-Dusseau
- Chapter 13: Address Spaces
- Chapter 36: I/O Devices (SSD characteristics)
- Chapter 39: Files and Directories (memory-mapped I/O)
- Free online: https://pages.cs.wisc.edu/~remzi/OSTEP/
“Database Internals” - Alex Petrov (O’Reilly, 2019)
- Chapter 4: Implementing B-Trees (page-based storage)
- Chapter 5: Transaction Processing and Recovery (disk I/O patterns)
- Shows how databases manage disk-backed data structures
“Systems Performance” - Brendan Gregg (Pearson, 2020)
- Chapter 8: File Systems (page cache, read-ahead)
- Chapter 9: Disks (SSD internals, I/O scheduling)
- Essential for understanding I/O bottlenecks and profiling
“The Art of Computer Systems Performance Analysis” - Raj Jain (Wiley, 1991)
- Chapter 12: Queueing Models (I/O queuing, latency analysis)
- How to reason about disk read latency distributions
- Classic text for performance modeling

Papers to Read:

“Approximate Nearest Neighbor Search on High Dimensional Data” - Li et al., ICDE 2020 (DiskANN improvements)
“SPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Search” - Chen et al., NeurIPS 2021 (Microsoft’s production system)
“HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory” - Wang et al., NeurIPS 2022 (combines RAM + SSD + persistent memory)

Learning milestones:

Memory-mapped I/O works → You understand OS-level optimization
Graph fits in memory limit → You understand memory management
Latency is SSD-bound → You understand the bottleneck
Prefetching helps → You understand pipelining

Project 8: Metadata Filtering with Vector Search

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 3: Advanced
Knowledge Area: Hybrid Search / Inverted Indexes
Software or Tool: Filtered Vector Search
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A vector search system that supports metadata filtering (e.g., “find similar products where price < $100 and category = ‘electronics’”) efficiently, not just post-filtering.

Why it teaches filtered search: Real applications always have filters. Naive post-filtering is slow (scan 1M vectors, filter to 1K, return top-10). Smart pre-filtering or integrated filtering is essential for production systems.

Core challenges you’ll face:

Pre-filter vs post-filter → maps to when to apply filters
Bitmap indexes for metadata → maps to efficient set operations
Filtered HNSW traversal → maps to skipping invalid nodes
Cost estimation → maps to choosing the right strategy

Key Concepts:

Inverted Indexes: “Introduction to Information Retrieval” - Manning et al.
Bitmap Indexes: Database index fundamentals
Filtered Graph Search: Qdrant, Pinecone approaches

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5 completed. Understanding of database indexes.

Real world outcome:

$ ./filtered_search "
  SELECT * FROM products
  WHERE vector NEAR query_embedding
    AND category = 'electronics'
    AND price < 100
    AND rating >= 4.0
  LIMIT 10
"

Query Analysis:
  Filter selectivity: 2.3% (23,000 of 1,000,000 match)
  Strategy: Pre-filter + HNSW on filtered subset

Execution:
  Step 1: Apply filters (bitmap intersection): 0.8ms
  Step 2: Build filtered entry points: 0.1ms
  Step 3: HNSW search on filtered graph: 1.2ms

Results:
╔═══════╤═══════════════╤══════════════╤════════╤════════════╗
║ Rank  │ Product ID    │ Distance     │ Price  │ Category   ║
╠═══════╪═══════════════╪══════════════╪════════╪════════════╣
║   1   │ PROD-34521    │    0.234     │ $49.99 │ electronics║
║   2   │ PROD-78234    │    0.287     │ $89.99 │ electronics║
║  ...  │     ...       │     ...      │   ...  │    ...     ║
╚═══════╧═══════════════╧══════════════╧════════╧════════════╝

Performance: 2.1ms total
  (vs 847ms with post-filtering on full HNSW traversal)

$ ./filtered_search --analyze-strategies
Filter Selectivity │ Best Strategy    │ Expected Time
───────────────────┼──────────────────┼──────────────
  > 50%            │ Post-filter      │  ~2ms
  10% - 50%        │ Filtered HNSW    │  ~3ms
  1% - 10%         │ Pre-filter       │  ~2ms
  < 1%             │ Scan filtered    │  ~1ms

Implementation Hints:

Metadata storage with bitmap indexes:

class MetadataIndex:
    def __init__(self):
        # For categorical: inverted index of bitmaps
        self.categorical = {}  # field -> {value -> bitmap}

        # For numerical: sorted index
        self.numerical = {}  # field -> sorted [(value, vector_id)]

    def add(self, vector_id, metadata):
        for field, value in metadata.items():
            if isinstance(value, str):
                if field not in self.categorical:
                    self.categorical[field] = {}
                if value not in self.categorical[field]:
                    self.categorical[field][value] = BitSet()
                self.categorical[field][value].set(vector_id)
            else:
                if field not in self.numerical:
                    self.numerical[field] = []
                self.numerical[field].append((value, vector_id))

    def build_indexes(self):
        for field in self.numerical:
            self.numerical[field].sort()

    def filter(self, conditions):
        result = None
        for field, op, value in conditions:
            if field in self.categorical:
                if op == '==':
                    bits = self.categorical[field].get(value, BitSet())
                elif op == 'in':
                    bits = BitSet()
                    for v in value:
                        bits |= self.categorical[field].get(v, BitSet())
            else:  # numerical
                bits = self._range_filter(field, op, value)

            if result is None:
                result = bits
            else:
                result &= bits  # Intersection

        return result

Filtered HNSW search:

class FilteredHNSW(HNSWIndex):
    def search_filtered(self, query, k, ef, valid_ids):
        """Search considering only nodes in valid_ids"""
        if len(valid_ids) == 0:
            return []

        # Convert to set for O(1) lookup
        valid_set = set(valid_ids)

        # Find a valid entry point
        entry = self._find_valid_entry(valid_set)
        if entry is None:
            # Fall back to scanning
            return self._scan_filtered(query, k, valid_set)

        # Modified beam search that skips invalid nodes
        visited = {entry.id}
        candidates = [(distance(query, entry.vector), entry)]
        results = []

        while candidates:
            dist_c, current = heappop(candidates)

            # Add to results if valid
            if current.id in valid_set:
                heappush(results, (-dist_c, current))
                if len(results) > ef:
                    heappop(results)

            # Explore neighbors (including invalid ones for navigation)
            for neighbor_id in current.neighbors[0]:
                if neighbor_id in visited:
                    continue
                visited.add(neighbor_id)

                neighbor = self.nodes[neighbor_id]
                dist_n = distance(query, neighbor.vector)

                # Always add to candidates for navigation
                if len(results) < ef or dist_n < -results[0][0]:
                    heappush(candidates, (dist_n, neighbor))

        # Extract top-k valid results
        return sorted(
            [(n.id, -d) for d, n in results],
            key=lambda x: x[1]
        )[:k]

Cost-based strategy selection:

def choose_strategy(self, query, filters, k):
    # Estimate filter selectivity
    selectivity = self.estimate_selectivity(filters)
    num_valid = int(self.num_vectors * selectivity)

    # Estimate costs
    post_filter_cost = self.hnsw_search_cost() + num_valid * 0.001
    pre_filter_cost = self.filter_cost(filters) + self.scan_cost(num_valid)
    filtered_hnsw_cost = self.filter_cost(filters) + self.hnsw_search_cost() * (1 / selectivity)

    if selectivity > 0.5:
        return 'post_filter'
    elif selectivity < 0.01:
        return 'pre_filter_scan'
    else:
        return 'filtered_hnsw'

The Core Question You’re Answering

“How do we search by similarity AND attributes without losing accuracy?”

This is the fundamental tension in filtered vector search. Three naive approaches all fail:

Post-filtering: Search all vectors, then filter → Fast search, but wastes 99% of work when filters are selective
Pre-filtering + brute force: Filter first, scan matches → Accurate, but O(n) when you have millions of filtered results
Separate systems: Vector index + SQL database → Two queries, consistency nightmares, network overhead

The real question is: Can we integrate filtering into the vector search itself, making the index aware of metadata?

This requires understanding:

When filtering BEFORE search is faster (high selectivity)
When filtering DURING search is better (medium selectivity)
When filtering AFTER search wins (low selectivity)
How to traverse HNSW graphs while respecting filter constraints
Why bitmap indexes enable sub-millisecond filter evaluation

Real-world impact: Qdrant’s filtered search is 100-400x faster than post-filtering. Pinecone’s metadata filtering enables production e-commerce search. Understanding this makes you valuable to any company doing RAG or semantic search.

Concepts You Must Understand First

1. Pre-filtering vs Post-filtering

Pre-filtering:

Filter (1M → 10K) → Vector search on 10K → Top 10
Cost: Filter time + search(10K)

Post-filtering:

Vector search on 1M → Filter top 100 → Return matches
Cost: search(1M) + filter(100)
Problem: Might only get 3 results after filtering top 100

Key insight: Pre-filtering works when selectivity is low (< 5%). Post-filtering works when selectivity is high (> 50%).

2. Filtered IVF (Inverted File Index)

Standard IVF: Cluster vectors, search nearest clusters Filtered IVF: Each cluster stores metadata bitmaps

Cluster 1: [vec1, vec2, vec5] + {category: electronics → [1,0,1]}
Cluster 2: [vec3, vec4] + {category: electronics → [0,1]}

Search with filter “category = electronics”:

Visit cluster 1, consider only vec1 and vec5
Visit cluster 2, consider only vec4
Skip non-matching vectors during traversal

3. Hybrid Indexes

Combine inverted index (for filters) with vector index (for similarity):

Inverted Index:              Vector Index (HNSW):
category=electronics → [1,5,9]     1 → [2,3,5]
price<100 → [1,2,3,8,9]            2 → [1,4,6]
                                    ...
Intersection: [1,9]         Search only among [1,9]

Fast path: Bitmap intersection in microseconds, vector search on subset.

4. Filter Selectivity

Selectivity = matching_vectors / total_vectors

High selectivity (80%): Most vectors match → post-filter is fine
Medium selectivity (10-50%): Some vectors match → filtered HNSW
Low selectivity (< 5%): Few vectors match → pre-filter + scan/small index

Estimating selectivity:

# Categorical: exact count from bitmap
selectivity = bitmap[category='electronics'].count() / total

# Numerical: histogram approximation
selectivity = histogram.estimate_range(price < 100) / total

5. Filtered Graph Traversal

Key challenge: HNSW navigation needs to traverse invalid nodes to reach valid ones.

Entry → [invalid] → [invalid] → [VALID TARGET]

Two strategies:

Strict filtering: Only traverse valid nodes → May get trapped, poor recall
Relaxed filtering: Traverse all, return only valid → Better recall, slightly slower

Qdrant uses relaxed with “visited valid count” early termination.

6. Bitmap Index Fundamentals

Store set membership as bit vectors:

category = 'electronics': 10110010 (vectors 0,2,3,6 match)
price < 100:              11010010 (vectors 0,1,3,6 match)
Intersection (AND):       10010010 (vectors 0,3,6 match both)

Operations are CPU-cache-friendly SIMD instructions (nanoseconds per 1M bits).

Resources:

“Database Internals” (Chapter 5: Transaction Processing and Recovery) - bitmap indexes
“Introduction to Information Retrieval” (Chapter 1) - inverted indexes
Qdrant documentation on filtered search strategies

Questions to Guide Your Design

1. Filter Selectivity Analysis

How do you estimate selectivity BEFORE executing the filter?
What data structures enable fast selectivity estimation?
- Histograms for numerical ranges?
- Cardinality sketches for categorical values?
When is the estimate good enough vs exact count needed?

2. Index Strategy Selection

Three main strategies:

Post-filter: search_all(k * overfetch) → filter → take_top_k
Pre-filter:  filter → search_subset(k)
Hybrid:      filter → search_with_constraint(k)

For each query, which strategy?

What’s the decision boundary in terms of selectivity?
How do you handle multiple filters (category AND price AND rating)?
What if selectivity estimation is wrong?

3. Bitmap vs Range Index

Categorical filters: bitmap is obvious choice Numerical filters: bitmap or sorted array?

# Bitmap approach
price_ranges[0-10] = 0101010
price_ranges[10-20] = 1010001
# Fast intersection, memory intensive

# Sorted array approach
[(5, vec1), (12, vec2), (18, vec3), ...]
# Binary search for range, less memory

When to use which?

Cardinality of numerical values?
Range query patterns?
Memory constraints?

4. HNSW Traversal with Filters

How do you modify HNSW search?

Start from filtered entry point or original entry?
Count only valid nodes toward ef parameter?
Expand search if not enough valid results?

Edge case: What if filter matches 5 vectors but you need top-10?

Return fewer results?
Relax filter?
Warn the user?

5. Index Maintenance

Vectors are added/updated/deleted with metadata:

How do you update bitmaps incrementally?
Do you rebuild indexes periodically?
How do you handle metadata schema changes?

6. Cost Model Accuracy

Your strategy selection depends on cost estimates:

post_filter_cost = hnsw_time(N) + filter_time(k * overfetch)
pre_filter_cost = filter_time(N) + search_time(filtered_count)

How do you calibrate these costs?

Offline benchmarking?
Online adaptive tuning?
What if workload changes?

Think about these before implementing. The best engineers reason about trade-offs before writing code.

Thinking Exercise

Scenario: 1 million product vectors with metadata:

{
  "category": "electronics",  # 10% of products
  "price": 79.99,             # 30% are < $100
  "rating": 4.5               # 40% are >= 4.0
}

Query: “Find top-10 similar to [query_vec] where category=’electronics’ AND price < 100 AND rating >= 4.0”

Part 1: Trace Pre-filtering Approach

Step-by-step execution:

category='electronics' → 100K matches (10%)
Intersect price<100 → ? matches
Intersect rating>=4.0 → ? matches
Search algorithm on filtered set → ?

Calculate:

How many vectors after each filter?
Assuming independent distributions, what’s final selectivity?
If you use brute force on filtered set, how many distance calculations?
If you use HNSW on full graph with filter, how many nodes visited?

Part 2: Trace Post-filtering Approach

HNSW search on 1M vectors with ef=200 → top 200 candidates
Filter candidates → how many match all conditions?
If < 10 matches, what do you do?

Calculate:

Probability that top-200 contains >= 10 filtered matches
Expected number of matches in top-200
Overfetch factor needed to get 10 results with 95% confidence

Part 3: Compare Recall

Pre-filtering with HNSW:

True top-10 from filtered set: A = […]
Your results: B = […]
Recall = A ∩ B / 10

Post-filtering with overfetch:

True top-10 considering filters: C = […]
Your results: D = […]
Recall = C ∩ D / 10

Why might they differ?

HNSW approximation errors
Filter position in pipeline
Overfetch amount

Part 4: Calculate Break-even Point

When does pre-filter become faster than post-filter?

Pre-filter cost:

T_filter + T_search(N_filtered)

Post-filter cost:

T_search(N_total) + T_filter(ef)

Solve for selectivity where costs are equal.

Part 5: Design Hybrid Strategy

Can you combine both?

if selectivity < 0.05:
    pre_filter + scan
elif selectivity < 0.3:
    pre_filter + HNSW on filtered graph
else:
    HNSW + post-filter with overfetch

For the given query (selectivity ≈ 0.012), which strategy?

Show your calculation
Estimate latency for each approach
Which would you choose?

This exercise forces you to think quantitatively about filtering strategies.

The Interview Questions They’ll Ask

Q1: “Explain pre-filtering vs post-filtering in vector search. When would you use each?”

Bad answer: “Pre-filtering filters first, post-filtering filters after.”

Good answer:

Pre-filtering: Apply metadata filter → search filtered subset
- Best when: Filter is highly selective (< 5% match)
- Cost: Filter is cheap, but searching small index might lack structure
- Risk: If too few vectors match, accuracy suffers
Post-filtering: Search all vectors → filter top-k * overfetch
- Best when: Filter is not selective (> 50% match)
- Cost: Full search is expensive, but filtering small result set is cheap
- Risk: Need overfetch factor to ensure enough filtered results
Decision: Estimate selectivity, compare costs, choose dynamically

Q2: “How do bitmap indexes enable fast filtering?”

Bad answer: “They’re just bit arrays that are fast.”

Good answer:

Represent set membership as bit vectors: category=’electronics’ → 10110010
Set operations map to bitwise AND/OR: extremely cache-friendly
Modern CPUs: SIMD instructions process 256 bits at once
Intersection of 1M vectors: ~1ms (vs ~10ms with hash set)
Space: 1 bit per vector per categorical value
Trade-off: Memory vs speed (use for low-cardinality categoricals)

Show you understand the implementation details.

Q3: “How would you search HNSW with filters?”

Bad answer: “Just skip nodes that don’t match the filter.”

Good answer - Three approaches:

Strict filtering: Only traverse nodes matching filter
- Problem: Can get stuck in filtered subgraph, poor connectivity
- Recall suffers when filter is selective
Relaxed filtering: Traverse all nodes, return only filtered
- Better graph connectivity, better recall
- Slightly more node visits
Hybrid with entry points: Pre-filter to get entry points, HNSW from there
- Find valid nodes in each layer
- Start search from closest valid entry
- Qdrant’s approach

Implementation detail: Track “valid nodes visited” separately from “total visited” for ef parameter.

Q4: “How do you estimate filter selectivity without scanning all vectors?”

Bad answer: “Count how many match.”

Good answer:

Categorical: Maintain exact counts in bitmap indexes (instant)

selectivity = bitmap[category='electronics'].popcount() / total_vectors

Numerical ranges: Use histograms

# 100 buckets for price [0, 1000]
histogram[70-80] = 5243 vectors
histogram[80-90] = 6891 vectors
# Estimate price < 100: sum first 10 buckets

Multiple filters: Assume independence (overestimate) or maintain joint histograms (memory intensive)
Validation: Compare estimate vs actual, adjust over time

Q5: “Filtered-DiskANN is mentioned in the DiskANN paper. How does it work?”

This tests if you’ve read the papers deeply.

Good answer:

DiskANN already uses graph navigation on SSD
Filtering adds: Bitmap index on SSD + filtered graph traversal
Key insight: During beam search, only expand nodes matching filter
Challenge: SSD random reads are slow, need prefetching
Solution: Prefetch neighbor metadata + vectors in single I/O
Trade-off: Larger I/O (fetch metadata too) vs more I/Os (separate metadata reads)

Layout on disk:

[Node ID | Vector | Neighbors | Metadata Bitmap]

Single 4KB read gets everything needed for filtering decision.

Q6: “Your filter matches 50% of vectors. Which strategy and why?”

Bad answer: “Post-filtering because it’s simple.”

Good answer - Walk through the calculation:

Assume: 1M vectors, HNSW search ~2ms, bitmap filter ~0.5ms

Post-filter:

Cost = search(1M) + filter(top-200)
     = 2ms + 0.01ms ≈ 2ms
Expected matches in top-200: ~100 (plenty for top-10)

Pre-filter + HNSW:

Filtered set: 500K vectors
Cost = filter(1M) + search(500K)
     = 0.5ms + ~1.5ms ≈ 2ms

Conclusion: Both are similar! But post-filter is simpler and has better recall (no HNSW approximation on filtered graph connectivity issues).

Choose post-filter with small overfetch (ef=50 is enough).

Q7: “How would you handle a filter that matches only 10 vectors?”

Bad answer: “Use pre-filtering.”

Good answer:

Recognize this is extreme selectivity (0.001%)
HNSW is overkill for 10 vectors

Best approach: Pre-filter → brute force scan of 10 vectors

valid_ids = bitmap.intersect_all(filters)  # 10 vectors
if len(valid_ids) < 100:
    # Just scan them
    return sorted([(id, distance(query, vectors[id]))
                   for id in valid_ids])[:k]

Cost: Filter (~0.5ms) + 10 distance calculations (~0.001ms) ≈ 0.5ms
Much faster than HNSW search

Q8: “What if your selectivity estimate is wrong?”

This tests production thinking.

Good answer:

Adaptive execution: Start with estimated strategy, measure actual

strategy = choose_strategy(estimated_selectivity)
results, actual_cost = execute(strategy)
if actual_cost > threshold:
    # Log for calibration
    update_cost_model(estimated_selectivity, actual_cost)

Graceful degradation: If post-filter doesn’t get enough results, fall back to pre-filter

results = post_filter_search(k=10, ef=100)
if len(results) < 10:
    # Not enough matches, try pre-filter
    results = pre_filter_search(k=10)

Cost-based optimizer: Like SQL databases, maintain statistics and update periodically

These questions test depth of understanding. Study the trade-offs, not just the algorithms.

Hints in Layers

Layer 1: Start with Post-filtering (Simplest baseline)

Build a naive implementation:

class PostFilterSearch:
    def search(self, query, k, filters, overfetch=5):
        # Search without filters
        candidates = self.hnsw.search(query, k * overfetch)

        # Filter results
        matches = [c for c in candidates if self.matches_filter(c.id, filters)]

        return matches[:k]

What you’ll learn:

How overfetch factor affects recall
When you get < k results (not enough matches in top-N)
That this works fine for non-selective filters

Test with varying selectivity (90%, 50%, 10%, 1%) and measure:

Latency
Recall (compare to ground truth)
Required overfetch for 95% chance of k results

Layer 2: Add Bitmap Pre-filtering

Implement inverted index for categorical metadata:

class BitmapIndex:
    def __init__(self, num_vectors):
        self.num_vectors = num_vectors
        self.indexes = {}  # field -> value -> bitset

    def add(self, vector_id, field, value):
        if field not in self.indexes:
            self.indexes[field] = {}
        if value not in self.indexes[field]:
            self.indexes[field][value] = Bitset(self.num_vectors)
        self.indexes[field][value].set(vector_id)

    def query(self, field, value):
        return self.indexes[field].get(value, Bitset(self.num_vectors))

    def intersect(self, filters):
        result = None
        for field, value in filters:
            bits = self.query(field, value)
            result = bits if result is None else result & bits
        return result.to_list()  # List of matching IDs

What you’ll learn:

Bitmap intersection is incredibly fast
Memory usage scales with cardinality * num_vectors
Can estimate selectivity for free (popcount)

Layer 3: Implement Strategy Selection

Add a query planner:

class FilteredSearch:
    def search(self, query, k, filters):
        selectivity = self.estimate_selectivity(filters)

        if selectivity > 0.5:
            return self.post_filter(query, k, filters)
        elif selectivity < 0.01:
            valid_ids = self.bitmap.intersect(filters)
            return self.brute_force_scan(query, k, valid_ids)
        else:
            return self.filtered_hnsw(query, k, filters)

    def estimate_selectivity(self, filters):
        result = None
        for field, value in filters:
            bits = self.bitmap.query(field, value)
            count = bits.popcount()
            sel = count / self.num_vectors
            result = sel if result is None else result * sel
        return result

What you’ll learn:

Selectivity estimation is critical
Different strategies have different sweet spots
Cost model needs calibration with real data

Layer 4: Filtered HNSW Traversal

This is the hard part. Modify HNSW to respect filters during search:

def search_filtered(self, query, k, ef, valid_ids):
    valid_set = set(valid_ids)

    # Find entry point in filtered set
    entry = self._find_valid_entry(valid_set)

    visited = set()
    candidates = [(distance(query, self.vectors[entry]), entry)]
    results = []

    while candidates:
        dist, current = heappop(candidates)

        if current in visited:
            continue
        visited.add(current)

        # Add to results if valid
        if current in valid_set:
            results.append((current, dist))
            if len(results) >= ef:
                break

        # Explore neighbors (even invalid ones for navigation)
        for neighbor in self.graph[current]:
            if neighbor not in visited:
                dist_n = distance(query, self.vectors[neighbor])
                heappush(candidates, (dist_n, neighbor))

    return sorted(results, key=lambda x: x[1])[:k]

Key design decisions:

Should you traverse invalid nodes? (Yes, for connectivity)
How to count toward ef? (Only valid nodes)
Entry point selection? (Closest valid node to original entry)

What you’ll learn:

Graph connectivity matters for filtered search
Relaxed filtering beats strict filtering
Entry point quality affects recall

Layer 5: Add Numerical Range Indexes

Bitmaps work for categorical, but numerical ranges need different approach:

class RangeIndex:
    def __init__(self):
        self.sorted_values = []  # [(value, vector_id), ...]

    def build(self, vectors_metadata, field):
        self.sorted_values = sorted(
            [(meta[field], vid) for vid, meta in vectors_metadata.items()]
        )

    def range_query(self, min_val, max_val):
        # Binary search for range
        start = bisect_left(self.sorted_values, (min_val, 0))
        end = bisect_right(self.sorted_values, (max_val, float('inf')))
        return [vid for val, vid in self.sorted_values[start:end]]

For very selective range queries, this beats bitmap.

What you’ll learn:

Different index structures for different data types
Trade-offs: bitmap (memory, fast AND) vs sorted (space-efficient, slower intersection)

Layer 6: Implement Hybrid Strategy (Pre-filter + Small HNSW)

For medium selectivity (5-20%), build HNSW on filtered subset:

def hybrid_search(self, query, k, filters):
    # Pre-filter
    valid_ids = self.bitmap.intersect(filters)

    if len(valid_ids) < 1000:
        # Too small for HNSW, brute force
        return self.brute_force_scan(query, k, valid_ids)

    # Build temporary HNSW on filtered subset
    filtered_vectors = [self.vectors[vid] for vid in valid_ids]
    temp_hnsw = HNSW(filtered_vectors, M=16, ef_construction=100)

    # Search
    results = temp_hnsw.search(query, k)
    return [(valid_ids[r.id], r.distance) for r in results]

This is expensive (building HNSW), but for repeated queries on same filter, you can cache.

Layer 7: Add Cost-Based Optimization

Calibrate your cost model with benchmarks:

class CostModel:
    def __init__(self):
        # Calibrated constants (microseconds)
        self.hnsw_base = 2000  # 2ms base HNSW search
        self.hnsw_per_node = 0.01  # 10ns per node
        self.bitmap_filter = 500  # 0.5ms to filter 1M vectors
        self.distance_calc = 0.1  # 100ns per distance

    def estimate_cost(self, strategy, selectivity, num_vectors):
        if strategy == 'post_filter':
            return self.hnsw_base + self.bitmap_filter * 0.001
        elif strategy == 'pre_filter_scan':
            filtered = int(num_vectors * selectivity)
            return self.bitmap_filter + filtered * self.distance_calc
        elif strategy == 'filtered_hnsw':
            return self.bitmap_filter + self.hnsw_base * (1/selectivity)

    def choose_strategy(self, selectivity, num_vectors):
        costs = {
            'post_filter': self.estimate_cost('post_filter', selectivity, num_vectors),
            'pre_filter_scan': self.estimate_cost('pre_filter_scan', selectivity, num_vectors),
            'filtered_hnsw': self.estimate_cost('filtered_hnsw', selectivity, num_vectors),
        }
        return min(costs, key=costs.get)

Run experiments, measure actual costs, tune constants.

Layer 8: Production Features

Add robustness:

Graceful degradation when estimate is wrong
Query timeout and fallback strategies
Logging and observability (which strategy chosen, actual vs estimated cost)
Adaptive cost model that learns from query patterns

By building layer by layer, you understand the trade-offs deeply.

Books That Will Help

Core Books:

“Introduction to Information Retrieval” by Manning, Raghavan, Schütze
- Chapter 1: Boolean Retrieval (inverted indexes, bitmaps)
- Chapter 6: Scoring and Ranking (cost models)
- Why: Foundational understanding of inverted indexes, which are the basis of bitmap filtering
“Database Internals” by Alex Petrov
- Chapter 5: Transaction Processing and Recovery (bitmap indexes)
- Chapter 7: Log-Structured Storage (LSM trees for metadata)
- Why: Deep dive into bitmap index implementation, how databases do filtering efficiently
“Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman
- Chapter 3: Finding Similar Items (LSH with filtering)
- Chapter 9: Recommendation Systems (hybrid filtering)
- Why: Understand how filtering works at scale, selectivity estimation

Papers:

“Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters” (2022)
- Core paper on integrating filters into graph-based ANN
- Explains pre-filter vs post-filter vs integrated filtering
- Shows how to modify graph traversal for filters
- Read: https://harsha-simhadri.org/pubs/Filtered-DiskANN22.pdf
“Qdrant: Vector Search Engine with Extended Filtering Support” (Documentation)
- Production system implementing filtered search
- Explains payload indexes, filtered HNSW strategy
- Read: https://qdrant.tech/documentation/concepts/filtering/
“PANN: Personalized Approximate Nearest Neighbor Search” (2020)
- Filtered ANN for recommendation systems
- User-based filtering strategies
- Shows real-world cost models

Supplementary:

“Designing Data-Intensive Applications” by Martin Kleppmann
- Chapter 3: Storage and Retrieval (indexes)
- Why: Understand trade-offs in index design, when to use what
“Database Systems: The Complete Book” by Garcia-Molina, Ullman, Widom
- Chapter 14: Index Structures (B-trees, bitmaps, spatial indexes)
- Why: Comprehensive coverage of all index types, helps you choose right structure

Blogs and Documentation:

Pinecone’s “Metadata Filtering Guide”
- https://www.pinecone.io/learn/vector-search-filtering/
- Production perspective on filtered vector search
Weaviate’s “Filtered Vector Search”
- How they combine inverted index + HNSW
- Real-world performance numbers
PostgreSQL Documentation: Index Types
- Understand GIN (Generalized Inverted Index)
- See how SQL databases integrate full-text search with filtering

Reading Order:

Start: Manning’s “Introduction to Information Retrieval” (Chapter 1) - understand inverted indexes
Then: Filtered-DiskANN paper - see how it applies to vector search
Next: Qdrant documentation - see production implementation
Deep dive: “Database Internals” (Chapter 5) - understand bitmap indexes deeply
Advanced: PANN paper - see personalized filtering

These books and papers will make you an expert in filtered search, not just implementer.

Learning milestones:

Bitmap filters are fast → You understand inverted indexes
Strategy selection matters → You understand cost-based optimization
Filtered HNSW works → You understand graph traversal with constraints
Performance beats naive approach → You’ve built an efficient system

Project 9: Vector Database Storage Engine

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Rust
Alternative Programming Languages: C++, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 4: Expert
Knowledge Area: Storage Systems / Write-Ahead Logging
Software or Tool: Storage Engine
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A durable storage engine for vectors with write-ahead logging (WAL), crash recovery, compaction, and efficient serialization—the persistence layer of a real vector database.

Why it teaches storage: Indexes are useless if they disappear on restart. A real vector database needs ACID properties: atomic writes, crash recovery, and efficient disk formats. This is what makes the difference between a prototype and a production system.

Core challenges you’ll face:

Write-ahead logging → maps to durability before acknowledging writes
Crash recovery → maps to replaying WAL on startup
Compaction → maps to merging and garbage collection
Efficient serialization → maps to binary formats for vectors

Key Concepts:

WAL: “Designing Data-Intensive Applications” Chapter 3 - Kleppmann
LSM Trees: For append-friendly storage
Memory-Mapped Files: OS-level optimization

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, experience with file I/O and binary formats.

Real world outcome:

$ ./vecdb_storage --init --path ./data

Initializing VectorDB Storage Engine:
  Data directory: ./data
  WAL segment size: 64 MB
  Compaction threshold: 1 GB

Files created:
  ./data/wal/segment_000000.wal
  ./data/vectors/vectors_000000.bin
  ./data/index/hnsw_000000.idx
  ./data/metadata.json

$ ./vecdb_storage --insert --count 100000
Inserting 100,000 vectors...
  WAL writes: 100,000
  Batch flushes: 10
  Time: 4.2s (23,800 inserts/sec)

$ kill -9 $PID  # Simulate crash

$ ./vecdb_storage --recover
Recovery started:
  Scanning WAL segments...
  Found 1 segment with 100,000 entries
  Replaying entries: [████████████████████] 100%
  Rebuilding index: 2.3s
  Recovery complete!

Verified:
  Vectors recovered: 100,000 ✓
  Index consistency: OK ✓
  No data loss ✓

$ ./vecdb_storage --compact
Compaction:
  Before: 3 segments, 1.2 GB
  Merging...
  After: 1 segment, 0.9 GB
  Space reclaimed: 300 MB (25%)

Implementation Hints:

Storage architecture:

./data/
├── wal/
│   ├── segment_000000.wal  # Write-ahead log
│   ├── segment_000001.wal
│   └── ...
├── vectors/
│   └── vectors_000000.bin  # Actual vector data
├── index/
│   └── hnsw_000000.idx     # Serialized index
└── metadata.json           # DB state

WAL entry format:

struct WALEntry {
    sequence_number: u64,
    operation: Operation,  // Insert, Delete, Update
    timestamp: u64,
    vector_id: u64,
    vector_data: Vec<f32>,  // Empty for Delete
    metadata: HashMap<String, Value>,
    checksum: u32,
}

enum Operation {
    Insert = 1,
    Delete = 2,
    Update = 3,
}

WAL writer:

struct WALWriter {
    current_segment: File,
    segment_size: usize,
    current_size: usize,
    sequence: AtomicU64,
}

impl WALWriter {
    fn append(&mut self, entry: WALEntry) -> Result<u64> {
        let seq = self.sequence.fetch_add(1, Ordering::SeqCst);
        let mut entry = entry;
        entry.sequence_number = seq;

        // Serialize entry
        let bytes = entry.serialize();
        entry.checksum = crc32(&bytes);

        // Write length + data + checksum
        self.current_segment.write_all(&(bytes.len() as u32).to_le_bytes())?;
        self.current_segment.write_all(&bytes)?;
        self.current_segment.write_all(&entry.checksum.to_le_bytes())?;

        // Flush to disk (durability!)
        self.current_segment.sync_all()?;

        self.current_size += bytes.len() + 8;

        // Rotate segment if needed
        if self.current_size >= self.segment_size {
            self.rotate_segment()?;
        }

        Ok(seq)
    }
}

Recovery:

fn recover(data_dir: &Path) -> Result<VectorDB> {
    let mut db = VectorDB::new();

    // Read metadata to find last checkpoint
    let metadata = read_metadata(data_dir)?;
    let last_checkpoint = metadata.last_checkpoint_sequence;

    // Replay WAL from last checkpoint
    for segment in list_wal_segments(data_dir)? {
        let reader = WALReader::open(segment)?;

        for entry in reader.entries() {
            let entry = entry?;

            // Verify checksum
            if crc32(&entry.data) != entry.checksum {
                log::error!("Corrupt WAL entry at seq {}", entry.sequence_number);
                continue;
            }

            // Skip already applied entries
            if entry.sequence_number <= last_checkpoint {
                continue;
            }

            // Apply to in-memory state
            match entry.operation {
                Operation::Insert => db.insert_recovered(entry.vector_id, entry.vector_data),
                Operation::Delete => db.delete_recovered(entry.vector_id),
                Operation::Update => db.update_recovered(entry.vector_id, entry.vector_data),
            }
        }
    }

    // Rebuild index if needed
    if db.index_needs_rebuild() {
        db.rebuild_index()?;
    }

    Ok(db)
}

Compaction:

fn compact(&mut self) -> Result<()> {
    // Create new segment file
    let new_segment = create_new_segment()?;

    // Write live vectors to new segment
    for (id, vector) in self.vectors.iter() {
        if !self.deleted.contains(id) {
            write_vector(&new_segment, id, vector)?;
        }
    }

    // Atomically swap segments
    new_segment.sync_all()?;
    rename(new_segment.path(), current_segment_path())?;

    // Delete old segments
    for old in old_segments {
        remove_file(old)?;
    }

    // Truncate WAL (everything is now in new segment)
    self.wal.truncate_before(self.sequence)?;

    Ok(())
}

The Core Question You’re Answering

“How do we persist vectors with ACID guarantees and fast recovery?”

When your vector database crashes mid-write, how do you ensure:

Durability: Acknowledged writes survive crashes
Atomicity: No partial writes corrupt the index
Recovery: Restart quickly without full reindexing
Performance: Don’t sacrifice write throughput for durability

This is the difference between a toy and a production system. Understanding storage engines means understanding the tradeoffs between:

Write amplification (how many times does data hit disk?)
Read amplification (how many files to read for one query?)
Space amplification (how much extra space for metadata/logs?)

You’re building the persistence layer that makes vectors survive power failures.

Concepts You Must Understand First

Before writing code, understand these storage fundamentals:

1. Write-Ahead Logging (WAL)

The durability guarantee:

Every modification is written to an append-only log before updating in-memory structures
Log entries are fsynced to disk before acknowledging the write
On crash, replay the log to reconstruct state
Key insight: Sequential writes are 100x faster than random writes

Critical details:

WAL entries need checksums (detect corruption)
Log segmentation (rotate to new files periodically)
Idempotency (replaying same entry twice is safe)

Read: “Designing Data-Intensive Applications” Chapter 3 (Storage and Retrieval) - Kleppmann

2. LSM Trees (Log-Structured Merge Trees)

The write-optimized architecture:

Writes go to memtable (in-memory sorted structure)
When memtable fills, flush to SSTable (sorted string table on disk)
Background compaction merges SSTables to reduce read amplification
Bloom filters avoid reading files that don’t contain a key

Why LSM for vectors?

Vector inserts are write-heavy (ingestion pipelines)
Immutable SSTables enable safe concurrent reads
Compaction reclaims space from deleted vectors

Read: “Database Internals” Chapter 7 (Log-Structured Storage) - Alex Petrov

3. Memtable and SSTable Structure

Memtable (in-memory):

Red-black tree or skip list (sorted structure)
Fast insertions O(log n)
Periodically flushed to disk

SSTable (on disk):

Immutable sorted file
Binary format: [key value_offset value_size] index + data blocks
Optional compression (Snappy, LZ4)

4. Compaction Strategies

Size-tiered compaction:

Merge SSTables of similar size
Good for write-heavy workloads
Higher space amplification

Leveled compaction:

Organize SSTables into levels (L0, L1, L2…)
Each level is 10x larger than previous
Lower space amplification, more I/O

Which to use?

Start with size-tiered (simpler)
Switch to leveled if space is constrained

5. Crash Recovery Protocol

Recovery steps:

Read manifest file (which SSTables are valid)
Replay WAL from last checkpoint
Rebuild memtable from WAL
Optionally rebuild index from SSTables
Mark recovery complete, accept writes

Key insight: Recovery time depends on WAL size, not database size

Questions to Guide Your Design

WAL Format Design

What goes in a WAL entry?
- Sequence number, operation type, vector ID, vector data, timestamp, checksum
- How do you serialize variable-length vectors efficiently?
- Should you compress WAL entries?
How do you handle corrupted entries?
- CRC32 checksums for integrity
- What if corruption is in the middle of the log?
- Skip and continue, or stop recovery?
When do you fsync()?
- After every write? (slow but durable)
- Batched every N writes? (fast but loss window)
- Configurable durability vs performance tradeoff

Flush Policy

When do you flush memtable to SSTable?
- Size threshold (e.g., 64 MB memtable)
- Time threshold (e.g., every 60 seconds)
- Memory pressure (adaptive flushing)
What if you crash during flush?
- Flush is not atomic (partial SSTable on disk)
- Solution: Write to temp file, rename on completion (atomic on POSIX)
- Keep WAL entries until flush completes

Compaction Strategy

How often do you compact?
- Too frequent: wastes CPU/disk I/O
- Too rare: read amplification hurts queries
- Heuristic: compact when you have N SSTables at same level
What about tombstones (deleted vectors)?
- Mark deleted in WAL, filter during compaction
- Bloom filters prevent reading SSTables with deleted keys

Crash Recovery

How do you know where to start recovery?
- Checkpoint sequence number in manifest file
- Replay WAL entries with seq > last_checkpoint
What if the index is corrupted?
- Can you rebuild HNSW from vectors alone?
- Trade-off: fast recovery vs full reindex

Thinking Exercise

Trace an insert through the storage engine:

User calls: db.insert(vector_id=42, vector=[0.1, 0.2, ...])

Step 1: Write to WAL
  - Serialize WAL entry (seq=100, op=Insert, id=42, data=[...])
  - Write to wal/segment_000000.wal
  - fsync() to guarantee durability
  - Return sequence number

Step 2: Update memtable
  - Insert into in-memory sorted structure
  - memtable.insert(42, vector)

Step 3: Check flush condition
  - if memtable.size() > 64 MB:
      - Freeze memtable (make immutable)
      - Start new memtable for writes
      - Background thread flushes frozen memtable to SSTable

Step 4: Flush to SSTable (background)
  - Create temp file: vectors/vectors_000001.tmp
  - Write sorted entries from frozen memtable
  - Write index block (key -> offset mapping)
  - Rename to vectors/vectors_000001.sst (atomic!)
  - Update manifest: "SSTable 000001 contains seq 0-100"
  - Truncate WAL entries <= 100 (no longer needed)

Step 5: Compaction (background, periodically)
  - Scan for SSTables at same level with overlap
  - Merge-sort entries from multiple SSTables
  - Write new compacted SSTable
  - Delete old SSTables
  - Update manifest

Now trace a crash scenario:

Crash happens during Step 4 (flush in progress)

On restart:
1. Read manifest -> last valid SSTable is 000000 (seq 0-50)
2. Scan WAL -> find entries seq 51-100
3. Replay:
   - entry seq=51: insert(id=10, ...)
   - entry seq=52: insert(id=11, ...)
   - ...
   - entry seq=100: insert(id=42, ...)
4. Memtable now contains all vectors from seq 51-100
5. Check for partial SSTable 000001.tmp -> delete (incomplete flush)
6. Rebuild index from vectors 0-100
7. Ready to accept writes!

Questions to consider:

What if crash happens during compaction?
How do you handle concurrent reads during flush?
What if WAL itself is corrupted?

The Interview Questions They’ll Ask

Fundamental Understanding

“Explain the difference between WAL and journaling in filesystems.”
- Answer: Both are write-ahead logs, but:
  - Filesystem journaling: logs metadata changes (file renames, permission changes)
  - Database WAL: logs data modifications (actual record changes)
  - WAL is application-level, journaling is OS-level
  - WAL provides logical recovery, journaling provides crash consistency
“Why use LSM trees instead of B-trees for vectors?”
- LSM advantages:
  - Sequential writes (no random I/O)
  - Higher write throughput
  - Better compression (immutable files)
- B-tree advantages:
  - Lower read amplification
  - Predictable read latency
- For vectors: LSM wins because ingestion is write-heavy, and vectors are rarely updated in-place
“What’s the difference between size-tiered and leveled compaction?”
- Size-tiered: Merge SSTables of similar size → less I/O, more space
- Leveled: Fixed size levels, compact to next level → more I/O, less space
- Example: RocksDB uses leveled by default, Cassandra uses size-tiered

Design Tradeoffs

“How do you minimize write amplification?”
- Techniques:
  - Larger memtable (fewer flushes)
  - Less aggressive compaction (fewer rewrites)
  - Compression (write less data)
- Tradeoff: Write amplification vs space amplification
“What if your WAL grows unbounded?”
- Solutions:
  - Checkpoint: flush memtable, mark WAL prefix as safe to delete
  - Segment rotation: create new WAL file periodically
  - Truncation: delete old segments after checkpoint
- Key: WAL size is bounded by checkpoint frequency
“How do you handle read-your-own-writes consistency?”
- Problem: Write to WAL, query before flush → vector not in SSTable yet
- Solution: Check memtable first, then SSTables (layered query)
- Ordering: memtable > newest SSTable > oldest SSTable

Crash Recovery

“What happens if you crash during compaction?”
- Safe approach: Compaction writes to new SSTables, deletes old ones only after success
- Crash scenario: Old SSTables still exist, new ones partially written
- Recovery: Read manifest → ignore partial SSTables → use old SSTables
- Key: Never delete source files until output is complete
“How do you detect corrupted WAL entries?”
- Checksums: CRC32 or xxHash on each entry
- On mismatch: Skip entry and log warning, or stop recovery (configurable)
- Best practice: Also checksum the entire segment (detect tail corruption)

Performance

“How do you optimize fsync latency?”
- Group commit: Batch multiple writes, single fsync
- Async replication: Acknowledge write after replicating to N nodes (don’t wait for fsync)
- Battery-backed write cache: Hardware acceleration (risky!)
“Why are sequential writes so much faster than random writes?”
- HDD: No seek time (200 MB/s sequential vs 1 MB/s random)
- SSD: Wear leveling, internal parallelism optimized for sequential
- OS: Readahead and writeback buffers help sequential I/O

Hints in Layers

Layer 1: In-memory only (ignore crashes)

struct VectorDB {
    vectors: HashMap<u64, Vec<f32>>,  // id -> vector
}

impl VectorDB {
    fn insert(&mut self, id: u64, vector: Vec<f32>) {
        self.vectors.insert(id, vector);
    }

    fn search(&self, query: Vec<f32>, k: usize) -> Vec<u64> {
        // Brute-force search
    }
}

Test: Insert 100k vectors, search works, restart → all data lost ✗

Layer 2: Add write-ahead log (durability)

struct WALEntry {
    seq: u64,
    op: Operation,
    id: u64,
    vector: Vec<f32>,
}

struct VectorDB {
    vectors: HashMap<u64, Vec<f32>>,
    wal: WALWriter,
}

impl VectorDB {
    fn insert(&mut self, id: u64, vector: Vec<f32>) -> Result<()> {
        // Write to WAL first!
        let entry = WALEntry { seq: self.next_seq(), op: Insert, id, vector: vector.clone() };
        self.wal.append(&entry)?;
        self.wal.fsync()?;  // Durability!

        // Now safe to update memory
        self.vectors.insert(id, vector);
        Ok(())
    }
}

fn recover(wal_path: &Path) -> VectorDB {
    let mut db = VectorDB::new();
    for entry in WALReader::read(wal_path) {
        db.vectors.insert(entry.id, entry.vector);
    }
    db
}

Test: Insert vectors, kill -9, restart → data recovered ✓

Layer 3: Add memtable + flush (reduce WAL size)

struct VectorDB {
    memtable: BTreeMap<u64, Vec<f32>>,  // Sorted!
    sstables: Vec<SSTableReader>,
    wal: WALWriter,
}

impl VectorDB {
    fn insert(&mut self, id: u64, vector: Vec<f32>) -> Result<()> {
        self.wal.append(&entry)?;
        self.memtable.insert(id, vector);

        if self.memtable.size() > 64 * 1024 * 1024 {
            self.flush_memtable()?;
        }
        Ok(())
    }

    fn flush_memtable(&mut self) -> Result<()> {
        let tmp_path = "vectors_temp.sst";
        let mut writer = SSTableWriter::new(tmp_path)?;

        for (id, vector) in self.memtable.iter() {
            writer.write(id, vector)?;
        }
        writer.finish()?;

        // Atomic rename!
        rename(tmp_path, "vectors_000000.sst")?;

        self.sstables.push(SSTableReader::open("vectors_000000.sst")?);
        self.memtable.clear();
        self.wal.checkpoint()?;  // Safe to truncate WAL now
        Ok(())
    }
}

Test: Insert 1M vectors → WAL doesn’t grow unbounded ✓

Layer 4: Add compaction (reduce read amplification)

impl VectorDB {
    fn compact(&mut self) -> Result<()> {
        // Size-tiered: merge SSTables of similar size
        let to_merge = self.find_mergeable_sstables();

        let mut writer = SSTableWriter::new("compacted_temp.sst")?;
        let mut heap = BinaryHeap::new();  // K-way merge

        // Initialize heap with first entry from each SSTable
        for sst in &to_merge {
            if let Some(entry) = sst.iter().next() {
                heap.push(entry);
            }
        }

        // Merge-sort
        while let Some(entry) = heap.pop() {
            writer.write(entry.id, entry.vector)?;
            if let Some(next) = entry.source.iter().next() {
                heap.push(next);
            }
        }

        writer.finish()?;
        rename("compacted_temp.sst", "compacted_000000.sst")?;

        // Delete old SSTables
        for sst in to_merge {
            remove_file(sst.path())?;
        }

        Ok(())
    }
}

Test: Insert 1M, delete 500k, compact → disk space reclaimed ✓

Layer 5: Add checksums and error handling (production-ready)

struct WALEntry {
    seq: u64,
    op: Operation,
    id: u64,
    vector: Vec<f32>,
    checksum: u32,  // CRC32
}

impl WALReader {
    fn read_entry(&mut self) -> Result<WALEntry> {
        let len = self.read_u32()?;
        let data = self.read_bytes(len)?;
        let checksum = self.read_u32()?;

        if crc32(&data) != checksum {
            return Err(CorruptedEntry);
        }

        WALEntry::deserialize(&data)
    }
}

Test: Corrupt WAL file → recovery skips bad entries, logs warning ✓

Books That Will Help

Essential Reading

“Designing Data-Intensive Applications” by Martin Kleppmann
- Chapter 3: Storage and Retrieval (WAL, LSM trees, B-trees)
- Chapter 11: Stream Processing (log-based systems)
- Why: The definitive guide to storage engine design
“Database Internals” by Alex Petrov
- Part I: Storage Engines (file formats, indexing)
- Chapter 7: Log-Structured Storage (LSM deep dive)
- Chapter 9: Failure Detection (crash recovery)
- Why: Implementation-focused, covers RocksDB and LevelDB internals

Deep Dives

“The Log-Structured Merge-Tree (LSM-Tree)” - O’Neil et al. (1996)
- The original LSM paper
- Available free online
- Why: Understand the theory behind modern storage engines
“RocksDB Tuning Guide” (online documentation)
- Real-world compaction strategies
- Write amplification benchmarks
- Why: Learn from a production LSM implementation

Reference Implementation

Study LevelDB source code (C++, ~5000 lines)
- db/log_writer.cc - WAL implementation
- db/memtable.cc - Skip list memtable
- table/table_builder.cc - SSTable format
- Why: Clean, well-commented reference implementation

Learning milestones:

WAL writes are durable → You understand write-ahead logging
Crash recovery works → You understand durability guarantees
Compaction reclaims space → You understand garbage collection
Performance is acceptable → You’ve built an efficient storage layer

Project 10: Distributed Vector Search Cluster

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Java
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Distributed Systems / Sharding
Software or Tool: Distributed Vector Database
Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed vector search cluster with sharding, replication, routing, and distributed query execution—turning a single-node index into a scalable system.

Why it teaches distributed systems: No single machine can handle billions of vectors with low latency. Distributed systems introduce new challenges: consistency, partition tolerance, coordination. Understanding these is essential for production systems.

Core challenges you’ll face:

Sharding strategies → maps to distributing vectors across nodes
Query routing → maps to finding which shards to query
Result merging → maps to combining results from multiple shards
Replication → maps to fault tolerance and read scaling
Consistency → maps to handling concurrent updates

Key Concepts:

Sharding: “Designing Data-Intensive Applications” Chapter 6 - Kleppmann
Replication: “Designing Data-Intensive Applications” Chapter 5 - Kleppmann
Distributed Consensus: Raft algorithm

Difficulty: Master Time estimate: 4-6 weeks Prerequisites: All previous projects, distributed systems experience.

Real world outcome:

$ ./vecdb cluster init --nodes 3 --shards 9 --replicas 2

Initializing distributed cluster:
  Nodes: node-0 (coordinator), node-1, node-2
  Shards: 9 (3 per node primary, 6 replicas distributed)
  Replication factor: 2

Shard distribution:
╔════════╤═══════════════════╤══════════════════╗
║ Shard  │ Primary Node      │ Replica Nodes    ║
╠════════╪═══════════════════╪══════════════════╣
║   0    │ node-0            │ node-1, node-2   ║
║   1    │ node-0            │ node-1, node-2   ║
║   2    │ node-0            │ node-1, node-2   ║
║   3    │ node-1            │ node-0, node-2   ║
║  ...   │  ...              │  ...             ║
╚════════╧═══════════════════╧══════════════════╝

Cluster ready!

$ ./vecdb insert --vectors 10000000 --distributed
Inserting 10M vectors across cluster:
  Routing: hash(vector_id) mod 9
  Progress: [████████████████████] 100%
  Time: 45s (222K inserts/sec)

$ ./vecdb search --query query.npy --k 10

Distributed Query Execution:
  Step 1: Route query to all 9 shards
  Step 2: Parallel search on each shard (local k=30)
  Step 3: Merge 270 candidates from 9 shards
  Step 4: Re-rank and return top-10

Performance:
  Shard latencies: [1.2ms, 1.4ms, 1.1ms, 1.3ms, 1.5ms, 1.2ms, 1.0ms, 1.4ms, 1.3ms]
  Merge time: 0.3ms
  Total: 1.8ms (limited by slowest shard + merge)

$ ./vecdb failover --kill node-1
Simulating node-1 failure...

  Detecting failure: 2.1s (heartbeat timeout)
  Promoting replicas:
    Shard 3: node-0 promoted to primary
    Shard 4: node-2 promoted to primary
    Shard 5: node-0 promoted to primary
  Cluster healthy (degraded: missing replicas)

$ ./vecdb search --query query.npy --k 10
Search still works! (using remaining nodes)
  Latency: 2.1ms (slightly higher due to fewer nodes)

Implementation Hints:

Cluster architecture:

┌─────────────────────────────────────────────────────────────┐
│                       Coordinator                            │
│  (Query routing, shard map, cluster membership)             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Node-0    │  │   Node-1    │  │   Node-2    │         │
│  │             │  │             │  │             │         │
│  │ Shard 0 (P) │  │ Shard 3 (P) │  │ Shard 6 (P) │         │
│  │ Shard 1 (P) │  │ Shard 4 (P) │  │ Shard 7 (P) │         │
│  │ Shard 2 (P) │  │ Shard 5 (P) │  │ Shard 8 (P) │         │
│  │ Shard 3 (R) │  │ Shard 0 (R) │  │ Shard 0 (R) │         │
│  │ Shard 4 (R) │  │ Shard 1 (R) │  │ Shard 1 (R) │         │
│  │ ...         │  │ ...         │  │ ...         │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                              │
└─────────────────────────────────────────────────────────────┘
P = Primary, R = Replica

Sharding:

type ShardRouter struct {
    numShards int
    shardMap  map[int][]NodeAddress  // shard -> [primary, replicas]
}

func (r *ShardRouter) GetShard(vectorID uint64) int {
    return int(vectorID % uint64(r.numShards))
}

func (r *ShardRouter) RouteInsert(vectorID uint64, vector []float32) error {
    shardID := r.GetShard(vectorID)
    primary := r.shardMap[shardID][0]

    // Send to primary (it will replicate)
    return r.sendInsert(primary, vectorID, vector)
}

func (r *ShardRouter) RouteSearch(query []float32, k int) ([]SearchResult, error) {
    // Search all shards in parallel
    results := make(chan []SearchResult, r.numShards)
    errors := make(chan error, r.numShards)

    for shardID := 0; shardID < r.numShards; shardID++ {
        go func(sid int) {
            node := r.pickNode(sid)  // Primary or replica
            res, err := r.sendSearch(node, query, k*3)  // Over-fetch for merge
            if err != nil {
                errors <- err
                return
            }
            results <- res
        }(shardID)
    }

    // Collect and merge
    allResults := []SearchResult{}
    for i := 0; i < r.numShards; i++ {
        select {
        case res := <-results:
            allResults = append(allResults, res...)
        case err := <-errors:
            return nil, err
        }
    }

    // Sort by distance and take top-k
    sort.Slice(allResults, func(i, j int) bool {
        return allResults[i].Distance < allResults[j].Distance
    })

    if len(allResults) > k {
        allResults = allResults[:k]
    }

    return allResults, nil
}

Replication:

type ReplicationManager struct {
    primary  *Shard
    replicas []*RemoteReplica
}

func (rm *ReplicationManager) Insert(id uint64, vector []float32) error {
    // Write to primary first
    if err := rm.primary.Insert(id, vector); err != nil {
        return err
    }

    // Replicate asynchronously (eventual consistency)
    // Or synchronously (strong consistency)
    for _, replica := range rm.replicas {
        go replica.Replicate(id, vector)
    }

    return nil
}

Failure detection:

func (c *Cluster) monitorHealth() {
    ticker := time.NewTicker(1 * time.Second)

    for range ticker.C {
        for _, node := range c.nodes {
            if !c.ping(node) {
                node.missedHeartbeats++
                if node.missedHeartbeats > 3 {
                    c.handleNodeFailure(node)
                }
            } else {
                node.missedHeartbeats = 0
            }
        }
    }
}

func (c *Cluster) handleNodeFailure(failed *Node) {
    log.Printf("Node %s failed, promoting replicas", failed.ID)

    for _, shard := range failed.PrimaryShards {
        // Find a replica to promote
        replica := c.findReplica(shard)
        if replica != nil {
            c.promoteToPrimary(replica, shard)
        }
    }

    c.updateShardMap()
    c.notifyClients()
}

The Core Question You’re Answering

“How do we scale vector search across machines while maintaining consistency?”

This isn’t about making a single node faster—that’s optimization. This is about fundamental distributed systems questions:

How do we partition billions of vectors across nodes so queries stay fast?
How do we ensure writes are durable and consistent across replicas?
How do we handle node failures without losing data or availability?
How do we route queries to the right shards and merge results correctly?

The single biggest misconception: “Distributed systems are just running multiple copies.” Wrong. Distribution introduces coordination overhead, network partitions, partial failures, and consistency challenges that don’t exist in single-node systems.

What makes this hard: You’re balancing the CAP theorem tradeoffs. Strong consistency means higher latency and lower availability. Eventual consistency means dealing with stale reads and conflict resolution. There’s no perfect answer—only informed tradeoffs.

Concepts You Must Understand First

Before you write distributed code, you need mental models for:

Sharding vs Replication
- Sharding = Horizontal partitioning. Split data across nodes to distribute load.
- Replication = Copy data to multiple nodes for fault tolerance and read scaling.
- You need BOTH: shard for capacity, replicate for availability.
Consistency Models
- Strong consistency: All replicas agree before acknowledging writes (slow, safe)
- Eventual consistency: Replicas converge over time (fast, risky)
- Quorum reads/writes: W + R > N guarantees consistency (W=write replicas, R=read replicas, N=total replicas)
CAP Theorem
- Consistency: All nodes see the same data
- Availability: Every request gets a response
- Partition tolerance: System works despite network failures
- You can only pick 2. In distributed systems, partitions WILL happen, so you choose CA or AP.
Partitioning Strategies
- Hash-based sharding: hash(key) % num_shards → Uniform distribution, but no range queries
- Range-based sharding: Assign key ranges to shards → Supports range queries, but risk of hot shards
- Consistent hashing: Virtual nodes minimize resharding when nodes are added/removed
Distributed Consensus
- Raft/Paxos: Algorithms for agreeing on state across nodes
- Leader election: One node coordinates writes
- Log replication: Leader’s log is replicated to followers

Study these chapters:

“Designing Data-Intensive Applications” Chapter 5 (Replication)
“Designing Data-Intensive Applications” Chapter 6 (Partitioning)
“Designing Data-Intensive Applications” Chapter 9 (Consistency and Consensus)
Raft paper: “In Search of an Understandable Consensus Algorithm”

Questions to Guide Your Design

Before writing code, answer these architectural questions:

Sharding:

What’s your shard key? (vector ID? category? timestamp?)
How many shards? (Too few = hot spots. Too many = coordination overhead.)
Hash-based or range-based partitioning?
How do you handle shard rebalancing when nodes are added/removed?

Replication:

What’s your replication factor? (RF=3 is common: 1 primary + 2 replicas)
Synchronous or asynchronous replication?
How do you handle replication lag?
What happens if a replica falls behind?

Consistency:

Strong or eventual consistency?
What are your quorum sizes? (e.g., W=2, R=2, N=3)
How do you handle write conflicts?
Do you need read-after-write consistency?

Query Routing:

How does the client know which shard to query?
- Option A: Client-side routing (client knows shard map)
- Option B: Coordinator routing (coordinator routes to shards)
- Option C: Random node routing (any node can forward)
For vector search, do you query ALL shards or use a routing hint?
How do you merge results from multiple shards?

Failure Handling:

How do you detect node failures? (Heartbeats? Gossip protocol?)
How do you handle a primary failure? (Promote a replica? Re-elect leader?)
What happens during a network partition?
How do you prevent split-brain scenarios?

Thinking Exercise

Trace a query through the distributed cluster:

Client sends: search(query_vector, k=10)

Step 1: Router receives query
  → Looks up shard map (9 shards, RF=2)
  → Decision: Query ALL shards (can't route ANN search to subset)

Step 2: Router fans out to 9 shards
  → Sends search(query, k=30) to each shard (over-fetch for accurate merge)
  → Sends to PRIMARY or REPLICA? (Read from replica for load balancing)

Step 3: Each shard searches locally
  → Shard 0: finds 30 nearest neighbors
  → Shard 1: finds 30 nearest neighbors
  → ... (9 parallel searches)

Step 4: Router collects 270 results (9 shards × 30 results)
  → Merges by distance
  → Returns top-10

Total latency = max(shard_latencies) + merge_time + network_overhead

Now trace a write:

Client sends: insert(vector_id=12345, vector=[0.1, 0.2, ...])

Step 1: Router determines shard
  → shard_id = hash(12345) % 9 = 3
  → Looks up shard 3: primary=node-1, replicas=[node-0, node-2]

Step 2: Router sends to primary (node-1)
  → Primary writes to its local index
  → Primary replicates to 2 replicas

Step 3: Replication strategy (choose one):
  A) Synchronous: Wait for W=2 replicas to acknowledge (slow, safe)
  B) Asynchronous: Acknowledge immediately, replicate in background (fast, risky)

Step 4: Router acknowledges to client
  → If synchronous: only after quorum
  → If async: immediately after primary write

Failure scenario:

Node-1 fails (it was primary for shards 3, 4, 5)

Step 1: Failure detection
  → Heartbeat monitor notices node-1 missed 3 consecutive pings
  → Marks node-1 as DEAD

Step 2: Replica promotion
  → For shard 3: promote node-0's replica to primary
  → For shard 4: promote node-2's replica to primary
  → For shard 5: promote node-0's replica to primary

Step 3: Update shard map
  → Router updates routing table
  → Clients see new primaries

Step 4: Cluster is now degraded
  → RF=2 instead of RF=3 (missing one replica)
  → Background process creates new replicas to restore RF=3

The Interview Questions They’ll Ask

Q1: “What’s the difference between sharding and replication?”

Bad answer: “Sharding is horizontal scaling, replication is copying data.”

Good answer: “Sharding partitions data across nodes to distribute LOAD—it’s about capacity. Replication copies data to multiple nodes for FAULT TOLERANCE—it’s about availability. You need both: shard to handle more data than one node can hold, replicate so a node failure doesn’t lose data. For example, with 9 shards and RF=3, each shard exists on 3 nodes. Total nodes = 3, with each node holding 3 primary shards + 6 replica shards.”

Q2: “How would you shard a vector database?”

Bad answer: “Just hash the vector ID and mod by number of shards.”

Good answer: “It depends on the query pattern. For pure ANN search where you must query all vectors, hash-based sharding by vector ID works well—it distributes uniformly and is simple. But if you have metadata filters (e.g., ‘category=images’), you might shard by category to enable shard pruning—query only relevant shards. The tradeoff: category-based sharding risks hot shards if one category dominates. In practice, hybrid approaches work: hash-shard within category, or use consistent hashing with virtual nodes to balance load.”

Q3: “Strong consistency or eventual consistency for vector search?”

Bad answer: “Strong consistency because we want correct results.”

Good answer: “It depends on the use case. For vector search specifically, eventual consistency is often acceptable because:

Vector embeddings rarely change—they’re mostly append-only
Slight staleness (a few seconds) doesn’t break user experience
Strong consistency adds latency (wait for quorum), which hurts search UX

However, for METADATA updates (e.g., deleting a user’s data for GDPR), you need strong consistency—can’t show deleted data. So hybrid: eventual consistency for inserts, strong consistency for deletes. This is what Pinecone and Weaviate do.”

Q4: “How do you handle merging results from multiple shards?”

Bad answer: “Just sort all results by distance and take top-k.”

Good answer: “That works, but you must over-fetch from each shard. If you want global top-10, you can’t just get top-10 from each shard—a shard might have all 10 global bests. Solution: fetch top-k×C from each shard (C is a safety factor, typically 2-3), then merge and re-rank. Example: for k=10, fetch top-30 from each of 9 shards (270 candidates), merge by distance, return top-10. The tradeoff: larger C = better recall but more data transferred. Production systems tune C based on data distribution.”

Q5: “What happens during a network partition?”

Bad answer: “The system stops working.”

Good answer: “It depends on your CAP choice. If you chose CP (consistency + partition tolerance), you sacrifice availability—nodes in the minority partition refuse writes to prevent split-brain. If you chose AP (availability + partition tolerance), you sacrifice consistency—both partitions accept writes, and you resolve conflicts later (e.g., last-write-wins, vector timestamps). For vector search, AP is often better—it’s better to serve slightly stale results than be unavailable. But you need a reconciliation strategy when the partition heals.”

Q6: “How does leader election work in Raft?”

Bad answer: “Nodes vote for a leader.”

Good answer: “Raft uses term-based elections. When a follower’s heartbeat timeout expires, it increments its term, transitions to CANDIDATE, votes for itself, and requests votes from other nodes. A candidate wins if it receives votes from a majority (quorum). Once elected, the leader sends heartbeats to prevent new elections. Key details: randomized timeouts prevent split votes, log completeness rule ensures the leader has the most up-to-date log, and higher term numbers always win. This guarantees at most one leader per term.”

Hints in Layers

Layer 1: Single-Node Baseline (Week 1)

Start simple. Build a single-node vector database with HNSW index (from Project 5). Wrap it in a network service:

type VectorNode struct {
    index *HNSW
}

func (n *VectorNode) Insert(ctx context.Context, req *InsertRequest) error {
    return n.index.Insert(req.VectorID, req.Vector)
}

func (n *VectorNode) Search(ctx context.Context, req *SearchRequest) ([]*SearchResult, error) {
    return n.index.Search(req.Query, req.K)
}

func main() {
    node := NewVectorNode()
    grpc.Serve(":50051", node)  // Use gRPC for inter-node communication
}

Why start here: You need a working single-node system before distributing it. This baseline lets you measure distributed overhead later.

Layer 2: Add Sharding (Week 2)

Partition data across multiple nodes. Implement hash-based sharding:

type ShardRouter struct {
    nodes     []*NodeClient  // RPC clients to nodes
    numShards int
}

func (r *ShardRouter) GetShardID(vectorID uint64) int {
    return int(vectorID % uint64(r.numShards))
}

func (r *ShardRouter) GetNode(shardID int) *NodeClient {
    // Assuming equal shard distribution
    return r.nodes[shardID % len(r.nodes)]
}

func (r *ShardRouter) Insert(vectorID uint64, vector []float32) error {
    shardID := r.GetShardID(vectorID)
    node := r.GetNode(shardID)
    return node.Insert(vectorID, vector)
}

func (r *ShardRouter) Search(query []float32, k int) ([]*SearchResult, error) {
    // Fan out to ALL shards (can't route ANN search)
    resultsChan := make(chan []*SearchResult, len(r.nodes))

    for _, node := range r.nodes {
        go func(n *NodeClient) {
            res, _ := n.Search(query, k*3)  // Over-fetch
            resultsChan <- res
        }(node)
    }

    // Merge results
    var allResults []*SearchResult
    for i := 0; i < len(r.nodes); i++ {
        allResults = append(allResults, <-resultsChan...)
    }

    sort.Slice(allResults, func(i, j int) bool {
        return allResults[i].Distance < allResults[j].Distance
    })

    return allResults[:k], nil
}

Test it: Insert 1M vectors, verify they’re distributed across shards. Run searches, verify results are correct.

Key insight: Sharding reduces per-node load but INCREASES query complexity (must query all shards).

Layer 3: Add Replication (Week 3)

Each shard now has replicas. Implement synchronous replication with quorum writes:

type ReplicatedShard struct {
    primary  *NodeClient
    replicas []*NodeClient
    W        int  // Write quorum
    R        int  // Read quorum
}

func (s *ReplicatedShard) Insert(vectorID uint64, vector []float32) error {
    // Send to all replicas
    ackChan := make(chan error, len(s.replicas)+1)

    // Write to primary
    go func() {
        ackChan <- s.primary.Insert(vectorID, vector)
    }()

    // Write to replicas
    for _, replica := range s.replicas {
        go func(r *NodeClient) {
            ackChan <- r.Insert(vectorID, vector)
        }(replica)
    }

    // Wait for W acknowledgments
    acks := 0
    for i := 0; i < len(s.replicas)+1; i++ {
        if <-ackChan == nil {
            acks++
            if acks >= s.W {
                return nil  // Quorum reached
            }
        }
    }

    return errors.New("quorum not reached")
}

Test it: Kill a replica mid-write, verify quorum still succeeds. Kill W nodes, verify writes fail (correct behavior).

Key insight: Replication trades latency for durability. W=1 is fast but risky. W=N is safe but slow.

Layer 4: Add Failure Detection (Week 4)

Implement heartbeat-based failure detection:

type HealthMonitor struct {
    nodes           []*Node
    heartbeatTicker *time.Ticker
}

func (h *HealthMonitor) Start() {
    h.heartbeatTicker = time.NewTicker(1 * time.Second)

    go func() {
        for range h.heartbeatTicker.C {
            for _, node := range h.nodes {
                if err := node.Ping(); err != nil {
                    node.FailureCount++
                    if node.FailureCount >= 3 {
                        h.handleFailure(node)
                    }
                } else {
                    node.FailureCount = 0
                }
            }
        }
    }()
}

func (h *HealthMonitor) handleFailure(failed *Node) {
    log.Printf("Node %s failed", failed.ID)

    // Promote replicas to primary
    for _, shard := range failed.PrimaryShards {
        replica := h.findHealthyReplica(shard)
        h.promoteToPrimary(replica, shard)
    }

    // Update routing table
    h.updateShardMap()
}

Test it: Kill a primary node, verify replica promotion. Kill multiple nodes, verify the system degrades gracefully.

Key insight: Failure detection has false positives (slow network != dead node). Tune timeouts carefully.

Layer 5: Add Consensus (Weeks 5-6)

Implement Raft for consistent shard map updates and leader election. This is the hardest part.

Option A: Use an existing Raft library (etcd’s raft, Hashicorp’s raft) Option B: Implement Raft yourself (educational but time-consuming)

// Using Hashicorp's raft library
func (c *Cluster) InitRaft() error {
    config := raft.DefaultConfig()
    config.LocalID = raft.ServerID(c.nodeID)

    store := raft.NewInmemStore()
    snapshots := raft.NewInmemSnapshotStore()
    transport, _ := raft.NewTCPTransport(c.bindAddr, nil, 3, 10*time.Second, os.Stderr)

    r, err := raft.NewRaft(config, c, store, store, snapshots, transport)
    if err != nil {
        return err
    }

    c.raft = r
    return nil
}

// Apply shard map updates through Raft
func (c *Cluster) UpdateShardMap(update *ShardMapUpdate) error {
    if c.raft.State() != raft.Leader {
        return c.forwardToLeader(update)
    }

    data, _ := json.Marshal(update)
    future := c.raft.Apply(data, 10*time.Second)
    return future.Error()
}

Test it: Kill the leader, verify a new leader is elected. Make writes during leader election, verify they’re queued and applied after election.

Key insight: Consensus is expensive. Only use it for critical state (shard map, membership), not every write.

Books That Will Help

“Designing Data-Intensive Applications” by Martin Kleppmann
- Chapter 5: Replication → Covers leader-follower replication, multi-leader, leaderless (Dynamo-style)
- Chapter 6: Partitioning → Covers sharding strategies, rebalancing, routing
- Chapter 7: Transactions → Covers consistency guarantees
- Chapter 8: The Trouble with Distributed Systems → Covers network faults, clocks, Byzantine faults
- Chapter 9: Consistency and Consensus → Covers linearizability, CAP theorem, Raft/Paxos
- This is THE book for distributed systems. Read chapters 5-9 before starting this project.
“Database Internals” by Alex Petrov
- Part II: Distributed Systems → Covers failure detection, leader election, replication, anti-entropy
- More implementation-focused than Kleppmann.
Raft Paper: “In Search of an Understandable Consensus Algorithm”
- Original Raft paper by Diego Ongaro and John Ousterhout
- 16 pages, very readable (unlike Paxos)
- Available at: https://raft.github.io/raft.pdf
“Distributed Systems” by Maarten van Steen and Andrew S. Tanenbaum
- Chapter 6: Synchronization → Covers clocks, mutual exclusion
- Chapter 7: Consistency and Replication → Covers consistency models
- More academic, but comprehensive.
Raft Visualization
- Interactive visualization: https://raft.github.io/
- Play with leader election, log replication, network partitions
Real-world systems to study:
- Cassandra: Leaderless, eventual consistency, tunable quorums
- MongoDB: Leader-based, strong consistency, automatic sharding
- Elasticsearch: Sharding, replication, Lucene-based (similar to vector DBs)
- Pinecone/Weaviate: Production vector databases (study their architecture blogs)

Reading plan:

Week 1: DDIA Chapters 5-6 (Replication, Partitioning)
Week 2: DDIA Chapters 8-9 (Distributed Systems, Consensus)
Week 3: Raft paper + visualization
Weeks 4-6: Build while referencing “Database Internals” Part II

Learning milestones:

Sharding distributes load → You understand partitioning
Parallel search works → You understand distributed query execution
Failover promotes replicas → You understand fault tolerance
Consistency is maintained → You understand distributed state

Capstone Project: Full-Featured Vector Database

File: LEARN_VECTOR_DATABASES.md
Main Programming Language: Rust
Alternative Programming Languages: C++, Go
Coolness Level: Level 5: Pure Magic (Super Cool)
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 5: Master
Knowledge Area: Complete Database System
Software or Tool: Production Vector Database
Main Book: All previous + “Database Internals” by Alex Petrov

What you’ll build: A complete, production-ready vector database with: HNSW and IVF-PQ indexes, durable storage, metadata filtering, distributed clustering, gRPC API, and client SDKs.

Why this is the ultimate test: This combines everything: algorithms, storage, networking, distributed systems, and API design. Building a complete vector database proves you truly understand how these systems work.

Core challenges you’ll face:

Integrating all components → maps to system design
API design → maps to user experience
Performance tuning → maps to profiling and optimization
Operational features → maps to monitoring, backups, upgrades

Key Concepts:

Everything from previous projects
API Design: gRPC best practices
Operations: Observability, deployment

Difficulty: Master Time estimate: 3-6 months Prerequisites: All previous projects completed.

Real world outcome:

$ vecdb-server --config production.yaml

╔═══════════════════════════════════════════════════════════╗
║                    VecDB Server v1.0                       ║
╠═══════════════════════════════════════════════════════════╣
║ Configuration:                                             ║
║   Storage: /data/vecdb                                     ║
║   Index: HNSW (M=32, ef=200)                              ║
║   Shards: 12 (distributed across 4 nodes)                 ║
║   Replicas: 2                                              ║
║                                                            ║
║ Endpoints:                                                 ║
║   gRPC: 0.0.0.0:6333                                      ║
║   HTTP: 0.0.0.0:6334                                      ║
║   Metrics: 0.0.0.0:9090/metrics                           ║
║                                                            ║
║ Status: Ready to accept connections                        ║
╚═══════════════════════════════════════════════════════════╝

# Python client usage
from vecdb import VecDBClient

client = VecDBClient("localhost:6333")

# Create collection
client.create_collection(
    name="products",
    dimension=768,
    metric="cosine",
    index_config={
        "type": "hnsw",
        "m": 32,
        "ef_construction": 200
    }
)

# Insert vectors
client.upsert(
    collection="products",
    vectors=[
        {"id": "prod_001", "vector": [...], "metadata": {"category": "electronics", "price": 99.99}},
        {"id": "prod_002", "vector": [...], "metadata": {"category": "clothing", "price": 49.99}},
        ...
    ]
)

# Search with filters
results = client.search(
    collection="products",
    vector=query_embedding,
    limit=10,
    filter={"category": "electronics", "price": {"$lt": 100}}
)

# Results
[
    {"id": "prod_001", "score": 0.95, "metadata": {...}},
    {"id": "prod_042", "score": 0.91, "metadata": {...}},
    ...
]

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────────────────┐
│                         Client SDKs                          │
│              (Python, JavaScript, Rust, Go)                  │
├─────────────────────────────────────────────────────────────┤
│                     API Layer (gRPC + HTTP)                  │
│    ┌────────────┬────────────┬────────────┬────────────┐    │
│    │  Insert    │  Search    │  Delete    │   Admin    │    │
│    └────────────┴────────────┴────────────┴────────────┘    │
├─────────────────────────────────────────────────────────────┤
│                       Query Processor                        │
│    ┌────────────┬────────────┬────────────┐                 │
│    │  Planner   │ Optimizer  │  Executor  │                 │
│    └────────────┴────────────┴────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                        Index Layer                           │
│    ┌────────────┬────────────┬────────────┐                 │
│    │   HNSW     │  IVF-PQ    │  Metadata  │                 │
│    └────────────┴────────────┴────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                       Storage Layer                          │
│    ┌────────────┬────────────┬────────────┐                 │
│    │    WAL     │  Vectors   │  Index     │                 │
│    └────────────┴────────────┴────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                    Distributed Layer                         │
│    ┌────────────┬────────────┬────────────┐                 │
│    │  Sharding  │Replication │ Consensus  │                 │
│    └────────────┴────────────┴────────────┘                 │
└─────────────────────────────────────────────────────────────┘

gRPC API definition:

service VecDB {
    rpc CreateCollection(CreateCollectionRequest) returns (CreateCollectionResponse);
    rpc Upsert(UpsertRequest) returns (UpsertResponse);
    rpc Search(SearchRequest) returns (SearchResponse);
    rpc Delete(DeleteRequest) returns (DeleteResponse);
    rpc GetCollection(GetCollectionRequest) returns (GetCollectionResponse);
}

message SearchRequest {
    string collection = 1;
    repeated float vector = 2;
    uint32 limit = 3;
    Filter filter = 4;
    SearchParams params = 5;
}

message SearchResponse {
    repeated SearchResult results = 1;
    uint64 searched_vectors = 2;
    float latency_ms = 3;
}

Collection management:

struct Collection {
    name: String,
    dimension: usize,
    metric: DistanceMetric,

    // Index
    index: Box<dyn VectorIndex>,

    // Storage
    storage: StorageEngine,

    // Metadata
    metadata_index: MetadataIndex,

    // Configuration
    config: CollectionConfig,
}

impl Collection {
    fn upsert(&mut self, vectors: Vec<VectorRecord>) -> Result<()> {
        // Write to WAL first
        self.storage.wal_append(&vectors)?;

        // Update index
        for record in &vectors {
            self.index.insert(record.id, &record.vector)?;
            self.metadata_index.insert(record.id, &record.metadata)?;
        }

        // Flush periodically
        if self.should_flush() {
            self.flush()?;
        }

        Ok(())
    }

    fn search(&self, query: &[f32], k: usize, filter: Option<Filter>) -> Result<Vec<SearchResult>> {
        let valid_ids = match filter {
            Some(f) => Some(self.metadata_index.filter(&f)?),
            None => None,
        };

        self.index.search(query, k, valid_ids)
    }
}

Metrics and monitoring:

struct Metrics {
    insert_latency: Histogram,
    search_latency: Histogram,
    vectors_total: Gauge,
    index_size_bytes: Gauge,
    queries_per_second: Counter,
}

impl Metrics {
    fn record_search(&self, latency: Duration, results: usize) {
        self.search_latency.observe(latency.as_secs_f64());
        self.queries_per_second.inc();
    }
}

Learning milestones:

Basic CRUD works → Core functionality complete
Filtered search works → Advanced querying done
Distributed mode works → Scalability achieved
Production deployed → Operational readiness
Handles real workloads → Production quality

Project Comparison Table

Project	Difficulty	Time	Depth of Understanding	Fun Factor
1. Brute Force k-NN	Beginner	Weekend	⭐⭐	⭐⭐⭐
2. LSH	Intermediate	1-2 weeks	⭐⭐⭐	⭐⭐⭐⭐
3. IVF	Advanced	2 weeks	⭐⭐⭐⭐	⭐⭐⭐
4. Product Quantization	Advanced	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
5. HNSW	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
6. IVF-PQ	Expert	2-3 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
7. DiskANN-style	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
8. Filtered Search	Advanced	2-3 weeks	⭐⭐⭐⭐	⭐⭐⭐⭐
9. Storage Engine	Expert	3-4 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
10. Distributed Cluster	Master	4-6 weeks	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐
Capstone: Full DB	Master	3-6 months	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐

Recommended Learning Path

Path A: The Complete Journey (6-12 months)

Weeks 1-2: Projects 1-2 (Fundamentals + LSH)
Weeks 3-5: Projects 3-4 (IVF + PQ)
Weeks 6-9: Project 5 (HNSW - the key algorithm)
Weeks 10-12: Project 6 (IVF-PQ composite)
Weeks 13-16: Projects 7-8 (Disk-based + Filtering)
Weeks 17-20: Project 9 (Storage Engine)
Weeks 21-26: Project 10 (Distributed)
Months 7-12: Capstone

Path B: Practical Focus (2-3 months)

If you want to understand production systems quickly:

1 → 3 → 5 → 8 → (use existing storage/distribution)

This covers: brute force baseline, IVF, HNSW, and filtering—enough to work with any production vector database.

Path C: Algorithm Deep Dive (2-3 months)

If you want to understand the algorithms deeply:

1 → 2 → 3 → 4 → 5 → 6

This covers all the major ANN algorithms without the systems complexity.

Recommendation Based on Your Goals

If you’re building RAG applications: Do Projects 1, 3, 5, 8. You’ll understand how to choose and tune indexes.

If you want to contribute to vector DBs: Do the full path. Open-source projects like Qdrant, Milvus need contributors who understand internals.

If you’re researching ANN algorithms: Focus on Projects 2-6, read the original papers, benchmark against ann-benchmarks.com.

Essential Resources

Papers

HNSW Paper - Malkov & Yashunin
Product Quantization Paper - Jégou, Douze, Schmid
DiskANN Paper - Microsoft Research
Billion-scale ANN - FAISS paper

Tutorials and Guides

Pinecone Learning Center - Excellent visual explanations
Zilliz Learn - Milvus team’s tutorials
FAISS Wiki - Official FAISS documentation

Benchmarks

ANN-Benchmarks - Standard benchmark for ANN algorithms
Big-ANN Benchmarks - Billion-scale benchmarks

Books

“Designing Data-Intensive Applications” by Martin Kleppmann
“Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman
“Introduction to Information Retrieval” by Manning, Raghavan, Schütze

Libraries to Study

FAISS - Facebook’s library (C++ with Python bindings)
hnswlib - Original HNSW implementation
Qdrant - Rust vector database
Milvus - Distributed vector database

Summary

#	Project Name	Main Language
1	Brute Force k-NN and Distance Metrics	Python
2	Locality Sensitive Hashing (LSH)	Python
3	Inverted File Index (IVF)	Python
4	Product Quantization (PQ)	Python
5	HNSW Index	Python
6	IVF-PQ Composite Index	Python
7	Disk-Based Vector Index (DiskANN-style)	C++
8	Metadata Filtering with Vector Search	Python
9	Vector Database Storage Engine	Rust
10	Distributed Vector Search Cluster	Go
Capstone	Full-Featured Vector Database	Rust

Vector databases are where algorithms meet systems engineering. The best way to understand them is to build one. Start with brute force, feel the pain, then appreciate the elegance of HNSW and the efficiency of IVF-PQ. By the end, you’ll know exactly what happens when you call collection.search().