← Back to all projects

LEARN VECTOR DATABASES

Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by *meaning*.

Learn Vector Databases: From Zero to Building Your Own

Goal: Deeply understand how vector databases work internallyโ€”from the mathematics of similarity search through indexing algorithms like HNSW and IVF, to building a production-ready vector search engine. Master the technology powering modern AI applications.


Why Vector Databases Matter

Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by meaning.

The challenge: How do you efficiently find the 10 most similar items among a billion vectors, each with 1536 dimensions? Brute force would take hours. Vector databases do it in milliseconds.

After completing these projects, you will:

  • Understand exactly how similarity search works at scale
  • Implement HNSW, IVF, LSH, and other ANN algorithms from scratch
  • Know when to use each algorithm and what tradeoffs to make
  • Build a complete vector database that handles real workloads
  • Appreciate the engineering behind Pinecone, Milvus, Qdrant, and others

Core Concept Analysis

What is a Vector?

In the context of vector databases, a vector (or embedding) is a list of numbers that represents the meaning or features of something:

"The quick brown fox" โ†’ [0.12, -0.34, 0.56, ..., 0.78]  (1536 dimensions)
[Image of a cat]      โ†’ [0.23, 0.45, -0.67, ..., 0.89]  (512 dimensions)
[Product attributes]  โ†’ [0.91, -0.23, 0.45, ..., 0.12]  (256 dimensions)

These vectors come from ML models (OpenAI embeddings, CLIP, sentence-transformers) that position similar items close together in high-dimensional space.

Distance Metrics

How do we measure โ€œclosenessโ€ between vectors?

  1. Euclidean Distance (L2): Straight-line distance in space
    d(a, b) = โˆš(ฮฃ(aแตข - bแตข)ยฒ)
    
    • Good for: Absolute position matters
    • Range: [0, โˆž)
  2. Cosine Similarity: Angle between vectors (normalized)
    cos(a, b) = (a ยท b) / (||a|| ร— ||b||)
    
    • Good for: Direction matters, magnitude doesnโ€™t
    • Range: [-1, 1] (1 = identical, 0 = orthogonal, -1 = opposite)
  3. Dot Product (Inner Product): Combines magnitude and direction
    a ยท b = ฮฃ(aแตข ร— bแตข)
    
    • Good for: When vectors are already normalized
    • Range: (-โˆž, โˆž)
  4. Manhattan Distance (L1): Sum of absolute differences
    d(a, b) = ฮฃ|aแตข - bแตข|
    

The Curse of Dimensionality

Why is vector search hard?

In high dimensions (100+), counterintuitive things happen:

  • All points are โ€œfarโ€: The distance between any two random points converges
  • Volume explodes: A hypercube with side 1 has volume 1, but almost all that volume is in the โ€œcornersโ€
  • Brute force is slow: Comparing a query to 1 billion vectors = 1 billion distance calculations

This is why we need Approximate Nearest Neighbor (ANN) algorithmsโ€”trading perfect accuracy for massive speedups.

The ANN Tradeoff

Accuracy (Recall)  โ†โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ†’  Speed
    100%                                    0.001ms

    Brute Force (exact)                     Too slow
    HNSW (graph)                            Great balance
    IVF (clustering)                        Faster, less accurate
    LSH (hashing)                           Fastest, least accurate

Recall@K: Of the true K nearest neighbors, what fraction did we find?

  • Recall@10 = 0.95 means we found 95% of the true top-10 neighbors

Vector Database Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Client Layer                         โ”‚
โ”‚         (REST API, gRPC, SDKs, Query Language)          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                     Query Layer                          โ”‚
โ”‚    (Query Planning, Filtering, Caching, Result Ranking) โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                     Index Layer                          โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚  HNSW   โ”‚  โ”‚   IVF   โ”‚  โ”‚   LSH   โ”‚  โ”‚  Flat   โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                    Storage Layer                         โ”‚
โ”‚    (Vector Storage, Metadata, WAL, Compression/PQ)      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                  Distributed Layer                       โ”‚
โ”‚      (Sharding, Replication, Consistency, Routing)      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Concept Summary Table

Concept Cluster What You Need to Internalize
Vector Embeddings Vectors are numeric representations of meaning. Similar items map to nearby points in high-dimensional space. Embeddings come from ML models (transformers, CNNs) that compress semantics into fixed-size arrays.
Distance Metrics Similarity = inverse of distance. Euclidean measures absolute position, cosine measures direction (angle), dot product combines both. Choice depends on whether magnitude matters for your domain.
Curse of Dimensionality In high dimensions (100+), all points become โ€œfarโ€ from each other, volume concentrates in corners, and brute force becomes impossibly slow. This is why approximate search exists.
ANN vs Exact Search Exact k-NN guarantees finding the true nearest neighbors but doesnโ€™t scale. ANN (Approximate Nearest Neighbor) trades accuracy for speedโ€”99% accuracy at 1000x speedup is the typical tradeoff.
LSH (Locality Sensitive Hashing) Hash similar vectors to same buckets using random projections. Multiple hash tables = OR (higher recall), multiple hashes per table = AND (higher precision). Probabilistic but fast.
IVF (Inverted File Index) Cluster vectors with k-means, then search only nearby clusters. Nprobe controls recall/speed tradeoff. Foundation of FAISS and most production systems. Simple but effective.
Product Quantization Compress vectors by splitting dimensions into subspaces and quantizing each independently. 128D float32 (512 bytes) โ†’ 128 uint8 codes (128 bytes) = 4x compression. Lossy but enables billion-scale search.
HNSW (Hierarchical NSW) Multi-layer proximity graph. Upper layers = express highways, lower layers = local streets. Greedy search with backtracking. Best all-around algorithm for accuracy and speed.
Index Composition Real systems combine techniques: IVF for partitioning + PQ for compression + HNSW for graph search. Each layer optimizes a different bottleneck.
Metadata Filtering Vector search + filters (price < $100, category = shoes). Pre-filtering wastes recall, post-filtering wastes compute. Hybrid indexes and filtered IVF are the solutions.
Distributed Systems Sharding for scale (partition by vector ID or space), replication for availability, consistency for correctness. CAP theorem applies. Leader-follower or quorum-based writes.
Disk-Based Indexes RAM is expensive at billion-vector scale. Disk-based indexes use SSD-friendly layouts: compressed vectors, graph edges on disk, selective caching. DiskANN achieves 95%+ recall.

Deep Dive Reading By Concept

This section maps each concept from above to specific book chapters and papers for deeper understanding. Read these before or alongside the projects to build strong mental models.

Vector Embeddings & Representation Learning

Concept Resource
What are embeddings and how are they created Speech and Language Processing by Jurafsky & Martin โ€” Ch. 6: โ€œVector Semantics and Embeddingsโ€
Word2Vec and semantic spaces Neural Network Methods for Natural Language Processing by Yoav Goldberg โ€” Ch. 10: โ€œPre-trained Word Representationsโ€
Modern transformer embeddings Natural Language Processing with Transformers by Tunstall, von Werra & Wolf โ€” Ch. 2: โ€œText Classificationโ€ (BERT embeddings section)
Image embeddings (CNNs) Deep Learning by Goodfellow, Bengio & Courville โ€” Ch. 9: โ€œConvolutional Networksโ€
Multimodal embeddings (CLIP) Paper: โ€œLearning Transferable Visual Models From Natural Language Supervisionโ€ (Radford et al., 2021)
Concept Resource
Distance functions and metric spaces Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ€” Ch. 3.5: โ€œDistance Measuresโ€
Cosine vs Euclidean: when to use which Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ€” Ch. 6: โ€œScoring, term weighting, and the vector space modelโ€
Dot product for normalized vectors Foundations of Statistical Natural Language Processing by Manning & Schรผtze โ€” Ch. 8: โ€œLexical Semanticsโ€
Metric properties (triangle inequality, symmetry) Pattern Recognition and Machine Learning by Bishop โ€” Ch. 2.5: โ€œBayesian Decision Theoryโ€

Curse of Dimensionality & High-Dimensional Geometry

Concept Resource
Why high dimensions are different The Elements of Statistical Learning by Hastie, Tibshirani & Friedman โ€” Ch. 2.5: โ€œLocal Methods in High Dimensionsโ€
Volume concentration in hyperspheres Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ€” Ch. 7: โ€œClusteringโ€ (curse of dimensionality section)
Distance concentration phenomenon Paper: โ€œWhen Is โ€˜Nearest Neighborโ€™ Meaningful?โ€ (Beyer et al., 1999)
Implications for machine learning Pattern Recognition and Machine Learning by Bishop โ€” Ch. 1.4: โ€œThe Curse of Dimensionalityโ€

Approximate Nearest Neighbor Algorithms

Concept Resource
ANN survey and taxonomy Paper: โ€œA Survey on Nearest Neighbor Search Methodsโ€ (Wang et al., 2021)
Recall/precision/latency tradeoffs Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ€” Ch. 3.6: โ€œLocality-Sensitive Hashing for Documentsโ€
Benchmarking ANN algorithms Website: ann-benchmarks.com โ€” comprehensive comparisons

Locality Sensitive Hashing (LSH)

Concept Resource
LSH fundamentals and theory Mining of Massive Datasets by Leskovec, Rajaraman & Ullman โ€” Ch. 3.4: โ€œLocality-Sensitive Hashingโ€
Random hyperplanes for cosine similarity Paper: โ€œSimilarity Estimation Techniques from Rounding Algorithmsโ€ (Charikar, 2002)
LSH families for different metrics Foundations of Data Science by Blum, Hopcroft & Kannan โ€” Ch. 2.5: โ€œLocality Sensitive Hashingโ€
Multi-probe LSH Paper: โ€œMulti-Probe LSH: Efficient Indexing for High-Dimensional Similarity Searchโ€ (Lv et al., 2007)
Practical LSH implementations Tutorial: Pinecone LSH Guide

Clustering & K-Means

Concept Resource
K-means algorithm and convergence Pattern Recognition and Machine Learning by Bishop โ€” Ch. 9.1: โ€œK-means Clusteringโ€
Choosing K (number of clusters) The Elements of Statistical Learning by Hastie, Tibshirani & Friedman โ€” Ch. 14.3: โ€œCluster Analysisโ€
K-means++ initialization Paper: โ€œk-means++: The Advantages of Careful Seedingโ€ (Arthur & Vassilvitskii, 2007)
Mini-batch k-means for large datasets Scikit-learn documentation on Mini-Batch K-Means

Inverted File Index (IVF)

Concept Resource
Inverted indexes in text search Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ€” Ch. 1: โ€œBoolean Retrievalโ€
IVF for vector search Paper: โ€œProduct Quantization for Nearest Neighbor Searchโ€ (Jegou et al., 2011) โ€” Section on IVF
FAISS IVF indexes FAISS Wiki: Index Types
Nprobe tuning strategies Pinecone blog: โ€œVector Indexesโ€ series

Product Quantization (PQ)

Concept Resource
Vector quantization fundamentals Pattern Recognition and Machine Learning by Bishop โ€” Ch. 9.2: โ€œMixtures of Gaussiansโ€ (VQ as special case)
Product Quantization paper (must-read) Paper: โ€œProduct Quantization for Nearest Neighbor Searchโ€ (Jegou et al., 2011)
Asymmetric distance computation Same paper, Section 3.3
Optimized PQ (OPQ) Paper: โ€œOptimized Product Quantizationโ€ (Ge et al., 2013)
PQ in FAISS FAISS Wiki: Product Quantizers

HNSW (Hierarchical Navigable Small World)

Concept Resource
Small-world networks Networks by Newman โ€” Ch. 11: โ€œSmall-World Networksโ€
Navigable Small World graphs Paper: โ€œApproximate Nearest Neighbor Algorithm Based on Navigable Small World Graphsโ€ (Malkov et al., 2014)
HNSW algorithm (must-read) Paper: โ€œEfficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphsโ€ (Malkov & Yashunin, 2018)
HNSW implementation details hnswlib GitHub
HNSW tuning (M and ef parameters) Pinecone blog: โ€œHNSW Index Deep Diveโ€
Concept Resource
Proximity graphs for search Nearest Neighbor Search: A Database Perspective by Shasha & Wang โ€” Ch. 4
Navigating graphs efficiently Paper: โ€œA Comparative Study of Graph-based Approximate Nearest Neighbor Algorithmsโ€ (Wang et al., 2021)
Graph construction strategies HNSW paper, Section 3.2

Compression & Quantization

Concept Resource
Vector quantization theory The Elements of Statistical Learning by Hastie, Tibshirani & Friedman โ€” Ch. 14.3.12: โ€œVector Quantizationโ€
Scalar quantization Understanding Digital Signal Processing by Lyons โ€” Ch. 12: โ€œData Compressionโ€
Additive quantization Paper: โ€œAdditive Quantization for Extreme Vector Compressionโ€ (Babenko & Lempitsky, 2014)
Concept Resource
SSD-friendly data structures Database Internals by Petrov โ€” Ch. 3: โ€œFile Formatsโ€
DiskANN algorithm Paper: โ€œDiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Nodeโ€ (Jayaram Subramanya et al., 2019)
Memory-mapped files The Linux Programming Interface by Kerrisk โ€” Ch. 49: โ€œMemory Mappingsโ€
Vamana graph (DiskANNโ€™s graph) DiskANN paper, Section 3
Concept Resource
Distributed system fundamentals Designing Data-Intensive Applications by Kleppmann โ€” Ch. 5-9 (Replication, Partitioning, Transactions, Consistency)
Sharding strategies Designing Data-Intensive Applications by Kleppmann โ€” Ch. 6: โ€œPartitioningโ€
CAP theorem and consistency Distributed Systems by Tanenbaum & van Steen โ€” Ch. 7: โ€œConsistency and Replicationโ€
Leader election (Raft/Paxos) Paper: โ€œIn Search of an Understandable Consensus Algorithm (Extended Version)โ€ (Ongaro & Ousterhout, 2014) โ€” Raft
Distributed vector databases Milvus documentation: Architecture Overview
Concept Resource
Filtered vector search Paper: โ€œFiltered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filtersโ€ (Singh et al., 2023)
Hybrid search (sparse + dense) Pinecone blog: โ€œHybrid Search Explainedโ€
Inverted indexes for metadata Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ€” Ch. 2: โ€œThe Term Vocabulary and Postings Listsโ€

Storage Engines & Persistence

Concept Resource
LSM-trees for write-heavy workloads Designing Data-Intensive Applications by Kleppmann โ€” Ch. 3: โ€œStorage and Retrievalโ€ (LSM-trees section)
WAL (Write-Ahead Logging) Database Internals by Petrov โ€” Ch. 4: โ€œImplementing B-Treesโ€ (WAL section)
Memory management in databases Database Internals by Petrov โ€” Ch. 5: โ€œTransaction Processing and Recoveryโ€

Performance & Benchmarking

Concept Resource
Measuring recall and precision Introduction to Information Retrieval by Manning, Raghavan & Schรผtze โ€” Ch. 8: โ€œEvaluation in Information Retrievalโ€
Latency percentiles (p99, p95) Designing Data-Intensive Applications by Kleppmann โ€” Ch. 1: โ€œReliable, Scalable, and Maintainable Applicationsโ€
Benchmarking methodology Paper: โ€œANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithmsโ€ (Aumรผller et al., 2020)

Essential Reading Order

For maximum comprehension, read in this order:

  1. Foundation (Week 1):
    • Mining of Massive Datasets Ch. 3 (distance metrics, LSH fundamentals)
    • Speech and Language Processing Ch. 6 (what embeddings are)
    • Pattern Recognition and Machine Learning Ch. 9.1 (k-means)
  2. Core Algorithms (Week 2-3):
    • LSH paper (Charikar, 2002)
    • Product Quantization paper (Jegou et al., 2011) โ€” must-read
    • HNSW paper (Malkov & Yashunin, 2018) โ€” must-read
  3. Advanced Techniques (Week 4):
    • DiskANN paper (for billion-scale search)
    • Filtered-DiskANN paper (for metadata filtering)
    • Milvus/Qdrant architecture docs (for distributed systems)
  4. Systems & Production (Week 5+):
    • Designing Data-Intensive Applications Ch. 3-6 (storage, replication, partitioning)
    • Database Internals (storage engine details)
    • Production vector DB docs (Pinecone, Qdrant, Milvus)

Project List

Projects progress from fundamental algorithms to a complete vector database implementation.


Project 1: Brute Force k-NN and Distance Metrics

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The โ€œResume Goldโ€
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Vector Mathematics / Similarity Search
  • Software or Tool: Distance Calculator
  • Main Book: โ€œMining of Massive Datasetsโ€ by Leskovec, Rajaraman, Ullman (Chapter 3)

What youโ€™ll build: A brute-force k-nearest neighbors search engine supporting multiple distance metrics (Euclidean, cosine, dot product), with optimizations using NumPy vectorization and SIMD.

Why it teaches fundamentals: Before learning approximate algorithms, you must understand exact search. This establishes the baseline that all other algorithms try to beat. Youโ€™ll also learn why brute force becomes impossible at scale.

Core challenges youโ€™ll face:

  • Implementing distance metrics correctly โ†’ maps to vector mathematics
  • Vectorizing with NumPy โ†’ maps to batch operations for speed
  • Memory management for large datasets โ†’ maps to chunked processing
  • Benchmarking accurately โ†’ maps to measuring recall and latency

Key Concepts:

  • Vector Operations: Linear Algebra fundamentals
  • Distance Metrics: โ€œMining of Massive Datasetsโ€ Chapter 3 - Leskovec et al.
  • NumPy Broadcasting: NumPy documentation

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of vectors/matrices. NumPy familiarity helpful.

Real world outcome:

$ ./vector_search --dataset random_10k_128d.npy --query query.npy --k 10

Loading dataset: 10,000 vectors ร— 128 dimensions
Query vector: 128 dimensions

Brute Force k-NN Results (k=10):
โ•”โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Rank  โ”‚ Vector ID โ”‚ Distance (L2)  โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   1   โ”‚   4,521   โ”‚    0.823       โ•‘
โ•‘   2   โ”‚   7,892   โ”‚    0.891       โ•‘
โ•‘   3   โ”‚   2,103   โ”‚    0.912       โ•‘
โ•‘  ...  โ”‚    ...    โ”‚     ...        โ•‘
โ•‘  10   โ”‚   9,234   โ”‚    1.102       โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Performance:
  Distance calculations: 10,000
  Time: 2.3ms (vectorized), 156ms (naive loop)
  Throughput: 4,347 queries/second

$ ./vector_search --dataset random_1M_768d.npy --query query.npy --k 10
Loading dataset: 1,000,000 vectors ร— 768 dimensions

Performance:
  Time: 1,847ms (too slow for production!)
  This is why we need approximate algorithms...

Implementation Hints:

Distance metric implementations:

Euclidean Distance (L2):
  d(q, v) = sqrt(sum((q[i] - v[i])^2 for i in range(dim)))

Cosine Similarity:
  sim(q, v) = dot(q, v) / (norm(q) * norm(v))
  distance = 1 - sim  (to convert to distance)

Dot Product:
  score(q, v) = sum(q[i] * v[i] for i in range(dim))
  (higher is better, not a distance)

Naive implementation (slow):

def knn_naive(query, vectors, k):
    distances = []
    for i, v in enumerate(vectors):
        d = euclidean_distance(query, v)
        distances.append((i, d))
    distances.sort(key=lambda x: x[1])
    return distances[:k]

Vectorized implementation (fast):

def knn_vectorized(query, vectors, k):
    # Compute all distances at once
    # (query - vectors)^2 = query^2 - 2*query*vectors + vectors^2
    query_sq = np.sum(query ** 2)
    vectors_sq = np.sum(vectors ** 2, axis=1)
    cross_term = 2 * np.dot(vectors, query)

    distances = query_sq - cross_term + vectors_sq
    # Skip sqrt for ranking (monotonic transformation)

    # Partial sort for top-k
    indices = np.argpartition(distances, k)[:k]
    top_k = indices[np.argsort(distances[indices])]

    return top_k, distances[top_k]

Key optimization insight: For ranking, you donโ€™t need the actual distanceโ€”just the ordering. So skip expensive operations like sqrt() that donโ€™t change the ranking.

Chunked processing for large datasets:

def knn_chunked(query, vectors, k, chunk_size=100000):
    candidates = []
    for i in range(0, len(vectors), chunk_size):
        chunk = vectors[i:i+chunk_size]
        chunk_results = knn_vectorized(query, chunk, k)
        candidates.extend([(idx + i, dist) for idx, dist in chunk_results])

    candidates.sort(key=lambda x: x[1])
    return candidates[:k]

The Core Question Youโ€™re Answering

โ€œHow do distance metrics actually work, and why does vectorization make them 50x faster?โ€

Before writing code, understand what youโ€™re measuring. Distance isnโ€™t just mathโ€”itโ€™s the foundation of similarity. Every vector database operation depends on computing distances correctly and efficiently.

Concepts You Must Understand First

Stop and research these before coding:

  1. Linear Algebra Fundamentals
    • What is a vector in mathematical terms vs in programming?
    • How do dot products relate to projections?
    • What does โ€œnormโ€ mean (L1, L2, infinity norm)?
    • Book Reference: โ€œIntroduction to Linear Algebraโ€ by Gilbert Strang โ€” Ch. 1-2
  2. NumPy Broadcasting and Vectorization
    • Why is a loop over 1M items slow but a NumPy operation fast?
    • What is SIMD (Single Instruction Multiple Data)?
    • How does NumPy use CPU vector instructions?
    • Book Reference: โ€œPython for Data Analysisโ€ by Wes McKinney โ€” Ch. 4: NumPy Basics
  3. Distance Metric Properties
    • What makes a function a โ€œmetricโ€ (triangle inequality, symmetry)?
    • Why is cosine similarity NOT a distance (hint: itโ€™s a similarity)?
    • When should you use Euclidean vs cosine?
    • Book Reference: โ€œMining of Massive Datasetsโ€ Ch. 3.5
  4. Big-O Complexity
    • What is the complexity of brute force k-NN? O(N*D) for what operations?
    • Why does the curse of dimensionality make D matter?
    • Book Reference: โ€œIntroduction to Algorithmsโ€ (CLRS) โ€” Ch. 3

Questions to Guide Your Design

Before implementing, think through these:

  1. API Design
    • Should your search function return indices, distances, or both?
    • How do you handle the case where k > dataset size?
    • Should you support batched queries (multiple query vectors at once)?
  2. Distance Metrics
    • For cosine similarity, do you normalize vectors once (at index time) or per query?
    • Can you avoid computing sqrt() for Euclidean distance? (Hint: yes, for ranking)
    • How do you handle zero vectors in cosine similarity?
  3. Memory Management
    • Can you load the entire dataset in RAM? If not, how do you chunk it?
    • Should you memory-map the data file or load it entirely?
    • Where does the bottleneck appear: CPU or memory bandwidth?
  4. Benchmarking
    • What datasets should you test on? (SIFT1M, GIST1M are standard)
    • How do you verify your results are correct? (Check against sklearn)
    • What metrics matter: latency, throughput, or both?

Thinking Exercise

Trace Distance Calculation By Hand

Before coding, compute these manually:

Vector A = [3, 4]
Vector B = [6, 8]

Questions:
1. Euclidean distance between A and B?
2. Cosine similarity between A and B?
3. Dot product of A and B?
4. If you normalize both vectors first, what's the dot product?
5. Why are (3) and (4) different? What does this tell you?

Work this out on paper. The insight from question 5 is crucial.

Performance Mental Model

Predict the speedup:

  • Naive Python loops over 10,000 vectors ร— 128 dimensions
  • NumPy vectorized version with SIMD

Make a guess: 2x? 10x? 100x?

Then implement both and measure. The actual speedup will teach you why vectorization matters.

The Interview Questions Theyโ€™ll Ask

Prepare to answer these:

  1. โ€œWhatโ€™s the difference between Euclidean distance and cosine similarity?โ€
    • Euclidean measures absolute distance in space; cosine measures angle between vectors
    • Euclidean is magnitude-sensitive; cosine is magnitude-invariant
    • Use Euclidean when magnitude matters (e.g., image brightness), cosine for orientation (e.g., text embeddings)
  2. โ€œWhy is brute force k-NN O(N) and not O(N log N)?โ€
    • Computing all N distances is O(N ร— D)
    • Finding top-k with argpartition is O(N), not O(N log N)
    • Full sort would be O(N log N), but we donโ€™t need it
  3. โ€œCan you skip the sqrt() in Euclidean distance? Why or why not?โ€
    • Yes for ranking (sqrt is monotonic, preserves order)
    • No if you need actual distance values
    • Saves computation: sqrt is ~10-20x slower than multiply
  4. โ€œYour search is slow. Is it CPU-bound or memory-bound? How would you check?โ€
    • Profile with time and check CPU utilization
    • Memory-bound: low CPU usage, high memory bandwidth
    • CPU-bound: high CPU usage, bottleneck in computation
    • Tools: top, htop, perf, Python profilers
  5. โ€œHow would you find k-NN for 1000 query vectors efficiently?โ€
    • Batch compute: distances matrix of shape (1000, N)
    • Use matrix multiplication instead of 1000 loops
    • Exploit CPU cache locality and SIMD
  6. โ€œWhat happens to distance distributions as dimensionality increases?โ€
    • Curse of dimensionality: all points become equidistant
    • Distance concentration: min and max distances converge
    • Relative contrast (max/min) decreases exponentially
    • This is why high-dimensional spaces need special techniques
  7. โ€œHow does NumPy make array operations fast?โ€
    • Uses compiled C/Fortran libraries (BLAS, LAPACK)
    • SIMD vectorization (compute 4-8 values simultaneously)
    • Contiguous memory layout (cache-friendly)
    • Avoids Python interpreter overhead

Hints in Layers

Hint 1: Start Simple

Implement naive Euclidean distance with Python loops first. Make sure it works correctly before optimizing.

import math

def euclidean_naive(v1, v2):
    dist = 0
    for i in range(len(v1)):
        dist += (v1[i] - v2[i]) ** 2
    return math.sqrt(dist)

Test it on small examples where you know the answer.

Hint 2: Vectorize with NumPy

Replace loops with NumPy operations:

import numpy as np

def euclidean_vectorized(v1, v2):
    return np.linalg.norm(v1 - v2)
    # Or: return np.sqrt(np.sum((v1 - v2) ** 2))

Benchmark the difference using timeit or time.perf_counter().

Hint 3: Batch Computation

To find k-NN, you need distances from query to ALL vectors. Compute all at once:

# query: (D,)
# vectors: (N, D)
# Result: (N,) distances
dists = np.linalg.norm(vectors - query, axis=1)

This is the key insight: broadcast the query and compute N distances in one operation.

Hint 4: Skip Sqrt for Ranking

For finding top-k, you only need the ORDERING, not actual distances:

# These give the same ordering:
distances = np.sqrt(np.sum((vectors - query)**2, axis=1))
distances_squared = np.sum((vectors - query)**2, axis=1)

# So skip the sqrt:
top_k_indices = np.argpartition(distances_squared, k)[:k]

Sqrt is expensive. Avoid it when you can.

Hint 5: Cosine Similarity Trick

Pre-normalize all vectors once. Then cosine similarity = dot product:

# At index time:
vectors_normalized = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)

# At query time:
query_normalized = query / np.linalg.norm(query)
similarities = np.dot(vectors_normalized, query_normalized)
# No need for division anymore!

This converts O(N ร— D) divisions to O(N ร— D) multiplications (which youโ€™d do anyway for dot product).

Hint 6: Use argpartition, Not argsort

Finding top-k doesnโ€™t require full sort:

# O(N log N) - wasteful
top_k = np.argsort(distances)[:k]

# O(N) - better
top_k = np.argpartition(distances, k)[:k]
top_k_sorted = top_k[np.argsort(distances[top_k])]

For k ยซย N, this is a massive speedup.

Hint 7: Memory-Map Large Datasets

For datasets that donโ€™t fit in RAM:

# Memory-mapped array (doesn't load entire file)
vectors = np.load('large_dataset.npy', mmap_mode='r')

# Process in chunks
chunk_size = 100000
for i in range(0, len(vectors), chunk_size):
    chunk = vectors[i:i+chunk_size]
    # Compute distances for this chunk

This lets you work with datasets larger than RAM.

Books That Will Help

Topic Book Chapter
Vector and matrix operations โ€œIntroduction to Linear Algebraโ€ by Gilbert Strang Ch. 1-2
Distance metrics in depth โ€œMining of Massive Datasetsโ€ by Leskovec et al. Ch. 3: Finding Similar Items
NumPy fundamentals โ€œPython for Data Analysisโ€ by Wes McKinney Ch. 4: NumPy Basics
Performance and vectorization โ€œHigh Performance Pythonโ€ by Gorelick & Ozsvald Ch. 6: Matrix and Vector Computation
K-NN algorithms โ€œPattern Recognition and Machine Learningโ€ by Bishop Ch. 2.5: Nearest Neighbor Methods
Big-O analysis โ€œIntroduction to Algorithmsโ€ (CLRS) Ch. 3: Growth of Functions

Learning milestones:

  1. Distance metrics compute correctly โ†’ You understand vector math
  2. Vectorized version is 50x+ faster โ†’ You understand batch operations
  3. You hit the wall at 1M vectors โ†’ You understand why ANN is needed
  4. Recall@10 is always 100% โ†’ You understand exact search baseline

Project 2: Locality Sensitive Hashing (LSH)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, C++, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The โ€œService & Supportโ€ Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Hashing / Probabilistic Data Structures
  • Software or Tool: LSH Index
  • Main Book: โ€œMining of Massive Datasetsโ€ by Leskovec, Rajaraman, Ullman (Chapter 3)

What youโ€™ll build: A Locality Sensitive Hashing index for approximate nearest neighbor search, using random hyperplanes for cosine similarity or random projections for Euclidean distance.

Why it teaches LSH: LSH is the simplest ANN algorithm to understand. Itโ€™s based on a beautiful insight: if we hash vectors such that similar vectors likely get the same hash, we can use hash tables for fast lookup. Understanding LSH makes other algorithms clearer.

Core challenges youโ€™ll face:

  • Designing hash functions โ†’ maps to random hyperplanes and projections
  • Amplification (AND/OR) โ†’ maps to balancing precision and recall
  • Multi-probe LSH โ†’ maps to checking nearby buckets
  • Parameter tuning โ†’ maps to number of tables and hash functions

Resources for key challenges:

Key Concepts:

  • LSH Fundamentals: โ€œMining of Massive Datasetsโ€ Chapter 3 - Leskovec et al.
  • Random Projections: Tyler Neylonโ€™s LSH Introduction
  • Amplification: LSH literature on AND/OR constructions

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 completed. Understanding of hash tables and probability.

Real world outcome:

$ ./lsh_search --build --dataset sift_1M.npy --tables 20 --hashes 8

Building LSH Index:
  Dataset: 1,000,000 vectors ร— 128 dimensions
  Hash tables: 20
  Hashes per table: 8
  Total hash functions: 160

Build time: 12.3s
Index size: 245 MB (vs 512 MB raw data)

$ ./lsh_search --query query.npy --k 10

LSH Query Results:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Rank  โ”‚ Vector ID โ”‚ Distance     โ”‚ Found in Table  โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   1   โ”‚   4,521   โ”‚    0.823     โ”‚      3, 7, 12   โ•‘
โ•‘   2   โ”‚   7,892   โ”‚    0.891     โ”‚      7, 15      โ•‘
โ•‘  ...  โ”‚    ...    โ”‚     ...      โ”‚       ...       โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Performance:
  Candidates examined: 2,042 (vs 1,000,000 brute force)
  Time: 4.2ms (vs 1,847ms brute force)
  Speedup: 440x
  Recall@10: 0.87 (87% of true neighbors found)

$ ./lsh_search --sweep-params
Sweeping parameters for optimal recall/speed...

Tables โ”‚ Hashes โ”‚ Recall@10 โ”‚ Query Time โ”‚ Index Size
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   5   โ”‚   4    โ”‚   0.62    โ”‚   1.1ms    โ”‚   61 MB
  10   โ”‚   6    โ”‚   0.78    โ”‚   2.3ms    โ”‚  122 MB
  20   โ”‚   8    โ”‚   0.87    โ”‚   4.2ms    โ”‚  245 MB
  40   โ”‚  10    โ”‚   0.94    โ”‚   8.7ms    โ”‚  490 MB

Implementation Hints:

The key insight of LSH:

Hash function property:
  P(h(a) = h(b)) โˆ similarity(a, b)

If vectors are similar โ†’ high probability of same hash
If vectors are different โ†’ low probability of same hash

For cosine similarity, use random hyperplanes:

def random_hyperplane_hash(vector, hyperplane):
    # Returns 1 if vector is on positive side, 0 otherwise
    return 1 if np.dot(vector, hyperplane) >= 0 else 0

def compute_hash(vector, hyperplanes):
    # Combine multiple hyperplane hashes into one integer
    hash_value = 0
    for i, hp in enumerate(hyperplanes):
        bit = random_hyperplane_hash(vector, hp)
        hash_value |= (bit << i)
    return hash_value

LSH Index structure:

class LSHIndex:
    def __init__(self, num_tables, hashes_per_table, dim):
        self.tables = []
        for _ in range(num_tables):
            hyperplanes = np.random.randn(hashes_per_table, dim)
            hyperplanes /= np.linalg.norm(hyperplanes, axis=1, keepdims=True)
            table = {
                'hyperplanes': hyperplanes,
                'buckets': defaultdict(list)  # hash -> [vector_ids]
            }
            self.tables.append(table)

    def insert(self, vector_id, vector):
        for table in self.tables:
            hash_val = compute_hash(vector, table['hyperplanes'])
            table['buckets'][hash_val].append(vector_id)

    def query(self, vector, k):
        candidates = set()
        for table in self.tables:
            hash_val = compute_hash(vector, table['hyperplanes'])
            candidates.update(table['buckets'][hash_val])

        # Re-rank candidates by actual distance
        results = []
        for cand_id in candidates:
            dist = distance(vector, self.vectors[cand_id])
            results.append((cand_id, dist))

        results.sort(key=lambda x: x[1])
        return results[:k]

Amplification (AND/OR construction):

  • More hashes per table (AND): Fewer candidates, higher precision
  • More tables (OR): More candidates, higher recall
  • Balance: Tune based on your recall requirements

Multi-probe LSH: Instead of only checking the exact bucket, also check nearby buckets (flip one bit at a time). Increases recall without adding tables.


The Core Question Youโ€™re Answering

Why does hashing similar vectors to the same buckets work for finding nearest neighbors?

This is the fundamental insight of LSH. Unlike traditional hash functions (which try to distribute items uniformly), LSH hash functions are designed so that:

  • Similar items collide with HIGH probability
  • Dissimilar items collide with LOW probability

The probability of collision is directly related to similarity. For random hyperplane hashing and cosine similarity:

P(h(a) = h(b)) = 1 - (angle(a,b) / ฯ€)

This means: if two vectors point in similar directions (small angle), theyโ€™re likely to be on the same side of a random hyperplane. By using multiple hyperplanes, you create a hash code where similar vectors get similar (often identical) codes.

The โ€œaha!โ€ moment: Youโ€™re not trying to find the EXACT nearest neighbor. Youโ€™re willing to accept approximate answers in exchange for massive speedups. LSH gives you probabilistic guarantees: โ€œWith probability 1-ฮด, Iโ€™ll find a neighbor within (1+ฮต) of optimal.โ€


Concepts You Must Understand First

Before implementing LSH, make sure you understand:

1. Random Projections

  • Projecting high-dimensional vectors onto random lines/hyperplanes
  • The Johnson-Lindenstrauss lemma (distances are approximately preserved)
  • Why random hyperplanes work for cosine similarity

2. Hash Functions and Collision

  • Traditional hash functions minimize collisions
  • LSH hash functions MAXIMIZE collisions for similar items
  • Hash codes as binary strings (each bit = one hyperplane test)

3. Probability Basics

  • P(h(a) = h(b)) depends on similarity(a, b)
  • Independence of hash functions
  • Amplification through AND/OR combinations

4. AND/OR Amplification

  • AND construction: All k hashes must match โ†’ Higher precision, lower recall
  • OR construction: Match in any of L tables โ†’ Higher recall, lower precision
  • Combined: L tables, each with k hashes โ†’ Tunable precision/recall

Study these first:


Questions to Guide Your Design

Hash Function Design:

  1. How many hash functions (hyperplanes) should each table use?
    • More hashes โ†’ smaller buckets โ†’ higher precision, lower recall
    • Fewer hashes โ†’ larger buckets โ†’ lower precision, higher recall
    • Typical: 6-12 hashes per table
  2. How do you generate random hyperplanes?
    • Sample from Gaussian: np.random.randn(dim)
    • Normalize to unit length: hyperplane / ||hyperplane||
    • Each hyperplane is independent
  3. How do you combine multiple bits into a hash?
    • Treat bits as binary number: hash = bitโ‚€ + 2*bitโ‚ + 4*bitโ‚‚ + ...
    • Use bitwise operations for speed: hash |= (bit << i)

Amplification Design:

  1. How many tables should you use?
    • More tables โ†’ more chances to find neighbors โ†’ higher recall
    • But also more memory and slower indexing
    • Typical: 10-50 tables depending on recall requirements
  2. How do you balance precision and recall?
    • High recall needed? โ†’ Many tables (L), fewer hashes per table (k)
    • High precision needed? โ†’ Fewer tables, more hashes per table
    • Formula: P_collision = (1 - (1 - p^k)^L) where p = similarity

Multi-Probe Strategy:

  1. Which nearby buckets should you probe?
    • Hamming distance 1: flip each bit individually (k probes)
    • Hamming distance 2: flip pairs of bits (k choose 2 probes)
    • Prioritize bits closest to decision boundary
  2. How many probes before diminishing returns?
    • Empirically: 2-10 probes per table
    • Track marginal increase in candidates found
    • Stop when cost exceeds benefit

Memory vs Speed:

  1. How do you store hash tables efficiently?
    • Dict of lists: buckets[hash] = [id1, id2, ...]
    • Problem: empty buckets waste space
    • Solution: only store non-empty buckets
  2. Should you store vectors or just IDs?
    • IDs only โ†’ compact, but need separate vector storage
    • Vectors included โ†’ faster re-ranking, but larger memory

Thinking Exercise: Understanding Hash Collisions

Exercise 1: Manual Hash Computation

Given two 3D vectors and one random hyperplane, compute if they hash to the same value:

v1 = [1, 2, 1]      (normalized: [0.408, 0.816, 0.408])
v2 = [1, 2, 2]      (normalized: [0.333, 0.667, 0.667])
hyperplane = [1, 0, 0]

Steps:

  1. Compute dot product: v1 ยท h and v2 ยท h
  2. Check sign: positive โ†’ bit=1, negative โ†’ bit=0
  3. Do they match?

Answer: v1ยทh = 0.408 (bit=1), v2ยทh = 0.333 (bit=1) โ†’ COLLISION โœ“

Exercise 2: Multi-Bit Hash

Now use 3 hyperplanes:

h1 = [1, 0, 0]
h2 = [0, 1, 0]
h3 = [0, 0, 1]

Compute 3-bit hash for both v1 and v2:

  • For each hyperplane, compute sign(v ยท h)
  • Combine into binary: bitโ‚€ + 2bitโ‚ + 4bitโ‚‚

Answer:

  • v1: [+, +, +] โ†’ 111โ‚‚ โ†’ 7
  • v2: [+, +, +] โ†’ 111โ‚‚ โ†’ 7
  • Still collision! โœ“

Exercise 3: Probability Calculation

If two vectors have cosine similarity 0.9, whatโ€™s the probability they collide with:

  • 1 random hyperplane?
  • 4 random hyperplanes (all must match)?
  • 10 tables of 4 hashes each (match in ANY table)?

Answers:

  • 1 hyperplane: P = 1 - arccos(0.9)/ฯ€ โ‰ˆ 0.87
  • 4 hyperplanes: P = 0.87^4 โ‰ˆ 0.57
  • 10 tables: P = 1 - (1 - 0.57)^10 โ‰ˆ 0.998 (very high recall!)

Exercise 4: Understanding the Tradeoff

You have 1M vectors, target recall@10 of 0.85. Compare configurations:

Tables Hashes Expected Candidates Recall Memory
5 8 500 0.65 60 MB
20 6 2000 0.85 240 MB
50 4 5000 0.95 600 MB

Question: Which would you choose for a latency-critical application?

Hint: More candidates = slower re-ranking. Look at candidates vs recall ratio.


The Interview Questions Theyโ€™ll Ask

Conceptual Understanding:

  1. โ€œExplain how LSH works to a software engineer unfamiliar with it.โ€
    • Good answer: โ€œLSH uses special hash functions where similar items collide frequently. By hashing vectors with random hyperplanes and checking which side they fall on, we create codes where similar vectors get similar codes. Then we use hash tables for fast lookup.โ€
    • Great answer: Add probability guarantees, AND/OR amplification, and concrete example with numbers.
  2. โ€œWhy use random hyperplanes for cosine similarity?โ€
    • The probability two vectors are on the same side equals their angular similarity
    • Mathematically: P(sign(vยทh) = sign(uยทh)) = 1 - ฮธ/ฯ€ where ฮธ = angle(v,u)
    • This directly connects hash collision to cosine similarity
  3. โ€œWhatโ€™s the difference between AND and OR construction?โ€
    • AND: All k hashes must match โ†’ fewer collisions โ†’ higher precision
    • OR: Match in any of L tables โ†’ more chances โ†’ higher recall
    • Combined: (k AND) OR L โ†’ tunable precision/recall tradeoff

Implementation Details:

  1. โ€œHow do you choose the number of tables and hashes per table?โ€
    • Start with target recall requirement (e.g., 0.85)
    • Estimate collision probability based on expected similarity
    • Use formula: P_recall โ‰ˆ 1 - (1 - p^k)^L
    • Solve for L and k given recall target
    • Empirically tune on validation set
  2. โ€œWhatโ€™s multi-probe LSH and why does it help?โ€
    • Instead of checking only exact hash bucket, also check nearby buckets
    • โ€œNearbyโ€ = Hamming distance 1 or 2 (flip 1-2 bits)
    • Helps because vectors near decision boundaries might be in adjacent buckets
    • Increases recall without adding tables (saves memory)
  3. โ€œHow would you handle updates (insertions/deletions)?โ€
    • Insertions: Hash new vector, add to all L buckets
    • Deletions: Remove from all L buckets (requires storing vectorโ†’bucket mapping)
    • Challenge: bucket lists can become fragmented
    • Solution: periodic rebuilding or use more sophisticated data structures

Performance & Scaling:

  1. โ€œWhy is LSH typically slower than IVF or HNSW in practice?โ€
    • High-dimensional spaces: need many bits for good precision
    • Many tables needed for good recall
    • Re-ranking cost: examine many candidates even if most are false positives
    • Modern alternatives (HNSW, IVF) have better recall/speed tradeoffs
  2. โ€œWhen would you choose LSH over other ANN algorithms?โ€
    • Very high dimensionality (>1000D) where clustering is hard
    • Streaming/online scenarios with frequent insertions
    • When you need theoretical guarantees on approximation
    • When simplicity and debuggability matter more than peak performance
  3. โ€œHow does LSH scale with dimensionality?โ€
    • Need O(log n / ฮตยฒ) hash functions for (1+ฮต) approximation
    • Higher dimensions โ†’ need more bits โ†’ larger hash codes
    • But curse of dimensionality affects ALL ANN methods
    • LSH degrades more gracefully than some alternatives

Advanced Topics:

  1. โ€œExplain data-dependent LSH vs data-independent LSH.โ€
    • Data-independent: random hyperplanes (what weโ€™ve implemented)
    • Data-dependent: learn hyperplanes from data to maximize separation
    • Example: use PCA directions or train with neural networks
    • Tradeoff: better performance vs complexity and adaptability
  2. โ€œHow would you implement LSH for other distance metrics?โ€
    • L2 (Euclidean): p-stable distributions (Cauchy for L2)
    • Hamming: random bit sampling
    • Jaccard: MinHash
    • Each metric needs a different hash family with collision โˆ similarity
  3. โ€œWhatโ€™s the theoretical guarantee of LSH?โ€
    • (c, rโ‚, rโ‚‚, pโ‚, pโ‚‚)-sensitive hash family
    • If d(a,b) โ‰ค rโ‚ โ†’ P(h(a)=h(b)) โ‰ฅ pโ‚ (nearby items collide often)
    • If d(a,b) โ‰ฅ rโ‚‚ โ†’ P(h(a)=h(b)) โ‰ค pโ‚‚ (distant items rarely collide)
    • With amplification: guarantees on finding (1+ฮต)-approximate neighbors

Hints in Layers

Layer 1: Single Hash Table (Start Here)

  • Implement one table with k=8 random hyperplanes
  • Hash vectors by checking sign of each projection
  • Store in dict: buckets[hash_value] = [vector_ids]
  • Query: hash query vector, return all vectors in same bucket
  • Test: verify similar vectors often share buckets

Layer 2: Multiple Tables (OR Construction)

  • Create L=20 independent tables (different random hyperplanes)
  • Insert: add vector to bucket in each table
  • Query: collect candidates from all L tables (union)
  • Re-rank candidates by actual distance
  • Test: measure recall@10 vs single table

Layer 3: Parameter Tuning

  • Implement parameter sweep: vary L and k
  • For each configuration, measure:
    • Recall@10
    • Query time
    • Memory usage
    • Number of candidates examined
  • Plot Pareto frontier: recall vs query time
  • Test: find optimal L,k for target recall of 0.85

Layer 4: Multi-Probe LSH

  • Generate probe sequence: flip 1 bit, 2 bits, etc.
  • For each table, check exact bucket + probes
  • Implement early stopping: stop if enough candidates found
  • Test: compare recall with/without multi-probe at same #candidates

Layer 5: Optimizations

  • Vectorized hash computation: batch process hyperplane projections
  • Compact bucket storage: use numpy arrays instead of lists
  • Early termination: stop search after examining N candidates
  • Test: benchmark against baseline, aim for 2-3x speedup

Debugging Tips:

  • Low recall? โ†’ Increase L (more tables) or decrease k (fewer hashes per table)
  • Too slow? โ†’ Decrease L or use multi-probe instead of more tables
  • Memory issues? โ†’ Decrease L or use approximate storage (bloom filters)
  • Verification: Always re-rank candidates with exact distances!

Books That Will Help

Core LSH Theory:

  • โ€œMining of Massive Datasetsโ€ by Leskovec, Rajaraman, Ullman
    • Chapter 3: Finding Similar Items
    • Section 3.4: LSH for Cosine Distance (essential reading)
    • Section 3.6: Applications of LSH
    • Why: Best introduction to LSH with clear examples and proofs

Random Projections & Theory:

  • โ€œFoundations of Data Scienceโ€ by Blum, Hopcroft, Kannan
    • Chapter 2: Random Projection and Dimensionality Reduction
    • Johnson-Lindenstrauss lemma and applications
    • Why: Theoretical foundation for why LSH works

Algorithms & Probability:

  • โ€œRandomized Algorithmsโ€ by Motwani and Raghavan
    • Chapter 6: Hashing
    • Probabilistic analysis of hash functions
    • Why: Deep dive into probability tools needed for LSH analysis

Practical Implementation:

  • โ€œIntroduction to Information Retrievalโ€ by Manning, Raghavan, Schรผtze
    • Chapter 19: Web Search Basics
    • Near-duplicate detection using LSH
    • Why: Real-world applications and engineering considerations

Advanced Topics:

  • โ€œNearest-Neighbor Methods in Learning and Visionโ€ edited by Shakhnarovich, Darrell, Indyk
    • Chapter 5: LSH by Indyk
    • Data-dependent LSH
    • Why: Written by LSH pioneer, covers cutting-edge techniques

Online Resources:

Papers to Read (after implementing):

  • โ€œSimilarity Search in High Dimensions via Hashingโ€ (Gionis et al., 1999) - Original LSH paper
  • โ€œLocality-Sensitive Hashing Scheme Based on p-Stable Distributionsโ€ (Datar et al., 2004) - LSH for L2 distance
  • โ€œMulti-Probe LSHโ€ (Lv et al., 2007) - Improves efficiency significantly

Learning milestones:

  1. Hash function maps similar vectors together โ†’ You understand LSH basics
  2. Multiple tables improve recall โ†’ You understand amplification
  3. Parameter tuning affects recall/speed โ†’ You understand the tradeoff
  4. Multi-probe improves efficiency โ†’ You understand advanced LSH

Project 3: Inverted File Index (IVF)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Clustering / Inverted Indexes
  • Software or Tool: IVF Index
  • Main Book: โ€œIntroduction to Information Retrievalโ€ by Manning, Raghavan, Schรผtze

What youโ€™ll build: An Inverted File (IVF) index that clusters vectors with k-means, then searches only nearby clusters at query timeโ€”the foundation of production vector search.

Why it teaches IVF: IVF is the workhorse of production vector databases. Itโ€™s conceptually simple (just clustering!) but has nuances around cluster count, nprobe tuning, and training data. Understanding IVF unlocks FAISS and most production systems.

Core challenges youโ€™ll face:

  • K-means clustering at scale โ†’ maps to efficient training
  • Choosing number of clusters โ†’ maps to sqrt(N) heuristic
  • nprobe tuning โ†’ maps to recall vs speed tradeoff
  • Non-uniform cluster sizes โ†’ maps to data distribution challenges

Resources for key challenges:

Key Concepts:

  • Inverted Index: โ€œIntroduction to Information Retrievalโ€ Chapter 1 - Manning et al.
  • K-Means Clustering: โ€œPattern Recognition and Machine Learningโ€ - Bishop
  • IVF in FAISS: FAISS documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1 completed. Understanding of k-means clustering.

Real world outcome:

$ ./ivf_search --build --dataset sift_1M.npy --clusters 1000

Building IVF Index:
  Dataset: 1,000,000 vectors ร— 128 dimensions
  Training k-means with 100,000 sample vectors...
  K-means iterations: 25, converged: True
  Clusters: 1,000 (avg 1,000 vectors per cluster)

Cluster size distribution:
  Min: 423, Max: 2,156, Median: 987, StdDev: 312

Build time: 45.2s
Index size: 134 MB (centroids) + 512 MB (data)

$ ./ivf_search --query query.npy --k 10 --nprobe 10

IVF Query Results (nprobe=10):
  Clusters searched: 10 (1% of index)
  Vectors compared: 9,870

โ•”โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Rank  โ”‚ Vector ID โ”‚ Distance     โ”‚ Cluster ID      โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   1   โ”‚   4,521   โ”‚    0.823     โ”‚      127        โ•‘
โ•‘   2   โ”‚   7,892   โ”‚    0.891     โ”‚      127        โ•‘
โ•‘  ...  โ”‚    ...    โ”‚     ...      โ”‚       ...       โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Performance:
  Time: 8.3ms (vs 1,847ms brute force)
  Speedup: 222x
  Recall@10: 0.92

$ ./ivf_search --nprobe-sweep
nprobe โ”‚ Recall@10 โ”‚ Query Time โ”‚ Vectors Compared
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
    1  โ”‚   0.52    โ”‚   1.2ms    โ”‚      987
    5  โ”‚   0.81    โ”‚   4.8ms    โ”‚    4,935
   10  โ”‚   0.92    โ”‚   8.3ms    โ”‚    9,870
   20  โ”‚   0.97    โ”‚  15.2ms    โ”‚   19,740
   50  โ”‚   0.99    โ”‚  38.1ms    โ”‚   49,350

Implementation Hints:

IVF structure:

class IVFIndex:
    def __init__(self, num_clusters, dim):
        self.num_clusters = num_clusters
        self.dim = dim
        self.centroids = None  # (num_clusters, dim)
        self.inverted_lists = [[] for _ in range(num_clusters)]
        # inverted_lists[i] = [(vector_id, vector), ...]

    def train(self, training_vectors):
        # Run k-means to find cluster centroids
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=self.num_clusters, n_init=1)
        kmeans.fit(training_vectors)
        self.centroids = kmeans.cluster_centers_

    def add(self, vector_id, vector):
        # Find nearest centroid
        distances = np.linalg.norm(self.centroids - vector, axis=1)
        cluster_id = np.argmin(distances)
        self.inverted_lists[cluster_id].append((vector_id, vector))

    def search(self, query, k, nprobe):
        # Find nprobe nearest centroids
        centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
        nearest_clusters = np.argsort(centroid_distances)[:nprobe]

        # Search within those clusters
        candidates = []
        for cluster_id in nearest_clusters:
            for vector_id, vector in self.inverted_lists[cluster_id]:
                dist = np.linalg.norm(query - vector)
                candidates.append((vector_id, dist))

        # Sort and return top-k
        candidates.sort(key=lambda x: x[1])
        return candidates[:k]

Choosing number of clusters:

Rule of thumb: num_clusters โ‰ˆ sqrt(num_vectors)

1M vectors โ†’ ~1,000 clusters
100M vectors โ†’ ~10,000 clusters

Too few clusters: Each cluster too large, slow search
Too many clusters: Centroid comparison overhead, missed neighbors

Training considerations:

  • Train on a representative sample (10-20% of data is often enough)
  • Ensure training data has same distribution as full dataset
  • More k-means iterations = better centroids, but slower training

Non-exhaustive search pitfall:

If nearest neighbor is in cluster 5, but query is closest to
centroid of cluster 3, we might miss it with low nprobe.

Solution: Higher nprobe, or use HNSW to find nearby centroids faster

The Core Question Youโ€™re Answering

โ€œHow does clustering enable fast search in high-dimensional spaces?โ€

This is the foundational insight of IVF: if you partition your dataset into regions (clusters), you can search only the regions near your query instead of the entire dataset. The key challenge is: how do you partition the space efficiently, and how do you decide which regions to search?

Think about it:

  • Without clustering: You must compare your query to all N vectors (O(N) time)
  • With clustering: You compare to K centroids, then search only nprobe clusters (O(K + nprobe * N/K) time)
  • The tradeoff: Speed vs accuracy - fewer clusters searched = faster but might miss the true nearest neighbor

Concepts You Must Understand First

Before implementing IVF, you need solid understanding of:

  1. K-means Clustering
    • How Lloydโ€™s algorithm iteratively assigns points to centroids and updates centroids
    • Why k-means minimizes within-cluster variance
    • Convergence criteria and initialization strategies (k-means++)
    • Time complexity: O(iterations ร— N ร— K ร— D) where K = number of clusters, D = dimensions
  2. Voronoi Partitions
    • Each centroid defines a region (Voronoi cell) of all points closer to it than any other centroid
    • The partition is exhaustive (every point belongs to exactly one cell)
    • Voronoi boundaries are hyperplanes in D-dimensional space
    • Understanding: โ€œWhich cluster does this query belong to?โ€ = โ€œWhich centroid is nearest?โ€
  3. The nprobe Parameter
    • nprobe = number of clusters to search at query time
    • nprobe = 1: Search only the nearest cluster (fast, low recall)
    • nprobe = K: Search all clusters (slow, perfect recall = brute force)
    • nprobe = 10-50: Typical production values (good recall/speed tradeoff)
  4. Recall vs Speed Tradeoff
    • Recall@k = fraction of true top-k neighbors found
    • As nprobe increases: recall โ†‘, query time โ†‘
    • The goal: Find minimum nprobe that achieves acceptable recall (often 90-95%)
  5. Cluster Imbalance
    • Real data doesnโ€™t cluster uniformly - some clusters are huge, others tiny
    • Imbalanced clusters hurt performance (searching a 5000-vector cluster vs a 200-vector cluster)
    • Solutions: Use more clusters, or switch to techniques like IVF-PQ or HNSW

Questions to Guide Your Design

As you build IVF, constantly ask yourself:

  1. How many clusters should I use?
    • Too few (K = 10): Huge clusters, slow search even with nprobe=1
    • Too many (K = 100,000): Centroid comparison overhead, might need high nprobe
    • Rule of thumb: K โ‰ˆ sqrt(N) - but why? (Balances centroid comparison vs in-cluster search)
    • For 1M vectors: K โ‰ˆ 1000 clusters โ‰ˆ 1000 vectors per cluster
  2. How much training data do I need?
    • K-means needs enough samples per cluster (at least 30-50 per cluster)
    • For K=1000 clusters: need at least 30,000-50,000 training vectors
    • But k-means is expensive on large datasets - can you subsample?
    • Answer: Yes! 10-20% of data is often sufficient if representative
  3. What if clusters are imbalanced?
    • Measure cluster size distribution: min, max, median, stddev
    • If max/min > 10x, you have imbalance
    • Options: Increase K, use hierarchical clustering, or accept it
    • Real-world data is always imbalanced - design for it!
  4. How do I tune nprobe?
    • Sweep nprobe from 1 to 100, measure recall@10 and query time
    • Plot recall vs nprobe and time vs nprobe
    • Find โ€œkneeโ€ of curve - where recall plateaus (diminishing returns)
    • Production systems often expose nprobe as a per-query parameter
  5. What distance metric should I use?
    • K-means assumes Euclidean distance (L2)
    • For cosine similarity: normalize vectors to unit length, then use L2
    • For other metrics (Manhattan, etc.): k-means may not work well
    • IVF is fundamentally Euclidean - thatโ€™s a limitation!

Thinking Exercise

Before coding, do this by hand to internalize the algorithm:

Setup: You have 20 2D vectors and K=4 clusters.

Vectors (x, y):
[1,1], [2,1], [1,2], [2,2],       # Cluster 1 (bottom-left)
[8,1], [9,1], [8,2], [9,2],       # Cluster 2 (bottom-right)
[1,8], [2,8], [1,9], [2,9],       # Cluster 3 (top-left)
[8,8], [9,8], [8,9], [9,9],       # Cluster 4 (top-right)
[5,5], [5,6], [6,5], [6,6]        # Middle vectors

Part 1: K-means

  1. Initialize 4 random centroids
  2. Assign each vector to nearest centroid
  3. Recompute centroids as mean of assigned vectors
  4. Repeat until convergence
  5. What are the final centroids? (Should be near [1.5,1.5], [8.5,1.5], [1.5,8.5], [8.5,8.5])

Part 2: Search with nprobe=1

  • Query: [5.5, 5.5]
  • Which centroid is nearest? (Could be any, theyโ€™re equidistant!)
  • Search only that cluster
  • What if the true nearest neighbor is in a different cluster? MISSED!

Part 3: Search with nprobe=2

  • Query: [5.5, 5.5]
  • Find 2 nearest centroids
  • Search both clusters
  • Higher chance of finding the true nearest neighbor

Part 4: Understanding the tradeoff

  • nprobe=1: Search ~5 vectors (1/4 of dataset)
  • nprobe=2: Search ~10 vectors (1/2 of dataset)
  • nprobe=4: Search all 20 vectors (brute force!)

Key insight: The middle vectors [5,5], [5,6], [6,5], [6,6] are equidistant from all centroids. Queries near them NEED higher nprobe. This is the fundamental limitation of IVF!


The Interview Questions Theyโ€™ll Ask

When you understand IVF deeply, you can answer:

  1. Theory Questions:
    • โ€œExplain how IVF reduces search time complexity from O(N) to O(K + nprobe * N/K)โ€
    • โ€œWhy do we use k-means for IVF? What properties make it suitable?โ€
    • โ€œWhatโ€™s the difference between IVF and HNSW? When would you use each?โ€
    • โ€œHow does IVF handle new data points? Is retraining required?โ€
  2. Design Questions:
    • โ€œYou have 100M 768-dim vectors. How many clusters would you use? Why?โ€
    • โ€œYour IVF index has 90% recall with nprobe=20 but you need 95%. What do you do?โ€
    • โ€œYou notice cluster sizes range from 100 to 50,000 vectors. Is this a problem? How do you fix it?โ€
    • โ€œHow would you handle streaming inserts to an IVF index?โ€
  3. Tradeoff Questions:
    • โ€œGiven a query latency SLA of 10ms, how do you choose nprobe?โ€
    • โ€œShould you train k-means on all 100M vectors or a sample? Whatโ€™s the tradeoff?โ€
    • โ€œYour index is 50GB. How does IVF help with memory? How could you compress further?โ€
    • โ€œCompare IVF with nprobe=10 vs HNSW with ef=10. Which is faster? More accurate?โ€
  4. Implementation Questions:
    • โ€œHow do you implement efficient centroid search? (Hint: itโ€™s brute force for small K!)โ€
    • โ€œWhat data structure do you use for inverted lists? Array? Linked list?โ€
    • โ€œHow do you parallelize IVF search across multiple queries?โ€
    • โ€œWhatโ€™s the index build time for 1M vectors, 1000 clusters, 50 k-means iterations?โ€
  5. Production Questions:
    • โ€œYour IVF index has high query latency variance. Why might this happen?โ€
      • Answer: Cluster imbalance - some clusters are 10x larger than others
    • โ€œHow do you handle deletes in IVF? Tombstones? Rebuilding?โ€
    • โ€œCan you combine IVF with PQ? Whatโ€™s the memory/accuracy tradeoff?โ€
    • โ€œHow would you shard an IVF index across 10 machines?โ€

Hints in Layers

Layer 1: If youโ€™re completely stuck starting

  1. Start with toy data: 1000 2D vectors, K=10 clusters
  2. Use sklearn.cluster.KMeans for training (donโ€™t implement k-means from scratch yet)
  3. Store clusters as a list of lists: clusters[i] = [(vector_id, vector), ...]
  4. For search: find nearest centroid with brute force, search that cluster

Layer 2: When basic IVF works

  1. Implement nprobe: search multiple clusters
  2. Measure recall by comparing to brute force results
  3. Plot recall vs nprobe curve
  4. Implement cluster size distribution statistics

Layer 3: When you want to optimize

  1. Implement your own k-means (Lloydโ€™s algorithm)
  2. Add k-means++ initialization
  3. Use numpy vectorization for centroid distances
  4. Parallelize search across clusters

Layer 4: When you want production features

  1. Serialize index to disk (save centroids + inverted lists)
  2. Add batch query support
  3. Implement cluster rebalancing (split large clusters)
  4. Add IVF + flat refinement (search with IVF, refine with exact distances)

Common mistakes to avoid:

  • Mistake: Using too few clusters (K < 100 for 1M vectors)
    • Result: Slow search even with nprobe=1
    • Fix: Use K โ‰ˆ sqrt(N)
  • Mistake: Training k-means on biased sample
    • Result: Poor cluster quality, low recall
    • Fix: Ensure training sample is representative (random sample)
  • Mistake: Not handling empty clusters
    • Result: Some centroids have no assigned vectors
    • Fix: Reassign empty centroids to split large clusters
  • Mistake: Computing exact distances to all centroids
    • Result: Slow centroid search
    • Fix: For K < 10,000, brute force is fine. For larger K, use approximate methods
  • Mistake: Storing full vectors in inverted lists
    • Result: High memory usage
    • Fix: Store only vector IDs, load vectors on-demand (or use PQ!)

Books That Will Help

  1. โ€œIntroduction to Information Retrievalโ€ by Manning, Raghavan, Schรผtze
    • Chapter 1: Inverted indexes (the inspiration for IVF)
    • Chapter 6: Scoring and ranking
    • Chapter 18: Hierarchical clustering
    • Why it helps: IVF is literally an inverted index for vectors
  2. โ€œPattern Recognition and Machine Learningโ€ by Christopher Bishop
    • Chapter 9: Mixture Models and EM (includes k-means)
    • Chapter 12: Continuous Latent Variables
    • Why it helps: Deep understanding of k-means and clustering theory
  3. โ€œFoundations of Multidimensional and Metric Data Structuresโ€ by Hanan Samet
    • Chapter 4: Point Access Methods
    • Chapter 5: Multidimensional Point Data
    • Why it helps: IVF in the broader context of spatial indexing
  4. Research Papers:
    • โ€œVideo Google: A Text Retrieval Approach to Object Matching in Videosโ€ (Sivic & Zisserman, 2003)
      • Original IVF for visual search
    • โ€œProduct Quantization for Nearest Neighbor Searchโ€ (Jรฉgou et al., 2011)
      • IVF-PQ, the production combination
    • FAISS paper: โ€œBillion-scale similarity search with GPUsโ€ (Johnson et al., 2017)
      • Modern IVF implementations
  5. Online Resources:

Learning milestones:

  1. K-means clusters vectors โ†’ You understand the preprocessing
  2. nprobe controls recall โ†’ You understand the search phase
  3. Cluster imbalance affects performance โ†’ You understand data distribution
  4. You compare with FAISS โ†’ You validate your implementation

Project 4: Product Quantization (PQ)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Vector Compression / Quantization
  • Software or Tool: PQ Encoder/Decoder
  • Main Book: โ€œProduct Quantization for Nearest Neighbor Searchโ€ - Jรฉgou et al.

What youโ€™ll build: A Product Quantization system that compresses 128-dimensional float vectors into just 8-16 bytes, enabling 97% memory reduction while maintaining search quality.

Why it teaches quantization: Memory is often the bottleneck for large vector databases. PQ is the key technique that makes billion-scale search possible on a single machine. Understanding PQ unlocks IVF-PQ, the most common production configuration.

Core challenges youโ€™ll face:

  • Subspace decomposition โ†’ maps to splitting vectors into segments
  • Codebook training โ†’ maps to k-means per subspace
  • Asymmetric distance computation โ†’ maps to query vs database accuracy
  • Distance lookup tables โ†’ maps to precomputed distances for speed

Resources for key challenges:

Key Concepts:

  • Product Quantization: Original paper by Jรฉgou, Douze, Schmid
  • Vector Quantization: Lloydโ€™s algorithm / k-means
  • Asymmetric Distance: ADC (Asymmetric Distance Computation)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 3 completed. Understanding of k-means.

Real world outcome:

$ ./pq_compress --dataset sift_1M.npy --segments 8 --bits 8

Training Product Quantization:
  Original: 1,000,000 vectors ร— 128 dims ร— 4 bytes = 512 MB
  Segments: 8 (16 dims each)
  Centroids per segment: 256 (8 bits)

Training codebooks (k-means per segment)...
  Segment 0: 256 centroids, trained on 1M subvectors
  Segment 1: 256 centroids, trained on 1M subvectors
  ...
  Segment 7: 256 centroids, trained on 1M subvectors

Compressing vectors...
  Compressed: 1,000,000 vectors ร— 8 bytes = 8 MB
  Compression ratio: 64x (98.4% reduction!)

Codebook size: 8 segments ร— 256 centroids ร— 16 dims ร— 4 bytes = 128 KB

$ ./pq_search --query query.npy --k 10

PQ Search with ADC (Asymmetric Distance Computation):
  Building distance lookup table: 0.02ms
  Scanning 1M compressed codes: 12.3ms

Results:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Rank  โ”‚ Vector ID โ”‚ Approx Dist    โ”‚ True Dist       โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   1   โ”‚   4,521   โ”‚    0.841       โ”‚    0.823        โ•‘
โ•‘   2   โ”‚   7,102   โ”‚    0.923       โ”‚    0.905        โ•‘
โ•‘  ...  โ”‚    ...    โ”‚     ...        โ”‚      ...        โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Quality:
  Recall@10: 0.89 (vs exact search)
  Mean distance error: 3.2%

Performance:
  Time: 12.3ms (vs 1,847ms brute force on raw vectors)
  Memory: 8 MB (vs 512 MB raw)
  Speedup: 150x

Implementation Hints:

Product Quantization structure:

Original vector (128 dims):
[v0, v1, v2, ..., v127]

Split into M=8 subvectors (16 dims each):
[v0-v15], [v16-v31], ..., [v112-v127]

Each subvector โ†’ quantized to one of 256 centroids
Compressed code: [c0, c1, c2, c3, c4, c5, c6, c7]  (8 bytes total)

Implementation:

class ProductQuantizer:
    def __init__(self, dim, num_segments, num_centroids=256):
        self.dim = dim
        self.num_segments = num_segments
        self.segment_dim = dim // num_segments
        self.num_centroids = num_centroids
        self.codebooks = []  # One codebook per segment

    def train(self, vectors):
        for seg in range(self.num_segments):
            start = seg * self.segment_dim
            end = start + self.segment_dim
            subvectors = vectors[:, start:end]

            kmeans = KMeans(n_clusters=self.num_centroids)
            kmeans.fit(subvectors)
            self.codebooks.append(kmeans.cluster_centers_)

    def encode(self, vector):
        codes = []
        for seg in range(self.num_segments):
            start = seg * self.segment_dim
            end = start + self.segment_dim
            subvector = vector[start:end]

            # Find nearest centroid
            distances = np.linalg.norm(self.codebooks[seg] - subvector, axis=1)
            code = np.argmin(distances)
            codes.append(code)

        return np.array(codes, dtype=np.uint8)

    def decode(self, codes):
        vector = []
        for seg, code in enumerate(codes):
            vector.extend(self.codebooks[seg][code])
        return np.array(vector)

Asymmetric Distance Computation (ADC):

def compute_distance_tables(self, query):
    """Precompute distances from query subvectors to all centroids"""
    tables = []
    for seg in range(self.num_segments):
        start = seg * self.segment_dim
        end = start + self.segment_dim
        query_sub = query[start:end]

        # Distance from query subvector to each centroid
        distances = np.linalg.norm(self.codebooks[seg] - query_sub, axis=1)
        tables.append(distances)

    return np.array(tables)  # Shape: (num_segments, num_centroids)

def search_pq(self, query, codes, k):
    """Search using precomputed distance tables"""
    tables = self.compute_distance_tables(query)

    # Compute approximate distance to each code
    distances = np.zeros(len(codes))
    for seg in range(self.num_segments):
        distances += tables[seg, codes[:, seg]] ** 2

    # Top-k
    indices = np.argpartition(distances, k)[:k]
    return indices[np.argsort(distances[indices])]

Why ADC is fast:

  • Query: Compute 8 ร— 256 = 2,048 subvector distances (one time)
  • Per vector: 8 table lookups + 8 additions
  • 1M vectors: 8M lookups vs 128M multiplications

The Core Question Youโ€™re Answering

โ€œHow do we compress 512-byte vectors to 128 bytes (or even 8 bytes) without losing search quality?โ€

This is the central problem of Product Quantization. When you have billions of vectors, memory becomes the primary bottleneck. Loading 512GB of vectors into RAM is expensive; loading 8GB is feasible. But naive compression destroys the distance relationships we need for similarity search.

The insight: Instead of compressing the entire vector as one unit, split it into independent subspaces and quantize each separately. This preserves local structure while achieving massive compression.

Why this matters in production:

  • Billion-scale search on single machines: 1B vectors ร— 512 bytes = 512GB (impossible) vs 1B ร— 8 bytes = 8GB (feasible)
  • Cache efficiency: Compressed codes fit in CPU cache, enabling faster scanning
  • Cost reduction: Less RAM = cheaper infrastructure
  • IVF-PQ combination: The most common production config in vector databases

Concepts You Must Understand First

Before implementing Product Quantization, ensure you understand these foundational concepts:

1. Vector Quantization (VQ)

  • What it is: Replacing continuous values with discrete codes from a codebook
  • Classic example: k-means clustering where each point is represented by its cluster centroid
  • The tradeoff: Memory savings vs reconstruction error
  • Why you need it: PQ is just VQ applied to subvectors instead of full vectors

2. K-Means Clustering

  • Lloydโ€™s algorithm: Iterative refinement (assign โ†’ update โ†’ repeat)
  • Codebook: The set of cluster centroids
  • Quantization error: Distance between original point and assigned centroid
  • Why you need it: Training PQ codebooks is running k-means on subvectors

3. Subspace Decomposition

  • Concept: Splitting a 128D vector into 8 ร— 16D independent subvectors
  • Independence assumption: Distance in each subspace contributes independently to total distance
  • Why it works: For high-dimensional random vectors, this assumption holds reasonably well
  • Why you need it: This is what makes โ€œproductโ€ quantization different from regular VQ

4. Asymmetric Distance Computation (ADC)

  • Symmetric: Both query and database vectors are quantized
  • Asymmetric: Only database vectors are quantized; query remains exact
  • Why asymmetric is better: Preserves query accuracy while compressing database
  • Why you need it: This is the secret to maintaining search quality with PQ

5. Distance Decomposition

  • For squared Euclidean distance: ||q - v||ยฒ = ฮฃแตข ||qแตข - vแตข||ยฒ where i indexes subspaces
  • This enables ADC: Compute distance per subspace, then sum
  • Precomputation: Build lookup tables of subspace distances once per query
  • Why you need it: This is how PQ achieves fast approximate distance computation

Study resources:

  • Original PQ paper: Jรฉgou et al., โ€œProduct Quantization for Nearest Neighbor Searchโ€
  • K-means: Chapter 9 of โ€œPattern Recognition and Machine Learningโ€ by Bishop
  • Vector quantization: โ€œVector Quantization and Signal Compressionโ€ by Gersho & Gray

Questions to Guide Your Design

1. How many subspaces (M) should I use?

  • Tradeoff: More subspaces = better compression, but larger overhead
  • Common choices: M = 8, 16, 32, 64
  • Rule of thumb: Each subspace should have at least 8-16 dimensions
  • Example: For 128D vectors, M=8 gives 16D subspaces; M=16 gives 8D subspaces

2. How many centroids (K) per codebook?

  • Determines code size: K=256 requires 8 bits per subspace
  • Common choices: K = 256 (8 bits), K = 16 (4 bits), K = 4096 (12 bits)
  • Tradeoff: More centroids = better reconstruction, but larger lookup tables
  • Memory impact: Each codebook is K ร— (D/M) ร— 4 bytes

3. How much training data do I need?

  • Minimum: At least 256 ร— 20 = 5,120 vectors for k-means to converge (if K=256)
  • Recommended: 100K-1M vectors for stable codebooks
  • Quality: More training data = codebooks better represent the distribution
  • Tradeoff: Training time vs codebook quality

4. Symmetric vs Asymmetric distance?

  • Symmetric (SDC): Quantize both query and database
    • Pros: Simpler, slightly faster
    • Cons: Lower accuracy (query quantization error)
  • Asymmetric (ADC): Keep query exact, quantize database only
    • Pros: Better accuracy (no query error)
    • Cons: Requires distance tables
  • Recommendation: Always use ADC for better recall

5. How do I balance compression vs accuracy?

  • Compression ratio = (D ร— 4 bytes) / (M ร— bits_per_code / 8)
  • Example: 128D, M=8, 8 bits โ†’ (128ร—4) / (8ร—1) = 64ร— compression
  • Accuracy: Measured by recall@k compared to exact search
  • Tuning: Increase M or K to improve recall (at cost of compression/speed)

6. Should I normalize vectors before PQ?

  • Yes, if using cosine similarity: Normalize to unit length first
  • Euclidean distance: Normalization optional but often helps
  • Why: Prevents one subspace from dominating distance computation
  • How: L2 normalize each vector before training and encoding

Thinking Exercise

Manual Product Quantization Walkthrough

Letโ€™s manually apply PQ to understand the process:

Given vector (16D for simplicity):

v = [0.5, 0.3, 0.8, 0.2,  0.1, 0.9, 0.4, 0.7,  0.6, 0.2, 0.3, 0.8,  0.9, 0.1, 0.5, 0.4]

Step 1: Split into M=4 subvectors (4D each):

vโ‚ = [0.5, 0.3, 0.8, 0.2]
vโ‚‚ = [0.1, 0.9, 0.4, 0.7]
vโ‚ƒ = [0.6, 0.2, 0.3, 0.8]
vโ‚„ = [0.9, 0.1, 0.5, 0.4]

Step 2: Assume we trained codebooks (K=4 centroids each):

Codebook 1:
  cโ‚โ‚€ = [0.1, 0.1, 0.2, 0.1]
  cโ‚โ‚ = [0.5, 0.3, 0.7, 0.2]  โ† closest to vโ‚
  cโ‚โ‚‚ = [0.8, 0.7, 0.9, 0.8]
  cโ‚โ‚ƒ = [0.3, 0.5, 0.4, 0.6]

Codebook 2:
  cโ‚‚โ‚€ = [0.1, 0.8, 0.3, 0.6]
  cโ‚‚โ‚ = [0.2, 0.9, 0.5, 0.7]  โ† closest to vโ‚‚
  cโ‚‚โ‚‚ = [0.7, 0.3, 0.8, 0.2]
  cโ‚‚โ‚ƒ = [0.9, 0.1, 0.1, 0.9]

Codebook 3:
  cโ‚ƒโ‚€ = [0.2, 0.1, 0.1, 0.3]
  cโ‚ƒโ‚ = [0.6, 0.2, 0.4, 0.7]  โ† closest to vโ‚ƒ
  cโ‚ƒโ‚‚ = [0.9, 0.8, 0.7, 0.9]
  cโ‚ƒโ‚ƒ = [0.4, 0.5, 0.6, 0.5]

Codebook 4:
  cโ‚„โ‚€ = [0.1, 0.1, 0.2, 0.1]
  cโ‚„โ‚ = [0.4, 0.3, 0.5, 0.3]
  cโ‚„โ‚‚ = [0.8, 0.2, 0.4, 0.5]  โ† closest to vโ‚„
  cโ‚„โ‚ƒ = [0.9, 0.9, 0.8, 0.9]

Step 3: Encode vector (find nearest centroid per subspace):

vโ‚ โ†’ cโ‚โ‚ (code 1)
vโ‚‚ โ†’ cโ‚‚โ‚ (code 1)
vโ‚ƒ โ†’ cโ‚ƒโ‚ (code 1)
vโ‚„ โ†’ cโ‚„โ‚‚ (code 2)

Compressed code: [1, 1, 1, 2]  (4 codes, 2 bits each = 1 byte total)

Step 4: Decode to approximate vector:

v' = [cโ‚โ‚, cโ‚‚โ‚, cโ‚ƒโ‚, cโ‚„โ‚‚]
   = [0.5, 0.3, 0.7, 0.2,  0.2, 0.9, 0.5, 0.7,  0.6, 0.2, 0.4, 0.7,  0.8, 0.2, 0.4, 0.5]

Step 5: Compute reconstruction error:

||v - v'||ยฒ = (0.8-0.7)ยฒ + (0.4-0.5)ยฒ + (0.3-0.4)ยฒ + (0.8-0.7)ยฒ + (0.9-0.8)ยฒ + (0.1-0.2)ยฒ + (0.5-0.4)ยฒ + (0.4-0.5)ยฒ
            = 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01 + 0.01
            = 0.08

Step 6: Asymmetric distance to query:

Given query: q = [0.4, 0.2, 0.7, 0.1, 0.3, 0.8, 0.5, 0.6, 0.5, 0.3, 0.2, 0.9, 0.8, 0.2, 0.4, 0.3]

Build distance tables:

Table 1 (distances from qโ‚=[0.4,0.2,0.7,0.1] to each centroid in Codebook 1):
  d(qโ‚, cโ‚โ‚€) = 0.94
  d(qโ‚, cโ‚โ‚) = 0.13  โ† vโ‚ encoded as code 1
  d(qโ‚, cโ‚โ‚‚) = 0.65
  d(qโ‚, cโ‚โ‚ƒ) = 0.28

Table 2 (distances from qโ‚‚=[0.3,0.8,0.5,0.6] to each centroid in Codebook 2):
  d(qโ‚‚, cโ‚‚โ‚€) = 0.14
  d(qโ‚‚, cโ‚‚โ‚) = 0.11  โ† vโ‚‚ encoded as code 1
  d(qโ‚‚, cโ‚‚โ‚‚) = 0.57
  d(qโ‚‚, cโ‚‚โ‚ƒ) = 0.91

Table 3 (distances from qโ‚ƒ=[0.5,0.3,0.2,0.9] to each centroid in Codebook 3):
  d(qโ‚ƒ, cโ‚ƒโ‚€) = 0.73
  d(qโ‚ƒ, cโ‚ƒโ‚) = 0.23  โ† vโ‚ƒ encoded as code 1
  d(qโ‚ƒ, cโ‚ƒโ‚‚) = 0.69
  d(qโ‚ƒ, cโ‚ƒโ‚ƒ) = 0.38

Table 4 (distances from qโ‚„=[0.8,0.2,0.4,0.3] to each centroid in Codebook 4):
  d(qโ‚„, cโ‚„โ‚€) = 0.84
  d(qโ‚„, cโ‚„โ‚) = 0.45
  d(qโ‚„, cโ‚„โ‚‚) = 0.05  โ† vโ‚„ encoded as code 2
  d(qโ‚„, cโ‚„โ‚ƒ) = 0.67

Approximate distance:

d(q, v) โ‰ˆ Tableโ‚[1]ยฒ + Tableโ‚‚[1]ยฒ + Tableโ‚ƒ[1]ยฒ + Tableโ‚„[2]ยฒ
        = 0.13ยฒ + 0.11ยฒ + 0.23ยฒ + 0.05ยฒ
        = 0.0169 + 0.0121 + 0.0529 + 0.0025
        = 0.0844

Key insights:

  1. Compression: 16D ร— 4 bytes = 64 bytes โ†’ 4 codes ร— 0.25 bytes = 1 byte (64ร— compression!)
  2. Reconstruction error: 0.08 (some information lost)
  3. Asymmetric distance: Only 4 table lookups + 4 additions per vector
  4. Codebook storage: 4 codebooks ร— 4 centroids ร— 4D ร— 4 bytes = 256 bytes

The Interview Questions Theyโ€™ll Ask

Theory Questions:

  1. Explain Product Quantization in one sentence.
    • Answer: โ€œPQ splits vectors into subspaces, quantizes each subspace independently, and represents vectors as concatenated codes for massive compression with minimal search quality loss.โ€
  2. Why is it called โ€˜productโ€™ quantization?
    • Answer: The set of possible reconstructed vectors is the Cartesian product of the M codebooks. With M subspaces and K centroids each, you have K^M possible codes (e.g., 256^8 โ‰ˆ 10^19 for M=8, K=256).
  3. Whatโ€™s the compression ratio for 128D float vectors with M=8, K=256?
    • Answer: Original: 128 dims ร— 4 bytes = 512 bytes. Compressed: 8 codes ร— 1 byte = 8 bytes. Ratio: 64ร— (98.4% reduction).
  4. Why use asymmetric distance instead of symmetric?
    • Answer: Symmetric quantizes both query and database, introducing error on both sides. Asymmetric keeps the query exact, only quantizing database vectors, which preserves better search accuracy with minimal overhead (building distance tables).
  5. How does PQ affect recall?
    • Answer: Recall decreases because quantization introduces approximation error. Typical recall@10: 85-95% with proper tuning. Increasing M or K improves recall but reduces compression.

Design Questions:

  1. You have 1B vectors of 512D. Design a PQ configuration.
    • Considerations:
      • Memory budget: 1B ร— X bytes (choose X based on RAM)
      • M: 512/32 = 16D per subspace with M=32
      • K: 256 (8 bits) for good reconstruction
      • Result: 1B ร— 32 bytes = 32GB + codebooks (32 ร— 256 ร— 16 ร— 4 bytes = 512KB)
      • Training: Sample 1M vectors for codebook training
  2. How would you debug low recall in PQ search?
    • Check reconstruction error: Decode some vectors and compare to originals
    • Visualize codebook coverage: Are centroids well-distributed?
    • Check training data: Is it representative of query distribution?
    • Try different M: Too few subspaces may lose structure
    • Compare ADC vs SDC: Should see higher recall with ADC
  3. How does PQ integrate with IVF?
    • Answer: IVF-PQ combines coarse quantization (IVF for partitioning) with fine quantization (PQ for compression). IVF narrows search to a few partitions (~100-1000), then PQ enables fast scanning within partitions. This is the most common production setup (used by FAISS, Milvus, etc.).

Implementation Questions:

  1. Walk me through the training process.
    • Answer:
      1. Sample training vectors (e.g., 100K-1M)
      2. For each subspace m = 0..M-1: a. Extract subvectors: subvecs = train[:, m*D/M : (m+1)*D/M] b. Run k-means with K centroids c. Store centroids as codebook_m
      3. Save all M codebooks
  2. How do you implement fast ADC search?
    • Answer:
      1. Precompute distance tables: For each subspace, compute distance from query subvector to all K centroids โ†’ M ร— K distances
      2. For each database vector code [cโ‚€, cโ‚, โ€ฆ, c_{M-1}]: a. Look up distances: d = [table[0][cโ‚€], table[1][cโ‚], ..., table[M-1][c_{M-1}]] b. Sum squared distances: dist = sum(dยฒ)
      3. Return top-k by dist
        • Complexity: O(Mร—K) table computation + O(Nร—M) scanning vs O(Nร—D) brute force
  3. How would you optimize PQ for cache efficiency?
    • Answer:
      • Store codes contiguously in memory (good spatial locality)
      • Align distance tables to cache lines (64 bytes)
      • Use SIMD for parallel table lookups and summation
      • Batch queries to amortize table computation
      • Interleave codes by subspace for better access patterns

Tradeoff Questions:

  1. PQ vs scalar quantization (SQ)?
    • PQ: Better compression (64ร—), more complex, requires training
    • SQ: Simple (min/max per dimension), moderate compression (4ร—), no training needed
    • Use PQ when: Memory is critical, can afford training time
    • Use SQ when: Need simple deployment, moderate compression sufficient
  2. What happens if query distribution differs from training data?
    • Answer: Codebooks may not represent query-relevant regions well, leading to higher quantization error and lower recall. Solutions: Retrain periodically, use larger K, or use query-aware training (sample training data from past queries).

Hints in Layers

Layer 1: Getting Started (Beginner)

  • Start with a tiny dataset: 1000 vectors, 16D, M=4, K=16
  • Implement basic encoding: Split vector โ†’ find nearest centroid per subspace โ†’ save codes
  • Test reconstruction: Decode codes back to vectors, compute error
  • Verify: Reconstruction error should be small but non-zero

Layer 2: Core PQ (Intermediate)

  • Implement k-means training: Use sklearn or write simple Lloydโ€™s algorithm
  • Train one codebook per subspace on real data (e.g., 10K vectors)
  • Encode entire dataset, measure compression ratio
  • Implement decoding and visualize reconstruction error distribution
  • Verify: Codebooks should have reasonable spread, reconstruction error < 10%

Layer 3: Asymmetric Distance (Intermediate-Advanced)

  • Implement distance table computation: Query subvector to all centroids
  • Implement ADC search: Table lookups + summation
  • Compare with brute force on original vectors: Measure recall@10
  • Optimize table computation: Vectorize with NumPy
  • Verify: ADC should be 10-100ร— faster than brute force, recall > 80%

Layer 4: Optimization (Advanced)

  • Batch table computation: Compute for multiple queries at once
  • Optimize code storage: Use np.uint8 for codes, contiguous arrays
  • Add SIMD hints: Align arrays, use numba for JIT compilation
  • Implement early termination: Stop after finding k candidates
  • Verify: Should achieve <1ms per query on 1M dataset

Layer 5: Production Ready (Expert)

  • Add incremental updates: Insert new vectors without full retraining
  • Implement codebook persistence: Save/load codebooks efficiently
  • Add monitoring: Track recall, latency, compression ratio over time
  • Combine with IVF: Partition space, then apply PQ within partitions
  • Tune hyperparameters: Grid search over M, K, training size
  • Verify: Recall > 90%, latency < 10ms on 10M+ dataset

Debugging tips:

  • Low recall? Check if codebooks are well-trained (visualize centroids)
  • Slow search? Profile table computation vs scanning (should be scanning-dominated)
  • High memory? Verify code dtype is uint8/uint16, not float
  • Poor reconstruction? Increase K or M, or normalize vectors before PQ

Books That Will Help

Primary Resource (Deep Dive):

  1. โ€œProduct Quantization for Nearest Neighbor Searchโ€ - Jรฉgou, Douze, Schmid (TPAMI 2011)
    • The original paper - only 12 pages but incredibly dense
    • Sections to focus on: Section III (Product Quantization), Section IV (Asymmetric Distance)
    • Contains all the math and theory you need
    • Available: IEEE Xplore or search โ€œJรฉgou PQ paperโ€

Foundational Books:

  1. โ€œVector Quantization and Signal Compressionโ€ - Gersho & Gray
    • Chapter 1-4: Foundations of quantization theory
    • Chapter 6: Vector quantization and k-means
    • Why it helps: Provides deep background on quantization, distortion, rate-distortion theory
  2. โ€œPattern Recognition and Machine Learningโ€ - Christopher Bishop
    • Chapter 9: K-means clustering and EM algorithm
    • Chapter 12: PCA and dimensionality reduction
    • Why it helps: Builds intuition for clustering and subspace methods
  3. โ€œFoundations of Multidimensional and Metric Data Structuresโ€ - Hanan Samet
    • Chapter 4: High-dimensional indexing
    • Why it helps: Context for where PQ fits in the landscape of ANN algorithms

Practical Resources:

  1. FAISS Documentation - Facebook AI Research
    • FAISS Wiki on PQ
    • Real-world implementation details and tuning guidelines
    • C++ code to study for optimization techniques
  2. โ€œBillion-scale similarity search with GPUsโ€ - Johnson, Douze, Jรฉgou (2017)
    • How PQ is used at billion-scale in production
    • GPU optimization techniques
    • Available: arXiv:1702.08734

Online Tutorials:

  1. Pinecone Learning Center
  2. Towards Data Science Articles
    • Search for โ€œProduct Quantizationโ€ and โ€œVector Quantizationโ€
    • Many practical tutorials with code examples
    • Good for seeing different implementation approaches

Research Papers (Advanced):

  1. โ€œOptimized Product Quantizationโ€ - Ge et al. (CVPR 2013)
    • Improvements to basic PQ: non-orthogonal subspace decomposition
    • Why it helps: Shows how to push PQ further for higher recall
  2. โ€œPolysemous Codesโ€ - Douze et al. (ECCV 2016)
    • Combines PQ codes with binary codes for faster filtering
    • Why it helps: State-of-the-art extension of PQ

How to Use These Resources:

  • Week 1: Read Jรฉgou PQ paper (section by section), implement basic encoding/decoding
  • Week 2: Study k-means (Bishop Ch 9), implement codebook training
  • Week 3: Read ADC section carefully, implement asymmetric distance
  • Week 4: Study FAISS code, optimize your implementation
  • Week 5+: Explore advanced papers (OPQ, polysemous codes) for extensions

Learning milestones:

  1. Codebooks train successfully โ†’ You understand quantization
  2. Encoding/decoding works โ†’ You understand the representation
  3. ADC is much faster than brute force โ†’ You understand the speedup
  4. Recall drops gracefully โ†’ You understand the accuracy tradeoff

Project 5: HNSW (Hierarchical Navigable Small World)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Graph-Based Search / Skip Lists
  • Software or Tool: HNSW Index
  • Main Book: โ€œEfficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphsโ€ - Malkov & Yashunin

What youโ€™ll build: A complete HNSW implementationโ€”the state-of-the-art algorithm for vector search, combining navigable small-world graphs with a hierarchical structure for logarithmic search complexity.

Why it teaches HNSW: HNSW is the default choice for most vector databases (Pinecone, Qdrant, Weaviate, pgvector). It offers the best balance of speed, accuracy, and flexibility. Understanding it deeply means understanding modern vector search.

Core challenges youโ€™ll face:

  • Graph construction โ†’ maps to finding good neighbors during insertion
  • Hierarchical layers โ†’ maps to skip-list-like navigation
  • Greedy search โ†’ maps to navigating the graph efficiently
  • Parameter tuning (M, efConstruction, efSearch) โ†’ maps to quality vs speed

Resources for key challenges:

Key Concepts:

  • Small World Graphs: Watts-Strogatz model
  • Skip Lists: Probabilistic data structure
  • Greedy Search: Local optimization in graphs
  • HNSW Paper: Malkov & Yashunin, 2016

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-2 completed. Understanding of graph data structures.

Real world outcome:

$ ./hnsw_build --dataset sift_1M.npy --M 16 --efConstruction 200

Building HNSW Index:
  Dataset: 1,000,000 vectors ร— 128 dimensions
  M (max connections): 16
  efConstruction: 200

Construction progress:
  Layer 3:        12 nodes (entry points)
  Layer 2:       234 nodes
  Layer 1:     4,891 nodes
  Layer 0: 1,000,000 nodes (all vectors)

Build time: 8m 32s
Index size: 287 MB (graph) + 512 MB (vectors) = 799 MB

Graph statistics:
  Avg connections at layer 0: 14.2
  Max layer reached: 4
  Graph diameter (estimated): 6 hops

$ ./hnsw_search --query query.npy --k 10 --ef 50

HNSW Search (ef=50):
  Starting from entry point at layer 3
  Layer 3: 2 hops โ†’ found node 8,234
  Layer 2: 3 hops โ†’ found node 12,891
  Layer 1: 4 hops โ†’ found node 4,521
  Layer 0: Expanded 50 candidates

Results:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Rank  โ”‚ Vector ID โ”‚ Distance     โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   1   โ”‚   4,521   โ”‚    0.823     โ•‘
โ•‘   2   โ”‚   7,892   โ”‚    0.891     โ•‘
โ•‘  ...  โ”‚    ...    โ”‚     ...      โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Performance:
  Nodes visited: 143
  Distance calculations: 143 (vs 1,000,000 brute force)
  Time: 0.8ms
  Recall@10: 0.98

$ ./hnsw_search --ef-sweep
ef   โ”‚ Recall@10 โ”‚ Query Time โ”‚ Nodes Visited
โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  10 โ”‚   0.72    โ”‚   0.2ms    โ”‚       38
  20 โ”‚   0.88    โ”‚   0.4ms    โ”‚       67
  50 โ”‚   0.98    โ”‚   0.8ms    โ”‚      143
 100 โ”‚   0.995   โ”‚   1.4ms    โ”‚      271
 200 โ”‚   0.999   โ”‚   2.6ms    โ”‚      523

Implementation Hints:

HNSW key insight: Combine two ideas:

  1. Navigable Small World (NSW): A graph where you can reach any node in few hops via greedy routing
  2. Hierarchy (Skip List): Multiple layers where upper layers have fewer nodes, enabling fast coarse search
Layer 2:  [A]----------[B]----------[C]  (few nodes, long edges)
           |            |            |
Layer 1:  [A]---[D]---[B]---[E]---[C]  (more nodes)
           |    |      |    |      |
Layer 0:  [A][F][D][G][B][H][E][I][C]  (all nodes, short edges)

Core data structures:

class HNSWNode:
    def __init__(self, vector_id, vector, level):
        self.id = vector_id
        self.vector = vector
        self.level = level
        self.neighbors = [[] for _ in range(level + 1)]
        # neighbors[layer] = list of neighbor node ids

class HNSWIndex:
    def __init__(self, dim, M=16, ef_construction=200):
        self.dim = dim
        self.M = M  # Max connections per node
        self.M_max0 = M * 2  # Max connections at layer 0
        self.ef_construction = ef_construction
        self.ml = 1 / np.log(M)  # Level multiplier
        self.nodes = {}  # id -> HNSWNode
        self.entry_point = None
        self.max_level = -1

Insertion algorithm:

def insert(self, vector_id, vector):
    # 1. Determine level for new node (probabilistic)
    level = self._random_level()
    node = HNSWNode(vector_id, vector, level)

    if self.entry_point is None:
        self.entry_point = node
        self.max_level = level
        self.nodes[vector_id] = node
        return

    # 2. Find entry point at each level via greedy search
    current = self.entry_point

    # 3. Search from top to node's level (greedy, single result)
    for lc in range(self.max_level, level, -1):
        current = self._search_layer_greedy(vector, current, lc)

    # 4. At node's level and below, find ef_construction neighbors
    for lc in range(min(level, self.max_level), -1, -1):
        candidates = self._search_layer(vector, current, self.ef_construction, lc)

        # Select M best neighbors
        neighbors = self._select_neighbors(vector, candidates, self.M if lc > 0 else self.M_max0)

        # Bidirectional connections
        for neighbor in neighbors:
            node.neighbors[lc].append(neighbor.id)
            neighbor.neighbors[lc].append(vector_id)

            # Prune if too many connections
            if len(neighbor.neighbors[lc]) > (self.M if lc > 0 else self.M_max0):
                neighbor.neighbors[lc] = self._select_neighbors(
                    neighbor.vector,
                    [self.nodes[n] for n in neighbor.neighbors[lc]],
                    self.M if lc > 0 else self.M_max0
                )

        if len(candidates) > 0:
            current = candidates[0]  # Closest candidate

    # 5. Update entry point if new node is higher
    if level > self.max_level:
        self.entry_point = node
        self.max_level = level

    self.nodes[vector_id] = node

def _random_level(self):
    level = 0
    while np.random.random() < self.ml and level < 16:
        level += 1
    return level

Search algorithm:

def search(self, query, k, ef):
    if self.entry_point is None:
        return []

    current = self.entry_point

    # Greedy search from top to layer 1
    for lc in range(self.max_level, 0, -1):
        current = self._search_layer_greedy(query, current, lc)

    # At layer 0, do beam search with ef candidates
    candidates = self._search_layer(query, current, ef, 0)

    # Return top-k
    return sorted(candidates, key=lambda n: distance(query, n.vector))[:k]

def _search_layer(self, query, entry, ef, layer):
    visited = {entry.id}
    candidates = [(distance(query, entry.vector), entry)]  # min-heap
    results = [(-distance(query, entry.vector), entry)]  # max-heap (negative for max)

    heapify(candidates)
    heapify(results)

    while candidates:
        dist_c, current = heappop(candidates)

        # Stop if closest candidate is worse than worst result
        if dist_c > -results[0][0]:
            break

        for neighbor_id in current.neighbors[layer]:
            if neighbor_id in visited:
                continue
            visited.add(neighbor_id)

            neighbor = self.nodes[neighbor_id]
            dist_n = distance(query, neighbor.vector)

            if dist_n < -results[0][0] or len(results) < ef:
                heappush(candidates, (dist_n, neighbor))
                heappush(results, (-dist_n, neighbor))

                if len(results) > ef:
                    heappop(results)

    return [node for _, node in results]

Parameters explained:

  • M: Max connections per node. Higher = more accurate, more memory
  • ef_construction: Beam width during build. Higher = better graph, slower build
  • ef_search: Beam width during search. Higher = better recall, slower search

The Core Question Youโ€™re Answering

โ€œHow does a multi-layer graph enable logarithmic search in high-dimensional spaces?โ€

This is the heart of HNSW. Unlike flat graphs (NSW) that require examining many nodes, or tree structures that fail in high dimensions, HNSW uses a hierarchical graph structure inspired by skip lists. Upper layers have fewer nodes with long-range connections for coarse navigation, while lower layers have dense connections for fine-grained search.

The key insight: Start at the top sparse layer to quickly traverse the space, then progressively refine your search by moving down layers until you reach layer 0 where all vectors exist. This gives you O(log N) search complexity instead of O(N).

Why this matters for vector databases:

  • Enables sub-millisecond search on millions of vectors
  • Scales logarithmically rather than linearly
  • Maintains high recall (>95%) while being 100x-1000x faster than brute force
  • Powers production systems like Pinecone, Weaviate, and Qdrant

Concepts You Must Understand First

Before building HNSW, ensure you understand these foundational concepts:

1. Small-World Networks (Watts-Strogatz Model)

What it is: A graph where most nodes can be reached from any other node in a small number of hops.

Why it matters: HNSW builds on the Navigable Small World (NSW) concept where greedy routing (always moving to the neighbor closest to the target) can find approximate nearest neighbors efficiently.

Key property: High clustering coefficient + short average path length

Learn it:

  • Study the Watts-Strogatz model and how random edges create shortcuts
  • Understand why social networks are small-world graphs (โ€œsix degrees of separationโ€)
  • Read about the NSW algorithm (precursor to HNSW)

2. Skip Lists

What it is: A probabilistic data structure that allows O(log n) search in sorted linked lists by maintaining multiple layers of increasingly sparse elements.

Why it matters: HNSWโ€™s hierarchical structure is directly inspired by skip lists. Each node is assigned a random layer level, and upper layers contain progressively fewer nodes.

Key property: Probabilistic layer assignment creates logarithmic height

Learn it:

  • Implement a basic skip list for 1D data
  • Understand probabilistic level assignment: level = floor(-ln(random()) * ml)
  • See how search starts at top layer and descends to find elements

3. Navigable Graphs

What it is: A graph where greedy routing (always moving to the best neighbor) reliably finds good paths to targets.

Why it matters: HNSW search works because the graph maintains navigability at each layer.

Key property: At each node, there exists a neighbor thatโ€™s closer to any query point

Learn it:

  • Understand greedy search on graphs (always move to closest neighbor)
  • Study Delaunay graphs (naturally navigable but expensive to build)
  • Learn why random graphs are NOT navigable

4. Graph Layers and Zoom Levels

What it is: Different levels of granularity in representing the same graph structure.

Why it matters: Upper layers in HNSW act like โ€œzoom outโ€ views with long-range connections, while layer 0 is fully โ€œzoomed inโ€.

Analogy: Like Google Maps - start with country view (layer 3), zoom to city (layer 2), then street (layer 1), then buildings (layer 0)


Questions to Guide Your Design

Before coding, think through these design questions:

1. M Parameter (Max Connections per Node)

  • Q: If M=16, each node has max 16 neighbors per layer. How does this affect:
    • Graph connectivity?
    • Memory usage?
    • Search quality?
    • Build time?
  • Think about: What happens if M is too small (4)? Too large (128)?
  • Answer: M controls the tradeoff between recall and memory. Typical values: 12-48.

2. efConstruction (Beam Width During Build)

  • Q: During insertion, you search for efConstruction candidates, then select M best. Why not just keep the first M you find?

  • Think about: How does exploring more candidates during build affect the final graph quality?
  • Answer: Higher efConstruction creates a better-connected graph but slows down insertion. Typical: 100-500.

3. Layer Selection Strategy

  • Q: How do you decide which layer a new node belongs to?
  • Think about:
    • Should it be deterministic or random?
    • What probability distribution?
    • How does this affect the layer structure?
  • Answer: Use exponential decay: P(level = l) = e^(-l * ln(M)). This creates a skip-list-like structure.

4. Neighbor Selection Heuristic

  • Q: You have efConstruction=200 candidates but can only keep M=16 neighbors. Which do you pick?
    • The 16 closest by distance?
    • 16 that maximize diversity?
    • 16 that maintain graph connectivity?
  • Think about: Pure distance-based selection can create โ€œclustersโ€ that hurt navigability.
  • Answer: Use a heuristic that balances closeness and diversity (e.g., avoid neighbors too close to each other).

5. Greedy Routing at Upper Layers

  • Q: At layer 3 (few nodes), you do greedy search (single path). At layer 0 (all nodes), you do beam search (ef paths). Why this difference?

  • Think about: Computational cost vs accuracy at different layers.
  • Answer: Upper layers are for coarse navigation (speed), layer 0 is for finding actual neighbors (accuracy).

Thinking Exercise: Trace Search by Hand

Setup: Imagine a 2D dataset (for visualization) with 1000 points and this HNSW structure:

  • Layer 2: 4 nodes
  • Layer 1: 40 nodes
  • Layer 0: 1000 nodes (all points)
  • M = 8, ef = 20

Exercise: Trace this search for query Q at position (7.5, 3.2):

Layer 2: [A:(0,0), B:(10,0), C:(0,10), D:(10,10)]
Entry point: A

Query Q: (7.5, 3.2)

Step-by-step:

  1. Layer 2 - Greedy Search:
    • Current: A (0,0), distance to Q = 8.2
    • Neighbors of A at layer 2: [B, C]
    • Check B: (10,0), distance = 3.8 โœ“ closer!
    • Check C: (0,10), distance = 11.3 โœ— worse
    • Move to B. Check Bโ€™s neighbors: [A, D]
    • Check D: (10,10), distance = 7.1 โœ— worse than current
    • Result: Best node at layer 2 = B (10, 0)
  2. Layer 1 - Greedy Search:
    • Start from Bโ€™s layer 1 representation
    • Bโ€™s neighbors at layer 1: [B1:(9,0), B2:(10,1), โ€ฆ]
    • Greedy walk until local minimum
    • Result: Best node at layer 1 = (8, 3)
  3. Layer 0 - Beam Search:
    • Start from (8, 3)
    • Maintain ef=20 candidates
    • Explore neighbors in a priority queue
    • Keep top-20 closest candidates
    • Result: Final 10 nearest neighbors

Key observations:

  • Each layer narrows the search region
  • Upper layers: O(1) hops because few nodes
  • Lower layers: O(log N) hops
  • Total complexity: O(log N)

Your task: Draw this scenario on paper. Create 4 nodes at layer 2, 40 at layer 1 (include the 4), and mark query Q. Trace the greedy path down the layers.


The Interview Questions Theyโ€™ll Ask

If you understand HNSW deeply, you should be able to answer these questions:

Theory Questions

  1. โ€œExplain why HNSW has O(log N) search complexity while NSW is O(N).โ€
    • Expected answer: The hierarchical structure reduces the search space at each layer. Upper layers have fewer nodes (sparse), so greedy search reaches the target region in O(1) hops. Descending layers increases nodes but maintains logarithmic hops. Without hierarchy (NSW), you must traverse a flat dense graph.
  2. โ€œWhat happens if you set M=2? What about M=128?โ€
    • M=2: Graph becomes a chain-like structure, poor connectivity, searches fail to find good paths
    • M=128: Excellent connectivity and recall, but massive memory overhead and slower search (more neighbors to check)
  3. โ€œWhy use exponential decay for layer assignment instead of uniform random?โ€
    • Exponential decay creates a skip-list structure where upper layers have exponentially fewer nodes. This ensures O(log N) layers. Uniform random would create O(N) layers or unbalanced structures.
  4. โ€œHNSW vs IVF: when would you choose each?โ€
    • HNSW: Better recall, consistent latency, online updates, smaller datasets (<100M)
    • IVF: Better for billion-scale, lower memory (with PQ), batch workloads, GPU acceleration

Implementation Questions

  1. โ€œHow do you handle the entry point initialization?โ€
    • First node becomes entry point at its random level
    • Each new node that reaches a higher level becomes the new entry point
    • Entry point is the starting node for all searches
  2. โ€œExplain the pruning step during insertion.โ€
    • When adding bidirectional edge, neighbor might exceed M connections
    • Must prune to M using selection heuristic (keep diverse neighbors)
    • Without pruning, memory grows unbounded
  3. โ€œWhatโ€™s the difference between efConstruction and efSearch?โ€
    • efConstruction: beam width during index building (affects graph quality)
    • efSearch: beam width during query (affects recall/speed tradeoff)
    • Higher efConstruction โ†’ better graph โ†’ can use lower efSearch
  4. โ€œHow would you make HNSW distributed?โ€
    • Partition vectors by ID or cluster
    • Each shard builds local HNSW
    • Search: scatter query to all shards, gather top-k from each, merge
    • Challenge: no global entry point

Debugging Questions

  1. โ€œYour recall is only 60% but hnswlib gets 95%. Whatโ€™s wrong?โ€
    • Check: Are you doing bidirectional connections?
    • Check: Is neighbor selection too greedy? (need diversity)
    • Check: Is efConstruction too low?
    • Check: Is distance calculation correct?
  2. โ€œBuild time is 10x slower than expected. Whereโ€™s the bottleneck?โ€
    • Profile: Is _search_layer called too often?
    • Check: Is efConstruction too high?
    • Check: Are you using a heap for candidate selection?
    • Check: Distance computation vectorized?

Hints in Layers

Build your understanding progressively:

Layer 1: Build a Flat Navigable Graph (NSW)

Goal: Understand greedy graph search before adding hierarchy.

Start here:

class NSWGraph:
    def __init__(self, M=16):
        self.M = M
        self.nodes = {}
        self.entry_point = None

    def insert(self, vector_id, vector):
        # Find M nearest neighbors via greedy search
        # Add bidirectional edges
        # That's it! No layers yet.

What youโ€™ll learn:

  • Greedy search on graphs
  • Bidirectional edge maintenance
  • Why M matters for connectivity

Success criteria: Can search 10K vectors with 70-80% recall

Layer 2: Add the Hierarchy (Skip List Structure)

Goal: Add probabilistic layers like skip lists.

Add this:

def _random_level(self):
    level = 0
    while random.random() < 1/self.M and level < 16:
        level += 1
    return level

class HNSWNode:
    def __init__(self, vector_id, vector, level):
        self.level = level
        self.neighbors = [[] for _ in range(level + 1)]

What youโ€™ll learn:

  • How layer assignment creates hierarchy
  • Multi-layer graph structure
  • Entry point management

Success criteria: Can build multi-layer graph, verify layer distribution

Goal: Start at top, greedily descend to layer 0.

Add this:

def search(self, query, k, ef):
    current = self.entry_point

    # Greedy at upper layers
    for lc in range(self.max_level, 0, -1):
        current = self._search_layer_greedy(query, current, lc)

    # Beam search at layer 0
    candidates = self._search_layer(query, current, ef, 0)
    return top_k(candidates)

What youโ€™ll learn:

  • Why greedy search at top layers
  • Why beam search at layer 0
  • How entry point guides search

Success criteria: Search is faster than flat NSW, recall still 70-80%

Layer 4: Add Pruning and Selection Heuristics

Goal: Improve graph quality with smart neighbor selection.

Add this:

def _select_neighbors_heuristic(self, base_vector, candidates, M):
    # Don't just pick M closest
    # Pick M that maximize diversity and avoid redundancy
    # Example: avoid neighbors too close to each other
    selected = []
    for candidate in sorted(candidates, key=distance):
        if len(selected) >= M:
            break
        # Check: is candidate too close to already selected?
        if not too_close_to_any(candidate, selected):
            selected.append(candidate)
    return selected

What youโ€™ll learn:

  • Why pure distance is insufficient
  • How diversity improves navigability
  • Graph connectivity vs local clustering

Success criteria: Recall jumps to 90-95%

Layer 5: Optimize and Match Production Quality

Goal: Make it fast and robust.

Add this:

  • Vectorize distance computations
  • Use heapq for candidate management
  • Pre-allocate numpy arrays
  • Add build progress monitoring
  • Parameter validation

Success criteria: Match hnswlib within 2x on speed and 99% on recall


Books That Will Help

Primary Resources

  1. โ€œEfficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphsโ€ - Malkov & Yashunin (2016)
    • The original HNSW paper
    • Read sections: Abstract, Algorithm 1 (insertion), Algorithm 2 (search)
    • Skip: Heavy theory sections on first read
    • Focus on: Figure 1 (layer structure), Algorithm descriptions
  2. โ€œSearching in Metric Spacesโ€ - Zezula, Amato, Dohnal, Batko
    • Chapter on navigable small worlds
    • Background on metric space searching
    • Focus on: NSW foundations, greedy routing

Supplementary Materials

  1. Skip Lists papers
    • โ€œSkip Lists: A Probabilistic Alternative to Balanced Treesโ€ - Pugh (1990)
    • Understanding probabilistic level assignment
    • See parallels to HNSW layer selection
  2. Small-World Networks
    • โ€œCollective dynamics of โ€˜small-worldโ€™ networksโ€ - Watts & Strogatz (1998)
    • Understand why small-world graphs enable efficient navigation
    • Key insight: High clustering + short path length

Practical Guides

  1. Pinecone Learning Center - HNSW
    • URL: https://www.pinecone.io/learn/series/faiss/hnsw/
    • Best practical explanation with visualizations
    • Focus on: Layer-by-layer search visualization
  2. hnswlib Source Code
    • URL: https://github.com/nmslib/hnswlib
    • Production C++ implementation
    • Focus on: hnswlib/hnswalg.h - core algorithms
    • Compare your Python to their optimized C++
  3. FAISS Source Code
    • URL: https://github.com/facebookresearch/faiss
    • File: faiss/IndexHNSW.cpp
    • See how Facebook implements HNSW at scale

Video Resources

  1. โ€œBillion-scale approximate nearest neighbor searchโ€ - Matthijs Douze (Facebook AI)
    • YouTube search for FAISS talks
    • Understanding HNSW in context of other methods
  2. โ€œVector Search & HNSWโ€ - Pinecone
    • Explains intuition before math
    • Good for visual learners

Papers for Deep Dive

  1. โ€œNavigating the Landscape of Metric Space Searchingโ€ - Malkov et al.
    • Comprehensive comparison of graph-based methods
    • See how NSW evolved to HNSW

Reading Order:

  1. Start with Pinecone guide (intuition)
  2. Read HNSW paper abstract + algorithms
  3. Implement Layer 1-2 from hints
  4. Read Skip Lists paper (understand probabilistic layers)
  5. Implement Layer 3-4
  6. Study hnswlib source code (optimizations)
  7. Read Small-World Networks paper (why it works)

Learning milestones:

  1. Hierarchical structure builds correctly โ†’ You understand the layers
  2. Greedy search finds good paths โ†’ You understand navigation
  3. Recall matches hnswlib โ†’ Youโ€™ve implemented correctly
  4. Parameter tuning works โ†’ You understand the tradeoffs

Project 6: IVF-PQ Composite Index

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Combined Indexing / Large-Scale Search
  • Software or Tool: IVF-PQ Index
  • Main Book: โ€œBillion-scale similarity search with GPUsโ€ - Johnson et al.

What youโ€™ll build: A composite IVF-PQ index that combines clustering (IVF) with compression (PQ) for billion-scale search with minimal memory usage.

Why it teaches composite indexes: No single technique is enough at scale. IVF-PQ combines the best of both worlds: IVF reduces search scope, PQ reduces memory. This is how FAISS handles billion-vector datasets.

Core challenges youโ€™ll face:

  • Training pipeline โ†’ maps to IVF clustering then PQ on residuals
  • Residual encoding โ†’ maps to PQ on (vector - centroid)
  • Search optimization โ†’ maps to precomputed tables + SIMD
  • Memory layout โ†’ maps to efficient storage for cache

Key Concepts:

  • Residual Quantization: PQ on the difference from centroid
  • Two-Stage Training: IVF first, then PQ
  • ADC with IVF: Combining both speedups

Learning milestones:

  1. Two-stage training works โ†’ You understand the pipeline
  2. Residual encoding improves accuracy โ†’ You understand why it helps
  3. Memory reduction is massive โ†’ You understand the compression
  4. Search matches FAISS performance โ†’ Youโ€™ve implemented correctly

Project 7: Disk-Based Vector Index (DiskANN-style)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: C++
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Storage Systems / SSD Optimization
  • Software or Tool: Disk-Based Vector Index
  • Main Book: โ€œDiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Nodeโ€ - Microsoft Research

What youโ€™ll build: A vector index that stores vectors on SSD with an in-memory graph for navigation, enabling billion-scale search on a single commodity machine.

Why it teaches disk-based search: Memory is expensive. Storing 1B vectors in RAM requires 500GB+. DiskANN-style indexes keep only the graph in memory (~10GB) and fetch vectors from SSD during search, reducing cost by 50x.

Core challenges youโ€™ll face:

  • Graph fits in memory, vectors on disk โ†’ maps to memory-mapped I/O
  • Minimizing disk reads โ†’ maps to graph design for locality
  • SSD-friendly access patterns โ†’ maps to sequential vs random reads
  • Beam search with async I/O โ†’ maps to pipelining disk access

Key Concepts:

  • DiskANN Paper: Microsoft Research, 2019
  • Memory-Mapped Files: OS-level I/O optimization
  • SSD Performance: Random vs sequential read patterns

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 5 completed. C++ proficiency, understanding of file I/O.

Real world outcome:

$ ./diskann_build --dataset deep_1B.npy --memory-limit 16GB

Building DiskANN Index:
  Dataset: 1,000,000,000 vectors ร— 96 dimensions
  Vector data size: 384 GB (will be on disk)
  Memory limit: 16 GB (for graph)

Building graph with memory-efficient construction...
  Pass 1: Sample-based graph initialization
  Pass 2: Full graph refinement (streaming)
  Pass 3: Graph compaction

Index files:
  vectors.bin: 384 GB (memory-mapped)
  graph.bin: 12 GB (in-memory)
  metadata.json: 1 KB

$ ./diskann_search --query query.npy --k 10 --beam-width 50

DiskANN Search:
  Graph navigation: 0.4ms (in-memory)
  Disk reads: 23 vectors fetched
  SSD latency: 2.1ms (avg 0.09ms per read)
  Total: 2.5ms

Performance analysis:
  Graph hops: 8
  Candidate set size: 50
  Disk reads: 23 (< beam width due to caching)
  Cache hits: 12

Recall@10: 0.96

$ ./diskann_search --benchmark --queries 10000
Benchmark (10K queries):
  Mean latency: 2.8ms
  P99 latency: 5.2ms
  Throughput: 357 queries/second
  SSD bandwidth: 1.2 GB/s

Implementation Hints:

Architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚             In-Memory Graph              โ”‚
โ”‚  (Node ID โ†’ [Neighbor IDs], 12 GB)      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚           Memory-Mapped Vectors          โ”‚
โ”‚    (Node ID โ†’ Vector, 384 GB on SSD)    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Core structures:

struct DiskANNIndex {
    // In-memory: graph only
    vector<vector<uint32_t>> graph;  // Neighbor lists

    // Memory-mapped: vectors on disk
    float* vectors;  // mmap'd from file
    size_t num_vectors;
    size_t dim;

    // Caching
    LRUCache<uint32_t, vector<float>> vector_cache;
};

Memory-mapping vectors:

void open_vectors(const string& path) {
    int fd = open(path.c_str(), O_RDONLY);
    size_t file_size = num_vectors * dim * sizeof(float);

    vectors = (float*)mmap(
        nullptr,
        file_size,
        PROT_READ,
        MAP_PRIVATE,  // Copy-on-write
        fd,
        0
    );

    // Advise OS about access pattern
    madvise(vectors, file_size, MADV_RANDOM);
}

vector<float> get_vector(uint32_t id) {
    // Check cache first
    if (auto cached = vector_cache.get(id)) {
        return *cached;
    }

    // Load from mmap (may trigger disk read)
    float* ptr = vectors + id * dim;
    vector<float> vec(ptr, ptr + dim);

    vector_cache.put(id, vec);
    return vec;
}

Search with prefetching:

vector<pair<uint32_t, float>> search(
    const vector<float>& query,
    size_t k,
    size_t beam_width
) {
    priority_queue<pair<float, uint32_t>> candidates;  // max-heap
    priority_queue<pair<float, uint32_t>, vector<...>, greater<>> results;  // min-heap
    unordered_set<uint32_t> visited;

    // Start from entry point
    auto entry_vec = get_vector(entry_point);
    float dist = distance(query, entry_vec);
    candidates.push({-dist, entry_point});  // Negative for min behavior
    results.push({dist, entry_point});
    visited.insert(entry_point);

    while (!candidates.empty()) {
        auto [neg_dist, current] = candidates.top();
        candidates.pop();

        if (-neg_dist > results.top().first && results.size() >= beam_width) {
            break;
        }

        // Prefetch neighbor vectors asynchronously
        auto& neighbors = graph[current];
        for (uint32_t neighbor : neighbors) {
            // Hint to OS to prefetch this page
            prefetch(&vectors[neighbor * dim]);
        }

        // Process neighbors
        for (uint32_t neighbor : neighbors) {
            if (visited.count(neighbor)) continue;
            visited.insert(neighbor);

            auto neighbor_vec = get_vector(neighbor);
            float d = distance(query, neighbor_vec);

            if (d < results.top().first || results.size() < beam_width) {
                candidates.push({-d, neighbor});
                results.push({d, neighbor});
                if (results.size() > beam_width) {
                    results.pop();
                }
            }
        }
    }

    // Extract top-k from results
    vector<pair<uint32_t, float>> top_k;
    while (!results.empty() && top_k.size() < k) {
        top_k.push_back({results.top().second, results.top().first});
        results.pop();
    }
    reverse(top_k.begin(), top_k.end());
    return top_k;
}

Graph construction (memory-efficient):

void build_graph(const string& vectors_path, size_t R, size_t L) {
    // R = max degree, L = candidate list size during construction

    // Initialize with random graph
    for (size_t i = 0; i < num_vectors; i++) {
        graph[i] = random_neighbors(R);
    }

    // Iterative refinement (process vectors in chunks to limit memory)
    size_t chunk_size = 1000000;  // 1M vectors at a time
    for (size_t offset = 0; offset < num_vectors; offset += chunk_size) {
        size_t end = min(offset + chunk_size, num_vectors);

        // Load chunk into memory
        vector<vector<float>> chunk_vectors = load_chunk(vectors_path, offset, end);

        // For each vector in chunk, improve its neighbors
        for (size_t i = offset; i < end; i++) {
            auto candidates = beam_search(chunk_vectors[i - offset], L);
            graph[i] = select_diverse_neighbors(candidates, R);

            // Add reverse edges
            for (uint32_t neighbor : graph[i]) {
                if (graph[neighbor].size() < R) {
                    graph[neighbor].push_back(i);
                }
            }
        }
    }
}

The Core Question Youโ€™re Answering:

โ€œHow do we search billions of vectors that donโ€™t fit in RAM?โ€

This is the fundamental scaling challenge in vector search. A billion 96-dimensional float vectors require 384GB of RAM just for storage, plus additional memory for the index structure. At $10/GB/month for cloud RAM, thatโ€™s $3,840/month just for storage. DiskANN solves this by keeping only the navigation graph (~10-20GB) in memory and storing vectors on SSD (~$0.10/GB/month), reducing costs by 50x while maintaining sub-10ms search latency.

The insight: You donโ€™t need all vectors in RAM. You only need the graph structure that tells you which vectors to examine. With careful graph design and SSD-aware access patterns, you can achieve 95%+ recall with just 20-50 disk reads per query.

Concepts You Must Understand First:

Before building DiskANN, you need to understand:

  1. SSD Performance Characteristics:
    • Sequential read: 3-7 GB/s (fast)
    • Random read: 500K-1M IOPS, ~50-100 ฮผs latency per read (slower)
    • Page granularity: SSDs read in 4KB pages, not individual bytes
    • Why it matters: Reading 96 floats (384 bytes) costs the same as reading 4KB. You pay for the whole page.
  2. Memory-Mapped Files (mmap):
    • OS maps file content to virtual memory addresses
    • Reading a mapped address triggers page fault โ†’ OS loads page from disk โ†’ returns data
    • OS manages page cache automatically (LRU eviction)
    • Why it matters: You get disk-backed arrays without manual I/O code.
  3. Vamana Graph Structure:
    • Undirected graph with bounded degree R (typically 64-128)
    • Built to maximize navigability: short path to any target
    • Pruning strategy maintains both connectivity and diversity
    • Why it matters: Good graph structure = fewer disk reads per search.
  4. Beam Search Navigation:
    • Maintains beam_width candidates (e.g., 50-100) during search
    • Explores candidates in best-first order
    • Greedy: expands most promising nodes first
    • Why it matters: Limits number of nodes visited = limits disk reads.
  5. Caching Strategies:
    • Hot vectors (entry points, frequently visited nodes) should stay in cache
    • LRU cache for recently accessed vectors
    • Pre-warming: load entry point and high-degree nodes at startup
    • Why it matters: Cache hit = no disk read (0.1ms saved per hit).
  6. Async I/O and Prefetching:
    • Issue prefetch hints for upcoming neighbor accesses
    • OS can fetch multiple pages in parallel
    • Overlaps computation with I/O latency
    • Why it matters: Can reduce total latency by 30-50%.

Questions to Guide Your Design:

Ask yourself these questions while building:

  1. Page Size and Alignment:
    • Should vectors be aligned to 4KB page boundaries?
    • Should you pack multiple small vectors per page?
    • How does dimension affect packing efficiency?
  2. Graph Degree R:
    • Higher R = more neighbors = better recall but more disk reads
    • Lower R = faster search but may get stuck in local minima
    • Whatโ€™s the sweet spot for your dataset?
  3. Beam Width:
    • Larger beam = more candidates = more disk reads but better recall
    • Smaller beam = faster but may miss the target
    • How does beam width affect the recall/latency tradeoff?
  4. What to Cache:
    • Entry point (always accessed)?
    • High-degree nodes (hubs in the graph)?
    • Recently accessed vectors (temporal locality)?
    • What cache size gives diminishing returns?
  5. Graph Construction:
    • Can you build the graph without loading all vectors into RAM?
    • Can you process vectors in streaming fashion?
    • How do you avoid out-of-memory during construction?
  6. Disk Access Patterns:
    • Are your disk reads sequential or random?
    • Can you batch reads to improve throughput?
    • Are you triggering unnecessary page faults?

Thinking Exercise:

Before coding, work through these calculations:

  1. Memory Budget:
    • 1B vectors ร— 96 dims ร— 4 bytes = 384 GB (vectors on disk)
    • 1B nodes ร— R neighbors ร— 4 bytes (node ID)
    • If R=64: 1B ร— 64 ร— 4 = 256 GB (graph structure)
    • Thatโ€™s already over your 16GB memory limit!
    • Key insight: You need compressed graph representation. Store only node IDs, not full neighbor lists.
  2. Disk Reads Per Query:
    • Beam width = 50
    • Assume 8 graph hops to reach target
    • Worst case: 50 candidates ร— 8 hops = 400 disk reads
    • But! Many candidates share neighbors โ†’ actual reads โ‰ˆ 100-150
    • With caching: reduce to 20-50 unique disk reads
    • At 0.1ms per read: 2-5ms disk latency
  3. Page Fault Analysis:
    • One vector = 96 floats ร— 4 bytes = 384 bytes
    • SSD page = 4KB
    • Vectors per page: 4096 / 384 โ‰ˆ 10 vectors
    • If vectors are randomly placed: 1 page fault per vector
    • If vectors are co-located by graph proximity: 1 page fault per 10 vectors (10x speedup!)
  4. Cache Size Impact:
    • 1GB cache = ~2.7M vectors (at 384 bytes each)
    • If top 0.1% of vectors account for 20% of accesses (Zipf distribution)
    • Caching 1M vectors (0.1%) could eliminate 20% of disk reads
    • Diminishing returns after ~2-4GB cache

The Interview Questions Theyโ€™ll Ask:

Practice answering these:

  1. โ€œWhy use DiskANN instead of HNSW for billion-scale search?โ€
    • HNSW requires all vectors in RAM โ†’ 384GB for 1B vectors โ†’ $3,840/month
    • DiskANN uses 10-20GB RAM + 400GB SSD โ†’ $200/month (19x cheaper)
    • Latency: HNSW 1-2ms, DiskANN 3-8ms (3-4x slower but acceptable)
    • When to use: DiskANN when cost matters, HNSW when latency is critical
  2. โ€œHow do you minimize disk reads during search?โ€
    • Good graph structure: short paths (6-8 hops) to any target
    • Beam search: prune candidates aggressively
    • Caching: keep hot vectors (entry points, hubs) in memory
    • Prefetching: hint OS to load neighbors in parallel
    • Result: 20-50 disk reads per query instead of 400+
  3. โ€œWhatโ€™s the difference between random and sequential SSD access?โ€
    • Sequential: 3-7 GB/s (reading contiguous blocks)
    • Random: 500K IOPS ร— 4KB = 2 GB/s (jumping around)
    • DiskANN is random access by nature (graph navigation)
    • Mitigation: Pre-sort vectors by graph proximity, batch reads
  4. โ€œHow do you build the graph without running out of memory?โ€
    • Donโ€™t load all vectors at once
    • Process in chunks (1M vectors per iteration)
    • For each chunk: improve graph connections using beam search
    • Multiple passes: gradually refine graph quality
    • Final graph fits in memory, but construction is streaming
  5. โ€œWhat happens if your cache is too small?โ€
    • More cache misses โ†’ more disk reads โ†’ higher latency
    • If cache < entry point + critical nodes: every query hits disk
    • Sweet spot: 1-4GB (0.1-0.4% of vectors) captures 20-40% of accesses
    • Beyond 10GB: diminishing returns (<5% reduction in disk reads)
  6. โ€œHow do memory-mapped files work?โ€
    • OS maps file to virtual address range
    • Reading address triggers page fault if page not in RAM
    • OS loads 4KB page from disk into RAM
    • Subsequent reads in same page are free (already in RAM)
    • OS evicts cold pages using LRU policy

Hints in Layers (Reveal When Stuck):

Layer 1 - Start Simple: Build in-memory HNSW first

  • Start with Project 5โ€™s HNSW implementation
  • Verify it works: build index, search, measure recall
  • Understand the baseline: how much RAM does it use?
  • This is your reference point for correctness

Layer 2 - Add Memory Mapping: Move vectors to disk

  • Keep graph in memory, move vectors to memory-mapped file
  • Use mmap() to map vector file to address space
  • Measure: how many page faults per query? (use perf stat)
  • You should see ~50-200 page faults per query initially

Layer 3 - Add Caching: Reduce disk reads

  • Implement LRU cache for vectors (1-2GB)
  • Cache the entry point and high-degree nodes at startup
  • Measure cache hit rate: aim for 30-50%
  • This should reduce latency by 30-40%

Layer 4 - Optimize Graph: Improve locality

  • Re-order vector IDs by graph proximity (BFS from entry point)
  • Vectors that are neighbors in graph should be close in file
  • This improves page reuse: 10 vectors per page instead of 1
  • Should reduce page faults by 5-10x

Layer 5 - Add Prefetching: Overlap I/O with computation

  • Use madvise(MADV_WILLNEED) to hint upcoming accesses
  • When visiting node N, prefetch all neighbors of N
  • Measure: latency should drop 20-30% as CPU and I/O overlap

Layer 6 - Compression (Advanced): Reduce disk bandwidth

  • Store vectors in compressed format (Product Quantization)
  • Load compressed vectors (1/8 size), decompress in CPU cache
  • Final reranking: fetch full-precision vectors for top-K
  • This is how production systems (Bing, Meta) scale to 10B+ vectors

Debugging Tips:

  • Use iostat to monitor disk I/O: are you reading more than expected?
  • Use perf stat to count page faults: should be ~20-50 per query
  • Profile with perf record: where is time spent? (should be I/O wait)
  • Verify graph connectivity: run BFS from entry point, ensure all nodes reachable

Books That Will Help:

  1. โ€œDiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Nodeโ€ - Subramanya et al., NeurIPS 2019
    • The original paper (30 pages)
    • Covers Vamana graph, graph construction, search algorithm
    • Read sections 3-4 for graph design, section 5 for implementation
    • Available free at: https://papers.nips.cc/paper/2019/hash/09853c7fb1d3f8ee67a61b6bf4a7f8e8-Abstract.html
  2. โ€œOperating Systems: Three Easy Piecesโ€ - Remzi Arpaci-Dusseau
    • Chapter 13: Address Spaces
    • Chapter 36: I/O Devices (SSD characteristics)
    • Chapter 39: Files and Directories (memory-mapped I/O)
    • Free online: https://pages.cs.wisc.edu/~remzi/OSTEP/
  3. โ€œDatabase Internalsโ€ - Alex Petrov (Oโ€™Reilly, 2019)
    • Chapter 4: Implementing B-Trees (page-based storage)
    • Chapter 5: Transaction Processing and Recovery (disk I/O patterns)
    • Shows how databases manage disk-backed data structures
  4. โ€œSystems Performanceโ€ - Brendan Gregg (Pearson, 2020)
    • Chapter 8: File Systems (page cache, read-ahead)
    • Chapter 9: Disks (SSD internals, I/O scheduling)
    • Essential for understanding I/O bottlenecks and profiling
  5. โ€œThe Art of Computer Systems Performance Analysisโ€ - Raj Jain (Wiley, 1991)
    • Chapter 12: Queueing Models (I/O queuing, latency analysis)
    • How to reason about disk read latency distributions
    • Classic text for performance modeling

Papers to Read:

  • โ€œApproximate Nearest Neighbor Search on High Dimensional Dataโ€ - Li et al., ICDE 2020 (DiskANN improvements)
  • โ€œSPANN: Highly-efficient Billion-scale Approximate Nearest Neighbor Searchโ€ - Chen et al., NeurIPS 2021 (Microsoftโ€™s production system)
  • โ€œHM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memoryโ€ - Wang et al., NeurIPS 2022 (combines RAM + SSD + persistent memory)

Learning milestones:

  1. Memory-mapped I/O works โ†’ You understand OS-level optimization
  2. Graph fits in memory limit โ†’ You understand memory management
  3. Latency is SSD-bound โ†’ You understand the bottleneck
  4. Prefetching helps โ†’ You understand pipelining

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The โ€œOpen Coreโ€ Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Hybrid Search / Inverted Indexes
  • Software or Tool: Filtered Vector Search
  • Main Book: โ€œIntroduction to Information Retrievalโ€ by Manning, Raghavan, Schรผtze

What youโ€™ll build: A vector search system that supports metadata filtering (e.g., โ€œfind similar products where price < $100 and category = โ€˜electronicsโ€™โ€) efficiently, not just post-filtering.

Why it teaches filtered search: Real applications always have filters. Naive post-filtering is slow (scan 1M vectors, filter to 1K, return top-10). Smart pre-filtering or integrated filtering is essential for production systems.

Core challenges youโ€™ll face:

  • Pre-filter vs post-filter โ†’ maps to when to apply filters
  • Bitmap indexes for metadata โ†’ maps to efficient set operations
  • Filtered HNSW traversal โ†’ maps to skipping invalid nodes
  • Cost estimation โ†’ maps to choosing the right strategy

Key Concepts:

  • Inverted Indexes: โ€œIntroduction to Information Retrievalโ€ - Manning et al.
  • Bitmap Indexes: Database index fundamentals
  • Filtered Graph Search: Qdrant, Pinecone approaches

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5 completed. Understanding of database indexes.

Real world outcome:

$ ./filtered_search "
  SELECT * FROM products
  WHERE vector NEAR query_embedding
    AND category = 'electronics'
    AND price < 100
    AND rating >= 4.0
  LIMIT 10
"

Query Analysis:
  Filter selectivity: 2.3% (23,000 of 1,000,000 match)
  Strategy: Pre-filter + HNSW on filtered subset

Execution:
  Step 1: Apply filters (bitmap intersection): 0.8ms
  Step 2: Build filtered entry points: 0.1ms
  Step 3: HNSW search on filtered graph: 1.2ms

Results:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Rank  โ”‚ Product ID    โ”‚ Distance     โ”‚ Price  โ”‚ Category   โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   1   โ”‚ PROD-34521    โ”‚    0.234     โ”‚ $49.99 โ”‚ electronicsโ•‘
โ•‘   2   โ”‚ PROD-78234    โ”‚    0.287     โ”‚ $89.99 โ”‚ electronicsโ•‘
โ•‘  ...  โ”‚     ...       โ”‚     ...      โ”‚   ...  โ”‚    ...     โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Performance: 2.1ms total
  (vs 847ms with post-filtering on full HNSW traversal)

$ ./filtered_search --analyze-strategies
Filter Selectivity โ”‚ Best Strategy    โ”‚ Expected Time
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  > 50%            โ”‚ Post-filter      โ”‚  ~2ms
  10% - 50%        โ”‚ Filtered HNSW    โ”‚  ~3ms
  1% - 10%         โ”‚ Pre-filter       โ”‚  ~2ms
  < 1%             โ”‚ Scan filtered    โ”‚  ~1ms

Implementation Hints:

Metadata storage with bitmap indexes:

class MetadataIndex:
    def __init__(self):
        # For categorical: inverted index of bitmaps
        self.categorical = {}  # field -> {value -> bitmap}

        # For numerical: sorted index
        self.numerical = {}  # field -> sorted [(value, vector_id)]

    def add(self, vector_id, metadata):
        for field, value in metadata.items():
            if isinstance(value, str):
                if field not in self.categorical:
                    self.categorical[field] = {}
                if value not in self.categorical[field]:
                    self.categorical[field][value] = BitSet()
                self.categorical[field][value].set(vector_id)
            else:
                if field not in self.numerical:
                    self.numerical[field] = []
                self.numerical[field].append((value, vector_id))

    def build_indexes(self):
        for field in self.numerical:
            self.numerical[field].sort()

    def filter(self, conditions):
        result = None
        for field, op, value in conditions:
            if field in self.categorical:
                if op == '==':
                    bits = self.categorical[field].get(value, BitSet())
                elif op == 'in':
                    bits = BitSet()
                    for v in value:
                        bits |= self.categorical[field].get(v, BitSet())
            else:  # numerical
                bits = self._range_filter(field, op, value)

            if result is None:
                result = bits
            else:
                result &= bits  # Intersection

        return result

Filtered HNSW search:

class FilteredHNSW(HNSWIndex):
    def search_filtered(self, query, k, ef, valid_ids):
        """Search considering only nodes in valid_ids"""
        if len(valid_ids) == 0:
            return []

        # Convert to set for O(1) lookup
        valid_set = set(valid_ids)

        # Find a valid entry point
        entry = self._find_valid_entry(valid_set)
        if entry is None:
            # Fall back to scanning
            return self._scan_filtered(query, k, valid_set)

        # Modified beam search that skips invalid nodes
        visited = {entry.id}
        candidates = [(distance(query, entry.vector), entry)]
        results = []

        while candidates:
            dist_c, current = heappop(candidates)

            # Add to results if valid
            if current.id in valid_set:
                heappush(results, (-dist_c, current))
                if len(results) > ef:
                    heappop(results)

            # Explore neighbors (including invalid ones for navigation)
            for neighbor_id in current.neighbors[0]:
                if neighbor_id in visited:
                    continue
                visited.add(neighbor_id)

                neighbor = self.nodes[neighbor_id]
                dist_n = distance(query, neighbor.vector)

                # Always add to candidates for navigation
                if len(results) < ef or dist_n < -results[0][0]:
                    heappush(candidates, (dist_n, neighbor))

        # Extract top-k valid results
        return sorted(
            [(n.id, -d) for d, n in results],
            key=lambda x: x[1]
        )[:k]

Cost-based strategy selection:

def choose_strategy(self, query, filters, k):
    # Estimate filter selectivity
    selectivity = self.estimate_selectivity(filters)
    num_valid = int(self.num_vectors * selectivity)

    # Estimate costs
    post_filter_cost = self.hnsw_search_cost() + num_valid * 0.001
    pre_filter_cost = self.filter_cost(filters) + self.scan_cost(num_valid)
    filtered_hnsw_cost = self.filter_cost(filters) + self.hnsw_search_cost() * (1 / selectivity)

    if selectivity > 0.5:
        return 'post_filter'
    elif selectivity < 0.01:
        return 'pre_filter_scan'
    else:
        return 'filtered_hnsw'

The Core Question Youโ€™re Answering

โ€œHow do we search by similarity AND attributes without losing accuracy?โ€

This is the fundamental tension in filtered vector search. Three naive approaches all fail:

  1. Post-filtering: Search all vectors, then filter โ†’ Fast search, but wastes 99% of work when filters are selective
  2. Pre-filtering + brute force: Filter first, scan matches โ†’ Accurate, but O(n) when you have millions of filtered results
  3. Separate systems: Vector index + SQL database โ†’ Two queries, consistency nightmares, network overhead

The real question is: Can we integrate filtering into the vector search itself, making the index aware of metadata?

This requires understanding:

  • When filtering BEFORE search is faster (high selectivity)
  • When filtering DURING search is better (medium selectivity)
  • When filtering AFTER search wins (low selectivity)
  • How to traverse HNSW graphs while respecting filter constraints
  • Why bitmap indexes enable sub-millisecond filter evaluation

Real-world impact: Qdrantโ€™s filtered search is 100-400x faster than post-filtering. Pineconeโ€™s metadata filtering enables production e-commerce search. Understanding this makes you valuable to any company doing RAG or semantic search.


Concepts You Must Understand First

1. Pre-filtering vs Post-filtering

Pre-filtering:

Filter (1M โ†’ 10K) โ†’ Vector search on 10K โ†’ Top 10
Cost: Filter time + search(10K)

Post-filtering:

Vector search on 1M โ†’ Filter top 100 โ†’ Return matches
Cost: search(1M) + filter(100)
Problem: Might only get 3 results after filtering top 100

Key insight: Pre-filtering works when selectivity is low (< 5%). Post-filtering works when selectivity is high (> 50%).

2. Filtered IVF (Inverted File Index)

Standard IVF: Cluster vectors, search nearest clusters Filtered IVF: Each cluster stores metadata bitmaps

Cluster 1: [vec1, vec2, vec5] + {category: electronics โ†’ [1,0,1]}
Cluster 2: [vec3, vec4] + {category: electronics โ†’ [0,1]}

Search with filter โ€œcategory = electronicsโ€:

  • Visit cluster 1, consider only vec1 and vec5
  • Visit cluster 2, consider only vec4
  • Skip non-matching vectors during traversal

3. Hybrid Indexes

Combine inverted index (for filters) with vector index (for similarity):

Inverted Index:              Vector Index (HNSW):
category=electronics โ†’ [1,5,9]     1 โ†’ [2,3,5]
price<100 โ†’ [1,2,3,8,9]            2 โ†’ [1,4,6]
                                    ...
Intersection: [1,9]         Search only among [1,9]

Fast path: Bitmap intersection in microseconds, vector search on subset.

4. Filter Selectivity

Selectivity = matching_vectors / total_vectors

  • High selectivity (80%): Most vectors match โ†’ post-filter is fine
  • Medium selectivity (10-50%): Some vectors match โ†’ filtered HNSW
  • Low selectivity (< 5%): Few vectors match โ†’ pre-filter + scan/small index

Estimating selectivity:

# Categorical: exact count from bitmap
selectivity = bitmap[category='electronics'].count() / total

# Numerical: histogram approximation
selectivity = histogram.estimate_range(price < 100) / total

5. Filtered Graph Traversal

Key challenge: HNSW navigation needs to traverse invalid nodes to reach valid ones.

Entry โ†’ [invalid] โ†’ [invalid] โ†’ [VALID TARGET]

Two strategies:

  • Strict filtering: Only traverse valid nodes โ†’ May get trapped, poor recall
  • Relaxed filtering: Traverse all, return only valid โ†’ Better recall, slightly slower

Qdrant uses relaxed with โ€œvisited valid countโ€ early termination.

6. Bitmap Index Fundamentals

Store set membership as bit vectors:

category = 'electronics': 10110010 (vectors 0,2,3,6 match)
price < 100:              11010010 (vectors 0,1,3,6 match)
Intersection (AND):       10010010 (vectors 0,3,6 match both)

Operations are CPU-cache-friendly SIMD instructions (nanoseconds per 1M bits).

Resources:

  • โ€œDatabase Internalsโ€ (Chapter 5: Transaction Processing and Recovery) - bitmap indexes
  • โ€œIntroduction to Information Retrievalโ€ (Chapter 1) - inverted indexes
  • Qdrant documentation on filtered search strategies

Questions to Guide Your Design

1. Filter Selectivity Analysis

  • How do you estimate selectivity BEFORE executing the filter?
  • What data structures enable fast selectivity estimation?
    • Histograms for numerical ranges?
    • Cardinality sketches for categorical values?
  • When is the estimate good enough vs exact count needed?

2. Index Strategy Selection

Three main strategies:

Post-filter: search_all(k * overfetch) โ†’ filter โ†’ take_top_k
Pre-filter:  filter โ†’ search_subset(k)
Hybrid:      filter โ†’ search_with_constraint(k)

For each query, which strategy?

  • Whatโ€™s the decision boundary in terms of selectivity?
  • How do you handle multiple filters (category AND price AND rating)?
  • What if selectivity estimation is wrong?

3. Bitmap vs Range Index

Categorical filters: bitmap is obvious choice Numerical filters: bitmap or sorted array?

# Bitmap approach
price_ranges[0-10] = 0101010
price_ranges[10-20] = 1010001
# Fast intersection, memory intensive

# Sorted array approach
[(5, vec1), (12, vec2), (18, vec3), ...]
# Binary search for range, less memory

When to use which?

  • Cardinality of numerical values?
  • Range query patterns?
  • Memory constraints?

4. HNSW Traversal with Filters

How do you modify HNSW search?

  • Start from filtered entry point or original entry?
  • Count only valid nodes toward ef parameter?
  • Expand search if not enough valid results?

Edge case: What if filter matches 5 vectors but you need top-10?

  • Return fewer results?
  • Relax filter?
  • Warn the user?

5. Index Maintenance

Vectors are added/updated/deleted with metadata:

  • How do you update bitmaps incrementally?
  • Do you rebuild indexes periodically?
  • How do you handle metadata schema changes?

6. Cost Model Accuracy

Your strategy selection depends on cost estimates:

post_filter_cost = hnsw_time(N) + filter_time(k * overfetch)
pre_filter_cost = filter_time(N) + search_time(filtered_count)

How do you calibrate these costs?

  • Offline benchmarking?
  • Online adaptive tuning?
  • What if workload changes?

Think about these before implementing. The best engineers reason about trade-offs before writing code.


Thinking Exercise

Scenario: 1 million product vectors with metadata:

{
  "category": "electronics",  # 10% of products
  "price": 79.99,             # 30% are < $100
  "rating": 4.5               # 40% are >= 4.0
}

Query: โ€œFind top-10 similar to [query_vec] where category=โ€™electronicsโ€™ AND price < 100 AND rating >= 4.0โ€

Part 1: Trace Pre-filtering Approach

Step-by-step execution:

1. category='electronics' โ†’ 100K matches (10%)
2. Intersect price<100 โ†’ ? matches
3. Intersect rating>=4.0 โ†’ ? matches
4. Search algorithm on filtered set โ†’ ?

Calculate:

  • How many vectors after each filter?
  • Assuming independent distributions, whatโ€™s final selectivity?
  • If you use brute force on filtered set, how many distance calculations?
  • If you use HNSW on full graph with filter, how many nodes visited?

Part 2: Trace Post-filtering Approach

1. HNSW search on 1M vectors with ef=200 โ†’ top 200 candidates
2. Filter candidates โ†’ how many match all conditions?
3. If < 10 matches, what do you do?

Calculate:

  • Probability that top-200 contains >= 10 filtered matches
  • Expected number of matches in top-200
  • Overfetch factor needed to get 10 results with 95% confidence

Part 3: Compare Recall

Pre-filtering with HNSW:

  • True top-10 from filtered set: A = [โ€ฆ]
  • Your results: B = [โ€ฆ]
  • Recall = A โˆฉ B / 10

Post-filtering with overfetch:

  • True top-10 considering filters: C = [โ€ฆ]
  • Your results: D = [โ€ฆ]
  • Recall = C โˆฉ D / 10

Why might they differ?

  • HNSW approximation errors
  • Filter position in pipeline
  • Overfetch amount

Part 4: Calculate Break-even Point

When does pre-filter become faster than post-filter?

Pre-filter cost:

T_filter + T_search(N_filtered)

Post-filter cost:

T_search(N_total) + T_filter(ef)

Solve for selectivity where costs are equal.

Part 5: Design Hybrid Strategy

Can you combine both?

if selectivity < 0.05:
    pre_filter + scan
elif selectivity < 0.3:
    pre_filter + HNSW on filtered graph
else:
    HNSW + post-filter with overfetch

For the given query (selectivity โ‰ˆ 0.012), which strategy?

  • Show your calculation
  • Estimate latency for each approach
  • Which would you choose?

This exercise forces you to think quantitatively about filtering strategies.


The Interview Questions Theyโ€™ll Ask

Q1: โ€œExplain pre-filtering vs post-filtering in vector search. When would you use each?โ€

Bad answer: โ€œPre-filtering filters first, post-filtering filters after.โ€

Good answer:

  • Pre-filtering: Apply metadata filter โ†’ search filtered subset
    • Best when: Filter is highly selective (< 5% match)
    • Cost: Filter is cheap, but searching small index might lack structure
    • Risk: If too few vectors match, accuracy suffers
  • Post-filtering: Search all vectors โ†’ filter top-k * overfetch
    • Best when: Filter is not selective (> 50% match)
    • Cost: Full search is expensive, but filtering small result set is cheap
    • Risk: Need overfetch factor to ensure enough filtered results
  • Decision: Estimate selectivity, compare costs, choose dynamically

Q2: โ€œHow do bitmap indexes enable fast filtering?โ€

Bad answer: โ€œTheyโ€™re just bit arrays that are fast.โ€

Good answer:

  • Represent set membership as bit vectors: category=โ€™electronicsโ€™ โ†’ 10110010
  • Set operations map to bitwise AND/OR: extremely cache-friendly
  • Modern CPUs: SIMD instructions process 256 bits at once
  • Intersection of 1M vectors: ~1ms (vs ~10ms with hash set)
  • Space: 1 bit per vector per categorical value
  • Trade-off: Memory vs speed (use for low-cardinality categoricals)

Show you understand the implementation details.

Q3: โ€œHow would you search HNSW with filters?โ€

Bad answer: โ€œJust skip nodes that donโ€™t match the filter.โ€

Good answer - Three approaches:

  1. Strict filtering: Only traverse nodes matching filter
    • Problem: Can get stuck in filtered subgraph, poor connectivity
    • Recall suffers when filter is selective
  2. Relaxed filtering: Traverse all nodes, return only filtered
    • Better graph connectivity, better recall
    • Slightly more node visits
  3. Hybrid with entry points: Pre-filter to get entry points, HNSW from there
    • Find valid nodes in each layer
    • Start search from closest valid entry
    • Qdrantโ€™s approach

Implementation detail: Track โ€œvalid nodes visitedโ€ separately from โ€œtotal visitedโ€ for ef parameter.

Q4: โ€œHow do you estimate filter selectivity without scanning all vectors?โ€

Bad answer: โ€œCount how many match.โ€

Good answer:

  • Categorical: Maintain exact counts in bitmap indexes (instant)
    selectivity = bitmap[category='electronics'].popcount() / total_vectors
    
  • Numerical ranges: Use histograms
    # 100 buckets for price [0, 1000]
    histogram[70-80] = 5243 vectors
    histogram[80-90] = 6891 vectors
    # Estimate price < 100: sum first 10 buckets
    
  • Multiple filters: Assume independence (overestimate) or maintain joint histograms (memory intensive)

  • Validation: Compare estimate vs actual, adjust over time

Q5: โ€œFiltered-DiskANN is mentioned in the DiskANN paper. How does it work?โ€

This tests if youโ€™ve read the papers deeply.

Good answer:

  • DiskANN already uses graph navigation on SSD
  • Filtering adds: Bitmap index on SSD + filtered graph traversal
  • Key insight: During beam search, only expand nodes matching filter
  • Challenge: SSD random reads are slow, need prefetching
  • Solution: Prefetch neighbor metadata + vectors in single I/O
  • Trade-off: Larger I/O (fetch metadata too) vs more I/Os (separate metadata reads)

Layout on disk:

[Node ID | Vector | Neighbors | Metadata Bitmap]

Single 4KB read gets everything needed for filtering decision.

Q6: โ€œYour filter matches 50% of vectors. Which strategy and why?โ€

Bad answer: โ€œPost-filtering because itโ€™s simple.โ€

Good answer - Walk through the calculation:

Assume: 1M vectors, HNSW search ~2ms, bitmap filter ~0.5ms

Post-filter:

Cost = search(1M) + filter(top-200)
     = 2ms + 0.01ms โ‰ˆ 2ms
Expected matches in top-200: ~100 (plenty for top-10)

Pre-filter + HNSW:

Filtered set: 500K vectors
Cost = filter(1M) + search(500K)
     = 0.5ms + ~1.5ms โ‰ˆ 2ms

Conclusion: Both are similar! But post-filter is simpler and has better recall (no HNSW approximation on filtered graph connectivity issues).

Choose post-filter with small overfetch (ef=50 is enough).

Q7: โ€œHow would you handle a filter that matches only 10 vectors?โ€

Bad answer: โ€œUse pre-filtering.โ€

Good answer:

  • Recognize this is extreme selectivity (0.001%)
  • HNSW is overkill for 10 vectors
  • Best approach: Pre-filter โ†’ brute force scan of 10 vectors
    valid_ids = bitmap.intersect_all(filters)  # 10 vectors
    if len(valid_ids) < 100:
        # Just scan them
        return sorted([(id, distance(query, vectors[id]))
                       for id in valid_ids])[:k]
    
  • Cost: Filter (~0.5ms) + 10 distance calculations (~0.001ms) โ‰ˆ 0.5ms
  • Much faster than HNSW search

Q8: โ€œWhat if your selectivity estimate is wrong?โ€

This tests production thinking.

Good answer:

  • Adaptive execution: Start with estimated strategy, measure actual
    strategy = choose_strategy(estimated_selectivity)
    results, actual_cost = execute(strategy)
    if actual_cost > threshold:
        # Log for calibration
        update_cost_model(estimated_selectivity, actual_cost)
    
  • Graceful degradation: If post-filter doesnโ€™t get enough results, fall back to pre-filter
    results = post_filter_search(k=10, ef=100)
    if len(results) < 10:
        # Not enough matches, try pre-filter
        results = pre_filter_search(k=10)
    
  • Cost-based optimizer: Like SQL databases, maintain statistics and update periodically

These questions test depth of understanding. Study the trade-offs, not just the algorithms.


Hints in Layers

Layer 1: Start with Post-filtering (Simplest baseline)

Build a naive implementation:

class PostFilterSearch:
    def search(self, query, k, filters, overfetch=5):
        # Search without filters
        candidates = self.hnsw.search(query, k * overfetch)

        # Filter results
        matches = [c for c in candidates if self.matches_filter(c.id, filters)]

        return matches[:k]

What youโ€™ll learn:

  • How overfetch factor affects recall
  • When you get < k results (not enough matches in top-N)
  • That this works fine for non-selective filters

Test with varying selectivity (90%, 50%, 10%, 1%) and measure:

  • Latency
  • Recall (compare to ground truth)
  • Required overfetch for 95% chance of k results

Layer 2: Add Bitmap Pre-filtering

Implement inverted index for categorical metadata:

class BitmapIndex:
    def __init__(self, num_vectors):
        self.num_vectors = num_vectors
        self.indexes = {}  # field -> value -> bitset

    def add(self, vector_id, field, value):
        if field not in self.indexes:
            self.indexes[field] = {}
        if value not in self.indexes[field]:
            self.indexes[field][value] = Bitset(self.num_vectors)
        self.indexes[field][value].set(vector_id)

    def query(self, field, value):
        return self.indexes[field].get(value, Bitset(self.num_vectors))

    def intersect(self, filters):
        result = None
        for field, value in filters:
            bits = self.query(field, value)
            result = bits if result is None else result & bits
        return result.to_list()  # List of matching IDs

What youโ€™ll learn:

  • Bitmap intersection is incredibly fast
  • Memory usage scales with cardinality * num_vectors
  • Can estimate selectivity for free (popcount)

Layer 3: Implement Strategy Selection

Add a query planner:

class FilteredSearch:
    def search(self, query, k, filters):
        selectivity = self.estimate_selectivity(filters)

        if selectivity > 0.5:
            return self.post_filter(query, k, filters)
        elif selectivity < 0.01:
            valid_ids = self.bitmap.intersect(filters)
            return self.brute_force_scan(query, k, valid_ids)
        else:
            return self.filtered_hnsw(query, k, filters)

    def estimate_selectivity(self, filters):
        result = None
        for field, value in filters:
            bits = self.bitmap.query(field, value)
            count = bits.popcount()
            sel = count / self.num_vectors
            result = sel if result is None else result * sel
        return result

What youโ€™ll learn:

  • Selectivity estimation is critical
  • Different strategies have different sweet spots
  • Cost model needs calibration with real data

Layer 4: Filtered HNSW Traversal

This is the hard part. Modify HNSW to respect filters during search:

def search_filtered(self, query, k, ef, valid_ids):
    valid_set = set(valid_ids)

    # Find entry point in filtered set
    entry = self._find_valid_entry(valid_set)

    visited = set()
    candidates = [(distance(query, self.vectors[entry]), entry)]
    results = []

    while candidates:
        dist, current = heappop(candidates)

        if current in visited:
            continue
        visited.add(current)

        # Add to results if valid
        if current in valid_set:
            results.append((current, dist))
            if len(results) >= ef:
                break

        # Explore neighbors (even invalid ones for navigation)
        for neighbor in self.graph[current]:
            if neighbor not in visited:
                dist_n = distance(query, self.vectors[neighbor])
                heappush(candidates, (dist_n, neighbor))

    return sorted(results, key=lambda x: x[1])[:k]

Key design decisions:

  • Should you traverse invalid nodes? (Yes, for connectivity)
  • How to count toward ef? (Only valid nodes)
  • Entry point selection? (Closest valid node to original entry)

What youโ€™ll learn:

  • Graph connectivity matters for filtered search
  • Relaxed filtering beats strict filtering
  • Entry point quality affects recall

Layer 5: Add Numerical Range Indexes

Bitmaps work for categorical, but numerical ranges need different approach:

class RangeIndex:
    def __init__(self):
        self.sorted_values = []  # [(value, vector_id), ...]

    def build(self, vectors_metadata, field):
        self.sorted_values = sorted(
            [(meta[field], vid) for vid, meta in vectors_metadata.items()]
        )

    def range_query(self, min_val, max_val):
        # Binary search for range
        start = bisect_left(self.sorted_values, (min_val, 0))
        end = bisect_right(self.sorted_values, (max_val, float('inf')))
        return [vid for val, vid in self.sorted_values[start:end]]

For very selective range queries, this beats bitmap.

What youโ€™ll learn:

  • Different index structures for different data types
  • Trade-offs: bitmap (memory, fast AND) vs sorted (space-efficient, slower intersection)

Layer 6: Implement Hybrid Strategy (Pre-filter + Small HNSW)

For medium selectivity (5-20%), build HNSW on filtered subset:

def hybrid_search(self, query, k, filters):
    # Pre-filter
    valid_ids = self.bitmap.intersect(filters)

    if len(valid_ids) < 1000:
        # Too small for HNSW, brute force
        return self.brute_force_scan(query, k, valid_ids)

    # Build temporary HNSW on filtered subset
    filtered_vectors = [self.vectors[vid] for vid in valid_ids]
    temp_hnsw = HNSW(filtered_vectors, M=16, ef_construction=100)

    # Search
    results = temp_hnsw.search(query, k)
    return [(valid_ids[r.id], r.distance) for r in results]

This is expensive (building HNSW), but for repeated queries on same filter, you can cache.

Layer 7: Add Cost-Based Optimization

Calibrate your cost model with benchmarks:

class CostModel:
    def __init__(self):
        # Calibrated constants (microseconds)
        self.hnsw_base = 2000  # 2ms base HNSW search
        self.hnsw_per_node = 0.01  # 10ns per node
        self.bitmap_filter = 500  # 0.5ms to filter 1M vectors
        self.distance_calc = 0.1  # 100ns per distance

    def estimate_cost(self, strategy, selectivity, num_vectors):
        if strategy == 'post_filter':
            return self.hnsw_base + self.bitmap_filter * 0.001
        elif strategy == 'pre_filter_scan':
            filtered = int(num_vectors * selectivity)
            return self.bitmap_filter + filtered * self.distance_calc
        elif strategy == 'filtered_hnsw':
            return self.bitmap_filter + self.hnsw_base * (1/selectivity)

    def choose_strategy(self, selectivity, num_vectors):
        costs = {
            'post_filter': self.estimate_cost('post_filter', selectivity, num_vectors),
            'pre_filter_scan': self.estimate_cost('pre_filter_scan', selectivity, num_vectors),
            'filtered_hnsw': self.estimate_cost('filtered_hnsw', selectivity, num_vectors),
        }
        return min(costs, key=costs.get)

Run experiments, measure actual costs, tune constants.

Layer 8: Production Features

Add robustness:

  • Graceful degradation when estimate is wrong
  • Query timeout and fallback strategies
  • Logging and observability (which strategy chosen, actual vs estimated cost)
  • Adaptive cost model that learns from query patterns

By building layer by layer, you understand the trade-offs deeply.


Books That Will Help

Core Books:

  1. โ€œIntroduction to Information Retrievalโ€ by Manning, Raghavan, Schรผtze
    • Chapter 1: Boolean Retrieval (inverted indexes, bitmaps)
    • Chapter 6: Scoring and Ranking (cost models)
    • Why: Foundational understanding of inverted indexes, which are the basis of bitmap filtering
  2. โ€œDatabase Internalsโ€ by Alex Petrov
    • Chapter 5: Transaction Processing and Recovery (bitmap indexes)
    • Chapter 7: Log-Structured Storage (LSM trees for metadata)
    • Why: Deep dive into bitmap index implementation, how databases do filtering efficiently
  3. โ€œMining of Massive Datasetsโ€ by Leskovec, Rajaraman, Ullman
    • Chapter 3: Finding Similar Items (LSH with filtering)
    • Chapter 9: Recommendation Systems (hybrid filtering)
    • Why: Understand how filtering works at scale, selectivity estimation

Papers:

  1. โ€œFiltered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filtersโ€ (2022)
    • Core paper on integrating filters into graph-based ANN
    • Explains pre-filter vs post-filter vs integrated filtering
    • Shows how to modify graph traversal for filters
    • Read: https://harsha-simhadri.org/pubs/Filtered-DiskANN22.pdf
  2. โ€œQdrant: Vector Search Engine with Extended Filtering Supportโ€ (Documentation)
    • Production system implementing filtered search
    • Explains payload indexes, filtered HNSW strategy
    • Read: https://qdrant.tech/documentation/concepts/filtering/
  3. โ€œPANN: Personalized Approximate Nearest Neighbor Searchโ€ (2020)
    • Filtered ANN for recommendation systems
    • User-based filtering strategies
    • Shows real-world cost models

Supplementary:

  1. โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann
    • Chapter 3: Storage and Retrieval (indexes)
    • Why: Understand trade-offs in index design, when to use what
  2. โ€œDatabase Systems: The Complete Bookโ€ by Garcia-Molina, Ullman, Widom
    • Chapter 14: Index Structures (B-trees, bitmaps, spatial indexes)
    • Why: Comprehensive coverage of all index types, helps you choose right structure

Blogs and Documentation:

  1. Pineconeโ€™s โ€œMetadata Filtering Guideโ€
    • https://www.pinecone.io/learn/vector-search-filtering/
    • Production perspective on filtered vector search
  2. Weaviateโ€™s โ€œFiltered Vector Searchโ€
    • How they combine inverted index + HNSW
    • Real-world performance numbers
  3. PostgreSQL Documentation: Index Types
    • Understand GIN (Generalized Inverted Index)
    • See how SQL databases integrate full-text search with filtering

Reading Order:

  1. Start: Manningโ€™s โ€œIntroduction to Information Retrievalโ€ (Chapter 1) - understand inverted indexes
  2. Then: Filtered-DiskANN paper - see how it applies to vector search
  3. Next: Qdrant documentation - see production implementation
  4. Deep dive: โ€œDatabase Internalsโ€ (Chapter 5) - understand bitmap indexes deeply
  5. Advanced: PANN paper - see personalized filtering

These books and papers will make you an expert in filtered search, not just implementer.


Learning milestones:

  1. Bitmap filters are fast โ†’ You understand inverted indexes
  2. Strategy selection matters โ†’ You understand cost-based optimization
  3. Filtered HNSW works โ†’ You understand graph traversal with constraints
  4. Performance beats naive approach โ†’ Youโ€™ve built an efficient system

Project 9: Vector Database Storage Engine

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The โ€œIndustry Disruptorโ€
  • Difficulty: Level 4: Expert
  • Knowledge Area: Storage Systems / Write-Ahead Logging
  • Software or Tool: Storage Engine
  • Main Book: โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann

What youโ€™ll build: A durable storage engine for vectors with write-ahead logging (WAL), crash recovery, compaction, and efficient serializationโ€”the persistence layer of a real vector database.

Why it teaches storage: Indexes are useless if they disappear on restart. A real vector database needs ACID properties: atomic writes, crash recovery, and efficient disk formats. This is what makes the difference between a prototype and a production system.

Core challenges youโ€™ll face:

  • Write-ahead logging โ†’ maps to durability before acknowledging writes
  • Crash recovery โ†’ maps to replaying WAL on startup
  • Compaction โ†’ maps to merging and garbage collection
  • Efficient serialization โ†’ maps to binary formats for vectors

Key Concepts:

  • WAL: โ€œDesigning Data-Intensive Applicationsโ€ Chapter 3 - Kleppmann
  • LSM Trees: For append-friendly storage
  • Memory-Mapped Files: OS-level optimization

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, experience with file I/O and binary formats.

Real world outcome:

$ ./vecdb_storage --init --path ./data

Initializing VectorDB Storage Engine:
  Data directory: ./data
  WAL segment size: 64 MB
  Compaction threshold: 1 GB

Files created:
  ./data/wal/segment_000000.wal
  ./data/vectors/vectors_000000.bin
  ./data/index/hnsw_000000.idx
  ./data/metadata.json

$ ./vecdb_storage --insert --count 100000
Inserting 100,000 vectors...
  WAL writes: 100,000
  Batch flushes: 10
  Time: 4.2s (23,800 inserts/sec)

$ kill -9 $PID  # Simulate crash

$ ./vecdb_storage --recover
Recovery started:
  Scanning WAL segments...
  Found 1 segment with 100,000 entries
  Replaying entries: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 100%
  Rebuilding index: 2.3s
  Recovery complete!

Verified:
  Vectors recovered: 100,000 โœ“
  Index consistency: OK โœ“
  No data loss โœ“

$ ./vecdb_storage --compact
Compaction:
  Before: 3 segments, 1.2 GB
  Merging...
  After: 1 segment, 0.9 GB
  Space reclaimed: 300 MB (25%)

Implementation Hints:

Storage architecture:

./data/
โ”œโ”€โ”€ wal/
โ”‚   โ”œโ”€โ”€ segment_000000.wal  # Write-ahead log
โ”‚   โ”œโ”€โ”€ segment_000001.wal
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ vectors/
โ”‚   โ””โ”€โ”€ vectors_000000.bin  # Actual vector data
โ”œโ”€โ”€ index/
โ”‚   โ””โ”€โ”€ hnsw_000000.idx     # Serialized index
โ””โ”€โ”€ metadata.json           # DB state

WAL entry format:

struct WALEntry {
    sequence_number: u64,
    operation: Operation,  // Insert, Delete, Update
    timestamp: u64,
    vector_id: u64,
    vector_data: Vec<f32>,  // Empty for Delete
    metadata: HashMap<String, Value>,
    checksum: u32,
}

enum Operation {
    Insert = 1,
    Delete = 2,
    Update = 3,
}

WAL writer:

struct WALWriter {
    current_segment: File,
    segment_size: usize,
    current_size: usize,
    sequence: AtomicU64,
}

impl WALWriter {
    fn append(&mut self, entry: WALEntry) -> Result<u64> {
        let seq = self.sequence.fetch_add(1, Ordering::SeqCst);
        let mut entry = entry;
        entry.sequence_number = seq;

        // Serialize entry
        let bytes = entry.serialize();
        entry.checksum = crc32(&bytes);

        // Write length + data + checksum
        self.current_segment.write_all(&(bytes.len() as u32).to_le_bytes())?;
        self.current_segment.write_all(&bytes)?;
        self.current_segment.write_all(&entry.checksum.to_le_bytes())?;

        // Flush to disk (durability!)
        self.current_segment.sync_all()?;

        self.current_size += bytes.len() + 8;

        // Rotate segment if needed
        if self.current_size >= self.segment_size {
            self.rotate_segment()?;
        }

        Ok(seq)
    }
}

Recovery:

fn recover(data_dir: &Path) -> Result<VectorDB> {
    let mut db = VectorDB::new();

    // Read metadata to find last checkpoint
    let metadata = read_metadata(data_dir)?;
    let last_checkpoint = metadata.last_checkpoint_sequence;

    // Replay WAL from last checkpoint
    for segment in list_wal_segments(data_dir)? {
        let reader = WALReader::open(segment)?;

        for entry in reader.entries() {
            let entry = entry?;

            // Verify checksum
            if crc32(&entry.data) != entry.checksum {
                log::error!("Corrupt WAL entry at seq {}", entry.sequence_number);
                continue;
            }

            // Skip already applied entries
            if entry.sequence_number <= last_checkpoint {
                continue;
            }

            // Apply to in-memory state
            match entry.operation {
                Operation::Insert => db.insert_recovered(entry.vector_id, entry.vector_data),
                Operation::Delete => db.delete_recovered(entry.vector_id),
                Operation::Update => db.update_recovered(entry.vector_id, entry.vector_data),
            }
        }
    }

    // Rebuild index if needed
    if db.index_needs_rebuild() {
        db.rebuild_index()?;
    }

    Ok(db)
}

Compaction:

fn compact(&mut self) -> Result<()> {
    // Create new segment file
    let new_segment = create_new_segment()?;

    // Write live vectors to new segment
    for (id, vector) in self.vectors.iter() {
        if !self.deleted.contains(id) {
            write_vector(&new_segment, id, vector)?;
        }
    }

    // Atomically swap segments
    new_segment.sync_all()?;
    rename(new_segment.path(), current_segment_path())?;

    // Delete old segments
    for old in old_segments {
        remove_file(old)?;
    }

    // Truncate WAL (everything is now in new segment)
    self.wal.truncate_before(self.sequence)?;

    Ok(())
}

The Core Question Youโ€™re Answering

โ€œHow do we persist vectors with ACID guarantees and fast recovery?โ€

When your vector database crashes mid-write, how do you ensure:

  • Durability: Acknowledged writes survive crashes
  • Atomicity: No partial writes corrupt the index
  • Recovery: Restart quickly without full reindexing
  • Performance: Donโ€™t sacrifice write throughput for durability

This is the difference between a toy and a production system. Understanding storage engines means understanding the tradeoffs between:

  • Write amplification (how many times does data hit disk?)
  • Read amplification (how many files to read for one query?)
  • Space amplification (how much extra space for metadata/logs?)

Youโ€™re building the persistence layer that makes vectors survive power failures.


Concepts You Must Understand First

Before writing code, understand these storage fundamentals:

1. Write-Ahead Logging (WAL)

The durability guarantee:

  • Every modification is written to an append-only log before updating in-memory structures
  • Log entries are fsynced to disk before acknowledging the write
  • On crash, replay the log to reconstruct state
  • Key insight: Sequential writes are 100x faster than random writes

Critical details:

  • WAL entries need checksums (detect corruption)
  • Log segmentation (rotate to new files periodically)
  • Idempotency (replaying same entry twice is safe)

Read: โ€œDesigning Data-Intensive Applicationsโ€ Chapter 3 (Storage and Retrieval) - Kleppmann

2. LSM Trees (Log-Structured Merge Trees)

The write-optimized architecture:

  • Writes go to memtable (in-memory sorted structure)
  • When memtable fills, flush to SSTable (sorted string table on disk)
  • Background compaction merges SSTables to reduce read amplification
  • Bloom filters avoid reading files that donโ€™t contain a key

Why LSM for vectors?

  • Vector inserts are write-heavy (ingestion pipelines)
  • Immutable SSTables enable safe concurrent reads
  • Compaction reclaims space from deleted vectors

Read: โ€œDatabase Internalsโ€ Chapter 7 (Log-Structured Storage) - Alex Petrov

3. Memtable and SSTable Structure

Memtable (in-memory):

  • Red-black tree or skip list (sorted structure)
  • Fast insertions O(log n)
  • Periodically flushed to disk

SSTable (on disk):

  • Immutable sorted file
  • Binary format: [key value_offset value_size] index + data blocks
  • Optional compression (Snappy, LZ4)

4. Compaction Strategies

Size-tiered compaction:

  • Merge SSTables of similar size
  • Good for write-heavy workloads
  • Higher space amplification

Leveled compaction:

  • Organize SSTables into levels (L0, L1, L2โ€ฆ)
  • Each level is 10x larger than previous
  • Lower space amplification, more I/O

Which to use?

  • Start with size-tiered (simpler)
  • Switch to leveled if space is constrained

5. Crash Recovery Protocol

Recovery steps:

  1. Read manifest file (which SSTables are valid)
  2. Replay WAL from last checkpoint
  3. Rebuild memtable from WAL
  4. Optionally rebuild index from SSTables
  5. Mark recovery complete, accept writes

Key insight: Recovery time depends on WAL size, not database size


Questions to Guide Your Design

WAL Format Design

  1. What goes in a WAL entry?
    • Sequence number, operation type, vector ID, vector data, timestamp, checksum
    • How do you serialize variable-length vectors efficiently?
    • Should you compress WAL entries?
  2. How do you handle corrupted entries?
    • CRC32 checksums for integrity
    • What if corruption is in the middle of the log?
    • Skip and continue, or stop recovery?
  3. When do you fsync()?
    • After every write? (slow but durable)
    • Batched every N writes? (fast but loss window)
    • Configurable durability vs performance tradeoff

Flush Policy

  1. When do you flush memtable to SSTable?
    • Size threshold (e.g., 64 MB memtable)
    • Time threshold (e.g., every 60 seconds)
    • Memory pressure (adaptive flushing)
  2. What if you crash during flush?
    • Flush is not atomic (partial SSTable on disk)
    • Solution: Write to temp file, rename on completion (atomic on POSIX)
    • Keep WAL entries until flush completes

Compaction Strategy

  1. How often do you compact?
    • Too frequent: wastes CPU/disk I/O
    • Too rare: read amplification hurts queries
    • Heuristic: compact when you have N SSTables at same level
  2. What about tombstones (deleted vectors)?
    • Mark deleted in WAL, filter during compaction
    • Bloom filters prevent reading SSTables with deleted keys

Crash Recovery

  1. How do you know where to start recovery?
    • Checkpoint sequence number in manifest file
    • Replay WAL entries with seq > last_checkpoint
  2. What if the index is corrupted?
    • Can you rebuild HNSW from vectors alone?
    • Trade-off: fast recovery vs full reindex

Thinking Exercise

Trace an insert through the storage engine:

User calls: db.insert(vector_id=42, vector=[0.1, 0.2, ...])

Step 1: Write to WAL
  - Serialize WAL entry (seq=100, op=Insert, id=42, data=[...])
  - Write to wal/segment_000000.wal
  - fsync() to guarantee durability
  - Return sequence number

Step 2: Update memtable
  - Insert into in-memory sorted structure
  - memtable.insert(42, vector)

Step 3: Check flush condition
  - if memtable.size() > 64 MB:
      - Freeze memtable (make immutable)
      - Start new memtable for writes
      - Background thread flushes frozen memtable to SSTable

Step 4: Flush to SSTable (background)
  - Create temp file: vectors/vectors_000001.tmp
  - Write sorted entries from frozen memtable
  - Write index block (key -> offset mapping)
  - Rename to vectors/vectors_000001.sst (atomic!)
  - Update manifest: "SSTable 000001 contains seq 0-100"
  - Truncate WAL entries <= 100 (no longer needed)

Step 5: Compaction (background, periodically)
  - Scan for SSTables at same level with overlap
  - Merge-sort entries from multiple SSTables
  - Write new compacted SSTable
  - Delete old SSTables
  - Update manifest

Now trace a crash scenario:

Crash happens during Step 4 (flush in progress)

On restart:
1. Read manifest -> last valid SSTable is 000000 (seq 0-50)
2. Scan WAL -> find entries seq 51-100
3. Replay:
   - entry seq=51: insert(id=10, ...)
   - entry seq=52: insert(id=11, ...)
   - ...
   - entry seq=100: insert(id=42, ...)
4. Memtable now contains all vectors from seq 51-100
5. Check for partial SSTable 000001.tmp -> delete (incomplete flush)
6. Rebuild index from vectors 0-100
7. Ready to accept writes!

Questions to consider:

  • What if crash happens during compaction?
  • How do you handle concurrent reads during flush?
  • What if WAL itself is corrupted?

The Interview Questions Theyโ€™ll Ask

Fundamental Understanding

  1. โ€œExplain the difference between WAL and journaling in filesystems.โ€
    • Answer: Both are write-ahead logs, but:
      • Filesystem journaling: logs metadata changes (file renames, permission changes)
      • Database WAL: logs data modifications (actual record changes)
      • WAL is application-level, journaling is OS-level
      • WAL provides logical recovery, journaling provides crash consistency
  2. โ€œWhy use LSM trees instead of B-trees for vectors?โ€
    • LSM advantages:
      • Sequential writes (no random I/O)
      • Higher write throughput
      • Better compression (immutable files)
    • B-tree advantages:
      • Lower read amplification
      • Predictable read latency
    • For vectors: LSM wins because ingestion is write-heavy, and vectors are rarely updated in-place
  3. โ€œWhatโ€™s the difference between size-tiered and leveled compaction?โ€
    • Size-tiered: Merge SSTables of similar size โ†’ less I/O, more space
    • Leveled: Fixed size levels, compact to next level โ†’ more I/O, less space
    • Example: RocksDB uses leveled by default, Cassandra uses size-tiered

Design Tradeoffs

  1. โ€œHow do you minimize write amplification?โ€
    • Techniques:
      • Larger memtable (fewer flushes)
      • Less aggressive compaction (fewer rewrites)
      • Compression (write less data)
    • Tradeoff: Write amplification vs space amplification
  2. โ€œWhat if your WAL grows unbounded?โ€
    • Solutions:
      • Checkpoint: flush memtable, mark WAL prefix as safe to delete
      • Segment rotation: create new WAL file periodically
      • Truncation: delete old segments after checkpoint
    • Key: WAL size is bounded by checkpoint frequency
  3. โ€œHow do you handle read-your-own-writes consistency?โ€
    • Problem: Write to WAL, query before flush โ†’ vector not in SSTable yet
    • Solution: Check memtable first, then SSTables (layered query)
    • Ordering: memtable > newest SSTable > oldest SSTable

Crash Recovery

  1. โ€œWhat happens if you crash during compaction?โ€
    • Safe approach: Compaction writes to new SSTables, deletes old ones only after success
    • Crash scenario: Old SSTables still exist, new ones partially written
    • Recovery: Read manifest โ†’ ignore partial SSTables โ†’ use old SSTables
    • Key: Never delete source files until output is complete
  2. โ€œHow do you detect corrupted WAL entries?โ€
    • Checksums: CRC32 or xxHash on each entry
    • On mismatch: Skip entry and log warning, or stop recovery (configurable)
    • Best practice: Also checksum the entire segment (detect tail corruption)

Performance

  1. โ€œHow do you optimize fsync latency?โ€
    • Group commit: Batch multiple writes, single fsync
    • Async replication: Acknowledge write after replicating to N nodes (donโ€™t wait for fsync)
    • Battery-backed write cache: Hardware acceleration (risky!)
  2. โ€œWhy are sequential writes so much faster than random writes?โ€
    • HDD: No seek time (200 MB/s sequential vs 1 MB/s random)
    • SSD: Wear leveling, internal parallelism optimized for sequential
    • OS: Readahead and writeback buffers help sequential I/O

Hints in Layers

Layer 1: In-memory only (ignore crashes)

struct VectorDB {
    vectors: HashMap<u64, Vec<f32>>,  // id -> vector
}

impl VectorDB {
    fn insert(&mut self, id: u64, vector: Vec<f32>) {
        self.vectors.insert(id, vector);
    }

    fn search(&self, query: Vec<f32>, k: usize) -> Vec<u64> {
        // Brute-force search
    }
}

Test: Insert 100k vectors, search works, restart โ†’ all data lost โœ—


Layer 2: Add write-ahead log (durability)

struct WALEntry {
    seq: u64,
    op: Operation,
    id: u64,
    vector: Vec<f32>,
}

struct VectorDB {
    vectors: HashMap<u64, Vec<f32>>,
    wal: WALWriter,
}

impl VectorDB {
    fn insert(&mut self, id: u64, vector: Vec<f32>) -> Result<()> {
        // Write to WAL first!
        let entry = WALEntry { seq: self.next_seq(), op: Insert, id, vector: vector.clone() };
        self.wal.append(&entry)?;
        self.wal.fsync()?;  // Durability!

        // Now safe to update memory
        self.vectors.insert(id, vector);
        Ok(())
    }
}

fn recover(wal_path: &Path) -> VectorDB {
    let mut db = VectorDB::new();
    for entry in WALReader::read(wal_path) {
        db.vectors.insert(entry.id, entry.vector);
    }
    db
}

Test: Insert vectors, kill -9, restart โ†’ data recovered โœ“


Layer 3: Add memtable + flush (reduce WAL size)

struct VectorDB {
    memtable: BTreeMap<u64, Vec<f32>>,  // Sorted!
    sstables: Vec<SSTableReader>,
    wal: WALWriter,
}

impl VectorDB {
    fn insert(&mut self, id: u64, vector: Vec<f32>) -> Result<()> {
        self.wal.append(&entry)?;
        self.memtable.insert(id, vector);

        if self.memtable.size() > 64 * 1024 * 1024 {
            self.flush_memtable()?;
        }
        Ok(())
    }

    fn flush_memtable(&mut self) -> Result<()> {
        let tmp_path = "vectors_temp.sst";
        let mut writer = SSTableWriter::new(tmp_path)?;

        for (id, vector) in self.memtable.iter() {
            writer.write(id, vector)?;
        }
        writer.finish()?;

        // Atomic rename!
        rename(tmp_path, "vectors_000000.sst")?;

        self.sstables.push(SSTableReader::open("vectors_000000.sst")?);
        self.memtable.clear();
        self.wal.checkpoint()?;  // Safe to truncate WAL now
        Ok(())
    }
}

Test: Insert 1M vectors โ†’ WAL doesnโ€™t grow unbounded โœ“


Layer 4: Add compaction (reduce read amplification)

impl VectorDB {
    fn compact(&mut self) -> Result<()> {
        // Size-tiered: merge SSTables of similar size
        let to_merge = self.find_mergeable_sstables();

        let mut writer = SSTableWriter::new("compacted_temp.sst")?;
        let mut heap = BinaryHeap::new();  // K-way merge

        // Initialize heap with first entry from each SSTable
        for sst in &to_merge {
            if let Some(entry) = sst.iter().next() {
                heap.push(entry);
            }
        }

        // Merge-sort
        while let Some(entry) = heap.pop() {
            writer.write(entry.id, entry.vector)?;
            if let Some(next) = entry.source.iter().next() {
                heap.push(next);
            }
        }

        writer.finish()?;
        rename("compacted_temp.sst", "compacted_000000.sst")?;

        // Delete old SSTables
        for sst in to_merge {
            remove_file(sst.path())?;
        }

        Ok(())
    }
}

Test: Insert 1M, delete 500k, compact โ†’ disk space reclaimed โœ“


Layer 5: Add checksums and error handling (production-ready)

struct WALEntry {
    seq: u64,
    op: Operation,
    id: u64,
    vector: Vec<f32>,
    checksum: u32,  // CRC32
}

impl WALReader {
    fn read_entry(&mut self) -> Result<WALEntry> {
        let len = self.read_u32()?;
        let data = self.read_bytes(len)?;
        let checksum = self.read_u32()?;

        if crc32(&data) != checksum {
            return Err(CorruptedEntry);
        }

        WALEntry::deserialize(&data)
    }
}

Test: Corrupt WAL file โ†’ recovery skips bad entries, logs warning โœ“


Books That Will Help

Essential Reading

  1. โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann
    • Chapter 3: Storage and Retrieval (WAL, LSM trees, B-trees)
    • Chapter 11: Stream Processing (log-based systems)
    • Why: The definitive guide to storage engine design
  2. โ€œDatabase Internalsโ€ by Alex Petrov
    • Part I: Storage Engines (file formats, indexing)
    • Chapter 7: Log-Structured Storage (LSM deep dive)
    • Chapter 9: Failure Detection (crash recovery)
    • Why: Implementation-focused, covers RocksDB and LevelDB internals

Deep Dives

  1. โ€œThe Log-Structured Merge-Tree (LSM-Tree)โ€ - Oโ€™Neil et al. (1996)
    • The original LSM paper
    • Available free online
    • Why: Understand the theory behind modern storage engines
  2. โ€œRocksDB Tuning Guideโ€ (online documentation)
    • Real-world compaction strategies
    • Write amplification benchmarks
    • Why: Learn from a production LSM implementation

Reference Implementation

  1. Study LevelDB source code (C++, ~5000 lines)
    • db/log_writer.cc - WAL implementation
    • db/memtable.cc - Skip list memtable
    • table/table_builder.cc - SSTable format
    • Why: Clean, well-commented reference implementation

Learning milestones:

  1. WAL writes are durable โ†’ You understand write-ahead logging
  2. Crash recovery works โ†’ You understand durability guarantees
  3. Compaction reclaims space โ†’ You understand garbage collection
  4. Performance is acceptable โ†’ Youโ€™ve built an efficient storage layer

Project 10: Distributed Vector Search Cluster

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The โ€œIndustry Disruptorโ€
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Sharding
  • Software or Tool: Distributed Vector Database
  • Main Book: โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann

What youโ€™ll build: A distributed vector search cluster with sharding, replication, routing, and distributed query executionโ€”turning a single-node index into a scalable system.

Why it teaches distributed systems: No single machine can handle billions of vectors with low latency. Distributed systems introduce new challenges: consistency, partition tolerance, coordination. Understanding these is essential for production systems.

Core challenges youโ€™ll face:

  • Sharding strategies โ†’ maps to distributing vectors across nodes
  • Query routing โ†’ maps to finding which shards to query
  • Result merging โ†’ maps to combining results from multiple shards
  • Replication โ†’ maps to fault tolerance and read scaling
  • Consistency โ†’ maps to handling concurrent updates

Key Concepts:

  • Sharding: โ€œDesigning Data-Intensive Applicationsโ€ Chapter 6 - Kleppmann
  • Replication: โ€œDesigning Data-Intensive Applicationsโ€ Chapter 5 - Kleppmann
  • Distributed Consensus: Raft algorithm

Difficulty: Master Time estimate: 4-6 weeks Prerequisites: All previous projects, distributed systems experience.

Real world outcome:

$ ./vecdb cluster init --nodes 3 --shards 9 --replicas 2

Initializing distributed cluster:
  Nodes: node-0 (coordinator), node-1, node-2
  Shards: 9 (3 per node primary, 6 replicas distributed)
  Replication factor: 2

Shard distribution:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•คโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Shard  โ”‚ Primary Node      โ”‚ Replica Nodes    โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘   0    โ”‚ node-0            โ”‚ node-1, node-2   โ•‘
โ•‘   1    โ”‚ node-0            โ”‚ node-1, node-2   โ•‘
โ•‘   2    โ”‚ node-0            โ”‚ node-1, node-2   โ•‘
โ•‘   3    โ”‚ node-1            โ”‚ node-0, node-2   โ•‘
โ•‘  ...   โ”‚  ...              โ”‚  ...             โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Cluster ready!

$ ./vecdb insert --vectors 10000000 --distributed
Inserting 10M vectors across cluster:
  Routing: hash(vector_id) mod 9
  Progress: [โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ] 100%
  Time: 45s (222K inserts/sec)

$ ./vecdb search --query query.npy --k 10

Distributed Query Execution:
  Step 1: Route query to all 9 shards
  Step 2: Parallel search on each shard (local k=30)
  Step 3: Merge 270 candidates from 9 shards
  Step 4: Re-rank and return top-10

Performance:
  Shard latencies: [1.2ms, 1.4ms, 1.1ms, 1.3ms, 1.5ms, 1.2ms, 1.0ms, 1.4ms, 1.3ms]
  Merge time: 0.3ms
  Total: 1.8ms (limited by slowest shard + merge)

$ ./vecdb failover --kill node-1
Simulating node-1 failure...

  Detecting failure: 2.1s (heartbeat timeout)
  Promoting replicas:
    Shard 3: node-0 promoted to primary
    Shard 4: node-2 promoted to primary
    Shard 5: node-0 promoted to primary
  Cluster healthy (degraded: missing replicas)

$ ./vecdb search --query query.npy --k 10
Search still works! (using remaining nodes)
  Latency: 2.1ms (slightly higher due to fewer nodes)

Implementation Hints:

Cluster architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                       Coordinator                            โ”‚
โ”‚  (Query routing, shard map, cluster membership)             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                              โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”‚
โ”‚  โ”‚   Node-0    โ”‚  โ”‚   Node-1    โ”‚  โ”‚   Node-2    โ”‚         โ”‚
โ”‚  โ”‚             โ”‚  โ”‚             โ”‚  โ”‚             โ”‚         โ”‚
โ”‚  โ”‚ Shard 0 (P) โ”‚  โ”‚ Shard 3 (P) โ”‚  โ”‚ Shard 6 (P) โ”‚         โ”‚
โ”‚  โ”‚ Shard 1 (P) โ”‚  โ”‚ Shard 4 (P) โ”‚  โ”‚ Shard 7 (P) โ”‚         โ”‚
โ”‚  โ”‚ Shard 2 (P) โ”‚  โ”‚ Shard 5 (P) โ”‚  โ”‚ Shard 8 (P) โ”‚         โ”‚
โ”‚  โ”‚ Shard 3 (R) โ”‚  โ”‚ Shard 0 (R) โ”‚  โ”‚ Shard 0 (R) โ”‚         โ”‚
โ”‚  โ”‚ Shard 4 (R) โ”‚  โ”‚ Shard 1 (R) โ”‚  โ”‚ Shard 1 (R) โ”‚         โ”‚
โ”‚  โ”‚ ...         โ”‚  โ”‚ ...         โ”‚  โ”‚ ...         โ”‚         โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
โ”‚                                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
P = Primary, R = Replica

Sharding:

type ShardRouter struct {
    numShards int
    shardMap  map[int][]NodeAddress  // shard -> [primary, replicas]
}

func (r *ShardRouter) GetShard(vectorID uint64) int {
    return int(vectorID % uint64(r.numShards))
}

func (r *ShardRouter) RouteInsert(vectorID uint64, vector []float32) error {
    shardID := r.GetShard(vectorID)
    primary := r.shardMap[shardID][0]

    // Send to primary (it will replicate)
    return r.sendInsert(primary, vectorID, vector)
}

func (r *ShardRouter) RouteSearch(query []float32, k int) ([]SearchResult, error) {
    // Search all shards in parallel
    results := make(chan []SearchResult, r.numShards)
    errors := make(chan error, r.numShards)

    for shardID := 0; shardID < r.numShards; shardID++ {
        go func(sid int) {
            node := r.pickNode(sid)  // Primary or replica
            res, err := r.sendSearch(node, query, k*3)  // Over-fetch for merge
            if err != nil {
                errors <- err
                return
            }
            results <- res
        }(shardID)
    }

    // Collect and merge
    allResults := []SearchResult{}
    for i := 0; i < r.numShards; i++ {
        select {
        case res := <-results:
            allResults = append(allResults, res...)
        case err := <-errors:
            return nil, err
        }
    }

    // Sort by distance and take top-k
    sort.Slice(allResults, func(i, j int) bool {
        return allResults[i].Distance < allResults[j].Distance
    })

    if len(allResults) > k {
        allResults = allResults[:k]
    }

    return allResults, nil
}

Replication:

type ReplicationManager struct {
    primary  *Shard
    replicas []*RemoteReplica
}

func (rm *ReplicationManager) Insert(id uint64, vector []float32) error {
    // Write to primary first
    if err := rm.primary.Insert(id, vector); err != nil {
        return err
    }

    // Replicate asynchronously (eventual consistency)
    // Or synchronously (strong consistency)
    for _, replica := range rm.replicas {
        go replica.Replicate(id, vector)
    }

    return nil
}

Failure detection:

func (c *Cluster) monitorHealth() {
    ticker := time.NewTicker(1 * time.Second)

    for range ticker.C {
        for _, node := range c.nodes {
            if !c.ping(node) {
                node.missedHeartbeats++
                if node.missedHeartbeats > 3 {
                    c.handleNodeFailure(node)
                }
            } else {
                node.missedHeartbeats = 0
            }
        }
    }
}

func (c *Cluster) handleNodeFailure(failed *Node) {
    log.Printf("Node %s failed, promoting replicas", failed.ID)

    for _, shard := range failed.PrimaryShards {
        // Find a replica to promote
        replica := c.findReplica(shard)
        if replica != nil {
            c.promoteToPrimary(replica, shard)
        }
    }

    c.updateShardMap()
    c.notifyClients()
}

The Core Question Youโ€™re Answering

โ€œHow do we scale vector search across machines while maintaining consistency?โ€

This isnโ€™t about making a single node fasterโ€”thatโ€™s optimization. This is about fundamental distributed systems questions:

  • How do we partition billions of vectors across nodes so queries stay fast?
  • How do we ensure writes are durable and consistent across replicas?
  • How do we handle node failures without losing data or availability?
  • How do we route queries to the right shards and merge results correctly?

The single biggest misconception: โ€œDistributed systems are just running multiple copies.โ€ Wrong. Distribution introduces coordination overhead, network partitions, partial failures, and consistency challenges that donโ€™t exist in single-node systems.

What makes this hard: Youโ€™re balancing the CAP theorem tradeoffs. Strong consistency means higher latency and lower availability. Eventual consistency means dealing with stale reads and conflict resolution. Thereโ€™s no perfect answerโ€”only informed tradeoffs.


Concepts You Must Understand First

Before you write distributed code, you need mental models for:

  1. Sharding vs Replication
    • Sharding = Horizontal partitioning. Split data across nodes to distribute load.
    • Replication = Copy data to multiple nodes for fault tolerance and read scaling.
    • You need BOTH: shard for capacity, replicate for availability.
  2. Consistency Models
    • Strong consistency: All replicas agree before acknowledging writes (slow, safe)
    • Eventual consistency: Replicas converge over time (fast, risky)
    • Quorum reads/writes: W + R > N guarantees consistency (W=write replicas, R=read replicas, N=total replicas)
  3. CAP Theorem
    • Consistency: All nodes see the same data
    • Availability: Every request gets a response
    • Partition tolerance: System works despite network failures
    • You can only pick 2. In distributed systems, partitions WILL happen, so you choose CA or AP.
  4. Partitioning Strategies
    • Hash-based sharding: hash(key) % num_shards โ†’ Uniform distribution, but no range queries
    • Range-based sharding: Assign key ranges to shards โ†’ Supports range queries, but risk of hot shards
    • Consistent hashing: Virtual nodes minimize resharding when nodes are added/removed
  5. Distributed Consensus
    • Raft/Paxos: Algorithms for agreeing on state across nodes
    • Leader election: One node coordinates writes
    • Log replication: Leaderโ€™s log is replicated to followers

Study these chapters:

  • โ€œDesigning Data-Intensive Applicationsโ€ Chapter 5 (Replication)
  • โ€œDesigning Data-Intensive Applicationsโ€ Chapter 6 (Partitioning)
  • โ€œDesigning Data-Intensive Applicationsโ€ Chapter 9 (Consistency and Consensus)
  • Raft paper: โ€œIn Search of an Understandable Consensus Algorithmโ€

Questions to Guide Your Design

Before writing code, answer these architectural questions:

Sharding:

  1. Whatโ€™s your shard key? (vector ID? category? timestamp?)
  2. How many shards? (Too few = hot spots. Too many = coordination overhead.)
  3. Hash-based or range-based partitioning?
  4. How do you handle shard rebalancing when nodes are added/removed?

Replication:

  1. Whatโ€™s your replication factor? (RF=3 is common: 1 primary + 2 replicas)
  2. Synchronous or asynchronous replication?
  3. How do you handle replication lag?
  4. What happens if a replica falls behind?

Consistency:

  1. Strong or eventual consistency?
  2. What are your quorum sizes? (e.g., W=2, R=2, N=3)
  3. How do you handle write conflicts?
  4. Do you need read-after-write consistency?

Query Routing:

  1. How does the client know which shard to query?
    • Option A: Client-side routing (client knows shard map)
    • Option B: Coordinator routing (coordinator routes to shards)
    • Option C: Random node routing (any node can forward)
  2. For vector search, do you query ALL shards or use a routing hint?
  3. How do you merge results from multiple shards?

Failure Handling:

  1. How do you detect node failures? (Heartbeats? Gossip protocol?)
  2. How do you handle a primary failure? (Promote a replica? Re-elect leader?)
  3. What happens during a network partition?
  4. How do you prevent split-brain scenarios?

Thinking Exercise

Trace a query through the distributed cluster:

Client sends: search(query_vector, k=10)

Step 1: Router receives query
  โ†’ Looks up shard map (9 shards, RF=2)
  โ†’ Decision: Query ALL shards (can't route ANN search to subset)

Step 2: Router fans out to 9 shards
  โ†’ Sends search(query, k=30) to each shard (over-fetch for accurate merge)
  โ†’ Sends to PRIMARY or REPLICA? (Read from replica for load balancing)

Step 3: Each shard searches locally
  โ†’ Shard 0: finds 30 nearest neighbors
  โ†’ Shard 1: finds 30 nearest neighbors
  โ†’ ... (9 parallel searches)

Step 4: Router collects 270 results (9 shards ร— 30 results)
  โ†’ Merges by distance
  โ†’ Returns top-10

Total latency = max(shard_latencies) + merge_time + network_overhead

Now trace a write:

Client sends: insert(vector_id=12345, vector=[0.1, 0.2, ...])

Step 1: Router determines shard
  โ†’ shard_id = hash(12345) % 9 = 3
  โ†’ Looks up shard 3: primary=node-1, replicas=[node-0, node-2]

Step 2: Router sends to primary (node-1)
  โ†’ Primary writes to its local index
  โ†’ Primary replicates to 2 replicas

Step 3: Replication strategy (choose one):
  A) Synchronous: Wait for W=2 replicas to acknowledge (slow, safe)
  B) Asynchronous: Acknowledge immediately, replicate in background (fast, risky)

Step 4: Router acknowledges to client
  โ†’ If synchronous: only after quorum
  โ†’ If async: immediately after primary write

Failure scenario:

Node-1 fails (it was primary for shards 3, 4, 5)

Step 1: Failure detection
  โ†’ Heartbeat monitor notices node-1 missed 3 consecutive pings
  โ†’ Marks node-1 as DEAD

Step 2: Replica promotion
  โ†’ For shard 3: promote node-0's replica to primary
  โ†’ For shard 4: promote node-2's replica to primary
  โ†’ For shard 5: promote node-0's replica to primary

Step 3: Update shard map
  โ†’ Router updates routing table
  โ†’ Clients see new primaries

Step 4: Cluster is now degraded
  โ†’ RF=2 instead of RF=3 (missing one replica)
  โ†’ Background process creates new replicas to restore RF=3

The Interview Questions Theyโ€™ll Ask

Q1: โ€œWhatโ€™s the difference between sharding and replication?โ€

Bad answer: โ€œSharding is horizontal scaling, replication is copying data.โ€

Good answer: โ€œSharding partitions data across nodes to distribute LOADโ€”itโ€™s about capacity. Replication copies data to multiple nodes for FAULT TOLERANCEโ€”itโ€™s about availability. You need both: shard to handle more data than one node can hold, replicate so a node failure doesnโ€™t lose data. For example, with 9 shards and RF=3, each shard exists on 3 nodes. Total nodes = 3, with each node holding 3 primary shards + 6 replica shards.โ€


Q2: โ€œHow would you shard a vector database?โ€

Bad answer: โ€œJust hash the vector ID and mod by number of shards.โ€

Good answer: โ€œIt depends on the query pattern. For pure ANN search where you must query all vectors, hash-based sharding by vector ID works wellโ€”it distributes uniformly and is simple. But if you have metadata filters (e.g., โ€˜category=imagesโ€™), you might shard by category to enable shard pruningโ€”query only relevant shards. The tradeoff: category-based sharding risks hot shards if one category dominates. In practice, hybrid approaches work: hash-shard within category, or use consistent hashing with virtual nodes to balance load.โ€


Q3: โ€œStrong consistency or eventual consistency for vector search?โ€

Bad answer: โ€œStrong consistency because we want correct results.โ€

Good answer: โ€œIt depends on the use case. For vector search specifically, eventual consistency is often acceptable because:

  1. Vector embeddings rarely changeโ€”theyโ€™re mostly append-only
  2. Slight staleness (a few seconds) doesnโ€™t break user experience
  3. Strong consistency adds latency (wait for quorum), which hurts search UX

However, for METADATA updates (e.g., deleting a userโ€™s data for GDPR), you need strong consistencyโ€”canโ€™t show deleted data. So hybrid: eventual consistency for inserts, strong consistency for deletes. This is what Pinecone and Weaviate do.โ€


Q4: โ€œHow do you handle merging results from multiple shards?โ€

Bad answer: โ€œJust sort all results by distance and take top-k.โ€

Good answer: โ€œThat works, but you must over-fetch from each shard. If you want global top-10, you canโ€™t just get top-10 from each shardโ€”a shard might have all 10 global bests. Solution: fetch top-kร—C from each shard (C is a safety factor, typically 2-3), then merge and re-rank. Example: for k=10, fetch top-30 from each of 9 shards (270 candidates), merge by distance, return top-10. The tradeoff: larger C = better recall but more data transferred. Production systems tune C based on data distribution.โ€


Q5: โ€œWhat happens during a network partition?โ€

Bad answer: โ€œThe system stops working.โ€

Good answer: โ€œIt depends on your CAP choice. If you chose CP (consistency + partition tolerance), you sacrifice availabilityโ€”nodes in the minority partition refuse writes to prevent split-brain. If you chose AP (availability + partition tolerance), you sacrifice consistencyโ€”both partitions accept writes, and you resolve conflicts later (e.g., last-write-wins, vector timestamps). For vector search, AP is often betterโ€”itโ€™s better to serve slightly stale results than be unavailable. But you need a reconciliation strategy when the partition heals.โ€


Q6: โ€œHow does leader election work in Raft?โ€

Bad answer: โ€œNodes vote for a leader.โ€

Good answer: โ€œRaft uses term-based elections. When a followerโ€™s heartbeat timeout expires, it increments its term, transitions to CANDIDATE, votes for itself, and requests votes from other nodes. A candidate wins if it receives votes from a majority (quorum). Once elected, the leader sends heartbeats to prevent new elections. Key details: randomized timeouts prevent split votes, log completeness rule ensures the leader has the most up-to-date log, and higher term numbers always win. This guarantees at most one leader per term.โ€


Hints in Layers

Layer 1: Single-Node Baseline (Week 1)

Start simple. Build a single-node vector database with HNSW index (from Project 5). Wrap it in a network service:

type VectorNode struct {
    index *HNSW
}

func (n *VectorNode) Insert(ctx context.Context, req *InsertRequest) error {
    return n.index.Insert(req.VectorID, req.Vector)
}

func (n *VectorNode) Search(ctx context.Context, req *SearchRequest) ([]*SearchResult, error) {
    return n.index.Search(req.Query, req.K)
}

func main() {
    node := NewVectorNode()
    grpc.Serve(":50051", node)  // Use gRPC for inter-node communication
}

Why start here: You need a working single-node system before distributing it. This baseline lets you measure distributed overhead later.


Layer 2: Add Sharding (Week 2)

Partition data across multiple nodes. Implement hash-based sharding:

type ShardRouter struct {
    nodes     []*NodeClient  // RPC clients to nodes
    numShards int
}

func (r *ShardRouter) GetShardID(vectorID uint64) int {
    return int(vectorID % uint64(r.numShards))
}

func (r *ShardRouter) GetNode(shardID int) *NodeClient {
    // Assuming equal shard distribution
    return r.nodes[shardID % len(r.nodes)]
}

func (r *ShardRouter) Insert(vectorID uint64, vector []float32) error {
    shardID := r.GetShardID(vectorID)
    node := r.GetNode(shardID)
    return node.Insert(vectorID, vector)
}

func (r *ShardRouter) Search(query []float32, k int) ([]*SearchResult, error) {
    // Fan out to ALL shards (can't route ANN search)
    resultsChan := make(chan []*SearchResult, len(r.nodes))

    for _, node := range r.nodes {
        go func(n *NodeClient) {
            res, _ := n.Search(query, k*3)  // Over-fetch
            resultsChan <- res
        }(node)
    }

    // Merge results
    var allResults []*SearchResult
    for i := 0; i < len(r.nodes); i++ {
        allResults = append(allResults, <-resultsChan...)
    }

    sort.Slice(allResults, func(i, j int) bool {
        return allResults[i].Distance < allResults[j].Distance
    })

    return allResults[:k], nil
}

Test it: Insert 1M vectors, verify theyโ€™re distributed across shards. Run searches, verify results are correct.

Key insight: Sharding reduces per-node load but INCREASES query complexity (must query all shards).


Layer 3: Add Replication (Week 3)

Each shard now has replicas. Implement synchronous replication with quorum writes:

type ReplicatedShard struct {
    primary  *NodeClient
    replicas []*NodeClient
    W        int  // Write quorum
    R        int  // Read quorum
}

func (s *ReplicatedShard) Insert(vectorID uint64, vector []float32) error {
    // Send to all replicas
    ackChan := make(chan error, len(s.replicas)+1)

    // Write to primary
    go func() {
        ackChan <- s.primary.Insert(vectorID, vector)
    }()

    // Write to replicas
    for _, replica := range s.replicas {
        go func(r *NodeClient) {
            ackChan <- r.Insert(vectorID, vector)
        }(replica)
    }

    // Wait for W acknowledgments
    acks := 0
    for i := 0; i < len(s.replicas)+1; i++ {
        if <-ackChan == nil {
            acks++
            if acks >= s.W {
                return nil  // Quorum reached
            }
        }
    }

    return errors.New("quorum not reached")
}

Test it: Kill a replica mid-write, verify quorum still succeeds. Kill W nodes, verify writes fail (correct behavior).

Key insight: Replication trades latency for durability. W=1 is fast but risky. W=N is safe but slow.


Layer 4: Add Failure Detection (Week 4)

Implement heartbeat-based failure detection:

type HealthMonitor struct {
    nodes           []*Node
    heartbeatTicker *time.Ticker
}

func (h *HealthMonitor) Start() {
    h.heartbeatTicker = time.NewTicker(1 * time.Second)

    go func() {
        for range h.heartbeatTicker.C {
            for _, node := range h.nodes {
                if err := node.Ping(); err != nil {
                    node.FailureCount++
                    if node.FailureCount >= 3 {
                        h.handleFailure(node)
                    }
                } else {
                    node.FailureCount = 0
                }
            }
        }
    }()
}

func (h *HealthMonitor) handleFailure(failed *Node) {
    log.Printf("Node %s failed", failed.ID)

    // Promote replicas to primary
    for _, shard := range failed.PrimaryShards {
        replica := h.findHealthyReplica(shard)
        h.promoteToPrimary(replica, shard)
    }

    // Update routing table
    h.updateShardMap()
}

Test it: Kill a primary node, verify replica promotion. Kill multiple nodes, verify the system degrades gracefully.

Key insight: Failure detection has false positives (slow network != dead node). Tune timeouts carefully.


Layer 5: Add Consensus (Weeks 5-6)

Implement Raft for consistent shard map updates and leader election. This is the hardest part.

Option A: Use an existing Raft library (etcdโ€™s raft, Hashicorpโ€™s raft) Option B: Implement Raft yourself (educational but time-consuming)

// Using Hashicorp's raft library
func (c *Cluster) InitRaft() error {
    config := raft.DefaultConfig()
    config.LocalID = raft.ServerID(c.nodeID)

    store := raft.NewInmemStore()
    snapshots := raft.NewInmemSnapshotStore()
    transport, _ := raft.NewTCPTransport(c.bindAddr, nil, 3, 10*time.Second, os.Stderr)

    r, err := raft.NewRaft(config, c, store, store, snapshots, transport)
    if err != nil {
        return err
    }

    c.raft = r
    return nil
}

// Apply shard map updates through Raft
func (c *Cluster) UpdateShardMap(update *ShardMapUpdate) error {
    if c.raft.State() != raft.Leader {
        return c.forwardToLeader(update)
    }

    data, _ := json.Marshal(update)
    future := c.raft.Apply(data, 10*time.Second)
    return future.Error()
}

Test it: Kill the leader, verify a new leader is elected. Make writes during leader election, verify theyโ€™re queued and applied after election.

Key insight: Consensus is expensive. Only use it for critical state (shard map, membership), not every write.


Books That Will Help

  1. โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann
    • Chapter 5: Replication โ†’ Covers leader-follower replication, multi-leader, leaderless (Dynamo-style)
    • Chapter 6: Partitioning โ†’ Covers sharding strategies, rebalancing, routing
    • Chapter 7: Transactions โ†’ Covers consistency guarantees
    • Chapter 8: The Trouble with Distributed Systems โ†’ Covers network faults, clocks, Byzantine faults
    • Chapter 9: Consistency and Consensus โ†’ Covers linearizability, CAP theorem, Raft/Paxos
    • This is THE book for distributed systems. Read chapters 5-9 before starting this project.
  2. โ€œDatabase Internalsโ€ by Alex Petrov
    • Part II: Distributed Systems โ†’ Covers failure detection, leader election, replication, anti-entropy
    • More implementation-focused than Kleppmann.
  3. Raft Paper: โ€œIn Search of an Understandable Consensus Algorithmโ€
    • Original Raft paper by Diego Ongaro and John Ousterhout
    • 16 pages, very readable (unlike Paxos)
    • Available at: https://raft.github.io/raft.pdf
  4. โ€œDistributed Systemsโ€ by Maarten van Steen and Andrew S. Tanenbaum
    • Chapter 6: Synchronization โ†’ Covers clocks, mutual exclusion
    • Chapter 7: Consistency and Replication โ†’ Covers consistency models
    • More academic, but comprehensive.
  5. Raft Visualization
    • Interactive visualization: https://raft.github.io/
    • Play with leader election, log replication, network partitions
  6. Real-world systems to study:
    • Cassandra: Leaderless, eventual consistency, tunable quorums
    • MongoDB: Leader-based, strong consistency, automatic sharding
    • Elasticsearch: Sharding, replication, Lucene-based (similar to vector DBs)
    • Pinecone/Weaviate: Production vector databases (study their architecture blogs)

Reading plan:

  • Week 1: DDIA Chapters 5-6 (Replication, Partitioning)
  • Week 2: DDIA Chapters 8-9 (Distributed Systems, Consensus)
  • Week 3: Raft paper + visualization
  • Weeks 4-6: Build while referencing โ€œDatabase Internalsโ€ Part II

Learning milestones:

  1. Sharding distributes load โ†’ You understand partitioning
  2. Parallel search works โ†’ You understand distributed query execution
  3. Failover promotes replicas โ†’ You understand fault tolerance
  4. Consistency is maintained โ†’ You understand distributed state

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The โ€œIndustry Disruptorโ€
  • Difficulty: Level 5: Master
  • Knowledge Area: Complete Database System
  • Software or Tool: Production Vector Database
  • Main Book: All previous + โ€œDatabase Internalsโ€ by Alex Petrov

What youโ€™ll build: A complete, production-ready vector database with: HNSW and IVF-PQ indexes, durable storage, metadata filtering, distributed clustering, gRPC API, and client SDKs.

Why this is the ultimate test: This combines everything: algorithms, storage, networking, distributed systems, and API design. Building a complete vector database proves you truly understand how these systems work.

Core challenges youโ€™ll face:

  • Integrating all components โ†’ maps to system design
  • API design โ†’ maps to user experience
  • Performance tuning โ†’ maps to profiling and optimization
  • Operational features โ†’ maps to monitoring, backups, upgrades

Key Concepts:

  • Everything from previous projects
  • API Design: gRPC best practices
  • Operations: Observability, deployment

Difficulty: Master Time estimate: 3-6 months Prerequisites: All previous projects completed.

Real world outcome:

$ vecdb-server --config production.yaml

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                    VecDB Server v1.0                       โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ Configuration:                                             โ•‘
โ•‘   Storage: /data/vecdb                                     โ•‘
โ•‘   Index: HNSW (M=32, ef=200)                              โ•‘
โ•‘   Shards: 12 (distributed across 4 nodes)                 โ•‘
โ•‘   Replicas: 2                                              โ•‘
โ•‘                                                            โ•‘
โ•‘ Endpoints:                                                 โ•‘
โ•‘   gRPC: 0.0.0.0:6333                                      โ•‘
โ•‘   HTTP: 0.0.0.0:6334                                      โ•‘
โ•‘   Metrics: 0.0.0.0:9090/metrics                           โ•‘
โ•‘                                                            โ•‘
โ•‘ Status: Ready to accept connections                        โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

# Python client usage
from vecdb import VecDBClient

client = VecDBClient("localhost:6333")

# Create collection
client.create_collection(
    name="products",
    dimension=768,
    metric="cosine",
    index_config={
        "type": "hnsw",
        "m": 32,
        "ef_construction": 200
    }
)

# Insert vectors
client.upsert(
    collection="products",
    vectors=[
        {"id": "prod_001", "vector": [...], "metadata": {"category": "electronics", "price": 99.99}},
        {"id": "prod_002", "vector": [...], "metadata": {"category": "clothing", "price": 49.99}},
        ...
    ]
)

# Search with filters
results = client.search(
    collection="products",
    vector=query_embedding,
    limit=10,
    filter={"category": "electronics", "price": {"$lt": 100}}
)

# Results
[
    {"id": "prod_001", "score": 0.95, "metadata": {...}},
    {"id": "prod_042", "score": 0.91, "metadata": {...}},
    ...
]

Implementation Hints:

System architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         Client SDKs                          โ”‚
โ”‚              (Python, JavaScript, Rust, Go)                  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                     API Layer (gRPC + HTTP)                  โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
โ”‚    โ”‚  Insert    โ”‚  Search    โ”‚  Delete    โ”‚   Admin    โ”‚    โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                       Query Processor                        โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚    โ”‚  Planner   โ”‚ Optimizer  โ”‚  Executor  โ”‚                 โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                        Index Layer                           โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚    โ”‚   HNSW     โ”‚  IVF-PQ    โ”‚  Metadata  โ”‚                 โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                       Storage Layer                          โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚    โ”‚    WAL     โ”‚  Vectors   โ”‚  Index     โ”‚                 โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                    Distributed Layer                         โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚    โ”‚  Sharding  โ”‚Replication โ”‚ Consensus  โ”‚                 โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

gRPC API definition:

service VecDB {
    rpc CreateCollection(CreateCollectionRequest) returns (CreateCollectionResponse);
    rpc Upsert(UpsertRequest) returns (UpsertResponse);
    rpc Search(SearchRequest) returns (SearchResponse);
    rpc Delete(DeleteRequest) returns (DeleteResponse);
    rpc GetCollection(GetCollectionRequest) returns (GetCollectionResponse);
}

message SearchRequest {
    string collection = 1;
    repeated float vector = 2;
    uint32 limit = 3;
    Filter filter = 4;
    SearchParams params = 5;
}

message SearchResponse {
    repeated SearchResult results = 1;
    uint64 searched_vectors = 2;
    float latency_ms = 3;
}

Collection management:

struct Collection {
    name: String,
    dimension: usize,
    metric: DistanceMetric,

    // Index
    index: Box<dyn VectorIndex>,

    // Storage
    storage: StorageEngine,

    // Metadata
    metadata_index: MetadataIndex,

    // Configuration
    config: CollectionConfig,
}

impl Collection {
    fn upsert(&mut self, vectors: Vec<VectorRecord>) -> Result<()> {
        // Write to WAL first
        self.storage.wal_append(&vectors)?;

        // Update index
        for record in &vectors {
            self.index.insert(record.id, &record.vector)?;
            self.metadata_index.insert(record.id, &record.metadata)?;
        }

        // Flush periodically
        if self.should_flush() {
            self.flush()?;
        }

        Ok(())
    }

    fn search(&self, query: &[f32], k: usize, filter: Option<Filter>) -> Result<Vec<SearchResult>> {
        let valid_ids = match filter {
            Some(f) => Some(self.metadata_index.filter(&f)?),
            None => None,
        };

        self.index.search(query, k, valid_ids)
    }
}

Metrics and monitoring:

struct Metrics {
    insert_latency: Histogram,
    search_latency: Histogram,
    vectors_total: Gauge,
    index_size_bytes: Gauge,
    queries_per_second: Counter,
}

impl Metrics {
    fn record_search(&self, latency: Duration, results: usize) {
        self.search_latency.observe(latency.as_secs_f64());
        self.queries_per_second.inc();
    }
}

Learning milestones:

  1. Basic CRUD works โ†’ Core functionality complete
  2. Filtered search works โ†’ Advanced querying done
  3. Distributed mode works โ†’ Scalability achieved
  4. Production deployed โ†’ Operational readiness
  5. Handles real workloads โ†’ Production quality

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Brute Force k-NN Beginner Weekend โญโญ โญโญโญ
2. LSH Intermediate 1-2 weeks โญโญโญ โญโญโญโญ
3. IVF Advanced 2 weeks โญโญโญโญ โญโญโญ
4. Product Quantization Advanced 2-3 weeks โญโญโญโญ โญโญโญโญ
5. HNSW Expert 3-4 weeks โญโญโญโญโญ โญโญโญโญโญ
6. IVF-PQ Expert 2-3 weeks โญโญโญโญโญ โญโญโญโญ
7. DiskANN-style Expert 3-4 weeks โญโญโญโญโญ โญโญโญโญ
8. Filtered Search Advanced 2-3 weeks โญโญโญโญ โญโญโญโญ
9. Storage Engine Expert 3-4 weeks โญโญโญโญโญ โญโญโญโญ
10. Distributed Cluster Master 4-6 weeks โญโญโญโญโญ โญโญโญโญโญ
Capstone: Full DB Master 3-6 months โญโญโญโญโญ โญโญโญโญโญ

Path A: The Complete Journey (6-12 months)

  1. Weeks 1-2: Projects 1-2 (Fundamentals + LSH)
  2. Weeks 3-5: Projects 3-4 (IVF + PQ)
  3. Weeks 6-9: Project 5 (HNSW - the key algorithm)
  4. Weeks 10-12: Project 6 (IVF-PQ composite)
  5. Weeks 13-16: Projects 7-8 (Disk-based + Filtering)
  6. Weeks 17-20: Project 9 (Storage Engine)
  7. Weeks 21-26: Project 10 (Distributed)
  8. Months 7-12: Capstone

Path B: Practical Focus (2-3 months)

If you want to understand production systems quickly:

1 โ†’ 3 โ†’ 5 โ†’ 8 โ†’ (use existing storage/distribution)

This covers: brute force baseline, IVF, HNSW, and filteringโ€”enough to work with any production vector database.

Path C: Algorithm Deep Dive (2-3 months)

If you want to understand the algorithms deeply:

1 โ†’ 2 โ†’ 3 โ†’ 4 โ†’ 5 โ†’ 6

This covers all the major ANN algorithms without the systems complexity.

Recommendation Based on Your Goals

If youโ€™re building RAG applications: Do Projects 1, 3, 5, 8. Youโ€™ll understand how to choose and tune indexes.

If you want to contribute to vector DBs: Do the full path. Open-source projects like Qdrant, Milvus need contributors who understand internals.

If youโ€™re researching ANN algorithms: Focus on Projects 2-6, read the original papers, benchmark against ann-benchmarks.com.


Essential Resources

Papers

Tutorials and Guides

Benchmarks

Books

  • โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann
  • โ€œMining of Massive Datasetsโ€ by Leskovec, Rajaraman, Ullman
  • โ€œIntroduction to Information Retrievalโ€ by Manning, Raghavan, Schรผtze

Libraries to Study

  • FAISS - Facebookโ€™s library (C++ with Python bindings)
  • hnswlib - Original HNSW implementation
  • Qdrant - Rust vector database
  • Milvus - Distributed vector database

Summary

# Project Name Main Language
1 Brute Force k-NN and Distance Metrics Python
2 Locality Sensitive Hashing (LSH) Python
3 Inverted File Index (IVF) Python
4 Product Quantization (PQ) Python
5 HNSW Index Python
6 IVF-PQ Composite Index Python
7 Disk-Based Vector Index (DiskANN-style) C++
8 Metadata Filtering with Vector Search Python
9 Vector Database Storage Engine Rust
10 Distributed Vector Search Cluster Go
Capstone Full-Featured Vector Database Rust

Vector databases are where algorithms meet systems engineering. The best way to understand them is to build one. Start with brute force, feel the pain, then appreciate the elegance of HNSW and the efficiency of IVF-PQ. By the end, youโ€™ll know exactly what happens when you call collection.search().