LEARN VECTOR DATABASES
Learn Vector Databases: From Zero to Building Your Own
Goal: Deeply understand how vector databases work internally—from the mathematics of similarity search through indexing algorithms like HNSW and IVF, to building a production-ready vector search engine. Master the technology powering modern AI applications.
Why Vector Databases Matter
Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by meaning.
The challenge: How do you efficiently find the 10 most similar items among a billion vectors, each with 1536 dimensions? Brute force would take hours. Vector databases do it in milliseconds.
After completing these projects, you will:
- Understand exactly how similarity search works at scale
- Implement HNSW, IVF, LSH, and other ANN algorithms from scratch
- Know when to use each algorithm and what tradeoffs to make
- Build a complete vector database that handles real workloads
- Appreciate the engineering behind Pinecone, Milvus, Qdrant, and others
Core Concept Analysis
What is a Vector?
In the context of vector databases, a vector (or embedding) is a list of numbers that represents the meaning or features of something:
"The quick brown fox" → [0.12, -0.34, 0.56, ..., 0.78] (1536 dimensions)
[Image of a cat] → [0.23, 0.45, -0.67, ..., 0.89] (512 dimensions)
[Product attributes] → [0.91, -0.23, 0.45, ..., 0.12] (256 dimensions)
These vectors come from ML models (OpenAI embeddings, CLIP, sentence-transformers) that position similar items close together in high-dimensional space.
Distance Metrics
How do we measure “closeness” between vectors?
- Euclidean Distance (L2): Straight-line distance in space
d(a, b) = √(Σ(aᵢ - bᵢ)²)- Good for: Absolute position matters
- Range: [0, ∞)
- Cosine Similarity: Angle between vectors (normalized)
cos(a, b) = (a · b) / (||a|| × ||b||)- Good for: Direction matters, magnitude doesn’t
- Range: [-1, 1] (1 = identical, 0 = orthogonal, -1 = opposite)
- Dot Product (Inner Product): Combines magnitude and direction
a · b = Σ(aᵢ × bᵢ)- Good for: When vectors are already normalized
- Range: (-∞, ∞)
- Manhattan Distance (L1): Sum of absolute differences
d(a, b) = Σ|aᵢ - bᵢ|
The Curse of Dimensionality
Why is vector search hard?
In high dimensions (100+), counterintuitive things happen:
- All points are “far”: The distance between any two random points converges
- Volume explodes: A hypercube with side 1 has volume 1, but almost all that volume is in the “corners”
- Brute force is slow: Comparing a query to 1 billion vectors = 1 billion distance calculations
This is why we need Approximate Nearest Neighbor (ANN) algorithms—trading perfect accuracy for massive speedups.
The ANN Tradeoff
Accuracy (Recall) ←——————————————————→ Speed
100% 0.001ms
Brute Force (exact) Too slow
HNSW (graph) Great balance
IVF (clustering) Faster, less accurate
LSH (hashing) Fastest, least accurate
Recall@K: Of the true K nearest neighbors, what fraction did we find?
- Recall@10 = 0.95 means we found 95% of the true top-10 neighbors
Vector Database Architecture
┌─────────────────────────────────────────────────────────┐
│ Client Layer │
│ (REST API, gRPC, SDKs, Query Language) │
├─────────────────────────────────────────────────────────┤
│ Query Layer │
│ (Query Planning, Filtering, Caching, Result Ranking) │
├─────────────────────────────────────────────────────────┤
│ Index Layer │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ HNSW │ │ IVF │ │ LSH │ │ Flat │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
├─────────────────────────────────────────────────────────┤
│ Storage Layer │
│ (Vector Storage, Metadata, WAL, Compression/PQ) │
├─────────────────────────────────────────────────────────┤
│ Distributed Layer │
│ (Sharding, Replication, Consistency, Routing) │
└─────────────────────────────────────────────────────────┘
Project List
Projects progress from fundamental algorithms to a complete vector database implementation.
Project 1: Brute Force k-NN and Distance Metrics
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, C++, Go
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 1: Beginner
- Knowledge Area: Vector Mathematics / Similarity Search
- Software or Tool: Distance Calculator
- Main Book: “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman (Chapter 3)
What you’ll build: A brute-force k-nearest neighbors search engine supporting multiple distance metrics (Euclidean, cosine, dot product), with optimizations using NumPy vectorization and SIMD.
Why it teaches fundamentals: Before learning approximate algorithms, you must understand exact search. This establishes the baseline that all other algorithms try to beat. You’ll also learn why brute force becomes impossible at scale.
Core challenges you’ll face:
- Implementing distance metrics correctly → maps to vector mathematics
- Vectorizing with NumPy → maps to batch operations for speed
- Memory management for large datasets → maps to chunked processing
- Benchmarking accurately → maps to measuring recall and latency
Key Concepts:
- Vector Operations: Linear Algebra fundamentals
- Distance Metrics: “Mining of Massive Datasets” Chapter 3 - Leskovec et al.
- NumPy Broadcasting: NumPy documentation
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of vectors/matrices. NumPy familiarity helpful.
Real world outcome:
$ ./vector_search --dataset random_10k_128d.npy --query query.npy --k 10
Loading dataset: 10,000 vectors × 128 dimensions
Query vector: 128 dimensions
Brute Force k-NN Results (k=10):
╔═══════╤═══════════╤════════════════╗
║ Rank │ Vector ID │ Distance (L2) ║
╠═══════╪═══════════╪════════════════╣
║ 1 │ 4,521 │ 0.823 ║
║ 2 │ 7,892 │ 0.891 ║
║ 3 │ 2,103 │ 0.912 ║
║ ... │ ... │ ... ║
║ 10 │ 9,234 │ 1.102 ║
╚═══════╧═══════════╧════════════════╝
Performance:
Distance calculations: 10,000
Time: 2.3ms (vectorized), 156ms (naive loop)
Throughput: 4,347 queries/second
$ ./vector_search --dataset random_1M_768d.npy --query query.npy --k 10
Loading dataset: 1,000,000 vectors × 768 dimensions
Performance:
Time: 1,847ms (too slow for production!)
This is why we need approximate algorithms...
Implementation Hints:
Distance metric implementations:
Euclidean Distance (L2):
d(q, v) = sqrt(sum((q[i] - v[i])^2 for i in range(dim)))
Cosine Similarity:
sim(q, v) = dot(q, v) / (norm(q) * norm(v))
distance = 1 - sim (to convert to distance)
Dot Product:
score(q, v) = sum(q[i] * v[i] for i in range(dim))
(higher is better, not a distance)
Naive implementation (slow):
def knn_naive(query, vectors, k):
distances = []
for i, v in enumerate(vectors):
d = euclidean_distance(query, v)
distances.append((i, d))
distances.sort(key=lambda x: x[1])
return distances[:k]
Vectorized implementation (fast):
def knn_vectorized(query, vectors, k):
# Compute all distances at once
# (query - vectors)^2 = query^2 - 2*query*vectors + vectors^2
query_sq = np.sum(query ** 2)
vectors_sq = np.sum(vectors ** 2, axis=1)
cross_term = 2 * np.dot(vectors, query)
distances = query_sq - cross_term + vectors_sq
# Skip sqrt for ranking (monotonic transformation)
# Partial sort for top-k
indices = np.argpartition(distances, k)[:k]
top_k = indices[np.argsort(distances[indices])]
return top_k, distances[top_k]
Key optimization insight: For ranking, you don’t need the actual distance—just the ordering. So skip expensive operations like sqrt() that don’t change the ranking.
Chunked processing for large datasets:
def knn_chunked(query, vectors, k, chunk_size=100000):
candidates = []
for i in range(0, len(vectors), chunk_size):
chunk = vectors[i:i+chunk_size]
chunk_results = knn_vectorized(query, chunk, k)
candidates.extend([(idx + i, dist) for idx, dist in chunk_results])
candidates.sort(key=lambda x: x[1])
return candidates[:k]
Learning milestones:
- Distance metrics compute correctly → You understand vector math
- Vectorized version is 50x+ faster → You understand batch operations
- You hit the wall at 1M vectors → You understand why ANN is needed
- Recall@10 is always 100% → You understand exact search baseline
Project 2: Locality Sensitive Hashing (LSH)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, C++, Java
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Hashing / Probabilistic Data Structures
- Software or Tool: LSH Index
- Main Book: “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman (Chapter 3)
What you’ll build: A Locality Sensitive Hashing index for approximate nearest neighbor search, using random hyperplanes for cosine similarity or random projections for Euclidean distance.
Why it teaches LSH: LSH is the simplest ANN algorithm to understand. It’s based on a beautiful insight: if we hash vectors such that similar vectors likely get the same hash, we can use hash tables for fast lookup. Understanding LSH makes other algorithms clearer.
Core challenges you’ll face:
- Designing hash functions → maps to random hyperplanes and projections
- Amplification (AND/OR) → maps to balancing precision and recall
- Multi-probe LSH → maps to checking nearby buckets
- Parameter tuning → maps to number of tables and hash functions
Resources for key challenges:
- Pinecone LSH Guide - Illustrated explanation
- MIT LSH Page - Original research and implementations
Key Concepts:
- LSH Fundamentals: “Mining of Massive Datasets” Chapter 3 - Leskovec et al.
- Random Projections: Tyler Neylon’s LSH Introduction
- Amplification: LSH literature on AND/OR constructions
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 completed. Understanding of hash tables and probability.
Real world outcome:
$ ./lsh_search --build --dataset sift_1M.npy --tables 20 --hashes 8
Building LSH Index:
Dataset: 1,000,000 vectors × 128 dimensions
Hash tables: 20
Hashes per table: 8
Total hash functions: 160
Build time: 12.3s
Index size: 245 MB (vs 512 MB raw data)
$ ./lsh_search --query query.npy --k 10
LSH Query Results:
╔═══════╤═══════════╤══════════════╤═════════════════╗
║ Rank │ Vector ID │ Distance │ Found in Table ║
╠═══════╪═══════════╪══════════════╪═════════════════╣
║ 1 │ 4,521 │ 0.823 │ 3, 7, 12 ║
║ 2 │ 7,892 │ 0.891 │ 7, 15 ║
║ ... │ ... │ ... │ ... ║
╚═══════╧═══════════╧══════════════╧═════════════════╝
Performance:
Candidates examined: 2,042 (vs 1,000,000 brute force)
Time: 4.2ms (vs 1,847ms brute force)
Speedup: 440x
Recall@10: 0.87 (87% of true neighbors found)
$ ./lsh_search --sweep-params
Sweeping parameters for optimal recall/speed...
Tables │ Hashes │ Recall@10 │ Query Time │ Index Size
───────┼────────┼───────────┼────────────┼───────────
5 │ 4 │ 0.62 │ 1.1ms │ 61 MB
10 │ 6 │ 0.78 │ 2.3ms │ 122 MB
20 │ 8 │ 0.87 │ 4.2ms │ 245 MB
40 │ 10 │ 0.94 │ 8.7ms │ 490 MB
Implementation Hints:
The key insight of LSH:
Hash function property:
P(h(a) = h(b)) ∝ similarity(a, b)
If vectors are similar → high probability of same hash
If vectors are different → low probability of same hash
For cosine similarity, use random hyperplanes:
def random_hyperplane_hash(vector, hyperplane):
# Returns 1 if vector is on positive side, 0 otherwise
return 1 if np.dot(vector, hyperplane) >= 0 else 0
def compute_hash(vector, hyperplanes):
# Combine multiple hyperplane hashes into one integer
hash_value = 0
for i, hp in enumerate(hyperplanes):
bit = random_hyperplane_hash(vector, hp)
hash_value |= (bit << i)
return hash_value
LSH Index structure:
class LSHIndex:
def __init__(self, num_tables, hashes_per_table, dim):
self.tables = []
for _ in range(num_tables):
hyperplanes = np.random.randn(hashes_per_table, dim)
hyperplanes /= np.linalg.norm(hyperplanes, axis=1, keepdims=True)
table = {
'hyperplanes': hyperplanes,
'buckets': defaultdict(list) # hash -> [vector_ids]
}
self.tables.append(table)
def insert(self, vector_id, vector):
for table in self.tables:
hash_val = compute_hash(vector, table['hyperplanes'])
table['buckets'][hash_val].append(vector_id)
def query(self, vector, k):
candidates = set()
for table in self.tables:
hash_val = compute_hash(vector, table['hyperplanes'])
candidates.update(table['buckets'][hash_val])
# Re-rank candidates by actual distance
results = []
for cand_id in candidates:
dist = distance(vector, self.vectors[cand_id])
results.append((cand_id, dist))
results.sort(key=lambda x: x[1])
return results[:k]
Amplification (AND/OR construction):
- More hashes per table (AND): Fewer candidates, higher precision
- More tables (OR): More candidates, higher recall
- Balance: Tune based on your recall requirements
Multi-probe LSH: Instead of only checking the exact bucket, also check nearby buckets (flip one bit at a time). Increases recall without adding tables.
Learning milestones:
- Hash function maps similar vectors together → You understand LSH basics
- Multiple tables improve recall → You understand amplification
- Parameter tuning affects recall/speed → You understand the tradeoff
- Multi-probe improves efficiency → You understand advanced LSH
Project 3: Inverted File Index (IVF)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Clustering / Inverted Indexes
- Software or Tool: IVF Index
- Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze
What you’ll build: An Inverted File (IVF) index that clusters vectors with k-means, then searches only nearby clusters at query time—the foundation of production vector search.
Why it teaches IVF: IVF is the workhorse of production vector databases. It’s conceptually simple (just clustering!) but has nuances around cluster count, nprobe tuning, and training data. Understanding IVF unlocks FAISS and most production systems.
Core challenges you’ll face:
- K-means clustering at scale → maps to efficient training
- Choosing number of clusters → maps to sqrt(N) heuristic
- nprobe tuning → maps to recall vs speed tradeoff
- Non-uniform cluster sizes → maps to data distribution challenges
Resources for key challenges:
- Pinecone Vector Indexes Guide - IVF explanation
- FAISS IVF Documentation
- Zilliz Vector Index Tutorial
Key Concepts:
- Inverted Index: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
- K-Means Clustering: “Pattern Recognition and Machine Learning” - Bishop
- IVF in FAISS: FAISS documentation
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1 completed. Understanding of k-means clustering.
Real world outcome:
$ ./ivf_search --build --dataset sift_1M.npy --clusters 1000
Building IVF Index:
Dataset: 1,000,000 vectors × 128 dimensions
Training k-means with 100,000 sample vectors...
K-means iterations: 25, converged: True
Clusters: 1,000 (avg 1,000 vectors per cluster)
Cluster size distribution:
Min: 423, Max: 2,156, Median: 987, StdDev: 312
Build time: 45.2s
Index size: 134 MB (centroids) + 512 MB (data)
$ ./ivf_search --query query.npy --k 10 --nprobe 10
IVF Query Results (nprobe=10):
Clusters searched: 10 (1% of index)
Vectors compared: 9,870
╔═══════╤═══════════╤══════════════╤═════════════════╗
║ Rank │ Vector ID │ Distance │ Cluster ID ║
╠═══════╪═══════════╪══════════════╪═════════════════╣
║ 1 │ 4,521 │ 0.823 │ 127 ║
║ 2 │ 7,892 │ 0.891 │ 127 ║
║ ... │ ... │ ... │ ... ║
╚═══════╧═══════════╧══════════════╧═════════════════╝
Performance:
Time: 8.3ms (vs 1,847ms brute force)
Speedup: 222x
Recall@10: 0.92
$ ./ivf_search --nprobe-sweep
nprobe │ Recall@10 │ Query Time │ Vectors Compared
───────┼───────────┼────────────┼─────────────────
1 │ 0.52 │ 1.2ms │ 987
5 │ 0.81 │ 4.8ms │ 4,935
10 │ 0.92 │ 8.3ms │ 9,870
20 │ 0.97 │ 15.2ms │ 19,740
50 │ 0.99 │ 38.1ms │ 49,350
Implementation Hints:
IVF structure:
class IVFIndex:
def __init__(self, num_clusters, dim):
self.num_clusters = num_clusters
self.dim = dim
self.centroids = None # (num_clusters, dim)
self.inverted_lists = [[] for _ in range(num_clusters)]
# inverted_lists[i] = [(vector_id, vector), ...]
def train(self, training_vectors):
# Run k-means to find cluster centroids
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=self.num_clusters, n_init=1)
kmeans.fit(training_vectors)
self.centroids = kmeans.cluster_centers_
def add(self, vector_id, vector):
# Find nearest centroid
distances = np.linalg.norm(self.centroids - vector, axis=1)
cluster_id = np.argmin(distances)
self.inverted_lists[cluster_id].append((vector_id, vector))
def search(self, query, k, nprobe):
# Find nprobe nearest centroids
centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
nearest_clusters = np.argsort(centroid_distances)[:nprobe]
# Search within those clusters
candidates = []
for cluster_id in nearest_clusters:
for vector_id, vector in self.inverted_lists[cluster_id]:
dist = np.linalg.norm(query - vector)
candidates.append((vector_id, dist))
# Sort and return top-k
candidates.sort(key=lambda x: x[1])
return candidates[:k]
Choosing number of clusters:
Rule of thumb: num_clusters ≈ sqrt(num_vectors)
1M vectors → ~1,000 clusters
100M vectors → ~10,000 clusters
Too few clusters: Each cluster too large, slow search
Too many clusters: Centroid comparison overhead, missed neighbors
Training considerations:
- Train on a representative sample (10-20% of data is often enough)
- Ensure training data has same distribution as full dataset
- More k-means iterations = better centroids, but slower training
Non-exhaustive search pitfall:
If nearest neighbor is in cluster 5, but query is closest to
centroid of cluster 3, we might miss it with low nprobe.
Solution: Higher nprobe, or use HNSW to find nearby centroids faster
Learning milestones:
- K-means clusters vectors → You understand the preprocessing
- nprobe controls recall → You understand the search phase
- Cluster imbalance affects performance → You understand data distribution
- You compare with FAISS → You validate your implementation
Project 4: Product Quantization (PQ)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Vector Compression / Quantization
- Software or Tool: PQ Encoder/Decoder
- Main Book: “Product Quantization for Nearest Neighbor Search” - Jégou et al.
What you’ll build: A Product Quantization system that compresses 128-dimensional float vectors into just 8-16 bytes, enabling 97% memory reduction while maintaining search quality.
Why it teaches quantization: Memory is often the bottleneck for large vector databases. PQ is the key technique that makes billion-scale search possible on a single machine. Understanding PQ unlocks IVF-PQ, the most common production configuration.
Core challenges you’ll face:
- Subspace decomposition → maps to splitting vectors into segments
- Codebook training → maps to k-means per subspace
- Asymmetric distance computation → maps to query vs database accuracy
- Distance lookup tables → maps to precomputed distances for speed
Resources for key challenges:
- Pinecone PQ Guide - 97% compression explained
- Zilliz PQ Tutorial - Hands-on Python
Key Concepts:
- Product Quantization: Original paper by Jégou, Douze, Schmid
- Vector Quantization: Lloyd’s algorithm / k-means
- Asymmetric Distance: ADC (Asymmetric Distance Computation)
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 3 completed. Understanding of k-means.
Real world outcome:
$ ./pq_compress --dataset sift_1M.npy --segments 8 --bits 8
Training Product Quantization:
Original: 1,000,000 vectors × 128 dims × 4 bytes = 512 MB
Segments: 8 (16 dims each)
Centroids per segment: 256 (8 bits)
Training codebooks (k-means per segment)...
Segment 0: 256 centroids, trained on 1M subvectors
Segment 1: 256 centroids, trained on 1M subvectors
...
Segment 7: 256 centroids, trained on 1M subvectors
Compressing vectors...
Compressed: 1,000,000 vectors × 8 bytes = 8 MB
Compression ratio: 64x (98.4% reduction!)
Codebook size: 8 segments × 256 centroids × 16 dims × 4 bytes = 128 KB
$ ./pq_search --query query.npy --k 10
PQ Search with ADC (Asymmetric Distance Computation):
Building distance lookup table: 0.02ms
Scanning 1M compressed codes: 12.3ms
Results:
╔═══════╤═══════════╤════════════════╤═════════════════╗
║ Rank │ Vector ID │ Approx Dist │ True Dist ║
╠═══════╪═══════════╪════════════════╪═════════════════╣
║ 1 │ 4,521 │ 0.841 │ 0.823 ║
║ 2 │ 7,102 │ 0.923 │ 0.905 ║
║ ... │ ... │ ... │ ... ║
╚═══════╧═══════════╧════════════════╧═════════════════╝
Quality:
Recall@10: 0.89 (vs exact search)
Mean distance error: 3.2%
Performance:
Time: 12.3ms (vs 1,847ms brute force on raw vectors)
Memory: 8 MB (vs 512 MB raw)
Speedup: 150x
Implementation Hints:
Product Quantization structure:
Original vector (128 dims):
[v0, v1, v2, ..., v127]
Split into M=8 subvectors (16 dims each):
[v0-v15], [v16-v31], ..., [v112-v127]
Each subvector → quantized to one of 256 centroids
Compressed code: [c0, c1, c2, c3, c4, c5, c6, c7] (8 bytes total)
Implementation:
class ProductQuantizer:
def __init__(self, dim, num_segments, num_centroids=256):
self.dim = dim
self.num_segments = num_segments
self.segment_dim = dim // num_segments
self.num_centroids = num_centroids
self.codebooks = [] # One codebook per segment
def train(self, vectors):
for seg in range(self.num_segments):
start = seg * self.segment_dim
end = start + self.segment_dim
subvectors = vectors[:, start:end]
kmeans = KMeans(n_clusters=self.num_centroids)
kmeans.fit(subvectors)
self.codebooks.append(kmeans.cluster_centers_)
def encode(self, vector):
codes = []
for seg in range(self.num_segments):
start = seg * self.segment_dim
end = start + self.segment_dim
subvector = vector[start:end]
# Find nearest centroid
distances = np.linalg.norm(self.codebooks[seg] - subvector, axis=1)
code = np.argmin(distances)
codes.append(code)
return np.array(codes, dtype=np.uint8)
def decode(self, codes):
vector = []
for seg, code in enumerate(codes):
vector.extend(self.codebooks[seg][code])
return np.array(vector)
Asymmetric Distance Computation (ADC):
def compute_distance_tables(self, query):
"""Precompute distances from query subvectors to all centroids"""
tables = []
for seg in range(self.num_segments):
start = seg * self.segment_dim
end = start + self.segment_dim
query_sub = query[start:end]
# Distance from query subvector to each centroid
distances = np.linalg.norm(self.codebooks[seg] - query_sub, axis=1)
tables.append(distances)
return np.array(tables) # Shape: (num_segments, num_centroids)
def search_pq(self, query, codes, k):
"""Search using precomputed distance tables"""
tables = self.compute_distance_tables(query)
# Compute approximate distance to each code
distances = np.zeros(len(codes))
for seg in range(self.num_segments):
distances += tables[seg, codes[:, seg]] ** 2
# Top-k
indices = np.argpartition(distances, k)[:k]
return indices[np.argsort(distances[indices])]
Why ADC is fast:
- Query: Compute 8 × 256 = 2,048 subvector distances (one time)
- Per vector: 8 table lookups + 8 additions
- 1M vectors: 8M lookups vs 128M multiplications
Learning milestones:
- Codebooks train successfully → You understand quantization
- Encoding/decoding works → You understand the representation
- ADC is much faster than brute force → You understand the speedup
- Recall drops gracefully → You understand the accuracy tradeoff
Project 5: HNSW (Hierarchical Navigable Small World)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Graph-Based Search / Skip Lists
- Software or Tool: HNSW Index
- Main Book: “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” - Malkov & Yashunin
What you’ll build: A complete HNSW implementation—the state-of-the-art algorithm for vector search, combining navigable small-world graphs with a hierarchical structure for logarithmic search complexity.
Why it teaches HNSW: HNSW is the default choice for most vector databases (Pinecone, Qdrant, Weaviate, pgvector). It offers the best balance of speed, accuracy, and flexibility. Understanding it deeply means understanding modern vector search.
Core challenges you’ll face:
- Graph construction → maps to finding good neighbors during insertion
- Hierarchical layers → maps to skip-list-like navigation
- Greedy search → maps to navigating the graph efficiently
- Parameter tuning (M, efConstruction, efSearch) → maps to quality vs speed
Resources for key challenges:
- Pinecone HNSW Guide - Step-by-step explanation
- Zilliz HNSW Tutorial - Visual explanation
- Original HNSW Paper - The source
Key Concepts:
- Small World Graphs: Watts-Strogatz model
- Skip Lists: Probabilistic data structure
- Greedy Search: Local optimization in graphs
- HNSW Paper: Malkov & Yashunin, 2016
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-2 completed. Understanding of graph data structures.
Real world outcome:
$ ./hnsw_build --dataset sift_1M.npy --M 16 --efConstruction 200
Building HNSW Index:
Dataset: 1,000,000 vectors × 128 dimensions
M (max connections): 16
efConstruction: 200
Construction progress:
Layer 3: 12 nodes (entry points)
Layer 2: 234 nodes
Layer 1: 4,891 nodes
Layer 0: 1,000,000 nodes (all vectors)
Build time: 8m 32s
Index size: 287 MB (graph) + 512 MB (vectors) = 799 MB
Graph statistics:
Avg connections at layer 0: 14.2
Max layer reached: 4
Graph diameter (estimated): 6 hops
$ ./hnsw_search --query query.npy --k 10 --ef 50
HNSW Search (ef=50):
Starting from entry point at layer 3
Layer 3: 2 hops → found node 8,234
Layer 2: 3 hops → found node 12,891
Layer 1: 4 hops → found node 4,521
Layer 0: Expanded 50 candidates
Results:
╔═══════╤═══════════╤══════════════╗
║ Rank │ Vector ID │ Distance ║
╠═══════╪═══════════╪══════════════╣
║ 1 │ 4,521 │ 0.823 ║
║ 2 │ 7,892 │ 0.891 ║
║ ... │ ... │ ... ║
╚═══════╧═══════════╧══════════════╝
Performance:
Nodes visited: 143
Distance calculations: 143 (vs 1,000,000 brute force)
Time: 0.8ms
Recall@10: 0.98
$ ./hnsw_search --ef-sweep
ef │ Recall@10 │ Query Time │ Nodes Visited
─────┼───────────┼────────────┼──────────────
10 │ 0.72 │ 0.2ms │ 38
20 │ 0.88 │ 0.4ms │ 67
50 │ 0.98 │ 0.8ms │ 143
100 │ 0.995 │ 1.4ms │ 271
200 │ 0.999 │ 2.6ms │ 523
Implementation Hints:
HNSW key insight: Combine two ideas:
- Navigable Small World (NSW): A graph where you can reach any node in few hops via greedy routing
- Hierarchy (Skip List): Multiple layers where upper layers have fewer nodes, enabling fast coarse search
Layer 2: [A]----------[B]----------[C] (few nodes, long edges)
| | |
Layer 1: [A]---[D]---[B]---[E]---[C] (more nodes)
| | | | |
Layer 0: [A][F][D][G][B][H][E][I][C] (all nodes, short edges)
Core data structures:
class HNSWNode:
def __init__(self, vector_id, vector, level):
self.id = vector_id
self.vector = vector
self.level = level
self.neighbors = [[] for _ in range(level + 1)]
# neighbors[layer] = list of neighbor node ids
class HNSWIndex:
def __init__(self, dim, M=16, ef_construction=200):
self.dim = dim
self.M = M # Max connections per node
self.M_max0 = M * 2 # Max connections at layer 0
self.ef_construction = ef_construction
self.ml = 1 / np.log(M) # Level multiplier
self.nodes = {} # id -> HNSWNode
self.entry_point = None
self.max_level = -1
Insertion algorithm:
def insert(self, vector_id, vector):
# 1. Determine level for new node (probabilistic)
level = self._random_level()
node = HNSWNode(vector_id, vector, level)
if self.entry_point is None:
self.entry_point = node
self.max_level = level
self.nodes[vector_id] = node
return
# 2. Find entry point at each level via greedy search
current = self.entry_point
# 3. Search from top to node's level (greedy, single result)
for lc in range(self.max_level, level, -1):
current = self._search_layer_greedy(vector, current, lc)
# 4. At node's level and below, find ef_construction neighbors
for lc in range(min(level, self.max_level), -1, -1):
candidates = self._search_layer(vector, current, self.ef_construction, lc)
# Select M best neighbors
neighbors = self._select_neighbors(vector, candidates, self.M if lc > 0 else self.M_max0)
# Bidirectional connections
for neighbor in neighbors:
node.neighbors[lc].append(neighbor.id)
neighbor.neighbors[lc].append(vector_id)
# Prune if too many connections
if len(neighbor.neighbors[lc]) > (self.M if lc > 0 else self.M_max0):
neighbor.neighbors[lc] = self._select_neighbors(
neighbor.vector,
[self.nodes[n] for n in neighbor.neighbors[lc]],
self.M if lc > 0 else self.M_max0
)
if len(candidates) > 0:
current = candidates[0] # Closest candidate
# 5. Update entry point if new node is higher
if level > self.max_level:
self.entry_point = node
self.max_level = level
self.nodes[vector_id] = node
def _random_level(self):
level = 0
while np.random.random() < self.ml and level < 16:
level += 1
return level
Search algorithm:
def search(self, query, k, ef):
if self.entry_point is None:
return []
current = self.entry_point
# Greedy search from top to layer 1
for lc in range(self.max_level, 0, -1):
current = self._search_layer_greedy(query, current, lc)
# At layer 0, do beam search with ef candidates
candidates = self._search_layer(query, current, ef, 0)
# Return top-k
return sorted(candidates, key=lambda n: distance(query, n.vector))[:k]
def _search_layer(self, query, entry, ef, layer):
visited = {entry.id}
candidates = [(distance(query, entry.vector), entry)] # min-heap
results = [(-distance(query, entry.vector), entry)] # max-heap (negative for max)
heapify(candidates)
heapify(results)
while candidates:
dist_c, current = heappop(candidates)
# Stop if closest candidate is worse than worst result
if dist_c > -results[0][0]:
break
for neighbor_id in current.neighbors[layer]:
if neighbor_id in visited:
continue
visited.add(neighbor_id)
neighbor = self.nodes[neighbor_id]
dist_n = distance(query, neighbor.vector)
if dist_n < -results[0][0] or len(results) < ef:
heappush(candidates, (dist_n, neighbor))
heappush(results, (-dist_n, neighbor))
if len(results) > ef:
heappop(results)
return [node for _, node in results]
Parameters explained:
- M: Max connections per node. Higher = more accurate, more memory
- ef_construction: Beam width during build. Higher = better graph, slower build
- ef_search: Beam width during search. Higher = better recall, slower search
Learning milestones:
- Hierarchical structure builds correctly → You understand the layers
- Greedy search finds good paths → You understand navigation
- Recall matches hnswlib → You’ve implemented correctly
- Parameter tuning works → You understand the tradeoffs
Project 6: IVF-PQ Composite Index
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: C++, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Combined Indexing / Large-Scale Search
- Software or Tool: IVF-PQ Index
- Main Book: “Billion-scale similarity search with GPUs” - Johnson et al.
What you’ll build: A composite IVF-PQ index that combines clustering (IVF) with compression (PQ) for billion-scale search with minimal memory usage.
Why it teaches composite indexes: No single technique is enough at scale. IVF-PQ combines the best of both worlds: IVF reduces search scope, PQ reduces memory. This is how FAISS handles billion-vector datasets.
Core challenges you’ll face:
- Training pipeline → maps to IVF clustering then PQ on residuals
- Residual encoding → maps to PQ on (vector - centroid)
- Search optimization → maps to precomputed tables + SIMD
- Memory layout → maps to efficient storage for cache
Key Concepts:
- Residual Quantization: PQ on the difference from centroid
- Two-Stage Training: IVF first, then PQ
- ADC with IVF: Combining both speedups
Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 3-4 completed.
Real world outcome:
$ ./ivfpq_build --dataset deep_1B.npy --clusters 65536 --pq_segments 64
Building IVF-PQ Index for 1 BILLION vectors:
Dimensions: 96
IVF clusters: 65,536
PQ segments: 64 (1.5 dims each)
PQ bits: 8 (256 centroids per segment)
Phase 1: Training IVF (k-means on 10M samples)
Iterations: 20, converged
Time: 15m 23s
Phase 2: Training PQ codebooks (on residuals)
Training 64 codebooks of 256 centroids each
Time: 8m 45s
Phase 3: Encoding vectors
Progress: [████████████████████] 100%
Time: 2h 34m
Index statistics:
Raw data: 1B × 96 × 4 bytes = 384 GB
Compressed: 1B × 64 bytes = 64 GB (6x compression)
Centroids: 65K × 96 × 4 bytes = 25 MB
Codebooks: 64 × 256 × 1.5 × 4 bytes = 98 KB
Total: 64 GB (84% reduction!)
$ ./ivfpq_search --query query.npy --k 10 --nprobe 64
IVF-PQ Search:
Centroid search: 0.3ms (found 64 clusters)
PQ scan: 4.2ms (scanned ~15M codes)
Re-ranking: 0.1ms
Performance:
Total: 4.6ms for 1B vectors!
Recall@10: 0.87
$ ./ivfpq_rerank --k 100 --rerank-k 10
With re-ranking (decode top-100, compute exact distances):
Total: 6.2ms
Recall@10: 0.94
Implementation Hints:
IVF-PQ structure:
class IVFPQIndex:
def __init__(self, dim, num_clusters, pq_segments, pq_bits=8):
self.dim = dim
self.num_clusters = num_clusters
# IVF component
self.centroids = None # (num_clusters, dim)
# PQ component (encodes residuals)
self.pq = ProductQuantizer(dim, pq_segments, 2 ** pq_bits)
# Storage: one PQ code array per cluster
self.inverted_lists = [
{'ids': [], 'codes': []}
for _ in range(num_clusters)
]
Training (two-stage):
def train(self, training_vectors):
# Stage 1: Train IVF (k-means)
kmeans = KMeans(n_clusters=self.num_clusters)
kmeans.fit(training_vectors)
self.centroids = kmeans.cluster_centers_
# Compute residuals: vector - nearest_centroid
assignments = kmeans.predict(training_vectors)
residuals = training_vectors - self.centroids[assignments]
# Stage 2: Train PQ on residuals
self.pq.train(residuals)
Adding vectors:
def add(self, vector_ids, vectors):
# Find nearest centroids
distances = cdist(vectors, self.centroids)
assignments = np.argmin(distances, axis=1)
# Compute and encode residuals
residuals = vectors - self.centroids[assignments]
codes = self.pq.encode_batch(residuals)
# Store in inverted lists
for i, (vid, cluster, code) in enumerate(zip(vector_ids, assignments, codes)):
self.inverted_lists[cluster]['ids'].append(vid)
self.inverted_lists[cluster]['codes'].append(code)
Search:
def search(self, query, k, nprobe):
# Step 1: Find nprobe nearest clusters
centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
nearest_clusters = np.argsort(centroid_distances)[:nprobe]
# Step 2: For each cluster, compute residual query
all_candidates = []
for cluster_id in nearest_clusters:
centroid = self.centroids[cluster_id]
residual_query = query - centroid
# Step 3: Compute PQ distance table for residual
distance_table = self.pq.compute_distance_tables(residual_query)
# Step 4: Scan codes and compute approximate distances
codes = np.array(self.inverted_lists[cluster_id]['codes'])
ids = self.inverted_lists[cluster_id]['ids']
if len(codes) == 0:
continue
# Fast PQ distance computation
pq_distances = np.sum(
distance_table[np.arange(self.pq.num_segments), codes],
axis=1
)
for vid, dist in zip(ids, pq_distances):
all_candidates.append((vid, dist))
# Step 5: Return top-k
all_candidates.sort(key=lambda x: x[1])
return all_candidates[:k]
Optimization: Store codes contiguously per cluster for cache efficiency:
# Instead of list of codes:
codes = np.array([...], dtype=np.uint8).reshape(-1, pq_segments)
# Shape: (num_vectors_in_cluster, pq_segments)
# This enables vectorized distance computation
Learning milestones:
- Two-stage training works → You understand the pipeline
- Residual encoding improves accuracy → You understand why it helps
- Memory reduction is massive → You understand the compression
- Search matches FAISS performance → You’ve implemented correctly
Project 7: Disk-Based Vector Index (DiskANN-style)
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: C++
- Alternative Programming Languages: Rust, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Storage Systems / SSD Optimization
- Software or Tool: Disk-Based Vector Index
- Main Book: “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node” - Microsoft Research
What you’ll build: A vector index that stores vectors on SSD with an in-memory graph for navigation, enabling billion-scale search on a single commodity machine.
Why it teaches disk-based search: Memory is expensive. Storing 1B vectors in RAM requires 500GB+. DiskANN-style indexes keep only the graph in memory (~10GB) and fetch vectors from SSD during search, reducing cost by 50x.
Core challenges you’ll face:
- Graph fits in memory, vectors on disk → maps to memory-mapped I/O
- Minimizing disk reads → maps to graph design for locality
- SSD-friendly access patterns → maps to sequential vs random reads
- Beam search with async I/O → maps to pipelining disk access
Key Concepts:
- DiskANN Paper: Microsoft Research, 2019
- Memory-Mapped Files: OS-level I/O optimization
- SSD Performance: Random vs sequential read patterns
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 5 completed. C++ proficiency, understanding of file I/O.
Real world outcome:
$ ./diskann_build --dataset deep_1B.npy --memory-limit 16GB
Building DiskANN Index:
Dataset: 1,000,000,000 vectors × 96 dimensions
Vector data size: 384 GB (will be on disk)
Memory limit: 16 GB (for graph)
Building graph with memory-efficient construction...
Pass 1: Sample-based graph initialization
Pass 2: Full graph refinement (streaming)
Pass 3: Graph compaction
Index files:
vectors.bin: 384 GB (memory-mapped)
graph.bin: 12 GB (in-memory)
metadata.json: 1 KB
$ ./diskann_search --query query.npy --k 10 --beam-width 50
DiskANN Search:
Graph navigation: 0.4ms (in-memory)
Disk reads: 23 vectors fetched
SSD latency: 2.1ms (avg 0.09ms per read)
Total: 2.5ms
Performance analysis:
Graph hops: 8
Candidate set size: 50
Disk reads: 23 (< beam width due to caching)
Cache hits: 12
Recall@10: 0.96
$ ./diskann_search --benchmark --queries 10000
Benchmark (10K queries):
Mean latency: 2.8ms
P99 latency: 5.2ms
Throughput: 357 queries/second
SSD bandwidth: 1.2 GB/s
Implementation Hints:
Architecture:
┌─────────────────────────────────────────┐
│ In-Memory Graph │
│ (Node ID → [Neighbor IDs], 12 GB) │
├─────────────────────────────────────────┤
│ Memory-Mapped Vectors │
│ (Node ID → Vector, 384 GB on SSD) │
└─────────────────────────────────────────┘
Core structures:
struct DiskANNIndex {
// In-memory: graph only
vector<vector<uint32_t>> graph; // Neighbor lists
// Memory-mapped: vectors on disk
float* vectors; // mmap'd from file
size_t num_vectors;
size_t dim;
// Caching
LRUCache<uint32_t, vector<float>> vector_cache;
};
Memory-mapping vectors:
void open_vectors(const string& path) {
int fd = open(path.c_str(), O_RDONLY);
size_t file_size = num_vectors * dim * sizeof(float);
vectors = (float*)mmap(
nullptr,
file_size,
PROT_READ,
MAP_PRIVATE, // Copy-on-write
fd,
0
);
// Advise OS about access pattern
madvise(vectors, file_size, MADV_RANDOM);
}
vector<float> get_vector(uint32_t id) {
// Check cache first
if (auto cached = vector_cache.get(id)) {
return *cached;
}
// Load from mmap (may trigger disk read)
float* ptr = vectors + id * dim;
vector<float> vec(ptr, ptr + dim);
vector_cache.put(id, vec);
return vec;
}
Search with prefetching:
vector<pair<uint32_t, float>> search(
const vector<float>& query,
size_t k,
size_t beam_width
) {
priority_queue<pair<float, uint32_t>> candidates; // max-heap
priority_queue<pair<float, uint32_t>, vector<...>, greater<>> results; // min-heap
unordered_set<uint32_t> visited;
// Start from entry point
auto entry_vec = get_vector(entry_point);
float dist = distance(query, entry_vec);
candidates.push({-dist, entry_point}); // Negative for min behavior
results.push({dist, entry_point});
visited.insert(entry_point);
while (!candidates.empty()) {
auto [neg_dist, current] = candidates.top();
candidates.pop();
if (-neg_dist > results.top().first && results.size() >= beam_width) {
break;
}
// Prefetch neighbor vectors asynchronously
auto& neighbors = graph[current];
for (uint32_t neighbor : neighbors) {
// Hint to OS to prefetch this page
prefetch(&vectors[neighbor * dim]);
}
// Process neighbors
for (uint32_t neighbor : neighbors) {
if (visited.count(neighbor)) continue;
visited.insert(neighbor);
auto neighbor_vec = get_vector(neighbor);
float d = distance(query, neighbor_vec);
if (d < results.top().first || results.size() < beam_width) {
candidates.push({-d, neighbor});
results.push({d, neighbor});
if (results.size() > beam_width) {
results.pop();
}
}
}
}
// Extract top-k from results
vector<pair<uint32_t, float>> top_k;
while (!results.empty() && top_k.size() < k) {
top_k.push_back({results.top().second, results.top().first});
results.pop();
}
reverse(top_k.begin(), top_k.end());
return top_k;
}
Graph construction (memory-efficient):
void build_graph(const string& vectors_path, size_t R, size_t L) {
// R = max degree, L = candidate list size during construction
// Initialize with random graph
for (size_t i = 0; i < num_vectors; i++) {
graph[i] = random_neighbors(R);
}
// Iterative refinement (process vectors in chunks to limit memory)
size_t chunk_size = 1000000; // 1M vectors at a time
for (size_t offset = 0; offset < num_vectors; offset += chunk_size) {
size_t end = min(offset + chunk_size, num_vectors);
// Load chunk into memory
vector<vector<float>> chunk_vectors = load_chunk(vectors_path, offset, end);
// For each vector in chunk, improve its neighbors
for (size_t i = offset; i < end; i++) {
auto candidates = beam_search(chunk_vectors[i - offset], L);
graph[i] = select_diverse_neighbors(candidates, R);
// Add reverse edges
for (uint32_t neighbor : graph[i]) {
if (graph[neighbor].size() < R) {
graph[neighbor].push_back(i);
}
}
}
}
}
Learning milestones:
- Memory-mapped I/O works → You understand OS-level optimization
- Graph fits in memory limit → You understand memory management
- Latency is SSD-bound → You understand the bottleneck
- Prefetching helps → You understand pipelining
Project 8: Metadata Filtering with Vector Search
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Python
- Alternative Programming Languages: Rust, Go, C++
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 3: Advanced
- Knowledge Area: Hybrid Search / Inverted Indexes
- Software or Tool: Filtered Vector Search
- Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze
What you’ll build: A vector search system that supports metadata filtering (e.g., “find similar products where price < $100 and category = ‘electronics’”) efficiently, not just post-filtering.
Why it teaches filtered search: Real applications always have filters. Naive post-filtering is slow (scan 1M vectors, filter to 1K, return top-10). Smart pre-filtering or integrated filtering is essential for production systems.
Core challenges you’ll face:
- Pre-filter vs post-filter → maps to when to apply filters
- Bitmap indexes for metadata → maps to efficient set operations
- Filtered HNSW traversal → maps to skipping invalid nodes
- Cost estimation → maps to choosing the right strategy
Key Concepts:
- Inverted Indexes: “Introduction to Information Retrieval” - Manning et al.
- Bitmap Indexes: Database index fundamentals
- Filtered Graph Search: Qdrant, Pinecone approaches
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5 completed. Understanding of database indexes.
Real world outcome:
$ ./filtered_search "
SELECT * FROM products
WHERE vector NEAR query_embedding
AND category = 'electronics'
AND price < 100
AND rating >= 4.0
LIMIT 10
"
Query Analysis:
Filter selectivity: 2.3% (23,000 of 1,000,000 match)
Strategy: Pre-filter + HNSW on filtered subset
Execution:
Step 1: Apply filters (bitmap intersection): 0.8ms
Step 2: Build filtered entry points: 0.1ms
Step 3: HNSW search on filtered graph: 1.2ms
Results:
╔═══════╤═══════════════╤══════════════╤════════╤════════════╗
║ Rank │ Product ID │ Distance │ Price │ Category ║
╠═══════╪═══════════════╪══════════════╪════════╪════════════╣
║ 1 │ PROD-34521 │ 0.234 │ $49.99 │ electronics║
║ 2 │ PROD-78234 │ 0.287 │ $89.99 │ electronics║
║ ... │ ... │ ... │ ... │ ... ║
╚═══════╧═══════════════╧══════════════╧════════╧════════════╝
Performance: 2.1ms total
(vs 847ms with post-filtering on full HNSW traversal)
$ ./filtered_search --analyze-strategies
Filter Selectivity │ Best Strategy │ Expected Time
───────────────────┼──────────────────┼──────────────
> 50% │ Post-filter │ ~2ms
10% - 50% │ Filtered HNSW │ ~3ms
1% - 10% │ Pre-filter │ ~2ms
< 1% │ Scan filtered │ ~1ms
Implementation Hints:
Metadata storage with bitmap indexes:
class MetadataIndex:
def __init__(self):
# For categorical: inverted index of bitmaps
self.categorical = {} # field -> {value -> bitmap}
# For numerical: sorted index
self.numerical = {} # field -> sorted [(value, vector_id)]
def add(self, vector_id, metadata):
for field, value in metadata.items():
if isinstance(value, str):
if field not in self.categorical:
self.categorical[field] = {}
if value not in self.categorical[field]:
self.categorical[field][value] = BitSet()
self.categorical[field][value].set(vector_id)
else:
if field not in self.numerical:
self.numerical[field] = []
self.numerical[field].append((value, vector_id))
def build_indexes(self):
for field in self.numerical:
self.numerical[field].sort()
def filter(self, conditions):
result = None
for field, op, value in conditions:
if field in self.categorical:
if op == '==':
bits = self.categorical[field].get(value, BitSet())
elif op == 'in':
bits = BitSet()
for v in value:
bits |= self.categorical[field].get(v, BitSet())
else: # numerical
bits = self._range_filter(field, op, value)
if result is None:
result = bits
else:
result &= bits # Intersection
return result
Filtered HNSW search:
class FilteredHNSW(HNSWIndex):
def search_filtered(self, query, k, ef, valid_ids):
"""Search considering only nodes in valid_ids"""
if len(valid_ids) == 0:
return []
# Convert to set for O(1) lookup
valid_set = set(valid_ids)
# Find a valid entry point
entry = self._find_valid_entry(valid_set)
if entry is None:
# Fall back to scanning
return self._scan_filtered(query, k, valid_set)
# Modified beam search that skips invalid nodes
visited = {entry.id}
candidates = [(distance(query, entry.vector), entry)]
results = []
while candidates:
dist_c, current = heappop(candidates)
# Add to results if valid
if current.id in valid_set:
heappush(results, (-dist_c, current))
if len(results) > ef:
heappop(results)
# Explore neighbors (including invalid ones for navigation)
for neighbor_id in current.neighbors[0]:
if neighbor_id in visited:
continue
visited.add(neighbor_id)
neighbor = self.nodes[neighbor_id]
dist_n = distance(query, neighbor.vector)
# Always add to candidates for navigation
if len(results) < ef or dist_n < -results[0][0]:
heappush(candidates, (dist_n, neighbor))
# Extract top-k valid results
return sorted(
[(n.id, -d) for d, n in results],
key=lambda x: x[1]
)[:k]
Cost-based strategy selection:
def choose_strategy(self, query, filters, k):
# Estimate filter selectivity
selectivity = self.estimate_selectivity(filters)
num_valid = int(self.num_vectors * selectivity)
# Estimate costs
post_filter_cost = self.hnsw_search_cost() + num_valid * 0.001
pre_filter_cost = self.filter_cost(filters) + self.scan_cost(num_valid)
filtered_hnsw_cost = self.filter_cost(filters) + self.hnsw_search_cost() * (1 / selectivity)
if selectivity > 0.5:
return 'post_filter'
elif selectivity < 0.01:
return 'pre_filter_scan'
else:
return 'filtered_hnsw'
Learning milestones:
- Bitmap filters are fast → You understand inverted indexes
- Strategy selection matters → You understand cost-based optimization
- Filtered HNSW works → You understand graph traversal with constraints
- Performance beats naive approach → You’ve built an efficient system
Project 9: Vector Database Storage Engine
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Rust
- Alternative Programming Languages: C++, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 4: Expert
- Knowledge Area: Storage Systems / Write-Ahead Logging
- Software or Tool: Storage Engine
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A durable storage engine for vectors with write-ahead logging (WAL), crash recovery, compaction, and efficient serialization—the persistence layer of a real vector database.
Why it teaches storage: Indexes are useless if they disappear on restart. A real vector database needs ACID properties: atomic writes, crash recovery, and efficient disk formats. This is what makes the difference between a prototype and a production system.
Core challenges you’ll face:
- Write-ahead logging → maps to durability before acknowledging writes
- Crash recovery → maps to replaying WAL on startup
- Compaction → maps to merging and garbage collection
- Efficient serialization → maps to binary formats for vectors
Key Concepts:
- WAL: “Designing Data-Intensive Applications” Chapter 3 - Kleppmann
- LSM Trees: For append-friendly storage
- Memory-Mapped Files: OS-level optimization
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, experience with file I/O and binary formats.
Real world outcome:
$ ./vecdb_storage --init --path ./data
Initializing VectorDB Storage Engine:
Data directory: ./data
WAL segment size: 64 MB
Compaction threshold: 1 GB
Files created:
./data/wal/segment_000000.wal
./data/vectors/vectors_000000.bin
./data/index/hnsw_000000.idx
./data/metadata.json
$ ./vecdb_storage --insert --count 100000
Inserting 100,000 vectors...
WAL writes: 100,000
Batch flushes: 10
Time: 4.2s (23,800 inserts/sec)
$ kill -9 $PID # Simulate crash
$ ./vecdb_storage --recover
Recovery started:
Scanning WAL segments...
Found 1 segment with 100,000 entries
Replaying entries: [████████████████████] 100%
Rebuilding index: 2.3s
Recovery complete!
Verified:
Vectors recovered: 100,000 ✓
Index consistency: OK ✓
No data loss ✓
$ ./vecdb_storage --compact
Compaction:
Before: 3 segments, 1.2 GB
Merging...
After: 1 segment, 0.9 GB
Space reclaimed: 300 MB (25%)
Implementation Hints:
Storage architecture:
./data/
├── wal/
│ ├── segment_000000.wal # Write-ahead log
│ ├── segment_000001.wal
│ └── ...
├── vectors/
│ └── vectors_000000.bin # Actual vector data
├── index/
│ └── hnsw_000000.idx # Serialized index
└── metadata.json # DB state
WAL entry format:
struct WALEntry {
sequence_number: u64,
operation: Operation, // Insert, Delete, Update
timestamp: u64,
vector_id: u64,
vector_data: Vec<f32>, // Empty for Delete
metadata: HashMap<String, Value>,
checksum: u32,
}
enum Operation {
Insert = 1,
Delete = 2,
Update = 3,
}
WAL writer:
struct WALWriter {
current_segment: File,
segment_size: usize,
current_size: usize,
sequence: AtomicU64,
}
impl WALWriter {
fn append(&mut self, entry: WALEntry) -> Result<u64> {
let seq = self.sequence.fetch_add(1, Ordering::SeqCst);
let mut entry = entry;
entry.sequence_number = seq;
// Serialize entry
let bytes = entry.serialize();
entry.checksum = crc32(&bytes);
// Write length + data + checksum
self.current_segment.write_all(&(bytes.len() as u32).to_le_bytes())?;
self.current_segment.write_all(&bytes)?;
self.current_segment.write_all(&entry.checksum.to_le_bytes())?;
// Flush to disk (durability!)
self.current_segment.sync_all()?;
self.current_size += bytes.len() + 8;
// Rotate segment if needed
if self.current_size >= self.segment_size {
self.rotate_segment()?;
}
Ok(seq)
}
}
Recovery:
fn recover(data_dir: &Path) -> Result<VectorDB> {
let mut db = VectorDB::new();
// Read metadata to find last checkpoint
let metadata = read_metadata(data_dir)?;
let last_checkpoint = metadata.last_checkpoint_sequence;
// Replay WAL from last checkpoint
for segment in list_wal_segments(data_dir)? {
let reader = WALReader::open(segment)?;
for entry in reader.entries() {
let entry = entry?;
// Verify checksum
if crc32(&entry.data) != entry.checksum {
log::error!("Corrupt WAL entry at seq {}", entry.sequence_number);
continue;
}
// Skip already applied entries
if entry.sequence_number <= last_checkpoint {
continue;
}
// Apply to in-memory state
match entry.operation {
Operation::Insert => db.insert_recovered(entry.vector_id, entry.vector_data),
Operation::Delete => db.delete_recovered(entry.vector_id),
Operation::Update => db.update_recovered(entry.vector_id, entry.vector_data),
}
}
}
// Rebuild index if needed
if db.index_needs_rebuild() {
db.rebuild_index()?;
}
Ok(db)
}
Compaction:
fn compact(&mut self) -> Result<()> {
// Create new segment file
let new_segment = create_new_segment()?;
// Write live vectors to new segment
for (id, vector) in self.vectors.iter() {
if !self.deleted.contains(id) {
write_vector(&new_segment, id, vector)?;
}
}
// Atomically swap segments
new_segment.sync_all()?;
rename(new_segment.path(), current_segment_path())?;
// Delete old segments
for old in old_segments {
remove_file(old)?;
}
// Truncate WAL (everything is now in new segment)
self.wal.truncate_before(self.sequence)?;
Ok(())
}
Learning milestones:
- WAL writes are durable → You understand write-ahead logging
- Crash recovery works → You understand durability guarantees
- Compaction reclaims space → You understand garbage collection
- Performance is acceptable → You’ve built an efficient storage layer
Project 10: Distributed Vector Search Cluster
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Go
- Alternative Programming Languages: Rust, Java
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Distributed Systems / Sharding
- Software or Tool: Distributed Vector Database
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A distributed vector search cluster with sharding, replication, routing, and distributed query execution—turning a single-node index into a scalable system.
Why it teaches distributed systems: No single machine can handle billions of vectors with low latency. Distributed systems introduce new challenges: consistency, partition tolerance, coordination. Understanding these is essential for production systems.
Core challenges you’ll face:
- Sharding strategies → maps to distributing vectors across nodes
- Query routing → maps to finding which shards to query
- Result merging → maps to combining results from multiple shards
- Replication → maps to fault tolerance and read scaling
- Consistency → maps to handling concurrent updates
Key Concepts:
- Sharding: “Designing Data-Intensive Applications” Chapter 6 - Kleppmann
- Replication: “Designing Data-Intensive Applications” Chapter 5 - Kleppmann
- Distributed Consensus: Raft algorithm
Difficulty: Master Time estimate: 4-6 weeks Prerequisites: All previous projects, distributed systems experience.
Real world outcome:
$ ./vecdb cluster init --nodes 3 --shards 9 --replicas 2
Initializing distributed cluster:
Nodes: node-0 (coordinator), node-1, node-2
Shards: 9 (3 per node primary, 6 replicas distributed)
Replication factor: 2
Shard distribution:
╔════════╤═══════════════════╤══════════════════╗
║ Shard │ Primary Node │ Replica Nodes ║
╠════════╪═══════════════════╪══════════════════╣
║ 0 │ node-0 │ node-1, node-2 ║
║ 1 │ node-0 │ node-1, node-2 ║
║ 2 │ node-0 │ node-1, node-2 ║
║ 3 │ node-1 │ node-0, node-2 ║
║ ... │ ... │ ... ║
╚════════╧═══════════════════╧══════════════════╝
Cluster ready!
$ ./vecdb insert --vectors 10000000 --distributed
Inserting 10M vectors across cluster:
Routing: hash(vector_id) mod 9
Progress: [████████████████████] 100%
Time: 45s (222K inserts/sec)
$ ./vecdb search --query query.npy --k 10
Distributed Query Execution:
Step 1: Route query to all 9 shards
Step 2: Parallel search on each shard (local k=30)
Step 3: Merge 270 candidates from 9 shards
Step 4: Re-rank and return top-10
Performance:
Shard latencies: [1.2ms, 1.4ms, 1.1ms, 1.3ms, 1.5ms, 1.2ms, 1.0ms, 1.4ms, 1.3ms]
Merge time: 0.3ms
Total: 1.8ms (limited by slowest shard + merge)
$ ./vecdb failover --kill node-1
Simulating node-1 failure...
Detecting failure: 2.1s (heartbeat timeout)
Promoting replicas:
Shard 3: node-0 promoted to primary
Shard 4: node-2 promoted to primary
Shard 5: node-0 promoted to primary
Cluster healthy (degraded: missing replicas)
$ ./vecdb search --query query.npy --k 10
Search still works! (using remaining nodes)
Latency: 2.1ms (slightly higher due to fewer nodes)
Implementation Hints:
Cluster architecture:
┌─────────────────────────────────────────────────────────────┐
│ Coordinator │
│ (Query routing, shard map, cluster membership) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Node-0 │ │ Node-1 │ │ Node-2 │ │
│ │ │ │ │ │ │ │
│ │ Shard 0 (P) │ │ Shard 3 (P) │ │ Shard 6 (P) │ │
│ │ Shard 1 (P) │ │ Shard 4 (P) │ │ Shard 7 (P) │ │
│ │ Shard 2 (P) │ │ Shard 5 (P) │ │ Shard 8 (P) │ │
│ │ Shard 3 (R) │ │ Shard 0 (R) │ │ Shard 0 (R) │ │
│ │ Shard 4 (R) │ │ Shard 1 (R) │ │ Shard 1 (R) │ │
│ │ ... │ │ ... │ │ ... │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
P = Primary, R = Replica
Sharding:
type ShardRouter struct {
numShards int
shardMap map[int][]NodeAddress // shard -> [primary, replicas]
}
func (r *ShardRouter) GetShard(vectorID uint64) int {
return int(vectorID % uint64(r.numShards))
}
func (r *ShardRouter) RouteInsert(vectorID uint64, vector []float32) error {
shardID := r.GetShard(vectorID)
primary := r.shardMap[shardID][0]
// Send to primary (it will replicate)
return r.sendInsert(primary, vectorID, vector)
}
func (r *ShardRouter) RouteSearch(query []float32, k int) ([]SearchResult, error) {
// Search all shards in parallel
results := make(chan []SearchResult, r.numShards)
errors := make(chan error, r.numShards)
for shardID := 0; shardID < r.numShards; shardID++ {
go func(sid int) {
node := r.pickNode(sid) // Primary or replica
res, err := r.sendSearch(node, query, k*3) // Over-fetch for merge
if err != nil {
errors <- err
return
}
results <- res
}(shardID)
}
// Collect and merge
allResults := []SearchResult{}
for i := 0; i < r.numShards; i++ {
select {
case res := <-results:
allResults = append(allResults, res...)
case err := <-errors:
return nil, err
}
}
// Sort by distance and take top-k
sort.Slice(allResults, func(i, j int) bool {
return allResults[i].Distance < allResults[j].Distance
})
if len(allResults) > k {
allResults = allResults[:k]
}
return allResults, nil
}
Replication:
type ReplicationManager struct {
primary *Shard
replicas []*RemoteReplica
}
func (rm *ReplicationManager) Insert(id uint64, vector []float32) error {
// Write to primary first
if err := rm.primary.Insert(id, vector); err != nil {
return err
}
// Replicate asynchronously (eventual consistency)
// Or synchronously (strong consistency)
for _, replica := range rm.replicas {
go replica.Replicate(id, vector)
}
return nil
}
Failure detection:
func (c *Cluster) monitorHealth() {
ticker := time.NewTicker(1 * time.Second)
for range ticker.C {
for _, node := range c.nodes {
if !c.ping(node) {
node.missedHeartbeats++
if node.missedHeartbeats > 3 {
c.handleNodeFailure(node)
}
} else {
node.missedHeartbeats = 0
}
}
}
}
func (c *Cluster) handleNodeFailure(failed *Node) {
log.Printf("Node %s failed, promoting replicas", failed.ID)
for _, shard := range failed.PrimaryShards {
// Find a replica to promote
replica := c.findReplica(shard)
if replica != nil {
c.promoteToPrimary(replica, shard)
}
}
c.updateShardMap()
c.notifyClients()
}
Learning milestones:
- Sharding distributes load → You understand partitioning
- Parallel search works → You understand distributed query execution
- Failover promotes replicas → You understand fault tolerance
- Consistency is maintained → You understand distributed state
Capstone Project: Full-Featured Vector Database
- File: LEARN_VECTOR_DATABASES.md
- Main Programming Language: Rust
- Alternative Programming Languages: C++, Go
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The “Industry Disruptor”
- Difficulty: Level 5: Master
- Knowledge Area: Complete Database System
- Software or Tool: Production Vector Database
- Main Book: All previous + “Database Internals” by Alex Petrov
What you’ll build: A complete, production-ready vector database with: HNSW and IVF-PQ indexes, durable storage, metadata filtering, distributed clustering, gRPC API, and client SDKs.
Why this is the ultimate test: This combines everything: algorithms, storage, networking, distributed systems, and API design. Building a complete vector database proves you truly understand how these systems work.
Core challenges you’ll face:
- Integrating all components → maps to system design
- API design → maps to user experience
- Performance tuning → maps to profiling and optimization
- Operational features → maps to monitoring, backups, upgrades
Key Concepts:
- Everything from previous projects
- API Design: gRPC best practices
- Operations: Observability, deployment
Difficulty: Master Time estimate: 3-6 months Prerequisites: All previous projects completed.
Real world outcome:
$ vecdb-server --config production.yaml
╔═══════════════════════════════════════════════════════════╗
║ VecDB Server v1.0 ║
╠═══════════════════════════════════════════════════════════╣
║ Configuration: ║
║ Storage: /data/vecdb ║
║ Index: HNSW (M=32, ef=200) ║
║ Shards: 12 (distributed across 4 nodes) ║
║ Replicas: 2 ║
║ ║
║ Endpoints: ║
║ gRPC: 0.0.0.0:6333 ║
║ HTTP: 0.0.0.0:6334 ║
║ Metrics: 0.0.0.0:9090/metrics ║
║ ║
║ Status: Ready to accept connections ║
╚═══════════════════════════════════════════════════════════╝
# Python client usage
from vecdb import VecDBClient
client = VecDBClient("localhost:6333")
# Create collection
client.create_collection(
name="products",
dimension=768,
metric="cosine",
index_config={
"type": "hnsw",
"m": 32,
"ef_construction": 200
}
)
# Insert vectors
client.upsert(
collection="products",
vectors=[
{"id": "prod_001", "vector": [...], "metadata": {"category": "electronics", "price": 99.99}},
{"id": "prod_002", "vector": [...], "metadata": {"category": "clothing", "price": 49.99}},
...
]
)
# Search with filters
results = client.search(
collection="products",
vector=query_embedding,
limit=10,
filter={"category": "electronics", "price": {"$lt": 100}}
)
# Results
[
{"id": "prod_001", "score": 0.95, "metadata": {...}},
{"id": "prod_042", "score": 0.91, "metadata": {...}},
...
]
Implementation Hints:
System architecture:
┌─────────────────────────────────────────────────────────────┐
│ Client SDKs │
│ (Python, JavaScript, Rust, Go) │
├─────────────────────────────────────────────────────────────┤
│ API Layer (gRPC + HTTP) │
│ ┌────────────┬────────────┬────────────┬────────────┐ │
│ │ Insert │ Search │ Delete │ Admin │ │
│ └────────────┴────────────┴────────────┴────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Query Processor │
│ ┌────────────┬────────────┬────────────┐ │
│ │ Planner │ Optimizer │ Executor │ │
│ └────────────┴────────────┴────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Index Layer │
│ ┌────────────┬────────────┬────────────┐ │
│ │ HNSW │ IVF-PQ │ Metadata │ │
│ └────────────┴────────────┴────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Storage Layer │
│ ┌────────────┬────────────┬────────────┐ │
│ │ WAL │ Vectors │ Index │ │
│ └────────────┴────────────┴────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Distributed Layer │
│ ┌────────────┬────────────┬────────────┐ │
│ │ Sharding │Replication │ Consensus │ │
│ └────────────┴────────────┴────────────┘ │
└─────────────────────────────────────────────────────────────┘
gRPC API definition:
service VecDB {
rpc CreateCollection(CreateCollectionRequest) returns (CreateCollectionResponse);
rpc Upsert(UpsertRequest) returns (UpsertResponse);
rpc Search(SearchRequest) returns (SearchResponse);
rpc Delete(DeleteRequest) returns (DeleteResponse);
rpc GetCollection(GetCollectionRequest) returns (GetCollectionResponse);
}
message SearchRequest {
string collection = 1;
repeated float vector = 2;
uint32 limit = 3;
Filter filter = 4;
SearchParams params = 5;
}
message SearchResponse {
repeated SearchResult results = 1;
uint64 searched_vectors = 2;
float latency_ms = 3;
}
Collection management:
struct Collection {
name: String,
dimension: usize,
metric: DistanceMetric,
// Index
index: Box<dyn VectorIndex>,
// Storage
storage: StorageEngine,
// Metadata
metadata_index: MetadataIndex,
// Configuration
config: CollectionConfig,
}
impl Collection {
fn upsert(&mut self, vectors: Vec<VectorRecord>) -> Result<()> {
// Write to WAL first
self.storage.wal_append(&vectors)?;
// Update index
for record in &vectors {
self.index.insert(record.id, &record.vector)?;
self.metadata_index.insert(record.id, &record.metadata)?;
}
// Flush periodically
if self.should_flush() {
self.flush()?;
}
Ok(())
}
fn search(&self, query: &[f32], k: usize, filter: Option<Filter>) -> Result<Vec<SearchResult>> {
let valid_ids = match filter {
Some(f) => Some(self.metadata_index.filter(&f)?),
None => None,
};
self.index.search(query, k, valid_ids)
}
}
Metrics and monitoring:
struct Metrics {
insert_latency: Histogram,
search_latency: Histogram,
vectors_total: Gauge,
index_size_bytes: Gauge,
queries_per_second: Counter,
}
impl Metrics {
fn record_search(&self, latency: Duration, results: usize) {
self.search_latency.observe(latency.as_secs_f64());
self.queries_per_second.inc();
}
}
Learning milestones:
- Basic CRUD works → Core functionality complete
- Filtered search works → Advanced querying done
- Distributed mode works → Scalability achieved
- Production deployed → Operational readiness
- Handles real workloads → Production quality
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Brute Force k-NN | Beginner | Weekend | ⭐⭐ | ⭐⭐⭐ |
| 2. LSH | Intermediate | 1-2 weeks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| 3. IVF | Advanced | 2 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| 4. Product Quantization | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 5. HNSW | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| 6. IVF-PQ | Expert | 2-3 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 7. DiskANN-style | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 8. Filtered Search | Advanced | 2-3 weeks | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 9. Storage Engine | Expert | 3-4 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| 10. Distributed Cluster | Master | 4-6 weeks | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Capstone: Full DB | Master | 3-6 months | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Path A: The Complete Journey (6-12 months)
- Weeks 1-2: Projects 1-2 (Fundamentals + LSH)
- Weeks 3-5: Projects 3-4 (IVF + PQ)
- Weeks 6-9: Project 5 (HNSW - the key algorithm)
- Weeks 10-12: Project 6 (IVF-PQ composite)
- Weeks 13-16: Projects 7-8 (Disk-based + Filtering)
- Weeks 17-20: Project 9 (Storage Engine)
- Weeks 21-26: Project 10 (Distributed)
- Months 7-12: Capstone
Path B: Practical Focus (2-3 months)
If you want to understand production systems quickly:
1 → 3 → 5 → 8 → (use existing storage/distribution)
This covers: brute force baseline, IVF, HNSW, and filtering—enough to work with any production vector database.
Path C: Algorithm Deep Dive (2-3 months)
If you want to understand the algorithms deeply:
1 → 2 → 3 → 4 → 5 → 6
This covers all the major ANN algorithms without the systems complexity.
Recommendation Based on Your Goals
If you’re building RAG applications: Do Projects 1, 3, 5, 8. You’ll understand how to choose and tune indexes.
If you want to contribute to vector DBs: Do the full path. Open-source projects like Qdrant, Milvus need contributors who understand internals.
If you’re researching ANN algorithms: Focus on Projects 2-6, read the original papers, benchmark against ann-benchmarks.com.
Essential Resources
Papers
- HNSW Paper - Malkov & Yashunin
- Product Quantization Paper - Jégou, Douze, Schmid
- DiskANN Paper - Microsoft Research
- Billion-scale ANN - FAISS paper
Tutorials and Guides
- Pinecone Learning Center - Excellent visual explanations
- Zilliz Learn - Milvus team’s tutorials
- FAISS Wiki - Official FAISS documentation
Benchmarks
- ANN-Benchmarks - Standard benchmark for ANN algorithms
- Big-ANN Benchmarks - Billion-scale benchmarks
Books
- “Designing Data-Intensive Applications” by Martin Kleppmann
- “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman
- “Introduction to Information Retrieval” by Manning, Raghavan, Schütze
Libraries to Study
- FAISS - Facebook’s library (C++ with Python bindings)
- hnswlib - Original HNSW implementation
- Qdrant - Rust vector database
- Milvus - Distributed vector database
Summary
| # | Project Name | Main Language |
|---|---|---|
| 1 | Brute Force k-NN and Distance Metrics | Python |
| 2 | Locality Sensitive Hashing (LSH) | Python |
| 3 | Inverted File Index (IVF) | Python |
| 4 | Product Quantization (PQ) | Python |
| 5 | HNSW Index | Python |
| 6 | IVF-PQ Composite Index | Python |
| 7 | Disk-Based Vector Index (DiskANN-style) | C++ |
| 8 | Metadata Filtering with Vector Search | Python |
| 9 | Vector Database Storage Engine | Rust |
| 10 | Distributed Vector Search Cluster | Go |
| Capstone | Full-Featured Vector Database | Rust |
Vector databases are where algorithms meet systems engineering. The best way to understand them is to build one. Start with brute force, feel the pain, then appreciate the elegance of HNSW and the efficiency of IVF-PQ. By the end, you’ll know exactly what happens when you call collection.search().