← Back to all projects

LEARN VECTOR DATABASES

Learn Vector Databases: From Zero to Building Your Own

Goal: Deeply understand how vector databases work internally—from the mathematics of similarity search through indexing algorithms like HNSW and IVF, to building a production-ready vector search engine. Master the technology powering modern AI applications.


Why Vector Databases Matter

Every time you use semantic search, image similarity, recommendation systems, or RAG (Retrieval-Augmented Generation) with LLMs, a vector database is doing the heavy lifting. Traditional databases search by exact match; vector databases search by meaning.

The challenge: How do you efficiently find the 10 most similar items among a billion vectors, each with 1536 dimensions? Brute force would take hours. Vector databases do it in milliseconds.

After completing these projects, you will:

  • Understand exactly how similarity search works at scale
  • Implement HNSW, IVF, LSH, and other ANN algorithms from scratch
  • Know when to use each algorithm and what tradeoffs to make
  • Build a complete vector database that handles real workloads
  • Appreciate the engineering behind Pinecone, Milvus, Qdrant, and others

Core Concept Analysis

What is a Vector?

In the context of vector databases, a vector (or embedding) is a list of numbers that represents the meaning or features of something:

"The quick brown fox" → [0.12, -0.34, 0.56, ..., 0.78]  (1536 dimensions)
[Image of a cat]      → [0.23, 0.45, -0.67, ..., 0.89]  (512 dimensions)
[Product attributes]  → [0.91, -0.23, 0.45, ..., 0.12]  (256 dimensions)

These vectors come from ML models (OpenAI embeddings, CLIP, sentence-transformers) that position similar items close together in high-dimensional space.

Distance Metrics

How do we measure “closeness” between vectors?

  1. Euclidean Distance (L2): Straight-line distance in space
    d(a, b) = √(Σ(aᵢ - bᵢ)²)
    
    • Good for: Absolute position matters
    • Range: [0, ∞)
  2. Cosine Similarity: Angle between vectors (normalized)
    cos(a, b) = (a · b) / (||a|| × ||b||)
    
    • Good for: Direction matters, magnitude doesn’t
    • Range: [-1, 1] (1 = identical, 0 = orthogonal, -1 = opposite)
  3. Dot Product (Inner Product): Combines magnitude and direction
    a · b = Σ(aᵢ × bᵢ)
    
    • Good for: When vectors are already normalized
    • Range: (-∞, ∞)
  4. Manhattan Distance (L1): Sum of absolute differences
    d(a, b) = Σ|aᵢ - bᵢ|
    

The Curse of Dimensionality

Why is vector search hard?

In high dimensions (100+), counterintuitive things happen:

  • All points are “far”: The distance between any two random points converges
  • Volume explodes: A hypercube with side 1 has volume 1, but almost all that volume is in the “corners”
  • Brute force is slow: Comparing a query to 1 billion vectors = 1 billion distance calculations

This is why we need Approximate Nearest Neighbor (ANN) algorithms—trading perfect accuracy for massive speedups.

The ANN Tradeoff

Accuracy (Recall)  ←——————————————————→  Speed
    100%                                    0.001ms

    Brute Force (exact)                     Too slow
    HNSW (graph)                            Great balance
    IVF (clustering)                        Faster, less accurate
    LSH (hashing)                           Fastest, least accurate

Recall@K: Of the true K nearest neighbors, what fraction did we find?

  • Recall@10 = 0.95 means we found 95% of the true top-10 neighbors

Vector Database Architecture

┌─────────────────────────────────────────────────────────┐
│                     Client Layer                         │
│         (REST API, gRPC, SDKs, Query Language)          │
├─────────────────────────────────────────────────────────┤
│                     Query Layer                          │
│    (Query Planning, Filtering, Caching, Result Ranking) │
├─────────────────────────────────────────────────────────┤
│                     Index Layer                          │
│   ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐   │
│   │  HNSW   │  │   IVF   │  │   LSH   │  │  Flat   │   │
│   └─────────┘  └─────────┘  └─────────┘  └─────────┘   │
├─────────────────────────────────────────────────────────┤
│                    Storage Layer                         │
│    (Vector Storage, Metadata, WAL, Compression/PQ)      │
├─────────────────────────────────────────────────────────┤
│                  Distributed Layer                       │
│      (Sharding, Replication, Consistency, Routing)      │
└─────────────────────────────────────────────────────────┘

Project List

Projects progress from fundamental algorithms to a complete vector database implementation.


Project 1: Brute Force k-NN and Distance Metrics

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, C++, Go
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Vector Mathematics / Similarity Search
  • Software or Tool: Distance Calculator
  • Main Book: “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman (Chapter 3)

What you’ll build: A brute-force k-nearest neighbors search engine supporting multiple distance metrics (Euclidean, cosine, dot product), with optimizations using NumPy vectorization and SIMD.

Why it teaches fundamentals: Before learning approximate algorithms, you must understand exact search. This establishes the baseline that all other algorithms try to beat. You’ll also learn why brute force becomes impossible at scale.

Core challenges you’ll face:

  • Implementing distance metrics correctly → maps to vector mathematics
  • Vectorizing with NumPy → maps to batch operations for speed
  • Memory management for large datasets → maps to chunked processing
  • Benchmarking accurately → maps to measuring recall and latency

Key Concepts:

  • Vector Operations: Linear Algebra fundamentals
  • Distance Metrics: “Mining of Massive Datasets” Chapter 3 - Leskovec et al.
  • NumPy Broadcasting: NumPy documentation

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of vectors/matrices. NumPy familiarity helpful.

Real world outcome:

$ ./vector_search --dataset random_10k_128d.npy --query query.npy --k 10

Loading dataset: 10,000 vectors × 128 dimensions
Query vector: 128 dimensions

Brute Force k-NN Results (k=10):
╔═══════╤═══════════╤════════════════╗
║ Rank  │ Vector ID │ Distance (L2)  ║
╠═══════╪═══════════╪════════════════╣
║   1   │   4,521   │    0.823       ║
║   2   │   7,892   │    0.891       ║
║   3   │   2,103   │    0.912       ║
║  ...  │    ...    │     ...        ║
║  10   │   9,234   │    1.102       ║
╚═══════╧═══════════╧════════════════╝

Performance:
  Distance calculations: 10,000
  Time: 2.3ms (vectorized), 156ms (naive loop)
  Throughput: 4,347 queries/second

$ ./vector_search --dataset random_1M_768d.npy --query query.npy --k 10
Loading dataset: 1,000,000 vectors × 768 dimensions

Performance:
  Time: 1,847ms (too slow for production!)
  This is why we need approximate algorithms...

Implementation Hints:

Distance metric implementations:

Euclidean Distance (L2):
  d(q, v) = sqrt(sum((q[i] - v[i])^2 for i in range(dim)))

Cosine Similarity:
  sim(q, v) = dot(q, v) / (norm(q) * norm(v))
  distance = 1 - sim  (to convert to distance)

Dot Product:
  score(q, v) = sum(q[i] * v[i] for i in range(dim))
  (higher is better, not a distance)

Naive implementation (slow):

def knn_naive(query, vectors, k):
    distances = []
    for i, v in enumerate(vectors):
        d = euclidean_distance(query, v)
        distances.append((i, d))
    distances.sort(key=lambda x: x[1])
    return distances[:k]

Vectorized implementation (fast):

def knn_vectorized(query, vectors, k):
    # Compute all distances at once
    # (query - vectors)^2 = query^2 - 2*query*vectors + vectors^2
    query_sq = np.sum(query ** 2)
    vectors_sq = np.sum(vectors ** 2, axis=1)
    cross_term = 2 * np.dot(vectors, query)

    distances = query_sq - cross_term + vectors_sq
    # Skip sqrt for ranking (monotonic transformation)

    # Partial sort for top-k
    indices = np.argpartition(distances, k)[:k]
    top_k = indices[np.argsort(distances[indices])]

    return top_k, distances[top_k]

Key optimization insight: For ranking, you don’t need the actual distance—just the ordering. So skip expensive operations like sqrt() that don’t change the ranking.

Chunked processing for large datasets:

def knn_chunked(query, vectors, k, chunk_size=100000):
    candidates = []
    for i in range(0, len(vectors), chunk_size):
        chunk = vectors[i:i+chunk_size]
        chunk_results = knn_vectorized(query, chunk, k)
        candidates.extend([(idx + i, dist) for idx, dist in chunk_results])

    candidates.sort(key=lambda x: x[1])
    return candidates[:k]

Learning milestones:

  1. Distance metrics compute correctly → You understand vector math
  2. Vectorized version is 50x+ faster → You understand batch operations
  3. You hit the wall at 1M vectors → You understand why ANN is needed
  4. Recall@10 is always 100% → You understand exact search baseline

Project 2: Locality Sensitive Hashing (LSH)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, C++, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Hashing / Probabilistic Data Structures
  • Software or Tool: LSH Index
  • Main Book: “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman (Chapter 3)

What you’ll build: A Locality Sensitive Hashing index for approximate nearest neighbor search, using random hyperplanes for cosine similarity or random projections for Euclidean distance.

Why it teaches LSH: LSH is the simplest ANN algorithm to understand. It’s based on a beautiful insight: if we hash vectors such that similar vectors likely get the same hash, we can use hash tables for fast lookup. Understanding LSH makes other algorithms clearer.

Core challenges you’ll face:

  • Designing hash functions → maps to random hyperplanes and projections
  • Amplification (AND/OR) → maps to balancing precision and recall
  • Multi-probe LSH → maps to checking nearby buckets
  • Parameter tuning → maps to number of tables and hash functions

Resources for key challenges:

Key Concepts:

  • LSH Fundamentals: “Mining of Massive Datasets” Chapter 3 - Leskovec et al.
  • Random Projections: Tyler Neylon’s LSH Introduction
  • Amplification: LSH literature on AND/OR constructions

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 1 completed. Understanding of hash tables and probability.

Real world outcome:

$ ./lsh_search --build --dataset sift_1M.npy --tables 20 --hashes 8

Building LSH Index:
  Dataset: 1,000,000 vectors × 128 dimensions
  Hash tables: 20
  Hashes per table: 8
  Total hash functions: 160

Build time: 12.3s
Index size: 245 MB (vs 512 MB raw data)

$ ./lsh_search --query query.npy --k 10

LSH Query Results:
╔═══════╤═══════════╤══════════════╤═════════════════╗
║ Rank  │ Vector ID │ Distance     │ Found in Table  ║
╠═══════╪═══════════╪══════════════╪═════════════════╣
║   1   │   4,521   │    0.823     │      3, 7, 12   ║
║   2   │   7,892   │    0.891     │      7, 15      ║
║  ...  │    ...    │     ...      │       ...       ║
╚═══════╧═══════════╧══════════════╧═════════════════╝

Performance:
  Candidates examined: 2,042 (vs 1,000,000 brute force)
  Time: 4.2ms (vs 1,847ms brute force)
  Speedup: 440x
  Recall@10: 0.87 (87% of true neighbors found)

$ ./lsh_search --sweep-params
Sweeping parameters for optimal recall/speed...

Tables │ Hashes │ Recall@10 │ Query Time │ Index Size
───────┼────────┼───────────┼────────────┼───────────
   5   │   4    │   0.62    │   1.1ms    │   61 MB
  10   │   6    │   0.78    │   2.3ms    │  122 MB
  20   │   8    │   0.87    │   4.2ms    │  245 MB
  40   │  10    │   0.94    │   8.7ms    │  490 MB

Implementation Hints:

The key insight of LSH:

Hash function property:
  P(h(a) = h(b)) ∝ similarity(a, b)

If vectors are similar → high probability of same hash
If vectors are different → low probability of same hash

For cosine similarity, use random hyperplanes:

def random_hyperplane_hash(vector, hyperplane):
    # Returns 1 if vector is on positive side, 0 otherwise
    return 1 if np.dot(vector, hyperplane) >= 0 else 0

def compute_hash(vector, hyperplanes):
    # Combine multiple hyperplane hashes into one integer
    hash_value = 0
    for i, hp in enumerate(hyperplanes):
        bit = random_hyperplane_hash(vector, hp)
        hash_value |= (bit << i)
    return hash_value

LSH Index structure:

class LSHIndex:
    def __init__(self, num_tables, hashes_per_table, dim):
        self.tables = []
        for _ in range(num_tables):
            hyperplanes = np.random.randn(hashes_per_table, dim)
            hyperplanes /= np.linalg.norm(hyperplanes, axis=1, keepdims=True)
            table = {
                'hyperplanes': hyperplanes,
                'buckets': defaultdict(list)  # hash -> [vector_ids]
            }
            self.tables.append(table)

    def insert(self, vector_id, vector):
        for table in self.tables:
            hash_val = compute_hash(vector, table['hyperplanes'])
            table['buckets'][hash_val].append(vector_id)

    def query(self, vector, k):
        candidates = set()
        for table in self.tables:
            hash_val = compute_hash(vector, table['hyperplanes'])
            candidates.update(table['buckets'][hash_val])

        # Re-rank candidates by actual distance
        results = []
        for cand_id in candidates:
            dist = distance(vector, self.vectors[cand_id])
            results.append((cand_id, dist))

        results.sort(key=lambda x: x[1])
        return results[:k]

Amplification (AND/OR construction):

  • More hashes per table (AND): Fewer candidates, higher precision
  • More tables (OR): More candidates, higher recall
  • Balance: Tune based on your recall requirements

Multi-probe LSH: Instead of only checking the exact bucket, also check nearby buckets (flip one bit at a time). Increases recall without adding tables.

Learning milestones:

  1. Hash function maps similar vectors together → You understand LSH basics
  2. Multiple tables improve recall → You understand amplification
  3. Parameter tuning affects recall/speed → You understand the tradeoff
  4. Multi-probe improves efficiency → You understand advanced LSH

Project 3: Inverted File Index (IVF)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Clustering / Inverted Indexes
  • Software or Tool: IVF Index
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: An Inverted File (IVF) index that clusters vectors with k-means, then searches only nearby clusters at query time—the foundation of production vector search.

Why it teaches IVF: IVF is the workhorse of production vector databases. It’s conceptually simple (just clustering!) but has nuances around cluster count, nprobe tuning, and training data. Understanding IVF unlocks FAISS and most production systems.

Core challenges you’ll face:

  • K-means clustering at scale → maps to efficient training
  • Choosing number of clusters → maps to sqrt(N) heuristic
  • nprobe tuning → maps to recall vs speed tradeoff
  • Non-uniform cluster sizes → maps to data distribution challenges

Resources for key challenges:

Key Concepts:

  • Inverted Index: “Introduction to Information Retrieval” Chapter 1 - Manning et al.
  • K-Means Clustering: “Pattern Recognition and Machine Learning” - Bishop
  • IVF in FAISS: FAISS documentation

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1 completed. Understanding of k-means clustering.

Real world outcome:

$ ./ivf_search --build --dataset sift_1M.npy --clusters 1000

Building IVF Index:
  Dataset: 1,000,000 vectors × 128 dimensions
  Training k-means with 100,000 sample vectors...
  K-means iterations: 25, converged: True
  Clusters: 1,000 (avg 1,000 vectors per cluster)

Cluster size distribution:
  Min: 423, Max: 2,156, Median: 987, StdDev: 312

Build time: 45.2s
Index size: 134 MB (centroids) + 512 MB (data)

$ ./ivf_search --query query.npy --k 10 --nprobe 10

IVF Query Results (nprobe=10):
  Clusters searched: 10 (1% of index)
  Vectors compared: 9,870

╔═══════╤═══════════╤══════════════╤═════════════════╗
║ Rank  │ Vector ID │ Distance     │ Cluster ID      ║
╠═══════╪═══════════╪══════════════╪═════════════════╣
║   1   │   4,521   │    0.823     │      127        ║
║   2   │   7,892   │    0.891     │      127        ║
║  ...  │    ...    │     ...      │       ...       ║
╚═══════╧═══════════╧══════════════╧═════════════════╝

Performance:
  Time: 8.3ms (vs 1,847ms brute force)
  Speedup: 222x
  Recall@10: 0.92

$ ./ivf_search --nprobe-sweep
nprobe │ Recall@10 │ Query Time │ Vectors Compared
───────┼───────────┼────────────┼─────────────────
    1  │   0.52    │   1.2ms    │      987
    5  │   0.81    │   4.8ms    │    4,935
   10  │   0.92    │   8.3ms    │    9,870
   20  │   0.97    │  15.2ms    │   19,740
   50  │   0.99    │  38.1ms    │   49,350

Implementation Hints:

IVF structure:

class IVFIndex:
    def __init__(self, num_clusters, dim):
        self.num_clusters = num_clusters
        self.dim = dim
        self.centroids = None  # (num_clusters, dim)
        self.inverted_lists = [[] for _ in range(num_clusters)]
        # inverted_lists[i] = [(vector_id, vector), ...]

    def train(self, training_vectors):
        # Run k-means to find cluster centroids
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=self.num_clusters, n_init=1)
        kmeans.fit(training_vectors)
        self.centroids = kmeans.cluster_centers_

    def add(self, vector_id, vector):
        # Find nearest centroid
        distances = np.linalg.norm(self.centroids - vector, axis=1)
        cluster_id = np.argmin(distances)
        self.inverted_lists[cluster_id].append((vector_id, vector))

    def search(self, query, k, nprobe):
        # Find nprobe nearest centroids
        centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
        nearest_clusters = np.argsort(centroid_distances)[:nprobe]

        # Search within those clusters
        candidates = []
        for cluster_id in nearest_clusters:
            for vector_id, vector in self.inverted_lists[cluster_id]:
                dist = np.linalg.norm(query - vector)
                candidates.append((vector_id, dist))

        # Sort and return top-k
        candidates.sort(key=lambda x: x[1])
        return candidates[:k]

Choosing number of clusters:

Rule of thumb: num_clusters ≈ sqrt(num_vectors)

1M vectors → ~1,000 clusters
100M vectors → ~10,000 clusters

Too few clusters: Each cluster too large, slow search
Too many clusters: Centroid comparison overhead, missed neighbors

Training considerations:

  • Train on a representative sample (10-20% of data is often enough)
  • Ensure training data has same distribution as full dataset
  • More k-means iterations = better centroids, but slower training

Non-exhaustive search pitfall:

If nearest neighbor is in cluster 5, but query is closest to
centroid of cluster 3, we might miss it with low nprobe.

Solution: Higher nprobe, or use HNSW to find nearby centroids faster

Learning milestones:

  1. K-means clusters vectors → You understand the preprocessing
  2. nprobe controls recall → You understand the search phase
  3. Cluster imbalance affects performance → You understand data distribution
  4. You compare with FAISS → You validate your implementation

Project 4: Product Quantization (PQ)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Vector Compression / Quantization
  • Software or Tool: PQ Encoder/Decoder
  • Main Book: “Product Quantization for Nearest Neighbor Search” - Jégou et al.

What you’ll build: A Product Quantization system that compresses 128-dimensional float vectors into just 8-16 bytes, enabling 97% memory reduction while maintaining search quality.

Why it teaches quantization: Memory is often the bottleneck for large vector databases. PQ is the key technique that makes billion-scale search possible on a single machine. Understanding PQ unlocks IVF-PQ, the most common production configuration.

Core challenges you’ll face:

  • Subspace decomposition → maps to splitting vectors into segments
  • Codebook training → maps to k-means per subspace
  • Asymmetric distance computation → maps to query vs database accuracy
  • Distance lookup tables → maps to precomputed distances for speed

Resources for key challenges:

Key Concepts:

  • Product Quantization: Original paper by Jégou, Douze, Schmid
  • Vector Quantization: Lloyd’s algorithm / k-means
  • Asymmetric Distance: ADC (Asymmetric Distance Computation)

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1, 3 completed. Understanding of k-means.

Real world outcome:

$ ./pq_compress --dataset sift_1M.npy --segments 8 --bits 8

Training Product Quantization:
  Original: 1,000,000 vectors × 128 dims × 4 bytes = 512 MB
  Segments: 8 (16 dims each)
  Centroids per segment: 256 (8 bits)

Training codebooks (k-means per segment)...
  Segment 0: 256 centroids, trained on 1M subvectors
  Segment 1: 256 centroids, trained on 1M subvectors
  ...
  Segment 7: 256 centroids, trained on 1M subvectors

Compressing vectors...
  Compressed: 1,000,000 vectors × 8 bytes = 8 MB
  Compression ratio: 64x (98.4% reduction!)

Codebook size: 8 segments × 256 centroids × 16 dims × 4 bytes = 128 KB

$ ./pq_search --query query.npy --k 10

PQ Search with ADC (Asymmetric Distance Computation):
  Building distance lookup table: 0.02ms
  Scanning 1M compressed codes: 12.3ms

Results:
╔═══════╤═══════════╤════════════════╤═════════════════╗
║ Rank  │ Vector ID │ Approx Dist    │ True Dist       ║
╠═══════╪═══════════╪════════════════╪═════════════════╣
║   1   │   4,521   │    0.841       │    0.823        ║
║   2   │   7,102   │    0.923       │    0.905        ║
║  ...  │    ...    │     ...        │      ...        ║
╚═══════╧═══════════╧════════════════╧═════════════════╝

Quality:
  Recall@10: 0.89 (vs exact search)
  Mean distance error: 3.2%

Performance:
  Time: 12.3ms (vs 1,847ms brute force on raw vectors)
  Memory: 8 MB (vs 512 MB raw)
  Speedup: 150x

Implementation Hints:

Product Quantization structure:

Original vector (128 dims):
[v0, v1, v2, ..., v127]

Split into M=8 subvectors (16 dims each):
[v0-v15], [v16-v31], ..., [v112-v127]

Each subvector → quantized to one of 256 centroids
Compressed code: [c0, c1, c2, c3, c4, c5, c6, c7]  (8 bytes total)

Implementation:

class ProductQuantizer:
    def __init__(self, dim, num_segments, num_centroids=256):
        self.dim = dim
        self.num_segments = num_segments
        self.segment_dim = dim // num_segments
        self.num_centroids = num_centroids
        self.codebooks = []  # One codebook per segment

    def train(self, vectors):
        for seg in range(self.num_segments):
            start = seg * self.segment_dim
            end = start + self.segment_dim
            subvectors = vectors[:, start:end]

            kmeans = KMeans(n_clusters=self.num_centroids)
            kmeans.fit(subvectors)
            self.codebooks.append(kmeans.cluster_centers_)

    def encode(self, vector):
        codes = []
        for seg in range(self.num_segments):
            start = seg * self.segment_dim
            end = start + self.segment_dim
            subvector = vector[start:end]

            # Find nearest centroid
            distances = np.linalg.norm(self.codebooks[seg] - subvector, axis=1)
            code = np.argmin(distances)
            codes.append(code)

        return np.array(codes, dtype=np.uint8)

    def decode(self, codes):
        vector = []
        for seg, code in enumerate(codes):
            vector.extend(self.codebooks[seg][code])
        return np.array(vector)

Asymmetric Distance Computation (ADC):

def compute_distance_tables(self, query):
    """Precompute distances from query subvectors to all centroids"""
    tables = []
    for seg in range(self.num_segments):
        start = seg * self.segment_dim
        end = start + self.segment_dim
        query_sub = query[start:end]

        # Distance from query subvector to each centroid
        distances = np.linalg.norm(self.codebooks[seg] - query_sub, axis=1)
        tables.append(distances)

    return np.array(tables)  # Shape: (num_segments, num_centroids)

def search_pq(self, query, codes, k):
    """Search using precomputed distance tables"""
    tables = self.compute_distance_tables(query)

    # Compute approximate distance to each code
    distances = np.zeros(len(codes))
    for seg in range(self.num_segments):
        distances += tables[seg, codes[:, seg]] ** 2

    # Top-k
    indices = np.argpartition(distances, k)[:k]
    return indices[np.argsort(distances[indices])]

Why ADC is fast:

  • Query: Compute 8 × 256 = 2,048 subvector distances (one time)
  • Per vector: 8 table lookups + 8 additions
  • 1M vectors: 8M lookups vs 128M multiplications

Learning milestones:

  1. Codebooks train successfully → You understand quantization
  2. Encoding/decoding works → You understand the representation
  3. ADC is much faster than brute force → You understand the speedup
  4. Recall drops gracefully → You understand the accuracy tradeoff

Project 5: HNSW (Hierarchical Navigable Small World)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Graph-Based Search / Skip Lists
  • Software or Tool: HNSW Index
  • Main Book: “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs” - Malkov & Yashunin

What you’ll build: A complete HNSW implementation—the state-of-the-art algorithm for vector search, combining navigable small-world graphs with a hierarchical structure for logarithmic search complexity.

Why it teaches HNSW: HNSW is the default choice for most vector databases (Pinecone, Qdrant, Weaviate, pgvector). It offers the best balance of speed, accuracy, and flexibility. Understanding it deeply means understanding modern vector search.

Core challenges you’ll face:

  • Graph construction → maps to finding good neighbors during insertion
  • Hierarchical layers → maps to skip-list-like navigation
  • Greedy search → maps to navigating the graph efficiently
  • Parameter tuning (M, efConstruction, efSearch) → maps to quality vs speed

Resources for key challenges:

Key Concepts:

  • Small World Graphs: Watts-Strogatz model
  • Skip Lists: Probabilistic data structure
  • Greedy Search: Local optimization in graphs
  • HNSW Paper: Malkov & Yashunin, 2016

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-2 completed. Understanding of graph data structures.

Real world outcome:

$ ./hnsw_build --dataset sift_1M.npy --M 16 --efConstruction 200

Building HNSW Index:
  Dataset: 1,000,000 vectors × 128 dimensions
  M (max connections): 16
  efConstruction: 200

Construction progress:
  Layer 3:        12 nodes (entry points)
  Layer 2:       234 nodes
  Layer 1:     4,891 nodes
  Layer 0: 1,000,000 nodes (all vectors)

Build time: 8m 32s
Index size: 287 MB (graph) + 512 MB (vectors) = 799 MB

Graph statistics:
  Avg connections at layer 0: 14.2
  Max layer reached: 4
  Graph diameter (estimated): 6 hops

$ ./hnsw_search --query query.npy --k 10 --ef 50

HNSW Search (ef=50):
  Starting from entry point at layer 3
  Layer 3: 2 hops → found node 8,234
  Layer 2: 3 hops → found node 12,891
  Layer 1: 4 hops → found node 4,521
  Layer 0: Expanded 50 candidates

Results:
╔═══════╤═══════════╤══════════════╗
║ Rank  │ Vector ID │ Distance     ║
╠═══════╪═══════════╪══════════════╣
║   1   │   4,521   │    0.823     ║
║   2   │   7,892   │    0.891     ║
║  ...  │    ...    │     ...      ║
╚═══════╧═══════════╧══════════════╝

Performance:
  Nodes visited: 143
  Distance calculations: 143 (vs 1,000,000 brute force)
  Time: 0.8ms
  Recall@10: 0.98

$ ./hnsw_search --ef-sweep
ef   │ Recall@10 │ Query Time │ Nodes Visited
─────┼───────────┼────────────┼──────────────
  10 │   0.72    │   0.2ms    │       38
  20 │   0.88    │   0.4ms    │       67
  50 │   0.98    │   0.8ms    │      143
 100 │   0.995   │   1.4ms    │      271
 200 │   0.999   │   2.6ms    │      523

Implementation Hints:

HNSW key insight: Combine two ideas:

  1. Navigable Small World (NSW): A graph where you can reach any node in few hops via greedy routing
  2. Hierarchy (Skip List): Multiple layers where upper layers have fewer nodes, enabling fast coarse search
Layer 2:  [A]----------[B]----------[C]  (few nodes, long edges)
           |            |            |
Layer 1:  [A]---[D]---[B]---[E]---[C]  (more nodes)
           |    |      |    |      |
Layer 0:  [A][F][D][G][B][H][E][I][C]  (all nodes, short edges)

Core data structures:

class HNSWNode:
    def __init__(self, vector_id, vector, level):
        self.id = vector_id
        self.vector = vector
        self.level = level
        self.neighbors = [[] for _ in range(level + 1)]
        # neighbors[layer] = list of neighbor node ids

class HNSWIndex:
    def __init__(self, dim, M=16, ef_construction=200):
        self.dim = dim
        self.M = M  # Max connections per node
        self.M_max0 = M * 2  # Max connections at layer 0
        self.ef_construction = ef_construction
        self.ml = 1 / np.log(M)  # Level multiplier
        self.nodes = {}  # id -> HNSWNode
        self.entry_point = None
        self.max_level = -1

Insertion algorithm:

def insert(self, vector_id, vector):
    # 1. Determine level for new node (probabilistic)
    level = self._random_level()
    node = HNSWNode(vector_id, vector, level)

    if self.entry_point is None:
        self.entry_point = node
        self.max_level = level
        self.nodes[vector_id] = node
        return

    # 2. Find entry point at each level via greedy search
    current = self.entry_point

    # 3. Search from top to node's level (greedy, single result)
    for lc in range(self.max_level, level, -1):
        current = self._search_layer_greedy(vector, current, lc)

    # 4. At node's level and below, find ef_construction neighbors
    for lc in range(min(level, self.max_level), -1, -1):
        candidates = self._search_layer(vector, current, self.ef_construction, lc)

        # Select M best neighbors
        neighbors = self._select_neighbors(vector, candidates, self.M if lc > 0 else self.M_max0)

        # Bidirectional connections
        for neighbor in neighbors:
            node.neighbors[lc].append(neighbor.id)
            neighbor.neighbors[lc].append(vector_id)

            # Prune if too many connections
            if len(neighbor.neighbors[lc]) > (self.M if lc > 0 else self.M_max0):
                neighbor.neighbors[lc] = self._select_neighbors(
                    neighbor.vector,
                    [self.nodes[n] for n in neighbor.neighbors[lc]],
                    self.M if lc > 0 else self.M_max0
                )

        if len(candidates) > 0:
            current = candidates[0]  # Closest candidate

    # 5. Update entry point if new node is higher
    if level > self.max_level:
        self.entry_point = node
        self.max_level = level

    self.nodes[vector_id] = node

def _random_level(self):
    level = 0
    while np.random.random() < self.ml and level < 16:
        level += 1
    return level

Search algorithm:

def search(self, query, k, ef):
    if self.entry_point is None:
        return []

    current = self.entry_point

    # Greedy search from top to layer 1
    for lc in range(self.max_level, 0, -1):
        current = self._search_layer_greedy(query, current, lc)

    # At layer 0, do beam search with ef candidates
    candidates = self._search_layer(query, current, ef, 0)

    # Return top-k
    return sorted(candidates, key=lambda n: distance(query, n.vector))[:k]

def _search_layer(self, query, entry, ef, layer):
    visited = {entry.id}
    candidates = [(distance(query, entry.vector), entry)]  # min-heap
    results = [(-distance(query, entry.vector), entry)]  # max-heap (negative for max)

    heapify(candidates)
    heapify(results)

    while candidates:
        dist_c, current = heappop(candidates)

        # Stop if closest candidate is worse than worst result
        if dist_c > -results[0][0]:
            break

        for neighbor_id in current.neighbors[layer]:
            if neighbor_id in visited:
                continue
            visited.add(neighbor_id)

            neighbor = self.nodes[neighbor_id]
            dist_n = distance(query, neighbor.vector)

            if dist_n < -results[0][0] or len(results) < ef:
                heappush(candidates, (dist_n, neighbor))
                heappush(results, (-dist_n, neighbor))

                if len(results) > ef:
                    heappop(results)

    return [node for _, node in results]

Parameters explained:

  • M: Max connections per node. Higher = more accurate, more memory
  • ef_construction: Beam width during build. Higher = better graph, slower build
  • ef_search: Beam width during search. Higher = better recall, slower search

Learning milestones:

  1. Hierarchical structure builds correctly → You understand the layers
  2. Greedy search finds good paths → You understand navigation
  3. Recall matches hnswlib → You’ve implemented correctly
  4. Parameter tuning works → You understand the tradeoffs

Project 6: IVF-PQ Composite Index

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: C++, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Combined Indexing / Large-Scale Search
  • Software or Tool: IVF-PQ Index
  • Main Book: “Billion-scale similarity search with GPUs” - Johnson et al.

What you’ll build: A composite IVF-PQ index that combines clustering (IVF) with compression (PQ) for billion-scale search with minimal memory usage.

Why it teaches composite indexes: No single technique is enough at scale. IVF-PQ combines the best of both worlds: IVF reduces search scope, PQ reduces memory. This is how FAISS handles billion-vector datasets.

Core challenges you’ll face:

  • Training pipeline → maps to IVF clustering then PQ on residuals
  • Residual encoding → maps to PQ on (vector - centroid)
  • Search optimization → maps to precomputed tables + SIMD
  • Memory layout → maps to efficient storage for cache

Key Concepts:

  • Residual Quantization: PQ on the difference from centroid
  • Two-Stage Training: IVF first, then PQ
  • ADC with IVF: Combining both speedups

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 3-4 completed.

Real world outcome:

$ ./ivfpq_build --dataset deep_1B.npy --clusters 65536 --pq_segments 64

Building IVF-PQ Index for 1 BILLION vectors:
  Dimensions: 96
  IVF clusters: 65,536
  PQ segments: 64 (1.5 dims each)
  PQ bits: 8 (256 centroids per segment)

Phase 1: Training IVF (k-means on 10M samples)
  Iterations: 20, converged
  Time: 15m 23s

Phase 2: Training PQ codebooks (on residuals)
  Training 64 codebooks of 256 centroids each
  Time: 8m 45s

Phase 3: Encoding vectors
  Progress: [████████████████████] 100%
  Time: 2h 34m

Index statistics:
  Raw data: 1B × 96 × 4 bytes = 384 GB
  Compressed: 1B × 64 bytes = 64 GB (6x compression)
  Centroids: 65K × 96 × 4 bytes = 25 MB
  Codebooks: 64 × 256 × 1.5 × 4 bytes = 98 KB
  Total: 64 GB (84% reduction!)

$ ./ivfpq_search --query query.npy --k 10 --nprobe 64

IVF-PQ Search:
  Centroid search: 0.3ms (found 64 clusters)
  PQ scan: 4.2ms (scanned ~15M codes)
  Re-ranking: 0.1ms

Performance:
  Total: 4.6ms for 1B vectors!
  Recall@10: 0.87

$ ./ivfpq_rerank --k 100 --rerank-k 10
With re-ranking (decode top-100, compute exact distances):
  Total: 6.2ms
  Recall@10: 0.94

Implementation Hints:

IVF-PQ structure:

class IVFPQIndex:
    def __init__(self, dim, num_clusters, pq_segments, pq_bits=8):
        self.dim = dim
        self.num_clusters = num_clusters

        # IVF component
        self.centroids = None  # (num_clusters, dim)

        # PQ component (encodes residuals)
        self.pq = ProductQuantizer(dim, pq_segments, 2 ** pq_bits)

        # Storage: one PQ code array per cluster
        self.inverted_lists = [
            {'ids': [], 'codes': []}
            for _ in range(num_clusters)
        ]

Training (two-stage):

def train(self, training_vectors):
    # Stage 1: Train IVF (k-means)
    kmeans = KMeans(n_clusters=self.num_clusters)
    kmeans.fit(training_vectors)
    self.centroids = kmeans.cluster_centers_

    # Compute residuals: vector - nearest_centroid
    assignments = kmeans.predict(training_vectors)
    residuals = training_vectors - self.centroids[assignments]

    # Stage 2: Train PQ on residuals
    self.pq.train(residuals)

Adding vectors:

def add(self, vector_ids, vectors):
    # Find nearest centroids
    distances = cdist(vectors, self.centroids)
    assignments = np.argmin(distances, axis=1)

    # Compute and encode residuals
    residuals = vectors - self.centroids[assignments]
    codes = self.pq.encode_batch(residuals)

    # Store in inverted lists
    for i, (vid, cluster, code) in enumerate(zip(vector_ids, assignments, codes)):
        self.inverted_lists[cluster]['ids'].append(vid)
        self.inverted_lists[cluster]['codes'].append(code)

Search:

def search(self, query, k, nprobe):
    # Step 1: Find nprobe nearest clusters
    centroid_distances = np.linalg.norm(self.centroids - query, axis=1)
    nearest_clusters = np.argsort(centroid_distances)[:nprobe]

    # Step 2: For each cluster, compute residual query
    all_candidates = []

    for cluster_id in nearest_clusters:
        centroid = self.centroids[cluster_id]
        residual_query = query - centroid

        # Step 3: Compute PQ distance table for residual
        distance_table = self.pq.compute_distance_tables(residual_query)

        # Step 4: Scan codes and compute approximate distances
        codes = np.array(self.inverted_lists[cluster_id]['codes'])
        ids = self.inverted_lists[cluster_id]['ids']

        if len(codes) == 0:
            continue

        # Fast PQ distance computation
        pq_distances = np.sum(
            distance_table[np.arange(self.pq.num_segments), codes],
            axis=1
        )

        for vid, dist in zip(ids, pq_distances):
            all_candidates.append((vid, dist))

    # Step 5: Return top-k
    all_candidates.sort(key=lambda x: x[1])
    return all_candidates[:k]

Optimization: Store codes contiguously per cluster for cache efficiency:

# Instead of list of codes:
codes = np.array([...], dtype=np.uint8).reshape(-1, pq_segments)
# Shape: (num_vectors_in_cluster, pq_segments)
# This enables vectorized distance computation

Learning milestones:

  1. Two-stage training works → You understand the pipeline
  2. Residual encoding improves accuracy → You understand why it helps
  3. Memory reduction is massive → You understand the compression
  4. Search matches FAISS performance → You’ve implemented correctly

Project 7: Disk-Based Vector Index (DiskANN-style)

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: C++
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Storage Systems / SSD Optimization
  • Software or Tool: Disk-Based Vector Index
  • Main Book: “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node” - Microsoft Research

What you’ll build: A vector index that stores vectors on SSD with an in-memory graph for navigation, enabling billion-scale search on a single commodity machine.

Why it teaches disk-based search: Memory is expensive. Storing 1B vectors in RAM requires 500GB+. DiskANN-style indexes keep only the graph in memory (~10GB) and fetch vectors from SSD during search, reducing cost by 50x.

Core challenges you’ll face:

  • Graph fits in memory, vectors on disk → maps to memory-mapped I/O
  • Minimizing disk reads → maps to graph design for locality
  • SSD-friendly access patterns → maps to sequential vs random reads
  • Beam search with async I/O → maps to pipelining disk access

Key Concepts:

  • DiskANN Paper: Microsoft Research, 2019
  • Memory-Mapped Files: OS-level I/O optimization
  • SSD Performance: Random vs sequential read patterns

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 5 completed. C++ proficiency, understanding of file I/O.

Real world outcome:

$ ./diskann_build --dataset deep_1B.npy --memory-limit 16GB

Building DiskANN Index:
  Dataset: 1,000,000,000 vectors × 96 dimensions
  Vector data size: 384 GB (will be on disk)
  Memory limit: 16 GB (for graph)

Building graph with memory-efficient construction...
  Pass 1: Sample-based graph initialization
  Pass 2: Full graph refinement (streaming)
  Pass 3: Graph compaction

Index files:
  vectors.bin: 384 GB (memory-mapped)
  graph.bin: 12 GB (in-memory)
  metadata.json: 1 KB

$ ./diskann_search --query query.npy --k 10 --beam-width 50

DiskANN Search:
  Graph navigation: 0.4ms (in-memory)
  Disk reads: 23 vectors fetched
  SSD latency: 2.1ms (avg 0.09ms per read)
  Total: 2.5ms

Performance analysis:
  Graph hops: 8
  Candidate set size: 50
  Disk reads: 23 (< beam width due to caching)
  Cache hits: 12

Recall@10: 0.96

$ ./diskann_search --benchmark --queries 10000
Benchmark (10K queries):
  Mean latency: 2.8ms
  P99 latency: 5.2ms
  Throughput: 357 queries/second
  SSD bandwidth: 1.2 GB/s

Implementation Hints:

Architecture:

┌─────────────────────────────────────────┐
│             In-Memory Graph              │
│  (Node ID → [Neighbor IDs], 12 GB)      │
├─────────────────────────────────────────┤
│           Memory-Mapped Vectors          │
│    (Node ID → Vector, 384 GB on SSD)    │
└─────────────────────────────────────────┘

Core structures:

struct DiskANNIndex {
    // In-memory: graph only
    vector<vector<uint32_t>> graph;  // Neighbor lists

    // Memory-mapped: vectors on disk
    float* vectors;  // mmap'd from file
    size_t num_vectors;
    size_t dim;

    // Caching
    LRUCache<uint32_t, vector<float>> vector_cache;
};

Memory-mapping vectors:

void open_vectors(const string& path) {
    int fd = open(path.c_str(), O_RDONLY);
    size_t file_size = num_vectors * dim * sizeof(float);

    vectors = (float*)mmap(
        nullptr,
        file_size,
        PROT_READ,
        MAP_PRIVATE,  // Copy-on-write
        fd,
        0
    );

    // Advise OS about access pattern
    madvise(vectors, file_size, MADV_RANDOM);
}

vector<float> get_vector(uint32_t id) {
    // Check cache first
    if (auto cached = vector_cache.get(id)) {
        return *cached;
    }

    // Load from mmap (may trigger disk read)
    float* ptr = vectors + id * dim;
    vector<float> vec(ptr, ptr + dim);

    vector_cache.put(id, vec);
    return vec;
}

Search with prefetching:

vector<pair<uint32_t, float>> search(
    const vector<float>& query,
    size_t k,
    size_t beam_width
) {
    priority_queue<pair<float, uint32_t>> candidates;  // max-heap
    priority_queue<pair<float, uint32_t>, vector<...>, greater<>> results;  // min-heap
    unordered_set<uint32_t> visited;

    // Start from entry point
    auto entry_vec = get_vector(entry_point);
    float dist = distance(query, entry_vec);
    candidates.push({-dist, entry_point});  // Negative for min behavior
    results.push({dist, entry_point});
    visited.insert(entry_point);

    while (!candidates.empty()) {
        auto [neg_dist, current] = candidates.top();
        candidates.pop();

        if (-neg_dist > results.top().first && results.size() >= beam_width) {
            break;
        }

        // Prefetch neighbor vectors asynchronously
        auto& neighbors = graph[current];
        for (uint32_t neighbor : neighbors) {
            // Hint to OS to prefetch this page
            prefetch(&vectors[neighbor * dim]);
        }

        // Process neighbors
        for (uint32_t neighbor : neighbors) {
            if (visited.count(neighbor)) continue;
            visited.insert(neighbor);

            auto neighbor_vec = get_vector(neighbor);
            float d = distance(query, neighbor_vec);

            if (d < results.top().first || results.size() < beam_width) {
                candidates.push({-d, neighbor});
                results.push({d, neighbor});
                if (results.size() > beam_width) {
                    results.pop();
                }
            }
        }
    }

    // Extract top-k from results
    vector<pair<uint32_t, float>> top_k;
    while (!results.empty() && top_k.size() < k) {
        top_k.push_back({results.top().second, results.top().first});
        results.pop();
    }
    reverse(top_k.begin(), top_k.end());
    return top_k;
}

Graph construction (memory-efficient):

void build_graph(const string& vectors_path, size_t R, size_t L) {
    // R = max degree, L = candidate list size during construction

    // Initialize with random graph
    for (size_t i = 0; i < num_vectors; i++) {
        graph[i] = random_neighbors(R);
    }

    // Iterative refinement (process vectors in chunks to limit memory)
    size_t chunk_size = 1000000;  // 1M vectors at a time
    for (size_t offset = 0; offset < num_vectors; offset += chunk_size) {
        size_t end = min(offset + chunk_size, num_vectors);

        // Load chunk into memory
        vector<vector<float>> chunk_vectors = load_chunk(vectors_path, offset, end);

        // For each vector in chunk, improve its neighbors
        for (size_t i = offset; i < end; i++) {
            auto candidates = beam_search(chunk_vectors[i - offset], L);
            graph[i] = select_diverse_neighbors(candidates, R);

            // Add reverse edges
            for (uint32_t neighbor : graph[i]) {
                if (graph[neighbor].size() < R) {
                    graph[neighbor].push_back(i);
                }
            }
        }
    }
}

Learning milestones:

  1. Memory-mapped I/O works → You understand OS-level optimization
  2. Graph fits in memory limit → You understand memory management
  3. Latency is SSD-bound → You understand the bottleneck
  4. Prefetching helps → You understand pipelining

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Hybrid Search / Inverted Indexes
  • Software or Tool: Filtered Vector Search
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A vector search system that supports metadata filtering (e.g., “find similar products where price < $100 and category = ‘electronics’”) efficiently, not just post-filtering.

Why it teaches filtered search: Real applications always have filters. Naive post-filtering is slow (scan 1M vectors, filter to 1K, return top-10). Smart pre-filtering or integrated filtering is essential for production systems.

Core challenges you’ll face:

  • Pre-filter vs post-filter → maps to when to apply filters
  • Bitmap indexes for metadata → maps to efficient set operations
  • Filtered HNSW traversal → maps to skipping invalid nodes
  • Cost estimation → maps to choosing the right strategy

Key Concepts:

  • Inverted Indexes: “Introduction to Information Retrieval” - Manning et al.
  • Bitmap Indexes: Database index fundamentals
  • Filtered Graph Search: Qdrant, Pinecone approaches

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Project 5 completed. Understanding of database indexes.

Real world outcome:

$ ./filtered_search "
  SELECT * FROM products
  WHERE vector NEAR query_embedding
    AND category = 'electronics'
    AND price < 100
    AND rating >= 4.0
  LIMIT 10
"

Query Analysis:
  Filter selectivity: 2.3% (23,000 of 1,000,000 match)
  Strategy: Pre-filter + HNSW on filtered subset

Execution:
  Step 1: Apply filters (bitmap intersection): 0.8ms
  Step 2: Build filtered entry points: 0.1ms
  Step 3: HNSW search on filtered graph: 1.2ms

Results:
╔═══════╤═══════════════╤══════════════╤════════╤════════════╗
║ Rank  │ Product ID    │ Distance     │ Price  │ Category   ║
╠═══════╪═══════════════╪══════════════╪════════╪════════════╣
║   1   │ PROD-34521    │    0.234     │ $49.99 │ electronics║
║   2   │ PROD-78234    │    0.287     │ $89.99 │ electronics║
║  ...  │     ...       │     ...      │   ...  │    ...     ║
╚═══════╧═══════════════╧══════════════╧════════╧════════════╝

Performance: 2.1ms total
  (vs 847ms with post-filtering on full HNSW traversal)

$ ./filtered_search --analyze-strategies
Filter Selectivity │ Best Strategy    │ Expected Time
───────────────────┼──────────────────┼──────────────
  > 50%            │ Post-filter      │  ~2ms
  10% - 50%        │ Filtered HNSW    │  ~3ms
  1% - 10%         │ Pre-filter       │  ~2ms
  < 1%             │ Scan filtered    │  ~1ms

Implementation Hints:

Metadata storage with bitmap indexes:

class MetadataIndex:
    def __init__(self):
        # For categorical: inverted index of bitmaps
        self.categorical = {}  # field -> {value -> bitmap}

        # For numerical: sorted index
        self.numerical = {}  # field -> sorted [(value, vector_id)]

    def add(self, vector_id, metadata):
        for field, value in metadata.items():
            if isinstance(value, str):
                if field not in self.categorical:
                    self.categorical[field] = {}
                if value not in self.categorical[field]:
                    self.categorical[field][value] = BitSet()
                self.categorical[field][value].set(vector_id)
            else:
                if field not in self.numerical:
                    self.numerical[field] = []
                self.numerical[field].append((value, vector_id))

    def build_indexes(self):
        for field in self.numerical:
            self.numerical[field].sort()

    def filter(self, conditions):
        result = None
        for field, op, value in conditions:
            if field in self.categorical:
                if op == '==':
                    bits = self.categorical[field].get(value, BitSet())
                elif op == 'in':
                    bits = BitSet()
                    for v in value:
                        bits |= self.categorical[field].get(v, BitSet())
            else:  # numerical
                bits = self._range_filter(field, op, value)

            if result is None:
                result = bits
            else:
                result &= bits  # Intersection

        return result

Filtered HNSW search:

class FilteredHNSW(HNSWIndex):
    def search_filtered(self, query, k, ef, valid_ids):
        """Search considering only nodes in valid_ids"""
        if len(valid_ids) == 0:
            return []

        # Convert to set for O(1) lookup
        valid_set = set(valid_ids)

        # Find a valid entry point
        entry = self._find_valid_entry(valid_set)
        if entry is None:
            # Fall back to scanning
            return self._scan_filtered(query, k, valid_set)

        # Modified beam search that skips invalid nodes
        visited = {entry.id}
        candidates = [(distance(query, entry.vector), entry)]
        results = []

        while candidates:
            dist_c, current = heappop(candidates)

            # Add to results if valid
            if current.id in valid_set:
                heappush(results, (-dist_c, current))
                if len(results) > ef:
                    heappop(results)

            # Explore neighbors (including invalid ones for navigation)
            for neighbor_id in current.neighbors[0]:
                if neighbor_id in visited:
                    continue
                visited.add(neighbor_id)

                neighbor = self.nodes[neighbor_id]
                dist_n = distance(query, neighbor.vector)

                # Always add to candidates for navigation
                if len(results) < ef or dist_n < -results[0][0]:
                    heappush(candidates, (dist_n, neighbor))

        # Extract top-k valid results
        return sorted(
            [(n.id, -d) for d, n in results],
            key=lambda x: x[1]
        )[:k]

Cost-based strategy selection:

def choose_strategy(self, query, filters, k):
    # Estimate filter selectivity
    selectivity = self.estimate_selectivity(filters)
    num_valid = int(self.num_vectors * selectivity)

    # Estimate costs
    post_filter_cost = self.hnsw_search_cost() + num_valid * 0.001
    pre_filter_cost = self.filter_cost(filters) + self.scan_cost(num_valid)
    filtered_hnsw_cost = self.filter_cost(filters) + self.hnsw_search_cost() * (1 / selectivity)

    if selectivity > 0.5:
        return 'post_filter'
    elif selectivity < 0.01:
        return 'pre_filter_scan'
    else:
        return 'filtered_hnsw'

Learning milestones:

  1. Bitmap filters are fast → You understand inverted indexes
  2. Strategy selection matters → You understand cost-based optimization
  3. Filtered HNSW works → You understand graph traversal with constraints
  4. Performance beats naive approach → You’ve built an efficient system

Project 9: Vector Database Storage Engine

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Storage Systems / Write-Ahead Logging
  • Software or Tool: Storage Engine
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A durable storage engine for vectors with write-ahead logging (WAL), crash recovery, compaction, and efficient serialization—the persistence layer of a real vector database.

Why it teaches storage: Indexes are useless if they disappear on restart. A real vector database needs ACID properties: atomic writes, crash recovery, and efficient disk formats. This is what makes the difference between a prototype and a production system.

Core challenges you’ll face:

  • Write-ahead logging → maps to durability before acknowledging writes
  • Crash recovery → maps to replaying WAL on startup
  • Compaction → maps to merging and garbage collection
  • Efficient serialization → maps to binary formats for vectors

Key Concepts:

  • WAL: “Designing Data-Intensive Applications” Chapter 3 - Kleppmann
  • LSM Trees: For append-friendly storage
  • Memory-Mapped Files: OS-level optimization

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, experience with file I/O and binary formats.

Real world outcome:

$ ./vecdb_storage --init --path ./data

Initializing VectorDB Storage Engine:
  Data directory: ./data
  WAL segment size: 64 MB
  Compaction threshold: 1 GB

Files created:
  ./data/wal/segment_000000.wal
  ./data/vectors/vectors_000000.bin
  ./data/index/hnsw_000000.idx
  ./data/metadata.json

$ ./vecdb_storage --insert --count 100000
Inserting 100,000 vectors...
  WAL writes: 100,000
  Batch flushes: 10
  Time: 4.2s (23,800 inserts/sec)

$ kill -9 $PID  # Simulate crash

$ ./vecdb_storage --recover
Recovery started:
  Scanning WAL segments...
  Found 1 segment with 100,000 entries
  Replaying entries: [████████████████████] 100%
  Rebuilding index: 2.3s
  Recovery complete!

Verified:
  Vectors recovered: 100,000 ✓
  Index consistency: OK ✓
  No data loss ✓

$ ./vecdb_storage --compact
Compaction:
  Before: 3 segments, 1.2 GB
  Merging...
  After: 1 segment, 0.9 GB
  Space reclaimed: 300 MB (25%)

Implementation Hints:

Storage architecture:

./data/
├── wal/
│   ├── segment_000000.wal  # Write-ahead log
│   ├── segment_000001.wal
│   └── ...
├── vectors/
│   └── vectors_000000.bin  # Actual vector data
├── index/
│   └── hnsw_000000.idx     # Serialized index
└── metadata.json           # DB state

WAL entry format:

struct WALEntry {
    sequence_number: u64,
    operation: Operation,  // Insert, Delete, Update
    timestamp: u64,
    vector_id: u64,
    vector_data: Vec<f32>,  // Empty for Delete
    metadata: HashMap<String, Value>,
    checksum: u32,
}

enum Operation {
    Insert = 1,
    Delete = 2,
    Update = 3,
}

WAL writer:

struct WALWriter {
    current_segment: File,
    segment_size: usize,
    current_size: usize,
    sequence: AtomicU64,
}

impl WALWriter {
    fn append(&mut self, entry: WALEntry) -> Result<u64> {
        let seq = self.sequence.fetch_add(1, Ordering::SeqCst);
        let mut entry = entry;
        entry.sequence_number = seq;

        // Serialize entry
        let bytes = entry.serialize();
        entry.checksum = crc32(&bytes);

        // Write length + data + checksum
        self.current_segment.write_all(&(bytes.len() as u32).to_le_bytes())?;
        self.current_segment.write_all(&bytes)?;
        self.current_segment.write_all(&entry.checksum.to_le_bytes())?;

        // Flush to disk (durability!)
        self.current_segment.sync_all()?;

        self.current_size += bytes.len() + 8;

        // Rotate segment if needed
        if self.current_size >= self.segment_size {
            self.rotate_segment()?;
        }

        Ok(seq)
    }
}

Recovery:

fn recover(data_dir: &Path) -> Result<VectorDB> {
    let mut db = VectorDB::new();

    // Read metadata to find last checkpoint
    let metadata = read_metadata(data_dir)?;
    let last_checkpoint = metadata.last_checkpoint_sequence;

    // Replay WAL from last checkpoint
    for segment in list_wal_segments(data_dir)? {
        let reader = WALReader::open(segment)?;

        for entry in reader.entries() {
            let entry = entry?;

            // Verify checksum
            if crc32(&entry.data) != entry.checksum {
                log::error!("Corrupt WAL entry at seq {}", entry.sequence_number);
                continue;
            }

            // Skip already applied entries
            if entry.sequence_number <= last_checkpoint {
                continue;
            }

            // Apply to in-memory state
            match entry.operation {
                Operation::Insert => db.insert_recovered(entry.vector_id, entry.vector_data),
                Operation::Delete => db.delete_recovered(entry.vector_id),
                Operation::Update => db.update_recovered(entry.vector_id, entry.vector_data),
            }
        }
    }

    // Rebuild index if needed
    if db.index_needs_rebuild() {
        db.rebuild_index()?;
    }

    Ok(db)
}

Compaction:

fn compact(&mut self) -> Result<()> {
    // Create new segment file
    let new_segment = create_new_segment()?;

    // Write live vectors to new segment
    for (id, vector) in self.vectors.iter() {
        if !self.deleted.contains(id) {
            write_vector(&new_segment, id, vector)?;
        }
    }

    // Atomically swap segments
    new_segment.sync_all()?;
    rename(new_segment.path(), current_segment_path())?;

    // Delete old segments
    for old in old_segments {
        remove_file(old)?;
    }

    // Truncate WAL (everything is now in new segment)
    self.wal.truncate_before(self.sequence)?;

    Ok(())
}

Learning milestones:

  1. WAL writes are durable → You understand write-ahead logging
  2. Crash recovery works → You understand durability guarantees
  3. Compaction reclaims space → You understand garbage collection
  4. Performance is acceptable → You’ve built an efficient storage layer

Project 10: Distributed Vector Search Cluster

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems / Sharding
  • Software or Tool: Distributed Vector Database
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A distributed vector search cluster with sharding, replication, routing, and distributed query execution—turning a single-node index into a scalable system.

Why it teaches distributed systems: No single machine can handle billions of vectors with low latency. Distributed systems introduce new challenges: consistency, partition tolerance, coordination. Understanding these is essential for production systems.

Core challenges you’ll face:

  • Sharding strategies → maps to distributing vectors across nodes
  • Query routing → maps to finding which shards to query
  • Result merging → maps to combining results from multiple shards
  • Replication → maps to fault tolerance and read scaling
  • Consistency → maps to handling concurrent updates

Key Concepts:

  • Sharding: “Designing Data-Intensive Applications” Chapter 6 - Kleppmann
  • Replication: “Designing Data-Intensive Applications” Chapter 5 - Kleppmann
  • Distributed Consensus: Raft algorithm

Difficulty: Master Time estimate: 4-6 weeks Prerequisites: All previous projects, distributed systems experience.

Real world outcome:

$ ./vecdb cluster init --nodes 3 --shards 9 --replicas 2

Initializing distributed cluster:
  Nodes: node-0 (coordinator), node-1, node-2
  Shards: 9 (3 per node primary, 6 replicas distributed)
  Replication factor: 2

Shard distribution:
╔════════╤═══════════════════╤══════════════════╗
║ Shard  │ Primary Node      │ Replica Nodes    ║
╠════════╪═══════════════════╪══════════════════╣
║   0    │ node-0            │ node-1, node-2   ║
║   1    │ node-0            │ node-1, node-2   ║
║   2    │ node-0            │ node-1, node-2   ║
║   3    │ node-1            │ node-0, node-2   ║
║  ...   │  ...              │  ...             ║
╚════════╧═══════════════════╧══════════════════╝

Cluster ready!

$ ./vecdb insert --vectors 10000000 --distributed
Inserting 10M vectors across cluster:
  Routing: hash(vector_id) mod 9
  Progress: [████████████████████] 100%
  Time: 45s (222K inserts/sec)

$ ./vecdb search --query query.npy --k 10

Distributed Query Execution:
  Step 1: Route query to all 9 shards
  Step 2: Parallel search on each shard (local k=30)
  Step 3: Merge 270 candidates from 9 shards
  Step 4: Re-rank and return top-10

Performance:
  Shard latencies: [1.2ms, 1.4ms, 1.1ms, 1.3ms, 1.5ms, 1.2ms, 1.0ms, 1.4ms, 1.3ms]
  Merge time: 0.3ms
  Total: 1.8ms (limited by slowest shard + merge)

$ ./vecdb failover --kill node-1
Simulating node-1 failure...

  Detecting failure: 2.1s (heartbeat timeout)
  Promoting replicas:
    Shard 3: node-0 promoted to primary
    Shard 4: node-2 promoted to primary
    Shard 5: node-0 promoted to primary
  Cluster healthy (degraded: missing replicas)

$ ./vecdb search --query query.npy --k 10
Search still works! (using remaining nodes)
  Latency: 2.1ms (slightly higher due to fewer nodes)

Implementation Hints:

Cluster architecture:

┌─────────────────────────────────────────────────────────────┐
│                       Coordinator                            │
│  (Query routing, shard map, cluster membership)             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Node-0    │  │   Node-1    │  │   Node-2    │         │
│  │             │  │             │  │             │         │
│  │ Shard 0 (P) │  │ Shard 3 (P) │  │ Shard 6 (P) │         │
│  │ Shard 1 (P) │  │ Shard 4 (P) │  │ Shard 7 (P) │         │
│  │ Shard 2 (P) │  │ Shard 5 (P) │  │ Shard 8 (P) │         │
│  │ Shard 3 (R) │  │ Shard 0 (R) │  │ Shard 0 (R) │         │
│  │ Shard 4 (R) │  │ Shard 1 (R) │  │ Shard 1 (R) │         │
│  │ ...         │  │ ...         │  │ ...         │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                              │
└─────────────────────────────────────────────────────────────┘
P = Primary, R = Replica

Sharding:

type ShardRouter struct {
    numShards int
    shardMap  map[int][]NodeAddress  // shard -> [primary, replicas]
}

func (r *ShardRouter) GetShard(vectorID uint64) int {
    return int(vectorID % uint64(r.numShards))
}

func (r *ShardRouter) RouteInsert(vectorID uint64, vector []float32) error {
    shardID := r.GetShard(vectorID)
    primary := r.shardMap[shardID][0]

    // Send to primary (it will replicate)
    return r.sendInsert(primary, vectorID, vector)
}

func (r *ShardRouter) RouteSearch(query []float32, k int) ([]SearchResult, error) {
    // Search all shards in parallel
    results := make(chan []SearchResult, r.numShards)
    errors := make(chan error, r.numShards)

    for shardID := 0; shardID < r.numShards; shardID++ {
        go func(sid int) {
            node := r.pickNode(sid)  // Primary or replica
            res, err := r.sendSearch(node, query, k*3)  // Over-fetch for merge
            if err != nil {
                errors <- err
                return
            }
            results <- res
        }(shardID)
    }

    // Collect and merge
    allResults := []SearchResult{}
    for i := 0; i < r.numShards; i++ {
        select {
        case res := <-results:
            allResults = append(allResults, res...)
        case err := <-errors:
            return nil, err
        }
    }

    // Sort by distance and take top-k
    sort.Slice(allResults, func(i, j int) bool {
        return allResults[i].Distance < allResults[j].Distance
    })

    if len(allResults) > k {
        allResults = allResults[:k]
    }

    return allResults, nil
}

Replication:

type ReplicationManager struct {
    primary  *Shard
    replicas []*RemoteReplica
}

func (rm *ReplicationManager) Insert(id uint64, vector []float32) error {
    // Write to primary first
    if err := rm.primary.Insert(id, vector); err != nil {
        return err
    }

    // Replicate asynchronously (eventual consistency)
    // Or synchronously (strong consistency)
    for _, replica := range rm.replicas {
        go replica.Replicate(id, vector)
    }

    return nil
}

Failure detection:

func (c *Cluster) monitorHealth() {
    ticker := time.NewTicker(1 * time.Second)

    for range ticker.C {
        for _, node := range c.nodes {
            if !c.ping(node) {
                node.missedHeartbeats++
                if node.missedHeartbeats > 3 {
                    c.handleNodeFailure(node)
                }
            } else {
                node.missedHeartbeats = 0
            }
        }
    }
}

func (c *Cluster) handleNodeFailure(failed *Node) {
    log.Printf("Node %s failed, promoting replicas", failed.ID)

    for _, shard := range failed.PrimaryShards {
        // Find a replica to promote
        replica := c.findReplica(shard)
        if replica != nil {
            c.promoteToPrimary(replica, shard)
        }
    }

    c.updateShardMap()
    c.notifyClients()
}

Learning milestones:

  1. Sharding distributes load → You understand partitioning
  2. Parallel search works → You understand distributed query execution
  3. Failover promotes replicas → You understand fault tolerance
  4. Consistency is maintained → You understand distributed state

  • File: LEARN_VECTOR_DATABASES.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Complete Database System
  • Software or Tool: Production Vector Database
  • Main Book: All previous + “Database Internals” by Alex Petrov

What you’ll build: A complete, production-ready vector database with: HNSW and IVF-PQ indexes, durable storage, metadata filtering, distributed clustering, gRPC API, and client SDKs.

Why this is the ultimate test: This combines everything: algorithms, storage, networking, distributed systems, and API design. Building a complete vector database proves you truly understand how these systems work.

Core challenges you’ll face:

  • Integrating all components → maps to system design
  • API design → maps to user experience
  • Performance tuning → maps to profiling and optimization
  • Operational features → maps to monitoring, backups, upgrades

Key Concepts:

  • Everything from previous projects
  • API Design: gRPC best practices
  • Operations: Observability, deployment

Difficulty: Master Time estimate: 3-6 months Prerequisites: All previous projects completed.

Real world outcome:

$ vecdb-server --config production.yaml

╔═══════════════════════════════════════════════════════════╗
║                    VecDB Server v1.0                       ║
╠═══════════════════════════════════════════════════════════╣
║ Configuration:                                             ║
║   Storage: /data/vecdb                                     ║
║   Index: HNSW (M=32, ef=200)                              ║
║   Shards: 12 (distributed across 4 nodes)                 ║
║   Replicas: 2                                              ║
║                                                            ║
║ Endpoints:                                                 ║
║   gRPC: 0.0.0.0:6333                                      ║
║   HTTP: 0.0.0.0:6334                                      ║
║   Metrics: 0.0.0.0:9090/metrics                           ║
║                                                            ║
║ Status: Ready to accept connections                        ║
╚═══════════════════════════════════════════════════════════╝

# Python client usage
from vecdb import VecDBClient

client = VecDBClient("localhost:6333")

# Create collection
client.create_collection(
    name="products",
    dimension=768,
    metric="cosine",
    index_config={
        "type": "hnsw",
        "m": 32,
        "ef_construction": 200
    }
)

# Insert vectors
client.upsert(
    collection="products",
    vectors=[
        {"id": "prod_001", "vector": [...], "metadata": {"category": "electronics", "price": 99.99}},
        {"id": "prod_002", "vector": [...], "metadata": {"category": "clothing", "price": 49.99}},
        ...
    ]
)

# Search with filters
results = client.search(
    collection="products",
    vector=query_embedding,
    limit=10,
    filter={"category": "electronics", "price": {"$lt": 100}}
)

# Results
[
    {"id": "prod_001", "score": 0.95, "metadata": {...}},
    {"id": "prod_042", "score": 0.91, "metadata": {...}},
    ...
]

Implementation Hints:

System architecture:

┌─────────────────────────────────────────────────────────────┐
│                         Client SDKs                          │
│              (Python, JavaScript, Rust, Go)                  │
├─────────────────────────────────────────────────────────────┤
│                     API Layer (gRPC + HTTP)                  │
│    ┌────────────┬────────────┬────────────┬────────────┐    │
│    │  Insert    │  Search    │  Delete    │   Admin    │    │
│    └────────────┴────────────┴────────────┴────────────┘    │
├─────────────────────────────────────────────────────────────┤
│                       Query Processor                        │
│    ┌────────────┬────────────┬────────────┐                 │
│    │  Planner   │ Optimizer  │  Executor  │                 │
│    └────────────┴────────────┴────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                        Index Layer                           │
│    ┌────────────┬────────────┬────────────┐                 │
│    │   HNSW     │  IVF-PQ    │  Metadata  │                 │
│    └────────────┴────────────┴────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                       Storage Layer                          │
│    ┌────────────┬────────────┬────────────┐                 │
│    │    WAL     │  Vectors   │  Index     │                 │
│    └────────────┴────────────┴────────────┘                 │
├─────────────────────────────────────────────────────────────┤
│                    Distributed Layer                         │
│    ┌────────────┬────────────┬────────────┐                 │
│    │  Sharding  │Replication │ Consensus  │                 │
│    └────────────┴────────────┴────────────┘                 │
└─────────────────────────────────────────────────────────────┘

gRPC API definition:

service VecDB {
    rpc CreateCollection(CreateCollectionRequest) returns (CreateCollectionResponse);
    rpc Upsert(UpsertRequest) returns (UpsertResponse);
    rpc Search(SearchRequest) returns (SearchResponse);
    rpc Delete(DeleteRequest) returns (DeleteResponse);
    rpc GetCollection(GetCollectionRequest) returns (GetCollectionResponse);
}

message SearchRequest {
    string collection = 1;
    repeated float vector = 2;
    uint32 limit = 3;
    Filter filter = 4;
    SearchParams params = 5;
}

message SearchResponse {
    repeated SearchResult results = 1;
    uint64 searched_vectors = 2;
    float latency_ms = 3;
}

Collection management:

struct Collection {
    name: String,
    dimension: usize,
    metric: DistanceMetric,

    // Index
    index: Box<dyn VectorIndex>,

    // Storage
    storage: StorageEngine,

    // Metadata
    metadata_index: MetadataIndex,

    // Configuration
    config: CollectionConfig,
}

impl Collection {
    fn upsert(&mut self, vectors: Vec<VectorRecord>) -> Result<()> {
        // Write to WAL first
        self.storage.wal_append(&vectors)?;

        // Update index
        for record in &vectors {
            self.index.insert(record.id, &record.vector)?;
            self.metadata_index.insert(record.id, &record.metadata)?;
        }

        // Flush periodically
        if self.should_flush() {
            self.flush()?;
        }

        Ok(())
    }

    fn search(&self, query: &[f32], k: usize, filter: Option<Filter>) -> Result<Vec<SearchResult>> {
        let valid_ids = match filter {
            Some(f) => Some(self.metadata_index.filter(&f)?),
            None => None,
        };

        self.index.search(query, k, valid_ids)
    }
}

Metrics and monitoring:

struct Metrics {
    insert_latency: Histogram,
    search_latency: Histogram,
    vectors_total: Gauge,
    index_size_bytes: Gauge,
    queries_per_second: Counter,
}

impl Metrics {
    fn record_search(&self, latency: Duration, results: usize) {
        self.search_latency.observe(latency.as_secs_f64());
        self.queries_per_second.inc();
    }
}

Learning milestones:

  1. Basic CRUD works → Core functionality complete
  2. Filtered search works → Advanced querying done
  3. Distributed mode works → Scalability achieved
  4. Production deployed → Operational readiness
  5. Handles real workloads → Production quality

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Brute Force k-NN Beginner Weekend ⭐⭐ ⭐⭐⭐
2. LSH Intermediate 1-2 weeks ⭐⭐⭐ ⭐⭐⭐⭐
3. IVF Advanced 2 weeks ⭐⭐⭐⭐ ⭐⭐⭐
4. Product Quantization Advanced 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
5. HNSW Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
6. IVF-PQ Expert 2-3 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
7. DiskANN-style Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
8. Filtered Search Advanced 2-3 weeks ⭐⭐⭐⭐ ⭐⭐⭐⭐
9. Storage Engine Expert 3-4 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
10. Distributed Cluster Master 4-6 weeks ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐
Capstone: Full DB Master 3-6 months ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐

Path A: The Complete Journey (6-12 months)

  1. Weeks 1-2: Projects 1-2 (Fundamentals + LSH)
  2. Weeks 3-5: Projects 3-4 (IVF + PQ)
  3. Weeks 6-9: Project 5 (HNSW - the key algorithm)
  4. Weeks 10-12: Project 6 (IVF-PQ composite)
  5. Weeks 13-16: Projects 7-8 (Disk-based + Filtering)
  6. Weeks 17-20: Project 9 (Storage Engine)
  7. Weeks 21-26: Project 10 (Distributed)
  8. Months 7-12: Capstone

Path B: Practical Focus (2-3 months)

If you want to understand production systems quickly:

1 → 3 → 5 → 8 → (use existing storage/distribution)

This covers: brute force baseline, IVF, HNSW, and filtering—enough to work with any production vector database.

Path C: Algorithm Deep Dive (2-3 months)

If you want to understand the algorithms deeply:

1 → 2 → 3 → 4 → 5 → 6

This covers all the major ANN algorithms without the systems complexity.

Recommendation Based on Your Goals

If you’re building RAG applications: Do Projects 1, 3, 5, 8. You’ll understand how to choose and tune indexes.

If you want to contribute to vector DBs: Do the full path. Open-source projects like Qdrant, Milvus need contributors who understand internals.

If you’re researching ANN algorithms: Focus on Projects 2-6, read the original papers, benchmark against ann-benchmarks.com.


Essential Resources

Papers

Tutorials and Guides

Benchmarks

Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann
  • “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman
  • “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

Libraries to Study

  • FAISS - Facebook’s library (C++ with Python bindings)
  • hnswlib - Original HNSW implementation
  • Qdrant - Rust vector database
  • Milvus - Distributed vector database

Summary

# Project Name Main Language
1 Brute Force k-NN and Distance Metrics Python
2 Locality Sensitive Hashing (LSH) Python
3 Inverted File Index (IVF) Python
4 Product Quantization (PQ) Python
5 HNSW Index Python
6 IVF-PQ Composite Index Python
7 Disk-Based Vector Index (DiskANN-style) C++
8 Metadata Filtering with Vector Search Python
9 Vector Database Storage Engine Rust
10 Distributed Vector Search Cluster Go
Capstone Full-Featured Vector Database Rust

Vector databases are where algorithms meet systems engineering. The best way to understand them is to build one. Start with brute force, feel the pain, then appreciate the elegance of HNSW and the efficiency of IVF-PQ. By the end, you’ll know exactly what happens when you call collection.search().