Project 2: Build Your Own Vector Database Engine

Implement a minimal vector database with HNSW indexing, metadata filters, and fast approximate search.


Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 2–3 weeks
Language Python
Prerequisites Graph data structures, embeddings basics
Key Topics ANN search, HNSW, similarity metrics, persistence

Learning Objectives

By completing this project, you will:

  1. Implement HNSW graph construction and search.
  2. Compare brute-force vs ANN retrieval accuracy.
  3. Add metadata filtering to vector queries.
  4. Persist indexes and reload reliably.
  5. Benchmark recall vs latency tradeoffs.

The Core Question You’re Answering

“How does a vector database achieve sublinear search without losing too much accuracy?”

This project makes ANN mechanisms tangible.


Concepts You Must Understand First

Concept Why It Matters Where to Learn
HNSW graphs Core ANN structure HNSW paper
Cosine similarity Common metric Vector math basics
Recall vs latency Real-world tradeoff IR benchmarks
Index persistence Production readiness Storage systems

Theoretical Foundation

HNSW Search Intuition

Coarse Layer -> Navigate neighbors -> Descend -> Fine Layer

HNSW uses multi-layer graphs to quickly find candidate regions, then refines results.


Project Specification

What You’ll Build

A CLI vector database that supports add, search, filter, and benchmark operations.

Functional Requirements

  1. HNSW index with configurable M/ef
  2. Cosine and dot similarity
  3. Metadata filters (tag, date, type)
  4. Index persistence to disk
  5. Benchmark mode vs brute-force

Non-Functional Requirements

  • Deterministic tests with fixed seeds
  • Clear metrics output
  • Safe handling of missing index

Real World Outcome

Example usage:

$ vecdb add "The quick brown fox"
$ vecdb search "fast animal" --top-k 5

Output:

1. [0.91] "The quick brown fox"
2. [0.88] "Fast animals in the forest"

Architecture Overview

┌──────────────┐   vectors  ┌──────────────┐
│ Vector Store │──────────▶│ HNSW Index   │
└──────┬───────┘           └──────┬───────┘
       │                          ▼
       ▼                   ┌──────────────┐
┌──────────────┐            │ Query Engine│
│ Persistence  │◀──────────│ + Filters   │
└──────────────┘            └──────────────┘

Implementation Guide

Phase 1: Brute-Force Baseline (3–5h)

  • Implement naive search
  • Checkpoint: correct top-k results

Phase 2: HNSW Index (8–12h)

  • Build graph layers and search
  • Checkpoint: recall near brute-force

Phase 3: Persistence + Benchmarks (6–10h)

  • Save/load index
  • Run benchmarks
  • Checkpoint: report shows latency gains

Common Pitfalls & Debugging

Pitfall Symptom Fix
Low recall wrong results tune ef and M
Slow build long indexing time adjust ef_construction
Broken load index fails to reload version format

Interview Questions They’ll Ask

  1. Why is HNSW faster than brute-force?
  2. What parameters control recall vs latency?
  3. How do you support metadata filtering efficiently?

Hints in Layers

  • Hint 1: Implement brute-force first for ground truth.
  • Hint 2: Build HNSW insertion + search.
  • Hint 3: Add filters as post-processing.
  • Hint 4: Benchmark recall vs latency.

Learning Milestones

  1. Baseline Works: brute-force returns correct results.
  2. ANN Works: HNSW returns near-correct results faster.
  3. Production Ready: persistence + filters stable.

Submission / Completion Criteria

Minimum Completion

  • HNSW index + search

Full Completion

  • Persistence + benchmarks

Excellence

  • Hybrid search or visualization

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md.