Project 2: Build Your Own Vector Database Engine
Implement a minimal vector database with HNSW indexing, metadata filters, and fast approximate search.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2–3 weeks |
| Language | Python |
| Prerequisites | Graph data structures, embeddings basics |
| Key Topics | ANN search, HNSW, similarity metrics, persistence |
Learning Objectives
By completing this project, you will:
- Implement HNSW graph construction and search.
- Compare brute-force vs ANN retrieval accuracy.
- Add metadata filtering to vector queries.
- Persist indexes and reload reliably.
- Benchmark recall vs latency tradeoffs.
The Core Question You’re Answering
“How does a vector database achieve sublinear search without losing too much accuracy?”
This project makes ANN mechanisms tangible.
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| HNSW graphs | Core ANN structure | HNSW paper |
| Cosine similarity | Common metric | Vector math basics |
| Recall vs latency | Real-world tradeoff | IR benchmarks |
| Index persistence | Production readiness | Storage systems |
Theoretical Foundation
HNSW Search Intuition
Coarse Layer -> Navigate neighbors -> Descend -> Fine Layer
HNSW uses multi-layer graphs to quickly find candidate regions, then refines results.
Project Specification
What You’ll Build
A CLI vector database that supports add, search, filter, and benchmark operations.
Functional Requirements
- HNSW index with configurable M/ef
- Cosine and dot similarity
- Metadata filters (tag, date, type)
- Index persistence to disk
- Benchmark mode vs brute-force
Non-Functional Requirements
- Deterministic tests with fixed seeds
- Clear metrics output
- Safe handling of missing index
Real World Outcome
Example usage:
$ vecdb add "The quick brown fox"
$ vecdb search "fast animal" --top-k 5
Output:
1. [0.91] "The quick brown fox"
2. [0.88] "Fast animals in the forest"
Architecture Overview
┌──────────────┐ vectors ┌──────────────┐
│ Vector Store │──────────▶│ HNSW Index │
└──────┬───────┘ └──────┬───────┘
│ ▼
▼ ┌──────────────┐
┌──────────────┐ │ Query Engine│
│ Persistence │◀──────────│ + Filters │
└──────────────┘ └──────────────┘
Implementation Guide
Phase 1: Brute-Force Baseline (3–5h)
- Implement naive search
- Checkpoint: correct top-k results
Phase 2: HNSW Index (8–12h)
- Build graph layers and search
- Checkpoint: recall near brute-force
Phase 3: Persistence + Benchmarks (6–10h)
- Save/load index
- Run benchmarks
- Checkpoint: report shows latency gains
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Low recall | wrong results | tune ef and M |
| Slow build | long indexing time | adjust ef_construction |
| Broken load | index fails to reload | version format |
Interview Questions They’ll Ask
- Why is HNSW faster than brute-force?
- What parameters control recall vs latency?
- How do you support metadata filtering efficiently?
Hints in Layers
- Hint 1: Implement brute-force first for ground truth.
- Hint 2: Build HNSW insertion + search.
- Hint 3: Add filters as post-processing.
- Hint 4: Benchmark recall vs latency.
Learning Milestones
- Baseline Works: brute-force returns correct results.
- ANN Works: HNSW returns near-correct results faster.
- Production Ready: persistence + filters stable.
Submission / Completion Criteria
Minimum Completion
- HNSW index + search
Full Completion
- Persistence + benchmarks
Excellence
- Hybrid search or visualization
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md.