Project 2: Build Your Own Vector Database Engine

Implement a minimal vector database with HNSW indexing, metadata filters, and fast approximate search.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2–3 weeks
Language	Python
Prerequisites	Graph data structures, embeddings basics
Key Topics	ANN search, HNSW, similarity metrics, persistence

Learning Objectives

By completing this project, you will:

Implement HNSW graph construction and search.
Compare brute-force vs ANN retrieval accuracy.
Add metadata filtering to vector queries.
Persist indexes and reload reliably.
Benchmark recall vs latency tradeoffs.

The Core Question You’re Answering

“How does a vector database achieve sublinear search without losing too much accuracy?”

This project makes ANN mechanisms tangible.

Concepts You Must Understand First

Concept	Why It Matters	Where to Learn
HNSW graphs	Core ANN structure	HNSW paper
Cosine similarity	Common metric	Vector math basics
Recall vs latency	Real-world tradeoff	IR benchmarks
Index persistence	Production readiness	Storage systems

Theoretical Foundation

HNSW Search Intuition

Coarse Layer -> Navigate neighbors -> Descend -> Fine Layer

HNSW uses multi-layer graphs to quickly find candidate regions, then refines results.

Project Specification

What You’ll Build

A CLI vector database that supports add, search, filter, and benchmark operations.

Functional Requirements

HNSW index with configurable M/ef
Cosine and dot similarity
Metadata filters (tag, date, type)
Index persistence to disk
Benchmark mode vs brute-force

Non-Functional Requirements

Deterministic tests with fixed seeds
Clear metrics output
Safe handling of missing index

Real World Outcome

Example usage:

$ vecdb add "The quick brown fox"
$ vecdb search "fast animal" --top-k 5

Output:

1. [0.91] "The quick brown fox"
2. [0.88] "Fast animals in the forest"

Architecture Overview

┌──────────────┐   vectors  ┌──────────────┐
│ Vector Store │──────────▶│ HNSW Index   │
└──────┬───────┘           └──────┬───────┘
       │                          ▼
       ▼                   ┌──────────────┐
┌──────────────┐            │ Query Engine│
│ Persistence  │◀──────────│ + Filters   │
└──────────────┘            └──────────────┘

Implementation Guide

Phase 1: Brute-Force Baseline (3–5h)

Implement naive search
Checkpoint: correct top-k results

Phase 2: HNSW Index (8–12h)

Build graph layers and search
Checkpoint: recall near brute-force

Phase 3: Persistence + Benchmarks (6–10h)

Save/load index
Run benchmarks
Checkpoint: report shows latency gains

Common Pitfalls & Debugging

Pitfall	Symptom	Fix
Low recall	wrong results	tune ef and M
Slow build	long indexing time	adjust ef_construction
Broken load	index fails to reload	version format

Interview Questions They’ll Ask

Why is HNSW faster than brute-force?
What parameters control recall vs latency?
How do you support metadata filtering efficiently?

Hints in Layers

Hint 1: Implement brute-force first for ground truth.
Hint 2: Build HNSW insertion + search.
Hint 3: Add filters as post-processing.
Hint 4: Benchmark recall vs latency.

Learning Milestones

Baseline Works: brute-force returns correct results.
ANN Works: HNSW returns near-correct results faster.
Production Ready: persistence + filters stable.

Submission / Completion Criteria

Minimum Completion

HNSW index + search

Full Completion

Persistence + benchmarks

Excellence

Hybrid search or visualization

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md.