GENERATIVE AI LLM RAG LEARNING PROJECTS
Every AI application you interact with—ChatGPT, GitHub Copilot, Notion AI, customer support chatbots—is built on these five technologies. Understanding them is no longer optional for developers:
Learning Generative AI, LLMs, RAG, Vector Databases & Reranking
Goal: Deeply understand Generative AI, LLMs, RAG (Retrieval-Augmented Generation), Vector Databases, and Reranking—not just as black-box APIs, but as systems you can build, debug, and optimize from first principles.
Why This Matters in 2025
Every AI application you interact with—ChatGPT, GitHub Copilot, Notion AI, customer support chatbots—is built on these five technologies. Understanding them is no longer optional for developers:
- OpenAI’s revenue: $3.7B in 2024, projected $11.6B in 2025 (source)
- Market size: The global generative AI market is expected to reach $1.3 trillion by 2032, growing at 42% CAGR
- Job demand: “RAG Engineer” and “LLM Engineer” roles have grown 850% year-over-year on LinkedIn (2024)
- Production usage: 92% of Fortune 500 companies are experimenting with or deploying LLM-based applications (Gartner, 2024)
But here’s the critical insight: Most developers only know how to call APIs. If you understand how transformers work, how embeddings encode meaning, how vector databases achieve sub-linear search, and why reranking matters—you’re in the top 5%.
What You’ll Be Able to Do
After completing these projects, you will:
- Understand LLMs at the architectural level - Know exactly what happens when you type a prompt and hit “Send”
- Debug RAG systems - Diagnose why retrieval fails and know exactly where to optimize
- Build production-grade search - Implement two-stage retrieval pipelines that outperform naive approaches
- Optimize embedding models - Fine-tune embeddings for your domain and measure the impact
- Design AI systems - Make informed architecture decisions about latency, accuracy, and cost tradeoffs
Why This Stack Matters
The Modern AI Architecture
Every modern AI application follows the same pattern:
User Query
↓
Embedding Model (converts text → vector)
↓
Vector Database (retrieves similar documents)
↓
Reranker (refines top results)
↓
LLM (generates answer using retrieved context)
↓
Response
This isn’t just ChatGPT. This is:
- GitHub Copilot: Retrieves similar code from your codebase, generates completions
- Notion AI: Retrieves relevant notes, summarizes or answers questions
- Customer support bots: Retrieves documentation, generates responses
- Legal AI: Retrieves case law, generates arguments
- Medical AI: Retrieves patient records + literature, suggests diagnoses
The stack is universal. The implementation details are where mastery lives.
Why Not Just Use APIs?
Using LangChain + OpenAI + Pinecone is fast to prototype. But:
- Black boxes fail silently - Your RAG system returns bad answers. Is it chunking? Retrieval? The LLM? You have no idea.
- Costs explode at scale - OpenAI charges $10/1M tokens. At 100M queries/month, that’s $1,000 just for embeddings. Can you optimize?
- Latency kills UX - Your search takes 2 seconds. Where’s the bottleneck? You can’t fix what you don’t understand.
- Vendor lock-in - OpenAI changes their API. Your entire system breaks. Do you understand the primitives well enough to switch?
Building from scratch once teaches you debugging, optimization, and architecture forever.
Detailed Concept Explanations
1. The Transformer Architecture
Transformers are the foundation of all modern LLMs (GPT, Claude, Llama). The core innovation: self-attention, which lets the model decide which tokens to “look at” when processing each token.
How Self-Attention Works
Input sentence: "The cat sat on the mat"
Step 1: Convert to embeddings
["The", "cat", "sat", "on", "the", "mat"]
↓ ↓ ↓ ↓ ↓ ↓
[e1] [e2] [e3] [e4] [e5] [e6] (each eᵢ is a 768-dim vector)
Step 2: For each token, compute attention scores to every other token
When processing "sat":
"sat" looks at "The" → score: 0.1
"sat" looks at "cat" → score: 0.8 (high! "cat" is the subject)
"sat" looks at "sat" → score: 0.3
"sat" looks at "on" → score: 0.6 (high! indicates direction)
"sat" looks at "the" → score: 0.1
"sat" looks at "mat" → score: 0.7 (high! "mat" is the object)
Step 3: Create context-aware representation
"sat" embedding = 0.1*e₁ + 0.8*e₂ + 0.3*e₃ + 0.6*e₄ + 0.1*e₅ + 0.7*e₆
(weighted sum based on attention scores)
Attention Mechanism (ASCII Visualization)
Query, Key, Value Matrices (the heart of attention):
Token: "sat"
↓
[Query vector q]
|
| Compare via dot product
|
┌─────────┼─────────┐
↓ ↓ ↓
[Key₁] [Key₂] [Key₃] ... [Keyₙ]
("The") ("cat") ("sat") ("mat")
↓ ↓ ↓ ↓
score₁ score₂ score₃ scoreₙ
0.1 0.8 0.3 0.7
↓
Softmax (normalize to sum to 1)
↓
[attention weights]
↓
Weighted sum of Value vectors
↓
[Context-aware representation of "sat"]
Key Insight: The model learns what to attend to during training. It’s not hard-coded—gradient descent discovers that verbs should attend to subjects and objects.
The Full Transformer Block
Input Tokens
↓
Embedding Layer (word → vector)
↓
Positional Encoding (add position info)
↓
╔═══════════════════════════════════╗
║ Transformer Block (×N layers) ║
║ ║
║ ┌─────────────────────────┐ ║
║ │ Multi-Head Attention │ ║ ← 8-16 attention heads in parallel
║ └─────────────────────────┘ ║
║ ↓ ║
║ ┌─────────────────────────┐ ║
║ │ Add & Normalize │ ║ ← Residual connection + LayerNorm
║ └─────────────────────────┘ ║
║ ↓ ║
║ ┌─────────────────────────┐ ║
║ │ Feed-Forward Network │ ║ ← 2-layer MLP
║ └─────────────────────────┘ ║
║ ↓ ║
║ ┌─────────────────────────┐ ║
║ │ Add & Normalize │ ║ ← Residual connection + LayerNorm
║ └─────────────────────────┘ ║
╚═══════════════════════════════════╝
↓
Output Logits (vocabulary probabilities)
↓
Softmax → Next Token
Why Transformers Dominate:
- Parallelizable: Unlike RNNs, all tokens can be processed simultaneously
- Long-range dependencies: Attention connects distant tokens directly
- Scalable: Performance improves predictably with more parameters (scaling laws)
Key Papers:
- “Attention Is All You Need” - Vaswani et al. (2017) - The original transformer paper
- “Language Models are Few-Shot Learners” - Brown et al. (2020) - GPT-3, showing transformer scaling
2. Embeddings and Vector Spaces
Embeddings convert text into numbers—but not just any numbers. They create a semantic space where similar meanings are close together.
How Embeddings Encode Meaning
Sentence: "The dog is running"
↓
Embedding Model (e.g., BERT, Sentence-BERT)
↓
768-dimensional vector: [0.23, -0.47, 0.81, ..., 0.12]
Sentence: "A puppy is jogging"
↓
Embedding Model
↓
768-dimensional vector: [0.21, -0.43, 0.79, ..., 0.09]
↑ Very close to "dog running"!
Sentence: "The database is slow"
↓
Embedding Model
↓
768-dimensional vector: [-0.82, 0.34, -0.15, ..., 0.67]
↑ Far from "dog running"
Vector Similarity in Semantic Space
Imagine 2D for visualization (real embeddings are 768+ dims):
"dog running"
●
\
\ cosine distance = 0.02 (very similar!)
\
● "puppy jogging"
● "database slow"
(distance = 0.98, very different)
Mathematical similarity:
cosine_similarity(v₁, v₂) = (v₁ · v₂) / (||v₁|| × ||v₂||)
Range: [-1, 1]
1.0 = identical meaning
0.9+ = very similar
0.7+ = related
0.5- = different
-1.0 = opposite meaning
Why Embeddings Are Powerful
Traditional keyword search fails:
Query: "dog running"
Document: "A puppy is jogging in the park"
Keyword overlap: 0 words! ❌ No match
Embedding search succeeds:
Query embedding: [0.23, -0.47, 0.81, ...]
Document embedding: [0.21, -0.43, 0.79, ...]
Cosine similarity: 0.94 ✅ High match!
The Magic: Embeddings learned from massive text corpora capture:
- Synonyms (“dog” ≈ “puppy”, “running” ≈ “jogging”)
- Semantic relationships (“king” - “man” + “woman” ≈ “queen”)
- Context (“bank” has different embeddings in “river bank” vs “money bank”)
How Embedding Models Are Trained
Training Objective: Contrastive Learning
Positive pairs (similar):
"dog running" ←→ "puppy jogging" (pull together)
"database slow" ←→ "sluggish database" (pull together)
Negative pairs (different):
"dog running" ←→ "database slow" (push apart)
"puppy jogging" ←→ "sluggish database" (push apart)
Loss function (InfoNCE):
L = -log( exp(sim(q, p⁺)) / [exp(sim(q, p⁺)) + Σ exp(sim(q, pⁱ⁻))] )
Where:
q = query embedding
p⁺ = positive example embedding
pⁱ⁻ = negative example embeddings
Goal: Maximize similarity to positive, minimize similarity to negatives
Key Papers:
- “Sentence-BERT” - Reimers & Gurevych (2019) - Efficient sentence embeddings
- “SimCSE” - Gao et al. (2021) - Simple contrastive learning for sentence embeddings
3. RAG Architecture
RAG solves the LLM’s fundamental limitation: knowledge cutoff. LLMs know only what they saw during training. RAG dynamically injects relevant information.
The RAG Pipeline
User Query: "What is our refund policy?"
↓
┌────────────────────────────────────────┐
│ Step 1: Query Embedding │
│ "What is our refund policy?" │
│ → [0.34, -0.21, 0.67, ...] │
└────────────────────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Step 2: Retrieve Similar Documents │
│ Search vector DB for nearest vectors │
│ │
│ Results: │
│ [1] "Refund Policy.pdf" (score: 0.92)│
│ [2] "FAQ.md" (score: 0.81) │
│ [3] "Terms.pdf" (score: 0.73) │
└────────────────────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Step 3: Construct Prompt │
│ │
│ System: You are a helpful assistant. │
│ Use the following context: │
│ │
│ [Context from retrieved docs] │
│ Refund Policy.pdf: "We offer..." │
│ FAQ.md: "Customers can request..." │
│ │
│ User Question: What is our refund │
│ policy? │
└────────────────────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Step 4: LLM Generation │
│ GPT-4 / Claude / Llama │
│ → "Based on our refund policy..." │
└────────────────────────────────────────┘
↓
Answer with citations
Document Chunking: The Hidden Complexity
Before documents enter the vector DB, they must be split into chunks. This is critical for quality.
Original document (5000 words):
"Our company was founded in 2010. [...2000 words...]
Refund Policy: Customers can request refunds within 30 days. [...3000 words...]"
❌ Bad chunking (entire doc = 1 chunk):
- Embedding diluted across unrelated content
- LLM context window wasted on irrelevant text
- Result: Poor retrieval, poor generation
✅ Good chunking (semantic sections):
Chunk 1: "Our company was founded..." (400 words)
Chunk 2: "Refund Policy: Customers can..." (300 words) ← HIGH RELEVANCE
Chunk 3: "Shipping information..." (350 words)
Result: Chunk 2 retrieved with high score, LLM gets exactly what it needs
Chunking Strategies:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Fixed-size | Split every N tokens | Simple, fast | Breaks sentences, no semantic awareness |
| Sentence-based | Split at sentence boundaries | Preserves meaning | Chunks vary wildly in size |
| Semantic | Detect topic shifts (e.g., TextTiling) | Best retrieval quality | Slow, complex |
| Recursive | Split by sections → paragraphs → sentences | Hierarchical structure | Implementation complexity |
Real-world impact: Improving chunking alone can increase RAG answer quality by 20-40% (source: Enhancing RAG study, 2025)
The “Lost in the Middle” Problem
LLMs have a bias: they pay more attention to the beginning and end of context.
Context with 10 retrieved chunks:
Chunk 1: High attention ✅
Chunk 2: Medium attention
Chunk 3: Low attention
Chunk 4: Very low attention
Chunk 5: LOWEST attention ← "Lost in the middle"
Chunk 6: Very low attention
Chunk 7: Low attention
Chunk 8: Medium attention
Chunk 9: High attention ✅
Chunk 10: High attention ✅
Solution: Only retrieve the most relevant chunks. This is where reranking helps.
Key Papers:
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - Lewis et al. (2020) - Original RAG paper
- “Lost in the Middle” - Liu et al. (2023) - LLM context position bias study
4. Vector Database Internals
Vector databases (Pinecone, Weaviate, Chroma) solve one problem: fast nearest neighbor search over millions/billions of vectors.
The Fundamental Problem
Naive search (brute force):
Query: [0.5, 0.3, 0.8, ...]
Database: 10 million vectors
Algorithm:
for each vector in database:
compute cosine_similarity(query, vector)
return top-k by similarity
Time complexity: O(n × d)
n = 10M vectors
d = 768 dimensions
= 7.68 billion operations per query
= ~500ms latency
❌ Unacceptable for production (users expect <100ms)
Solution: Approximate Nearest Neighbors (ANN) algorithms trade perfect accuracy for speed.
HNSW: Hierarchical Navigable Small World Graphs
HNSW is the gold standard for ANN search. It builds a multi-layer graph where:
- Lower layers: Dense connections, fine-grained navigation
- Upper layers: Sparse connections, coarse-grained navigation (express lanes)
HNSW Graph Structure (3 layers):
Layer 2 (sparse):
A ←→ E ←→ K
↓ ↓ ↓
Layer 1 (medium):
A ←→ B ←→ E ←→ H ←→ K ←→ N
↓ ↓ ↓ ↓ ↓ ↓
Layer 0 (dense):
A ↔ B ↔ C ↔ D ↔ E ↔ F ↔ G ↔ H ↔ I ↔ J ↔ K ↔ L ↔ M ↔ N ↔ O ↔ P
(all vectors exist here)
Search algorithm:
1. Enter at top layer, find closest node
2. Jump to next layer down
3. Navigate locally to closer nodes
4. Repeat until bottom layer
5. Return k nearest neighbors
Example search for query Q (closest to vector G):
Layer 2: Start at A → jump to E (closer to Q) → jump to K (not closer, go down)
Layer 1: At E → navigate to H (closer) → navigate to K (not closer, go down)
Layer 0: At H → navigate to G (closest!) → return G + neighbors
Comparisons: ~12 (instead of 10M!)
Speedup: >100,000× faster than brute force
Accuracy: ~95% recall@10 (finds 9.5 out of 10 true nearest neighbors)
The Precision-Recall-Latency Tradeoff
Vector DB Settings (HNSW parameters):
ef_construction = 200 (build-time search depth)
ef_search = 100 (query-time search depth)
M = 16 (connections per node)
High M, High ef_search
↑
│
┌───────────┼───────────┐
│ │ │
Slow│ Perfect │ Good │ Fast
│ Recall │ Recall │ Recall
│ 100% │ 95% │ 80%
│ │ │
└───────────┼───────────┘
│
↓
Low M, Low ef_search
Production choice: 95% recall, <50ms latency
(Missing 5% is acceptable; 500ms latency is not)
Key Papers:
- “Efficient and robust approximate nearest neighbor search using HNSW” - Malkov & Yashunin (2016)
- “Billion-scale similarity search with GPUs” - Johnson et al. (2017) - FAISS library
5. Two-Stage Retrieval (Bi-Encoder → Cross-Encoder)
The problem with embeddings: they compress an entire document into a single vector. This loses nuance.
Bi-Encoders vs Cross-Encoders
Bi-Encoder (e.g., Sentence-BERT):
Query: "dog running"
↓
Encoder
↓
[q_embedding]
Document: "puppy jogging"
↓
Encoder
↓
[d_embedding]
Similarity = cosine(q_embedding, d_embedding)
Pros: ✅ Fast (encode once, store in vector DB)
Cons: ❌ Independent encoding loses query-document interaction
Cross-Encoder (e.g., BERT):
Query: "dog running"
Document: "puppy jogging"
↓
Concatenate: "[CLS] dog running [SEP] puppy jogging [SEP]"
↓
BERT Encoder (processes both together)
↓
[relevance score]
Pros: ✅ Accurate (query and document interact in attention layers)
Cons: ❌ Slow (must encode every query-document pair at query time)
Two-Stage Pipeline
Stage 1: Bi-Encoder (Fast Retrieval)
Input: User query
Action: Search vector DB with bi-encoder embeddings
Output: Top 100 candidate documents
Time: ~50ms
Stage 2: Cross-Encoder (Accurate Reranking)
Input: Query + 100 candidates
Action: Score each pair with cross-encoder
Output: Reranked top 10 documents
Time: ~200ms (100 documents × 2ms each)
Total time: ~250ms
Quality: +15-25% precision@10 vs bi-encoder alone
Real-World Impact
Example query: "How to reset my password"
Bi-encoder results (before reranking):
1. "Password Requirements" (score: 0.78)
2. "Account Security Guide" (score: 0.76)
3. "How to Reset Your Password" (score: 0.74) ← SHOULD BE #1
4. "Two-Factor Authentication" (score: 0.71)
5. "Login Troubleshooting" (score: 0.70)
Cross-encoder reranking:
1. "How to Reset Your Password" (score: 0.94) ✅
2. "Login Troubleshooting" (score: 0.82)
3. "Password Requirements" (score: 0.79)
4. "Account Security Guide" (score: 0.71)
5. "Two-Factor Authentication" (score: 0.65)
User clicks on first result: 89% vs 34% (before reranking)
Key Papers:
- “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” - Reimers & Gurevych (2019)
- “The Curse of Dense Low-Dimensional Information Retrieval” - Luan et al. (2021)
Concept Summary Table
| Concept Cluster | What You Must Internalize |
|---|---|
| LLM Foundations | Self-attention mechanics, tokenization (BPE), autoregressive generation, why transformers parallelize |
| Embeddings | Contrastive learning, cosine similarity, why semantic spaces work, fine-tuning embeddings |
| RAG System Design | Chunking strategies, retrieval-generation coupling, prompt construction, evaluation metrics |
| Vector Databases | ANN algorithms (HNSW), precision-recall-latency tradeoffs, indexing vs querying, metadata filtering |
| Reranking | Bi-encoder vs cross-encoder architecture, two-stage pipelines, when to rerank vs re-retrieve |
Deep Dive Reading by Concept
Transformer Architecture & LLMs
| Book | Relevant Chapters | What It Teaches |
|---|---|---|
| “Build a Large Language Model (From Scratch)” by Sebastian Raschka | Ch. 2-4 | Implementing attention, transformers, and training from scratch |
| “Deep Learning” by Goodfellow, Bengio, Courville | Ch. 10, 12 | Theoretical foundations of sequence models and optimization |
| “Speech and Language Processing” by Jurafsky & Martin (3rd ed.) | Ch. 7, 9, 10 | Tokenization, language models, transformers |
| “Natural Language Processing with Transformers” by Tunstall, von Werra, Wolf | Ch. 1-4 | Practical transformer implementation with Hugging Face |
Key Papers:
- “Attention Is All You Need” - The transformer paper
- “BERT: Pre-training of Deep Bidirectional Transformers” - Devlin et al.
- “Language Models are Few-Shot Learners” - GPT-3 paper
Embeddings & Vector Spaces
| Book | Relevant Chapters | What It Teaches |
|---|---|---|
| “Speech and Language Processing” by Jurafsky & Martin | Ch. 6 | Word embeddings (Word2Vec, GloVe), distributional semantics |
| “Deep Learning” by Goodfellow, Bengio, Courville | Ch. 15.4 | Representation learning theory |
| “AI Engineering” by Chip Huyen | Ch. 4 | Practical embedding usage and fine-tuning |
| “Natural Language Processing with Transformers” | Ch. 2 | Contextual embeddings (BERT, Sentence-BERT) |
Key Papers:
- “Sentence-BERT” - Efficient sentence embeddings
- “SimCSE” - Simple contrastive learning
RAG Systems
| Book | Relevant Chapters | What It Teaches |
|---|---|---|
| “AI Engineering” by Chip Huyen | Ch. 6-8 | RAG architecture, chunking, evaluation, production patterns |
| “Building LLMs for Production” by Sarkar & Tunstall | Ch. 5-7 | Retrieval systems, context management, optimization |
| “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 3 | Information retrieval fundamentals, indexing |
Key Papers:
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” - Original RAG paper
- “Lost in the Middle” - Context position bias
- “Enhancing RAG: A Study of Best Practices” - 2025 comprehensive study
Vector Databases & ANN Algorithms
| Book | Relevant Chapters | What It Teaches |
|---|---|---|
| “Algorithms, Fourth Edition” by Sedgewick & Wayne | Ch. 3 (Searching) | Fundamental search algorithms, BSTs, hash tables |
| “Introduction to Information Retrieval” by Manning, Raghavan, Schütze | Ch. 1, 6, 18 | IR fundamentals, indexing, similarity search |
| “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 3, 12 | Database indexing, distributed systems |
| “AI Engineering” by Chip Huyen | Ch. 7 | Vector database selection, production deployment |
Key Papers:
- “Efficient and robust approximate nearest neighbor search using HNSW” - HNSW algorithm
- “Billion-scale similarity search with GPUs” - FAISS
- “Milvus: A Purpose-Built Vector Data Management System” - Vector DB architecture
Reranking & Two-Stage Retrieval
| Book | Relevant Chapters | What It Teaches |
|---|---|---|
| “Introduction to Information Retrieval” by Manning, Raghavan, Schütze | Ch. 8, 15 | Scoring, ranking, learning to rank |
| “AI Engineering” by Chip Huyen | Ch. 6 | Reranking in production RAG systems |
| “Natural Language Processing with Transformers” | Ch. 3 | Cross-encoders, BERT for ranking |
Key Papers:
- “Sentence-BERT” - Bi-encoders for retrieval
- “The Curse of Dense Low-Dimensional Information Retrieval” - Why reranking helps
- “ColBERT: Efficient and Effective Passage Search” - Late interaction reranking
Core Concept Analysis
1. Generative AI & LLMs
- Transformer architecture (self-attention, multi-head attention)
- Tokenization (BPE, WordPiece, SentencePiece)
- Embeddings and positional encoding
- Forward pass, softmax, and generation strategies (greedy, beam search, sampling)
- Fine-tuning vs prompting vs RAG
2. RAG (Retrieval Augmented Generation)
- Document chunking strategies (fixed-size, semantic, sentence-based)
- Embedding generation (bi-encoders)
- Retrieval pipelines (sparse, dense, hybrid)
- Context injection and prompt construction
- Evaluation metrics (relevance, faithfulness, answer quality)
3. Vector Databases
- Vector similarity metrics (cosine, euclidean, dot product)
- Indexing algorithms (HNSW, IVF, PQ)
- Approximate Nearest Neighbors (ANN) trade-offs
- Metadata filtering and hybrid search
4. Reranking
- Bi-encoders vs Cross-encoders architecture
- Two-stage retrieval pipelines
- Relevance scoring mechanisms
- Performance vs accuracy trade-offs
Project 1: Build a Mini-Transformer from Scratch
- File: GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md
- Programming Language: Python
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 4: Expert
- Knowledge Area: Deep Learning / NLP
- Software or Tool: PyTorch
- Main Book: “Build a Large Language Model (From Scratch)” by Sebastian Raschka
What you’ll build: A small GPT-like transformer (decoder-only) that can generate coherent text after training on a corpus like Shakespeare or Wikipedia excerpts.
Why it teaches LLMs: You cannot understand how ChatGPT “thinks” until you implement self-attention yourself and see how tokens attend to each other. Building this forces you to grapple with the math behind Q, K, V matrices, why positional encoding exists, and how generation actually works.
Core challenges you’ll face:
- Implementing multi-head self-attention without libraries (maps to understanding the attention mechanism)
- Making causal masking work for autoregressive generation (maps to understanding why LLMs can only see “past” tokens)
- Getting training to converge with proper learning rate scheduling (maps to understanding why training LLMs is hard)
- Implementing tokenization from scratch (maps to understanding how text becomes numbers)
Key Concepts:
- Attention Mechanism: “Attention Is All You Need” paper - Vaswani et al.
- Transformer Architecture: “Build a Large Language Model (From Scratch)” Chapter 3 - Sebastian Raschka
- Tokenization: “Let’s build the GPT Tokenizer” - Andrej Karpathy
- Training Dynamics: “AI Engineering” Chapter 4 - Chip Huyen
Difficulty: Intermediate Time estimate: 2-3 weeks Prerequisites: Python, PyTorch basics, linear algebra fundamentals
Real world outcome:
- Your terminal will generate coherent Shakespeare-style text: “To be or not to be, that is the question of the…”
- You can input any prompt and watch your model complete it
- A Jupyter notebook showing attention heatmaps visualizing what tokens your model “looks at”
Learning milestones:
- After implementing attention: You’ll understand why transformers can process sequences in parallel and what “attending” actually means
- After training: You’ll viscerally understand the compute/data/quality relationship and why bigger isn’t always better
- After generation: You’ll understand temperature, top-k, nucleus sampling and why models sometimes produce garbage
Project 2: Build Your Own Vector Database Engine
- File: GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md
- Programming Language: Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Databases / ML Infrastructure
- Software or Tool: HNSW / Embeddings
- Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne
What you’ll build: A vector database from scratch that stores embeddings, indexes them with HNSW (Hierarchical Navigable Small World), and retrieves nearest neighbors with sub-linear complexity.
Why it teaches Vector Databases: Using Pinecone or ChromaDB hides the magic. Building HNSW yourself forces you to understand why approximate nearest neighbor search is necessary (exact search is O(n)), how graph-based indexing works, and the precision-recall tradeoffs every production system makes.
Core challenges you’ll face:
- Implementing HNSW graph construction (maps to understanding navigable small-world graphs)
- Managing the index persistence to disk (maps to understanding why vector DBs need special storage)
- Implementing cosine similarity efficiently with numpy vectorization (maps to understanding similarity metrics)
- Building metadata filtering alongside vector search (maps to understanding hybrid search)
Resources for key challenges:
- “Hierarchical Navigable Small World graphs” paper - Original HNSW paper by Malkov & Yashunin
- “FAISS Tutorial” - Pinecone’s explanation of indexing algorithms
Key Concepts:
- HNSW Algorithm: “Efficient and robust approximate nearest neighbor search using HNSW” - Malkov & Yashunin
- Vector Similarity: “Math for Programmers” Chapter 6 (Vectors) - Paul Orland
- Indexing Structures: “Algorithms, Fourth Edition” Chapter 3 (Searching) - Sedgewick & Wayne
Difficulty: Intermediate-Advanced Time estimate: 2-3 weeks Prerequisites: Python, data structures (graphs, heaps), basic understanding of embeddings
Real world outcome:
- A CLI tool where you can:
./myvecdb add "The quick brown fox"and./myvecdb search "fast animal" --top-k 5 - Benchmarks showing your HNSW is 100x faster than brute-force on 100k vectors
- A visualization of your HNSW graph structure showing how vectors connect across layers
Learning milestones:
- After brute-force implementation: You’ll understand why O(n) doesn’t scale and appreciate the need for ANN
- After HNSW implementation: You’ll understand graph navigation, entry points, and layer traversal
- After benchmarking: You’ll internalize the precision-recall-latency tradeoff that every production system faces
Real World Outcome
By the end of this project, you’ll have a fully functional vector database with tangible, demonstrable capabilities:
CLI Tool Capabilities
# Initialize a new vector database
$ ./myvecdb init --dimensions 384 --metric cosine
✓ Created database at ./vecdb_data/
✓ Index type: HNSW (M=16, ef_construction=200)
# Add vectors from a file or directly
$ ./myvecdb add --file embeddings.jsonl
✓ Inserted 50,000 vectors in 12.3 seconds (4,065 vectors/sec)
✓ Index built with 3 layers
# Interactive text search (generates embedding on the fly)
$ ./myvecdb search "machine learning algorithms" --top-k 10 --model all-MiniLM-L6-v2
Query embedding generated in 45ms
Search completed in 3.2ms
Results (cosine similarity):
1. [0.912] "Deep learning and neural network architectures" (doc_4521)
2. [0.891] "Supervised learning methods for classification" (doc_8834)
3. [0.878] "Introduction to gradient descent optimization" (doc_1203)
4. [0.865] "Machine learning model evaluation metrics" (doc_7621)
5. [0.847] "Ensemble methods: Random Forests and Boosting" (doc_3345)
...
# Search with a raw vector
$ ./myvecdb search --vector "[0.234, -0.521, 0.891, ...]" --top-k 5
# Benchmark performance
$ ./myvecdb benchmark --vectors 100000 --queries 1000
Running benchmark with 100,000 vectors, 1,000 queries...
Brute Force Results:
Average query time: 142.3ms
Queries per second: 7.0
Recall@10: 100.00% (exact)
HNSW Results (ef_search=50):
Average query time: 1.4ms
Queries per second: 714.3
Recall@10: 94.20%
Speed improvement: 101.6x faster
Precision-recall tradeoff: 5.8% recall loss for 100x speedup
# Show index statistics
$ ./myvecdb stats
Database Statistics:
Total vectors: 100,000
Dimensions: 384
Index type: HNSW
Memory usage: 245 MB
Disk usage: 198 MB
HNSW Parameters:
Layers: 4
M (connections per node): 16
M_max (max connections): 32
ef_construction: 200
Total edges: 1,638,492
Layer distribution:
Layer 0: 100,000 nodes (100.00%)
Layer 1: 6,250 nodes (6.25%)
Layer 2: 391 nodes (0.39%)
Layer 3: 24 nodes (0.02%)
Layer 4: 1 node (entry point)
# Visualize the graph structure (generates a graphviz file)
$ ./myvecdb visualize --layer 2 --max-nodes 50 --output graph.dot
✓ Generated visualization with 50 nodes
$ dot -Tpng graph.dot -o hnsw_graph.png
Performance Benchmarks You’ll Generate
Your implementation will produce concrete performance comparisons:
=== PERFORMANCE COMPARISON ===
Dataset: 100K vectors (384 dimensions)
Method | Latency (p50) | Latency (p99) | QPS | Recall@10 | Memory
--------------------|---------------|---------------|--------|-----------|--------
Brute Force | 142.3ms | 189.7ms | 7 | 100.00% | 153 MB
HNSW (ef=10) | 0.8ms | 2.1ms | 1,250 | 87.30% | 245 MB
HNSW (ef=50) | 1.4ms | 3.8ms | 714 | 94.20% | 245 MB
HNSW (ef=100) | 2.7ms | 6.5ms | 370 | 97.80% | 245 MB
HNSW (ef=200) | 5.1ms | 11.2ms | 196 | 99.40% | 245 MB
Key Insights:
- HNSW achieves 100x speedup at 94% recall
- Memory overhead: 60% increase for HNSW index structure
- Sweet spot: ef=50 (good balance of speed and accuracy)
Visualization Outputs
Your tool will generate visual representations of the HNSW graph:
Layer 3 (Top Layer - Entry Point):
[Node 42891] ← Entry point
|
Layer 2 (24 nodes):
[42891] ─── [18273] ─── [91847]
| | |
[7284] ─── [63491] ─── [2847]
Layer 1 (391 nodes):
Dense connections showing navigable small-world structure
Average degree: 14.2 connections per node
Layer 0 (All 100K nodes):
Full graph with M=16 connections per node
Shows nearest neighbors at base resolution
Code Output Example
When you integrate this into a Python application:
from myvecdb import VectorDB
# Initialize
db = VectorDB(dimensions=384, metric='cosine')
# Insert vectors
db.insert("doc_1", embedding_vector_1)
db.insert_batch([
("doc_2", embedding_vector_2),
("doc_3", embedding_vector_3),
...
])
# Search
results = db.search(query_vector, top_k=10)
for doc_id, score in results:
print(f"{doc_id}: {score:.4f}")
# Get stats
stats = db.get_stats()
print(f"Total vectors: {stats['total_vectors']}")
print(f"Average search time: {stats['avg_search_ms']:.2f}ms")
The Core Question You’re Answering
“Why is exact nearest neighbor search too slow for production systems, and how do graph-based indexes like HNSW solve this while maintaining acceptable accuracy?”
The Fundamental Problem
When you have a collection of N vectors and a query vector, finding the k nearest neighbors requires:
Brute Force Approach:
- Compute similarity (cosine/euclidean) between query and EVERY vector
- Sort all N results
- Return top k
Time Complexity: O(N × D) where D is dimensionality
- For 1 million 384-dimensional vectors: ~384 million operations per query
- At 50ms per query, you get 20 QPS (queries per second)
- This doesn’t scale for production systems needing 1000+ QPS
Why This Matters in Real Applications
Consider a real RAG (Retrieval Augmented Generation) system:
- User asks: “What’s the refund policy?”
- System generates query embedding (384 dimensions)
- Must search 10 million document chunks
- Needs results in <100ms for good UX
- Serves 100 concurrent users
Brute force: 10M × 100 users = impossible HNSW: 2ms per query × 100 users = easily handled
The HNSW Solution
HNSW solves this by creating a hierarchical graph structure that enables:
- Logarithmic search complexity: O(log N) instead of O(N)
- Greedy graph navigation: Jump to progressively closer neighbors
- Multi-layer skip lists: Upper layers provide “highways” for fast traversal
Key Insight: Instead of comparing against all N vectors, HNSW compares against log(N) vectors by intelligently navigating a graph structure.
Example with 1 million vectors:
- Brute force: Compare against 1,000,000 vectors
- HNSW: Compare against ~200 vectors (5000x fewer comparisons)
The Tradeoff You’re Managing
HNSW doesn’t find the EXACT nearest neighbors—it finds APPROXIMATE nearest neighbors:
- Recall@10: 95% means 9.5 out of 10 results are correct
- 5% loss in accuracy for 100x speedup
- This tradeoff is acceptable for most production systems
The question becomes: How do you tune HNSW parameters (M, ef_construction, ef_search) to get the best speed/accuracy balance for YOUR use case?
Concepts You Must Understand First
Before implementing HNSW, you need solid foundations in these areas. Each concept builds on the previous one.
1. Vector Similarity Metrics
What it is: Mathematical functions that measure how “close” two vectors are in multi-dimensional space.
Why it matters: The entire vector database is built around efficiently finding similar vectors. If you don’t understand what similarity means mathematically, you can’t implement or debug search algorithms.
Time to master: 2-4 hours
Cosine Similarity
Measures angle between vectors (ignores magnitude):
cosine_similarity(A, B) = (A · B) / (||A|| × ||B||)
Range: [-1, 1] where 1 = identical direction, -1 = opposite
When to use: Text embeddings (most common), where vector magnitude doesn’t carry semantic meaning.
Book reference: “Mathematics for Machine Learning” by Deisenroth, Faisal, Aldo - Chapter 3 (Analytic Geometry), Section 3.2
Euclidean Distance
Measures straight-line distance in space:
euclidean_distance(A, B) = sqrt(Σ(A_i - B_i)²)
Range: [0, ∞] where 0 = identical
When to use: Image embeddings, spatial data where magnitude matters.
Book reference: “Introduction to Algorithms” by Cormen et al. (CLRS) - Appendix B.5
Dot Product
Raw correlation without normalization:
dot_product(A, B) = Σ(A_i × B_i)
Range: (-∞, ∞)
When to use: When vectors are pre-normalized (cosine similarity becomes dot product).
Book reference: “Linear Algebra Done Right” by Axler - Chapter 6
2. Graph Data Structures
What it is: A collection of nodes (vertices) connected by edges, representing relationships.
Why it matters: HNSW IS a graph. Every concept—neighbors, layers, traversal—depends on understanding graphs.
Time to master: 4-6 hours
Graph Representation
The most common way to represent graphs in code:
# Adjacency list (what HNSW uses)
graph = {
0: [1, 2, 5], # Node 0 connects to nodes 1, 2, 5
1: [0, 3], # Node 1 connects to nodes 0, 3
2: [0, 3, 4],
...
}
# Access neighbors: O(1)
# Space complexity: O(V + E) where V=nodes, E=edges
Book reference: “Algorithms, Fourth Edition” by Sedgewick & Wayne - Chapter 4.1 (Undirected Graphs)
Graph Traversal Algorithms
HNSW uses Greedy Best-First Search—always move to the nearest unvisited neighbor:
def greedy_search(graph, start, goal, distance_fn):
current = start
visited = {start}
while True:
neighbors = graph[current]
closest = None
closest_dist = distance_fn(current, goal)
for neighbor in neighbors:
if neighbor not in visited:
dist = distance_fn(neighbor, goal)
if dist < closest_dist:
closest = neighbor
closest_dist = dist
if closest is None: # No improvement
return current
current = closest
visited.add(current)
Book reference: “Introduction to Algorithms” (CLRS) - Chapter 22.2 (Breadth-First Search) and 22.3 (Depth-First Search)
3. Skip Lists and Hierarchical Structures
What it is: Multi-layer data structures that enable fast search by “skipping” elements.
Why it matters: HNSW extends the skip list concept from 1D (sorted lists) to multi-dimensional metric spaces.
Time to master: 2-3 hours
Classic Skip List
A skip list has multiple layers, where upper layers have fewer elements:
Layer 2: [1] -----------------> [50] --------> [100]
Layer 1: [1] -------> [25] ---> [50] --> [75] -> [100]
Layer 0: [1] -> [10] -> [25] -> [40] -> [50] -> [75] -> [90] -> [100]
Search for 75:
1. Start at Layer 2, move to 50 (can't jump to 100, it's too far)
2. Drop to Layer 1 at 50
3. Move to 75 at Layer 1
4. Found! (O(log N) time instead of O(N))
HNSW is like a skip list for vectors: Instead of sorted numbers, you have vectors in metric space. Instead of “less than”, you use “distance”.
Book reference: “Algorithms, Fourth Edition” by Sedgewick & Wayne - Chapter 3.3 (Balanced Search Trees - conceptually similar)
4. HNSW Algorithm Specifics
What it is: Hierarchical Navigable Small World graphs—a specific algorithm for approximate nearest neighbor search.
Why it matters: This is what you’re implementing. Understanding the algorithm deeply is non-negotiable.
Time to master: 8-12 hours (including reading the paper)
Core HNSW Concepts
Small World Networks:
- Most nodes can be reached from any other in a small number of hops
- Think “six degrees of separation”
- Achieved by having both short-range and long-range connections
Navigability:
- Ability to find short paths using only local information (greedy search)
- No global knowledge needed during search
Hierarchy:
- Multiple layers like a skip list
- Upper layers: Fewer nodes, longer-range connections (highways)
- Lower layers: More nodes, shorter-range connections (local roads)
Key HNSW Parameters
| Parameter | Typical Value | What It Controls | Impact of Increasing |
|---|---|---|---|
| M | 16 | Max connections per node (layer 0) | Better recall, more memory, slower insert |
| M_max | 16 | Max connections (layers > 0) | Usually set equal to M |
| ef_construction | 200 | Search breadth during insert | Better graph quality, slower insert |
| ef_search | 50 | Search breadth during query | Better recall, slower search |
| m_L | 1/ln(2) ≈ 1.44 | Layer assignment multiplier | Controls exponential layer distribution |
Book reference: Original HNSW paper - “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” by Malkov & Yashunin
Reading guide:
- Read Abstract + Introduction (understand the problem)
- Read Section 3 (Algorithm description) carefully
- Skim Section 4 (Experiments - see performance results)
- Reference Section 5 (Discussion) when debugging
5. Time and Space Complexity Analysis
What it is: Mathematical framework for analyzing how algorithms scale.
Why it matters: You need to understand WHY HNSW is faster than brute force, and how to communicate this to technical stakeholders.
Time to master: 3-5 hours (if rusty on Big-O notation)
Complexity Comparison
| Operation | Brute Force | HNSW | Explanation |
|---|---|---|---|
| Search | O(N × D) | O(log N × M × D) | HNSW visits log(N) nodes instead of all N |
| Insert | O(1) | O(log N × M × ef_construction) | Must find neighbors and update graph |
| Delete | O(1) | O(M² × log N) | Must repair graph connections |
| Space | O(N × D) | O(N × D × M) | M times more memory for edges |
Where:
- N = number of vectors
- D = dimensionality
- M = max connections per node
Example calculation (100K vectors, 384 dimensions, M=16):
- Brute force search: 100,000 × 384 = 38.4M operations
- HNSW search: log₂(100,000) × 16 × 384 ≈ 102K operations
- Speedup: ~375x (in practice: ~100x due to constant factors)
Book reference: “Introduction to Algorithms” (CLRS) - Chapter 3 (Growth of Functions)
6. NumPy Vectorization and Optimization
What it is: Writing Python code that leverages SIMD instructions and C libraries for fast array operations.
Why it matters: Naive Python loops are 100-1000x slower than vectorized NumPy. For similarity computation, this is the difference between 1ms and 100ms.
Time to master: 4-6 hours of practice
Performance Comparison
import numpy as np
import time
# Generate test data
query = np.random.rand(384)
vectors = np.random.rand(100000, 384)
# SLOW: Python loops (NEVER do this)
def cosine_slow(query, vectors):
results = []
for vec in vectors:
dot = sum(q * v for q, v in zip(query, vec))
norm_q = sum(q**2 for q in query) ** 0.5
norm_v = sum(v**2 for v in vec) ** 0.5
results.append(dot / (norm_q * norm_v))
return results
# FAST: NumPy vectorization
def cosine_fast(query, vectors):
# Normalize
query_norm = query / np.linalg.norm(query)
vectors_norm = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
# Batch dot product
return vectors_norm @ query_norm
# Benchmark
start = time.time()
_ = cosine_fast(query, vectors)
fast_time = time.time() - start
start = time.time()
_ = cosine_slow(query, vectors[:1000]) # Only 1K vectors or it takes forever
slow_time = (time.time() - start) * 100 # Extrapolate to 100K
print(f"Vectorized: {fast_time*1000:.1f}ms")
print(f"Loops (projected): {slow_time*1000:.1f}ms")
print(f"Speedup: {slow_time/fast_time:.0f}x")
# Output:
# Vectorized: 12.3ms
# Loops (projected): 1847.2ms
# Speedup: 150x
Key principles:
- Never loop over NumPy arrays in Python
- Use matrix multiplication (
@) for batch operations - Pre-normalize vectors if using cosine similarity
- Use
axisparameter for row/column operations
Book reference: “Python for Data Analysis” by Wes McKinney - Chapter 4 (NumPy Basics), Appendix A (Advanced NumPy)
7. Persistence and Serialization
What it is: Saving in-memory data structures to disk and loading them back efficiently.
Why it matters: Vector databases must survive restarts. A 100K vector index might take 5 minutes to build—you can’t rebuild on every startup.
Time to master: 2-3 hours
Serialization Strategies
Option 1: Pickle (Quick & Dirty)
import pickle
# Save
with open('index.pkl', 'wb') as f:
pickle.dump({'vectors': vectors, 'graph': graph}, f)
# Load
with open('index.pkl', 'rb') as f:
data = pickle.load(f)
Pros: One line of code Cons: Not portable across Python versions, security risk, slow for large data
Option 2: NumPy Binary + JSON
import numpy as np
import json
# Save
np.save('vectors.npy', vectors)
with open('metadata.json', 'w') as f:
json.dump({'graph': graph, 'params': params}, f)
# Load
vectors = np.load('vectors.npy')
with open('metadata.json') as f:
data = json.load(f)
Pros: Fast for arrays, human-readable metadata Cons: Multiple files to manage
Option 3: HDF5 (Production-Ready)
import h5py
# Save
with h5py.File('index.h5', 'w') as f:
f.create_dataset('vectors', data=vectors)
f.attrs['M'] = 16
f.attrs['ef_construction'] = 200
# Graph stored as separate datasets
# Load
with h5py.File('index.h5', 'r') as f:
vectors = f['vectors'][:]
M = f.attrs['M']
Pros: Single file, compression, partial loading Cons: External dependency
Book reference: “Designing Data-Intensive Applications” by Kleppmann - Chapter 4 (Encoding and Evolution)
Questions to Guide Your Design
Before writing code, answer these questions. Your answers will determine your implementation strategy.
1. Graph Structure Questions
Q: How many layers should your HNSW graph have?
This isn’t a fixed number—HNSW uses probabilistic assignment.
The algorithm:
import random
import math
def assign_layer(m_L=1.44):
"""
Randomly assign a layer to a new node.
Each layer has exponentially fewer nodes.
"""
return math.floor(-math.log(random.uniform(0, 1)) * m_L)
# Test the distribution
layers = [assign_layer() for _ in range(100000)]
for i in range(5):
pct = layers.count(i) / 1000
print(f"Layer {i}: {pct:.1f}% of nodes")
# Output:
# Layer 0: 100.0% (all nodes)
# Layer 1: 6.8%
# Layer 2: 0.5%
# Layer 3: 0.03%
# Layer 4: 0.002%
Why this works: Creates a pyramid structure automatically, regardless of dataset size.
Your decision: Use the probabilistic formula. Don’t hardcode layer count.
Q: What’s the optimal M parameter?
M controls how many neighbors each node connects to. Higher M = better recall but more memory.
Tradeoff analysis:
| M value | Memory per node | Recall@10 | Search speed | Recommendation |
|---|---|---|---|---|
| M=4 | Low (16 edges) | ~85% | Very fast | Embedded devices only |
| M=8 | Medium (32 edges) | ~90% | Fast | Resource-constrained |
| M=16 | Medium-high (64 edges) | ~94% | Balanced | Default choice |
| M=32 | High (128 edges) | ~97% | Slower | High-accuracy needs |
| M=64 | Very high (256 edges) | ~98.5% | Slow | Rarely justified |
Memory calculation:
- 100K vectors × M=16 × 4 bytes/edge = 6.4MB just for edges
- Plus vector storage: 100K × 384 dims × 4 bytes = 147MB
- Total: ~153MB (M=16 adds ~4% overhead)
Your decision: Start with M=16, make it configurable, benchmark on your data.
2. Distance Metric Questions
Q: Should you support multiple similarity metrics?
Option A: Single metric (simpler)
class VectorDB:
def __init__(self, dimensions):
self.dimensions = dimensions
# Hardcode cosine similarity
Pros: Simpler implementation, fewer bugs Cons: Less flexible
Option B: Multiple metrics (better)
class SimilarityMetric:
def compute(self, a, b): pass
class CosineSimilarity(SimilarityMetric):
def compute(self, a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
class EuclideanDistance(SimilarityMetric):
def compute(self, a, b):
return -np.linalg.norm(a - b) # Negative so higher = better
class VectorDB:
def __init__(self, dimensions, metric='cosine'):
self.metric = {
'cosine': CosineSimilarity(),
'euclidean': EuclideanDistance(),
}[metric]
Pros: Flexible, professional API Cons: More code to maintain
Your decision: Implement cosine only first, add abstraction layer for future metrics.
Q: Should you pre-normalize vectors?
If using cosine similarity, you can optimize by pre-normalizing:
# Without pre-normalization
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 3 operations: dot, norm(a), norm(b)
# With pre-normalization (at insert time)
def insert(self, vec_id, vector):
normalized = vector / np.linalg.norm(vector)
self.vectors.append(normalized)
def cosine_similarity_prenorm(a, b):
return np.dot(a, b) # Just dot product!
# 1 operation: 3x faster!
Tradeoff:
- Pro: 2-3x faster search
- Con: Original vectors not preserved (may matter for some applications)
Your decision: Pre-normalize for cosine similarity. Document this behavior clearly.
3. Data Structure Questions
Q: How will you store the graph?
Option 1: Separate layer storage (recommended)
class HNSW:
def __init__(self):
self.layers = [] # List of dicts
# layers[0] = {node_id: [neighbor_ids], ...} # Layer 0
# layers[1] = {node_id: [neighbor_ids], ...} # Layer 1
# ...
def get_neighbors(self, node, layer):
return self.layers[layer].get(node, [])
Pros: Clear separation, easy to visualize Cons: Multiple lookups for multi-layer nodes
Option 2: Unified node structure
class HNSW:
def __init__(self):
self.nodes = {}
# nodes[node_id] = {
# 'level': 2,
# 'connections': {
# 0: [neighbors at layer 0],
# 1: [neighbors at layer 1],
# 2: [neighbors at layer 2],
# }
# }
def get_neighbors(self, node, layer):
return self.nodes[node]['connections'][layer]
Pros: All node data together Cons: More complex structure
Your decision: Use Option 1 (separate layers) for clarity. Profile later if needed.
Q: How will you map vector IDs to indices?
You need to support string IDs (like “doc_12345”) but use integer indices internally for efficiency.
class VectorDB:
def __init__(self):
# Bidirectional mapping
self.id_to_idx = {} # "doc_123" -> 0
self.idx_to_id = {} # 0 -> "doc_123"
self.vectors = [] # List of numpy arrays
def insert(self, vec_id, vector):
if vec_id in self.id_to_idx:
raise ValueError(f"ID {vec_id} already exists")
idx = len(self.vectors)
self.id_to_idx[vec_id] = idx
self.idx_to_id[idx] = vec_id
self.vectors.append(vector)
def search(self, query, k=10):
# Internal search returns indices
indices = self._search_internal(query, k)
# Convert to IDs for API
return [(self.idx_to_id[idx], score) for idx, score in indices]
Your decision: Always use this bidirectional mapping pattern.
4. Insert Strategy Questions
Q: How do you handle duplicate vectors?
Three strategies:
A. Reject duplicates
def insert(self, vec_id, vector):
if vec_id in self.id_to_idx:
raise ValueError(f"Duplicate ID: {vec_id}")
Pros: Clear semantics Cons: Caller must handle updates separately
B. Allow duplicates with different IDs
def insert(self, vec_id, vector):
# Same vector can have multiple IDs
idx = len(self.vectors)
self.vectors.append(vector)
# ...
Pros: Flexible Cons: Wastes space, confusing search results
C. Update existing
def insert(self, vec_id, vector):
if vec_id in self.id_to_idx:
idx = self.id_to_idx[vec_id]
self.vectors[idx] = vector # Replace
# Rebuild connections for this node
Pros: User-friendly API Cons: Complex to implement correctly
Your decision: Start with A (reject), add update_vector() method later if needed.
Q: How do you select neighbors during insert?
HNSW uses a heuristic to select “diverse” neighbors, not just closest.
Simple approach (start here):
def select_neighbors_simple(self, candidates, M):
"""Just take M nearest neighbors."""
sorted_cands = sorted(candidates, key=lambda x: x[0])
return [node_id for dist, node_id in sorted_cands[:M]]
HNSW heuristic (better graph quality):
def select_neighbors_heuristic(self, candidates, M):
"""
Select diverse neighbors that don't cluster.
From HNSW paper Algorithm 4.
"""
candidates = sorted(candidates, key=lambda x: x[0])
result = []
for dist_to_q, candidate in candidates:
# Check if candidate is too close to already-selected neighbors
should_add = True
for prev_dist, prev_cand in result:
dist_between = self.distance(
self.vectors[candidate],
self.vectors[prev_cand]
)
# If closer to existing neighbor than to query, skip
if dist_between < dist_to_q:
should_add = False
break
if should_add:
result.append((dist_to_q, candidate))
if len(result) >= M:
break
# Fill remaining slots with closest if needed
if len(result) < M:
for dist, cand in candidates:
if (dist, cand) not in result:
result.append((dist, cand))
if len(result) >= M:
break
return [node_id for _, node_id in result]
Your decision: Implement simple first, add heuristic when you’re benchmarking recall.
5. Search Strategy Questions
Q: What’s your stopping condition during search?
In HNSW, ef_search controls how many candidate nodes you examine:
def search_layer(self, query, entry_points, ef, layer):
"""
ef = exploration factor (how many candidates to examine)
"""
visited = set()
candidates = [] # Min-heap: (distance, node)
results = [] # Max-heap: (-distance, node)
# Initialize
for ep in entry_points:
dist = self.distance(query, self.vectors[ep])
heapq.heappush(candidates, (dist, ep))
heapq.heappush(results, (-dist, ep))
visited.add(ep)
while candidates:
current_dist, current = heapq.heappop(candidates)
# Stopping condition: current is worse than our worst result
if current_dist > -results[0][0]:
break
# Explore neighbors
for neighbor in self.get_neighbors(current, layer):
if neighbor not in visited:
visited.add(neighbor)
dist = self.distance(query, self.vectors[neighbor])
# Add to candidates if better than worst result OR we need more results
if dist < -results[0][0] or len(results) < ef:
heapq.heappush(candidates, (dist, neighbor))
heapq.heappush(results, (-dist, neighbor))
# Keep only top ef results
if len(results) > ef:
heapq.heappop(results)
return sorted([(-dist, node) for dist, node in results])
Your decision: Use this standard HNSW stopping condition.
Q: Should you return distances or similarities?
API design choice that affects user experience:
Option A: Distances (lower = better)
results = db.search(query, k=10)
# [(0.12, "doc_1"), (0.23, "doc_2"), ...]
Intuitive for Euclidean distance, confusing for cosine similarity.
Option B: Similarities (higher = better)
results = db.search(query, k=10)
# [(0.95, "doc_1"), (0.87, "doc_2"), ...]
Intuitive for all metrics, matches user expectations.
Your decision: Return similarities (convert distances if needed). Users expect higher scores = better matches.
6. Persistence Questions
Q: What’s your persistence strategy?
Naive approach (start here):
def save(self, filepath):
"""Save entire index to disk."""
import pickle
with open(filepath, 'wb') as f:
pickle.dump({
'vectors': self.vectors,
'graph': self.graph,
'metadata': {...}
}, f)
Pros: Simple Cons: Slow for large indexes, all-or-nothing
Better approach (production):
def save(self, filepath):
"""Save with versioning and checksums."""
import json
import hashlib
# Save vectors
vectors_file = f"{filepath}.vectors.npy"
np.save(vectors_file, np.array(self.vectors))
# Compute checksum
with open(vectors_file, 'rb') as f:
vectors_checksum = hashlib.sha256(f.read()).hexdigest()
# Save metadata + graph
metadata = {
'version': 1,
'num_vectors': len(self.vectors),
'dimensions': self.dimensions,
'M': self.M,
'graph': self.graph,
'id_mappings': {
'id_to_idx': self.id_to_idx,
'idx_to_id': self.idx_to_id,
},
'checksums': {
'vectors': vectors_checksum
}
}
with open(f"{filepath}.meta.json", 'w') as f:
json.dump(metadata, f)
def load(self, filepath):
"""Load and validate."""
# Load metadata
with open(f"{filepath}.meta.json") as f:
metadata = json.load(f)
# Load vectors
vectors = np.load(f"{filepath}.vectors.npy")
# Validate checksum
import hashlib
with open(f"{filepath}.vectors.npy", 'rb') as f:
checksum = hashlib.sha256(f.read()).hexdigest()
if checksum != metadata['checksums']['vectors']:
raise ValueError("Checksum mismatch! Index may be corrupted.")
# Restore state
self.vectors = list(vectors)
self.graph = metadata['graph']
self.id_to_idx = metadata['id_mappings']['id_to_idx']
# ...
Your decision: Start with pickle, refactor to versioned format before sharing code.
7. Performance Optimization Questions
Q: Should you use multi-threading for search?
Python’s GIL (Global Interpreter Lock) complicates this:
Reality check:
# This WON'T speed up a single query (GIL prevents true parallelism)
def search_parallel_WRONG(self, query, k=10):
import threading
threads = [threading.Thread(target=self._search_layer, ...) for _ in range(4)]
# All threads blocked by GIL, no speedup!
What DOES work:
# Handle multiple CONCURRENT queries (different users)
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=10)
# Each thread handles a different user's query
# GIL released during I/O and NumPy operations
For single-query speedup, you need:
- Multiprocessing (separate Python processes)
- C extensions (Cython, pybind11)
- Just-in-time compilation (Numba)
Your decision: Single-threaded for now. Add thread pool for concurrent users later.
Q: How will you benchmark performance?
Need systematic methodology to measure speed/accuracy tradeoffs:
def benchmark(db, test_queries, ground_truth, ef_values=[10, 50, 100]):
"""
Benchmark HNSW with different ef_search values.
Args:
db: Your VectorDB instance
test_queries: List of query vectors
ground_truth: List of true nearest neighbor IDs (from brute force)
ef_values: List of ef_search values to test
Returns:
Results dict with latencies and recalls for each ef value
"""
results = {}
for ef in ef_values:
latencies = []
recalls = []
for query, true_neighbors in zip(test_queries, ground_truth):
# Measure latency
start = time.time()
hnsw_results = db.search(query, k=10, ef_search=ef)
latency = time.time() - start
latencies.append(latency * 1000) # Convert to ms
# Measure recall
hnsw_ids = set([doc_id for doc_id, _ in hnsw_results])
true_ids = set(true_neighbors[:10])
recall = len(hnsw_ids & true_ids) / 10.0
recalls.append(recall)
results[ef] = {
'latency_p50': np.percentile(latencies, 50),
'latency_p99': np.percentile(latencies, 99),
'recall_avg': np.mean(recalls),
'qps': 1000 / np.mean(latencies) # Queries per second
}
# Print comparison table
print(f"{'ef_search':<12} {'p50 (ms)':<12} {'p99 (ms)':<12} {'QPS':<10} {'Recall@10'}")
print("-" * 60)
for ef, stats in sorted(results.items()):
print(f"{ef:<12} {stats['latency_p50']:<12.2f} {stats['latency_p99']:<12.2f} "
f"{stats['qps']:<10.0f} {stats['recall_avg']:.2%}")
return results
Your decision: Build this benchmarking function early. Use it to guide parameter tuning.
Thinking Exercise
Complete this before writing code. It will save you hours of debugging.
Exercise: Build a 2D HNSW Graph by Hand
Goal: Internalize how HNSW works by manually constructing a small graph and tracing searches.
Setup: 10 vectors in 2D space (so you can visualize):
V0: (1, 1) V5: (7, 2)
V1: (2, 2) V6: (8, 1)
V2: (3, 1) V7: (2, 8)
V3: (1, 3) V8: (3, 7)
V4: (4, 4) V9: (8, 8)
Task 1: Assign Layers (5 minutes)
Use probabilistic assignment. For this exercise, use these pre-assigned layers:
Layer 2: V4 (entry point - highest layer)
Layer 1: V1, V4, V9
Layer 0: All vectors (V0-V9)
Question: Why does V4 appear in both Layer 1 and Layer 2?
Answer: Nodes at layer L also exist in all layers below L. Layer 2 nodes are “promoted” from lower layers.
Task 2: Build Connections (15 minutes)
For each layer, connect each node to its M=2 nearest neighbors using Euclidean distance.
Layer 0 connections (all nodes, M=2):
Calculate distances manually. For example, V0’s distances:
- V0 to V1: √((2-1)² + (2-1)²) = √2 ≈ 1.41
- V0 to V2: √((3-1)² + (1-1)²) = 2.00
- V0 to V3: √((1-1)² + (3-1)²) = 2.00
- V0 to V4: √((4-1)² + (4-1)²) = √18 ≈ 4.24
- … (continue for all)
V0’s 2 nearest neighbors: V1 (1.41), V3 (2.00)
Complete this for all nodes:
Your task: Fill in the connections
V0 → [V1, V3]
V1 → [_____, _____]
V2 → [_____, _____]
V3 → [_____, _____]
V4 → [_____, _____]
V5 → [_____, _____]
V6 → [_____, _____]
V7 → [_____, _____]
V8 → [_____, _____]
V9 → [_____, _____]
Expected answers:
V0 → [V1, V3]
V1 → [V0, V2]
V2 → [V1, V5]
V3 → [V0, V7]
V4 → [V1, V8]
V5 → [V2, V6]
V6 → [V5, V9]
V7 → [V3, V8]
V8 → [V7, V4]
V9 → [V6, V8]
Layer 1 connections (only V1, V4, V9 exist here, M=2):
V1 → [V4] (only 2 other nodes at this layer, V4 is closer than V9)
V4 → [V1, V9] (both neighbors)
V9 → [V4] (V4 closer than V1)
Layer 2 connections (only V4 exists):
V4 → [] (no neighbors at top layer - this is the entry point)
Task 3: Trace a Search (10 minutes)
Query: Find nearest neighbor to (7, 7)
Follow HNSW algorithm:
Step 1: Start at entry point (Layer 2)
- Current node: V4 at (4, 4)
- Distance to query (7, 7): √((7-4)² + (7-4)²) = √18 ≈ 4.24
- Neighbors at Layer 2: None
- Best at this layer: V4
- Drop to Layer 1
Step 2: Layer 1
- Current: V4 at (4, 4)
- Neighbors at Layer 1: [V1, V9]
- Calculate distances:
- V1 (2, 2): √((7-2)² + (7-2)²) = √50 ≈ 7.07 (worse than V4)
- V9 (8, 8): √((7-8)² + (7-8)²) = √2 ≈ 1.41 (BETTER!)
- Move to V9
- Check V9’s neighbors at Layer 1: [V4] (already visited)
- Best at this layer: V9
- Drop to Layer 0
Step 3: Layer 0
- Current: V9 at (8, 8)
- Neighbors at Layer 0: [V6, V8]
- Calculate distances:
- V6 (8, 1): √((7-8)² + (7-1)²) = √37 ≈ 6.08 (worse than V9)
- V8 (3, 7): √((7-3)² + (7-7)²) = 4.00 (worse than V9)
- No improvement found
- Return V9 as nearest neighbor
Verification: Calculate all distances to (7, 7):
- V0: √((7-1)² + (7-1)²) = √72 ≈ 8.49
- V1: √50 ≈ 7.07
- V2: √((7-3)² + (7-1)²) = √52 ≈ 7.21
- V3: √((7-1)² + (7-3)²) = √52 ≈ 7.21
- V4: √18 ≈ 4.24
- V5: √25 = 5.00
- V6: √37 ≈ 6.08
- V7: √((7-2)² + (7-8)²) = √26 ≈ 5.10
- V8: 4.00
- V9: √2 ≈ 1.41 ← CORRECT!
Key insight: We only examined 5 nodes (V4, V1, V9, V6, V8) out of 10 total. For 1 million vectors, HNSW would examine ~200 nodes instead of 1 million.
Task 4: Search for (2, 3) (Do This Yourself!)
Now trace a search for query (2, 3):
- Start at Layer 2: V4 (4, 4)
- Distance to (2, 3): ?
- Drop to Layer 1
- At Layer 1, explore V4’s neighbors [V1, V9]
- Which is closer to (2, 3)?
- Move to the better node
- At Layer 0, explore that node’s neighbors
- Find the nearest neighbor
Expected answer: V1 at (2, 2) or V3 at (1, 3) (both distance ≈ 1.41)
How many nodes did you examine? Should be 4-6 nodes.
Task 5: What If M=4? (Analysis Exercise)
Redraw Layer 0 with M=4 (each node connects to 4 nearest neighbors instead of 2).
Questions:
- How does V0’s neighbor list change?
- Will the search path for (7, 7) change?
- How much more memory does this use?
- Will recall improve?
Answers:
- V0 → [V1, V3, V2, V4] (added V2 and V4)
- Possibly—more connections = more paths to explore
- Memory: 4/2 = 2x more edges
- Yes—denser graph = more likely to find true nearest neighbor
Reflection Questions
Before moving to implementation, answer these:
- Why does HNSW use multiple layers instead of one dense graph?
- Answer: Upper layers provide “highways” to jump far distances quickly. Lower layers provide precision.
- What happens if you set M=1 (only 1 connection per node)?
- Answer: Graph becomes a tree or disconnected. Many nodes unreachable. Terrible recall.
- What’s the worst-case scenario for HNSW?
- Answer: Unlucky greedy path leads you away from true nearest neighbor. Must backtrack or accept suboptimal result.
- How does HNSW handle adding new vectors?
- Answer: Assign layer, search for neighbors at each layer, connect bidirectionally. Graph adapts dynamically.
The Interview Questions They’ll Ask
After building this project, you’ll confidently answer these in technical interviews.
Question 1: “Explain HNSW to a software engineer who knows algorithms but not ML”
Your answer:
“HNSW is like a skip list, but for multi-dimensional vectors instead of sorted numbers.
Imagine you have 1 million document embeddings (384-dimensional vectors) and you want to find the 10 most similar to a query. The naive approach is to compare against all 1 million—that’s O(n) and takes ~100 milliseconds.
HNSW builds a hierarchical graph:
- Layer 0 (bottom): Contains all 1 million vectors
- Layer 1: ~6% of vectors (62,500)
- Layer 2: ~0.4% of vectors (4,000)
- Layer 3: Maybe just 1 vector (entry point)
Each vector connects to its M nearest neighbors at each layer (typically M=16).
When you search:
- Start at the top layer (1 entry point)
- Greedily jump to the nearest neighbor
- When you can’t improve, drop down a layer
- Repeat until you reach Layer 0
Upper layers have long-range connections (like highways), lower layers have short-range (like local roads). This gives you O(log n) search time.
It’s approximate because greedy search can get stuck in local optima, but in practice you get 95%+ recall with 100x speedup.”
Follow-up they’ll ask: “How do you handle the approximation?”
Answer: “You tune the ef_search parameter. Higher ef_search examines more candidates at each layer, improving recall at the cost of latency. It’s a dial you turn based on your accuracy requirements.”
Question 2: “What’s the tradeoff between recall and latency?”
Your answer:
“Approximate nearest neighbor search is fundamentally a speed-vs-accuracy tradeoff. HNSW exposes this through the ef_search parameter.
In my implementation with 100K vectors:
| ef_search | Latency (p50) | Recall@10 | Use Case |
|---|---|---|---|
| 10 | 0.8ms | 87% | Chatbot (speed critical) |
| 50 | 1.4ms | 94% | General retrieval (balanced) |
| 100 | 2.7ms | 98% | Legal search (accuracy critical) |
| 200 | 5.1ms | 99.4% | Near-exact (overkill for most) |
The choice depends on your application:
- E-commerce recommendations: ef=10-20 (users don’t notice 87% vs 95% recall)
- Medical literature search: ef=100+ (missing relevant papers is costly)
- Real-time RAG chatbot: ef=30-50 (balance UX and quality)
The key insight: even at 87% recall, you’re getting 8.7 out of 10 correct results. For most applications, that’s acceptable when you’re getting 100x speedup.”
Follow-up: “How do you choose the right value?”
Answer: “Benchmark on your actual queries and ground truth. Measure the recall@k at different ef values, plot the recall-vs-latency curve, and find the ‘knee’—the point where increasing ef gives diminishing returns. For my use case, ef=50 was the sweet spot: 94% recall at 1.4ms.”
Question 3: “When would you use HNSW vs IVF (Inverted File Index)?”
Your answer:
“They solve the same problem (ANN search) with different approaches:
HNSW:
- Strategy: Graph-based navigation
- Pros: High recall, consistent latency, no training phase
- Cons: Higher memory (stores graph edges)
- Scaling: Great up to ~10-50M vectors in RAM
IVF-PQ (Inverted File with Product Quantization):
- Strategy: Clustering + compression
- Pros: Scales to billions, very memory-efficient
- Cons: Requires training (k-means clustering), lower recall, variable latency
- Scaling: Designed for 100M+ vectors
Decision matrix:
| Dataset Size | Memory Budget | Latency SLA | Recommendation |
|---|---|---|---|
| <1M vectors | Normal | <5ms | HNSW (simple, fast) |
| 1-10M vectors | Normal | <10ms | HNSW (well-tested) |
| 10-100M vectors | Tight | <20ms | IVF-PQ (compression needed) |
| 100M+ vectors | Any | Variable OK | IVF-PQ (only option) |
Example:
- RAG system with 2M document chunks: Use HNSW. Fits in RAM (2M × 384 dims × 4 bytes ≈ 3GB), simple deployment.
- Image search with 500M images: Use IVF-PQ. HNSW would need 500GB+ RAM, impractical.
Hybrid approach: Many production systems use HNSW + quantization (compress vectors but keep HNSW graph structure). Best of both worlds.”
Question 4: “How would you debug poor search quality?”
Your answer:
“I’d debug systematically, isolating whether it’s a retrieval problem or data problem.
Step 1: Isolate HNSW vs Embedding Quality
# Compare HNSW results with brute force (exact search)
exact_results = brute_force_knn(query, vectors, k=10)
hnsw_results = hnsw.search(query, k=10, ef_search=100)
recall = len(set(exact_results) & set(hnsw_results)) / 10
print(f'HNSW Recall: {recall:.1%}')
- If recall < 80%: HNSW configuration problem
- If recall > 95% but results still bad: Embedding model problem
Step 2: Tune HNSW Parameters
# Try different ef_search values
for ef in [10, 50, 100, 200]:
results = hnsw.search(query, ef_search=ef)
# Check if results improve
If recall improves significantly with higher ef, your default ef_search is too low.
Step 3: Check Index Quality
# Verify graph connectivity
for layer in range(num_layers):
isolated_nodes = [n for n, neighbors in graph[layer].items()
if len(neighbors) == 0]
print(f'Layer {layer}: {len(isolated_nodes)} isolated nodes')
Isolated nodes indicate index corruption or bugs in insert logic.
Step 4: Visualize Embedding Space
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Project to 2D
vectors_2d = TSNE(n_components=2).fit_transform(vectors[:1000])
query_2d = TSNE(n_components=2).fit_transform([query])[0]
# Plot
plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.3)
plt.scatter(query_2d[0], query_2d[1], c='red', marker='x', s=200)
plt.scatter(results_2d[:, 0], results_2d[:, 1], c='green', marker='o')
plt.show()
If query and true results are far apart in 2D, your embeddings don’t capture similarity well.
Step 5: Check Distance Distribution
distances = [distance(query, result) for result in top_100_results]
print(f'Distance range: [{min(distances):.3f}, {max(distances):.3f}]')
print(f'Distance std: {np.std(distances):.3f}')
If all distances are similar (e.g., 0.65-0.68), your embeddings might be too uniform (degeneracy problem).
Real incident example: I once debugged poor RAG results and found the issue was mixing normalized and unnormalized vectors. Half the corpus had norm ~1.0, half had norm ~15. Cosine similarity was meaningless. Solution: Renormalize all vectors on load.”
Question 5: “How would you scale to 1 billion vectors?”
Your answer:
“1 billion vectors won’t fit in a single machine’s RAM (1B × 384 dims × 4 bytes = 1.4TB), so you need distributed architecture.
Approach 1: Horizontal Sharding
Partition vectors across N shards:
- Shard 1: Vectors 0-10M
- Shard 2: Vectors 10M-20M
- ...
- Shard 100: Vectors 990M-1B
Query flow:
1. Broadcast query to all shards
2. Each shard returns top-k locally
3. Aggregator merges top-k from each shard
4. Return global top-k
Pros: Simple, linear scaling Cons: Query latency = slowest shard, network overhead, recall degradation (global top-10 might not be in top-10 of any shard)
Recall problem example:
- True global top-10 might have 3 from shard A, 4 from shard B, 3 from shard C
- But if you only get top-10 from each, you might miss some
Solution: Retrieve top-50 from each shard, merge to global top-10 (oversampling).
Approach 2: Hierarchical Routing
Two-level index:
1. Cluster 1B vectors into 10K clusters (k-means)
2. Store cluster centroids (fits in RAM)
3. For query:
- Search centroids, find top-N clusters
- Route to shards containing those clusters
- Search only relevant shards
Pros: Reduces network calls (only query relevant shards) Cons: Clustering quality affects accuracy, more complex
Approach 3: Compression + HNSW
Use Product Quantization to compress vectors:
- Original: 384 dims × 4 bytes = 1,536 bytes/vector
- Compressed: 64 bytes/vector (24x compression)
- 1B vectors = 64GB (fits in single large machine!)
Two-stage search:
1. Build HNSW on compressed vectors
2. Search compressed index, retrieve top-100
3. Rerank top-100 with original full-precision vectors
Pros: Single-machine solution, simpler ops Cons: Recall degradation from compression, need to store original vectors separately for reranking
My recommendation:
Start with Approach 3 (compression). If single-machine isn’t enough:
- Use Approach 1 (sharding) with HNSW per shard
- Oversample (retrieve top-50 per shard for global top-10)
- Use consistent hashing for shard assignment
- Add read replicas for high QPS
Real-world reference: Pinecone uses sharding + compression. Weaviate uses sharding + HNSW. This is the proven production pattern.”
Question 6: “Difference between vector databases and traditional databases?”
Your answer:
“The fundamental difference is the query primitive:
Traditional DB: Exact match, filtering, aggregation
- Query:
SELECT * FROM users WHERE email = 'alice@example.com' - Result: Exact matches (0 or 1 result)
- Index: B-tree, hash index
- Complexity: O(log n) exact search
Vector DB: Similarity search
- Query:
Find documents similar to [0.23, -0.41, 0.89, ...] - Result: Ranked list of approximate matches (always k results)
- Index: HNSW, IVF, graph-based
- Complexity: O(log n) approximate search
Key difference: Vector DBs trade exactness for similarity.
Example use cases:
Traditional DB:
- ‘Find user with ID 12345’ → Exact result or None
- ‘Find orders where price > $100’ → All matching orders
- ‘Count users by country’ → Aggregation
Vector DB:
- ‘Find documents similar to: machine learning’ → Top 10 similar docs
- ‘Find images visually similar to this photo’ → Ranked results
- ‘Find code snippets semantically similar to this function’ → Approximate matches
However, modern applications need both:
# Hybrid query (what production RAG systems do)
results = vector_db.search(
vector=query_embedding,
filters={
'metadata.category': 'electronics',
'metadata.price': {'$lt': 1000},
'metadata.in_stock': True
},
top_k=10
)
This combines:
- Vector similarity (semantic search)
- Metadata filtering (traditional DB)
That’s why modern vector databases (Pinecone, Weaviate, Qdrant) support both. Under the hood:
- Apply metadata filters first (traditional index)
- Run vector search on filtered subset (HNSW)
- Return top-k results
Architecture insight: Vector DBs are specialized systems that excel at one thing (similarity search) but delegate other operations (transactions, joins, consistency) to traditional DBs or applications.”
Hints in Layers
If you get stuck, reveal these hints progressively. Try each challenge first—debugging is how you learn.
Hint Layer 1: Getting Started
Hint: I don't know how to start. What's the first file I should create?
Start with this minimal structure:
# vector_db.py
import numpy as np
class VectorDB:
def __init__(self, dimensions, metric='cosine'):
self.dimensions = dimensions
self.metric = metric
# Storage
self.vectors = [] # List of numpy arrays
self.ids = [] # Corresponding IDs
# Mappings
self.id_to_idx = {} # ID -> index
self.idx_to_id = {} # index -> ID
def insert(self, vec_id, vector):
"""Add a vector to the database."""
if vec_id in self.id_to_idx:
raise ValueError(f"ID {vec_id} already exists")
if len(vector) != self.dimensions:
raise ValueError(f"Vector has {len(vector)} dims, expected {self.dimensions}")
idx = len(self.vectors)
self.vectors.append(np.array(vector))
self.ids.append(vec_id)
self.id_to_idx[vec_id] = idx
self.idx_to_id[idx] = vec_id
def search(self, query, k=10):
"""
Find k nearest neighbors to query.
Start with brute force, implement HNSW later.
"""
if len(self.vectors) == 0:
return []
# Brute force for now
query = np.array(query)
similarities = []
for idx, vec in enumerate(self.vectors):
sim = self._cosine_similarity(query, vec)
similarities.append((idx, sim))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
# Return top k
results = []
for idx, sim in similarities[:k]:
vec_id = self.idx_to_id[idx]
results.append((vec_id, sim))
return results
def _cosine_similarity(self, a, b):
"""Compute cosine similarity between two vectors."""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Test it
if __name__ == '__main__':
db = VectorDB(dimensions=3)
# Insert some vectors
db.insert("doc_1", [1.0, 0.0, 0.0])
db.insert("doc_2", [0.9, 0.1, 0.0])
db.insert("doc_3", [0.0, 1.0, 0.0])
# Search
query = [1.0, 0.0, 0.0]
results = db.search(query, k=2)
print("Results:")
for doc_id, score in results:
print(f" {doc_id}: {score:.3f}")
Get this working first. Don’t implement HNSW yet. Brute force is your baseline.
Next step: Vectorize the search (see next hint).
Hint: How do I make brute-force search faster with NumPy?
Replace the Python loop with vectorized operations:
def search(self, query, k=10):
"""Vectorized brute-force search."""
if len(self.vectors) == 0:
return []
query = np.array(query)
# Convert list of vectors to 2D array (n_vectors × dimensions)
vectors_matrix = np.array(self.vectors)
# Batch cosine similarity
similarities = self._cosine_similarity_batch(query, vectors_matrix)
# Get top k indices
top_k_indices = np.argsort(similarities)[::-1][:k]
# Convert to results
results = []
for idx in top_k_indices:
vec_id = self.idx_to_id[idx]
score = similarities[idx]
results.append((vec_id, score))
return results
def _cosine_similarity_batch(self, query, vectors):
"""
Compute cosine similarity between query and all vectors.
Args:
query: (d,) array
vectors: (n, d) array
Returns:
(n,) array of similarities
"""
# Normalize query
query_norm = query / np.linalg.norm(query)
# Normalize all vectors
vectors_norm = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
# Batch dot product (matrix-vector multiplication)
similarities = vectors_norm @ query_norm
return similarities
Benchmark:
import time
# Generate test data
np.random.seed(42)
for i in range(10000):
vec = np.random.rand(384)
db.insert(f"doc_{i}", vec)
query = np.random.rand(384)
start = time.time()
results = db.search(query, k=10)
elapsed = time.time() - start
print(f"Search time: {elapsed*1000:.1f}ms for {len(db.vectors)} vectors")
# Should be ~10-30ms for 10K vectors
Optimization: Pre-normalize vectors at insert time if using cosine similarity (see next hint).
Hint: Should I pre-normalize vectors?
Yes! If using cosine similarity, pre-normalizing vectors at insert time makes search 2-3x faster:
class VectorDB:
def __init__(self, dimensions, metric='cosine'):
self.dimensions = dimensions
self.metric = metric
self.vectors = []
self.ids = []
self.id_to_idx = {}
self.idx_to_id = {}
# Pre-normalization flag
self.prenormalized = (metric == 'cosine')
def insert(self, vec_id, vector):
"""Add a vector, pre-normalizing if using cosine similarity."""
if vec_id in self.id_to_idx:
raise ValueError(f"ID {vec_id} already exists")
vector = np.array(vector)
if len(vector) != self.dimensions:
raise ValueError(f"Expected {self.dimensions} dims, got {len(vector)}")
# Pre-normalize for cosine similarity
if self.prenormalized:
norm = np.linalg.norm(vector)
if norm > 0:
vector = vector / norm
idx = len(self.vectors)
self.vectors.append(vector)
self.ids.append(vec_id)
self.id_to_idx[vec_id] = idx
self.idx_to_id[idx] = vec_id
def _cosine_similarity_batch(self, query, vectors):
"""
Fast cosine similarity for pre-normalized vectors.
Cosine similarity = dot product when vectors are normalized.
"""
if self.prenormalized:
# Normalize query
query_norm = query / np.linalg.norm(query)
# Vectors already normalized, just dot product!
return vectors @ query_norm
else:
# General case (if not pre-normalized)
query_norm = query / np.linalg.norm(query)
vectors_norm = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
return vectors_norm @ query_norm
Speedup explanation:
- Without pre-normalization: 3 operations (dot, norm(query), norm(vectors))
- With pre-normalization: 1 operation (dot only)
- Result: ~3x faster search
Trade-off: Original vectors not preserved. Document this clearly.
Hint Layer 2: HNSW Implementation
Hint: How do I start implementing HNSW? What's the first function?
Start by adding layer assignment and basic graph structure:
import random
import math
class HNSW:
def __init__(self, dimensions, M=16, ef_construction=200, m_L=1.44):
"""
Args:
M: Max connections per node at layer 0
ef_construction: Size of dynamic candidate list during insert
m_L: Normalization factor for level generation
"""
self.dimensions = dimensions
self.M = M
self.M_max = M # Max connections for layers > 0
self.ef_construction = ef_construction
self.m_L = m_L
# Storage
self.vectors = []
self.id_to_idx = {}
self.idx_to_id = {}
# Graph structure: list of dicts (one dict per layer)
# graph[layer][node_idx] = [neighbor_indices]
self.graph = []
# Entry point for search (highest layer node)
self.entry_point = None
def _assign_layer(self):
"""
Randomly assign a layer for a new node.
Probability of level l: (1/m_L)^l
"""
return math.floor(-math.log(random.uniform(0, 1)) * self.m_L)
def insert(self, vec_id, vector):
"""Add a vector to the HNSW index."""
if vec_id in self.id_to_idx:
raise ValueError(f"ID {vec_id} already exists")
vector = np.array(vector)
# Assign index and store vector
idx = len(self.vectors)
self.vectors.append(vector)
self.id_to_idx[vec_id] = idx
self.idx_to_id[idx] = vec_id
# Assign layer for this node
node_layer = self._assign_layer()
# Ensure we have enough layers
while len(self.graph) <= node_layer:
self.graph.append({})
# First node becomes entry point
if self.entry_point is None:
self.entry_point = idx
for layer in range(node_layer + 1):
self.graph[layer][idx] = []
return
# TODO: Implement neighbor search and connection
# (See next hint)
def search(self, query, k=10, ef_search=50):
"""Search for k nearest neighbors."""
# TODO: Implement search
# (See subsequent hints)
pass
Next step: Implement _search_layer (greedy search on a single layer).
Hint: How do I implement greedy search on a single layer?
Use a min-heap for candidates and max-heap for results:
import heapq
class HNSW:
# ... (previous methods)
def _search_layer(self, query, entry_points, num_closest, layer):
"""
Greedy search on a single layer.
Args:
query: Query vector
entry_points: List of starting node indices
num_closest: Number of closest nodes to return (ef parameter)
layer: Which layer to search
Returns:
List of (distance, node_idx) tuples, sorted by distance
"""
visited = set()
candidates = [] # Min-heap: (distance, node_idx)
results = [] # Max-heap: (-distance, node_idx) to keep top-k
# Initialize with entry points
for node_idx in entry_points:
dist = self._distance(query, self.vectors[node_idx])
heapq.heappush(candidates, (dist, node_idx))
heapq.heappush(results, (-dist, node_idx))
visited.add(node_idx)
# Greedy search
while candidates:
current_dist, current_idx = heapq.heappop(candidates)
# Stop if current is farther than our worst result
if current_dist > -results[0][0]:
break
# Explore neighbors
neighbors = self.graph[layer].get(current_idx, [])
for neighbor_idx in neighbors:
if neighbor_idx not in visited:
visited.add(neighbor_idx)
dist = self._distance(query, self.vectors[neighbor_idx])
# Add to candidates if better than worst result OR need more results
if dist < -results[0][0] or len(results) < num_closest:
heapq.heappush(candidates, (dist, neighbor_idx))
heapq.heappush(results, (-dist, neighbor_idx))
# Keep only top num_closest
if len(results) > num_closest:
heapq.heappop(results)
# Convert to positive distances and sort
return sorted([(-dist, idx) for dist, idx in results])
def _distance(self, a, b):
"""
Compute distance between vectors.
For cosine similarity, return negative similarity (so lower = better).
"""
# Assuming pre-normalized vectors
similarity = np.dot(a, b)
return -similarity # Negative so we can use min-heap
Test this independently:
# Create a simple graph manually
hnsw = HNSW(dimensions=3, M=2)
hnsw.vectors = [
np.array([1.0, 0.0, 0.0]), # idx 0
np.array([0.9, 0.1, 0.0]), # idx 1
np.array([0.0, 1.0, 0.0]), # idx 2
]
hnsw.graph = [
{0: [1], 1: [0, 2], 2: [1]} # Layer 0 connections
]
# Test search
query = np.array([1.0, 0.0, 0.0])
results = hnsw._search_layer(query, entry_points=[0], num_closest=2, layer=0)
print(results) # Should return nodes 0 and 1 (most similar)
Hint: How do I connect a new node to its neighbors during insert?
After finding neighbors with _search_layer, you need to:
- Connect new node to neighbors
- Connect neighbors back to new node (bidirectional)
- Prune connections if they exceed M
class HNSW:
# ... (previous methods)
def insert(self, vec_id, vector):
"""Add a vector to the HNSW index."""
# ... (earlier code for ID/vector storage)
vector = np.array(vector)
if self.prenormalized:
vector = vector / np.linalg.norm(vector)
idx = len(self.vectors)
self.vectors.append(vector)
self.id_to_idx[vec_id] = idx
self.idx_to_id[idx] = vec_id
node_layer = self._assign_layer()
# Ensure graph has enough layers
while len(self.graph) <= node_layer:
self.graph.append({})
# First node becomes entry point
if self.entry_point is None:
self.entry_point = idx
for layer in range(node_layer + 1):
self.graph[layer][idx] = []
return
# Find nearest neighbors at each layer
entry_points = [self.entry_point]
# Search from top layer down to node's layer
for layer in range(len(self.graph) - 1, node_layer, -1):
nearest = self._search_layer(vector, entry_points, ef=1, layer=layer)
entry_points = [idx for _, idx in nearest]
# Insert into layers from node_layer down to 0
for layer in range(node_layer, -1, -1):
# Find neighbors at this layer
candidates = self._search_layer(
vector,
entry_points,
num_closest=self.ef_construction,
layer=layer
)
# Select M neighbors
M = self.M if layer == 0 else self.M_max
neighbors = self._select_neighbors(candidates, M)
# Add edges
self.graph[layer][idx] = neighbors
# Add reverse edges (bidirectional)
for neighbor_idx in neighbors:
if neighbor_idx not in self.graph[layer]:
self.graph[layer][neighbor_idx] = []
self.graph[layer][neighbor_idx].append(idx)
# Prune if neighbor has too many connections
if len(self.graph[layer][neighbor_idx]) > M:
# Recompute best M neighbors for this node
neighbor_vec = self.vectors[neighbor_idx]
neighbor_candidates = [
(self._distance(neighbor_vec, self.vectors[n]), n)
for n in self.graph[layer][neighbor_idx]
]
new_neighbors = self._select_neighbors(neighbor_candidates, M)
self.graph[layer][neighbor_idx] = new_neighbors
# Update entry points for next layer
entry_points = neighbors
# Update global entry point if new node is at higher layer
if node_layer > self._get_node_layer(self.entry_point):
self.entry_point = idx
def _select_neighbors(self, candidates, M):
"""
Select M best neighbors from candidates.
Start with simple approach: just take M nearest.
"""
sorted_candidates = sorted(candidates, key=lambda x: x[0])
return [idx for _, idx in sorted_candidates[:M]]
def _get_node_layer(self, idx):
"""Find the highest layer a node appears in."""
for layer in range(len(self.graph) - 1, -1, -1):
if idx in self.graph[layer]:
return layer
return 0
Test: Insert 100 vectors and check graph connectivity:
hnsw = HNSW(dimensions=384, M=16)
for i in range(100):
vec = np.random.rand(384)
hnsw.insert(f"doc_{i}", vec)
# Check layer distribution
for layer in range(len(hnsw.graph)):
print(f"Layer {layer}: {len(hnsw.graph[layer])} nodes")
# Check connectivity (no isolated nodes)
for layer in range(len(hnsw.graph)):
isolated = [n for n, neighbors in hnsw.graph[layer].items() if len(neighbors) == 0]
print(f"Layer {layer}: {len(isolated)} isolated nodes (should be 0 or very few)")
Hint: How do I implement the full search (across all layers)?
Search starts at the entry point, traverses down through layers:
class HNSW:
# ... (previous methods)
def search(self, query, k=10, ef_search=50):
"""
Search for k nearest neighbors.
Args:
query: Query vector
k: Number of results to return
ef_search: Size of candidate list during search (higher = better recall)
Returns:
List of (vec_id, similarity) tuples
"""
if self.entry_point is None:
return []
query = np.array(query)
if self.prenormalized:
query = query / np.linalg.norm(query)
# Start at entry point
entry_points = [self.entry_point]
# Search from top layer down to layer 1 (ef=1, just find nearest)
for layer in range(len(self.graph) - 1, 0, -1):
nearest = self._search_layer(query, entry_points, ef=1, layer=layer)
entry_points = [idx for _, idx in nearest]
# Search layer 0 with larger ef for better recall
candidates = self._search_layer(query, entry_points, ef_search, layer=0)
# Convert to results (distances to similarities)
results = []
for dist, idx in candidates[:k]:
vec_id = self.idx_to_id[idx]
similarity = -dist # Convert back to positive similarity
results.append((vec_id, similarity))
return results
Test end-to-end:
# Create index with known vectors
hnsw = HNSW(dimensions=3, M=4, ef_construction=10)
test_vectors = {
"A": [1.0, 0.0, 0.0],
"B": [0.9, 0.1, 0.0],
"C": [0.0, 1.0, 0.0],
"D": [0.0, 0.9, 0.1],
}
for vec_id, vec in test_vectors.items():
hnsw.insert(vec_id, vec)
# Query should return A, then B (most similar to [1,0,0])
query = [1.0, 0.0, 0.0]
results = hnsw.search(query, k=2)
print(results) # Should be [('A', ~1.0), ('B', ~0.9)]
Hint Layer 3: Optimization and Debugging
Hint: My search is slow. How do I profile it?
Use Python’s cProfile to find bottlenecks:
import cProfile
import pstats
# Generate test data
hnsw = HNSW(dimensions=384, M=16)
for i in range(10000):
vec = np.random.rand(384)
hnsw.insert(f"doc_{i}", vec)
# Profile search
test_queries = [np.random.rand(384) for _ in range(100)]
profiler = cProfile.Profile()
profiler.enable()
for query in test_queries:
hnsw.search(query, k=10, ef_search=50)
profiler.disable()
# Print stats
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20) # Top 20 functions
Common bottlenecks:
_distancecalled too many times → Batch distance computations- Heap operations slow → Use numpy for sorting if k is large
- Dictionary lookups in tight loops → Cache frequently accessed values
Optimization example:
# BEFORE (slow)
for neighbor in neighbors:
dist = self._distance(query, self.vectors[neighbor])
# ...
# AFTER (fast)
# Batch distance computation
neighbor_vecs = np.array([self.vectors[n] for n in neighbors])
distances = self._distance_batch(query, neighbor_vecs)
for neighbor, dist in zip(neighbors, distances):
# ...
Hint: My recall is <80%. What's wrong?
Debug checklist:
1. Check ef_search is high enough:
# Try increasing ef_search
for ef in [10, 50, 100, 200]:
results = hnsw.search(query, k=10, ef_search=ef)
# Compare with brute force
exact = brute_force_search(query, k=10)
recall = len(set(results) & set(exact)) / 10
print(f"ef={ef}: recall={recall:.1%}")
If recall improves significantly, your default ef_search is too low.
2. Check ef_construction:
# Rebuild index with higher ef_construction
hnsw_low = HNSW(ef_construction=10) # Too low
hnsw_high = HNSW(ef_construction=200) # Better
# Insert same data, compare recall
Higher ef_construction builds better graph quality.
3. Check M parameter:
# Too few connections = poor recall
hnsw_m4 = HNSW(M=4) # Low recall
hnsw_m16 = HNSW(M=16) # Better recall
hnsw_m32 = HNSW(M=32) # Even better (but more memory)
4. Verify graph connectivity:
# Check for isolated nodes
for layer in range(len(hnsw.graph)):
isolated = [n for n, neighbors in hnsw.graph[layer].items()
if len(neighbors) == 0]
if isolated:
print(f"WARNING: Layer {layer} has {len(isolated)} isolated nodes")
# This indicates a bug in insert logic
5. Implement heuristic neighbor selection (Algorithm 4 from HNSW paper):
def _select_neighbors_heuristic(self, candidates, M):
"""Select diverse neighbors (better than just taking M nearest)."""
candidates = sorted(candidates, key=lambda x: x[0])
result = []
for dist_to_q, candidate in candidates:
# Check distance to already-selected neighbors
should_add = True
for prev_dist, prev_neighbor in result:
dist_between = self._distance(
self.vectors[candidate],
self.vectors[prev_neighbor]
)
if dist_between < dist_to_q:
# Candidate is closer to existing neighbor than to query
# Skip to maintain diversity
should_add = False
break
if should_add:
result.append((dist_to_q, candidate))
if len(result) >= M:
break
# Fill remaining slots if needed
if len(result) < M:
for item in candidates:
if item not in result:
result.append(item)
if len(result) >= M:
break
return [idx for _, idx in result]
6. Compare with brute force systematically:
def measure_recall(hnsw, test_queries, k=10):
"""Measure recall against brute force ground truth."""
recalls = []
for query in test_queries:
# Ground truth (brute force)
exact_results = brute_force_knn(query, hnsw.vectors, k)
exact_ids = set([hnsw.idx_to_id[idx] for idx in exact_results])
# HNSW results
hnsw_results = hnsw.search(query, k=k)
hnsw_ids = set([doc_id for doc_id, _ in hnsw_results])
# Recall = |intersection| / k
recall = len(exact_ids & hnsw_ids) / k
recalls.append(recall)
return np.mean(recalls)
recall = measure_recall(hnsw, test_queries, k=10)
print(f"Average recall@10: {recall:.1%}")
Hint: How do I save/load the index to disk?
Start with a simple pickle-based approach:
import pickle
import numpy as np
class HNSW:
# ... (previous methods)
def save(self, filepath):
"""Save index to disk."""
state = {
'dimensions': self.dimensions,
'M': self.M,
'M_max': self.M_max,
'ef_construction': self.ef_construction,
'm_L': self.m_L,
'vectors': self.vectors,
'id_to_idx': self.id_to_idx,
'idx_to_id': self.idx_to_id,
'graph': self.graph,
'entry_point': self.entry_point,
}
with open(filepath, 'wb') as f:
pickle.dump(state, f)
@classmethod
def load(cls, filepath):
"""Load index from disk."""
with open(filepath, 'rb') as f:
state = pickle.load(f)
# Create instance
hnsw = cls(
dimensions=state['dimensions'],
M=state['M'],
ef_construction=state['ef_construction'],
m_L=state['m_L']
)
# Restore state
hnsw.vectors = state['vectors']
hnsw.id_to_idx = state['id_to_idx']
hnsw.idx_to_id = state['idx_to_id']
hnsw.graph = state['graph']
hnsw.entry_point = state['entry_point']
hnsw.M_max = state['M_max']
return hnsw
# Usage
hnsw.save("my_index.pkl")
hnsw_loaded = HNSW.load("my_index.pkl")
Better approach (production): Use separate files for vectors and metadata:
import json
import hashlib
class HNSW:
def save(self, filepath_prefix):
"""
Save index with separate files for vectors and metadata.
Creates:
{prefix}.vectors.npy - Vector data
{prefix}.meta.json - Graph + metadata
"""
# Save vectors
vectors_file = f"{filepath_prefix}.vectors.npy"
np.save(vectors_file, np.array(self.vectors))
# Compute checksum
with open(vectors_file, 'rb') as f:
checksum = hashlib.sha256(f.read()).hexdigest()
# Save metadata
metadata = {
'version': 1,
'dimensions': self.dimensions,
'M': self.M,
'M_max': self.M_max,
'ef_construction': self.ef_construction,
'm_L': self.m_L,
'num_vectors': len(self.vectors),
'entry_point': self.entry_point,
'id_to_idx': self.id_to_idx,
'idx_to_id': {int(k): v for k, v in self.idx_to_id.items()}, # JSON requires str keys
'graph': [
{int(k): v for k, v in layer.items()}
for layer in self.graph
],
'checksum': checksum
}
with open(f"{filepath_prefix}.meta.json", 'w') as f:
json.dump(metadata, f, indent=2)
@classmethod
def load(cls, filepath_prefix):
"""Load index from disk with validation."""
# Load metadata
with open(f"{filepath_prefix}.meta.json") as f:
metadata = json.load(f)
# Validate version
if metadata['version'] != 1:
raise ValueError(f"Unsupported index version: {metadata['version']}")
# Load vectors
vectors_file = f"{filepath_prefix}.vectors.npy"
vectors = np.load(vectors_file)
# Validate checksum
with open(vectors_file, 'rb') as f:
checksum = hashlib.sha256(f.read()).hexdigest()
if checksum != metadata['checksum']:
raise ValueError("Checksum mismatch! Index may be corrupted.")
# Create instance
hnsw = cls(
dimensions=metadata['dimensions'],
M=metadata['M'],
ef_construction=metadata['ef_construction'],
m_L=metadata['m_L']
)
# Restore state
hnsw.vectors = list(vectors)
hnsw.id_to_idx = metadata['id_to_idx']
hnsw.idx_to_id = {int(k): v for k, v in metadata['idx_to_id'].items()}
hnsw.graph = [
{int(k): v for k, v in layer.items()}
for layer in metadata['graph']
]
hnsw.entry_point = metadata['entry_point']
hnsw.M_max = metadata['M_max']
return hnsw
# Usage
hnsw.save("my_index")
# Creates: my_index.vectors.npy, my_index.meta.json
hnsw_loaded = HNSW.load("my_index")
Books That Will Help
Structured learning resources to deepen your understanding:
| Topic | Book | Specific Chapters | Why It Helps | Time Investment |
|---|---|---|---|---|
| Vector Similarity & Linear Algebra | “Mathematics for Machine Learning” by Deisenroth, Faisal, Aldo | Ch. 3 (Analytic Geometry), Sections 3.1-3.3 | Understand dot products, norms, angles—the foundation of cosine similarity | 3-4 hours |
| Graph Algorithms | “Algorithms, Fourth Edition” by Sedgewick & Wayne | Ch. 4.1 (Undirected Graphs), Ch. 4.2 (Directed Graphs) | Learn graph representation, DFS, BFS—essential for understanding HNSW traversal | 4-5 hours |
| Algorithm Complexity | “Introduction to Algorithms” (CLRS) | Ch. 3 (Growth of Functions), Ch. 22.1-22.3 (Graph basics and search) | Understand why HNSW is O(log n) and how to analyze performance | 3-4 hours |
| Skip Lists (conceptual foundation) | “Algorithms, Fourth Edition” by Sedgewick & Wayne | Ch. 3.3 (Balanced Search Trees) | Understand hierarchical search structures—skip lists are the 1D analog of HNSW | 2 hours |
| HNSW Algorithm (primary source) | “Efficient and robust approximate nearest neighbor search using HNSW” - Original paper | Full paper (20 pages) | The definitive resource—read AFTER understanding graphs | 3-5 hours |
| NumPy Optimization | “Python for Data Analysis” by Wes McKinney | Ch. 4 (NumPy Basics), Appendix A (Advanced NumPy) | Write fast vectorized code (100x faster than Python loops) | 3-4 hours |
| Information Retrieval Metrics | “Introduction to Information Retrieval” by Manning, Raghavan & Schütze | Ch. 8 (Evaluation in Information Retrieval) | Understand recall, precision, NDCG—how to measure search quality | 2-3 hours |
| Production Systems | “AI Engineering” by Chip Huyen | Ch. 6 (Data Engineering), Ch. 7 (Model Deployment) | Scaling, persistence, monitoring—making this production-ready | 4-5 hours |
| Distributed Systems (for scaling) | “Designing Data-Intensive Applications” by Kleppmann | Ch. 4 (Encoding), Ch. 5 (Replication), Ch. 6 (Partitioning) | Understand sharding, replication—scaling beyond single machine | 6-8 hours |
Reading Path (Priority Order)
Phase 1: Core Prerequisites (10-12 hours)
- Sedgewick Ch. 4.1-4.2 (Graphs) - 30-40 pages
- McKinney Ch. 4 (NumPy) - 40 pages
- Deisenroth Ch. 3.1-3.3 (Vectors and similarity) - 25 pages
- CLRS Ch. 3 (Complexity) - 20-30 pages
Phase 2: HNSW Deep Dive (5-8 hours)
- Sedgewick Ch. 3.3 (Skip lists for intuition) - 15 pages
- HNSW paper (full read, take notes) - 20 pages
- Manning Ch. 8 (Evaluation metrics) - 25 pages
Phase 3: Production & Scaling (10-15 hours)
- Huyen Ch. 6-7 (Production ML) - 60 pages
- Kleppmann Ch. 4-6 (Distributed systems) - 100 pages
Total reading: ~300-350 pages, 25-35 hours
Study Tips
- Don’t read linearly: Jump to specific sections as needed
- Code while reading: Implement concepts immediately
- Skip proofs initially: Focus on intuition, come back to math later
- Use papers as reference: HNSW paper is dense—read section 3 (algorithm) first, skim the rest
- Test understanding: After each chapter, explain it to someone (rubber duck debugging)
Alternative: Video Resources
If you prefer video learning:
- StatQuest: Cosine Similarity - 10 min
- William Fiset: Graph Theory Playlist - 2-3 hours
- Andrej Karpathy: Let’s build GPT - Learn from implementation approach
- Pinecone: HNSW Explanation - Blog series with visuals
Project 3: Build a Complete RAG System (No LangChain)
- File: GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md
- Programming Language: Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Generative AI / Information Retrieval
- Software or Tool: Vector DB / LLM API
- Main Book: “AI Engineering” by Chip Huyen
What you’ll build: A question-answering system over your own documents (PDFs, markdown files) that chunks, embeds, retrieves, and generates answers — without using LangChain or LlamaIndex.
Why it teaches RAG: Frameworks hide everything. Building RAG manually forces you to understand why chunk size matters, how retrieval quality directly impacts generation quality, and where the “lost in the middle” problem comes from.
Core challenges you’ll face:
- Implementing chunking strategies (fixed, semantic, recursive) and seeing how they affect retrieval (maps to understanding document processing)
- Building the retrieval pipeline with proper scoring (maps to understanding information retrieval)
- Constructing prompts that maximize context utilization (maps to understanding prompt engineering for RAG)
- Evaluating RAG quality (relevance, faithfulness, hallucination detection)
Key Concepts:
- Chunking Strategies: “Enhancing RAG: A Study of Best Practices” - January 2025 arXiv paper
- Retrieval Fundamentals: “Designing Data-Intensive Applications” Chapter 3 - Martin Kleppmann
- Prompt Construction: “AI Engineering” Chapter 6 (RAG) - Chip Huyen
- Evaluation: “RAG Assessment (RAGAS)” - Standard metrics documentation
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Python, understanding of embeddings, API access to an LLM (OpenAI, Claude, or local Ollama)
Real world outcome:
- A CLI:
./myrag query "What is the refund policy?" --docs ./company_policies/ - The system returns an answer with citations: “According to [policy.pdf, page 3], refunds are…”
- A dashboard showing retrieval scores, chunk sources, and confidence levels
Learning milestones:
- After implementing chunking: You’ll see how chunk size directly impacts retrieval quality — too small loses context, too large dilutes relevance
- After building retrieval: You’ll understand why “semantic search” isn’t magic and sometimes fails badly
- After end-to-end evaluation: You’ll understand the full chain of quality degradation and where to invest optimization effort
Project 4: Implement a Two-Stage Retrieval System with Reranking
- File: GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md
- Programming Language: Python
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Information Retrieval / ML
- Software or Tool: Cross-Encoders
- Main Book: “Introduction to Information Retrieval” by Manning, Raghavan & Schütze
What you’ll build: A search system that first retrieves 100 candidates using fast bi-encoder search, then re-ranks them with a cross-encoder to surface the most relevant results.
Why it teaches Reranking: Bi-encoders are fast but compress meaning into a single vector — they lose nuance. Cross-encoders are accurate but slow because they process query+document pairs together. Building both yourself reveals why modern search uses a two-stage architecture.
Core challenges you’ll face:
- Implementing/fine-tuning a bi-encoder for fast retrieval (maps to understanding embedding models)
- Implementing/using a cross-encoder for reranking (maps to understanding attention between query and document)
- Measuring precision@k before and after reranking (maps to understanding evaluation metrics)
- Finding the optimal candidate count for reranking (100? 50? 200?)
Resources for key challenges:
- “Retrieve & Re-Rank” - Sentence Transformers documentation
- “Rerankers and Two-Stage Retrieval” - Pinecone’s excellent visual guide
Key Concepts:
- Bi-Encoders: “Sentence-BERT paper” - Reimers & Gurevych
- Cross-Encoders: “Training and Finetuning Reranker Models” - Hugging Face blog
- Information Retrieval Metrics: “Foundations of Information Security” Chapter on Search - Jason Andress
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 2 or 3 completed, familiarity with Hugging Face transformers
Real world outcome:
- A search API that returns results with clear before/after reranking comparison
- Benchmarks showing: “Bi-encoder alone: 72% precision@5, With reranking: 91% precision@5”
- A visualization showing how document rankings change after reranking
Learning milestones:
- After bi-encoder implementation: You’ll understand why embedding search is fast but approximate
- After cross-encoder implementation: You’ll see why cross-encoders are more accurate but can’t scale
- After combining both: You’ll internalize the fundamental speed-accuracy tradeoff in information retrieval
Project 5: Fine-Tune Your Own Embedding Model
- File: GENERATIVE_AI_LLM_RAG_LEARNING_PROJECTS.md
- Programming Language: Python
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 4: Expert
- Knowledge Area: ML Training / NLP
- Software or Tool: Sentence Transformers
- Main Book: “AI Engineering” by Chip Huyen
What you’ll build: Take an existing embedding model (like all-MiniLM-L6-v2) and fine-tune it on your domain-specific data to dramatically improve retrieval quality.
Why it teaches embeddings deeply: General-purpose embeddings work “okay” everywhere but excel nowhere. Fine-tuning forces you to understand contrastive learning, hard negative mining, and how embeddings encode meaning.
Core challenges you’ll face:
- Curating training data (query-positive-negative triplets) (maps to understanding what embeddings learn)
- Implementing contrastive loss (InfoNCE, Multiple Negatives Ranking Loss) (maps to understanding how embeddings are trained)
- Mining hard negatives effectively (maps to understanding why easy negatives don’t improve models)
- Evaluating embedding quality (NDCG, MRR, recall@k)
Key Concepts:
- Contrastive Learning: “AI Engineering” Chapter 5 (Fine-tuning) - Chip Huyen
- Loss Functions: “Sentence Transformers Training Overview” - SBERT docs
- Hard Negative Mining: “Training State-of-the-Art Embedding Models” - Hugging Face
Difficulty: Advanced Time estimate: 2 weeks Prerequisites: PyTorch, completed Projects 2-4, understanding of loss functions
Real world outcome:
- Side-by-side comparison: “Base model recall@10: 68%, Fine-tuned recall@10: 89%”
- Your fine-tuned model uploaded to Hugging Face Hub
- A demo where domain-specific queries (legal, medical, your codebase) return dramatically better results
Learning milestones:
- After data curation: You’ll understand that embedding quality is bounded by training data quality
- After training: You’ll viscerally understand the embedding space — similar things cluster together
- After evaluation: You’ll understand when to fine-tune vs when to use better base models
Real World Outcome
After completing this project, you will have:
Concrete Deliverables:
- A fine-tuned embedding model uploaded to Hugging Face Hub (e.g.,
your-username/legal-embeddings-v1) - Side-by-side benchmark comparison showing measurable improvements:
- Base model (
all-MiniLM-L6-v2): Recall@10: 68%, NDCG@10: 0.71, MRR: 0.62 - Fine-tuned model: Recall@10: 89%, NDCG@10: 0.87, MRR: 0.81
- Base model (
- A demo notebook where domain-specific queries (e.g., “breach of contract remedies” for legal, “myocardial infarction symptoms” for medical) return dramatically more relevant results
- Training artifacts: loss curves, validation metrics over epochs, hard negative mining statistics
Tangible Demonstration:
# Query with base model
$ python demo.py --model sentence-transformers/all-MiniLM-L6-v2 --query "statute of limitations personal injury"
Top 3 results: [generic legal terms, irrelevant contract law, wrong jurisdiction]
# Same query with fine-tuned model
$ python demo.py --model your-username/legal-embeddings-v1 --query "statute of limitations personal injury"
Top 3 results: [exact tort law statutes, jurisdiction-specific timelines, case precedents]
Validation Metrics:
- Training dataset: 50,000+ triplets (query, positive, negative)
- Validation set performance tracked every 500 steps
- Final model size: ~80MB (quantized version: ~20MB)
- Inference speed: 2000 queries/second on CPU (comparable to base model)
The Core Question You’re Answering
“How are embeddings actually trained, and why does domain-specific fine-tuning dramatically improve retrieval quality over general-purpose models?”
This breaks down into:
- What is contrastive learning, and why is it the foundation of embedding training?
- Why do “hard negatives” matter more than easy negatives for training effective embeddings?
- How does the loss function (InfoNCE, Multiple Negatives Ranking Loss, Triplet Loss) shape the embedding space?
- When should you fine-tune an existing model vs training from scratch vs using a general model?
By building this project, you’ll move from treating embeddings as magic black boxes to understanding:
- How similarity is learned through positive/negative examples
- Why embeddings from the same family (e.g., BERT-based) can have wildly different quality depending on training data
- The computational and data requirements for effective fine-tuning
- How to evaluate embedding quality beyond simple “semantic similarity”
Concepts You Must Understand First
1. Contrastive Learning
What it is: A training paradigm where the model learns by contrasting similar (positive) examples against dissimilar (negative) examples.
Why it matters for embeddings:
- You want
embed(query)to be close toembed(relevant_doc)in vector space - But also far from
embed(irrelevant_doc) - The model learns to “push” positive pairs together and “pull” negative pairs apart
Book references:
- “AI Engineering” by Chip Huyen - Chapter 5 (Fine-tuning section on contrastive objectives)
- “Deep Learning” by Goodfellow et al. - Chapter 20.10 (Metric Learning)
Key Intuition:
Before training (random embeddings):
query: [0.2, 0.5, 0.1]
positive_doc: [0.8, 0.1, 0.3] ← Similarity: 0.34 (low)
negative_doc: [0.3, 0.4, 0.2] ← Similarity: 0.41 (higher than positive!)
After contrastive training:
query: [0.7, 0.2, 0.9]
positive_doc: [0.72, 0.19, 0.91] ← Similarity: 0.98 (high!)
negative_doc: [0.1, 0.9, 0.05] ← Similarity: 0.12 (low)
2. Loss Functions for Embeddings
Multiple Negatives Ranking Loss (MNR) - Most common for asymmetric retrieval:
L = -log(exp(sim(q, p+) / τ) / Σ exp(sim(q, pi) / τ))
Where:
- q = query embedding
- p+ = positive document
- pi = all documents in batch (including negatives)
- τ = temperature parameter (controls distribution sharpness)
Triplet Loss - Classic formulation:
L = max(0, margin + sim(q, neg) - sim(q, pos))
Forces positive to be at least 'margin' closer than negative
InfoNCE (Contrastive Loss) - Used by SimCLR, CLIP:
Maximize agreement between positive pairs
Minimize agreement with all other pairs in batch
Book references:
- “Dive into Deep Learning” by Zhang et al. - Chapter 14.7 (Similarity and Analogy Tasks)
- Sentence Transformers documentation: Training Overview
3. Hard Negative Mining
What it is: Strategically selecting negatives that are semantically similar to the query but not relevant.
Why it’s critical:
- Easy negatives (random documents) don’t teach the model much
- Hard negatives (similar but wrong) force the model to learn fine-grained distinctions
Example (legal domain):
Query: "breach of contract remedies"
Easy negative (useless for training):
"how to bake chocolate chip cookies" # Model already knows this is irrelevant
Hard negative (forces model to learn):
"breach of fiduciary duty remedies" # Same domain, similar terms, but different legal concept
Book references:
- “AI Engineering” Chapter 6 (RAG section on retrieval quality)
- Research paper: “In-Batch Negatives for Knowledge Distillation”
4. Evaluation Metrics for Embeddings
Recall@k: Of the top k retrieved documents, what fraction are relevant?
If 10 documents are relevant, and top 10 retrieval includes 7 of them:
Recall@10 = 7/10 = 0.7
NDCG@k (Normalized Discounted Cumulative Gain): Rewards relevant documents appearing earlier in results.
NDCG considers both relevance and ranking position
Perfect ranking = 1.0
Random ranking ≈ 0.5
MRR (Mean Reciprocal Rank): Position of first relevant result.
If first relevant document is at position 3:
RR = 1/3 = 0.33
Book references:
- “Introduction to Information Retrieval” by Manning et al. - Chapter 8 (Evaluation)
- RAGAS documentation: Evaluation Metrics
5. Bi-Encoder Architecture (What You’re Fine-Tuning)
Structure:
Query Encoder (shared weights with Doc Encoder):
Input: "What is the capital of France?"
↓
Tokenizer: [CLS] what is the capital of france [SEP]
↓
BERT/MiniLM: 12 layers of self-attention
↓
Pooling: Mean pooling of token embeddings
↓
Output: 384-dimensional vector
Document Encoder (same architecture):
Input: "Paris is the capital and largest city of France."
↓
... same process ...
↓
Output: 384-dimensional vector
Similarity: cosine(query_embedding, doc_embedding)
Why bi-encoders for retrieval:
- Encode queries and documents independently
- Pre-compute all document embeddings offline
- Fast search: just compute query embedding at runtime
Book references:
- “Build a Large Language Model” by Raschka - Chapter 2 (Understanding Transformers)
- Sentence-BERT paper: “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”
Questions to Guide Your Design
Data Collection & Curation
- How many training triplets do I need?
- Minimum: ~10,000 for noticeable improvement
- Good: 50,000-100,000 for solid domain adaptation
- Excellent: 500,000+ for state-of-the-art domain performance
- Reality check: Quality > quantity. 10k high-quality triplets beat 100k noisy ones.
- Where do I get query-positive-negative triplets?
- Option 1: Use existing labeled data (e.g., Stack Overflow question-answer pairs)
- Option 2: Mine from search logs (queries + clicked/skipped documents)
- Option 3: Synthetic generation with GPT-4 (create queries from documents)
- Option 4: Use BM25/base embeddings to mine hard negatives from corpus
- What makes a “hard negative” vs “easy negative”?
- Hard negative: High lexical overlap but semantically different
- Query: “python list comprehension”
- Hard neg: “python dictionary comprehension” (similar syntax, different data structure)
- Easy negative: Completely unrelated
- Query: “python list comprehension”
- Easy neg: “Java inheritance patterns” (different language entirely)
- Hard negative: High lexical overlap but semantically different
- Should I create balanced triplets or use in-batch negatives?
- Triplet approach: Explicitly create (query, pos, neg) tuples
- In-batch negatives: Use all other documents in batch as implicit negatives
- More efficient (1 query + N positives = N*(N-1) training pairs)
- Batch size becomes crucial (larger = more negatives)
Training Configuration
- What base model should I start with?
- For speed:
all-MiniLM-L6-v2(80MB, 384 dimensions, fast) - For quality:
all-mpnet-base-v2(420MB, 768 dimensions, slower but better) - For specific domains: Check Hugging Face for pre-existing domain models
- For speed:
- What batch size for contrastive learning?
- Larger batches = more in-batch negatives = better training signal
- But: memory limited
- Recommendation: Start with 16-32, use gradient accumulation to simulate 64-128
- Learning rate scheduling?
- Start too high → model forgets pre-training (catastrophic forgetting)
- Start too low → training takes forever
- Recommendation: Warm-up for 10% of steps, then cosine decay
- Typical range: 2e-5 to 5e-5 for fine-tuning
- How many epochs?
- Too few: Underfitting (no improvement over base model)
- Too many: Overfitting (model memorizes training queries)
- Recommendation: 1-5 epochs with early stopping based on validation loss
Evaluation Strategy
- How do I create a validation set?
- Hold out 10-20% of queries (not documents!)
- Ensure validation queries represent real use cases
- Include edge cases (ambiguous queries, rare domains)
- What metrics should I track during training?
- Primary: Validation loss (contrastive loss on held-out set)
- Secondary: Recall@10, NDCG@10, MRR on validation queries
- Sanity check: Training loss should decrease steadily
- How do I know if fine-tuning actually helped?
- Compare against base model on your validation set
- Minimum improvement: +5% Recall@10
- Good improvement: +10-15% Recall@10
- Excellent improvement: +20%+ Recall@10
Thinking Exercise
Before writing code, manually understand what makes a good training example:
Exercise: Create Training Triplets for Legal Domain
Scenario: You’re building a legal research assistant. Create 5 triplets.
Template:
Query: [What would a lawyer search for?]
Positive: [Document that answers the query]
Negative (hard): [Looks relevant but isn't]
Example Solution:
Triplet 1:
Query: "statute of limitations personal injury California"
Positive: "In California, the statute of limitations for personal injury cases is two years from the date of injury under California Code of Civil Procedure § 335.1."
Hard Negative: "In California, the statute of limitations for medical malpractice is three years from the date of injury or one year from discovery under California Code of Civil Procedure § 340.5."
Why it's hard: Same state, same legal concept (statute of limitations), but different cause of action.
Triplet 2:
Query: "breach of contract damages calculation"
Positive: "Contract damages are calculated as expectation damages (benefit of the bargain) plus consequential damages that were reasonably foreseeable at the time of contract formation."
Hard Negative: "Breach of fiduciary duty damages include actual losses caused by the breach plus disgorgement of any profits the fiduciary obtained through the breach."
Why it's hard: Both about breach and damages, but different legal standards apply.
Now create 3 more for your domain:
- Domain 1: Medical (e.g., “symptoms of myocardial infarction”)
- Domain 2: Code documentation (e.g., “how to handle async errors in JavaScript”)
- Domain 3: Your choice
Key insight: Notice how hard negatives share vocabulary with the query but require semantic understanding to distinguish. This is what you’re teaching the embedding model.
The Interview Questions They’ll Ask
Conceptual Understanding
- “Explain contrastive learning and why it’s used for training embeddings.”
- Answer: Contrastive learning trains models to bring similar examples closer together in embedding space while pushing dissimilar examples apart. For embeddings, we want
sim(query, relevant_doc) > sim(query, irrelevant_doc). The model learns by contrasting positive pairs (query + relevant doc) against negative pairs (query + irrelevant doc). The loss function penalizes the model when negative similarities are too high or positive similarities are too low.
- Answer: Contrastive learning trains models to bring similar examples closer together in embedding space while pushing dissimilar examples apart. For embeddings, we want
- “What’s the difference between triplet loss and Multiple Negatives Ranking Loss?”
- Answer:
- Triplet Loss: Explicit (query, positive, negative) triplets. Forces
sim(q, pos) > sim(q, neg) + margin. Requires careful negative mining. - MNR Loss: Uses all other documents in the batch as implicit negatives. More efficient (one query can train against many negatives). Better gradient signal from multiple negatives. Batch size becomes a hyperparameter.
- Triplet Loss: Explicit (query, positive, negative) triplets. Forces
- Answer:
- “Why do hard negatives matter more than easy negatives?”
- Answer: Easy negatives (completely unrelated documents) provide weak training signal—the model already knows they’re dissimilar. Hard negatives (similar-looking but semantically different) force the model to learn fine-grained distinctions. Example: For query “python list comprehension”, a hard negative like “python dictionary comprehension” teaches more than an easy negative like “Java inheritance”.
- “When should you fine-tune embeddings vs using a general-purpose model vs training from scratch?”
- Answer:
- Use general model: Domain has broad vocabulary, retrieval quality is acceptable (>80% Recall@10)
- Fine-tune: Domain has specialized vocabulary or concepts, you have 10k+ labeled examples, need 10-20% improvement
- Train from scratch: Extremely specialized domain (e.g., medical codes), have 500k+ examples, need state-of-the-art performance. Usually not worth it—fine-tuning wins on efficiency.
- Answer:
Technical Implementation
- “How would you mine hard negatives from a corpus?”
- Answer:
- Step 1: Embed entire corpus with base model
- Step 2: For each query, retrieve top-50 results
- Step 3: Use BM25 or base embeddings to rank
- Step 4: Select results ranked 10-30 as hard negatives (not top-10 = likely relevant, but close enough to be confusing)
- Alternative: Use cross-encoder to re-score and select high-score-but-not-relevant docs
- Answer:
- “What’s the relationship between batch size and training quality for contrastive learning?”
- Answer: Larger batch size = more in-batch negatives = richer training signal. If batch size is 32, each query trains against 31 negatives. Doubling to 64 gives 63 negatives. But memory is limited, so use gradient accumulation: small physical batch (16), accumulate gradients over 4 steps → effective batch size 64.
- “How do you prevent catastrophic forgetting when fine-tuning?”
- Answer:
- Use low learning rate (2e-5 vs 1e-3 for training from scratch)
- Warm-up learning rate for first 10% of steps
- Fine-tune only top layers initially (freeze bottom 6 layers of BERT)
- Use mixed training data (domain-specific + general examples)
- Monitor performance on general benchmark (e.g., STS-B) to ensure general capabilities aren’t lost
- Answer:
Evaluation & Metrics
- “Explain Recall@k vs NDCG@k. When does each matter?”
- Answer:
- Recall@k: Binary—did we retrieve the relevant documents in top k? Good for: “Did we find the answer?” Use case: FAQ retrieval (just need the answer somewhere in top 10).
- NDCG@k: Rewards relevant documents appearing earlier. Good for: “Did we rank relevant documents highly?” Use case: Search engine (users only look at top 3 results).
- Answer:
- “Your fine-tuned model shows 90% Recall@10 on validation but 60% on production. What happened?”
- Answer: Overfitting to validation data. Possible causes:
- Validation set too similar to training set (data leakage)
- Validation queries don’t represent real user queries
- Model memorized training queries (too many epochs)
- Production data has distribution shift (new topics, different vocabulary)
- Fix: Create more diverse validation set, use early stopping, add regularization, collect production feedback
- Answer: Overfitting to validation data. Possible causes:
- “How would you debug embeddings that seem ‘broken’—everything is equally similar?”
- Answer:
- Check 1: Inspect embedding distribution. If all embeddings cluster in one region → model collapsed
- Check 2: Verify loss is decreasing. Flat loss → learning rate too low or bad data
- Check 3: Check for gradient flow. Frozen layers or vanishing gradients?
- Check 4: Visualize embeddings with t-SNE/UMAP. Should see clusters for different topics
- Check 5: Test similarity distribution. Should range from 0.1 (dissimilar) to 0.9 (similar), not all 0.5
- Answer:
Hints in Layers
Layer 1: Getting Started (If completely stuck)
- Use the Sentence Transformers training examples
- Start with a pre-built dataset like MS MARCO or Natural Questions
- Use
MultipleNegativesRankingLoss—easiest to implement - Fine-tune
all-MiniLM-L6-v2for just 1 epoch to see if pipeline works
Layer 2: Data Challenges
- Problem: Don’t have labeled query-document pairs?
- Hint: Use GPT-4 to generate synthetic queries from documents: “Generate 3 search queries that would lead to this document”
- Problem: How to mine hard negatives?
- Hint: Use BM25 (keyword search) to retrieve top 50, then treat results 11-30 as hard negatives
- Problem: Dataset is too large to fit in memory?
- Hint: Use
datasetslibrary with streaming mode, or create a custom PyTorchIterableDataset
- Hint: Use
Layer 3: Training Issues
- Problem: Loss not decreasing?
- Hint: Check learning rate (try 2e-5), verify labels are correct (pos should be more similar than neg), increase batch size
- Problem: Model overfitting?
- Hint: Use early stopping, add dropout (0.1), reduce epochs, increase training data diversity
- Problem: Out of memory errors?
- Hint: Reduce batch size to 8, enable gradient accumulation, use mixed precision training (
fp16=True)
- Hint: Reduce batch size to 8, enable gradient accumulation, use mixed precision training (
Layer 4: Evaluation Pitfalls
- Problem: High training accuracy but low validation accuracy?
- Hint: Validation set might have different distribution. Check query types, add more diverse training data
- Problem: Don’t know if improvement is good enough?
- Hint: Compare against BM25 baseline (keyword search) and base model. Aim for +10% Recall@10 over base model
Layer 5: Advanced Optimizations
- Hint: Add query prefix during training: “query: [query text]” and “passage: [doc text]” to help model distinguish roles
- Hint: Use mixed precision training (
fp16=True) to halve memory usage and double speed - Hint: Implement curriculum learning: train on easy negatives first, then progressively harder negatives
- Hint: Distill from a larger cross-encoder into your bi-encoder for quality boost
Books That Will Help
Primary Resources
- “AI Engineering” by Chip Huyen (2024)
- Chapter 5: Fine-tuning techniques and contrastive learning
- Chapter 6: RAG systems and embedding evaluation
- Why it’s essential: Practical guidance on production embedding systems
- “Natural Language Processing with Transformers” by Tunstall, von Werra & Wolf (2022)
- Chapter 5: Text classification and embeddings
- Chapter 9: Fine-tuning for production
- Why it’s essential: Deep dive into transformer fine-tuning with Hugging Face
- “Introduction to Information Retrieval” by Manning, Raghavan & Schütze (2008)
- Chapter 6: Scoring, term weighting, and the vector space model
- Chapter 8: Evaluation in information retrieval
- Why it’s essential: Foundational understanding of retrieval metrics (Recall, NDCG, MRR)
Supplementary Resources
- “Deep Learning” by Goodfellow, Bengio & Courville (2016)
- Chapter 20.10: Metric learning and similarity learning
- Why it helps: Theoretical foundation for contrastive learning
- “Dive into Deep Learning” by Zhang, Lipton, Li & Smola (2023)
- Chapter 14.7: Similarity and analogy tasks
- Interactive notebooks: Hands-on embedding training examples
- Why it helps: Practical code examples and visualizations
Online Resources
- Sentence Transformers Documentation
- Training Overview
- Loss Functions
- Why it’s essential: Authoritative reference for embedding fine-tuning
- Hugging Face Blog
- “Training and Finetuning Embedding Models with Sentence Transformers”
- Why it’s essential: Step-by-step tutorial with code
Research Papers (Optional but Illuminating)
- “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” - Reimers & Gurevych (2019)
- arXiv:1908.10084
- Why it matters: Original paper explaining bi-encoder architecture
- “Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling” - Hofstätter et al. (2021)
- arXiv:2104.06967
- Why it matters: Advanced hard negative mining strategies
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| Mini-Transformer | Intermediate | 2-3 weeks | ⭐⭐⭐⭐⭐ (deepest LLM understanding) | ⭐⭐⭐⭐ (magical when it generates text) |
| Vector DB Engine | Intermediate-Advanced | 2-3 weeks | ⭐⭐⭐⭐⭐ (fundamental to all retrieval) | ⭐⭐⭐ (algorithmic satisfaction) |
| RAG System | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ (practical RAG mastery) | ⭐⭐⭐⭐⭐ (immediately useful) |
| Two-Stage Reranking | Intermediate | 1-2 weeks | ⭐⭐⭐⭐ (search architecture) | ⭐⭐⭐⭐ (measurable improvements) |
| Fine-Tune Embeddings | Advanced | 2 weeks | ⭐⭐⭐⭐⭐ (embedding mastery) | ⭐⭐⭐ (requires patience) |
Recommended Learning Path
Based on your learning goals, here’s the optimal path:
Start with: Project 3 (RAG System)
Why: It gives you the fastest path to a working end-to-end system while touching all four areas. You’ll immediately see how LLMs, embeddings, retrieval, and generation fit together. This creates the mental framework for diving deeper.
Then: Project 2 (Vector DB Engine)
Why: After building RAG, you’ll have questions about why retrieval sometimes fails. Building a vector DB from scratch answers those questions and gives you intuition about similarity search.
Then: Project 4 (Two-Stage Reranking)
Why: You’ll see your RAG system’s limitations — retrieval isn’t always precise. Adding reranking teaches you the two-stage architecture used in production.
Finally: Project 1 (Mini-Transformer)
Why: Now you’ll appreciate why building this matters. After working with LLMs as black boxes, implementing one from scratch is enlightening.
Final Overall Project: Production-Grade AI Research Assistant
After completing the projects above, combine everything into one comprehensive system:
What you’ll build: A fully functional research assistant that can ingest papers/documents, answer questions with citations, maintain conversation context, and continuously improve its retrieval quality — all running locally or on your infrastructure.
Why this is the capstone: This forces you to integrate every component you’ve learned: your understanding of transformers powers your prompt engineering, your vector DB knowledge informs your indexing strategy, your RAG expertise structures your retrieval pipeline, and your reranking skills ensure precision.
Core challenges you’ll face:
- Multi-modal document processing (PDFs with figures, tables, code) (maps to advanced chunking)
- Hierarchical retrieval (section → paragraph → sentence) (maps to multi-stage retrieval architecture)
- Conversation memory with proper context window management (maps to understanding LLM limitations)
- Self-improving retrieval through user feedback signals (maps to production ML systems)
- Streaming responses with progressive retrieval (maps to real-world UX requirements)
Key Concepts:
- Agentic RAG: “2025’s Ultimate Guide to RAG Retrieval” - Mehul Pratap Singh
- Advanced RAG Patterns: “AI Engineering” Chapters 6-8 - Chip Huyen
- System Design: “Designing Data-Intensive Applications” Chapters 1-3 - Martin Kleppmann
- Production ML: “Fundamentals of Software Architecture” - Richards & Ford (for system design patterns)
Difficulty: Advanced Time estimate: 1-2 months Prerequisites: All 5 projects above completed
Real world outcome:
- A web UI where you drop PDFs and ask questions across your entire knowledge base
- Answers come with highlighted citations that link back to source documents
- A feedback mechanism: 👍/👎 that triggers fine-tuning of your embedding model
- Metrics dashboard showing retrieval quality, response latency, and user satisfaction
- Deploy it to a cloud instance and use it daily for your own research
Learning milestones:
- After MVP: You’ll have a working assistant that answers questions over your docs
- After adding reranking: You’ll see measurable quality improvements in precision
- After adding feedback loop: You’ll understand how production AI systems improve over time
- After deploying: You’ll understand the full MLOps lifecycle for AI applications
Real World Outcome
By the end of this project, you’ll have a fully functional, production-grade AI research assistant deployed and accessible via web interface. Here’s what that looks like:
The Web Application
Landing Page: A clean, modern interface built with React/Next.js or Streamlit that includes:
- Document upload area supporting drag-and-drop for PDFs, markdown, DOCX, and URLs
- A prominent search bar with auto-complete suggestions based on previously indexed content
- Dashboard showing: number of documents indexed, total chunks, embedding model version, average query latency
Chat Interface: ``` User: “What are the key findings about transformer efficiency in the 2023 papers?”
[System shows “Retrieving… 3 sources found” with a progress indicator]
Essential Resources Summary
| Resource | What It Teaches |
|---|---|
| Sebastian Raschka’s “Build an LLM from Scratch” | Transformer implementation, training, fine-tuning |
| Chip Huyen’s “AI Engineering” | Production RAG, evaluation, system design |
| Sentence Transformers docs | Embeddings, reranking, fine-tuning |
| Pinecone Learning Center | Vector DBs, indexing, RAG patterns |
| Andrej Karpathy’s YouTube | Neural networks from scratch, GPT implementation |
Sources
- 2025’s Ultimate Guide to RAG Retrieval - Medium
- Enhancing RAG: A Study of Best Practices - arXiv
- Vector Database Comparison 2025 - Aloa
- Rerankers and Two-Stage Retrieval - Pinecone
- Training Reranker Models - Hugging Face
- LLMs from Scratch - GitHub
- Transformer Explainer - Interactive
- Retrieve & Re-Rank - Sentence Transformers