Project 28: “The Semantic Search Engine” — Knowledge Management

Attribute Value
File KIRO_CLI_LEARNING_PROJECTS.md
Main Programming Language Python (RAG)
Coolness Level Level 4: Hardcore Tech Flex
Difficulty Level 3: Advanced
Knowledge Area Knowledge Management

What you’ll build: Enable /knowledge and ingest a folder of PDFs for semantic Q&A.

Why it teaches Retrieval: You learn how to use data larger than the context window.

Success criteria:

  • An answer is grounded in retrieved chunks.

Real World Outcome

You will have a Kiro CLI extension that ingests PDF documents and enables semantic question-answering that goes beyond the context window limit. When you run it, you’ll see:

Ingestion Phase:

$ kiro "/knowledge ingest ~/Documents/research_papers/"

📚 Semantic Search Engine - Knowledge Ingestion
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Processing PDFs...
├─ attention_is_all_you_need.pdf (8 pages) ✓
│  └─ Extracted 47 chunks (avg: 512 tokens/chunk)
├─ bert_pretraining.pdf (16 pages) ✓
│  └─ Extracted 89 chunks (avg: 498 tokens/chunk)
└─ gpt3_language_models.pdf (75 pages) ✓
   └─ Extracted 412 chunks (avg: 505 tokens/chunk)

Generating embeddings... [████████████████████] 548/548 chunks

Building vector index (FAISS)...
├─ Index type: IVF256,Flat
├─ Dimensions: 1536 (text-embedding-3-small)
└─ Total vectors: 548

💾 Saved to: ~/.kiro/knowledge/research_papers.faiss
✓ Knowledge base ready: research_papers (548 chunks, 274k tokens)

Query Phase:

$ kiro "/knowledge query research_papers 'What is the self-attention mechanism in transformers?'"

🔍 Semantic Search Results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Query: "What is the self-attention mechanism in transformers?"

Top 5 Retrieved Chunks (by cosine similarity):

1. attention_is_all_you_need.pdf (page 3, score: 0.94)
   "Self-attention, sometimes called intra-attention, is a mechanism
    relating different positions of a single sequence to compute a
    representation of the sequence. The attention function maps a
    query and set of key-value pairs to an output..."

2. attention_is_all_you_need.pdf (page 4, score: 0.89)
   "Scaled Dot-Product Attention: We compute attention as
    Attention(Q,K,V) = softmax(QK^T / sqrt(d_k))V where Q, K, V
    are the queries, keys, and values matrices..."

3. bert_pretraining.pdf (page 7, score: 0.82)
   "BERT uses bidirectional self-attention, allowing each token to
    attend to all tokens in both directions. This differs from GPT's
    causal (left-to-right) attention masking..."

───────────────────────────────────────────────────────────────

📝 Generated Answer (grounded in retrieved context):

Self-attention is a mechanism that relates different positions within
a single sequence to compute its representation. In the Transformer
architecture, it works by:

1. Computing Query (Q), Key (K), Value (V) matrices from input
2. Calculating attention scores: softmax(QK^T / sqrt(d_k))
3. Using scores to weight the Value vectors

The scaling factor sqrt(d_k) prevents dot products from growing too
large. BERT extends this with bidirectional attention, while GPT uses
causal masking for autoregressive generation.

📚 Sources: attention_is_all_you_need.pdf (p3-4), bert_pretraining.pdf (p7)

Usage in Conversation:

$ kiro "Based on my research papers, explain how to implement a custom attention layer"

[Kiro automatically retrieves relevant chunks from the knowledge base]

I found 3 relevant sections from your research papers knowledge base:
- attention_is_all_you_need.pdf discusses scaled dot-product attention
- bert_pretraining.pdf covers multi-head attention implementation
- efficient_transformers.pdf shows optimization techniques

Here's how to implement a custom attention layer...
[Answer grounded in retrieved context]

You’re seeing exactly what modern RAG (Retrieval-Augmented Generation) systems do - breaking the context window limitation by retrieving only relevant information on-demand!

The Core Question You’re Answering

“How do you give an LLM access to knowledge beyond its context window without fine-tuning?”

Before you write any code, sit with this question. Most developers think context windows solve everything (“just throw it all in!”), but:

  • GPT-4 Turbo: 128k tokens ≈ 96,000 words ≈ 200 pages
  • Your company’s documentation: 10,000 pages
  • Every research paper ever written: billions of pages

Even with 200k token windows, you can’t fit everything. RAG (Retrieval-Augmented Generation) solves this by:

  1. Converting text to semantic vectors (embeddings)
  2. Storing vectors in a searchable index
  3. Retrieving only relevant chunks for each query
  4. Grounding LLM responses in retrieved context

This is how ChatGPT’s “Browse with Bing” works, how GitHub Copilot uses your codebase, and how enterprise AI assistants access internal docs.

Concepts You Must Understand First

Stop and research these before coding:

  1. Vector Embeddings
    • What is an embedding? (numeric representation of semantic meaning)
    • Why does cosine similarity measure semantic relatedness?
    • How does text-embedding-3-small differ from text-embedding-ada-002?
    • Book Reference: “Speech and Language Processing” Ch. 6 (Vector Semantics) - Jurafsky & Martin
  2. Chunking Strategies
    • Why chunk documents instead of embedding entire PDFs?
    • What’s the trade-off between chunk size (128 vs 512 vs 2048 tokens)?
    • How does overlapping chunks prevent context loss at boundaries?
    • Book Reference: “Information Retrieval” Ch. 2 (Indexing) - Manning, Raghavan, Schütze
  3. Vector Databases (FAISS, Pinecone, Weaviate)
    • What is Approximate Nearest Neighbor (ANN) search?
    • Why is exhaustive search O(n) too slow for millions of vectors?
    • How does FAISS’s IVF (Inverted File Index) work?
    • Blog Reference: “FAISS: A Library for Efficient Similarity Search” - Facebook AI Research
  4. Retrieval Algorithms
    • Dense retrieval (embeddings) vs sparse retrieval (BM25/TF-IDF)
    • What is hybrid search? (combining dense + sparse)
    • How does reranking improve top-k results?
    • Paper Reference: “Dense Passage Retrieval for Open-Domain QA” - Karpukhin et al., 2020
  5. PDF Parsing
    • How does PyPDF2/pdfplumber extract text from PDFs?
    • What breaks with scanned PDFs (OCR needed)?
    • How do you handle tables, images, and multi-column layouts?
    • Docs Reference: pdfplumber documentation

Questions to Guide Your Design

Before implementing, think through these:

  1. Chunking Strategy
    • Fixed-size chunks (512 tokens) or semantic chunks (paragraph boundaries)?
    • Should chunks overlap? If so, by how much (50 tokens? 25%)?
    • How will you handle code blocks, tables, and lists (semantic units)?
  2. Embedding Model Selection
    • OpenAI text-embedding-3-small (1536 dims, $0.02/1M tokens)?
    • Sentence-BERT (384 dims, free, runs locally)?
    • How will you handle the latency vs cost trade-off?
  3. Vector Index Design
    • FAISS Flat (exact search, slow for >100k vectors)?
    • FAISS IVF (approximate, 10x faster, 95% recall)?
    • Do you need GPU acceleration (faiss-gpu)?
  4. Retrieval Strategy
    • Top-k retrieval (how many chunks? 3? 5? 10?)?
    • Score threshold (min cosine similarity to include)?
    • How will you format retrieved chunks in the prompt?
  5. Metadata & Filtering
    • Should you store page numbers, document titles, timestamps?
    • Do you need to filter by document type or date range?
    • How will you cite sources in the generated answer?

Thinking Exercise

Trace Retrieval Flow

Before coding, manually trace this RAG pipeline:

Given:

  • Knowledge base: 3 PDFs (Attention Is All You Need, BERT, GPT-3)
  • Query: “How does GPT-3 differ from BERT in pretraining?”

Trace each step:

  1. Query Embedding
    • Input: “How does GPT-3 differ from BERT in pretraining?”
    • Output: 1536-dimensional vector (e.g., [0.023, -0.145, 0.891, …])
    • Question: Why embed the query with the same model as the chunks?
  2. Vector Search (FAISS)
    • Compute cosine similarity between query vector and all 548 chunk vectors
    • Sort by similarity score (1.0 = identical, 0.0 = orthogonal)
    • Return top 5 chunks
    • Question: Why cosine similarity instead of Euclidean distance?
  3. Retrieved Chunks (hypothetical)
    Chunk 1 (gpt3_language_models.pdf, page 12, score: 0.91)
    "GPT-3 uses autoregressive language modeling, predicting the next
     token given all previous tokens. Unlike BERT's masked language
     modeling, GPT-3 is trained left-to-right..."
    
    Chunk 2 (bert_pretraining.pdf, page 3, score: 0.88)
    "BERT is pretrained with two objectives: (1) Masked Language Model
     (MLM) where 15% of tokens are masked, and (2) Next Sentence
     Prediction (NSP)..."
    
    • Question: Why did these chunks score higher than others?
  4. Prompt Construction
    System: You are an AI assistant. Answer based on the context below.
    
    Context:
    [Chunk 1 content]
    [Chunk 2 content]
    ...
    
    User: How does GPT-3 differ from BERT in pretraining?
    
    Answer:
    
    • Question: What if the retrieved chunks don’t answer the question?
  5. Generated Answer
    • LLM reads retrieved context + query
    • Generates grounded answer citing sources
    • Question: How do you detect hallucination (info NOT in retrieved chunks)?

Questions while tracing:

  • What if no chunks have similarity > 0.5? (query outside knowledge base)
  • What if 10 chunks all have similarity > 0.9? (do you use all? truncate?)
  • What if the PDF has OCR errors? (“pretraining” → “pre-training” → “pretrainng”)?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain the difference between RAG (Retrieval-Augmented Generation) and fine-tuning. When would you use each?”

  2. “Your vector search is returning irrelevant chunks for 20% of queries. How would you debug and fix this?”

  3. “You have 1 million PDF pages to index. Embedding them with OpenAI costs $200. How would you reduce this cost?”

  4. “A user asks ‘What’s the latest update?’ but your knowledge base is from 6 months ago. How does your system handle this gracefully?”

  5. “Walk me through the math of cosine similarity. Why is it better than Euclidean distance for text embeddings?”

  6. “You’re getting complaints that answers are slow (10 seconds). Where are the bottlenecks and how do you optimize?”

Hints in Layers

Hint 1: Start with PDF Ingestion Don’t jump straight to embeddings. First, prove you can extract clean text from a single PDF. Use pdfplumber (better than PyPDF2 for tables). Test with a research paper PDF and verify paragraph boundaries are preserved.

Hint 2: Implement Chunking Split the extracted text into 512-token chunks with 50-token overlap. Use tiktoken (OpenAI’s tokenizer) to count tokens accurately. Store chunks with metadata:

chunk = {
    'text': "Self-attention is a mechanism...",
    'source': 'attention_is_all_you_need.pdf',
    'page': 3,
    'chunk_id': 'doc1_chunk_047',
    'token_count': 498
}

Hint 3: Generate Embeddings Call OpenAI’s embedding API for each chunk. Batch requests (up to 2048 chunks/request) to reduce latency:

response = openai.embeddings.create(
    model="text-embedding-3-small",
    input=[chunk['text'] for chunk in chunks[:2048]]
)
embeddings = [data.embedding for data in response.data]

Each embedding is a 1536-dimensional float array.

Hint 4: Build FAISS Index Create a Flat index for exact search (start simple before optimizing):

import faiss
import numpy as np

dimension = 1536
embeddings_matrix = np.array(embeddings).astype('float32')

index = faiss.IndexFlatL2(dimension)  # L2 distance (convert to cosine later)
index.add(embeddings_matrix)  # Add all vectors

faiss.write_index(index, 'knowledge.faiss')  # Save to disk

Hint 5: Query & Retrieve For a user query, embed it and search the index:

query_embedding = openai.embeddings.create(
    model="text-embedding-3-small",
    input="What is self-attention?"
).data[0].embedding

query_vector = np.array([query_embedding]).astype('float32')
k = 5  # Top 5 results
distances, indices = index.search(query_vector, k)

# Retrieve original chunks
retrieved_chunks = [chunks[i] for i in indices[0]]

Hint 6: Construct RAG Prompt Format retrieved chunks into a prompt:

context = "\n\n".join([
    f"Source: {chunk['source']} (page {chunk['page']})\n{chunk['text']}"
    for chunk in retrieved_chunks
])

prompt = f"""Answer based on the following context:

{context}

Question: {user_query}

Answer:"""

Hint 7: Debugging Tools When results are bad, inspect:

  • Chunk quality: Are chunks semantically coherent? (print first 10)
  • Embedding distribution: Are vectors normalized? (check norms)
  • Similarity scores: What are the top-k scores? (should be > 0.6 for good matches)
  • Retrieved text: Does it actually answer the query? (manual review)

Hint 8: Optimization (Once It Works)

  • Switch to FAISS IVF for >10k chunks (10x faster, slight recall loss)
  • Cache embeddings (don’t re-embed the same query)
  • Use sentence-transformers for local embedding (no API costs)
  • Implement hybrid search (dense + BM25 sparse retrieval)

Books That Will Help

Topic Book Chapter
Vector Semantics & Embeddings “Speech and Language Processing” by Jurafsky & Martin Ch. 6
Information Retrieval Fundamentals “Information Retrieval” by Manning, Raghavan, Schütze Ch. 1-2
Nearest Neighbor Search “Foundations of Data Science” by Blum, Hopcroft, Kannan Ch. 2 (High-Dimensional Space)
Transformer Attention (context for RAG) “Deep Learning” by Goodfellow, Bengio, Courville Ch. 10 (Sequence Modeling)
PDF Parsing & Text Extraction “Mining the Web” by Soumen Chakrabarti Ch. 3 (Crawling & Extraction)

Common Pitfalls & Debugging

Problem 1: “Embeddings return nonsense - unrelated chunks rank highest”

  • Why: You’re using Euclidean distance (L2) instead of cosine similarity. L2 is affected by vector magnitude; cosine only cares about direction.
  • Fix: Use IndexFlatIP (inner product) with normalized vectors, or convert L2 distances to cosine.
  • Quick test: faiss.normalize_L2(embeddings_matrix) before adding to index. Verify with np.linalg.norm(embeddings_matrix[0]) ≈ 1.0.

Problem 2: “PDF extraction is garbled - formulas and tables break”

  • Why: PyPDF2 doesn’t handle complex layouts. Scanned PDFs need OCR.
  • Fix: Use pdfplumber for tables, pytesseract for scanned PDFs, unstructured library for mixed content.
  • Quick test: pdfplumber.open('paper.pdf').pages[0].extract_text() - inspect visually for garbling.

Problem 3: “Query returns 0 results with similarity > 0.5”

  • Why: Query is outside the knowledge base domain, or embedding model mismatch (query embedded with different model than chunks).
  • Fix: Fallback to “no relevant information found” response. Check embedding model consistency.
  • Quick test: Embed a chunk’s text as a query - should return that chunk with similarity ≈ 1.0.

Problem 4: “Indexing 100k chunks takes 30 minutes”

  • Why: Calling OpenAI API for each chunk individually (network latency dominates).
  • Fix: Batch requests (up to 2048 chunks per API call). Use asyncio for parallelism.
  • Quick test: Time 1 chunk vs 100 chunks batched - batching should be 10-50x faster.

Problem 5: “Answers hallucinate facts not in retrieved chunks”

  • Why: LLM ignores context and uses pretrained knowledge. Prompt doesn’t enforce grounding.
  • Fix: Add to prompt: “Answer ONLY using the context above. If the answer isn’t in the context, say ‘I don’t have enough information.’”
  • Quick test: Query something NOT in the knowledge base - LLM should refuse to answer.

Definition of Done

  • PDF ingestion works: Extract text from 3+ PDFs with different layouts (text-heavy, tables, diagrams)
  • Chunking is semantic: Verify chunks split on paragraph boundaries, not mid-sentence
  • Embeddings are generated: 500+ chunks embedded successfully, stored with metadata
  • FAISS index builds: Index file saved to disk, loads correctly on restart
  • Query retrieval works: Top-5 chunks for a test query include expected results
  • Similarity scores make sense: Relevant chunks score > 0.7, irrelevant < 0.5
  • Answers are grounded: Generated responses cite sources (page numbers, document names)
  • Edge cases handled: Empty PDFs, malformed PDFs, queries outside knowledge base domain
  • Performance is acceptable: Query latency < 2 seconds (embedding + search + generation)
  • Code is documented: README explains ingestion, querying, and adding new documents