Project 12: “The Documentation Librarian” — RAG
| Attribute | Value |
|---|---|
| File | KIRO_CLI_LEARNING_PROJECTS.md |
| Main Programming Language | Python (mcp-server-rag) |
| Coolness Level | Level 3: Genuinely Clever |
| Difficulty | Level 2: Intermediate |
| Knowledge Area | RAG (Retrieval Augmented Generation) |
What you’ll build: A custom MCP server that indexes PDFs or internal docs for question answering.
Why it teaches RAG: You extend Kiro with private knowledge beyond its training data.
Success criteria:
- Kiro answers a question that is only in your private docs.
Real World Outcome
You’ll build a custom MCP server that indexes your internal documentation (PDFs, Markdown, Confluence pages) using embeddings, then lets Kiro search and answer questions from this private knowledge base—extending Kiro’s knowledge beyond its training data.
Example: Indexing Internal Documentation
$ python docs-mcp-server.py index ./company-docs/
Indexing documentation...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📄 engineering-handbook.pdf (142 pages) ✓ Embedded
📄 api-reference.md (2,340 lines) ✓ Embedded
📄 deployment-guide.pdf (68 pages) ✓ Embedded
📄 security-policies.md (890 lines) ✓ Embedded
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total chunks: 1,847
Vector DB: ./company-docs/.vectordb/
Ready for queries ✓
Configuration (.kiro/settings/mcp.json):
{
"mcpServers": {
"docs": {
"command": "python3",
"args": ["./docs-mcp-server.py", "serve"],
"env": {
"VECTOR_DB_PATH": "./company-docs/.vectordb",
"EMBEDDING_MODEL": "text-embedding-3-small"
}
}
}
}
When you ask Kiro questions from internal docs:
$ kiro chat
You: "What's our process for deploying to production?"
Kiro: [Tool: docs/search_knowledge_base]
Query: "production deployment process"
Top 3 results:
[Tool: docs/retrieve_context]
Based on the Engineering Handbook (Section 7.2) and Deployment Guide (Chapter 3):
**Production Deployment Process:**
1. **Pre-Deployment Checklist** (Eng Handbook p.67)
- [ ] PR approved by 2+ engineers
- [ ] All tests passing in CI
- [ ] Database migration tested in staging
- [ ] Feature flags configured
2. **Deployment Window** (Deployment Guide p.24)
- Scheduled: Tuesday/Thursday 10 AM PST
- Emergency: Requires VP Engineering approval
- Avoid: Friday afternoon, holiday weeks
3. **Rollout Strategy** (Deployment Guide p.31)
- Blue-green deployment with AWS ECS
- 10% traffic → 50% → 100% over 2 hours
- Automatic rollback if error rate >0.1%
4. **Post-Deployment** (Eng Handbook p.71)
- Monitor DataDog for 30 minutes
- Check Sentry for new error spikes
- Notify #production-deploys Slack channel
**Emergency Rollback:**
```bash
./scripts/rollback.sh --environment production --revision previous
Sources:
- Engineering Handbook (2024-Q4), Section 7.2
- Deployment Guide v3.1, Chapters 3-4
You: “What are the approved third-party libraries for authentication?”
Kiro: [Tool: docs/search_knowledge_base] Query: “approved authentication libraries security policy”
According to Security Policies (Section 4.5 “Approved Dependencies”):
Authentication Libraries:
✅ Approved:
passportv0.7+ (Node.js) - OAuth, JWT, SAMLauthlibv1.2+ (Python) - OAuth 2.0, OpenID Connectspring-securityv6.1+ (Java) - Enterprise SSO
❌ Prohibited:
- Custom-built JWT libraries (use approved only)
- Deprecated:
oauth2-server(EOL 2023) - Unvetted: Any library not on approved list
Approval Process: (Security Policies p.89)
- Submit request to #security-review
- Security team evaluates: License, maintenance, CVE history
- Approval SLA: 5 business days
Current Exception: (Added 2024-12-01)
next-authv5.0+ approved for internal tools only (not customer-facing)
Source: Security Policies v2.3, Section 4.5
**The power:** Kiro accessed information that exists ONLY in your internal documentation, not in its training data. It cited exact page numbers and versions.
---
#### The Core Question You're Answering
> "How can I give an AI access to my company's internal knowledge base without uploading sensitive documents to third-party services?"
Traditional solutions have problems:
- **Shared drives:** Documents exist but aren't searchable conversationally
- **Confluence:** Requires manual searching and reading
- **Uploading to ChatGPT:** Violates data privacy policies
This project solves it: **Build a local RAG system that keeps your data on-premises while enabling AI-powered search**.
---
#### Concepts You Must Understand First
**Stop and research these before coding:**
1. **Embeddings and Vector Similarity**
- What is an embedding (vector representation of text)?
- How do you measure similarity between vectors (cosine similarity, dot product)?
- Why are embeddings better than keyword search for semantic matching?
- *Book Reference:* "Speech and Language Processing" by Jurafsky & Martin - Ch. 6
2. **Vector Databases**
- What's the difference between traditional databases and vector databases?
- How do vector indexes work (HNSW, IVF)?
- When do you use in-memory (FAISS) vs persistent (Chroma, Pinecone)?
- *Web Reference:* [Pinecone - What is a Vector Database?](https://www.pinecone.io/learn/vector-database/)
3. **Retrieval Augmented Generation (RAG)**
- What's the difference between RAG and fine-tuning?
- How do you chunk documents for optimal retrieval (size, overlap)?
- What's the tradeoff between context window size and retrieval accuracy?
- *Web Reference:* [LangChain RAG Documentation](https://python.langchain.com/docs/use_cases/question_answering/)
---
#### Questions to Guide Your Design
**Before implementing, think through these:**
1. **Document Processing**
- How do you extract text from PDFs (PyPDF2, pdfplumber, or OCR)?
- Should you split documents by page, paragraph, or semantic chunks?
- How do you preserve metadata (source file, page number, section heading)?
- What happens with images or tables in documents?
2. **Chunking Strategy**
- What chunk size optimizes retrieval (512 tokens, 1000 tokens)?
- Should chunks overlap to avoid splitting important context?
- How do you handle code blocks vs prose (different chunking strategies)?
- Should you create multiple chunk sizes for different query types?
3. **Embedding and Retrieval**
- Which embedding model (OpenAI, Sentence Transformers, local models)?
- How many top-k results to retrieve (3, 5, 10)?
- Should you re-rank results after initial retrieval?
- How do you handle queries that don't match any documents?
---
#### Thinking Exercise
### Scenario: Chunking Strategy
Given this internal documentation snippet:
```markdown
# Deployment Guide
## Chapter 3: Production Deploys
### 3.1 Pre-Deployment Checklist
Before deploying to production, verify:
1. All tests pass in CI/CD pipeline
2. Database migrations tested in staging
3. Feature flags configured
### 3.2 Deployment Window
Production deploys occur:
- Scheduled: Tuesday/Thursday 10 AM PST
- Emergency: Requires VP approval
### 3.3 Rollout Strategy
We use blue-green deployment:
1. Deploy to blue environment
2. Route 10% traffic
3. Monitor for 30 minutes
4. Gradually increase to 100%
Questions while designing chunking:
Option 1: By Section (Heading-Based)
Chunk 1: "Chapter 3: Production Deploys ... 3.1 Pre-Deployment Checklist ... verify: 1. All tests..."
Chunk 2: "3.2 Deployment Window ... Production deploys occur: ..."
Chunk 3: "3.3 Rollout Strategy ... We use blue-green deployment: ..."
Option 2: Fixed Token Size (500 tokens)
Chunk 1: "Chapter 3... 3.1 Pre-Deployment... 3.2 Deployment Window... (cut mid-section)"
Chunk 2: "...Window ... Tuesday/Thursday... 3.3 Rollout... blue-green deployment..."
Option 3: Semantic (Paragraph-Based with Overlap)
Chunk 1: "Chapter 3... 3.1 Pre-Deployment Checklist... verify: 1. All tests..."
Chunk 2: "3.1 (last paragraph)... 3.2 Deployment Window... Scheduled: Tuesday..."
Chunk 3: "3.2 (last paragraph)... 3.3 Rollout Strategy... blue-green..."
Which is best for this query: “What days can I deploy to production?”
- Option 1 ✅ - Section 3.2 is intact
- Option 2 ❌ - Answer split across chunks
- Option 3 ✅ - Overlap ensures context
The Interview Questions They’ll Ask
-
“Explain the difference between RAG and fine-tuning for extending an LLM’s knowledge. When would you use each?”
-
“How would you design a document chunking strategy that balances retrieval accuracy with context preservation?”
-
“What strategies would you use to prevent RAG systems from generating answers based on outdated documentation?”
-
“Describe how you would handle multi-lingual documentation in a RAG system.”
-
“How would you implement access control so users can only retrieve documents they’re authorized to see?”
-
“What approaches would you use to evaluate RAG system quality (precision, recall, answer quality)?”
Hints in Layers
Hint 1: Document Ingestion Pipeline
# docs-mcp-server.py
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
def index_documents(docs_path):
# Load all PDFs and Markdown files
loaders = [
PyPDFLoader(f) for f in glob("**/*.pdf")
] + [
TextLoader(f) for f in glob("**/*.md")
]
docs = []
for loader in loaders:
docs.extend(loader.load())
# Split into chunks with overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = splitter.split_documents(docs)
# Create embeddings and store in vector DB
embeddings = OpenAIEmbeddings()
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory=".vectordb")
return vectordb
Hint 2: MCP Server Query Tool
@mcp_server.tool("search_knowledge_base")
def search_docs(query: str, top_k: int = 3):
"""Search internal documentation"""
vectordb = Chroma(persist_directory=".vectordb", embedding_function=OpenAIEmbeddings())
results = vectordb.similarity_search_with_score(query, k=top_k)
return [
{
"content": doc.page_content,
"source": doc.metadata["source"],
"page": doc.metadata.get("page", "N/A"),
"score": score
}
for doc, score in results
]
Hint 3: Metadata Preservation When loading documents, preserve source information:
doc.metadata = {
"source": filename,
"page": page_num,
"section": heading,
"last_modified": file_mtime
}
Hint 4: Hybrid Search (Keyword + Semantic) Combine vector similarity with keyword matching for better results:
# Get semantic matches
vector_results = vectordb.similarity_search(query, k=10)
# Get keyword matches (BM25)
keyword_results = bm25.get_top_n(query, documents, n=10)
# Merge and re-rank (Reciprocal Rank Fusion)
final_results = reciprocal_rank_fusion([vector_results, keyword_results])
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Embeddings | “Speech and Language Processing” by Jurafsky & Martin | Ch. 6 (Vector Semantics) |
| Information Retrieval | “Introduction to Information Retrieval” by Manning et al. | Ch. 6 (Scoring, Term Weighting) |
| RAG Systems | “Building LLM Applications” by Damian Fanton | Ch. 4 (Retrieval) |
| Vector Databases | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 3 (Storage Engines) |
Common Pitfalls & Debugging
Problem 1: “Embeddings fail with ‘token limit exceeded’“
- Why: Document chunks are too large (>8,191 tokens for text-embedding-3)
- Fix: Reduce chunk_size to 1000 tokens or use recursive splitting
- Quick test:
len(tiktoken.encode(chunk))should be <8000
Problem 2: “RAG returns irrelevant results”
- Why: Query and document embeddings use different semantic spaces
- Fix: Use query expansion or rewrite user questions before embedding
- Quick test: Manually inspect top-k results for relevance
Problem 3: “Kiro hallucinates information not in documents”
- Why: LLM fills in gaps when retrieved context is incomplete
- Fix: Add system prompt: “Only answer from provided context. Say ‘I don’t know’ if information isn’t in the documents.”
- Quick test: Ask a question you know isn’t in the docs
Problem 4: “Vector DB queries are slow (>5 seconds)”
- Why: No index optimization (brute-force search)
- Fix: Use HNSW index in FAISS or enable indexing in Chroma
- Quick test: Time queries:
time vectordb.similarity_search(query, k=5)
Definition of Done
- Indexed at least 3 internal documents (PDFs or Markdown)
- MCP server exposes
search_knowledge_basetool - Kiro successfully answers a question only present in indexed docs
- Chunks preserve metadata (source file, page number)
- Retrieved results include relevance scores
- System prompt prevents hallucination beyond retrieved context
- Documented chunking strategy and embedding model in README
- Tested with queries that have no matching documents (graceful “I don’t know”)