Project 12: "The Documentation Librarian" — RAG

Project 12: “The Documentation Librarian” — RAG

Attribute	Value
File	`KIRO_CLI_LEARNING_PROJECTS.md`
Main Programming Language	Python (mcp-server-rag)
Coolness Level	Level 3: Genuinely Clever
Difficulty	Level 2: Intermediate
Knowledge Area	RAG (Retrieval Augmented Generation)

What you’ll build: A custom MCP server that indexes PDFs or internal docs for question answering.

Why it teaches RAG: You extend Kiro with private knowledge beyond its training data.

Success criteria:

Kiro answers a question that is only in your private docs.

Real World Outcome

You’ll build a custom MCP server that indexes your internal documentation (PDFs, Markdown, Confluence pages) using embeddings, then lets Kiro search and answer questions from this private knowledge base—extending Kiro’s knowledge beyond its training data.

Example: Indexing Internal Documentation

$ python docs-mcp-server.py index ./company-docs/

Indexing documentation...
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📄 engineering-handbook.pdf (142 pages)    ✓ Embedded
📄 api-reference.md (2,340 lines)          ✓ Embedded
📄 deployment-guide.pdf (68 pages)         ✓ Embedded
📄 security-policies.md (890 lines)        ✓ Embedded
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total chunks: 1,847
Vector DB: ./company-docs/.vectordb/
Ready for queries ✓

Configuration (.kiro/settings/mcp.json):

{
  "mcpServers": {
    "docs": {
      "command": "python3",
      "args": ["./docs-mcp-server.py", "serve"],
      "env": {
        "VECTOR_DB_PATH": "./company-docs/.vectordb",
        "EMBEDDING_MODEL": "text-embedding-3-small"
      }
    }
  }
}

When you ask Kiro questions from internal docs:

$ kiro chat

You: "What's our process for deploying to production?"

Kiro: [Tool: docs/search_knowledge_base]
  Query: "production deployment process"
  Top 3 results:

[Tool: docs/retrieve_context]

Based on the Engineering Handbook (Section 7.2) and Deployment Guide (Chapter 3):

**Production Deployment Process:**

1. **Pre-Deployment Checklist** (Eng Handbook p.67)
   - [ ] PR approved by 2+ engineers
   - [ ] All tests passing in CI
   - [ ] Database migration tested in staging
   - [ ] Feature flags configured

2. **Deployment Window** (Deployment Guide p.24)
   - Scheduled: Tuesday/Thursday 10 AM PST
   - Emergency: Requires VP Engineering approval
   - Avoid: Friday afternoon, holiday weeks

3. **Rollout Strategy** (Deployment Guide p.31)
   - Blue-green deployment with AWS ECS
   - 10% traffic → 50% → 100% over 2 hours
   - Automatic rollback if error rate >0.1%

4. **Post-Deployment** (Eng Handbook p.71)
   - Monitor DataDog for 30 minutes
   - Check Sentry for new error spikes
   - Notify #production-deploys Slack channel

**Emergency Rollback:**
```bash
./scripts/rollback.sh --environment production --revision previous

Sources:

Engineering Handbook (2024-Q4), Section 7.2
Deployment Guide v3.1, Chapters 3-4

You: “What are the approved third-party libraries for authentication?”

Kiro: [Tool: docs/search_knowledge_base] Query: “approved authentication libraries security policy”

According to Security Policies (Section 4.5 “Approved Dependencies”):

Authentication Libraries:

✅ Approved:

passport v0.7+ (Node.js) - OAuth, JWT, SAML
authlib v1.2+ (Python) - OAuth 2.0, OpenID Connect
spring-security v6.1+ (Java) - Enterprise SSO

❌ Prohibited:

Custom-built JWT libraries (use approved only)
Deprecated: oauth2-server (EOL 2023)
Unvetted: Any library not on approved list

Approval Process: (Security Policies p.89)

Submit request to #security-review
Security team evaluates: License, maintenance, CVE history
Approval SLA: 5 business days

Current Exception: (Added 2024-12-01)

next-auth v5.0+ approved for internal tools only (not customer-facing)

Source: Security Policies v2.3, Section 4.5

**The power:** Kiro accessed information that exists ONLY in your internal documentation, not in its training data. It cited exact page numbers and versions.

---

#### The Core Question You're Answering

> "How can I give an AI access to my company's internal knowledge base without uploading sensitive documents to third-party services?"

Traditional solutions have problems:
- **Shared drives:** Documents exist but aren't searchable conversationally
- **Confluence:** Requires manual searching and reading
- **Uploading to ChatGPT:** Violates data privacy policies

This project solves it: **Build a local RAG system that keeps your data on-premises while enabling AI-powered search**.

---

#### Concepts You Must Understand First

**Stop and research these before coding:**

1. **Embeddings and Vector Similarity**
   - What is an embedding (vector representation of text)?
   - How do you measure similarity between vectors (cosine similarity, dot product)?
   - Why are embeddings better than keyword search for semantic matching?
   - *Book Reference:* "Speech and Language Processing" by Jurafsky & Martin - Ch. 6

2. **Vector Databases**
   - What's the difference between traditional databases and vector databases?
   - How do vector indexes work (HNSW, IVF)?
   - When do you use in-memory (FAISS) vs persistent (Chroma, Pinecone)?
   - *Web Reference:* [Pinecone - What is a Vector Database?](https://www.pinecone.io/learn/vector-database/)

3. **Retrieval Augmented Generation (RAG)**
   - What's the difference between RAG and fine-tuning?
   - How do you chunk documents for optimal retrieval (size, overlap)?
   - What's the tradeoff between context window size and retrieval accuracy?
   - *Web Reference:* [LangChain RAG Documentation](https://python.langchain.com/docs/use_cases/question_answering/)

---

#### Questions to Guide Your Design

**Before implementing, think through these:**

1. **Document Processing**
   - How do you extract text from PDFs (PyPDF2, pdfplumber, or OCR)?
   - Should you split documents by page, paragraph, or semantic chunks?
   - How do you preserve metadata (source file, page number, section heading)?
   - What happens with images or tables in documents?

2. **Chunking Strategy**
   - What chunk size optimizes retrieval (512 tokens, 1000 tokens)?
   - Should chunks overlap to avoid splitting important context?
   - How do you handle code blocks vs prose (different chunking strategies)?
   - Should you create multiple chunk sizes for different query types?

3. **Embedding and Retrieval**
   - Which embedding model (OpenAI, Sentence Transformers, local models)?
   - How many top-k results to retrieve (3, 5, 10)?
   - Should you re-rank results after initial retrieval?
   - How do you handle queries that don't match any documents?

---

#### Thinking Exercise

### Scenario: Chunking Strategy

Given this internal documentation snippet:

```markdown
# Deployment Guide

## Chapter 3: Production Deploys

### 3.1 Pre-Deployment Checklist

Before deploying to production, verify:
1. All tests pass in CI/CD pipeline
2. Database migrations tested in staging
3. Feature flags configured

### 3.2 Deployment Window

Production deploys occur:
- Scheduled: Tuesday/Thursday 10 AM PST
- Emergency: Requires VP approval

### 3.3 Rollout Strategy

We use blue-green deployment:
1. Deploy to blue environment
2. Route 10% traffic
3. Monitor for 30 minutes
4. Gradually increase to 100%

Questions while designing chunking:

Option 1: By Section (Heading-Based)

Chunk 1: "Chapter 3: Production Deploys ... 3.1 Pre-Deployment Checklist ... verify: 1. All tests..."
Chunk 2: "3.2 Deployment Window ... Production deploys occur: ..."
Chunk 3: "3.3 Rollout Strategy ... We use blue-green deployment: ..."

Option 2: Fixed Token Size (500 tokens)

Chunk 1: "Chapter 3... 3.1 Pre-Deployment... 3.2 Deployment Window... (cut mid-section)"
Chunk 2: "...Window ... Tuesday/Thursday... 3.3 Rollout... blue-green deployment..."

Option 3: Semantic (Paragraph-Based with Overlap)

Chunk 1: "Chapter 3... 3.1 Pre-Deployment Checklist... verify: 1. All tests..."
Chunk 2: "3.1 (last paragraph)... 3.2 Deployment Window... Scheduled: Tuesday..."
Chunk 3: "3.2 (last paragraph)... 3.3 Rollout Strategy... blue-green..."

Which is best for this query: “What days can I deploy to production?”

Option 1 ✅ - Section 3.2 is intact
Option 2 ❌ - Answer split across chunks
Option 3 ✅ - Overlap ensures context

The Interview Questions They’ll Ask

“Explain the difference between RAG and fine-tuning for extending an LLM’s knowledge. When would you use each?”
“How would you design a document chunking strategy that balances retrieval accuracy with context preservation?”
“What strategies would you use to prevent RAG systems from generating answers based on outdated documentation?”
“Describe how you would handle multi-lingual documentation in a RAG system.”
“How would you implement access control so users can only retrieve documents they’re authorized to see?”
“What approaches would you use to evaluate RAG system quality (precision, recall, answer quality)?”

Hints in Layers

Hint 1: Document Ingestion Pipeline

# docs-mcp-server.py
from langchain.document_loaders import PyPDFLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

def index_documents(docs_path):
    # Load all PDFs and Markdown files
    loaders = [
        PyPDFLoader(f) for f in glob("**/*.pdf")
    ] + [
        TextLoader(f) for f in glob("**/*.md")
    ]

    docs = []
    for loader in loaders:
        docs.extend(loader.load())

    # Split into chunks with overlap
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200
    )
    chunks = splitter.split_documents(docs)

    # Create embeddings and store in vector DB
    embeddings = OpenAIEmbeddings()
    vectordb = Chroma.from_documents(chunks, embeddings, persist_directory=".vectordb")

    return vectordb

Hint 2: MCP Server Query Tool

@mcp_server.tool("search_knowledge_base")
def search_docs(query: str, top_k: int = 3):
    """Search internal documentation"""
    vectordb = Chroma(persist_directory=".vectordb", embedding_function=OpenAIEmbeddings())

    results = vectordb.similarity_search_with_score(query, k=top_k)

    return [
        {
            "content": doc.page_content,
            "source": doc.metadata["source"],
            "page": doc.metadata.get("page", "N/A"),
            "score": score
        }
        for doc, score in results
    ]

Hint 3: Metadata Preservation When loading documents, preserve source information:

doc.metadata = {
    "source": filename,
    "page": page_num,
    "section": heading,
    "last_modified": file_mtime
}

Hint 4: Hybrid Search (Keyword + Semantic) Combine vector similarity with keyword matching for better results:

# Get semantic matches
vector_results = vectordb.similarity_search(query, k=10)

# Get keyword matches (BM25)
keyword_results = bm25.get_top_n(query, documents, n=10)

# Merge and re-rank (Reciprocal Rank Fusion)
final_results = reciprocal_rank_fusion([vector_results, keyword_results])

Books That Will Help

Topic	Book	Chapter
Embeddings	“Speech and Language Processing” by Jurafsky & Martin	Ch. 6 (Vector Semantics)
Information Retrieval	“Introduction to Information Retrieval” by Manning et al.	Ch. 6 (Scoring, Term Weighting)
RAG Systems	“Building LLM Applications” by Damian Fanton	Ch. 4 (Retrieval)
Vector Databases	“Designing Data-Intensive Applications” by Martin Kleppmann	Ch. 3 (Storage Engines)

Common Pitfalls & Debugging

Problem 1: “Embeddings fail with ‘token limit exceeded’“

Why: Document chunks are too large (>8,191 tokens for text-embedding-3)
Fix: Reduce chunk_size to 1000 tokens or use recursive splitting
Quick test: len(tiktoken.encode(chunk)) should be <8000

Problem 2: “RAG returns irrelevant results”

Why: Query and document embeddings use different semantic spaces
Fix: Use query expansion or rewrite user questions before embedding
Quick test: Manually inspect top-k results for relevance

Problem 3: “Kiro hallucinates information not in documents”

Why: LLM fills in gaps when retrieved context is incomplete
Fix: Add system prompt: “Only answer from provided context. Say ‘I don’t know’ if information isn’t in the documents.”
Quick test: Ask a question you know isn’t in the docs

Problem 4: “Vector DB queries are slow (>5 seconds)”

Why: No index optimization (brute-force search)
Fix: Use HNSW index in FAISS or enable indexing in Chroma
Quick test: Time queries: time vectordb.similarity_search(query, k=5)

Definition of Done

Indexed at least 3 internal documents (PDFs or Markdown)
MCP server exposes search_knowledge_base tool
Kiro successfully answers a question only present in indexed docs
Chunks preserve metadata (source file, page number)
Retrieved results include relevance scores
System prompt prevents hallucination beyond retrieved context
Documented chunking strategy and embedding model in README
Tested with queries that have no matching documents (graceful “I don’t know”)