Project 2: Document Q&A Bot (RAG)

Build a LangChain-based RAG system that answers questions from your documents with citations.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 8-14 hours
Language Python or JavaScript
Prerequisites Embeddings basics, vector search concepts
Key Topics chunking, retrieval, grounding, citations

1. Learning Objectives

By completing this project, you will:

  1. Ingest and chunk documents for retrieval.
  2. Store embeddings in a vector store.
  3. Retrieve top-k relevant chunks.
  4. Generate answers with citations.
  5. Evaluate retrieval quality on a test set.

2. Theoretical Foundation

2.1 RAG as Grounding

RAG reduces hallucinations by requiring evidence from retrieved content.


3. Project Specification

3.1 What You Will Build

A Q&A bot that answers questions about a document corpus and returns citations for each answer.

3.2 Functional Requirements

  1. Ingestion pipeline for PDFs/Markdown.
  2. Chunking with configurable size/overlap.
  3. Vector store for embeddings.
  4. Retriever for top-k context.
  5. Answering with citations.

3.3 Non-Functional Requirements

  • Consistent embeddings with fixed model.
  • Transparent citations in outputs.
  • Fallback when retrieval returns no results.

4. Solution Architecture

4.1 Components

Component Responsibility
Loader Read documents
Chunker Split into chunks
Index Store embeddings
Retriever Fetch relevant chunks
Answer Chain Generate grounded response

5. Implementation Guide

5.1 Project Structure

LEARN_LANGCHAIN_PROJECTS/P02-document-qa-bot/
├── src/
│   ├── ingest.py
│   ├── chunk.py
│   ├── index.py
│   ├── retrieve.py
│   └── answer.py

5.2 Implementation Phases

Phase 1: Ingestion + chunking (3-5h)

  • Load documents and create chunks.
  • Checkpoint: chunks stable across runs.

Phase 2: Retrieval (3-5h)

  • Build vector store and retriever.
  • Checkpoint: top-k chunks are relevant.

Phase 3: Answering + eval (2-4h)

  • Add citations to answers.
  • Evaluate on sample queries.
  • Checkpoint: citations point to correct docs.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit chunking overlap correctness
Integration retrieval top-k accuracy
Regression citations every claim cited

6.2 Critical Test Cases

  1. Query with known answer returns correct citation.
  2. No-context query returns safe fallback.
  3. Chunking respects size and overlap.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Bad chunking irrelevant context tune size/overlap
Citation drift wrong sources attach chunk IDs
Latency spikes slow retrieval reduce top-k

8. Extensions & Challenges

Beginner

  • Add a simple web UI.
  • Add query history.

Intermediate

  • Add reranking stage.
  • Add hybrid keyword + vector search.

Advanced

  • Add evaluation dashboard.
  • Add multi-document comparison mode.

9. Real-World Connections

  • Support bots need grounded answers.
  • Internal knowledge bases rely on RAG.

10. Resources

  • LangChain RAG docs
  • Vector store guides
  • “AI Engineering” (retrieval systems)

11. Self-Assessment Checklist

  • I can build an end-to-end RAG pipeline.
  • I can add citations to answers.
  • I can evaluate retrieval quality.

12. Submission / Completion Criteria

Minimum Completion:

  • RAG pipeline with citations

Full Completion:

  • Evaluation on test queries
  • Retrieval fallback handling

Excellence:

  • Reranking or hybrid search
  • Evaluation dashboard

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/LEARN_LANGCHAIN_PROJECTS.md.