Project 4: Production RAG with Citations

Build a citation-grounded RAG assistant with retrieval traces, faithfulness checks, and explicit low-evidence fallback behavior.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 20-30 hours
Main Programming Language Python (Alternatives: TypeScript, Go)
Alternative Programming Languages TypeScript, Go
Coolness Level Level 6
Business Potential Level 4
Prerequisites Projects 1-3, API architecture, retrieval basics
Key Topics RAG pipeline, reranking, provenance, faithfulness

1. Learning Objectives

By completing this project, you will:

  1. Build an end-to-end RAG pipeline with source attribution.
  2. Implement retrieval and reranking with context budgeting.
  3. Add answer gates for low-evidence conditions.
  4. Evaluate citation correctness and faithfulness systematically.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Grounded Generation and Provenance

Fundamentals RAG is a pipeline that retrieves external evidence and injects it into the model prompt before generation. In production, the key quality target is grounded output: claims should be supported by retrieved sources, and the system should provide verifiable provenance. Fluency without grounding is unreliable memory behavior.

Deep Dive into the concept A practical RAG system has several stages: query analysis, retrieval, reranking, context assembly, generation, and evaluation. Failures can occur at any stage, so observability is mandatory. Retrieval can return irrelevant chunks; reranking can overfit lexical overlap; context assembly can bury key evidence due to budget pressure; generation can produce unsupported claims even when evidence is available.

Provenance is the glue that keeps the pipeline auditable. Each chunk should carry source ID, version, and location metadata. When assembling context, preserve these IDs. During generation, require citation formatting that points back to those IDs. This does not automatically guarantee truth, but it enables verification.

Faithfulness evaluation differs from relevance evaluation. A chunk can be relevant but still insufficient to support a specific claim. Faithfulness checks should compare answer assertions against cited chunk content. If confidence is low, the system should respond with a constrained fallback such as “insufficient evidence” and suggest next steps. This fail-informed behavior is better than fabricated confidence.

Retrieval strategy strongly affects answer quality. A common pattern is two-stage retrieval: ANN for broad candidate recall, then reranking for precision. Top-k should be bounded by token budget and quality metrics; bigger is not always better because noisy evidence can dilute signal. Query-type routing also helps: some user prompts are chit-chat and do not need expensive retrieval.

Chunking policy matters for grounding. If chunks are too large, citations become vague and contain mixed topics; if too small, chunks lose explanatory context. A balance with semantic boundaries and overlap typically works best. Importantly, chunk IDs should remain stable across re-ingestion to avoid citation drift.

Operationally, this architecture needs release discipline. Treat index settings, embedding model versions, and prompt templates as versioned artifacts. Run evaluation suites before rollout. Track unsupported-claim rate, citation correctness, retrieval latency, and fallback frequency. If unsupported claims spike after a deployment, rollback should be straightforward.

Finally, governance matters in enterprise settings. Retrieval must enforce ACLs, tenant boundaries, and document freshness constraints. If sensitive or stale sources are retrieved, citations become liabilities. Grounded generation is therefore both a quality and compliance problem.

How this fits on projects

  • Primary concept for this project.
  • Evaluated deeply in Project 6.

Definitions & key terms

  • Grounding: aligning generated claims to external evidence.
  • Provenance: source identity/version trail.
  • Faithfulness: support of claims by provided evidence.
  • Fallback response: controlled behavior under weak evidence.

Mental model diagram (ASCII)

Query -> ANN Retrieve -> Rerank -> Context Assemble -> Generate -> Evaluate
           |                                         |
           +------------- source IDs ----------------+

How it works (step-by-step)

  1. Classify query and choose retrieval route.
  2. Retrieve top-N candidates and rerank.
  3. Select top-k by relevance and budget.
  4. Assemble prompt with source IDs.
  5. Generate answer using citation template.
  6. Run faithfulness checks and log trace.
  7. Return answer or low-evidence fallback.

Invariants:

  • Every cited claim maps to source IDs present in context.
  • Retrieval and generation traces are stored.
  • ACL/tenant filters apply before generation.

Failure modes:

  • Unsupported claims with decorative citations.
  • Top-k saturation causing noisy context.
  • Stale source usage due to weak freshness filters.

Minimal concrete example

query: "What is the PTO carry-over policy?"
retrieved_ids: [H-42, H-43, U-2025-10]
answer includes citations: [H-42, U-2025-10]
faithfulness_check: pass

Common misconceptions

  • “If citations exist, answer is trustworthy.”
  • “Higher top-k always helps.”
  • “RAG quality is only a model problem.”

Check-your-understanding questions

  1. Why can citation formatting still produce unfaithful answers?
  2. When should fallback trigger instead of answer generation?
  3. Why must chunk IDs be stable over re-ingest cycles?

Check-your-understanding answers

  1. Citations can be attached without semantic support.
  2. When retrieved evidence is missing or below confidence threshold.
  3. To preserve traceability and avoid broken provenance.

Real-world applications

  • Enterprise policy copilots.
  • Legal and compliance document assistants.
  • Internal developer documentation Q&A.

Where you’ll apply it

  • This project directly.
  • Also used in: Project 6.

References

  • RAG paper: https://arxiv.org/abs/2005.11401
  • RAG survey: https://arxiv.org/abs/2312.10997

Key insights Useful RAG systems are judged by supported claims, not fluent prose.

Summary This project operationalizes trustworthy memory with retrieval, provenance, and evaluation.

Homework/Exercises to practice the concept

  1. Define a fallback policy for low-evidence responses.
  2. Create 20 query-answer fixtures with expected citations.

Solutions to the homework/exercises

  1. Use confidence thresholds and explicit uncertainty templates.
  2. Include adversarial queries to test unsupported-claim detection.

3. Project Specification

3.1 What You Will Build

A RAG assistant that:

  • ingests documents,
  • retrieves and reranks evidence,
  • generates citation-grounded answers,
  • evaluates faithfulness and logs traces.

3.2 Functional Requirements

  1. Document ingest with chunking and metadata.
  2. ANN retrieval with bounded top-k.
  3. Optional reranking stage.
  4. Citation-aware prompt assembly.
  5. Faithfulness checker and fallback gate.

3.3 Non-Functional Requirements

  • Performance: p95 end-to-end latency under 1.5s for target workload.
  • Reliability: deterministic outputs for fixture tests.
  • Security: ACL and tenant isolation enforcement.

3.4 Example Usage / Output

$ llm-memory rag ask --question "How many PTO days carry over?"
answer: Up to 5 days carry over into Q1.
citations: handbook-v7#p42, policy-update-2025-10#sec2.1
faithfulness: PASS

3.5 Data Formats / Schemas / Protocols

chunk_record:
- chunk_id
- source_id
- source_version
- text
- metadata{tenant,acl,freshness_tag}

3.6 Edge Cases

  • Query with no matching evidence.
  • Conflicting source documents.
  • Highly ambiguous query intent.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ llm-memory rag ingest --corpus data/handbook
$ llm-memory rag ask --question "What is the PTO carry-over policy?"
$ llm-memory rag trace --latest

3.7.2 Golden Path Demo (Deterministic)

$ llm-memory rag ask --fixture fixtures/golden_pto_query.json
[RESULT] faithfulness=PASS citations=2 unsupported_claims=0
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ llm-memory rag ask --fixture fixtures/no_evidence_query.json
[RESULT] status=INSUFFICIENT_EVIDENCE fallback_used=true
exit_code=1

4. Solution Architecture

4.1 High-Level Design

ingest -> embed/index -> retrieve -> rerank -> assemble -> generate -> evaluate -> return

4.2 Key Components

Component Responsibility Key Decisions
Ingest Pipeline chunk and index docs stable chunk IDs
Retriever fetch candidate chunks top-k and filters
Reranker precision refinement bounded candidate count
Answer Gate faithfulness and fallback threshold policy

4.3 Data Structures (No Full Code)

TraceRecord{query_id,retrieved_ids,selected_ids,citations,faithfulness_score}

4.4 Algorithm Overview

  1. Parse query and choose route.
  2. Retrieve + rerank evidence.
  3. Assemble prompt within token budget.
  4. Generate with citations.
  5. Evaluate and gate output.

Complexity:

  • Time: retrieval + rerank + generation.
  • Space: index size + per-query context state.

5. Implementation Guide

5.1 Development Environment Setup

# initialize vector index, load fixture corpus, run golden query

5.2 Project Structure

p04-production-rag/
  src/
    ingest
    retrieve
    rerank
    assemble
    evaluate
    cli
  data/
  fixtures/
  tests/

5.3 The Core Question You’re Answering

“How do I guarantee traceable, evidence-backed answers under real constraints?”

5.4 Concepts You Must Understand First

  • Retrieval pipeline stages.
  • Faithfulness vs relevance.
  • Provenance metadata design.

5.5 Questions to Guide Your Design

  • What triggers fallback instead of answer?
  • Which metrics block release?

5.6 Thinking Exercise

Take one query with conflicting sources and define deterministic tie-break behavior.

5.7 The Interview Questions They’ll Ask

  1. How do you design trustworthy RAG?
  2. How do you evaluate faithfulness?
  3. How do you handle conflicting documents?
  4. How do you enforce source ACLs?
  5. What are common RAG failure modes?

5.8 Hints in Layers

  • Hint 1: enforce source IDs from ingest stage.
  • Hint 2: keep top-k bounded.
  • Hint 3: add unsupported-claim detection.
  • Hint 4: store full per-query traces.

5.9 Books That Will Help

Topic Book Chapter
Architecture trade-offs Fundamentals of Software Architecture Trade-off analysis
Reliability engineering Code Complete Testing and defensive checks

5.10 Implementation Phases

  • Phase 1: ingest + retrieval + baseline answer.
  • Phase 2: reranking + citation template.
  • Phase 3: faithfulness gate + regression suite.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
low-evidence behavior answer anyway / fallback fallback trustworthiness
reranker scope rerank all / top-N top-N bounded latency control

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit component logic chunk metadata checks
Integration end-to-end pipeline ingest->ask->trace
Evaluation output quality faithfulness fixtures

6.2 Critical Test Cases

  1. Supported claim with correct citations.
  2. Unsupported claim triggers fallback.
  3. ACL filtering blocks unauthorized chunks.

6.3 Test Data

Versioned fixtures with known evidence spans and expected citation IDs.


7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Decorative citations unsupported claims claim-to-source checks
Over-large top-k latency/noise spikes bounded top-k and rerank
stale docs outdated answers freshness filters + versioning

7.2 Debugging Strategies

  • Replay failed queries with trace inspection.
  • Compare selected evidence vs expected evidence fixture.

7.3 Performance Traps

Unbounded reranking and oversized context assembly.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add source snippet previews.
  • Add confidence annotations.

8.2 Intermediate Extensions

  • Add query routing for retrieval/no-retrieval decisions.
  • Add per-domain retrieval profiles.

8.3 Advanced Extensions

  • Add human feedback loop for citation corrections.
  • Add policy-driven writeback from verified answers.

9. Real-World Connections

9.1 Industry Applications

  • Enterprise policy assistants.
  • Regulated-domain Q&A systems.
  • LlamaIndex and LangChain RAG stacks.

9.3 Interview Relevance

Shows strong system design around trust, traceability, and reliability.


10. Resources

10.1 Essential Reading

  • Original RAG paper and recent surveys.

10.2 Video Resources

  • Production RAG architecture talks.

10.3 Tools & Documentation

  • Vector DB documentation.
  • LLM model docs for context and pricing.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain grounding and faithfulness differences.
  • I can explain fallback policies.

11.2 Implementation

  • Every answer is traceable to source IDs.
  • Low-evidence cases are handled safely.

11.3 Growth

  • I can defend RAG design choices with metrics.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • retrieval + citations + trace logs

Full Completion:

  • reranking + faithfulness checks + fallback policy

Excellence (Going Above & Beyond):

  • release-gated evaluation and robust ACL/freshness controls