Project 10: Citation Grounding Gateway
API responses with citation integrity scores and unverifiable-claim blocking.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 5-10 days (capstone: 3-5 weeks) |
| Main Programming Language | TypeScript |
| Alternative Programming Languages | Python, Go |
| Coolness Level | Level 4: Trust Builder |
| Business Potential | 5. Compliance Enablement |
| Knowledge Area | Grounded Generation |
| Software or Tool | Citation parser + verifier |
| Main Book | Information Retrieval (Manning et al.) |
| Concept Clusters | Context Engineering and Caching; Prompt Contracts and Output Typing |
1. Learning Objectives
By completing this project, you will:
- Design a reliable artifact: A response gateway that rejects unsupported claims and emits audit traces.
- Decompose LLM responses into atomic claims and verify each claim against a bounded source corpus.
- Build a citation integrity pipeline that detects hallucinated references, fabricated source IDs, and unsupported attributions.
- Compute grounding scores at the claim level and enforce accept/reject thresholds through policy gates.
- Implement structured compliance logging with immutable audit trails suitable for regulated domains.
- Produce a working citation verification system that distinguishes grounded facts from unverifiable assertions with measurable precision and recall.
2. All Theory Needed (Per-Concept Breakdown)
Evidence-Bound Response Design
Fundamentals Evidence-Bound Response Design is the foundational mental model for this project because it mandates that every factual claim in an LLM response must trace back to a specific passage in an approved source document. Without this discipline, language models generate plausible-sounding text that may contain fabricated facts, invented statistics, or hallucinated references. Grounded generation means the system treats the source corpus as the single source of truth and refuses to emit any claim that cannot be linked to a verifiable passage. For this project, you should view each response not as free text but as a structured assembly of individually verifiable claim-citation pairs. The critical shift is from asking “does this sound right?” to asking “can I prove this from the sources?”
Deep Dive into the concept At depth, Evidence-Bound Response Design requires a multi-stage pipeline that decomposes, matches, scores, and gates every response before it reaches the consumer.
The first stage is claim decomposition. A raw LLM response is a monolithic block of text. You must break it into atomic claims, where each claim is a single factual assertion that can be independently verified. For example, the sentence “The 2025 IRS standard mileage rate is 70 cents per mile, up from 67 cents in 2024” contains two atomic claims: the 2025 rate and the 2024 rate. Decomposition can be rule-based (sentence splitting plus named entity recognition) or model-assisted (prompting a second LLM to extract claim units). The key invariant is completeness: every factual assertion in the response must appear as an extracted claim. If a claim is missed, it bypasses verification entirely.
The second stage is citation linking. Each extracted claim needs to be matched against the allowed source corpus. Citation linking patterns include inline reference IDs (where the model emits “[src-3]” markers), span-level attribution (where each claim carries a pointer to the exact paragraph and character range in the source), and source-ID linking (where each claim names the document it draws from). The strongest pattern is span-level attribution because it enables fine-grained entailment checking. Weaker patterns like document-level citation leave room for claims that are “near” a source but not actually supported by it.
The third stage is grounding score computation. The grounding score is the fraction of extracted claims that have valid, verified citations. For example, if a response contains 5 atomic claims and 4 have confirmed source support, the grounding score is 0.80. You must define an acceptance threshold (for example, 0.90 for high-stakes domains like healthcare or finance) and a rejection threshold below which the entire response is blocked. Between those thresholds, you may choose to strip ungrounded claims and return only the verified subset.
The fourth stage is the accept/reject gate. This is the enforcement boundary. Responses above the acceptance threshold pass through with their citation metadata intact. Responses below the rejection threshold are blocked entirely and replaced with a structured error explaining that the question could not be answered from the available sources. Responses in the middle zone require a policy decision: strip-and-pass, retry with different prompting, or escalate to human review.
Finally, Evidence-Bound Response Design shapes how you think about failure. A “wrong” response is not one that sounds bad; it is one where a claim lacks source backing or where a citation points to a passage that does not entail the claim. This makes quality measurement objective and auditable rather than subjective.
How this fit on projects This concept is the primary design driver for Project 10. It determines the response pipeline architecture, the claim extraction strategy, the citation format, and the grounding threshold configuration. Every downstream component (integrity validation, compliance logging) depends on the claim-citation pairs produced by this stage.
Definitions & key terms
- Atomic claim: a single factual assertion that can be independently verified against a source.
- Grounding score: the fraction of atomic claims in a response that have verified citations (range 0.0 to 1.0).
- Span-level attribution: linking a claim to a specific character range or paragraph within a source document.
- Source corpus: the bounded set of documents that the system is allowed to cite.
- Claim decomposition: the process of splitting a monolithic response into individually verifiable claim units.
- Acceptance threshold: the minimum grounding score required for a response to be emitted.
Mental model diagram (ASCII)
LLM Response (raw text)
|
v
+---------------------+
| Claim Decomposer | --> [Claim 1] [Claim 2] [Claim 3] [Claim 4]
+---------------------+
|
v
+---------------------+
| Citation Matcher | --> Claim 1: src-2 p.4 Claim 2: src-1 p.7
| (per claim) | Claim 3: NO MATCH Claim 4: src-2 p.11
+---------------------+
|
v
+---------------------+
| Grounding Scorer | --> score = 3/4 = 0.75
+---------------------+
|
v
+---------------------+
| Accept / Reject | --> 0.75 < 0.90 threshold => REJECT
| Gate | reason: UNGROUNDED_CLAIMS
+---------------------+
|
v
Structured Error OR Verified Response with Citations
How it works (step-by-step, with invariants and failure modes)
- Receive the raw LLM response and the allowed source corpus.
- Run claim decomposition to extract all atomic factual assertions.
- Invariant: every factual sentence must produce at least one claim. If zero claims are extracted from a non-empty response, the decomposer has a bug.
- For each claim, search the source corpus for a supporting passage.
- Failure mode: the model may have cited a source ID that does not exist in the corpus (hallucinated citation).
- For matched claims, verify that the cited passage actually entails the claim (semantic check, not just keyword overlap).
- Failure mode: a passage may mention the same topic but not support the specific assertion.
- Compute the grounding score as verified_claims / total_claims.
- Apply the acceptance threshold policy.
- If score >= threshold: emit verified response with citation metadata.
- If score < threshold: emit structured error with UNGROUNDED_CLAIM reason code.
- Log the full claim-by-claim verification matrix for audit.
Minimal concrete example
Response: "Aspirin reduces fever. It was invented in 1897. It cures cancer."
Claim extraction:
C1: "Aspirin reduces fever"
C2: "Aspirin was invented in 1897"
C3: "Aspirin cures cancer"
Citation matching against source: pharma_handbook.pdf
C1 -> pharma_handbook.pdf, section 3.2, "antipyretic properties" -> MATCH
C2 -> pharma_handbook.pdf, section 1.1, "first synthesized 1897" -> MATCH
C3 -> NO SUPPORTING PASSAGE FOUND -> NO MATCH
Grounding score: 2/3 = 0.67
Threshold: 0.90
Decision: REJECT
Reason: UNGROUNDED_CLAIMS (C3 has no source support)
Common misconceptions
- “If the model cites a source, the citation is correct.” Models frequently hallucinate source IDs or cite passages that discuss a related topic but do not support the specific claim.
- “Grounding means the response mentions a source name.” Mentioning a source is not the same as the source entailing the claim. The cited passage must logically support the assertion.
- “A high grounding score means the response is factually correct.” Grounding only means claims are backed by the sources you provided. If the sources themselves contain errors, the grounded response inherits those errors.
- “You can check grounding at the whole-response level.” Whole-response checks are too coarse. A response with 9 correct claims and 1 fabricated claim would pass a naive check. Claim-level decomposition is essential.
Check-your-understanding questions
- Why must you decompose responses into atomic claims rather than checking the response as a whole?
- What is the difference between a hallucinated citation and a misattributed citation?
- How should the system handle a response where the grounding score falls between the reject threshold and the accept threshold?
- Why is span-level attribution stronger than document-level attribution?
Check-your-understanding answers
- Whole-response checks mask individual ungrounded claims. A response could be 90% grounded but contain one dangerous fabrication that only claim-level decomposition would catch.
- A hallucinated citation references a source ID that does not exist in the corpus at all. A misattributed citation references a real source that exists but whose content does not actually support the claim.
- Policy options include: stripping ungrounded claims and returning only verified ones, retrying with a more constrained prompt, or escalating to human review. The choice depends on the domain risk level.
- Document-level attribution only says “this claim comes from document X” but the document may be 50 pages long. Span-level attribution points to the exact paragraph or sentence, enabling precise entailment checking.
Real-world applications
- Healthcare question-answering systems that must cite clinical guidelines for every recommendation.
- Legal research assistants that attribute every case law reference to a specific court ruling.
- Financial compliance bots that ground every regulatory statement to the specific section of the regulation.
- Enterprise knowledge bases that refuse to answer if the internal documentation does not cover the topic.
Where you’ll apply it
- The response pipeline design (claim extraction + citation matching + scoring) forms the core architecture of Project 10.
- The grounding threshold configuration determines the accept/reject behavior of the gateway.
- The claim-citation pair format feeds directly into the compliance logging system (Concept 3).
References
- “Introduction to Information Retrieval” by Manning, Raghavan, Schutze - relevance scoring and evaluation chapters
- NIST AI 600-1 Generative AI Profile - grounding and attribution requirements
- Google SAFE (Search Augmented Factual Evaluation) framework for claim verification
- Microsoft Groundedness detection in Azure AI Content Safety
Key insights Grounded generation is not about making the model “more accurate”; it is about building a verification pipeline that catches ungrounded claims before they reach the consumer.
Summary Evidence-Bound Response Design transforms LLM responses from opaque text blobs into structured claim-citation assemblies where every factual assertion is individually verified against a bounded source corpus. The pipeline decomposes responses into atomic claims, matches each claim to source passages, computes a grounding score, and enforces accept/reject thresholds. This makes response quality objective, auditable, and enforceable.
Homework/Exercises to practice the concept
- Design a claim decomposition strategy for a medical Q&A response that contains 3 drug dosage claims, 2 contraindication claims, and 1 general health advice claim. Classify which claims must be grounded and which (if any) can pass without citation.
- Define the grounding score thresholds for three domains: casual FAQ bot (low stakes), financial advice (medium stakes), and clinical decision support (high stakes). Justify each threshold.
- Sketch the data structure for a claim-citation pair that includes: claim text, source ID, passage span, entailment confidence, and verification timestamp.
Solutions to the homework/exercises
- For the medical Q&A: all drug dosage claims and contraindication claims must be grounded (these are safety-critical). General health advice (“drink water”) may have a relaxed threshold but should still carry a citation if one exists. The key insight is that not all claims carry equal risk.
- Reasonable thresholds: casual FAQ 0.70 (some ungrounded claims acceptable), financial 0.90 (most claims must be backed), clinical 0.98 (virtually no ungrounded claims tolerated). Each threshold should also define what happens to ungrounded claims: strip, flag, or block.
-
The data structure should include: claim_id, claim_text, source_id, passage_start_offset, passage_end_offset, passage_text, entailment_score (0.0-1.0), verified_at (ISO timestamp), verification_method (semantic_similarity exact_match entailment_model).
Citation Integrity Validation
Fundamentals Citation Integrity Validation is the verification layer that determines whether a citation actually supports the claim it is attached to. Having a citation is not the same as having a valid citation. Models frequently generate citations that look legitimate but fail under scrutiny: the source ID may not exist in the corpus, the cited passage may discuss a related topic without supporting the specific claim, or the citation may be a near-miss where the passage partially overlaps but does not entail the assertion. This concept forces you to build a verification pipeline that checks every claim-citation pair for genuine evidential support, not just structural presence. Without this layer, a response can appear fully grounded while containing citations that are cosmetic rather than substantive.
Deep Dive into the concept At depth, Citation Integrity Validation is a multi-check verification pipeline that operates on every claim-citation pair produced by the Evidence-Bound Response Design stage.
The first check is source existence verification. When the model emits a citation like “[src-3, section 4.2]”, the system must confirm that source ID “src-3” exists in the allowed corpus and that section 4.2 exists within that source. This catches the most blatant failure mode: hallucinated source IDs. Models are remarkably good at generating plausible-looking reference identifiers that correspond to nothing. A simple lookup against the source index eliminates this class of error entirely.
The second check is passage retrieval. Given a valid source ID and location, the system retrieves the actual passage text. This is where you discover whether the citation points to relevant content or to an unrelated section. Passage retrieval must handle various granularity levels: page numbers, section headers, paragraph indices, or character offsets. The choice of granularity affects both precision and performance.
The third check is entailment verification. This is the most critical and most difficult step. Given the claim text and the retrieved passage, does the passage logically entail the claim? This is not keyword matching. The passage “heart rate may increase with caffeine” does not entail the claim “caffeine causes tachycardia” even though the words overlap. Entailment checking can use embedding similarity (fast but imprecise), natural language inference models (more accurate but slower), or structured comparison logic for numeric claims. The entailment score should be a continuous value (0.0 to 1.0) rather than binary, so you can set different thresholds per domain.
The fourth dimension is citation coverage metrics. For a complete response, you need both precision and recall of attributions. Citation precision measures: of all citations provided, how many are valid? Citation recall measures: of all claims that need citations, how many have them? A response with high precision but low recall has a few valid citations but leaves many claims unattributed. A response with low precision but high recall cites something for every claim but many citations are invalid. You need both metrics to assess citation quality.
The fifth dimension is hallucinated citation detection patterns. Models hallucinate citations in predictable ways: they invent DOIs that follow the format but do not exist, they cite real authors with wrong paper titles, they reference plausible section numbers in documents that have fewer sections, or they generate URLs that return 404. Building a catalog of hallucination patterns lets you add fast pre-checks before the expensive entailment step.
Finally, the verification pipeline must handle partial matches gracefully. A passage may support part of a claim but not all of it. For example, a source confirms the year of an event but not the specific dollar amount mentioned in the same claim. The system should decompose compound claims and score each sub-assertion independently when possible.
How this fit on projects This concept powers the core verification engine of Project 10. It sits between the claim decomposer (Concept 1) and the compliance logger (Concept 3). Every claim-citation pair passes through this verification pipeline, and the aggregate results determine whether the response is accepted or rejected.
Definitions & key terms
- Entailment: a logical relationship where passage P entails claim C if C must be true whenever P is true.
- Hallucinated citation: a model-generated reference to a source that does not exist in the allowed corpus.
- Misattributed citation: a reference to a real source whose content does not support the specific claim.
- Citation precision: fraction of provided citations that are verified as valid.
- Citation recall: fraction of claims requiring citations that actually have valid ones.
- Entailment score: continuous value (0.0-1.0) indicating how strongly a passage supports a claim.
Mental model diagram (ASCII)
Claim + Citation Pair
|
v
+---------------------------+
| 1. Source Existence Check | --> source_id in corpus? YES/NO
+---------------------------+
| (if YES)
v
+---------------------------+
| 2. Passage Retrieval | --> fetch passage at cited location
+---------------------------+
|
v
+---------------------------+
| 3. Entailment Check | --> does passage support claim?
| (semantic similarity | entailment_score: 0.0 - 1.0
| or NLI model) |
+---------------------------+
|
v
+---------------------------+
| 4. Verdict | --> VALID (score >= 0.80)
| | WEAK (0.50 <= score < 0.80)
| | INVALID (score < 0.50)
| | HALLUCINATED (source not found)
+---------------------------+
Aggregate across all claims:
Citation Precision = valid_citations / total_citations
Citation Recall = cited_claims / claims_needing_citation
How it works (step-by-step, with invariants and failure modes)
- Receive a list of claim-citation pairs from the claim decomposer.
- For each pair, verify the source ID exists in the allowed corpus index.
- Failure mode: hallucinated source ID. Verdict: HALLUCINATED. No further checks needed.
- Retrieve the passage at the cited location (section, paragraph, page).
- Failure mode: location does not exist (e.g., section 12 in a 10-section document). Verdict: HALLUCINATED.
- Compute entailment score between claim text and retrieved passage.
- Failure mode: passage discusses a related topic but does not entail the claim. Verdict: INVALID or WEAK.
- Invariant: entailment checks must be directional. “Passage entails claim” is different from “claim entails passage.”
- Assign a verdict to each pair: VALID, WEAK, INVALID, or HALLUCINATED.
- Compute aggregate citation precision and citation recall.
- Feed the per-claim verification matrix to the grounding scorer and compliance logger.
Minimal concrete example
Verification Matrix:
+-------+----------------------------------+-----------+---------+------------------+-----------+
| Claim | Claim Text | Source | Section | Entailment Score | Verdict |
+-------+----------------------------------+-----------+---------+------------------+-----------+
| C1 | "Mileage rate is 70 cents" | irs_2025 | p.2 | 0.94 | VALID |
| C2 | "Rate increased from 67 cents" | irs_2025 | p.2 | 0.91 | VALID |
| C3 | "IRS expects 12M filers" | irs_2025 | -- | -- | HALLUCINATED |
+-------+----------------------------------+-----------+---------+------------------+-----------+
Citation Precision: 2/3 = 0.67 (C3 citation is fabricated)
Citation Recall: 2/3 = 0.67 (C3 has no valid citation)
Common misconceptions
- “If a citation includes a real source name, it must be valid.” The source may be real but the specific section or passage cited may not support the claim, or the model may have invented a section number.
- “Keyword overlap between claim and passage means the citation is valid.” Keywords can overlap without entailment. “Apple stock rose 5%” and “Apple released new stock options” share keywords but have entirely different meanings.
- “Citation verification is too slow for production.” Source existence checks and passage retrieval are fast (index lookups). Only the entailment step is computationally expensive, and it can be optimized with embedding pre-filtering.
- “You only need to check citations when the user asks for them.” Every factual claim in a grounded system needs citation verification, regardless of whether the end user sees the citations. The citations are for system integrity, not just user display.
Check-your-understanding questions
- What is the difference between citation precision and citation recall, and why do you need both?
- How would you verify a numeric claim like “revenue was $4.2B in Q3” against a source passage?
- Why must entailment checking be directional (passage entails claim, not claim entails passage)?
- What is the fastest way to eliminate hallucinated citations before running expensive entailment checks?
Check-your-understanding answers
- Precision measures how many provided citations are actually valid; recall measures how many claims that need citations have them. A system can have high precision (few citations, all valid) but low recall (many claims lack citations). You need both to assess whether the response is genuinely grounded.
- Extract the numeric value and qualifier (revenue, $4.2B, Q3) from the claim, locate the corresponding data in the source passage, and compare the exact values. Numeric claims require exact match or within-tolerance comparison, not just semantic similarity.
- “Passage entails claim” means the passage provides evidence for the claim. “Claim entails passage” would mean the claim provides evidence for the passage, which is the wrong direction. A passage saying “temperatures rose in 2024” entails “it got warmer last year” but not vice versa.
- A source index lookup: check whether the cited source ID and location exist in the corpus before doing any semantic analysis. This eliminates all fabricated references in O(1) per citation.
Real-world applications
- Retrieval-Augmented Generation (RAG) pipelines that need to verify that retrieved passages actually support generated answers.
- Legal document review systems that must confirm every case law citation references a real ruling that supports the argument.
- Medical information systems where citing a clinical guideline requires the guideline to actually recommend the stated treatment.
- Academic integrity tools that verify whether cited papers genuinely support the claims attributed to them.
Where you’ll apply it
- The verification engine is the core runtime component of the P10 gateway.
- Entailment scores feed into the grounding score computation (Concept 1).
- The per-claim verification matrix is the primary input to compliance logging (Concept 3).
References
- “Introduction to Information Retrieval” by Manning et al. - evaluation metrics (precision, recall) and relevance scoring
- Stanford Natural Language Inference (SNLI) corpus and entailment model design
- FEVER (Fact Extraction and VERification) shared task methodology
- Google SAFE framework for search-augmented factual evaluation
Key insights A citation is only as good as the entailment relationship between the cited passage and the claim it supports; presence of a reference is not evidence of truth.
Summary Citation Integrity Validation builds a verification pipeline that checks every claim-citation pair through source existence, passage retrieval, and entailment scoring. It produces per-claim verdicts (VALID, WEAK, INVALID, HALLUCINATED) and aggregate metrics (citation precision and recall) that feed into the grounding score and compliance log. This layer catches the subtle failures that surface-level citation presence checks miss.
Homework/Exercises to practice the concept
- Design a hallucinated citation detector that catches the five most common hallucination patterns (invented DOIs, wrong section numbers, real author with wrong paper title, plausible URLs that 404, citation to a source not in the corpus). Describe the check for each pattern.
- Build a verification matrix for a 5-claim response where 2 claims are VALID, 1 is WEAK, 1 is INVALID, and 1 is HALLUCINATED. Compute citation precision and recall. Then define what action the gateway should take based on these results.
- Compare embedding cosine similarity versus an NLI (natural language inference) model for entailment checking. List two scenarios where cosine similarity gives a false positive and explain why NLI would catch it.
Solutions to the homework/exercises
- Hallucination detector: (1) DOI format check plus DOI resolution against doi.org, (2) section number bounds check against document structure metadata, (3) author-title cross-reference against source index, (4) URL HEAD request for existence, (5) source ID lookup against the allowed corpus index. Each check runs in order of increasing cost.
- With 2 VALID, 1 WEAK, 1 INVALID, 1 HALLUCINATED: citation precision = 2/5 = 0.40 (or 3/5 = 0.60 if WEAK counts as partial). Citation recall = 2/5 = 0.40 for strictly valid. At these levels, the gateway should REJECT the entire response because both precision and recall are below any reasonable threshold.
- Cosine similarity false positives: (1) “Apple stock price rose” vs “Apple released new stock purchase plan” – high keyword overlap, completely different meanings. (2) “The drug reduces blood pressure” vs “Blood pressure medications have side effects” – topically similar but the passage does not entail the claim. NLI models are trained on directional entailment and would classify these as neutral or contradiction rather than entailment.
Compliance Logging and Traceability
Fundamentals Compliance Logging and Traceability is the audit infrastructure that records every grounding decision so it can be reviewed, queried, and defended after the fact. In regulated domains like finance, healthcare, and legal services, it is not enough for the system to make correct decisions; you must prove that it made correct decisions and explain why. This means every request that flows through the citation grounding gateway must produce a structured, immutable audit record that captures the question asked, the sources consulted, the claims extracted, the verification verdicts, the grounding score, and the final accept/reject decision. Without this layer, you have a system that works but cannot demonstrate that it works, which is insufficient for compliance, incident investigation, or continuous quality improvement.
Deep Dive into the concept At depth, Compliance Logging and Traceability requires designing a structured decision log format, enforcing immutability guarantees, implementing retention policies, and supporting compliance queries.
The first requirement is the structured decision log. Every gateway request produces a log entry that contains: a unique trace_id for correlation, the original question, the source corpus fingerprint (hash of allowed sources), the full claim-citation verification matrix from the integrity validation stage, the aggregate grounding score, the accept/reject decision, and the reason code if rejected. This is not unstructured text logging. Every field has a defined type and purpose because compliance auditors and automated monitors need to query these records programmatically.
The second requirement is immutability. Audit records must not be modifiable after creation. This is a fundamental requirement in regulated environments because the value of an audit trail depends on trust that it has not been tampered with. Immutability can be achieved through append-only storage, cryptographic chaining (where each record includes a hash of the previous record), or write-once storage backends. The key design decision is how strong the immutability guarantee needs to be: append-only files with OS-level write protection may suffice for internal use, while regulated industries may require cryptographic proof.
The third requirement is retention policies. Not all audit records need to be stored forever. You need a retention policy that specifies how long records are kept, when they are archived, and when they can be deleted. Healthcare regulations (HIPAA) may require 6-year retention. Financial regulations (SOX) may require 7 years. The system must enforce these policies automatically and log retention actions themselves (so you can prove you did not prematurely delete records).
The fourth dimension is compliance query support. Auditors need to ask questions like: “Show me all requests in the last 30 days where grounding score was below 0.80,” or “Find all rejected responses for source corpus X,” or “What was the average grounding score for questions about topic Y last quarter?” Your log format and storage backend must support these queries efficiently. This means structured fields (not free text), indexed columns, and time-range partitioning.
The fifth dimension is regulatory frameworks. Different domains impose different audit requirements. HIPAA requires traceability of all decisions involving protected health information. SOX requires audit trails for financial reporting systems. GDPR requires the ability to demonstrate lawful processing. The EU AI Act requires documentation of high-risk AI system decisions. Your compliance logging layer must be configurable to meet the specific requirements of the deployment domain.
Finally, compliance logs serve a dual purpose: regulatory defense and continuous improvement. Beyond satisfying auditors, the logs provide the data foundation for monitoring grounding score trends over time, detecting prompt drift, identifying problematic source corpora, and measuring the effectiveness of pipeline changes. This makes the compliance layer not just a cost center but a quality improvement engine.
How this fit on projects This concept governs the audit and observability infrastructure of Project 10. Every verification decision from Concept 1 and Concept 2 flows into the compliance log. The log format and retention policies must be designed before implementation begins because they constrain the data structures used throughout the pipeline.
Definitions & key terms
- Audit trail: a chronological, immutable record of all decisions made by the system.
- Trace ID: a unique identifier that correlates all log entries for a single gateway request across pipeline stages.
- Retention policy: rules governing how long audit records are stored, when they are archived, and when they may be deleted.
- Immutability guarantee: the assurance that a log entry cannot be modified or deleted after creation.
- Decision log: a structured record containing the inputs, verification results, scores, and final outcome of a gateway request.
- Compliance query: a programmatic search across audit records to answer regulatory or quality questions.
Mental model diagram (ASCII)
Gateway Request
|
v
+----------------------------+
| Pipeline Execution |
| (Claim Decompose -> |
| Citation Verify -> |
| Grounding Score -> |
| Accept/Reject) |
+----------------------------+
|
v
+----------------------------+
| Decision Log Builder | --> Structured audit record
| Fields: |
| trace_id |
| timestamp |
| question_hash |
| source_corpus_fingerprint |
| claim_verification_matrix |
| grounding_score |
| decision (ACCEPT/REJECT) |
| reason_code |
+----------------------------+
|
v
+----------------------------+
| Immutable Log Store | --> Append-only, tamper-evident
| (append-only + hash chain) |
+----------------------------+
|
v
+----------------------------+
| Retention Manager | --> Archive at 90d, delete at 7yr
+----------------------------+
|
v
+----------------------------+
| Compliance Query Engine | --> "show all rejects last 30d"
+----------------------------+
How it works (step-by-step, with invariants and failure modes)
- At the end of each gateway request, collect all pipeline outputs: claim list, verification matrix, grounding score, and decision.
- Build a structured decision log entry with all required fields and a unique trace_id.
- Invariant: every field must be populated. Missing fields indicate a pipeline bug that must be caught before persistence.
- Compute a hash of the log entry for tamper detection.
- Append the entry to the immutable log store.
- Failure mode: storage write failure. The system must retry or queue the entry rather than silently dropping it. Audit gaps are compliance violations.
- Apply retention policy: tag the entry with its retention expiry based on the domain configuration.
- Index queryable fields (trace_id, timestamp, grounding_score, decision, source_corpus_id) for compliance queries.
- Failure mode: indexing lag. Queries may miss recent entries. Acceptable for analytics but not for real-time compliance checks.
Minimal concrete example
Decision Log Entry:
{
"trace_id": "trc_p10_008",
"timestamp": "2025-11-15T14:32:01Z",
"question_hash": "sha256:a1b2c3...",
"source_corpus": ["irs_2025_notice.pdf"],
"source_corpus_fingerprint": "sha256:d4e5f6...",
"claims_total": 3,
"claims_verified": 2,
"claims_rejected": 1,
"grounding_score": 0.67,
"decision": "REJECT",
"reason_code": "UNGROUNDED_CLAIMS",
"rejected_claims": [
{
"claim_id": "C3",
"claim_text": "IRS expects 12M filers",
"verdict": "HALLUCINATED",
"detail": "No supporting passage found in source corpus"
}
],
"retention_policy": "FINANCIAL_7YR",
"log_hash": "sha256:f7g8h9...",
"prev_log_hash": "sha256:j0k1l2..."
}
Common misconceptions
- “Logging is just console.log() with a timestamp.” Compliance logging requires structured fields, immutability, retention enforcement, and query support. Unstructured text logs are nearly useless for regulatory audits.
- “We can add audit trails later.” Retrofitting compliance logging is extremely expensive because it requires changing data structures throughout the pipeline. Design it from the start.
- “Immutability means we need a blockchain.” Append-only files with hash chaining provide sufficient immutability for most regulated environments. Blockchain is overkill and adds latency.
- “Retention means keeping everything forever.” Over-retention violates data minimization principles (GDPR) and increases storage costs. Retention policies must balance regulatory minimums with data minimization requirements.
Check-your-understanding questions
- Why must the decision log include the source corpus fingerprint rather than just the source file names?
- What happens if the immutable log store write fails? What are the options?
- How would an auditor use the compliance query engine to investigate a customer complaint about a wrong answer?
- Why does the log entry include both the grounding score and the individual claim verdicts?
Check-your-understanding answers
- File names can refer to different content over time (if a document is updated). The fingerprint (content hash) proves exactly which version of the sources was used at decision time. This is essential for reproducing the decision later.
- Options: (a) synchronous retry with backoff, (b) buffer to a local write-ahead log and flush when storage recovers, (c) fail the entire request (refuse to serve if audit cannot be recorded). Option (c) is required in strictly regulated environments where serving without an audit trail is a compliance violation.
- The auditor would query by trace_id (if available from the response) or by time range and question content. The log entry shows exactly which claims were extracted, which citations were checked, what verdicts were assigned, and why the response was accepted or rejected. This provides a complete causal chain from question to answer.
- The grounding score is useful for aggregate monitoring and threshold enforcement. The individual claim verdicts are necessary for root-cause analysis: when a response is rejected, you need to know which specific claims failed and why.
Real-world applications
- Healthcare AI assistants that must demonstrate HIPAA-compliant decision trails for every clinical recommendation.
- Financial advisory systems where SOX audit requirements mandate traceable decision logs.
- Legal research tools where courts may require evidence that AI-generated summaries are properly sourced.
- EU AI Act compliance for high-risk AI systems that must document their decision-making process.
Where you’ll apply it
- The decision log format is designed in Phase 1 and populated throughout the pipeline.
- Retention policies are configured per deployment domain (healthcare, finance, general).
- The compliance query engine is used for operational monitoring and audit response.
References
- NIST AI 600-1 Generative AI Profile - documentation and traceability requirements
- HIPAA Security Rule - audit control requirements (45 CFR 164.312(b))
- SOX Section 404 - internal controls over financial reporting
- EU AI Act - documentation requirements for high-risk AI systems
- “Designing Data-Intensive Applications” by Martin Kleppmann - log-structured storage and immutability patterns
Key insights A grounding system that cannot prove its decisions are correct is no better than one that makes incorrect decisions, because in regulated environments, unauditable correctness is indistinguishable from undetected failure.
Summary Compliance Logging and Traceability builds the audit infrastructure that records every grounding decision in a structured, immutable, and queryable format. Each gateway request produces a decision log entry with trace ID, source fingerprint, claim verification matrix, grounding score, and decision outcome. The system enforces retention policies, supports compliance queries, and serves as both a regulatory defense mechanism and a quality improvement data source.
Homework/Exercises to practice the concept
- Design a decision log schema with at least 12 fields. For each field, specify the type, whether it is required or optional, and what compliance question it answers.
- Define retention policies for three deployment domains: internal FAQ bot (low regulation), financial advisor (SOX), and clinical decision support (HIPAA). Specify retention duration, archival trigger, and deletion authorization requirements.
- Sketch a compliance query that an auditor would use to find all responses in the last quarter where the grounding score was below 0.80 and the decision was ACCEPT (indicating a potential threshold misconfiguration).
Solutions to the homework/exercises
- The schema should include at minimum: trace_id (string, required, correlates all pipeline stages), timestamp (ISO 8601, required, temporal queries), question_hash (string, required, privacy-safe question identification), source_corpus_fingerprint (string, required, reproducibility), claims_total (integer, required, denominator for grounding score), claims_verified (integer, required, numerator), grounding_score (float, required, threshold enforcement), decision (enum, required, ACCEPT/REJECT), reason_code (string, required if rejected), claim_verification_matrix (array, required, root cause analysis), retention_policy (string, required, lifecycle management), log_hash (string, required, tamper detection).
- Internal FAQ bot: 90-day retention, auto-archive at 30 days, auto-delete at 90 days, no authorization required. Financial advisor: 7-year retention, archive at 1 year to cold storage, deletion requires compliance officer sign-off. Clinical decision support: 6-year minimum retention, archive at 1 year, deletion requires HIPAA privacy officer authorization and documented justification.
- The query: SELECT * FROM decision_logs WHERE timestamp >= ‘2025-07-01’ AND timestamp < ‘2025-10-01’ AND grounding_score < 0.80 AND decision = ‘ACCEPT’. This query surfaces cases where the threshold may be set too low or where the accept gate has a bug.
3. Project Specification
3.1 What You Will Build
An API gateway that blocks ungrounded claims and returns only citation-backed responses.
3.2 Functional Requirements
- Accept a question plus an allowed source set.
- Generate candidate answer with citation placeholders.
- Verify every factual span against supplied evidence.
- Return unified error shape when grounding fails.
3.3 Non-Functional Requirements
- Performance: p95 latency under 1.2s for source set <= 20 documents.
- Reliability: Grounding scores are stable for identical source snapshots.
- Security/Policy: Never cite disallowed or out-of-scope sources.
3.4 Example Usage / Output
$ npm run dev --workspace p10-citation-gateway
[ready] listening on http://localhost:3000
$ curl -s http://localhost:3000/v1/answer \
-H 'content-type: application/json' \
-d '{
"question": "What are the 2025 IRS mileage rates?",
"sources": ["irs_2025_notice.pdf"],
"max_citations": 3
}' | jq
{
"answer": "The 2025 standard mileage rates are ...",
"citations": [
{"source": "irs_2025_notice.pdf", "section": "p.2", "quote_span": "..."}
],
"grounding_score": 0.94,
"trace_id": "trc_p10_008"
}
3.5 Data Formats / Schemas / Protocols
- Request JSON with
question,sources, and constraints. - Response JSON with
answer,citations[],grounding_score. - Error JSON with code/message/trace_id/project.
3.6 Edge Cases
- Claim partially supported across multiple documents.
- Source contains conflicting statements.
- Question cannot be answered from supplied sources.
- Citation pointer format drift between parser versions.
3.7 Real World Outcome
This project is complete when your API can serve valid requests with typed responses and reject invalid/high-risk requests with a unified error shape.
3.7.1 How to Run (Copy/Paste)
$ npm run dev --workspace p10-citation-gateway
3.7.2 Golden Path Demo (Deterministic)
Use fixed fixture payloads and verify the same response shape and decision fields every run.
3.7.3 API Endpoints
| Method | Endpoint | Purpose | |——–|———-|———| | POST | /v1/answer | Execute project-specific core flow |
3.7.4 Success Response Example
$ curl -s http://localhost:3000/v1/answer \
-H 'content-type: application/json' \
-d '{
"question": "What are the 2025 IRS mileage rates?",
"sources": ["irs_2025_notice.pdf"],
"max_citations": 3
}' | jq
{
"answer": "The 2025 standard mileage rates are ...",
"citations": [
{"source": "irs_2025_notice.pdf", "section": "p.2", "quote_span": "..."}
],
"grounding_score": 0.94,
"trace_id": "trc_p10_008"
}
3.7.5 Error Response Example
$ curl -s http://localhost:3000/v1/answer \
-H 'content-type: application/json' \
-d '{
"question": "Who will win the 2028 election?",
"sources": ["irs_2025_notice.pdf"],
"max_citations": 3
}' | jq
{
"error": {
"code": "UNGROUNDED_CLAIM",
"message": "No supporting evidence found in provided sources.",
"trace_id": "trc_p10_009",
"project": "P10"
}
}
4. Solution Architecture
4.1 High-Level Design
User Input / Trigger
|
v
+-------------------------+
| Claim Extractor |
+-------------------------+
|
v
+-------------------------+
| Evidence Matcher |
+-------------------------+
|
v
+-------------------------+
| Gateway Policy |
+-------------------------+
|
v
Artifacts / API / UI / Logs
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Claim Extractor | Breaks answer into verifiable claim units. | Score claims independently for precise failure reasons. | | Evidence Matcher | Links claims to source spans. | Require exact source provenance for each claim. | | Gateway Policy | Blocks ungrounded responses. | Fail closed on unverifiable high-stakes claims. |
4.3 Data Structures (No Full Code)
P10_Request:
- trace_id
- input payload/context
- policy profile
P10_Decision:
- status (ALLOW | DENY | RETRY | ESCALATE | PROMOTE | ROLLBACK)
- reason_code
- artifact pointers
4.4 Algorithm Overview
Key algorithm: Policy-aware decision pipeline
- Normalize input and attach deterministic trace metadata.
- Run contract/schema validation and project-specific core checks.
- Apply policy gates and decide: success, retry, deny, escalate, or rollback.
- Persist artifacts and publish operational metrics.
Complexity Analysis (conceptual):
- Time: O(n) over fixture/request items in a batch run.
- Space: O(n) for traces and report artifacts.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies
# 2) Prepare fixtures under fixtures/
# 3) Run the project command(s) listed in section 3.7
5.2 Project Structure
p10/
├── src/
├── fixtures/
├── policies/
├── out/
└── README.md
5.3 The Core Question You’re Answering
“How do I guarantee that high-stakes answers are source-grounded and auditable?”
This question matters because it forces the project to produce objective evidence instead of relying on subjective prompt impressions.
5.4 Concepts You Must Understand First
- Grounded generation pipelines
- Why does this concept matter for P10?
- Book Reference: “Introduction to Information Retrieval” by Manning et al.
- Citation span verification
- Why does this concept matter for P10?
- Book Reference: Fact-checking and claim-evidence alignment literature
- Safety fallback for unverifiable claims
- Why does this concept matter for P10?
- Book Reference: “Site Reliability Engineering” by Google - error budgeting mindset
5.5 Questions to Guide Your Design
- Boundary and contracts
- What is the smallest safe contract surface for citation grounding gateway?
- Which failure reasons must be explicit and machine-readable?
- Runtime policy
- What is allowed automatically, what needs retry, and what must escalate?
- Which policy checks must happen before any side effect?
- Evidence and observability
- What traces/metrics are required for fast incident triage?
- What specific thresholds trigger rollback or human review?
5.6 Thinking Exercise
Pre-Mortem for Citation Grounding Gateway
Before implementing, write down 10 ways this project can fail in production. Classify each failure into: contract, policy, security, or operations.
Questions to answer:
- Which failures can be prevented before runtime?
- Which failures require runtime detection and escalation?
5.7 The Interview Questions They’ll Ask
- “How do you define “grounded” in a measurable way?”
- “What should happen when evidence is ambiguous?”
- “How do you design citation objects for auditability?”
- “Why can citations still be wrong even when present?”
- “How would you tune for both speed and verification quality?”
5.8 Hints in Layers
Hint 1: Verify claim-by-claim Whole-answer checks are too coarse for debugging.
Hint 2: Use source allowlists Hard-limit what documents can be cited.
Hint 3: Fail closed If evidence is missing, return error not guess.
Hint 4: Record spans Store exact source spans for each cited claim.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Retrieval basics | “Introduction to Information Retrieval” by Manning et al. | Ranking + evaluation chapters | | Reliable service behavior | “Site Reliability Engineering” by Google | Error budget mindset | | Data contract design | “Designing Data-Intensive Applications” by Martin Kleppmann | Schema evolution chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Define contracts, policy profiles, and deterministic fixtures.
- Build the core execution path and baseline artifact output.
- Checkpoint: One golden-path scenario runs end-to-end with trace id and artifact.
Phase 2: Core Functionality
- Add project-specific evaluation/routing/verification logic.
- Add error paths with unified reason codes.
- Checkpoint: Golden-path and one failure-path both behave deterministically.
Phase 3: Operational Hardening
- Add metrics, trend reporting, and release/rollback or escalation gates.
- Document runbook and incident/debug flow.
- Checkpoint: Team member can reproduce output from clean checkout.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Validation order | Late checks vs early checks | Early checks | Fail-fast saves cost and reduces unsafe execution | | Failure handling | Silent retries vs explicit reason codes | Explicit reason codes | Enables automation and faster debugging | | Rollout/escalation | Manual-only vs policy-driven | Policy-driven with manual override | Balances speed and safety |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate deterministic building blocks | schema checks, policy gates, parser behaviors | | Integration Tests | Verify end-to-end project path | golden-path command/API flow | | Edge Case Tests | Ensure robust failure handling | malformed fixture, blocked policy action |
6.2 Critical Test Cases
- Golden path succeeds and emits expected artifact shape.
- High-risk/invalid path returns deterministic error with reason code.
- Replay with same seed/config yields same decision summary.
6.3 Test Data
fixtures/golden_case.*
fixtures/failure_case.*
fixtures/edge_cases/*
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Citations exist but don’t support claim” | Span matching is weak or shallow. | Implement claim-level entailment checks. | | “Gateway answers from memory” | Prompt allows unsupported prior knowledge. | Force source-only answering mode. | | “Latency too high” | Verification runs on whole answer monolith. | Split into claim units and parallelize verification. |
7.2 Debugging Strategies
- Re-run deterministic fixtures with fixed seed and compare trace ids.
- Diff latest artifacts against last known-good baseline.
- Isolate whether failure is contract, policy, or runtime dependency related.
7.3 Performance Traps
- Unbounded retries inflate latency and cost.
- Overly broad logging can slow hot paths.
- Missing cache/canonicalization can create avoidable compute churn.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one new fixture category and expected outcome labels.
- Add one new reason code with deterministic validation.
8.2 Intermediate Extensions
- Add dashboard-ready trend exports.
- Add automated regression diff against previous run artifacts.
8.3 Advanced Extensions
- Integrate with rollout gates or human approval workflows.
- Add chaos-style fault injection and recovery assertions.
9. Real-World Connections
9.1 Industry Applications
- PromptOps platform teams operating AI features under compliance constraints.
- Internal AI governance tooling for release safety and incident response.
9.2 Related Open Source Projects
- LangChain/LangSmith style eval and tracing workflows.
- OpenTelemetry-based observability stacks for decision traces.
9.3 Interview Relevance
- Demonstrates ability to convert probabilistic model behavior into deterministic software guarantees.
- Shows practical production-thinking: contracts, policies, monitoring, and operational controls.
10. Resources
10.1 Essential Reading
- OpenAI/Anthropic/Google provider docs for structured outputs, tool calling, and prompt controls.
- OWASP LLM Top 10 and NIST AI RMF guidance for safety and governance.
10.2 Video Resources
- Talks on LLM eval systems, PromptOps, and AI safety operations.
10.3 Tools & Documentation
- JSON schema validators, policy engines, and tracing infrastructure docs.
10.4 Related Projects in This Series
- Previous projects: build specialized primitives.
- Next projects: integrate these primitives into broader operational systems.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the core risk boundaries and policy gates for this project.
- I can explain the artifact format and why each field exists.
- I can justify the release/escalation criteria.
11.2 Implementation
- Golden-path and failure-path flows both work.
- Deterministic artifacts are produced and reproducible.
- Observability fields are present for debugging and audits.
11.3 Growth
- I can describe one tradeoff I made and why.
- I can explain this project design in an interview setting.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Golden path works with deterministic output artifact.
- At least one failure-path scenario returns unified error shape/reason code.
- Core metrics are emitted and documented.
Full Completion:
- Includes automated tests, trend reporting, and reproducible runbook.
- Includes operational thresholds for promote/rollback or escalate/approve.
Excellence (Above & Beyond):
- Integrates with adjacent projects (registry, rollout, firewall, HITL) cleanly.
- Demonstrates incident drill replay and fast root-cause workflow.