Project 2: Conversation Summarization & Distillation Pipeline
Build a deterministic summarization pipeline that converts raw conversations into structured, auditable memory.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2 |
| Time Estimate | Weekend |
| Main Programming Language | Python (Alternatives: TypeScript, Java) |
| Alternative Programming Languages | TypeScript, Java |
| Coolness Level | Level 3 |
| Business Potential | Level 3 |
| Prerequisites | JSON handling, prompt templating, basic evaluation |
| Key Topics | Summarization, consolidation, lineage |
1. Learning Objectives
By completing this project, you will:
- Design a structured summary format for long-term memory.
- Build a repeatable distillation pipeline that produces stable outputs.
- Attach lineage links from summary items to raw episodes.
- Detect and correct summary drift over time.
2. All Theory Needed (Per-Concept Breakdown)
Memory Consolidation and Summarization Fidelity
Fundamentals Consolidation is the process of turning raw episodes into structured memory that is compact, stable, and retrievable. In agent systems, raw logs grow quickly and are expensive to search, so summaries become the default long-term memory. But summaries are lossy: they compress information, remove nuance, and can introduce errors. A good consolidation pipeline must define what gets preserved (facts, preferences, open tasks) and what is discarded (irrelevant chatter). Summaries must also be auditable: each summarized item should link back to the specific episodes that support it.
Deep Dive into the concept Summarization in memory systems is not “write a shorter version.” It is information distillation with explicit structure. The first decision is schema: what fields matter for downstream retrieval? A common approach is to split summary into categories such as facts, preferences, tasks, decisions, and unresolved questions. Each category is designed for a retrieval policy: facts for semantic memory, preferences for user constraints, tasks for planning. Once schema is fixed, the summarizer becomes a structured extractor rather than a generic summarizer.
Fidelity is the next challenge. Summaries are prone to hallucinations because the model will attempt to complete a coherent narrative even when the source is ambiguous. To protect against this, you need verification passes that check every extracted item against source text. A practical method is to attach evidence spans (line ranges or episode IDs) to each summary item. This also supports lineage: if a summary is wrong, you can trace it back and correct the source.
Consolidation must also deal with time. Preferences change, tasks are completed, and old facts become stale. A memory system should version summaries and apply decay policies. One approach is to keep a rolling “current summary” and archive older versions for audit. Another is to store summary items with an expiration or “last confirmed” timestamp. This prevents the system from confidently retrieving outdated or incorrect memory.
Summarization fidelity is also affected by compression ratio. If you compress too aggressively, you lose key details; if you compress too lightly, retrieval becomes noisy. A workable target for conversational memory is 5-15% of the original token count, but the right ratio depends on your memory budget and retrieval strategy. To monitor fidelity, you should track a set of validation probes: queries whose expected answers depend on summary content. If probe accuracy drops after a summary update, you know the new summary lost or distorted information.
Finally, summaries should be structured for retrieval. This means chunking by category, attaching metadata (confidence, sensitivity), and assigning ownership (who said it, when). You will use these design choices later when you build hybrid routers and preference stores. Consolidation is not an isolated feature; it is the backbone that makes long-term memory feasible at scale.
From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.
Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.
How this fits on projects This concept is the core of Project 2 and feeds into Project 5 (reflection) and Project 10 (paging and memory tiers).
Definitions & key terms
- Consolidation: Process of compressing raw memory into structured form.
- Summary drift: When summaries diverge from source over time.
- Lineage: Links from summary items to source episodes.
- Compression ratio: Summary size relative to raw size.
Mental model diagram (ASCII)
Raw Episodes -> Summarizer -> Structured Summary
| | |
v v v
Archive Lineage Retrieval
How It Works (Step-by-Step)
- Ingest raw conversation logs with timestamps.
- Extract summary items into fixed schema fields.
- Attach lineage to source episode IDs.
- Validate each item against source text.
- Store versioned summaries and apply decay rules.
Minimal Concrete Example
summary:
facts:
- text: "User is migrating Flask to FastAPI"
confidence: 0.86
evidence: [EPI-0041, EPI-0043]
preferences:
- text: "Prefers step-by-step explanations"
evidence: [EPI-0038]
Common Misconceptions
- “Summaries are always safer than raw logs.” (False: they can introduce errors.)
- “One summary format fits all.” (False: schema depends on retrieval use cases.)
Check-Your-Understanding Questions
- Why is lineage critical for summaries?
- What is summary drift?
- How does compression ratio affect retrieval quality?
Check-Your-Understanding Answers
- It enables auditing and correction when summaries are wrong.
- It is the divergence between summarized memory and actual events.
- High compression risks losing details; low compression increases noise.
Real-World Applications
- CRM systems summarizing customer interactions.
- Personal assistants maintaining preference memory.
Where You’ll Apply It
- In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
- Also used in: Project 5, Project 10.
References
- “Generative Agents” (memory stream + reflection) - https://www.egoai.com/research/interactive-simulacra
- “AI Engineering” by Chip Huyen - Ch. 6 (RAG and agent memory)
Key Insights Summaries are not just shorter text; they are structured memory that must be auditable and versioned.
Summary Consolidation transforms raw logs into structured memory, but only with strong schema, lineage, and validation can it be trusted.
Homework/Exercises to Practice the Concept
- Design a summary schema with 4 fields and explain why each matters.
- Take a short transcript and create summary items with evidence IDs.
Solutions to the Homework/Exercises
- Example schema: facts, preferences, tasks, decisions.
- Each summary item should list supporting episode IDs.
3. Project Specification
3.1 What You Will Build
A distillation pipeline that:
- Reads raw chat logs
- Generates structured summaries
- Attaches lineage for each summary item
- Runs verification checks against source
- Produces versioned summary outputs
3.2 Functional Requirements
- Structured Output: Summaries must follow fixed schema.
- Lineage Links: Each summary item lists source episodes.
- Verification: Each item is verified against source text.
- Versioning: Summaries are stored with version numbers.
- Decay: Old summaries expire or are archived.
3.3 Non-Functional Requirements
- Performance: Distill 1k messages in under 60 seconds.
- Reliability: Same input yields same summary with fixed settings.
- Usability: Clear CLI output and diff-friendly summary format.
3.4 Example Usage / Output
$ distill run --input logs/session_042.json --template facts,preferences,open_tasks
[OK] summary_id=SUM-0042 version=1
$ distill show SUM-0042
Facts:
- User is migrating Flask to FastAPI
Preferences:
- Prefers step-by-step explanations
Open tasks:
- Choose a vector store
Lineage: 17 episodes
3.5 Data Formats / Schemas / Protocols
Summary JSON shape:
{
"summary_id": "SUM-0042",
"version": 1,
"facts": [{"text": "User is migrating Flask to FastAPI", "evidence": ["EPI-0041"]}],
"preferences": [{"text": "Prefers step-by-step explanations", "evidence": ["EPI-0038"]}],
"open_tasks": [{"text": "Choose a vector store", "evidence": ["EPI-0046"]}]
}
3.6 Edge Cases
- Summarizer returns empty fields
- Evidence IDs missing or invalid
- Conflicting summary items across versions
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
$ distill run --input logs/session_042.json --template facts,preferences,open_tasks
$ distill verify SUM-0042
$ distill show SUM-0042
3.7.2 Golden Path Demo (Deterministic)
$ distill verify SUM-0042
[VERIFY] facts=3 checked, errors=0
[VERIFY] preferences=2 checked, errors=0
[RESULT] PASS
exit_code=0
3.7.3 Failure Demo (Deterministic)
$ distill verify SUM-0099
[VERIFY] facts=2 checked, errors=1
[ERROR] summary item not supported by source
exit_code=3
4. Solution Architecture
4.1 High-Level Design
+--------------+ +-------------+ +--------------+
| Log Ingest |-> | Summarizer |-> | Verifier |
+--------------+ +-------------+ +--------------+
| | |
v v v
Archive Structured Output Version Store
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Ingest | Load raw logs | Stable ordering and timestamps |
| Summarizer | Extract structured items | Fixed schema and templates |
| Verifier | Validate against source | Evidence matching strategy |
| Store | Versioned summaries | Versioning and decay rules |
4.3 Data Structures (No Full Code)
SummaryItem:
text: string
evidence_ids: list
confidence: float
4.4 Algorithm Overview
Key Algorithm: Summary Verification
- For each summary item, collect evidence episodes.
- Check if item text is supported by evidence.
- Flag items without support.
Complexity Analysis:
- Time: O(n * e) where e is evidence per item
- Space: O(n) for summary items
5. Implementation Guide
5.1 Development Environment Setup
- Prepare a config file with schema templates
- Store logs in a normalized JSON format
5.2 Project Structure
project-root/
├── src/
│ ├── ingest/
│ ├── summarize/
│ ├── verify/
│ └── store/
├── tests/
└── README.md
5.3 The Core Question You’re Answering
“How do I compress long conversations into memory without losing truth?”
5.4 Concepts You Must Understand First
- Summary schema design
- Verification and lineage
5.5 Questions to Guide Your Design
- How will you ensure that every summary item has evidence?
- How will you detect and version corrections?
5.6 Thinking Exercise
Create two summary versions from the same transcript and identify the differences.
5.7 The Interview Questions They’ll Ask
- “What is summary drift and how do you detect it?”
- “Why is lineage important in memory systems?”
- “How do you validate summaries?”
- “What schema would you choose and why?”
- “How does decay reduce noise?”
5.8 Hints in Layers
Hint 1: Use a fixed template Hint 2: Attach evidence IDs Hint 3: Add a verifier step Hint 4: Version every summary
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Agent systems | “AI Engineering” by Chip Huyen | Ch. 6 |
| Data lineage | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 4 |
5.10 Implementation Phases
Phase 1: Foundation (4-6 hours)
- Build log ingest and schema templates
Phase 2: Core Functionality (6-8 hours)
- Summarize and verify
Phase 3: Polish & Edge Cases (4-6 hours)
- Versioning and decay
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Summary schema | Free-form / Structured | Structured | Enables retrieval and auditing |
| Verification | Manual / Automated | Automated | Scales to large logs |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Verify schema outputs | Missing fields |
| Integration | Summarize + verify | Full pipeline |
| Edge | Conflicting summaries | Version mismatch |
6.2 Critical Test Cases
- Summary item with no evidence must fail.
- Same input must yield same summary.
- Outdated summaries must be archived.
6.3 Test Data
conversation:
- "User prefers step-by-step explanations"
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Vague schema | Unusable summaries | Tighten fields |
| No verification | Hallucinated memory | Add evidence checks |
| No versioning | Drift without trace | Store versions |
7.2 Debugging Strategies
- Compare summaries across versions
- Sample items and verify evidence
7.3 Performance Traps
- Overly complex verification loops
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a confidence field to each item
8.2 Intermediate Extensions
- Add delta summaries between versions
8.3 Advanced Extensions
- Add semantic diffing between summary versions
9. Real-World Connections
9.1 Industry Applications
- Customer support summarization
- Personal assistant preference memory
9.2 Related Open Source Projects
- LangChain Memory
- LlamaIndex Memory
9.3 Interview Relevance
- Summarization fidelity and schema design are common agent system questions.
10. Resources
10.1 Essential Reading
- “AI Engineering” by Chip Huyen - Ch. 6
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4
10.2 Video Resources
- Talks on RAG and memory consolidation
10.3 Tools & Documentation
- SQLite documentation
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain summary drift and lineage.
11.2 Implementation
- Summary output is deterministic.
- Verification checks work.
11.3 Growth
- I can justify my summary schema choices.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Pipeline generates structured summaries
- Verification step runs
Full Completion:
- Versioning and decay implemented
Excellence (Going Above & Beyond):
- Semantic diffing and confidence tracking