Project 2: Conversation Summarization & Distillation Pipeline

Build a deterministic summarization pipeline that converts raw conversations into structured, auditable memory.

Quick Reference

Attribute	Value
Difficulty	Level 2
Time Estimate	Weekend
Main Programming Language	Python (Alternatives: TypeScript, Java)
Alternative Programming Languages	TypeScript, Java
Coolness Level	Level 3
Business Potential	Level 3
Prerequisites	JSON handling, prompt templating, basic evaluation
Key Topics	Summarization, consolidation, lineage

1. Learning Objectives

By completing this project, you will:

Design a structured summary format for long-term memory.
Build a repeatable distillation pipeline that produces stable outputs.
Attach lineage links from summary items to raw episodes.
Detect and correct summary drift over time.

2. All Theory Needed (Per-Concept Breakdown)

Memory Consolidation and Summarization Fidelity

Fundamentals Consolidation is the process of turning raw episodes into structured memory that is compact, stable, and retrievable. In agent systems, raw logs grow quickly and are expensive to search, so summaries become the default long-term memory. But summaries are lossy: they compress information, remove nuance, and can introduce errors. A good consolidation pipeline must define what gets preserved (facts, preferences, open tasks) and what is discarded (irrelevant chatter). Summaries must also be auditable: each summarized item should link back to the specific episodes that support it.

Deep Dive into the concept Summarization in memory systems is not “write a shorter version.” It is information distillation with explicit structure. The first decision is schema: what fields matter for downstream retrieval? A common approach is to split summary into categories such as facts, preferences, tasks, decisions, and unresolved questions. Each category is designed for a retrieval policy: facts for semantic memory, preferences for user constraints, tasks for planning. Once schema is fixed, the summarizer becomes a structured extractor rather than a generic summarizer.

Fidelity is the next challenge. Summaries are prone to hallucinations because the model will attempt to complete a coherent narrative even when the source is ambiguous. To protect against this, you need verification passes that check every extracted item against source text. A practical method is to attach evidence spans (line ranges or episode IDs) to each summary item. This also supports lineage: if a summary is wrong, you can trace it back and correct the source.

Consolidation must also deal with time. Preferences change, tasks are completed, and old facts become stale. A memory system should version summaries and apply decay policies. One approach is to keep a rolling “current summary” and archive older versions for audit. Another is to store summary items with an expiration or “last confirmed” timestamp. This prevents the system from confidently retrieving outdated or incorrect memory.

Summarization fidelity is also affected by compression ratio. If you compress too aggressively, you lose key details; if you compress too lightly, retrieval becomes noisy. A workable target for conversational memory is 5-15% of the original token count, but the right ratio depends on your memory budget and retrieval strategy. To monitor fidelity, you should track a set of validation probes: queries whose expected answers depend on summary content. If probe accuracy drops after a summary update, you know the new summary lost or distorted information.

Finally, summaries should be structured for retrieval. This means chunking by category, attaching metadata (confidence, sensitivity), and assigning ownership (who said it, when). You will use these design choices later when you build hybrid routers and preference stores. Consolidation is not an isolated feature; it is the backbone that makes long-term memory feasible at scale.

From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.

Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.

How this fits on projects This concept is the core of Project 2 and feeds into Project 5 (reflection) and Project 10 (paging and memory tiers).

Definitions & key terms

Consolidation: Process of compressing raw memory into structured form.
Summary drift: When summaries diverge from source over time.
Lineage: Links from summary items to source episodes.
Compression ratio: Summary size relative to raw size.

Mental model diagram (ASCII)

Raw Episodes -> Summarizer -> Structured Summary
      |              |              |
      v              v              v
   Archive        Lineage        Retrieval

How It Works (Step-by-Step)

Ingest raw conversation logs with timestamps.
Extract summary items into fixed schema fields.
Attach lineage to source episode IDs.
Validate each item against source text.
Store versioned summaries and apply decay rules.

Minimal Concrete Example

summary:
  facts:
    - text: "User is migrating Flask to FastAPI"
      confidence: 0.86
      evidence: [EPI-0041, EPI-0043]
  preferences:
    - text: "Prefers step-by-step explanations"
      evidence: [EPI-0038]

Common Misconceptions

“Summaries are always safer than raw logs.” (False: they can introduce errors.)
“One summary format fits all.” (False: schema depends on retrieval use cases.)

Check-Your-Understanding Questions

Why is lineage critical for summaries?
What is summary drift?
How does compression ratio affect retrieval quality?

Check-Your-Understanding Answers

It enables auditing and correction when summaries are wrong.
It is the divergence between summarized memory and actual events.
High compression risks losing details; low compression increases noise.

Real-World Applications

CRM systems summarizing customer interactions.
Personal assistants maintaining preference memory.

Where You’ll Apply It

In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
Also used in: Project 5, Project 10.

References

“Generative Agents” (memory stream + reflection) - https://www.egoai.com/research/interactive-simulacra
“AI Engineering” by Chip Huyen - Ch. 6 (RAG and agent memory)

Key Insights Summaries are not just shorter text; they are structured memory that must be auditable and versioned.

Summary Consolidation transforms raw logs into structured memory, but only with strong schema, lineage, and validation can it be trusted.

Homework/Exercises to Practice the Concept

Design a summary schema with 4 fields and explain why each matters.
Take a short transcript and create summary items with evidence IDs.

Solutions to the Homework/Exercises

Example schema: facts, preferences, tasks, decisions.
Each summary item should list supporting episode IDs.

3. Project Specification

3.1 What You Will Build

A distillation pipeline that:

Reads raw chat logs
Generates structured summaries
Attaches lineage for each summary item
Runs verification checks against source
Produces versioned summary outputs

3.2 Functional Requirements

Structured Output: Summaries must follow fixed schema.
Lineage Links: Each summary item lists source episodes.
Verification: Each item is verified against source text.
Versioning: Summaries are stored with version numbers.
Decay: Old summaries expire or are archived.

3.3 Non-Functional Requirements

Performance: Distill 1k messages in under 60 seconds.
Reliability: Same input yields same summary with fixed settings.
Usability: Clear CLI output and diff-friendly summary format.

3.4 Example Usage / Output

$ distill run --input logs/session_042.json --template facts,preferences,open_tasks
[OK] summary_id=SUM-0042 version=1

$ distill show SUM-0042
Facts:
- User is migrating Flask to FastAPI
Preferences:
- Prefers step-by-step explanations
Open tasks:
- Choose a vector store
Lineage: 17 episodes

3.5 Data Formats / Schemas / Protocols

Summary JSON shape:

{
  "summary_id": "SUM-0042",
  "version": 1,
  "facts": [{"text": "User is migrating Flask to FastAPI", "evidence": ["EPI-0041"]}],
  "preferences": [{"text": "Prefers step-by-step explanations", "evidence": ["EPI-0038"]}],
  "open_tasks": [{"text": "Choose a vector store", "evidence": ["EPI-0046"]}]
}

3.6 Edge Cases

Summarizer returns empty fields
Evidence IDs missing or invalid
Conflicting summary items across versions

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ distill run --input logs/session_042.json --template facts,preferences,open_tasks
$ distill verify SUM-0042
$ distill show SUM-0042

3.7.2 Golden Path Demo (Deterministic)

$ distill verify SUM-0042
[VERIFY] facts=3 checked, errors=0
[VERIFY] preferences=2 checked, errors=0
[RESULT] PASS
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ distill verify SUM-0099
[VERIFY] facts=2 checked, errors=1
[ERROR] summary item not supported by source
exit_code=3

4. Solution Architecture

4.1 High-Level Design

+--------------+   +-------------+   +--------------+
| Log Ingest   |-> | Summarizer  |-> | Verifier     |
+--------------+   +-------------+   +--------------+
        |                 |                 |
        v                 v                 v
     Archive         Structured Output   Version Store

4.2 Key Components

Component	Responsibility	Key Decisions
Ingest	Load raw logs	Stable ordering and timestamps
Summarizer	Extract structured items	Fixed schema and templates
Verifier	Validate against source	Evidence matching strategy
Store	Versioned summaries	Versioning and decay rules

4.3 Data Structures (No Full Code)

SummaryItem:
  text: string
  evidence_ids: list
  confidence: float

4.4 Algorithm Overview

Key Algorithm: Summary Verification

For each summary item, collect evidence episodes.
Check if item text is supported by evidence.
Flag items without support.

Complexity Analysis:

Time: O(n * e) where e is evidence per item
Space: O(n) for summary items

5. Implementation Guide

5.1 Development Environment Setup

- Prepare a config file with schema templates
- Store logs in a normalized JSON format

5.2 Project Structure

project-root/
├── src/
│   ├── ingest/
│   ├── summarize/
│   ├── verify/
│   └── store/
├── tests/
└── README.md

5.3 The Core Question You’re Answering

“How do I compress long conversations into memory without losing truth?”

5.4 Concepts You Must Understand First

Summary schema design
Verification and lineage

5.5 Questions to Guide Your Design

How will you ensure that every summary item has evidence?
How will you detect and version corrections?

5.6 Thinking Exercise

Create two summary versions from the same transcript and identify the differences.

5.7 The Interview Questions They’ll Ask

“What is summary drift and how do you detect it?”
“Why is lineage important in memory systems?”
“How do you validate summaries?”
“What schema would you choose and why?”
“How does decay reduce noise?”

5.8 Hints in Layers

Hint 1: Use a fixed template Hint 2: Attach evidence IDs Hint 3: Add a verifier step Hint 4: Version every summary

5.9 Books That Will Help

Topic	Book	Chapter
Agent systems	“AI Engineering” by Chip Huyen	Ch. 6
Data lineage	“Designing Data-Intensive Applications” by Martin Kleppmann	Ch. 4

5.10 Implementation Phases

Phase 1: Foundation (4-6 hours)

Build log ingest and schema templates

Phase 2: Core Functionality (6-8 hours)

Summarize and verify

Phase 3: Polish & Edge Cases (4-6 hours)

Versioning and decay

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Summary schema	Free-form / Structured	Structured	Enables retrieval and auditing
Verification	Manual / Automated	Automated	Scales to large logs

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Verify schema outputs	Missing fields
Integration	Summarize + verify	Full pipeline
Edge	Conflicting summaries	Version mismatch

6.2 Critical Test Cases

Summary item with no evidence must fail.
Same input must yield same summary.
Outdated summaries must be archived.

6.3 Test Data

conversation:
  - "User prefers step-by-step explanations"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Vague schema	Unusable summaries	Tighten fields
No verification	Hallucinated memory	Add evidence checks
No versioning	Drift without trace	Store versions

7.2 Debugging Strategies

Compare summaries across versions
Sample items and verify evidence

7.3 Performance Traps

Overly complex verification loops

8. Extensions & Challenges

8.1 Beginner Extensions

Add a confidence field to each item

8.2 Intermediate Extensions

Add delta summaries between versions

8.3 Advanced Extensions

Add semantic diffing between summary versions

9. Real-World Connections

9.1 Industry Applications

Customer support summarization
Personal assistant preference memory

LangChain Memory
LlamaIndex Memory

9.3 Interview Relevance

Summarization fidelity and schema design are common agent system questions.

10. Resources

10.1 Essential Reading

“AI Engineering” by Chip Huyen - Ch. 6
“Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 4

10.2 Video Resources

Talks on RAG and memory consolidation

10.3 Tools & Documentation

SQLite documentation

11. Self-Assessment Checklist

11.1 Understanding

I can explain summary drift and lineage.

11.2 Implementation

Summary output is deterministic.
Verification checks work.

11.3 Growth

I can justify my summary schema choices.

12. Submission / Completion Criteria

Minimum Viable Completion:

Pipeline generates structured summaries
Verification step runs

Full Completion:

Versioning and decay implemented

Excellence (Going Above & Beyond):

Semantic diffing and confidence tracking