Project 4: Context Window Manager (What to Include, What to Compress)
Project 4: Context Window Manager (What to Include, What to Compress)
Build an intelligent context budgeting system that selects and compresses information to fit expensive, limited context windows
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1-2 weeks |
| Language | Python (Alternatives: TypeScript) |
| Prerequisites | Basic knowledge of RAG, tokenization, Projects 1-2 |
| Key Topics | Context Engineering, Summarization, Token Budgeting, Relevance Ranking |
| Knowledge Area | Context Engineering / Summarization |
| Software/Tool | Tiktoken / Tokenizers |
| Main Book | โDesigning Data-Intensive Applicationsโ (Retrieval patterns) |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 4. The โOpen Coreโ Infrastructure |
1. Learning Objectives
By completing this project, you will:
- Master Token Budgeting: Learn to treat context windows as a scarce, expensive resource requiring precise allocation
- Implement Document Selection: Build ranking algorithms that prioritize the most relevant information
- Design Summarization Strategies: Compress conversation history while preserving critical information
- Handle Provenance Tracking: Ensure compressed data remains traceable to original sources
- Understand โLost in the Middleโ: Mitigate the phenomenon where models ignore mid-context information
- Build Traceability Manifests: Create audit trails showing what was included, excluded, and why
- Optimize for Different Use Cases: Balance precision, recall, and cost across various application types
2. Theoretical Foundation
2.1 Core Concepts
The Context Window as a Budget
Modern LLMs have fixed context windows:
- GPT-4: 8K, 32K, or 128K tokens
- Claude 3: 200K tokens
- Llama 3: 8K tokens
The Core Problem: Your application often has more information than fits:
Available Information:
- System prompt: 500 tokens
- Conversation history (20 messages): 3,000 tokens
- Retrieved documents (50 docs): 25,000 tokens
- User query: 50 tokens
TOTAL: 28,550 tokens
Context Window: 8,000 tokens
OVERFLOW: 20,550 tokens (72% must be cut!)
Naive Solutions (All Bad):
- Truncate arbitrarily: Cut the last 20K tokens โ Lose critical context
- Compress everything: Summarize all docs โ Lose factual precision
- Give up: Only use first 8K tokens โ Ignore most evidence
Smart Solution: Treat context as a budget allocation problem:
- Allocate tokens based on importance
- Summarize low-priority content
- Drop irrelevant content entirely
- Track what was included/excluded
The Lost in the Middle Phenomenon
Research Finding (Liu et al., 2023): LLMs have a U-shaped attention curve:
Attention/Recall Quality
^
100% |โ โ
|โ โ
|โ โ
50%|โ โ
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0%|___________________________________________|
Beginning Middle End
Context Position
Key Insights:
- Beginning: High attention (system prompt, key instructions)
- Middle: Low attention (often ignored, even if relevant!)
- End: High attention (recent context, user query)
Implication for Context Management:
- Place most important facts at beginning or end
- Avoid burying critical information in the middle
- Reorder documents by relevance before packing
Token Counting Precision
Common Mistake: Using character count or word count as a proxy for tokens.
Reality: Tokenization is complex and model-specific:
# Different tokenization for same text
text = "Hello, world!"
# Approximate (WRONG):
len(text) / 4 = 3.25 tokens # โ Inaccurate
# GPT-4 tokenization (CORRECT):
tiktoken.encode(text) = 4 tokens # โ Accurate
# Tokens: ["Hello", ",", " world", "!"]
# Different text with same character count:
text2 = "ไฝ ๅฅฝไธ็" # 4 characters
tiktoken.encode(text2) = 4 tokens
# But different tokens! Language matters.
Why Precision Matters:
- Context window limits are strict (exactly 8192 tokens, not ~8000)
- Exceeding limit causes API errors or silent truncation
- Cost is per token, not per character
- Must reserve space for output (response tokens)
Tool: Use tiktoken (Python) or model-specific tokenizers for exact counts.
Relevance Scoring Strategies
Goal: Rank documents/messages by relevance to user query.
Method 1: Keyword Matching (Simple)
def keyword_score(query: str, document: str) -> float:
query_words = set(query.lower().split())
doc_words = set(document.lower().split())
overlap = query_words & doc_words
return len(overlap) / len(query_words)
# Example:
query = "refund policy 30 days"
doc = "Our refund policy allows returns within 30 days"
score = keyword_score(query, doc) # High score (3/4 words match)
Method 2: TF-IDF (Better)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([query] + documents)
# Cosine similarity between query and each doc
from sklearn.metrics.pairwise import cosine_similarity
scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])
Method 3: Semantic Embeddings (Best)
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)
# Cosine similarity
scores = np.dot(query_embedding, doc_embeddings.T)[0]
Trade-offs:
| Method | Speed | Accuracy | Setup |
|---|---|---|---|
| Keywords | Fast | Low | None |
| TF-IDF | Medium | Medium | Sklearn |
| Embeddings | Slow (first run) | High | Model download |
Summarization vs. Selection
Two Approaches to Fitting Context:
Selection (Discrete):
10 Documents (5000 tokens total)
โ Rank by relevance
โ Select top 3 (1500 tokens)
โ Fits in budget (2000 tokens)
Summarization (Continuous):
10 Documents (5000 tokens total)
โ Summarize each doc: 500 โ 50 tokens
โ All 10 docs compressed to 500 tokens
โ Fits in budget (2000 tokens)
When to Use Each:
| Approach | Best For | Pros | Cons |
|---|---|---|---|
| Selection | Factual QA, citations | Preserves exact text, traceable | May lose relevant info in dropped docs |
| Summarization | Long conversations, broad context | Retains all documents | Loses precision, harder to cite |
| Hybrid | Most use cases | Balance precision and coverage | More complex to implement |
Hybrid Strategy (Recommended):
- Select top K most relevant documents (full text)
- Summarize next K moderately relevant documents
- Drop remaining documents entirely
Provenance and Traceability
Problem: After summarization/selection, you must know where facts came from.
Bad (No Provenance):
summary = llm.summarize(docs)
# Later: User asks "Where did you get that info?"
# You: "Uh... somewhere in the docs?" โ
Good (With Provenance):
summary = llm.summarize(docs)
manifest = {
"summary_source_ids": ["doc_1", "doc_3", "doc_7"],
"original_tokens": 5000,
"compressed_tokens": 500,
"compression_ratio": 0.1,
"method": "extractive_summary"
}
# Later: User asks for source
# You: "That came from doc_3, section 2" โ
Why It Matters:
- Regulatory compliance (GDPR, healthcare)
- User trust (can verify claims)
- Debugging (trace incorrect answers)
- Auditing (track what model โsawโ)
2.2 Why This Matters
Production Relevance
Real-World Scenarios Requiring Context Management:
- Customer Support (RAG)
- Knowledge base: 10,000 articles
- User query: โHow do I return a damaged item?โ
- Must select 3-5 most relevant articles from 10K
- Context budget: 8K tokens (system + history + articles + response)
- Long Conversations (Chatbots)
- Conversation: 50+ messages over multiple sessions
- Context window: 8K tokens
- Must compress early messages while retaining key facts
- Document Analysis (Legal, Medical)
- Input: 500-page contract
- Task: Answer specific questions
- Cannot fit entire contract โ must select relevant sections
- Multi-Document QA
- Input: 100 research papers
- Task: Synthesize answer across papers
- Must rank and select most relevant papers
Consequences of Poor Context Management:
| Problem | Impact | Example |
|---|---|---|
| Exceed context limit | API error or silent truncation | Conversation crashes mid-session |
| Include irrelevant docs | Model distraction, wrong answers | Cites shipping policy for billing question |
| Stuff everything | โLost in the Middleโ โ ignores key facts | Correct answer is in doc 25/50, model misses it |
| Over-summarize | Loss of precision | โPolicy allows refundsโ (but within 30 days? for what items?) |
| No provenance | Cannot verify or cite sources | Regulatory violation, user distrust |
Industry Applications
| Company | Use Case | Strategy |
|---|---|---|
| Notion | Long document chat | Hybrid: Select relevant sections + summarize context |
| Anthropic | Constitutional AI | Reranking: Most important rules at start/end |
| OpenAI | ChatGPT long conversations | Sliding window + summary of old messages |
| Perplexity | Multi-source answers | Top-K selection from web search results |
2.3 Common Misconceptions
| Misconception | Reality |
|---|---|
| โBigger context windows solve everythingโ | Bigger windows are more expensive and still have โLost in the Middleโ issues |
| โSummarization preserves all informationโ | Summarization always loses details; itโs a lossy compression |
| โToken count โ word count / 4โ | Tokenization varies by language, special chars, model |
| โModels read all context equallyโ | Attention is U-shaped (strong at ends, weak in middle) |
| โLatest model = biggest contextโ | Claude 3 (200K) vs GPT-4 (32K), but cost scales with size |
3. Project Specification
3.1 What You Will Build
A context budgeting library that:
- Accepts inputs: User query, retrieved documents, conversation history, token budget
- Ranks documents: By relevance to query
- Allocates tokens: Distributes budget across system, query, history, documents
- Compresses content: Summarizes or truncates as needed
- Maintains provenance: Tracks which sources were included/excluded
- Outputs: Final prompt string + traceability manifest
Core Question This Tool Answers:
โHow do I fit a world of information into a tiny, expensive window without the model getting confused?โ
3.2 Functional Requirements
FR1: Token Counting
Requirements:
- Support multiple tokenizers (GPT-4, Claude, Llama)
- Provide exact token counts for strings
- Calculate total token usage across all components
Interface:
class TokenCounter:
def __init__(self, model: str = "gpt-4"):
self.model = model
self.tokenizer = self._load_tokenizer(model)
def count(self, text: str) -> int:
"""Return exact token count"""
pass
def count_messages(self, messages: List[dict]) -> int:
"""Count tokens in chat format"""
pass
def fits_in_budget(self, text: str, budget: int) -> bool:
"""Check if text fits in token budget"""
pass
FR2: Document Ranking
Requirements:
- Rank documents by relevance to query
- Support multiple ranking methods (keywords, TF-IDF, embeddings)
- Allow custom scoring functions
Interface:
class DocumentRanker:
def rank(
self,
query: str,
documents: List[Document]
) -> List[tuple[Document, float]]:
"""Return documents sorted by relevance score"""
pass
FR3: Budget Allocation
Requirements:
- Allocate token budget across components
- Reserve space for system prompt, query, response
- Distribute remaining budget to history and documents
Example Budget:
Total Budget: 8000 tokens
Allocations:
- System Prompt: 500 tokens (fixed)
- User Query: 50 tokens (measured)
- Response: 1000 tokens (reserved)
- Subtotal: 1550 tokens
Remaining for Context: 6450 tokens
- Conversation History: 2000 tokens (30%)
- Retrieved Documents: 4450 tokens (70%)
FR4: Content Compression
Requirements:
- Summarize conversation history
- Truncate or summarize documents
- Preserve critical information
Strategies:
class ContentCompressor:
def compress_history(
self,
messages: List[dict],
budget: int
) -> List[dict]:
"""
Compress conversation history to fit budget.
Strategy:
- Keep system message (always)
- Keep last N messages (recency)
- Summarize middle messages
"""
pass
def compress_documents(
self,
documents: List[Document],
budget: int,
strategy: str = "hybrid"
) -> List[Document]:
"""
Compress documents to fit budget.
Strategies:
- select: Keep top-K docs (drop others)
- summarize: Summarize each doc
- hybrid: Keep top docs, summarize rest
"""
pass
FR5: Traceability Manifest
Requirements:
- Track what was included in final context
- Track what was excluded and why
- Record compression ratios
- Provide source mapping
Manifest Structure:
@dataclass
class ContextManifest:
total_budget: int
used_tokens: int
components: dict # {component: token_count}
documents_included: List[str] # Document IDs
documents_excluded: List[str]
documents_summarized: List[str]
history_compressed: bool
history_original_tokens: int
history_compressed_tokens: int
selection_reasoning: str
timestamp: str
def to_dict(self) -> dict:
"""Export as JSON"""
pass
def print_summary(self):
"""Human-readable summary"""
pass
3.3 Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Precision | <1% error in token counting | API failures occur at exact limits |
| Performance | <500ms for 100 documents | Real-time applications |
| Memory Efficiency | Handle 1000+ documents | Large knowledge bases |
| Extensibility | Custom ranking functions | Different use cases |
| Observability | Detailed manifests | Debugging and auditing |
3.4 Example Usage
Basic Usage:
from context_manager import ContextManager, Document
# Initialize manager
manager = ContextManager(
model="gpt-4",
total_budget=8000,
response_budget=1000 # Reserve for response
)
# Define documents
documents = [
Document(id="doc_1", content="Our refund policy allows..."),
Document(id="doc_2", content="Shipping takes 3-5 days..."),
Document(id="doc_3", content="For technical support..."),
# ... 50 more documents
]
# User query
query = "What is your refund policy?"
# Build context
result = manager.build_context(
query=query,
documents=documents,
conversation_history=[
{"role": "user", "content": "Hi there"},
{"role": "assistant", "content": "Hello! How can I help?"}
],
system_prompt="You are a helpful customer support agent."
)
# Access results
print(result.final_prompt) # Ready to send to LLM
print(result.manifest.used_tokens) # 7,432 / 8,000
print(result.manifest.documents_included) # ["doc_1", "doc_3"]
# Verify budget compliance
assert result.manifest.used_tokens <= 8000
Console Output:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONTEXT WINDOW MANAGER - Budget Report โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Budget: 8,000 tokens
Used: 7,432 tokens (92.9%)
Reserved for Response: 1,000 tokens
BUDGET BREAKDOWN
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Component Tokens % of Budget
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
System Prompt 500 6.3%
User Query 48 0.6%
Conversation History 1,284 16.1%
Retrieved Documents 5,600 70.0%
Response (Reserved) 1,000 -
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TOTAL 7,432 92.9%
DOCUMENT SELECTION
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Ranked 50 documents by relevance to query
โ INCLUDED (Full Text - 3 documents):
1. doc_1 (Relevance: 0.95, Tokens: 1,200)
"Our refund policy allows returns within 30 days..."
2. doc_3 (Relevance: 0.82, Tokens: 980)
"For refund requests, please contact support@..."
3. doc_7 (Relevance: 0.71, Tokens: 870)
"Refund processing takes 5-7 business days..."
โ SUMMARIZED (5 documents):
4. doc_12 (Relevance: 0.61, Original: 1,500 โ Summary: 200 tokens)
5. doc_18 (Relevance: 0.58, Original: 2,100 โ Summary: 180 tokens)
... 3 more
โ EXCLUDED (42 documents):
Low relevance to query (score < 0.5)
CONVERSATION HISTORY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Original: 8 messages (3,421 tokens)
Strategy: Keep last 4 messages, summarize earlier
โ Kept (Last 4 messages): 1,284 tokens
โ Summarized (First 4 messages): "User greeted assistant,
asked about shipping times, received answer."
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Ready to send to LLM
Budget compliance: PASS (432 tokens remaining)
Estimated response cost: $0.0223 (GPT-4 pricing)
Manifest saved: ./manifests/context_2024-12-27_14-32-01.json
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Application Code โ
โ result = manager.build_context(query, docs, history) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Context Manager โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Token โ โ Document โ โ Budget โ โ
โ โ Counter โ โ Ranker โ โ Allocator โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ Content โ โ Manifest โ โ Prompt โ โ
โ โ Compressor โ โ Generator โ โ Builder โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.2 Key Components
Component 1: Token Counter
- Uses tiktoken for exact token counts
- Supports multiple models
- Handles message formatting overhead
Component 2: Document Ranker
- Implements multiple ranking strategies
- Supports custom scoring functions
- Caches embeddings for performance
Component 3: Budget Allocator
- Distributes tokens across components
- Reserves space for response
- Adjusts allocation based on priorities
Component 4: Content Compressor
- Summarizes conversation history
- Compresses or truncates documents
- Maintains semantic meaning
Component 5: Manifest Generator
- Tracks all decisions
- Records provenance
- Exports as JSON
4.3 Data Structures
Document
@dataclass
class Document:
id: str
content: str
metadata: dict = None
tokens: Optional[int] = None
embedding: Optional[np.ndarray] = None
def __post_init__(self):
if self.tokens is None:
self.tokens = count_tokens(self.content)
ContextResult
@dataclass
class ContextResult:
final_prompt: str
manifest: ContextManifest
total_tokens: int
def to_messages(self) -> List[dict]:
"""Convert to chat format"""
pass
4.4 Algorithm Overview
Main Algorithm: build_context()
def build_context(
query: str,
documents: List[Document],
conversation_history: List[dict],
system_prompt: str,
total_budget: int = 8000,
response_budget: int = 1000
) -> ContextResult:
"""
Build optimized context within token budget.
Steps:
1. Count fixed components (system, query)
2. Calculate available budget
3. Rank documents by relevance
4. Allocate budget to history and documents
5. Compress content to fit
6. Build final prompt
7. Generate manifest
"""
# Step 1: Count fixed components
system_tokens = count_tokens(system_prompt)
query_tokens = count_tokens(query)
fixed_tokens = system_tokens + query_tokens + response_budget
# Step 2: Available budget for context
available_budget = total_budget - fixed_tokens
# Step 3: Rank documents
ranked_docs = ranker.rank(query, documents)
# Step 4: Allocate budget
history_budget = int(available_budget * 0.3) # 30% for history
docs_budget = available_budget - history_budget # 70% for docs
# Step 5: Compress history
compressed_history = compressor.compress_history(
conversation_history,
history_budget
)
# Step 6: Select/compress documents
selected_docs = compressor.compress_documents(
ranked_docs,
docs_budget,
strategy="hybrid"
)
# Step 7: Build final prompt
final_prompt = build_prompt(
system=system_prompt,
history=compressed_history,
documents=selected_docs,
query=query
)
# Step 8: Generate manifest
manifest = generate_manifest(
budget=total_budget,
used=count_tokens(final_prompt),
included_docs=[d.id for d in selected_docs],
excluded_docs=[d.id for d in documents if d not in selected_docs],
# ... more metadata
)
return ContextResult(
final_prompt=final_prompt,
manifest=manifest,
total_tokens=count_tokens(final_prompt)
)
5. Implementation Guide
5.1 Development Environment Setup
# Create environment
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install tiktoken sentence-transformers numpy sklearn
# For development
pip install pytest black mypy
5.2 Project Structure
context-manager/
โโโ src/
โ โโโ __init__.py
โ โโโ manager.py # Main ContextManager class
โ โโโ token_counter.py # Token counting
โ โโโ ranker.py # Document ranking
โ โโโ compressor.py # Content compression
โ โโโ manifest.py # Manifest generation
โ โโโ utils.py
โโโ tests/
โ โโโ test_token_counter.py
โ โโโ test_ranker.py
โ โโโ test_manager.py
โโโ examples/
โ โโโ basic_usage.py
โ โโโ rag_system.py
โ โโโ long_conversation.py
โโโ pyproject.toml
โโโ README.md
5.3 Implementation Phases
Phase 1: Token Counting (Day 1)
Checkpoint 1.1: Implement TokenCounter
# src/token_counter.py
import tiktoken
from typing import List
class TokenCounter:
def __init__(self, model: str = "gpt-4"):
self.model = model
self.encoding = tiktoken.encoding_for_model(model)
def count(self, text: str) -> int:
"""Count tokens in text"""
return len(self.encoding.encode(text))
def count_messages(self, messages: List[dict]) -> int:
"""
Count tokens in chat-formatted messages.
Includes formatting overhead:
- Message role markers
- Message separators
"""
tokens = 0
for message in messages:
# Formatting overhead per message
tokens += 4 # Role marker overhead
# Content
tokens += self.count(message.get("content", ""))
# Name field if present
if "name" in message:
tokens += self.count(message["name"])
tokens += -1 # Role is omitted if name present
tokens += 2 # Reply priming
return tokens
def fits_in_budget(self, text: str, budget: int) -> bool:
"""Check if text fits in budget"""
return self.count(text) <= budget
Test the counter:
# tests/test_token_counter.py
import pytest
from src.token_counter import TokenCounter
def test_count_basic():
counter = TokenCounter("gpt-4")
assert counter.count("Hello") == 1
assert counter.count("Hello, world!") == 4
def test_count_messages():
counter = TokenCounter("gpt-4")
messages = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"}
]
tokens = counter.count_messages(messages)
assert tokens > 0
assert tokens < 20 # Should be small
def test_fits_in_budget():
counter = TokenCounter("gpt-4")
assert counter.fits_in_budget("Short text", budget=100)
assert not counter.fits_in_budget("x" * 10000, budget=10)
Phase 2: Document Ranking (Days 2-3)
Checkpoint 2.1: Implement keyword-based ranker
# src/ranker.py
from typing import List, Callable
from dataclasses import dataclass
import numpy as np
@dataclass
class RankedDocument:
document: 'Document'
score: float
class DocumentRanker:
def __init__(self, strategy: str = "keyword"):
self.strategy = strategy
def rank(
self,
query: str,
documents: List['Document']
) -> List[RankedDocument]:
"""Rank documents by relevance to query"""
if self.strategy == "keyword":
return self._keyword_rank(query, documents)
elif self.strategy == "tfidf":
return self._tfidf_rank(query, documents)
elif self.strategy == "embedding":
return self._embedding_rank(query, documents)
else:
raise ValueError(f"Unknown strategy: {self.strategy}")
def _keyword_rank(
self,
query: str,
documents: List['Document']
) -> List[RankedDocument]:
"""Simple keyword matching"""
query_words = set(query.lower().split())
ranked = []
for doc in documents:
doc_words = set(doc.content.lower().split())
overlap = query_words & doc_words
# Score: overlap / query words
score = len(overlap) / len(query_words) if query_words else 0
ranked.append(RankedDocument(document=doc, score=score))
# Sort by score (descending)
ranked.sort(key=lambda x: x.score, reverse=True)
return ranked
def _embedding_rank(
self,
query: str,
documents: List['Document']
) -> List[RankedDocument]:
"""Semantic embedding-based ranking"""
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Encode query
query_embedding = model.encode([query])[0]
# Encode documents (or use cached embeddings)
doc_contents = [d.content for d in documents]
doc_embeddings = model.encode(doc_contents)
# Calculate cosine similarity
scores = np.dot(doc_embeddings, query_embedding)
# Create ranked list
ranked = [
RankedDocument(document=doc, score=float(score))
for doc, score in zip(documents, scores)
]
ranked.sort(key=lambda x: x.score, reverse=True)
return ranked
Phase 3: Content Compression (Days 4-5)
Checkpoint 3.1: History compression
# src/compressor.py
from typing import List
from .token_counter import TokenCounter
class ContentCompressor:
def __init__(self, token_counter: TokenCounter):
self.counter = token_counter
def compress_history(
self,
messages: List[dict],
budget: int
) -> List[dict]:
"""
Compress conversation history to fit budget.
Strategy:
1. Always keep system message
2. Keep last N messages (recency bias)
3. Summarize older messages if needed
"""
if not messages:
return []
# Separate system message
system_msg = next(
(m for m in messages if m["role"] == "system"),
None
)
user_messages = [
m for m in messages if m["role"] != "system"
]
# Calculate budget for user messages
user_budget = budget
if system_msg:
user_budget -= self.counter.count_messages([system_msg])
# Keep adding messages from end until budget exceeded
kept_messages = []
current_tokens = 0
for message in reversed(user_messages):
msg_tokens = self.counter.count_messages([message])
if current_tokens + msg_tokens <= user_budget:
kept_messages.insert(0, message)
current_tokens += msg_tokens
else:
break
# Build result
result = []
if system_msg:
result.append(system_msg)
# If we couldn't keep all messages, add summary of omitted ones
omitted_count = len(user_messages) - len(kept_messages)
if omitted_count > 0:
summary = {
"role": "system",
"content": f"[Earlier conversation summary: {omitted_count} messages omitted to save space]"
}
result.append(summary)
result.extend(kept_messages)
return result
def compress_documents(
self,
ranked_docs: List['RankedDocument'],
budget: int,
strategy: str = "hybrid"
) -> List['Document']:
"""
Compress documents to fit budget.
Strategies:
- select: Keep top-K docs (drop rest)
- hybrid: Keep top docs, summarize middle, drop low
"""
if strategy == "select":
return self._select_top_k(ranked_docs, budget)
elif strategy == "hybrid":
return self._hybrid_compress(ranked_docs, budget)
else:
raise ValueError(f"Unknown strategy: {strategy}")
def _select_top_k(
self,
ranked_docs: List['RankedDocument'],
budget: int
) -> List['Document']:
"""Select top-K documents that fit in budget"""
selected = []
current_tokens = 0
for ranked_doc in ranked_docs:
doc = ranked_doc.document
doc_tokens = doc.tokens or self.counter.count(doc.content)
if current_tokens + doc_tokens <= budget:
selected.append(doc)
current_tokens += doc_tokens
else:
# Budget exceeded
break
return selected
def _hybrid_compress(
self,
ranked_docs: List['RankedDocument'],
budget: int
) -> List['Document']:
"""
Hybrid strategy:
- Top 30% of budget: Full text of top docs
- Next 40% of budget: Summaries of middle docs
- Remaining: Drop
"""
# Allocate budget
full_text_budget = int(budget * 0.7)
summary_budget = budget - full_text_budget
# Select top docs for full text
full_text_docs = self._select_top_k(
ranked_docs,
full_text_budget
)
# TODO: Implement summarization for remaining docs
# For now, just return full text docs
return full_text_docs
Phase 4: Integration (Days 6-7)
Checkpoint 4.1: Main ContextManager
# src/manager.py
from typing import List, Optional
from dataclasses import dataclass
from .token_counter import TokenCounter
from .ranker import DocumentRanker, RankedDocument
from .compressor import ContentCompressor
from .manifest import ContextManifest, ManifestGenerator
@dataclass
class Document:
id: str
content: str
metadata: dict = None
tokens: Optional[int] = None
@dataclass
class ContextResult:
final_prompt: str
manifest: ContextManifest
total_tokens: int
class ContextManager:
def __init__(
self,
model: str = "gpt-4",
total_budget: int = 8000,
response_budget: int = 1000,
ranking_strategy: str = "keyword"
):
self.total_budget = total_budget
self.response_budget = response_budget
self.counter = TokenCounter(model)
self.ranker = DocumentRanker(strategy=ranking_strategy)
self.compressor = ContentCompressor(self.counter)
self.manifest_generator = ManifestGenerator()
def build_context(
self,
query: str,
documents: List[Document],
conversation_history: List[dict] = None,
system_prompt: str = ""
) -> ContextResult:
"""Build optimized context within budget"""
# Calculate token counts
system_tokens = self.counter.count(system_prompt) if system_prompt else 0
query_tokens = self.counter.count(query)
fixed_tokens = system_tokens + query_tokens + self.response_budget
available_budget = self.total_budget - fixed_tokens
# Allocate budget
history_budget = int(available_budget * 0.3)
docs_budget = available_budget - history_budget
# Rank documents
ranked_docs = self.ranker.rank(query, documents)
# Compress history
compressed_history = []
if conversation_history:
compressed_history = self.compressor.compress_history(
conversation_history,
history_budget
)
# Select/compress documents
selected_docs = self.compressor.compress_documents(
ranked_docs,
docs_budget,
strategy="select"
)
# Build final prompt
final_prompt = self._build_prompt(
system_prompt,
compressed_history,
selected_docs,
query
)
# Generate manifest
manifest = self.manifest_generator.generate(
total_budget=self.total_budget,
used_tokens=self.counter.count(final_prompt),
system_tokens=system_tokens,
query_tokens=query_tokens,
history_tokens=self.counter.count_messages(compressed_history),
docs_tokens=sum(d.tokens or 0 for d in selected_docs),
included_docs=[d.id for d in selected_docs],
excluded_docs=[
rd.document.id for rd in ranked_docs
if rd.document not in selected_docs
]
)
return ContextResult(
final_prompt=final_prompt,
manifest=manifest,
total_tokens=self.counter.count(final_prompt)
)
def _build_prompt(
self,
system_prompt: str,
history: List[dict],
documents: List[Document],
query: str
) -> str:
"""Build final prompt string"""
parts = []
if system_prompt:
parts.append(f"SYSTEM: {system_prompt}\n")
if history:
parts.append("CONVERSATION HISTORY:")
for msg in history:
parts.append(f"{msg['role'].upper()}: {msg['content']}")
parts.append("")
if documents:
parts.append("RELEVANT DOCUMENTS:")
for doc in documents:
parts.append(f"[{doc.id}]")
parts.append(doc.content)
parts.append("")
parts.append(f"USER QUERY: {query}")
return "\n".join(parts)
Checkpoint 4.2: Manifest generator
# src/manifest.py
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import List
@dataclass
class ContextManifest:
total_budget: int
used_tokens: int
timestamp: str
system_tokens: int
query_tokens: int
history_tokens: int
docs_tokens: int
documents_included: List[str]
documents_excluded: List[str]
def to_dict(self) -> dict:
return asdict(self)
def print_summary(self):
"""Print human-readable summary"""
print("โ" + "="*62 + "โ")
print("โ" + " "*10 + "CONTEXT BUDGET REPORT" + " "*31 + "โ")
print("โ" + "="*62 + "โ\n")
print(f"Total Budget: {self.total_budget:,} tokens")
print(f"Used: {self.used_tokens:,} tokens ({self.used_tokens/self.total_budget*100:.1f}%)\n")
print("BREAKDOWN:")
print(f" System Prompt: {self.system_tokens:,} tokens")
print(f" User Query: {self.query_tokens:,} tokens")
print(f" History: {self.history_tokens:,} tokens")
print(f" Documents: {self.docs_tokens:,} tokens\n")
print(f"Documents Included: {len(self.documents_included)}")
print(f"Documents Excluded: {len(self.documents_excluded)}")
class ManifestGenerator:
def generate(
self,
total_budget: int,
used_tokens: int,
system_tokens: int,
query_tokens: int,
history_tokens: int,
docs_tokens: int,
included_docs: List[str],
excluded_docs: List[str]
) -> ContextManifest:
"""Generate context manifest"""
return ContextManifest(
total_budget=total_budget,
used_tokens=used_tokens,
timestamp=datetime.now().isoformat(),
system_tokens=system_tokens,
query_tokens=query_tokens,
history_tokens=history_tokens,
docs_tokens=docs_tokens,
documents_included=included_docs,
documents_excluded=excluded_docs
)
6. Testing Strategy
6.1 Critical Test Cases
# tests/test_manager.py
import pytest
from src.manager import ContextManager, Document
def test_basic_budget_compliance():
"""Ensure context never exceeds budget"""
manager = ContextManager(total_budget=1000)
docs = [
Document(id=f"doc_{i}", content="x" * 500)
for i in range(10)
]
result = manager.build_context(
query="Test query",
documents=docs
)
assert result.total_tokens <= 1000
def test_document_ranking():
"""Ensure relevant documents are selected"""
manager = ContextManager(ranking_strategy="keyword")
docs = [
Document(id="doc_refund", content="Our refund policy allows..."),
Document(id="doc_shipping", content="Shipping takes 3-5 days..."),
Document(id="doc_returns", content="To return an item...")
]
result = manager.build_context(
query="What is your refund policy?",
documents=docs
)
# Most relevant doc should be included
assert "doc_refund" in result.manifest.documents_included
7. Extensions & Challenges
7.1 Beginner Extensions
Extension 1: Support for Multiple Models
- Add support for Claude, Llama tokenizers
- Handle model-specific formatting
Extension 2: Visualization Dashboard
- Create HTML report showing budget allocation
- Visualize which docs were selected/excluded
7.2 Intermediate Extensions
Extension 3: Adaptive Budgeting
- Dynamically adjust allocation based on query type
- More budget for complex queries
Extension 4: Semantic Chunking
- Split long documents into semantic chunks
- Rank chunks instead of full documents
7.3 Advanced Extensions
Extension 5: Multi-Query Optimization
- Optimize context for batch of related queries
- Reuse context across queries
Extension 6: Cost-Aware Selection
- Factor in API costs when selecting models
- Trade-off between context size and cost
8. Real-World Connections
8.1 Industry Applications
Use Case 1: Notion AI
- Long document chat
- Must select relevant sections from 100+ page docs
Use Case 2: Perplexity AI
- Multi-source search results
- Rank and select top web pages
Use Case 3: Customer Support RAG
- Knowledge base with 10K+ articles
- Select 3-5 most relevant for each query
9. Resources
9.1 Essential Reading
Books
- โDesigning Data-Intensive Applicationsโ by Martin Kleppmann - Ch. 3 (Storage & Retrieval)
- โAlgorithmsโ by Sedgewick - Ch. 4 (Search patterns)
Papers
- โLost in the Middleโ (Liu et al., 2023) - Context position effects
- โLong-Context LLMsโ - Strategies for handling long context
10. Self-Assessment Checklist
Understanding
- I understand why token counting must be exact
- I can explain โLost in the Middleโ phenomenon
- I know when to use selection vs summarization
- I understand provenance tracking importance
Implementation
- My token counter uses proper tokenizer (tiktoken)
- I implement at least 2 ranking strategies
- I track budget allocation in manifest
- I handle edge cases (empty docs, huge queries)
11. Completion Criteria
Minimum Viable Completion
- TokenCounter with tiktoken integration
- Document ranking (keyword-based minimum)
- Budget allocation logic
- Document selection to fit budget
- Basic manifest generation
- Integration tests with budget compliance
Full Completion
- Multiple ranking strategies (keyword + embeddings)
- History compression
- Hybrid selection/summarization
- Detailed manifests with provenance
- CLI tool for testing
- Performance benchmarks
You now have production-grade context management infrastructure. This is essential for any LLM application dealing with RAG, long conversations, or large document sets.