Project 4: Context Window Manager (What to Include, What to Compress)

Project 4: Context Window Manager (What to Include, What to Compress)

Build an intelligent context budgeting system that selects and compresses information to fit expensive, limited context windows

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 1-2 weeks
Language Python (Alternatives: TypeScript)
Prerequisites Basic knowledge of RAG, tokenization, Projects 1-2
Key Topics Context Engineering, Summarization, Token Budgeting, Relevance Ranking
Knowledge Area Context Engineering / Summarization
Software/Tool Tiktoken / Tokenizers
Main Book โ€œDesigning Data-Intensive Applicationsโ€ (Retrieval patterns)
Coolness Level Level 3: Genuinely Clever
Business Potential 4. The โ€œOpen Coreโ€ Infrastructure

1. Learning Objectives

By completing this project, you will:

  1. Master Token Budgeting: Learn to treat context windows as a scarce, expensive resource requiring precise allocation
  2. Implement Document Selection: Build ranking algorithms that prioritize the most relevant information
  3. Design Summarization Strategies: Compress conversation history while preserving critical information
  4. Handle Provenance Tracking: Ensure compressed data remains traceable to original sources
  5. Understand โ€œLost in the Middleโ€: Mitigate the phenomenon where models ignore mid-context information
  6. Build Traceability Manifests: Create audit trails showing what was included, excluded, and why
  7. Optimize for Different Use Cases: Balance precision, recall, and cost across various application types

2. Theoretical Foundation

2.1 Core Concepts

The Context Window as a Budget

Modern LLMs have fixed context windows:

  • GPT-4: 8K, 32K, or 128K tokens
  • Claude 3: 200K tokens
  • Llama 3: 8K tokens

The Core Problem: Your application often has more information than fits:

Available Information:
  - System prompt: 500 tokens
  - Conversation history (20 messages): 3,000 tokens
  - Retrieved documents (50 docs): 25,000 tokens
  - User query: 50 tokens
  TOTAL: 28,550 tokens

Context Window: 8,000 tokens
OVERFLOW: 20,550 tokens (72% must be cut!)

Naive Solutions (All Bad):

  1. Truncate arbitrarily: Cut the last 20K tokens โ†’ Lose critical context
  2. Compress everything: Summarize all docs โ†’ Lose factual precision
  3. Give up: Only use first 8K tokens โ†’ Ignore most evidence

Smart Solution: Treat context as a budget allocation problem:

  • Allocate tokens based on importance
  • Summarize low-priority content
  • Drop irrelevant content entirely
  • Track what was included/excluded

The Lost in the Middle Phenomenon

Research Finding (Liu et al., 2023): LLMs have a U-shaped attention curve:

Attention/Recall Quality
     ^
100% |โ–ˆ                                          โ–ˆ
     |โ–ˆ                                          โ–ˆ
     |โ–ˆ                                          โ–ˆ
  50%|โ–ˆ                                          โ–ˆ
     |โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–ˆ
     |โ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–ˆ
   0%|___________________________________________|
      Beginning        Middle              End
                  Context Position

Key Insights:

  1. Beginning: High attention (system prompt, key instructions)
  2. Middle: Low attention (often ignored, even if relevant!)
  3. End: High attention (recent context, user query)

Implication for Context Management:

  • Place most important facts at beginning or end
  • Avoid burying critical information in the middle
  • Reorder documents by relevance before packing

Token Counting Precision

Common Mistake: Using character count or word count as a proxy for tokens.

Reality: Tokenization is complex and model-specific:

# Different tokenization for same text
text = "Hello, world!"

# Approximate (WRONG):
len(text) / 4 = 3.25 tokens  # โŒ Inaccurate

# GPT-4 tokenization (CORRECT):
tiktoken.encode(text) = 4 tokens  # โœ“ Accurate
# Tokens: ["Hello", ",", " world", "!"]

# Different text with same character count:
text2 = "ไฝ ๅฅฝไธ–็•Œ"  # 4 characters
tiktoken.encode(text2) = 4 tokens
# But different tokens! Language matters.

Why Precision Matters:

  • Context window limits are strict (exactly 8192 tokens, not ~8000)
  • Exceeding limit causes API errors or silent truncation
  • Cost is per token, not per character
  • Must reserve space for output (response tokens)

Tool: Use tiktoken (Python) or model-specific tokenizers for exact counts.

Relevance Scoring Strategies

Goal: Rank documents/messages by relevance to user query.

Method 1: Keyword Matching (Simple)

def keyword_score(query: str, document: str) -> float:
    query_words = set(query.lower().split())
    doc_words = set(document.lower().split())
    overlap = query_words & doc_words
    return len(overlap) / len(query_words)

# Example:
query = "refund policy 30 days"
doc = "Our refund policy allows returns within 30 days"
score = keyword_score(query, doc)  # High score (3/4 words match)

Method 2: TF-IDF (Better)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([query] + documents)

# Cosine similarity between query and each doc
from sklearn.metrics.pairwise import cosine_similarity
scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])

Method 3: Semantic Embeddings (Best)

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)

# Cosine similarity
scores = np.dot(query_embedding, doc_embeddings.T)[0]

Trade-offs:

Method Speed Accuracy Setup
Keywords Fast Low None
TF-IDF Medium Medium Sklearn
Embeddings Slow (first run) High Model download

Summarization vs. Selection

Two Approaches to Fitting Context:

Selection (Discrete):

10 Documents (5000 tokens total)
โ†“ Rank by relevance
โ†“ Select top 3 (1500 tokens)
โœ“ Fits in budget (2000 tokens)

Summarization (Continuous):

10 Documents (5000 tokens total)
โ†“ Summarize each doc: 500 โ†’ 50 tokens
โœ“ All 10 docs compressed to 500 tokens
โœ“ Fits in budget (2000 tokens)

When to Use Each:

Approach Best For Pros Cons
Selection Factual QA, citations Preserves exact text, traceable May lose relevant info in dropped docs
Summarization Long conversations, broad context Retains all documents Loses precision, harder to cite
Hybrid Most use cases Balance precision and coverage More complex to implement

Hybrid Strategy (Recommended):

  1. Select top K most relevant documents (full text)
  2. Summarize next K moderately relevant documents
  3. Drop remaining documents entirely

Provenance and Traceability

Problem: After summarization/selection, you must know where facts came from.

Bad (No Provenance):

summary = llm.summarize(docs)
# Later: User asks "Where did you get that info?"
# You: "Uh... somewhere in the docs?" โŒ

Good (With Provenance):

summary = llm.summarize(docs)
manifest = {
    "summary_source_ids": ["doc_1", "doc_3", "doc_7"],
    "original_tokens": 5000,
    "compressed_tokens": 500,
    "compression_ratio": 0.1,
    "method": "extractive_summary"
}

# Later: User asks for source
# You: "That came from doc_3, section 2" โœ“

Why It Matters:

  • Regulatory compliance (GDPR, healthcare)
  • User trust (can verify claims)
  • Debugging (trace incorrect answers)
  • Auditing (track what model โ€œsawโ€)

2.2 Why This Matters

Production Relevance

Real-World Scenarios Requiring Context Management:

  1. Customer Support (RAG)
    • Knowledge base: 10,000 articles
    • User query: โ€œHow do I return a damaged item?โ€
    • Must select 3-5 most relevant articles from 10K
    • Context budget: 8K tokens (system + history + articles + response)
  2. Long Conversations (Chatbots)
    • Conversation: 50+ messages over multiple sessions
    • Context window: 8K tokens
    • Must compress early messages while retaining key facts
  3. Document Analysis (Legal, Medical)
    • Input: 500-page contract
    • Task: Answer specific questions
    • Cannot fit entire contract โ†’ must select relevant sections
  4. Multi-Document QA
    • Input: 100 research papers
    • Task: Synthesize answer across papers
    • Must rank and select most relevant papers

Consequences of Poor Context Management:

Problem Impact Example
Exceed context limit API error or silent truncation Conversation crashes mid-session
Include irrelevant docs Model distraction, wrong answers Cites shipping policy for billing question
Stuff everything โ€œLost in the Middleโ€ โ†’ ignores key facts Correct answer is in doc 25/50, model misses it
Over-summarize Loss of precision โ€œPolicy allows refundsโ€ (but within 30 days? for what items?)
No provenance Cannot verify or cite sources Regulatory violation, user distrust

Industry Applications

Company Use Case Strategy
Notion Long document chat Hybrid: Select relevant sections + summarize context
Anthropic Constitutional AI Reranking: Most important rules at start/end
OpenAI ChatGPT long conversations Sliding window + summary of old messages
Perplexity Multi-source answers Top-K selection from web search results

2.3 Common Misconceptions

Misconception Reality
โ€œBigger context windows solve everythingโ€ Bigger windows are more expensive and still have โ€œLost in the Middleโ€ issues
โ€œSummarization preserves all informationโ€ Summarization always loses details; itโ€™s a lossy compression
โ€œToken count โ‰ˆ word count / 4โ€ Tokenization varies by language, special chars, model
โ€œModels read all context equallyโ€ Attention is U-shaped (strong at ends, weak in middle)
โ€œLatest model = biggest contextโ€ Claude 3 (200K) vs GPT-4 (32K), but cost scales with size

3. Project Specification

3.1 What You Will Build

A context budgeting library that:

  1. Accepts inputs: User query, retrieved documents, conversation history, token budget
  2. Ranks documents: By relevance to query
  3. Allocates tokens: Distributes budget across system, query, history, documents
  4. Compresses content: Summarizes or truncates as needed
  5. Maintains provenance: Tracks which sources were included/excluded
  6. Outputs: Final prompt string + traceability manifest

Core Question This Tool Answers:

โ€œHow do I fit a world of information into a tiny, expensive window without the model getting confused?โ€

3.2 Functional Requirements

FR1: Token Counting

Requirements:

  • Support multiple tokenizers (GPT-4, Claude, Llama)
  • Provide exact token counts for strings
  • Calculate total token usage across all components

Interface:

class TokenCounter:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.tokenizer = self._load_tokenizer(model)

    def count(self, text: str) -> int:
        """Return exact token count"""
        pass

    def count_messages(self, messages: List[dict]) -> int:
        """Count tokens in chat format"""
        pass

    def fits_in_budget(self, text: str, budget: int) -> bool:
        """Check if text fits in token budget"""
        pass

FR2: Document Ranking

Requirements:

  • Rank documents by relevance to query
  • Support multiple ranking methods (keywords, TF-IDF, embeddings)
  • Allow custom scoring functions

Interface:

class DocumentRanker:
    def rank(
        self,
        query: str,
        documents: List[Document]
    ) -> List[tuple[Document, float]]:
        """Return documents sorted by relevance score"""
        pass

FR3: Budget Allocation

Requirements:

  • Allocate token budget across components
  • Reserve space for system prompt, query, response
  • Distribute remaining budget to history and documents

Example Budget:

Total Budget: 8000 tokens

Allocations:
- System Prompt: 500 tokens (fixed)
- User Query: 50 tokens (measured)
- Response: 1000 tokens (reserved)
- Subtotal: 1550 tokens

Remaining for Context: 6450 tokens
  - Conversation History: 2000 tokens (30%)
  - Retrieved Documents: 4450 tokens (70%)

FR4: Content Compression

Requirements:

  • Summarize conversation history
  • Truncate or summarize documents
  • Preserve critical information

Strategies:

class ContentCompressor:
    def compress_history(
        self,
        messages: List[dict],
        budget: int
    ) -> List[dict]:
        """
        Compress conversation history to fit budget.

        Strategy:
        - Keep system message (always)
        - Keep last N messages (recency)
        - Summarize middle messages
        """
        pass

    def compress_documents(
        self,
        documents: List[Document],
        budget: int,
        strategy: str = "hybrid"
    ) -> List[Document]:
        """
        Compress documents to fit budget.

        Strategies:
        - select: Keep top-K docs (drop others)
        - summarize: Summarize each doc
        - hybrid: Keep top docs, summarize rest
        """
        pass

FR5: Traceability Manifest

Requirements:

  • Track what was included in final context
  • Track what was excluded and why
  • Record compression ratios
  • Provide source mapping

Manifest Structure:

@dataclass
class ContextManifest:
    total_budget: int
    used_tokens: int
    components: dict  # {component: token_count}

    documents_included: List[str]  # Document IDs
    documents_excluded: List[str]
    documents_summarized: List[str]

    history_compressed: bool
    history_original_tokens: int
    history_compressed_tokens: int

    selection_reasoning: str
    timestamp: str

    def to_dict(self) -> dict:
        """Export as JSON"""
        pass

    def print_summary(self):
        """Human-readable summary"""
        pass

3.3 Non-Functional Requirements

Requirement Target Rationale
Precision <1% error in token counting API failures occur at exact limits
Performance <500ms for 100 documents Real-time applications
Memory Efficiency Handle 1000+ documents Large knowledge bases
Extensibility Custom ranking functions Different use cases
Observability Detailed manifests Debugging and auditing

3.4 Example Usage

Basic Usage:

from context_manager import ContextManager, Document

# Initialize manager
manager = ContextManager(
    model="gpt-4",
    total_budget=8000,
    response_budget=1000  # Reserve for response
)

# Define documents
documents = [
    Document(id="doc_1", content="Our refund policy allows..."),
    Document(id="doc_2", content="Shipping takes 3-5 days..."),
    Document(id="doc_3", content="For technical support..."),
    # ... 50 more documents
]

# User query
query = "What is your refund policy?"

# Build context
result = manager.build_context(
    query=query,
    documents=documents,
    conversation_history=[
        {"role": "user", "content": "Hi there"},
        {"role": "assistant", "content": "Hello! How can I help?"}
    ],
    system_prompt="You are a helpful customer support agent."
)

# Access results
print(result.final_prompt)  # Ready to send to LLM
print(result.manifest.used_tokens)  # 7,432 / 8,000
print(result.manifest.documents_included)  # ["doc_1", "doc_3"]

# Verify budget compliance
assert result.manifest.used_tokens <= 8000

Console Output:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘           CONTEXT WINDOW MANAGER - Budget Report            โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Total Budget: 8,000 tokens
Used: 7,432 tokens (92.9%)
Reserved for Response: 1,000 tokens

BUDGET BREAKDOWN
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Component                 Tokens      % of Budget
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
System Prompt              500         6.3%
User Query                  48         0.6%
Conversation History     1,284        16.1%
Retrieved Documents      5,600        70.0%
Response (Reserved)      1,000         -
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TOTAL                    7,432        92.9%

DOCUMENT SELECTION
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Ranked 50 documents by relevance to query

โœ“ INCLUDED (Full Text - 3 documents):
  1. doc_1 (Relevance: 0.95, Tokens: 1,200)
     "Our refund policy allows returns within 30 days..."

  2. doc_3 (Relevance: 0.82, Tokens: 980)
     "For refund requests, please contact support@..."

  3. doc_7 (Relevance: 0.71, Tokens: 870)
     "Refund processing takes 5-7 business days..."

โ— SUMMARIZED (5 documents):
  4. doc_12 (Relevance: 0.61, Original: 1,500 โ†’ Summary: 200 tokens)
  5. doc_18 (Relevance: 0.58, Original: 2,100 โ†’ Summary: 180 tokens)
  ... 3 more

โœ— EXCLUDED (42 documents):
  Low relevance to query (score < 0.5)

CONVERSATION HISTORY
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Original: 8 messages (3,421 tokens)
Strategy: Keep last 4 messages, summarize earlier

โœ“ Kept (Last 4 messages): 1,284 tokens
โ— Summarized (First 4 messages): "User greeted assistant,
   asked about shipping times, received answer."

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

โœ“ Ready to send to LLM
  Budget compliance: PASS (432 tokens remaining)
  Estimated response cost: $0.0223 (GPT-4 pricing)

Manifest saved: ./manifests/context_2024-12-27_14-32-01.json

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Application Code                          โ”‚
โ”‚  result = manager.build_context(query, docs, history)       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                         โ”‚
                         โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Context Manager                            โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚    Token     โ”‚  โ”‚   Document   โ”‚  โ”‚   Budget    โ”‚       โ”‚
โ”‚  โ”‚   Counter    โ”‚  โ”‚    Ranker    โ”‚  โ”‚  Allocator  โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ”‚                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”‚
โ”‚  โ”‚   Content    โ”‚  โ”‚  Manifest    โ”‚  โ”‚   Prompt    โ”‚       โ”‚
โ”‚  โ”‚  Compressor  โ”‚  โ”‚  Generator   โ”‚  โ”‚   Builder   โ”‚       โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Components

Component 1: Token Counter

  • Uses tiktoken for exact token counts
  • Supports multiple models
  • Handles message formatting overhead

Component 2: Document Ranker

  • Implements multiple ranking strategies
  • Supports custom scoring functions
  • Caches embeddings for performance

Component 3: Budget Allocator

  • Distributes tokens across components
  • Reserves space for response
  • Adjusts allocation based on priorities

Component 4: Content Compressor

  • Summarizes conversation history
  • Compresses or truncates documents
  • Maintains semantic meaning

Component 5: Manifest Generator

  • Tracks all decisions
  • Records provenance
  • Exports as JSON

4.3 Data Structures

Document

@dataclass
class Document:
    id: str
    content: str
    metadata: dict = None
    tokens: Optional[int] = None
    embedding: Optional[np.ndarray] = None

    def __post_init__(self):
        if self.tokens is None:
            self.tokens = count_tokens(self.content)

ContextResult

@dataclass
class ContextResult:
    final_prompt: str
    manifest: ContextManifest
    total_tokens: int

    def to_messages(self) -> List[dict]:
        """Convert to chat format"""
        pass

4.4 Algorithm Overview

Main Algorithm: build_context()

def build_context(
    query: str,
    documents: List[Document],
    conversation_history: List[dict],
    system_prompt: str,
    total_budget: int = 8000,
    response_budget: int = 1000
) -> ContextResult:
    """
    Build optimized context within token budget.

    Steps:
    1. Count fixed components (system, query)
    2. Calculate available budget
    3. Rank documents by relevance
    4. Allocate budget to history and documents
    5. Compress content to fit
    6. Build final prompt
    7. Generate manifest
    """

    # Step 1: Count fixed components
    system_tokens = count_tokens(system_prompt)
    query_tokens = count_tokens(query)
    fixed_tokens = system_tokens + query_tokens + response_budget

    # Step 2: Available budget for context
    available_budget = total_budget - fixed_tokens

    # Step 3: Rank documents
    ranked_docs = ranker.rank(query, documents)

    # Step 4: Allocate budget
    history_budget = int(available_budget * 0.3)  # 30% for history
    docs_budget = available_budget - history_budget  # 70% for docs

    # Step 5: Compress history
    compressed_history = compressor.compress_history(
        conversation_history,
        history_budget
    )

    # Step 6: Select/compress documents
    selected_docs = compressor.compress_documents(
        ranked_docs,
        docs_budget,
        strategy="hybrid"
    )

    # Step 7: Build final prompt
    final_prompt = build_prompt(
        system=system_prompt,
        history=compressed_history,
        documents=selected_docs,
        query=query
    )

    # Step 8: Generate manifest
    manifest = generate_manifest(
        budget=total_budget,
        used=count_tokens(final_prompt),
        included_docs=[d.id for d in selected_docs],
        excluded_docs=[d.id for d in documents if d not in selected_docs],
        # ... more metadata
    )

    return ContextResult(
        final_prompt=final_prompt,
        manifest=manifest,
        total_tokens=count_tokens(final_prompt)
    )

5. Implementation Guide

5.1 Development Environment Setup

# Create environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install tiktoken sentence-transformers numpy sklearn

# For development
pip install pytest black mypy

5.2 Project Structure

context-manager/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ manager.py          # Main ContextManager class
โ”‚   โ”œโ”€โ”€ token_counter.py    # Token counting
โ”‚   โ”œโ”€โ”€ ranker.py           # Document ranking
โ”‚   โ”œโ”€โ”€ compressor.py       # Content compression
โ”‚   โ”œโ”€โ”€ manifest.py         # Manifest generation
โ”‚   โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ test_token_counter.py
โ”‚   โ”œโ”€โ”€ test_ranker.py
โ”‚   โ””โ”€โ”€ test_manager.py
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ basic_usage.py
โ”‚   โ”œโ”€โ”€ rag_system.py
โ”‚   โ””โ”€โ”€ long_conversation.py
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ README.md

5.3 Implementation Phases

Phase 1: Token Counting (Day 1)

Checkpoint 1.1: Implement TokenCounter

# src/token_counter.py
import tiktoken
from typing import List

class TokenCounter:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.encoding = tiktoken.encoding_for_model(model)

    def count(self, text: str) -> int:
        """Count tokens in text"""
        return len(self.encoding.encode(text))

    def count_messages(self, messages: List[dict]) -> int:
        """
        Count tokens in chat-formatted messages.

        Includes formatting overhead:
        - Message role markers
        - Message separators
        """
        tokens = 0
        for message in messages:
            # Formatting overhead per message
            tokens += 4  # Role marker overhead

            # Content
            tokens += self.count(message.get("content", ""))

            # Name field if present
            if "name" in message:
                tokens += self.count(message["name"])
                tokens += -1  # Role is omitted if name present

        tokens += 2  # Reply priming
        return tokens

    def fits_in_budget(self, text: str, budget: int) -> bool:
        """Check if text fits in budget"""
        return self.count(text) <= budget

Test the counter:

# tests/test_token_counter.py
import pytest
from src.token_counter import TokenCounter

def test_count_basic():
    counter = TokenCounter("gpt-4")

    assert counter.count("Hello") == 1
    assert counter.count("Hello, world!") == 4

def test_count_messages():
    counter = TokenCounter("gpt-4")

    messages = [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"}
    ]

    tokens = counter.count_messages(messages)
    assert tokens > 0
    assert tokens < 20  # Should be small

def test_fits_in_budget():
    counter = TokenCounter("gpt-4")

    assert counter.fits_in_budget("Short text", budget=100)
    assert not counter.fits_in_budget("x" * 10000, budget=10)

Phase 2: Document Ranking (Days 2-3)

Checkpoint 2.1: Implement keyword-based ranker

# src/ranker.py
from typing import List, Callable
from dataclasses import dataclass
import numpy as np

@dataclass
class RankedDocument:
    document: 'Document'
    score: float

class DocumentRanker:
    def __init__(self, strategy: str = "keyword"):
        self.strategy = strategy

    def rank(
        self,
        query: str,
        documents: List['Document']
    ) -> List[RankedDocument]:
        """Rank documents by relevance to query"""
        if self.strategy == "keyword":
            return self._keyword_rank(query, documents)
        elif self.strategy == "tfidf":
            return self._tfidf_rank(query, documents)
        elif self.strategy == "embedding":
            return self._embedding_rank(query, documents)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

    def _keyword_rank(
        self,
        query: str,
        documents: List['Document']
    ) -> List[RankedDocument]:
        """Simple keyword matching"""
        query_words = set(query.lower().split())

        ranked = []
        for doc in documents:
            doc_words = set(doc.content.lower().split())
            overlap = query_words & doc_words

            # Score: overlap / query words
            score = len(overlap) / len(query_words) if query_words else 0

            ranked.append(RankedDocument(document=doc, score=score))

        # Sort by score (descending)
        ranked.sort(key=lambda x: x.score, reverse=True)

        return ranked

    def _embedding_rank(
        self,
        query: str,
        documents: List['Document']
    ) -> List[RankedDocument]:
        """Semantic embedding-based ranking"""
        from sentence_transformers import SentenceTransformer

        model = SentenceTransformer('all-MiniLM-L6-v2')

        # Encode query
        query_embedding = model.encode([query])[0]

        # Encode documents (or use cached embeddings)
        doc_contents = [d.content for d in documents]
        doc_embeddings = model.encode(doc_contents)

        # Calculate cosine similarity
        scores = np.dot(doc_embeddings, query_embedding)

        # Create ranked list
        ranked = [
            RankedDocument(document=doc, score=float(score))
            for doc, score in zip(documents, scores)
        ]

        ranked.sort(key=lambda x: x.score, reverse=True)

        return ranked

Phase 3: Content Compression (Days 4-5)

Checkpoint 3.1: History compression

# src/compressor.py
from typing import List
from .token_counter import TokenCounter

class ContentCompressor:
    def __init__(self, token_counter: TokenCounter):
        self.counter = token_counter

    def compress_history(
        self,
        messages: List[dict],
        budget: int
    ) -> List[dict]:
        """
        Compress conversation history to fit budget.

        Strategy:
        1. Always keep system message
        2. Keep last N messages (recency bias)
        3. Summarize older messages if needed
        """
        if not messages:
            return []

        # Separate system message
        system_msg = next(
            (m for m in messages if m["role"] == "system"),
            None
        )

        user_messages = [
            m for m in messages if m["role"] != "system"
        ]

        # Calculate budget for user messages
        user_budget = budget
        if system_msg:
            user_budget -= self.counter.count_messages([system_msg])

        # Keep adding messages from end until budget exceeded
        kept_messages = []
        current_tokens = 0

        for message in reversed(user_messages):
            msg_tokens = self.counter.count_messages([message])

            if current_tokens + msg_tokens <= user_budget:
                kept_messages.insert(0, message)
                current_tokens += msg_tokens
            else:
                break

        # Build result
        result = []
        if system_msg:
            result.append(system_msg)

        # If we couldn't keep all messages, add summary of omitted ones
        omitted_count = len(user_messages) - len(kept_messages)
        if omitted_count > 0:
            summary = {
                "role": "system",
                "content": f"[Earlier conversation summary: {omitted_count} messages omitted to save space]"
            }
            result.append(summary)

        result.extend(kept_messages)

        return result

    def compress_documents(
        self,
        ranked_docs: List['RankedDocument'],
        budget: int,
        strategy: str = "hybrid"
    ) -> List['Document']:
        """
        Compress documents to fit budget.

        Strategies:
        - select: Keep top-K docs (drop rest)
        - hybrid: Keep top docs, summarize middle, drop low
        """
        if strategy == "select":
            return self._select_top_k(ranked_docs, budget)
        elif strategy == "hybrid":
            return self._hybrid_compress(ranked_docs, budget)
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

    def _select_top_k(
        self,
        ranked_docs: List['RankedDocument'],
        budget: int
    ) -> List['Document']:
        """Select top-K documents that fit in budget"""
        selected = []
        current_tokens = 0

        for ranked_doc in ranked_docs:
            doc = ranked_doc.document
            doc_tokens = doc.tokens or self.counter.count(doc.content)

            if current_tokens + doc_tokens <= budget:
                selected.append(doc)
                current_tokens += doc_tokens
            else:
                # Budget exceeded
                break

        return selected

    def _hybrid_compress(
        self,
        ranked_docs: List['RankedDocument'],
        budget: int
    ) -> List['Document']:
        """
        Hybrid strategy:
        - Top 30% of budget: Full text of top docs
        - Next 40% of budget: Summaries of middle docs
        - Remaining: Drop
        """
        # Allocate budget
        full_text_budget = int(budget * 0.7)
        summary_budget = budget - full_text_budget

        # Select top docs for full text
        full_text_docs = self._select_top_k(
            ranked_docs,
            full_text_budget
        )

        # TODO: Implement summarization for remaining docs
        # For now, just return full text docs

        return full_text_docs

Phase 4: Integration (Days 6-7)

Checkpoint 4.1: Main ContextManager

# src/manager.py
from typing import List, Optional
from dataclasses import dataclass

from .token_counter import TokenCounter
from .ranker import DocumentRanker, RankedDocument
from .compressor import ContentCompressor
from .manifest import ContextManifest, ManifestGenerator

@dataclass
class Document:
    id: str
    content: str
    metadata: dict = None
    tokens: Optional[int] = None

@dataclass
class ContextResult:
    final_prompt: str
    manifest: ContextManifest
    total_tokens: int

class ContextManager:
    def __init__(
        self,
        model: str = "gpt-4",
        total_budget: int = 8000,
        response_budget: int = 1000,
        ranking_strategy: str = "keyword"
    ):
        self.total_budget = total_budget
        self.response_budget = response_budget

        self.counter = TokenCounter(model)
        self.ranker = DocumentRanker(strategy=ranking_strategy)
        self.compressor = ContentCompressor(self.counter)
        self.manifest_generator = ManifestGenerator()

    def build_context(
        self,
        query: str,
        documents: List[Document],
        conversation_history: List[dict] = None,
        system_prompt: str = ""
    ) -> ContextResult:
        """Build optimized context within budget"""
        # Calculate token counts
        system_tokens = self.counter.count(system_prompt) if system_prompt else 0
        query_tokens = self.counter.count(query)

        fixed_tokens = system_tokens + query_tokens + self.response_budget
        available_budget = self.total_budget - fixed_tokens

        # Allocate budget
        history_budget = int(available_budget * 0.3)
        docs_budget = available_budget - history_budget

        # Rank documents
        ranked_docs = self.ranker.rank(query, documents)

        # Compress history
        compressed_history = []
        if conversation_history:
            compressed_history = self.compressor.compress_history(
                conversation_history,
                history_budget
            )

        # Select/compress documents
        selected_docs = self.compressor.compress_documents(
            ranked_docs,
            docs_budget,
            strategy="select"
        )

        # Build final prompt
        final_prompt = self._build_prompt(
            system_prompt,
            compressed_history,
            selected_docs,
            query
        )

        # Generate manifest
        manifest = self.manifest_generator.generate(
            total_budget=self.total_budget,
            used_tokens=self.counter.count(final_prompt),
            system_tokens=system_tokens,
            query_tokens=query_tokens,
            history_tokens=self.counter.count_messages(compressed_history),
            docs_tokens=sum(d.tokens or 0 for d in selected_docs),
            included_docs=[d.id for d in selected_docs],
            excluded_docs=[
                rd.document.id for rd in ranked_docs
                if rd.document not in selected_docs
            ]
        )

        return ContextResult(
            final_prompt=final_prompt,
            manifest=manifest,
            total_tokens=self.counter.count(final_prompt)
        )

    def _build_prompt(
        self,
        system_prompt: str,
        history: List[dict],
        documents: List[Document],
        query: str
    ) -> str:
        """Build final prompt string"""
        parts = []

        if system_prompt:
            parts.append(f"SYSTEM: {system_prompt}\n")

        if history:
            parts.append("CONVERSATION HISTORY:")
            for msg in history:
                parts.append(f"{msg['role'].upper()}: {msg['content']}")
            parts.append("")

        if documents:
            parts.append("RELEVANT DOCUMENTS:")
            for doc in documents:
                parts.append(f"[{doc.id}]")
                parts.append(doc.content)
                parts.append("")

        parts.append(f"USER QUERY: {query}")

        return "\n".join(parts)

Checkpoint 4.2: Manifest generator

# src/manifest.py
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import List

@dataclass
class ContextManifest:
    total_budget: int
    used_tokens: int
    timestamp: str

    system_tokens: int
    query_tokens: int
    history_tokens: int
    docs_tokens: int

    documents_included: List[str]
    documents_excluded: List[str]

    def to_dict(self) -> dict:
        return asdict(self)

    def print_summary(self):
        """Print human-readable summary"""
        print("โ•”" + "="*62 + "โ•—")
        print("โ•‘" + " "*10 + "CONTEXT BUDGET REPORT" + " "*31 + "โ•‘")
        print("โ•š" + "="*62 + "โ•\n")

        print(f"Total Budget: {self.total_budget:,} tokens")
        print(f"Used: {self.used_tokens:,} tokens ({self.used_tokens/self.total_budget*100:.1f}%)\n")

        print("BREAKDOWN:")
        print(f"  System Prompt:      {self.system_tokens:,} tokens")
        print(f"  User Query:         {self.query_tokens:,} tokens")
        print(f"  History:            {self.history_tokens:,} tokens")
        print(f"  Documents:          {self.docs_tokens:,} tokens\n")

        print(f"Documents Included: {len(self.documents_included)}")
        print(f"Documents Excluded: {len(self.documents_excluded)}")

class ManifestGenerator:
    def generate(
        self,
        total_budget: int,
        used_tokens: int,
        system_tokens: int,
        query_tokens: int,
        history_tokens: int,
        docs_tokens: int,
        included_docs: List[str],
        excluded_docs: List[str]
    ) -> ContextManifest:
        """Generate context manifest"""
        return ContextManifest(
            total_budget=total_budget,
            used_tokens=used_tokens,
            timestamp=datetime.now().isoformat(),
            system_tokens=system_tokens,
            query_tokens=query_tokens,
            history_tokens=history_tokens,
            docs_tokens=docs_tokens,
            documents_included=included_docs,
            documents_excluded=excluded_docs
        )

6. Testing Strategy

6.1 Critical Test Cases

# tests/test_manager.py
import pytest
from src.manager import ContextManager, Document

def test_basic_budget_compliance():
    """Ensure context never exceeds budget"""
    manager = ContextManager(total_budget=1000)

    docs = [
        Document(id=f"doc_{i}", content="x" * 500)
        for i in range(10)
    ]

    result = manager.build_context(
        query="Test query",
        documents=docs
    )

    assert result.total_tokens <= 1000

def test_document_ranking():
    """Ensure relevant documents are selected"""
    manager = ContextManager(ranking_strategy="keyword")

    docs = [
        Document(id="doc_refund", content="Our refund policy allows..."),
        Document(id="doc_shipping", content="Shipping takes 3-5 days..."),
        Document(id="doc_returns", content="To return an item...")
    ]

    result = manager.build_context(
        query="What is your refund policy?",
        documents=docs
    )

    # Most relevant doc should be included
    assert "doc_refund" in result.manifest.documents_included

7. Extensions & Challenges

7.1 Beginner Extensions

Extension 1: Support for Multiple Models

  • Add support for Claude, Llama tokenizers
  • Handle model-specific formatting

Extension 2: Visualization Dashboard

  • Create HTML report showing budget allocation
  • Visualize which docs were selected/excluded

7.2 Intermediate Extensions

Extension 3: Adaptive Budgeting

  • Dynamically adjust allocation based on query type
  • More budget for complex queries

Extension 4: Semantic Chunking

  • Split long documents into semantic chunks
  • Rank chunks instead of full documents

7.3 Advanced Extensions

Extension 5: Multi-Query Optimization

  • Optimize context for batch of related queries
  • Reuse context across queries

Extension 6: Cost-Aware Selection

  • Factor in API costs when selecting models
  • Trade-off between context size and cost

8. Real-World Connections

8.1 Industry Applications

Use Case 1: Notion AI

  • Long document chat
  • Must select relevant sections from 100+ page docs

Use Case 2: Perplexity AI

  • Multi-source search results
  • Rank and select top web pages

Use Case 3: Customer Support RAG

  • Knowledge base with 10K+ articles
  • Select 3-5 most relevant for each query

9. Resources

9.1 Essential Reading

Books

  • โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann - Ch. 3 (Storage & Retrieval)
  • โ€œAlgorithmsโ€ by Sedgewick - Ch. 4 (Search patterns)

Papers

  • โ€œLost in the Middleโ€ (Liu et al., 2023) - Context position effects
  • โ€œLong-Context LLMsโ€ - Strategies for handling long context

10. Self-Assessment Checklist

Understanding

  • I understand why token counting must be exact
  • I can explain โ€œLost in the Middleโ€ phenomenon
  • I know when to use selection vs summarization
  • I understand provenance tracking importance

Implementation

  • My token counter uses proper tokenizer (tiktoken)
  • I implement at least 2 ranking strategies
  • I track budget allocation in manifest
  • I handle edge cases (empty docs, huge queries)

11. Completion Criteria

Minimum Viable Completion

  • TokenCounter with tiktoken integration
  • Document ranking (keyword-based minimum)
  • Budget allocation logic
  • Document selection to fit budget
  • Basic manifest generation
  • Integration tests with budget compliance

Full Completion

  • Multiple ranking strategies (keyword + embeddings)
  • History compression
  • Hybrid selection/summarization
  • Detailed manifests with provenance
  • CLI tool for testing
  • Performance benchmarks

You now have production-grade context management infrastructure. This is essential for any LLM application dealing with RAG, long conversations, or large document sets.