Project 4: Context Window Manager (What to Include, What to Compress)

Build an intelligent context budgeting system that selects and compresses information to fit expensive, limited context windows

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1-2 weeks
Language	Python (Alternatives: TypeScript)
Prerequisites	Basic knowledge of RAG, tokenization, Projects 1-2
Key Topics	Context Engineering, Summarization, Token Budgeting, Relevance Ranking
Knowledge Area	Context Engineering / Summarization
Software/Tool	Tiktoken / Tokenizers
Main Book	“Designing Data-Intensive Applications” (Retrieval patterns)
Coolness Level	Level 3: Genuinely Clever
Business Potential	4. The “Open Core” Infrastructure

1. Learning Objectives

By completing this project, you will:

Master Token Budgeting: Learn to treat context windows as a scarce, expensive resource requiring precise allocation
Implement Document Selection: Build ranking algorithms that prioritize the most relevant information
Design Summarization Strategies: Compress conversation history while preserving critical information
Handle Provenance Tracking: Ensure compressed data remains traceable to original sources
Understand “Lost in the Middle”: Mitigate the phenomenon where models ignore mid-context information
Build Traceability Manifests: Create audit trails showing what was included, excluded, and why
Optimize for Different Use Cases: Balance precision, recall, and cost across various application types

2. Theoretical Foundation

2.1 Core Concepts

The Context Window as a Budget

Modern LLMs have fixed context windows:

GPT-4: 8K, 32K, or 128K tokens
Claude 3: 200K tokens
Llama 3: 8K tokens

The Core Problem: Your application often has more information than fits:

Available Information:
  - System prompt: 500 tokens
  - Conversation history (20 messages): 3,000 tokens
  - Retrieved documents (50 docs): 25,000 tokens
  - User query: 50 tokens
  TOTAL: 28,550 tokens

Context Window: 8,000 tokens
OVERFLOW: 20,550 tokens (72% must be cut!)

Naive Solutions (All Bad):

Truncate arbitrarily: Cut the last 20K tokens → Lose critical context
Compress everything: Summarize all docs → Lose factual precision
Give up: Only use first 8K tokens → Ignore most evidence

Smart Solution: Treat context as a budget allocation problem:

Allocate tokens based on importance
Summarize low-priority content
Drop irrelevant content entirely
Track what was included/excluded

The Lost in the Middle Phenomenon

Research Finding (Liu et al., 2023): LLMs have a U-shaped attention curve:

Attention/Recall Quality
     ^
100% |█                                          █
     |█                                          █
     |█                                          █
  50%|█                                          █
     |█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█
     |█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█
   0%|___________________________________________|
      Beginning        Middle              End
                  Context Position

Key Insights:

Beginning: High attention (system prompt, key instructions)
Middle: Low attention (often ignored, even if relevant!)
End: High attention (recent context, user query)

Implication for Context Management:

Place most important facts at beginning or end
Avoid burying critical information in the middle
Reorder documents by relevance before packing

Token Counting Precision

Common Mistake: Using character count or word count as a proxy for tokens.

Reality: Tokenization is complex and model-specific:

# Different tokenization for same text
text = "Hello, world!"

# Approximate (WRONG):
len(text) / 4 = 3.25 tokens  # ❌ Inaccurate

# GPT-4 tokenization (CORRECT):
tiktoken.encode(text) = 4 tokens  # ✓ Accurate
# Tokens: ["Hello", ",", " world", "!"]

# Different text with same character count:
text2 = "你好世界"  # 4 characters
tiktoken.encode(text2) = 4 tokens
# But different tokens! Language matters.

Why Precision Matters:

Context window limits are strict (exactly 8192 tokens, not ~8000)
Exceeding limit causes API errors or silent truncation
Cost is per token, not per character
Must reserve space for output (response tokens)

Tool: Use tiktoken (Python) or model-specific tokenizers for exact counts.

Relevance Scoring Strategies

Goal: Rank documents/messages by relevance to user query.

Method 1: Keyword Matching (Simple)

def keyword_score(query: str, document: str) -> float:
    query_words = set(query.lower().split())
    doc_words = set(document.lower().split())
    overlap = query_words & doc_words
    return len(overlap) / len(query_words)

# Example:
query = "refund policy 30 days"
doc = "Our refund policy allows returns within 30 days"
score = keyword_score(query, doc)  # High score (3/4 words match)

Method 2: TF-IDF (Better)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([query] + documents)

# Cosine similarity between query and each doc
from sklearn.metrics.pairwise import cosine_similarity
scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])

Method 3: Semantic Embeddings (Best)

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

query_embedding = model.encode([query])
doc_embeddings = model.encode(documents)

# Cosine similarity
scores = np.dot(query_embedding, doc_embeddings.T)[0]

Trade-offs:

Method	Speed	Accuracy	Setup
Keywords	Fast	Low	None
TF-IDF	Medium	Medium	Sklearn
Embeddings	Slow (first run)	High	Model download

Summarization vs. Selection

Two Approaches to Fitting Context:

Selection (Discrete):

10 Documents (5000 tokens total)
↓ Rank by relevance
↓ Select top 3 (1500 tokens)
✓ Fits in budget (2000 tokens)

Summarization (Continuous):

10 Documents (5000 tokens total)
↓ Summarize each doc: 500 → 50 tokens
✓ All 10 docs compressed to 500 tokens
✓ Fits in budget (2000 tokens)

When to Use Each:

Approach	Best For	Pros	Cons
Selection	Factual QA, citations	Preserves exact text, traceable	May lose relevant info in dropped docs
Summarization	Long conversations, broad context	Retains all documents	Loses precision, harder to cite
Hybrid	Most use cases	Balance precision and coverage	More complex to implement

Hybrid Strategy (Recommended):

Select top K most relevant documents (full text)
Summarize next K moderately relevant documents
Drop remaining documents entirely

Provenance and Traceability

Problem: After summarization/selection, you must know where facts came from.

Bad (No Provenance):

summary = llm.summarize(docs)
# Later: User asks "Where did you get that info?"
# You: "Uh... somewhere in the docs?" ❌

Good (With Provenance):

summary = llm.summarize(docs)
manifest = {
    "summary_source_ids": ["doc_1", "doc_3", "doc_7"],
    "original_tokens": 5000,
    "compressed_tokens": 500,
    "compression_ratio": 0.1,
    "method": "extractive_summary"
}

# Later: User asks for source
# You: "That came from doc_3, section 2" ✓

Why It Matters:

Regulatory compliance (GDPR, healthcare)
User trust (can verify claims)
Debugging (trace incorrect answers)
Auditing (track what model “saw”)

2.2 Why This Matters

Production Relevance

Real-World Scenarios Requiring Context Management:

Customer Support (RAG)
- Knowledge base: 10,000 articles
- User query: “How do I return a damaged item?”
- Must select 3-5 most relevant articles from 10K
- Context budget: 8K tokens (system + history + articles + response)
Long Conversations (Chatbots)
- Conversation: 50+ messages over multiple sessions
- Context window: 8K tokens
- Must compress early messages while retaining key facts
Document Analysis (Legal, Medical)
- Input: 500-page contract
- Task: Answer specific questions
- Cannot fit entire contract → must select relevant sections
Multi-Document QA
- Input: 100 research papers
- Task: Synthesize answer across papers
- Must rank and select most relevant papers

Consequences of Poor Context Management:

Problem	Impact	Example
Exceed context limit	API error or silent truncation	Conversation crashes mid-session
Include irrelevant docs	Model distraction, wrong answers	Cites shipping policy for billing question
Stuff everything	“Lost in the Middle” → ignores key facts	Correct answer is in doc 25/50, model misses it
Over-summarize	Loss of precision	“Policy allows refunds” (but within 30 days? for what items?)
No provenance	Cannot verify or cite sources	Regulatory violation, user distrust

Industry Applications

Company	Use Case	Strategy
Notion	Long document chat	Hybrid: Select relevant sections + summarize context
Anthropic	Constitutional AI	Reranking: Most important rules at start/end
OpenAI	ChatGPT long conversations	Sliding window + summary of old messages
Perplexity	Multi-source answers	Top-K selection from web search results

2.3 Common Misconceptions

Misconception	Reality
“Bigger context windows solve everything”	Bigger windows are more expensive and still have “Lost in the Middle” issues
“Summarization preserves all information”	Summarization always loses details; it’s a lossy compression
“Token count ≈ word count / 4”	Tokenization varies by language, special chars, model
“Models read all context equally”	Attention is U-shaped (strong at ends, weak in middle)
“Latest model = biggest context”	Claude 3 (200K) vs GPT-4 (32K), but cost scales with size

3. Project Specification

3.1 What You Will Build

A context budgeting library that:

Accepts inputs: User query, retrieved documents, conversation history, token budget
Ranks documents: By relevance to query
Allocates tokens: Distributes budget across system, query, history, documents
Compresses content: Summarizes or truncates as needed
Maintains provenance: Tracks which sources were included/excluded
Outputs: Final prompt string + traceability manifest

Core Question This Tool Answers:

“How do I fit a world of information into a tiny, expensive window without the model getting confused?”

3.2 Functional Requirements

FR1: Token Counting

Requirements:

Support multiple tokenizers (GPT-4, Claude, Llama)
Provide exact token counts for strings
Calculate total token usage across all components

Interface:

class TokenCounter:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.tokenizer = self._load_tokenizer(model)

    def count(self, text: str) -> int:
        """Return exact token count"""
        pass

    def count_messages(self, messages: List[dict]) -> int:
        """Count tokens in chat format"""
        pass

    def fits_in_budget(self, text: str, budget: int) -> bool:
        """Check if text fits in token budget"""
        pass

FR2: Document Ranking

Requirements:

Rank documents by relevance to query
Support multiple ranking methods (keywords, TF-IDF, embeddings)
Allow custom scoring functions

Interface:

class DocumentRanker:
    def rank(
        self,
        query: str,
        documents: List[Document]
    ) -> List[tuple[Document, float]]:
        """Return documents sorted by relevance score"""
        pass

FR3: Budget Allocation

Requirements:

Allocate token budget across components
Reserve space for system prompt, query, response
Distribute remaining budget to history and documents

Example Budget:

Total Budget: 8000 tokens

Allocations:
- System Prompt: 500 tokens (fixed)
- User Query: 50 tokens (measured)
- Response: 1000 tokens (reserved)
- Subtotal: 1550 tokens

Remaining for Context: 6450 tokens
  - Conversation History: 2000 tokens (30%)
  - Retrieved Documents: 4450 tokens (70%)

FR4: Content Compression

Requirements:

Summarize conversation history
Truncate or summarize documents
Preserve critical information

Strategies:

class ContentCompressor:
    def compress_history(
        self,
        messages: List[dict],
        budget: int
    ) -> List[dict]:
        """
        Compress conversation history to fit budget.

        Strategy:
        - Keep system message (always)
        - Keep last N messages (recency)
        - Summarize middle messages
        """
        pass

    def compress_documents(
        self,
        documents: List[Document],
        budget: int,
        strategy: str = "hybrid"
    ) -> List[Document]:
        """
        Compress documents to fit budget.

        Strategies:
        - select: Keep top-K docs (drop others)
        - summarize: Summarize each doc
        - hybrid: Keep top docs, summarize rest
        """
        pass

FR5: Traceability Manifest

Requirements:

Track what was included in final context
Track what was excluded and why
Record compression ratios
Provide source mapping

Manifest Structure:

@dataclass
class ContextManifest:
    total_budget: int
    used_tokens: int
    components: dict  # {component: token_count}

    documents_included: List[str]  # Document IDs
    documents_excluded: List[str]
    documents_summarized: List[str]

    history_compressed: bool
    history_original_tokens: int
    history_compressed_tokens: int

    selection_reasoning: str
    timestamp: str

    def to_dict(self) -> dict:
        """Export as JSON"""
        pass

    def print_summary(self):
        """Human-readable summary"""
        pass

3.3 Non-Functional Requirements

Requirement	Target	Rationale
Precision	<1% error in token counting	API failures occur at exact limits
Performance	<500ms for 100 documents	Real-time applications
Memory Efficiency	Handle 1000+ documents	Large knowledge bases
Extensibility	Custom ranking functions	Different use cases
Observability	Detailed manifests	Debugging and auditing

3.4 Example Usage

Basic Usage:

from context_manager import ContextManager, Document

# Initialize manager
manager = ContextManager(
    model="gpt-4",
    total_budget=8000,
    response_budget=1000  # Reserve for response
)

# Define documents
documents = [
    Document(id="doc_1", content="Our refund policy allows..."),
    Document(id="doc_2", content="Shipping takes 3-5 days..."),
    Document(id="doc_3", content="For technical support..."),
    # ... 50 more documents
]

# User query
query = "What is your refund policy?"

# Build context
result = manager.build_context(
    query=query,
    documents=documents,
    conversation_history=[
        {"role": "user", "content": "Hi there"},
        {"role": "assistant", "content": "Hello! How can I help?"}
    ],
    system_prompt="You are a helpful customer support agent."
)

# Access results
print(result.final_prompt)  # Ready to send to LLM
print(result.manifest.used_tokens)  # 7,432 / 8,000
print(result.manifest.documents_included)  # ["doc_1", "doc_3"]

# Verify budget compliance
assert result.manifest.used_tokens <= 8000

Console Output:

╔══════════════════════════════════════════════════════════════╗
║           CONTEXT WINDOW MANAGER - Budget Report            ║
╚══════════════════════════════════════════════════════════════╝

Total Budget: 8,000 tokens
Used: 7,432 tokens (92.9%)
Reserved for Response: 1,000 tokens

BUDGET BREAKDOWN
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Component                 Tokens      % of Budget
────────────────────────────────────────────────────────────────
System Prompt              500         6.3%
User Query                  48         0.6%
Conversation History     1,284        16.1%
Retrieved Documents      5,600        70.0%
Response (Reserved)      1,000         -
────────────────────────────────────────────────────────────────
TOTAL                    7,432        92.9%

DOCUMENT SELECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Ranked 50 documents by relevance to query

✓ INCLUDED (Full Text - 3 documents):
  1. doc_1 (Relevance: 0.95, Tokens: 1,200)
     "Our refund policy allows returns within 30 days..."

  2. doc_3 (Relevance: 0.82, Tokens: 980)
     "For refund requests, please contact support@..."

  3. doc_7 (Relevance: 0.71, Tokens: 870)
     "Refund processing takes 5-7 business days..."

◐ SUMMARIZED (5 documents):
  4. doc_12 (Relevance: 0.61, Original: 1,500 → Summary: 200 tokens)
  5. doc_18 (Relevance: 0.58, Original: 2,100 → Summary: 180 tokens)
  ... 3 more

✗ EXCLUDED (42 documents):
  Low relevance to query (score < 0.5)

CONVERSATION HISTORY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original: 8 messages (3,421 tokens)
Strategy: Keep last 4 messages, summarize earlier

✓ Kept (Last 4 messages): 1,284 tokens
◐ Summarized (First 4 messages): "User greeted assistant,
   asked about shipping times, received answer."

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Ready to send to LLM
  Budget compliance: PASS (432 tokens remaining)
  Estimated response cost: $0.0223 (GPT-4 pricing)

Manifest saved: ./manifests/context_2024-12-27_14-32-01.json

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────────┐
│                   Application Code                          │
│  result = manager.build_context(query, docs, history)       │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                  Context Manager                            │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐       │
│  │    Token     │  │   Document   │  │   Budget    │       │
│  │   Counter    │  │    Ranker    │  │  Allocator  │       │
│  └──────────────┘  └──────────────┘  └─────────────┘       │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐       │
│  │   Content    │  │  Manifest    │  │   Prompt    │       │
│  │  Compressor  │  │  Generator   │  │   Builder   │       │
│  └──────────────┘  └──────────────┘  └─────────────┘       │
└─────────────────────────────────────────────────────────────┘

4.2 Key Components

Component 1: Token Counter

Uses tiktoken for exact token counts
Supports multiple models
Handles message formatting overhead

Component 2: Document Ranker

Implements multiple ranking strategies
Supports custom scoring functions
Caches embeddings for performance

Component 3: Budget Allocator

Distributes tokens across components
Reserves space for response
Adjusts allocation based on priorities

Component 4: Content Compressor

Summarizes conversation history
Compresses or truncates documents
Maintains semantic meaning

Component 5: Manifest Generator

Tracks all decisions
Records provenance
Exports as JSON

4.3 Data Structures

Document

@dataclass
class Document:
    id: str
    content: str
    metadata: dict = None
    tokens: Optional[int] = None
    embedding: Optional[np.ndarray] = None

    def __post_init__(self):
        if self.tokens is None:
            self.tokens = count_tokens(self.content)

ContextResult

@dataclass
class ContextResult:
    final_prompt: str
    manifest: ContextManifest
    total_tokens: int

    def to_messages(self) -> List[dict]:
        """Convert to chat format"""
        pass

4.4 Algorithm Overview

Main Algorithm: build_context()

def build_context(
    query: str,
    documents: List[Document],
    conversation_history: List[dict],
    system_prompt: str,
    total_budget: int = 8000,
    response_budget: int = 1000
) -> ContextResult:
    """
    Build optimized context within token budget.

    Steps:
    1. Count fixed components (system, query)
    2. Calculate available budget
    3. Rank documents by relevance
    4. Allocate budget to history and documents
    5. Compress content to fit
    6. Build final prompt
    7. Generate manifest
    """

    # Step 1: Count fixed components
    system_tokens = count_tokens(system_prompt)
    query_tokens = count_tokens(query)
    fixed_tokens = system_tokens + query_tokens + response_budget

    # Step 2: Available budget for context
    available_budget = total_budget - fixed_tokens

    # Step 3: Rank documents
    ranked_docs = ranker.rank(query, documents)

    # Step 4: Allocate budget
    history_budget = int(available_budget * 0.3)  # 30% for history
    docs_budget = available_budget - history_budget  # 70% for docs

    # Step 5: Compress history
    compressed_history = compressor.compress_history(
        conversation_history,
        history_budget
    )

    # Step 6: Select/compress documents
    selected_docs = compressor.compress_documents(
        ranked_docs,
        docs_budget,
        strategy="hybrid"
    )

    # Step 7: Build final prompt
    final_prompt = build_prompt(
        system=system_prompt,
        history=compressed_history,
        documents=selected_docs,
        query=query
    )

    # Step 8: Generate manifest
    manifest = generate_manifest(
        budget=total_budget,
        used=count_tokens(final_prompt),
        included_docs=[d.id for d in selected_docs],
        excluded_docs=[d.id for d in documents if d not in selected_docs],
        # ... more metadata
    )

    return ContextResult(
        final_prompt=final_prompt,
        manifest=manifest,
        total_tokens=count_tokens(final_prompt)
    )

5. Implementation Guide

5.1 Development Environment Setup

# Create environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install tiktoken sentence-transformers numpy sklearn

# For development
pip install pytest black mypy

5.2 Project Structure

context-manager/
├── src/
│   ├── __init__.py
│   ├── manager.py          # Main ContextManager class
│   ├── token_counter.py    # Token counting
│   ├── ranker.py           # Document ranking
│   ├── compressor.py       # Content compression
│   ├── manifest.py         # Manifest generation
│   └── utils.py
├── tests/
│   ├── test_token_counter.py
│   ├── test_ranker.py
│   └── test_manager.py
├── examples/
│   ├── basic_usage.py
│   ├── rag_system.py
│   └── long_conversation.py
├── pyproject.toml
└── README.md

5.3 Implementation Phases

Phase 1: Token Counting (Day 1)

Checkpoint 1.1: Implement TokenCounter

# src/token_counter.py
import tiktoken
from typing import List

class TokenCounter:
    def __init__(self, model: str = "gpt-4"):
        self.model = model
        self.encoding = tiktoken.encoding_for_model(model)

    def count(self, text: str) -> int:
        """Count tokens in text"""
        return len(self.encoding.encode(text))

    def count_messages(self, messages: List[dict]) -> int:
        """
        Count tokens in chat-formatted messages.

        Includes formatting overhead:
        - Message role markers
        - Message separators
        """
        tokens = 0
        for message in messages:
            # Formatting overhead per message
            tokens += 4  # Role marker overhead

            # Content
            tokens += self.count(message.get("content", ""))

            # Name field if present
            if "name" in message:
                tokens += self.count(message["name"])
                tokens += -1  # Role is omitted if name present

        tokens += 2  # Reply priming
        return tokens

    def fits_in_budget(self, text: str, budget: int) -> bool:
        """Check if text fits in budget"""
        return self.count(text) <= budget

Test the counter:

# tests/test_token_counter.py
import pytest
from src.token_counter import TokenCounter

def test_count_basic():
    counter = TokenCounter("gpt-4")

    assert counter.count("Hello") == 1
    assert counter.count("Hello, world!") == 4

def test_count_messages():
    counter = TokenCounter("gpt-4")

    messages = [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"}
    ]

    tokens = counter.count_messages(messages)
    assert tokens > 0
    assert tokens < 20  # Should be small

def test_fits_in_budget():
    counter = TokenCounter("gpt-4")

    assert counter.fits_in_budget("Short text", budget=100)
    assert not counter.fits_in_budget("x" * 10000, budget=10)

Phase 2: Document Ranking (Days 2-3)

Checkpoint 2.1: Implement keyword-based ranker

# src/ranker.py
from typing import List, Callable
from dataclasses import dataclass
import numpy as np

@dataclass
class RankedDocument:
    document: 'Document'
    score: float

class DocumentRanker:
    def __init__(self, strategy: str = "keyword"):
        self.strategy = strategy

    def rank(
        self,
        query: str,
        documents: List['Document']
    ) -> List[RankedDocument]:
        """Rank documents by relevance to query"""
        if self.strategy == "keyword":
            return self._keyword_rank(query, documents)
        elif self.strategy == "tfidf":
            return self._tfidf_rank(query, documents)
        elif self.strategy == "embedding":
            return self._embedding_rank(query, documents)
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

    def _keyword_rank(
        self,
        query: str,
        documents: List['Document']
    ) -> List[RankedDocument]:
        """Simple keyword matching"""
        query_words = set(query.lower().split())

        ranked = []
        for doc in documents:
            doc_words = set(doc.content.lower().split())
            overlap = query_words & doc_words

            # Score: overlap / query words
            score = len(overlap) / len(query_words) if query_words else 0

            ranked.append(RankedDocument(document=doc, score=score))

        # Sort by score (descending)
        ranked.sort(key=lambda x: x.score, reverse=True)

        return ranked

    def _embedding_rank(
        self,
        query: str,
        documents: List['Document']
    ) -> List[RankedDocument]:
        """Semantic embedding-based ranking"""
        from sentence_transformers import SentenceTransformer

        model = SentenceTransformer('all-MiniLM-L6-v2')

        # Encode query
        query_embedding = model.encode([query])[0]

        # Encode documents (or use cached embeddings)
        doc_contents = [d.content for d in documents]
        doc_embeddings = model.encode(doc_contents)

        # Calculate cosine similarity
        scores = np.dot(doc_embeddings, query_embedding)

        # Create ranked list
        ranked = [
            RankedDocument(document=doc, score=float(score))
            for doc, score in zip(documents, scores)
        ]

        ranked.sort(key=lambda x: x.score, reverse=True)

        return ranked

Phase 3: Content Compression (Days 4-5)

Checkpoint 3.1: History compression

# src/compressor.py
from typing import List
from .token_counter import TokenCounter

class ContentCompressor:
    def __init__(self, token_counter: TokenCounter):
        self.counter = token_counter

    def compress_history(
        self,
        messages: List[dict],
        budget: int
    ) -> List[dict]:
        """
        Compress conversation history to fit budget.

        Strategy:
        1. Always keep system message
        2. Keep last N messages (recency bias)
        3. Summarize older messages if needed
        """
        if not messages:
            return []

        # Separate system message
        system_msg = next(
            (m for m in messages if m["role"] == "system"),
            None
        )

        user_messages = [
            m for m in messages if m["role"] != "system"
        ]

        # Calculate budget for user messages
        user_budget = budget
        if system_msg:
            user_budget -= self.counter.count_messages([system_msg])

        # Keep adding messages from end until budget exceeded
        kept_messages = []
        current_tokens = 0

        for message in reversed(user_messages):
            msg_tokens = self.counter.count_messages([message])

            if current_tokens + msg_tokens <= user_budget:
                kept_messages.insert(0, message)
                current_tokens += msg_tokens
            else:
                break

        # Build result
        result = []
        if system_msg:
            result.append(system_msg)

        # If we couldn't keep all messages, add summary of omitted ones
        omitted_count = len(user_messages) - len(kept_messages)
        if omitted_count > 0:
            summary = {
                "role": "system",
                "content": f"[Earlier conversation summary: {omitted_count} messages omitted to save space]"
            }
            result.append(summary)

        result.extend(kept_messages)

        return result

    def compress_documents(
        self,
        ranked_docs: List['RankedDocument'],
        budget: int,
        strategy: str = "hybrid"
    ) -> List['Document']:
        """
        Compress documents to fit budget.

        Strategies:
        - select: Keep top-K docs (drop rest)
        - hybrid: Keep top docs, summarize middle, drop low
        """
        if strategy == "select":
            return self._select_top_k(ranked_docs, budget)
        elif strategy == "hybrid":
            return self._hybrid_compress(ranked_docs, budget)
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

    def _select_top_k(
        self,
        ranked_docs: List['RankedDocument'],
        budget: int
    ) -> List['Document']:
        """Select top-K documents that fit in budget"""
        selected = []
        current_tokens = 0

        for ranked_doc in ranked_docs:
            doc = ranked_doc.document
            doc_tokens = doc.tokens or self.counter.count(doc.content)

            if current_tokens + doc_tokens <= budget:
                selected.append(doc)
                current_tokens += doc_tokens
            else:
                # Budget exceeded
                break

        return selected

    def _hybrid_compress(
        self,
        ranked_docs: List['RankedDocument'],
        budget: int
    ) -> List['Document']:
        """
        Hybrid strategy:
        - Top 30% of budget: Full text of top docs
        - Next 40% of budget: Summaries of middle docs
        - Remaining: Drop
        """
        # Allocate budget
        full_text_budget = int(budget * 0.7)
        summary_budget = budget - full_text_budget

        # Select top docs for full text
        full_text_docs = self._select_top_k(
            ranked_docs,
            full_text_budget
        )

        # TODO: Implement summarization for remaining docs
        # For now, just return full text docs

        return full_text_docs

Phase 4: Integration (Days 6-7)

Checkpoint 4.1: Main ContextManager

# src/manager.py
from typing import List, Optional
from dataclasses import dataclass

from .token_counter import TokenCounter
from .ranker import DocumentRanker, RankedDocument
from .compressor import ContentCompressor
from .manifest import ContextManifest, ManifestGenerator

@dataclass
class Document:
    id: str
    content: str
    metadata: dict = None
    tokens: Optional[int] = None

@dataclass
class ContextResult:
    final_prompt: str
    manifest: ContextManifest
    total_tokens: int

class ContextManager:
    def __init__(
        self,
        model: str = "gpt-4",
        total_budget: int = 8000,
        response_budget: int = 1000,
        ranking_strategy: str = "keyword"
    ):
        self.total_budget = total_budget
        self.response_budget = response_budget

        self.counter = TokenCounter(model)
        self.ranker = DocumentRanker(strategy=ranking_strategy)
        self.compressor = ContentCompressor(self.counter)
        self.manifest_generator = ManifestGenerator()

    def build_context(
        self,
        query: str,
        documents: List[Document],
        conversation_history: List[dict] = None,
        system_prompt: str = ""
    ) -> ContextResult:
        """Build optimized context within budget"""
        # Calculate token counts
        system_tokens = self.counter.count(system_prompt) if system_prompt else 0
        query_tokens = self.counter.count(query)

        fixed_tokens = system_tokens + query_tokens + self.response_budget
        available_budget = self.total_budget - fixed_tokens

        # Allocate budget
        history_budget = int(available_budget * 0.3)
        docs_budget = available_budget - history_budget

        # Rank documents
        ranked_docs = self.ranker.rank(query, documents)

        # Compress history
        compressed_history = []
        if conversation_history:
            compressed_history = self.compressor.compress_history(
                conversation_history,
                history_budget
            )

        # Select/compress documents
        selected_docs = self.compressor.compress_documents(
            ranked_docs,
            docs_budget,
            strategy="select"
        )

        # Build final prompt
        final_prompt = self._build_prompt(
            system_prompt,
            compressed_history,
            selected_docs,
            query
        )

        # Generate manifest
        manifest = self.manifest_generator.generate(
            total_budget=self.total_budget,
            used_tokens=self.counter.count(final_prompt),
            system_tokens=system_tokens,
            query_tokens=query_tokens,
            history_tokens=self.counter.count_messages(compressed_history),
            docs_tokens=sum(d.tokens or 0 for d in selected_docs),
            included_docs=[d.id for d in selected_docs],
            excluded_docs=[
                rd.document.id for rd in ranked_docs
                if rd.document not in selected_docs
            ]
        )

        return ContextResult(
            final_prompt=final_prompt,
            manifest=manifest,
            total_tokens=self.counter.count(final_prompt)
        )

    def _build_prompt(
        self,
        system_prompt: str,
        history: List[dict],
        documents: List[Document],
        query: str
    ) -> str:
        """Build final prompt string"""
        parts = []

        if system_prompt:
            parts.append(f"SYSTEM: {system_prompt}\n")

        if history:
            parts.append("CONVERSATION HISTORY:")
            for msg in history:
                parts.append(f"{msg['role'].upper()}: {msg['content']}")
            parts.append("")

        if documents:
            parts.append("RELEVANT DOCUMENTS:")
            for doc in documents:
                parts.append(f"[{doc.id}]")
                parts.append(doc.content)
                parts.append("")

        parts.append(f"USER QUERY: {query}")

        return "\n".join(parts)

Checkpoint 4.2: Manifest generator

# src/manifest.py
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import List

@dataclass
class ContextManifest:
    total_budget: int
    used_tokens: int
    timestamp: str

    system_tokens: int
    query_tokens: int
    history_tokens: int
    docs_tokens: int

    documents_included: List[str]
    documents_excluded: List[str]

    def to_dict(self) -> dict:
        return asdict(self)

    def print_summary(self):
        """Print human-readable summary"""
        print("╔" + "="*62 + "╗")
        print("║" + " "*10 + "CONTEXT BUDGET REPORT" + " "*31 + "║")
        print("╚" + "="*62 + "╝\n")

        print(f"Total Budget: {self.total_budget:,} tokens")
        print(f"Used: {self.used_tokens:,} tokens ({self.used_tokens/self.total_budget*100:.1f}%)\n")

        print("BREAKDOWN:")
        print(f"  System Prompt:      {self.system_tokens:,} tokens")
        print(f"  User Query:         {self.query_tokens:,} tokens")
        print(f"  History:            {self.history_tokens:,} tokens")
        print(f"  Documents:          {self.docs_tokens:,} tokens\n")

        print(f"Documents Included: {len(self.documents_included)}")
        print(f"Documents Excluded: {len(self.documents_excluded)}")

class ManifestGenerator:
    def generate(
        self,
        total_budget: int,
        used_tokens: int,
        system_tokens: int,
        query_tokens: int,
        history_tokens: int,
        docs_tokens: int,
        included_docs: List[str],
        excluded_docs: List[str]
    ) -> ContextManifest:
        """Generate context manifest"""
        return ContextManifest(
            total_budget=total_budget,
            used_tokens=used_tokens,
            timestamp=datetime.now().isoformat(),
            system_tokens=system_tokens,
            query_tokens=query_tokens,
            history_tokens=history_tokens,
            docs_tokens=docs_tokens,
            documents_included=included_docs,
            documents_excluded=excluded_docs
        )

6. Testing Strategy

6.1 Critical Test Cases

# tests/test_manager.py
import pytest
from src.manager import ContextManager, Document

def test_basic_budget_compliance():
    """Ensure context never exceeds budget"""
    manager = ContextManager(total_budget=1000)

    docs = [
        Document(id=f"doc_{i}", content="x" * 500)
        for i in range(10)
    ]

    result = manager.build_context(
        query="Test query",
        documents=docs
    )

    assert result.total_tokens <= 1000

def test_document_ranking():
    """Ensure relevant documents are selected"""
    manager = ContextManager(ranking_strategy="keyword")

    docs = [
        Document(id="doc_refund", content="Our refund policy allows..."),
        Document(id="doc_shipping", content="Shipping takes 3-5 days..."),
        Document(id="doc_returns", content="To return an item...")
    ]

    result = manager.build_context(
        query="What is your refund policy?",
        documents=docs
    )

    # Most relevant doc should be included
    assert "doc_refund" in result.manifest.documents_included

7. Extensions & Challenges

7.1 Beginner Extensions

Extension 1: Support for Multiple Models

Add support for Claude, Llama tokenizers
Handle model-specific formatting

Extension 2: Visualization Dashboard

Create HTML report showing budget allocation
Visualize which docs were selected/excluded

7.2 Intermediate Extensions

Extension 3: Adaptive Budgeting

Dynamically adjust allocation based on query type
More budget for complex queries

Extension 4: Semantic Chunking

Split long documents into semantic chunks
Rank chunks instead of full documents

7.3 Advanced Extensions

Extension 5: Multi-Query Optimization

Optimize context for batch of related queries
Reuse context across queries

Extension 6: Cost-Aware Selection

Factor in API costs when selecting models
Trade-off between context size and cost

8. Real-World Connections

8.1 Industry Applications

Use Case 1: Notion AI

Long document chat
Must select relevant sections from 100+ page docs

Use Case 2: Perplexity AI

Multi-source search results
Rank and select top web pages

Use Case 3: Customer Support RAG

Knowledge base with 10K+ articles
Select 3-5 most relevant for each query

9. Resources

9.1 Essential Reading

Books

“Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 3 (Storage & Retrieval)
“Algorithms” by Sedgewick - Ch. 4 (Search patterns)

Papers

“Lost in the Middle” (Liu et al., 2023) - Context position effects
“Long-Context LLMs” - Strategies for handling long context

10. Self-Assessment Checklist

Understanding

I understand why token counting must be exact
I can explain “Lost in the Middle” phenomenon
I know when to use selection vs summarization
I understand provenance tracking importance

Implementation

My token counter uses proper tokenizer (tiktoken)
I implement at least 2 ranking strategies
I track budget allocation in manifest
I handle edge cases (empty docs, huge queries)

11. Completion Criteria

Minimum Viable Completion

TokenCounter with tiktoken integration
Document ranking (keyword-based minimum)
Budget allocation logic
Document selection to fit budget
Basic manifest generation
Integration tests with budget compliance

Full Completion

Multiple ranking strategies (keyword + embeddings)
History compression
Hybrid selection/summarization
Detailed manifests with provenance
CLI tool for testing
Performance benchmarks

You now have production-grade context management infrastructure. This is essential for any LLM application dealing with RAG, long conversations, or large document sets.