Project 4: Context Window Manager

Deterministic context packets with measured relevance and budget compliance.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate See main guide estimates (typically 3-8 days except capstone)
Main Programming Language Python
Alternative Programming Languages TypeScript, Go
Coolness Level Level 3: Practical Performance Win
Business Potential 4. Platform Feature
Knowledge Area RAG Context Engineering
Software or Tool Retriever + reranker + packer
Main Book Designing Data-Intensive Applications
Concept Clusters Context Engineering and Caching; Prompt Contracts and Output Typing

1. Learning Objectives

By completing this project, you will:

  1. Understand token counting algorithms (BPE, SentencePiece) and why token counts differ across models and tokenizers.
  2. Design a context assembly pipeline that retrieves, reranks, and packs evidence chunks within strict token budgets.
  3. Implement priority-based content selection that balances relevance, freshness, trust, and mandatory coverage requirements.
  4. Build a token ledger that accounts for every token in the context window: system prompt, instructions, retrieved evidence, conversation history, and output reservation.
  5. Produce deterministic, reproducible context packets with explainability fields showing why each chunk was included or excluded.
  6. Measure context quality through coverage scores, relevance metrics, and budget utilization efficiency.

2. All Theory Needed (Per-Concept Breakdown)

Concept A: Tokenization and Token Budget Accounting

Fundamentals Token counting is the foundation of context window management because every LLM enforces a hard limit on the total number of tokens it can process in a single request (input tokens + output tokens). A token is not a word, not a character, and not a byte: it is a model-specific unit produced by a tokenizer algorithm. The same English sentence may tokenize into 10 tokens on GPT-4 and 12 tokens on Claude, because each model uses a different tokenizer with a different vocabulary. If your context assembly pipeline uses an approximate token counter instead of the provider-exact tokenizer, you will either waste budget (packing fewer chunks than you could) or overflow the window (causing truncation or API errors). Token budget accounting treats the context window as a finite resource that must be allocated across competing sections: system prompt, safety preamble, retrieved evidence, conversation history, and reserved space for the model’s output. Getting this accounting wrong is the single most common failure mode in RAG systems.

Deep Dive into the concept Tokenization algorithms determine how text is split into tokens. The dominant algorithm family is Byte Pair Encoding (BPE), used by OpenAI’s models (via tiktoken), GPT-family models, and many open-source models. BPE works by starting with individual bytes (or characters) as the initial vocabulary, then iteratively merging the most frequent adjacent pair into a new token. This process repeats for a fixed number of merge operations (typically 50,000-100,000 merges), producing a vocabulary where common words like “the” are single tokens, common subwords like “ing” or “tion” are single tokens, and rare words are split into multiple subword tokens. The key insight is that BPE is deterministic given a fixed vocabulary: the same text always produces the same tokens.

SentencePiece is an alternative tokenization framework (used by models like LLaMA and some Google models) that treats the input as a raw byte stream without assuming pre-tokenized word boundaries. It supports both BPE and Unigram Language Model algorithms. The practical difference is that SentencePiece can handle any language or encoding without whitespace assumptions, making it more robust for multilingual content.

Each model provider publishes (or ships) an exact tokenizer. OpenAI provides tiktoken (Python library), Anthropic provides their tokenizer via the API’s token counting endpoint, and open-source models ship their tokenizer as part of the model artifacts. For production token accounting, you must use the exact tokenizer for the model you are calling. Approximate counters (e.g., “divide character count by 4”) introduce errors of 10-30%, which is catastrophic for budget management.

Token budget allocation divides the context window into named sections, each with a reserved or maximum token count. A typical allocation for a 128K-token model might be:

System prompt and safety preamble: 500-2000 tokens (fixed, non-negotiable). These are the developer instructions that define the model’s behavior, output format, and safety constraints. They must always fit within the window, so they are allocated first.

Conversation history: 1000-8000 tokens (sliding window). For multi-turn conversations, recent messages are kept and older messages are summarized or dropped. A sliding window keeps the N most recent turns; a summarization approach compresses older turns into a summary that uses fewer tokens.

Retrieved evidence (RAG context): the primary variable allocation, typically 2000-50000 tokens depending on the query complexity and model. This is where the context packer operates: selecting, ordering, and fitting evidence chunks into the remaining budget after fixed sections are accounted for.

Output reservation: 500-4000 tokens reserved for the model’s response. The model needs space to generate its answer; if you fill the entire window with input, the model has no room to respond. Output reservation must be subtracted from the total budget before packing evidence.

The token ledger is a data structure that tracks the current allocation:

Token Ledger:
  total_budget:        128000
  system_prompt:        -1200  (fixed)
  safety_preamble:       -350  (fixed)
  conversation_history: -3200  (sliding window, 8 recent turns)
  output_reservation:   -2000  (reserved for response)
  ─────────────────────────────
  available_for_evidence: 121250
  packed_evidence:      -118900 (47 chunks packed)
  remaining_unused:        2350 (1.8% waste)

Budget overflow happens when the sum of all sections exceeds the model’s context window. This is a hard failure: the API rejects the request, or the provider silently truncates the input, losing information from the end. Budget underflow (significant unused space) is a soft failure: you are paying for context capacity you are not using, and the model has less evidence to reason with.

Context window sizes vary dramatically across models (as of 2025): GPT-4o supports 128K tokens, Claude 3.5 Sonnet supports 200K tokens, Gemini 1.5 Pro supports 2M tokens, and many open-source models support 8K-32K tokens. Your context manager must be parameterized by model, using the correct tokenizer and window size for each target model.

Prefix caching is a provider optimization where the API caches the key-value (KV) representations of token prefixes that repeat across requests. If your system prompt and safety preamble are identical across many requests (which they should be), the provider can cache the KV computations for that prefix, reducing both latency and cost. Cached tokens typically cost 10x less than uncached tokens. To maximize cache hits, your context assembly must ensure that the fixed prefix (system prompt + preamble) is byte-identical across requests. Any variation (even a single whitespace change) invalidates the cache.

How this fit on projects Token budget accounting is the foundation of Project 4. Every decision in the context packer (which chunks to include, in what order, whether to truncate) depends on accurate token counting and budget allocation. This concept also connects to Project 9 (Prompt Caching Optimizer) where prefix caching becomes the primary focus.

Definitions & key terms

  • Token: The smallest unit of text that a model processes, produced by a tokenizer algorithm. Not a word, not a character.
  • BPE (Byte Pair Encoding): A tokenization algorithm that iteratively merges frequent byte pairs to build a vocabulary. Used by GPT-family models (tiktoken).
  • SentencePiece: A tokenization framework that operates on raw byte streams without word boundary assumptions. Used by LLaMA and some Google models.
  • Context window: The maximum number of tokens (input + output) a model can process in a single request.
  • Token budget: The total context window minus output reservation, divided into named sections (system prompt, history, evidence).
  • Token ledger: A data structure tracking the current token allocation across all sections, showing used, available, and wasted tokens.
  • Prefix caching: Provider optimization that caches KV computations for token prefixes that repeat across requests, reducing cost and latency.
  • Output reservation: Tokens subtracted from the total budget and reserved for the model’s generated response.

Mental model diagram (ASCII)

              CONTEXT WINDOW TOKEN BUDGET
              ============================

  +----------------------------------------------------------+
  |                    MODEL CONTEXT WINDOW                   |
  |                  (e.g., 128,000 tokens)                   |
  |                                                           |
  |  +----------------------------------------------------+  |
  |  | SYSTEM PROMPT + SAFETY PREAMBLE (fixed)             |  |
  |  | ~1,500 tokens                                       |  |
  |  | [CACHEABLE PREFIX - same across all requests]       |  |
  |  +----------------------------------------------------+  |
  |  +----------------------------------------------------+  |
  |  | CONVERSATION HISTORY (sliding window)               |  |
  |  | ~3,200 tokens (8 most recent turns)                 |  |
  |  | Older turns -> summarized or dropped                |  |
  |  +----------------------------------------------------+  |
  |  +----------------------------------------------------+  |
  |  | RETRIEVED EVIDENCE (variable - packer fills this)   |  |
  |  |                                                     |  |
  |  | Chunk 1: [tax_deduction_rules.md, 850 tokens]       |  |
  |  | Chunk 2: [home_office_criteria.md, 720 tokens]      |  |
  |  | Chunk 3: [irs_pub_587.md, 1100 tokens]              |  |
  |  |  ...                                                |  |
  |  | Chunk N: [last_fitting_chunk.md, 340 tokens]        |  |
  |  |                                                     |  |
  |  | Total packed: ~121,000 tokens                       |  |
  |  +----------------------------------------------------+  |
  |  +----------------------------------------------------+  |
  |  | OUTPUT RESERVATION (reserved for model response)    |  |
  |  | ~2,000 tokens                                       |  |
  |  +----------------------------------------------------+  |
  +----------------------------------------------------------+

  TOKENIZATION PIPELINE
  =====================

  Raw Text: "The home office deduction allows..."
       |
       v
  +------------------+
  | TOKENIZER (BPE)  |  "The" -> [464]
  | (model-specific) |  " home" -> [1524]
  |                  |  " office" -> [3906]
  | tiktoken (GPT)   |  " deduction" -> [82546]
  | or               |  " allows" -> [6276]
  | SentencePiece    |  "..." -> [986]
  +------------------+
       |
       v
  Token count: 6 tokens for this phrase
  (actual count, not word count or char/4 estimate)

  BUDGET ALLOCATION FLOW
  ======================

  Total Window: 128,000
       |
       v
  Subtract fixed sections:
    - System prompt:       1,200
    - Safety preamble:       350
    - Output reservation:  2,000
    - Conv. history:       3,200
       |
       v
  Available for evidence: 121,250 tokens
       |
       v
  Packer fills with ranked chunks
  until budget is exhausted or
  all relevant chunks are packed

How it works (step-by-step, with invariants and failure modes)

  1. Initialize the token ledger with the model’s total context window size and the exact tokenizer. Invariant: tokenizer must match the target model exactly. Failure mode: using wrong tokenizer causes token count mismatches of 10-30%.
  2. Allocate fixed sections (system prompt, safety preamble) and compute their exact token counts. Invariant: fixed sections must always fit within the window. Failure mode: system prompt exceeds budget, leaving no room for evidence.
  3. Allocate conversation history using a sliding window of recent turns. Compute exact token count per turn. Invariant: history allocation must not exceed its cap. Failure mode: a single very long turn exceeds the history budget; truncation or summarization is needed.
  4. Reserve output tokens by subtracting from the total budget. Invariant: output reservation must be at least the minimum expected response length. Failure mode: insufficient reservation causes the model to truncate its response mid-answer.
  5. Compute available evidence budget = total - system - preamble - history - output_reservation. Invariant: available budget must be >= 0. Failure mode: fixed sections exceed the total window, leaving negative budget for evidence.
  6. Pass available evidence budget to the context packer (Concept B) for chunk selection and ordering.

Minimal concrete example

Token counting comparison across models:

Input: "What are the tax deduction rules for a home office in 2024?"

tiktoken (GPT-4):  14 tokens
  ["What", " are", " the", " tax", " deduction", " rules",
   " for", " a", " home", " office", " in", " 2024", "?"]

Anthropic tokenizer (Claude):  12 tokens
  ["What", " are", " the", " tax", " deduction", " rules",
   " for", " a", " home", " office", " in 2024", "?"]

Approximate (chars/4):  60 chars / 4 = 15 tokens (wrong!)

Budget ledger for a 4096-token model (small model example):
  total_budget:          4096
  system_prompt:          -640
  safety_preamble:        -120
  output_reservation:     -512
  conversation_history:   -800
  ─────────────────────────────
  available_for_evidence: 2024 tokens

  This means you can fit approximately 4-6 medium chunks.
  Budget discipline is critical on small-window models.

Common misconceptions

  • “Divide character count by 4 to estimate tokens.” This heuristic is wildly inaccurate for non-English text, code, URLs, and special characters. Japanese text may use 2-3x more tokens per character than English. Always use the exact tokenizer.
  • “All models use the same tokenizer.” Each model family has its own tokenizer with a different vocabulary. GPT-4 uses cl100k_base, Claude uses its own tokenizer, LLaMA uses SentencePiece. Token counts for the same text differ across models.
  • “Context window size = maximum input size.” The context window includes both input and output tokens. If the window is 128K and you pack 127K of input, the model can only generate 1K tokens of response.
  • “Bigger context windows make token budgeting unnecessary.” Even with 200K or 2M token windows, budget discipline matters. Irrelevant context dilutes attention, increases cost (per-token pricing), adds latency, and can actually hurt response quality through the “lost in the middle” phenomenon where models struggle to use information in the middle of very long contexts.
  • “Prefix caching works automatically.” Caching only activates when the prefix bytes are identical across requests. Any variation in the system prompt text, even whitespace or timestamp differences, invalidates the cache and you pay full price for every token.

Check-your-understanding questions

  1. Why does using an approximate token counter instead of the exact tokenizer cause problems in a context packing pipeline?
  2. What happens if you do not reserve tokens for the model’s output?
  3. How does prefix caching interact with context assembly, and what must be true about the system prompt for caching to work?
  4. Why does a 128K context window not mean you can pack 128K tokens of evidence?
  5. How would you handle a conversation history that exceeds its allocated token budget?

Check-your-understanding answers

  1. Approximate counters can be off by 10-30%, which means you either waste budget (packing fewer chunks than you could fit) or overflow the window (the API rejects the request or truncates silently). In a system processing thousands of queries, even a 5% error rate causes intermittent failures that are hard to debug because they depend on the specific text content.
  2. The model’s response gets truncated mid-generation. It may stop mid-sentence, miss critical parts of the answer, or fail to produce required structured output fields. This is a silent failure: no error is raised, but the output is incomplete.
  3. Prefix caching caches the KV-pair computations for token prefixes that are identical across requests. For caching to work, the system prompt and any fixed preamble must produce byte-identical token sequences every time. This means no timestamps, no request-specific data, and no randomization in the fixed prefix. The context assembly pipeline must ensure the cacheable prefix is separated from the variable content.
  4. The 128K limit includes both input and output tokens. You must subtract the output reservation (the space the model needs to generate its response), the system prompt, conversation history, and any fixed preamble. The actual space available for packed evidence is typically 80-95% of the total window.
  5. Options include: (a) sliding window that keeps only the N most recent turns, (b) summarization that compresses older turns into a shorter summary, (c) priority-based retention that keeps the most relevant past turns based on topic similarity to the current query, or (d) hybrid approaches that summarize old turns while keeping recent ones verbatim. Each approach trades information loss against token savings, and the choice depends on whether conversation continuity or evidence space is more important for the application.

Real-world applications

  • RAG systems in production (customer support, legal research, medical Q&A) must manage token budgets across system instructions, retrieved documents, and conversation history within strict latency and cost constraints.
  • Multi-model orchestration pipelines that route queries to different models (GPT-4 for complex reasoning, Claude for long-context tasks, smaller models for simple queries) need tokenizer-aware budget management per model.
  • Prompt caching optimization (Anthropic, OpenAI) saves 90% of token costs on the cacheable prefix, but only if the context assembly pipeline produces byte-identical prefixes.
  • Cost monitoring dashboards track token utilization efficiency to identify queries that waste budget by packing irrelevant content or underutilizing available window space.

Where you’ll apply it

  • Phase 1: implement the token ledger and budget allocation for fixed sections.
  • Phase 2: integrate exact tokenizer for the target model and measure packing efficiency.
  • Phase 3: optimize prefix stability for caching and track budget utilization metrics.

References

  • “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 3 (data structures for ordered access patterns)
  • OpenAI tiktoken library documentation and cookbook (https://github.com/openai/tiktoken)
  • Anthropic token counting API documentation
  • “AI Engineering” by Chip Huyen - chapters on serving and cost optimization
  • Sennrich et al., “Neural Machine Translation of Rare Words with Subword Units” (2016) - original BPE for NLP paper
  • Kudo and Richardson, “SentencePiece: A simple and language independent subword tokenizer” (2018)

Key insights Token counting is not estimation; it is accounting, and the ledger must balance exactly or the system fails silently at the worst possible time.

Summary Token budget management requires using the exact model-specific tokenizer, allocating the context window into named sections (system prompt, history, evidence, output reservation), tracking allocations in a token ledger, and optimizing for prefix caching by keeping fixed sections byte-identical. Approximate token counting is a common source of production failures in RAG systems.

Homework/Exercises to practice the concept

  1. Using pseudocode, implement a token ledger that accepts a model name, loads the correct tokenizer, and tracks allocations for 4 named sections. Show how available evidence budget is computed.
  2. Given the following scenario, compute the token budget: model has 32K window, system prompt is 800 tokens, preamble is 200 tokens, conversation history is 4 turns averaging 300 tokens each, and output reservation is 1000 tokens. How many tokens are available for evidence?
  3. Design a prefix caching strategy for a customer support chatbot. Identify which parts of the prompt are cacheable (identical across requests) and which are variable. Estimate the cost savings if 80% of the total tokens are in the cacheable prefix.
  4. Write pseudocode for a sliding window conversation history manager that keeps the most recent N turns within a token budget, summarizing older turns when the budget is exceeded.

Solutions to the homework/exercises

  1. The ledger should have methods: allocate(section_name, text) that tokenizes the text with the exact tokenizer and records the count, available_budget() that returns total_window - sum(allocations) - output_reservation, and validate() that checks all sections fit. The constructor should dispatch to tiktoken for GPT models, the Anthropic tokenizer for Claude, and SentencePiece for LLaMA models.
  2. Total: 32,000. System prompt: -800. Preamble: -200. History: -(4 * 300) = -1,200. Output: -1,000. Available: 32,000 - 800 - 200 - 1,200 - 1,000 = 28,800 tokens for evidence. This fits approximately 30-50 medium chunks (600-1000 tokens each).
  3. Cacheable prefix: system prompt (“You are a helpful customer support agent for…”) + safety preamble (“Never reveal internal policies…”) + output format instructions (“Respond in JSON with fields…”). These are identical for every request. Variable content: conversation history, retrieved knowledge base chunks, current user query. If the cacheable prefix is 80% of tokens and cached tokens cost 10x less, the cost per request drops from X to 0.28X (80% * 0.1 + 20% * 1.0 = 0.28), a 72% savings.
  4. The sliding window manager should: (a) compute token count for each turn, (b) sum from most recent backward until the budget is exceeded, (c) for remaining older turns, generate a summary (using a separate LLM call or extractive method), (d) include the summary as a “compressed history” section that fits within its own token allocation. The invariant is that the total history section never exceeds its allocated budget.

Concept B: Context Assembly Pipeline and Priority-Based Packing

Fundamentals Context assembly is the process of selecting, ordering, and fitting content from multiple sources into a model’s context window to maximize the quality of the model’s response. This is the core of what has come to be called “context engineering” as distinct from “prompt engineering.” While prompt engineering focuses on crafting the instruction text itself (what you tell the model to do), context engineering focuses on what information the model has access to when it processes those instructions. The insight is that the quality of an LLM response depends more on what is in the context window than on how cleverly the instructions are worded. A mediocre prompt with excellent, relevant context produces better results than a brilliant prompt with irrelevant or missing context. The context assembly pipeline is the system that decides what fills the window, and this project builds that system.

Deep Dive into the concept A production context assembly pipeline has four stages: retrieval, reranking, packing, and verification. Each stage makes decisions that affect the final context quality, and each introduces its own failure modes.

Stage 1: Retrieval. Given a user query, the retriever fetches candidate chunks from a document store. This is typically done using vector similarity search (embedding the query and finding the nearest document embeddings), keyword search (BM25), or a hybrid combination. The retriever should return more candidates than the final budget can accommodate (typically 3-10x more), giving the reranker a rich pool to select from. The key decision here is the retrieval depth (how many candidates to fetch). Too few candidates means the reranker has limited choices and may miss the best evidence. Too many candidates waste compute on chunks that will be discarded.

Stage 2: Reranking. The reranker scores each candidate chunk on multiple dimensions and produces a ranked list. Relevance scoring measures how well the chunk’s content matches the query semantics. This can use a cross-encoder model (more accurate but slower), a bi-encoder similarity score (faster but less precise), or an LLM-as-judge approach. Beyond relevance, production reranking must consider additional signals: trust score (is this chunk from a verified, authoritative source?), freshness (is the information current or stale?), coverage (does this chunk cover a topic not yet represented in the selected set?), and deduplication (is this chunk substantially similar to another already-selected chunk?). The reranking formula combines these signals with configurable weights.

A critical concept in reranking is diversity-aware selection, sometimes called Maximum Marginal Relevance (MMR). Pure relevance ranking tends to select chunks that are all about the same subtopic (the one most directly matching the query), while ignoring other relevant subtopics. MMR penalizes chunks that are too similar to already-selected chunks, promoting a diverse selection that covers the query from multiple angles. For a tax deduction query, pure relevance might select 5 chunks all about the definition of home office deduction, while MMR would select chunks covering definition, eligibility criteria, calculation method, documentation requirements, and common mistakes.

Stage 3: Packing. The packer takes the reranked list and fills the evidence section of the context window, respecting the token budget from the ledger (Concept A). The packing algorithm is conceptually a variant of the knapsack problem: each chunk has a “value” (its reranking score) and a “weight” (its token count), and the goal is to maximize total value within the weight constraint. However, unlike the classic knapsack, context packing has additional constraints: ordering matters (the model pays more attention to content at the beginning and end of the context, a phenomenon called “lost in the middle”), mandatory chunks must be included regardless of score (e.g., regulatory disclaimers or required context), and group constraints may require at least one chunk from each required topic.

The packing algorithm proceeds as follows: first, allocate mandatory chunks (subtract their token counts from the budget). Then, iterate through the reranked list in order, adding chunks that fit within the remaining budget. When a chunk does not fit, skip it and try the next one (since a smaller, lower-ranked chunk might fit in the remaining space). Record the inclusion or exclusion reason for each chunk in the explainability field.

The “lost in the middle” phenomenon (documented in research by Liu et al., 2023) shows that LLMs struggle to use information placed in the middle of long contexts. They attend most to the beginning and end. This has implications for chunk ordering: the most important evidence should be placed first and last, with less critical chunks in the middle. This is sometimes called the “sandwich” ordering strategy.

Stage 4: Verification. After packing, the verifier checks that the assembled context meets quality requirements. Coverage verification checks that all required topics are represented (e.g., a legal query must include at least one chunk about jurisdiction-specific rules). Consistency verification checks for factual conflicts between included chunks (e.g., two chunks stating different tax deduction limits). Budget verification confirms that the total token count matches the ledger (no overflow, acceptable waste). Source diversity verification confirms that chunks come from multiple sources rather than a single document (reducing single-source bias).

The output of the pipeline is a context packet: a structured document containing the assembled context with metadata. Each packed chunk includes its source ID, relevance score, token count, position, and inclusion reason. Each excluded chunk (from the top-k candidates) includes its exclusion reason (budget exceeded, duplicate of chunk X, below relevance threshold, stale timestamp). This explainability metadata is critical for debugging hallucinations and quality issues: if the model gives a wrong answer, you can inspect the context packet to determine whether the correct information was present, correctly positioned, or missing.

How this fit on projects The context assembly pipeline is the core implementation of Project 4. The retriever, reranker, packer, and verifier are the main components you will build. This concept also connects to Project 5 (Few-Shot Example Curator) which selects examples rather than evidence chunks, and Project 12 (Conversation Memory Compressor) which manages the conversation history section of the context window.

Definitions & key terms

  • Context engineering: The discipline of designing systems that provide the right information, in the right format, at the right time, to fill the model’s context window optimally. Broader than prompt engineering.
  • Retrieval: Fetching candidate chunks from a document store using vector search, keyword search, or hybrid methods.
  • Reranking: Scoring candidate chunks on relevance, trust, freshness, coverage, and deduplication to produce a ranked list.
  • Maximum Marginal Relevance (MMR): A diversity-aware selection algorithm that penalizes chunks similar to already-selected ones, promoting topical diversity.
  • Context packing: Fitting ranked chunks into the evidence section of the context window, respecting token budget constraints.
  • Lost in the middle: The empirical finding that LLMs attend most to the beginning and end of long contexts, underutilizing information in the middle.
  • Context packet: The fully assembled context with metadata, explainability fields, and token accounting for every included and excluded chunk.
  • Coverage score: A metric measuring whether all required topics are represented in the packed context.
  • Explainability field: Per-chunk metadata recording why a chunk was included or excluded from the context.

Mental model diagram (ASCII)

              CONTEXT ASSEMBLY PIPELINE
              ==========================

  User Query: "Can I deduct home office expenses?"
       |
       v
  +----------------------------------------------------------+
  | STAGE 1: RETRIEVAL                                        |
  |                                                           |
  | Vector Search (embeddings)  +  Keyword Search (BM25)      |
  |          |                            |                   |
  |          +----------+  +--------------+                   |
  |                     v  v                                  |
  |              Candidate Pool: 42 chunks                    |
  |              (3-10x more than budget allows)              |
  +----------------------------------------------------------+
       |
       v
  +----------------------------------------------------------+
  | STAGE 2: RERANKING                                        |
  |                                                           |
  | For each chunk, compute composite score:                  |
  |                                                           |
  |   relevance (cross-encoder)     * 0.50                    |
  |   + trust_score (source auth.)  * 0.20                    |
  |   + freshness (days since pub.) * 0.15                    |
  |   + coverage (new topic bonus)  * 0.10                    |
  |   - dedup_penalty (sim to selected) * 0.05               |
  |   ─────────────────────────────────────                   |
  |   = composite_score                                       |
  |                                                           |
  | Apply MMR for diversity:                                  |
  |   Penalize chunks similar to already-selected set         |
  |                                                           |
  | Output: Ranked list of 42 chunks, top-12 selected         |
  +----------------------------------------------------------+
       |
       v
  +----------------------------------------------------------+
  | STAGE 3: PACKING                                          |
  |                                                           |
  |   Token Budget Available: 2800 tokens (from ledger)       |
  |                                                           |
  |   1. Allocate mandatory chunks first:                     |
  |      [regulatory_disclaimer: 180 tokens]                  |
  |      Remaining: 2620 tokens                               |
  |                                                           |
  |   2. Pack ranked chunks in order:                         |
  |      Chunk 1 (850 tok, score 0.94) -> INCLUDE  [1770 rem] |
  |      Chunk 2 (720 tok, score 0.91) -> INCLUDE  [1050 rem] |
  |      Chunk 3 (1100 tok, score 0.88) -> SKIP (too large)  |
  |      Chunk 4 (680 tok, score 0.85) -> INCLUDE  [370 rem]  |
  |      Chunk 5 (340 tok, score 0.82) -> INCLUDE  [30 rem]   |
  |      Chunk 6 (290 tok, score 0.79) -> SKIP (too large)   |
  |                                                           |
  |   3. Apply "sandwich" ordering:                           |
  |      [highest_score_first, lowest_score, middle...]       |
  |                                                           |
  |   Packed: 2770/2800 tokens (98.9% utilization)            |
  +----------------------------------------------------------+
       |
       v
  +----------------------------------------------------------+
  | STAGE 4: VERIFICATION                                     |
  |                                                           |
  | Coverage check: required topics present?                  |
  |   [x] deduction definition                                |
  |   [x] eligibility criteria                                |
  |   [ ] calculation method  <-- WARN: missing topic         |
  |   [x] documentation requirements                          |
  |                                                           |
  | Consistency check: factual conflicts?                     |
  |   No conflicts detected                                   |
  |                                                           |
  | Budget check: 2770/2800 = OK                              |
  | Source diversity: 3 unique sources = OK                    |
  +----------------------------------------------------------+
       |
       v
  Context Packet (JSON output)
  with per-chunk explainability fields

How it works (step-by-step, with invariants and failure modes)

  1. The retriever receives the user query and fetches candidate chunks from the document store. Invariant: candidate pool size must be >= 3x the expected final chunk count. Failure mode: too few candidates means the reranker cannot improve over the initial retrieval order.
  2. The reranker scores each candidate on relevance, trust, freshness, coverage, and deduplication. Invariant: scoring must be deterministic given the same inputs (no random components unless seeded). Failure mode: non-deterministic scoring produces different context packets for the same query, making quality issues unreproducible.
  3. MMR diversity selection iterates through the ranked list, at each step selecting the chunk that maximizes the balance between relevance and diversity from the already-selected set. Invariant: MMR lambda parameter controls the relevance-diversity tradeoff and must be configurable. Failure mode: lambda too high produces diverse but irrelevant selection; lambda too low produces relevant but redundant selection.
  4. The packer iterates through the reranked list and adds chunks that fit within the remaining token budget. Invariant: mandatory chunks are always allocated first. Failure mode: a mandatory chunk exceeds the available budget, requiring either budget increase or mandatory chunk reduction.
  5. The packer applies the sandwich ordering strategy for the final context layout. Invariant: the most important evidence is at the beginning and end of the evidence section. Failure mode: critical evidence placed in the middle may be underutilized by the model.
  6. The verifier checks coverage, consistency, budget compliance, and source diversity. Invariant: all checks must pass for the context packet to be marked as valid. Failure mode: missing required topic triggers a coverage warning, which may require re-running retrieval with adjusted parameters.
  7. Each chunk receives an explainability field recording its inclusion reason (score, rank, mandatory) or exclusion reason (budget exceeded, duplicate, stale, below threshold). Invariant: every candidate chunk must have an explainability entry. Failure mode: missing explainability makes quality debugging impossible.

Minimal concrete example

Context Packet Output (JSON):
{
  "trace_id": "ctx-20240115-0042",
  "query": "Can I deduct home office expenses?",
  "model": "gpt-4o",
  "budget": {
    "total_window": 128000,
    "system_prompt": 1200,
    "output_reservation": 2000,
    "history": 3200,
    "available_evidence": 121600,
    "packed_evidence": 2770,
    "utilization": 0.989
  },
  "packed_chunks": [
    {
      "chunk_id": "tax-kb-0142",
      "source": "irs_pub_587.md",
      "position": 1,
      "tokens": 850,
      "relevance_score": 0.94,
      "trust_score": 0.99,
      "included_reason": "highest_composite_score"
    },
    {
      "chunk_id": "tax-kb-0088",
      "source": "home_office_guide.md",
      "position": 2,
      "tokens": 720,
      "relevance_score": 0.91,
      "trust_score": 0.95,
      "included_reason": "second_ranked_by_composite"
    }
  ],
  "excluded_chunks": [
    {
      "chunk_id": "tax-kb-0201",
      "tokens": 1100,
      "relevance_score": 0.88,
      "excluded_reason": "exceeded_remaining_budget (needed 1100, had 1050)"
    }
  ],
  "coverage": {
    "required_topics": ["deduction_definition", "eligibility", "calculation"],
    "covered": ["deduction_definition", "eligibility"],
    "missing": ["calculation"],
    "score": 0.67
  }
}

Deterministic tie-breaking for reproducibility:
  When two chunks have identical composite scores,
  break ties by: chunk_id (lexicographic) -> source_id -> position_in_source
  This ensures identical context packets across runs.

Common misconceptions

  • “More context is always better.” Research shows that irrelevant context dilutes attention and can cause the model to hallucinate or ignore critical evidence. The “lost in the middle” phenomenon means that information in the center of very long contexts is underutilized. Strategic selection of fewer, highly relevant chunks often outperforms dumping everything into a large context window.
  • “Relevance ranking alone produces the best context.” Relevance-only ranking produces redundant selections: the top 10 most relevant chunks often cover the same subtopic. Diversity-aware selection (MMR) produces context that covers the query from multiple angles, which leads to more comprehensive and accurate responses.
  • “Token budget is the only constraint on context quality.” Coverage requirements (must include certain topics), consistency constraints (no conflicting information), source diversity (not all from one document), and ordering effects (lost in the middle) all affect quality independently of budget utilization.
  • “Context packing is a simple sort-and-truncate operation.” Effective packing involves knapsack-like optimization (smaller chunks may fit where large ones cannot), mandatory chunk allocation, diversity constraints, ordering strategies, and explainability logging. A naive sort-and-truncate leaves significant quality on the table.
  • “Once you build the pipeline, it works for all queries.” Different query types need different retrieval strategies (keyword-heavy queries vs. semantic queries), different budget allocations (simple factual queries need less evidence than complex multi-topic queries), and different coverage requirements. The pipeline must be configurable per query type.

Check-your-understanding questions

  1. Why should the retriever return 3-10x more candidates than the final budget can accommodate?
  2. How does Maximum Marginal Relevance (MMR) improve context quality compared to pure relevance ranking?
  3. What is the “lost in the middle” phenomenon, and how does the sandwich ordering strategy address it?
  4. Why are explainability fields on each chunk critical for debugging hallucination issues?
  5. When two chunks conflict factually, how should the packer handle this?

Check-your-understanding answers

  1. Over-retrieval gives the reranker a larger pool to select from, enabling better diversity, coverage, and quality. If the retriever returns exactly as many chunks as the budget allows, the reranker has no room to improve the selection: it must use whatever the retriever found. Over-retrieval also provides fallback candidates when some chunks are excluded for non-relevance reasons (staleness, untrusted source, deduplication).
  2. Pure relevance ranking selects the top-N most similar chunks to the query, but these chunks often cover the same subtopic because similarity metrics cluster around the query’s primary topic. MMR introduces a diversity penalty: at each selection step, it penalizes chunks that are too similar to already-selected chunks. This produces a selection that covers the query from multiple angles (definition, eligibility, calculation, examples) rather than N slightly different phrasings of the same fact.
  3. Research shows that LLMs attend most strongly to the beginning and end of their context window, with reduced attention to the middle. The sandwich ordering strategy places the highest-value evidence at the beginning, second-highest at the end, and lower-value evidence in the middle. This maximizes the probability that the model uses the most important evidence while still including supporting context.
  4. When a model hallucinates or gives an incorrect answer, you need to determine whether the correct information was (a) present in the context and ignored, (b) present but poorly positioned, or (c) missing entirely. Explainability fields let you check: was the correct chunk retrieved? Was it reranked highly? Was it included or excluded, and why? Without this metadata, debugging requires re-running the entire pipeline, which may produce different results due to non-deterministic components.
  5. Conflicting chunks present a difficult choice. Options: (a) include both and let the model resolve the conflict (risks confusing the model), (b) include only the higher-trust or more recent chunk (risks losing valid alternative information), (c) include both but add an explicit conflict annotation that instructs the model to acknowledge the disagreement. The recommended approach is (c) for transparent handling, with a conflict detection check in the verification stage that flags the issue.

Real-world applications

  • Enterprise RAG systems (customer support, legal research, medical Q&A) use context assembly pipelines to pack the most relevant knowledge base chunks into each model call, directly impacting answer quality and cost.
  • Anthropic and OpenAI have both emphasized context engineering as the successor to prompt engineering, with Anthropic’s documentation describing it as designing dynamic systems that provide the right information at the right time.
  • Multi-turn conversational AI agents use context managers to balance conversation history, retrieved knowledge, and tool outputs within a single context window, dynamically adjusting allocations as conversations evolve.
  • AI coding assistants (Cursor, Copilot) use sophisticated context assembly to select the most relevant code files, documentation, and conversation history to include when generating code suggestions.
  • Search-augmented generation systems (Perplexity, Google AI Overview) use context assembly to select and order retrieved web content for answer synthesis.

Where you’ll apply it

  • Phase 1: implement the retriever and reranker with configurable scoring weights.
  • Phase 2: implement the packer with knapsack optimization, mandatory chunk allocation, and sandwich ordering.
  • Phase 3: implement the verifier (coverage, consistency, budget) and explainability fields.

References

  • “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 2-3 (data models and storage engines relevant to document retrieval)
  • “Introduction to Information Retrieval” by Manning, Raghavan, Schutze - Ch. 6-8 (vector space models and ranked retrieval)
  • Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023) - empirical study of positional attention patterns
  • Carbonell and Goldstein, “The Use of MMR, Diversity-Based Reranking for Reordering Documents” (1998) - the original MMR paper
  • “AI Engineering” by Chip Huyen - chapters on RAG and context management
  • Anthropic documentation on context engineering (2025)
  • Philipp Schmid, “The New Skill in AI is Not Prompting, It’s Context Engineering” (2025)

Key insights Context engineering is about building the system that decides what fills the window, not just crafting what you say in it; the quality of an LLM response depends more on what it can see than on how you ask.

Summary The context assembly pipeline has four stages: retrieval (fetch candidate chunks), reranking (score on relevance, trust, freshness, diversity), packing (fit chunks within token budget using knapsack-like optimization with mandatory allocations and sandwich ordering), and verification (check coverage, consistency, budget compliance). Each chunk gets explainability metadata for debugging. Context engineering is distinct from prompt engineering: it is the discipline of providing the right information, not just the right instructions.

Homework/Exercises to practice the concept

  1. Design a reranking formula with 5 signals (relevance, trust, freshness, coverage, dedup penalty) and configurable weights. For a tax law query, assign weights and justify each choice.
  2. Given 10 candidate chunks with the following token counts and relevance scores, implement the greedy packing algorithm (in pseudocode) with a budget of 3000 tokens: Chunks: [(800, 0.95), (600, 0.92), (1200, 0.90), (400, 0.88), (900, 0.85), (550, 0.82), (700, 0.79), (300, 0.75), (1100, 0.72), (450, 0.70)]. Show the final packed set, total tokens used, and utilization percentage.
  3. Explain how you would modify the packing algorithm to handle the “lost in the middle” phenomenon. Show the reordering step with the chunks from exercise 2.
  4. Design a coverage requirement specification for a medical Q&A system. What topics must always be present in the context for a drug interaction query?

Solutions to the homework/exercises

  1. For a tax law query: relevance = 0.45 (highest weight because answer accuracy depends on finding the right rules), trust = 0.25 (tax information from the IRS is more reliable than blog posts), freshness = 0.15 (tax laws change annually, stale information is dangerous), coverage = 0.10 (a complete answer needs multiple aspects: definition, eligibility, calculation, documentation), dedup_penalty = 0.05 (avoid redundant chunks). Justification: trust is weighted higher than typical because tax information has legal consequences; freshness is important because tax law changes annually.
  2. Greedy packing with 3000 budget: Chunk 1 (800, 0.95) -> include, remaining 2200. Chunk 2 (600, 0.92) -> include, remaining 1600. Chunk 3 (1200, 0.90) -> skip (too large). Chunk 4 (400, 0.88) -> include, remaining 1200. Chunk 5 (900, 0.85) -> include, remaining 300. Chunk 6 (550, 0.82) -> skip. Chunk 7 (700, 0.79) -> skip. Chunk 8 (300, 0.75) -> include, remaining 0. Final set: chunks 1, 2, 4, 5, 8. Total: 3000/3000 = 100% utilization. Note: chunk 3 was skipped because it was too large, but chunk 4 (smaller, slightly lower score) fit.
  3. After greedy packing selects chunks [1, 2, 4, 5, 8] with scores [0.95, 0.92, 0.88, 0.85, 0.75], apply sandwich ordering: highest score first (chunk 1, 0.95), second-highest last (chunk 2, 0.92), then fill middle in descending order. Final order: [1(0.95), 4(0.88), 5(0.85), 8(0.75), 2(0.92)]. This places the two highest-value chunks at the beginning and end where the model pays most attention.
  4. For a drug interaction query, required coverage topics: (a) primary drug mechanism of action, (b) interacting drug mechanism, (c) specific interaction description (synergistic, antagonistic, metabolic), (d) clinical significance (severity, frequency), (e) recommended action (avoid, adjust dose, monitor). All five topics must have at least one chunk present. If any topic is missing, the coverage check should flag a warning and potentially re-run retrieval with a topic-specific boost query.

3. Project Specification

3.1 What You Will Build

A context-packing pipeline that retrieves, reranks, and packs evidence chunks within strict token budgets, producing deterministic context packets with explainability and coverage metrics.

3.2 Functional Requirements

  1. Retrieve candidate chunks from a document store given a user query.
  2. Rerank chunks by composite score (relevance, trust, freshness, coverage, deduplication) with configurable weights.
  3. Pack chunks under strict token limit with deterministic ordering and mandatory chunk allocation.
  4. Emit explainability fields for every chunk (included_reason or excluded_reason).
  5. Verify coverage (required topics present), consistency (no factual conflicts), and budget compliance.
  6. Produce a context packet JSON with token ledger, packed chunks, excluded chunks, and coverage report.

3.3 Non-Functional Requirements

  • Performance: Average pack operation under 400 ms for 50 candidate chunks.
  • Reliability: Same query + corpus snapshot + seed yields identical context packets across runs.
  • Security/Policy: Blocked or untrusted sources cannot be packed into final context. Trust scores below configurable threshold are excluded.

3.4 Example Usage / Output

$ uv run p04-context pack --query "Can I deduct home office expenses?" --kb fixtures/tax_kb --budget 2800 --out out/p04
[INFO] Tokenizer: tiktoken (cl100k_base) for model gpt-4o
[INFO] Token ledger: system=1200, preamble=350, history=0, output_res=2000, evidence_budget=2800
[INFO] Retrieved candidates: 42 chunks from tax_kb
[INFO] Reranked top-k: 12 chunks (MMR lambda=0.7)
[PACK] Mandatory chunks: 1 (regulatory_disclaimer, 180 tokens)
[PACK] Packed evidence: 5 chunks, 2741/2800 tokens (97.9% utilization)
[PASS] Coverage score: 0.93 (4/4 required topics present, 1 partial)
[PASS] Consistency: no factual conflicts detected
[PASS] Source diversity: 3 unique sources
[INFO] Context packet: out/p04/context_packet.json
[INFO] Explainability report: out/p04/explainability.csv

3.5 Data Formats / Schemas / Protocols

  • Chunk index: source_id, section_id, text, token_count, trust_score, timestamp, topic_tags.
  • Reranking config YAML: scoring weights, MMR lambda, trust threshold, freshness decay.
  • Context packet JSON: trace_id, query, model, budget ledger, packed_chunks (with explainability), excluded_chunks (with reasons), coverage report.
  • Explainability CSV: chunk_id, action (included/excluded), reason, score, tokens, position.

3.6 Edge Cases

  • Highly relevant chunk exceeds remaining token budget (must skip and try next smaller chunk).
  • Two chunks conflict factually but both rank highly (consistency check should flag).
  • Query asks outside corpus domain (coverage check reports 0% required topics).
  • Source chunk has stale timestamp beyond freshness policy (excluded by freshness filter).
  • Mandatory chunk exceeds available evidence budget (budget error requiring policy adjustment).
  • All candidates are from a single source (source diversity warning).
  • Conversation history consumes most of the budget, leaving minimal evidence space (triggers history compression).

3.7 Real World Outcome

This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.

3.7.1 How to Run (Copy/Paste)

$ uv run p04-context pack --query "Can I deduct home office expenses?" --kb fixtures/tax_kb --budget 2800 --out out/p04
  • Working directory: project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS
  • Required inputs: project fixtures under fixtures/
  • Output directory: out/p04

3.7.2 Golden Path Demo (Deterministic)

Use the fixed seed already embedded in the command or config profile. You should see stable packed chunk sets, identical token counts, and identical coverage scores between runs. The trace_id and corpus hash are logged in the context packet header.

3.7.3 If CLI: exact terminal transcript

$ uv run p04-context pack --query "Can I deduct home office expenses?" --kb fixtures/tax_kb --budget 2800 --out out/p04
[INFO] Tokenizer: tiktoken (cl100k_base) for model gpt-4o
[INFO] Token ledger: system=1200, preamble=350, history=0, output_res=2000, evidence_budget=2800
[INFO] Retrieved candidates: 42 chunks from tax_kb
[INFO] Reranked top-k: 12 chunks (MMR lambda=0.7)
[PACK] Mandatory chunks: 1 (regulatory_disclaimer, 180 tokens)
[PACK] Packed evidence: 5 chunks, 2741/2800 tokens (97.9% utilization)
[PASS] Coverage score: 0.93 (4/4 required topics present, 1 partial)
[PASS] Consistency: no factual conflicts detected
[PASS] Source diversity: 3 unique sources
[INFO] Context packet: out/p04/context_packet.json
[INFO] Explainability report: out/p04/explainability.csv
$ echo $?
0

Failure demo:

$ uv run p04-context pack --query "Can I deduct home office expenses?" --kb fixtures/tax_kb --budget 300 --out out/p04
[ERROR] Budget too small to fit required system + policy preamble (needs >= 640 tokens)
[HINT] Increase --budget or reduce mandatory preamble sections
$ echo $?
2

Coverage warning demo:

$ uv run p04-context pack --query "quantum computing applications" --kb fixtures/tax_kb --budget 2800 --out out/p04
[WARN] Query appears outside corpus domain (tax_kb)
[WARN] Coverage score: 0.00 (0/4 required topics present)
[INFO] Packed evidence: 3 chunks, 1200/2800 tokens (42.9% utilization)
[HINT] Consider using a different knowledge base for this query
$ echo $?
1

4. Solution Architecture

4.1 High-Level Design

              CONTEXT WINDOW MANAGER ARCHITECTURE
              ====================================

  User Query + Config
        |
        v
  +----------------------------------------------+
  |           TOKEN LEDGER                        |
  | - Load model-specific tokenizer               |
  | - Compute fixed section allocations            |
  | - Calculate available evidence budget          |
  +----------------------------------------------+
        |
        v
  +----------------------------------------------+
  |           RETRIEVER                           |
  | - Vector search + keyword search (hybrid)      |
  | - Fetch 3-10x candidate pool                   |
  | - Tag each chunk with source metadata          |
  +----------------------------------------------+
        |
        v
  +----------------------------------------------+
  |           RERANKER                            |
  | - Composite scoring (5 signals + weights)      |
  | - MMR diversity selection                      |
  | - Trust and freshness filtering                |
  +----------------------------------------------+
        |
        v
  +----------------------------------------------+
  |           PACKER                              |
  | - Mandatory chunk allocation                   |
  | - Greedy knapsack-style packing                |
  | - Sandwich ordering for attention optimization |
  | - Explainability fields per chunk              |
  +----------------------------------------------+
        |
        v
  +----------------------------------------------+
  |           VERIFIER                            |
  | - Coverage check (required topics)             |
  | - Consistency check (factual conflicts)        |
  | - Budget compliance (token ledger balanced)    |
  | - Source diversity check                       |
  +----------------------------------------------+
        |
        v
  Context Packet JSON + Explainability CSV
  -> out/p04/

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Token Ledger | Manages budget allocation using exact model-specific tokenizer. | Use provider-exact tokenizer (tiktoken, Anthropic, SentencePiece). Never approximate. | | Retriever | Fetches candidate chunks from corpus using hybrid search. | Return 3-10x candidates for rich reranking pool. | | Reranker | Scores candidates on 5 signals with configurable weights and MMR diversity. | Separate relevance from trust/freshness. Use deterministic scoring with seeded tie-breakers. | | Packer | Fills evidence budget with ranked chunks using greedy knapsack + sandwich ordering. | Mandatory chunks first. Skip too-large chunks and try smaller ones. Log every decision. | | Verifier | Validates coverage, consistency, budget, and source diversity. | All checks must pass for valid context packet. Coverage failures emit warnings, budget failures emit errors. |

4.3 Data Structures (No Full Code)

P04_TokenLedger:
- model: "gpt-4o"
- tokenizer: tiktoken(cl100k_base)
- total_window: 128000
- allocations: {system: 1200, preamble: 350, history: 0, output_res: 2000}
- available_evidence: 124450
- packed_evidence: 0 (updated during packing)

P04_Chunk:
- chunk_id: "tax-kb-0142"
- source_id: "irs_pub_587"
- text: <content>
- token_count: 850  (computed by exact tokenizer)
- trust_score: 0.99
- timestamp: "2024-01-15"
- topic_tags: ["deduction_definition", "eligibility"]

P04_ContextPacket:
- trace_id: "ctx-20240115-0042"
- query: <text>
- model: "gpt-4o"
- budget: P04_TokenLedger
- packed_chunks: [P04_Chunk + position + included_reason]
- excluded_chunks: [P04_Chunk + excluded_reason]
- coverage: {required: [...], covered: [...], missing: [...], score: 0.93}
- consistency: {conflicts: [], status: "clean"}

4.4 Algorithm Overview

Key algorithm: Priority-based context packing with budget constraints

  1. Initialize token ledger with model-specific tokenizer and compute available evidence budget.
  2. Retrieve candidate chunks (3-10x over-retrieval) and compute exact token count for each.
  3. Score candidates using composite reranking formula with configurable weights.
  4. Apply MMR diversity selection to produce the final ranked list.
  5. Allocate mandatory chunks first, subtracting from evidence budget.
  6. Greedily pack ranked chunks: if the next chunk fits, include it; if not, skip and try the next smaller one.
  7. Apply sandwich ordering to the packed set (highest-value first and last, others in middle).
  8. Run verification checks (coverage, consistency, budget, diversity).
  9. Emit context packet JSON with full explainability.

Complexity Analysis (conceptual):

  • Time: O(n * k) for reranking (n candidates, k scoring signals), O(n) for packing.
  • Space: O(n) for candidate pool and context packet output.

5. Implementation Guide

5.1 Development Environment Setup

# 1) Install dependencies
#    - Python 3.11+ with uv
#    - tiktoken (for OpenAI models) or provider-specific tokenizer
#    - numpy (for vector similarity in reranking)
#    - pyyaml (for reranking config)
#    - jinja2 (optional, for HTML report generation)

# 2) Prepare fixtures
#    - fixtures/tax_kb/ (document chunks as JSONL or individual files)
#    - fixtures/rerank_config.yaml (scoring weights and MMR parameters)
#    - fixtures/coverage_requirements.yaml (required topics per query type)

# 3) Run the project command(s) listed in section 3.7

5.2 Project Structure

p04/
├── src/
│   ├── ledger.py          # Token budget management with exact tokenizers
│   ├── retriever.py       # Candidate chunk fetching (vector + keyword)
│   ├── reranker.py        # Composite scoring + MMR diversity
│   ├── packer.py          # Greedy knapsack + sandwich ordering
│   ├── verifier.py        # Coverage, consistency, budget checks
│   └── schemas.py         # Data structures for chunks, packets, ledger
├── fixtures/
│   ├── tax_kb/            # Document chunks
│   ├── rerank_config.yaml
│   └── coverage_requirements.yaml
├── out/
└── README.md

5.3 The Core Question You’re Answering

“What evidence should enter the prompt, in what order, and at what token cost, and can I prove why each piece was chosen?”

This question matters because it forces you to build a system that makes context assembly decisions explicit, measurable, and reproducible, rather than relying on ad hoc chunk selection that produces inconsistent quality.

5.4 Concepts You Must Understand First

  1. Token counting and BPE tokenization
    • Why does the same text produce different token counts on different models, and what are the consequences for context packing?
    • Book Reference: Sennrich et al., “Subword Units” (2016); OpenAI tiktoken documentation
  2. Information retrieval and reranking
    • How do vector similarity search and BM25 keyword search complement each other, and why is hybrid retrieval better than either alone?
    • Book Reference: “Introduction to Information Retrieval” by Manning et al. - Ch. 6-8
  3. Context window attention patterns
    • What is the “lost in the middle” phenomenon, and how does chunk ordering affect model response quality?
    • Book Reference: Liu et al., “Lost in the Middle” (2023); “AI Engineering” by Chip Huyen

5.5 Questions to Guide Your Design

  1. Token budget architecture
    • How do you allocate the context window across fixed sections (system, preamble, history) and variable sections (evidence)?
    • What happens when conversation history grows and squeezes the evidence budget?
    • How do you handle multi-model pipelines where different models have different tokenizers?
  2. Retrieval and reranking strategy
    • What is the right over-retrieval ratio for your candidate pool?
    • How do you weight the 5 reranking signals for different query types?
    • How does MMR lambda affect the relevance-diversity tradeoff?
  3. Packing and verification
    • How do you handle chunks too large for the remaining budget?
    • What coverage requirements are mandatory vs. nice-to-have?
    • How do you make the packing algorithm deterministic for reproducibility?

5.6 Thinking Exercise

Context Assembly Design Analysis

Before implementing, work through this detailed scenario:

You have a 4096-token context window (a small model). The system prompt uses 640 tokens, the safety preamble uses 120 tokens, and you need to reserve 512 tokens for output. There is no conversation history.

  1. Calculate the available evidence budget.
  2. Given 8 candidate chunks with these token counts and relevance scores: [(500, 0.95), (800, 0.93), (300, 0.91), (600, 0.89), (400, 0.87), (700, 0.85), (200, 0.82), (450, 0.79)], run the greedy packing algorithm and show which chunks are included.
  3. Apply sandwich ordering to the packed set.
  4. Now imagine one chunk is mandatory (a 200-token regulatory disclaimer). Re-run the packing with the mandatory chunk pre-allocated.
  5. Compute the coverage score assuming 3 required topics: the mandatory chunk covers topic A, chunk 1 covers topic B, and no packed chunk covers topic C.

Questions to answer:

  • What is the utilization percentage in each scenario?
  • How does mandatory chunk allocation change the final packed set?
  • What action should the system take when coverage score is below 1.0?

5.7 The Interview Questions They’ll Ask

  1. “How do you trade off recall versus token budget in RAG prompts? When is it better to retrieve fewer, higher-quality chunks?”
  2. “Why must reranking include trust and freshness, not only relevance? Give a concrete example where relevance-only ranking produces dangerous results.”
  3. “How do you make context packing deterministic and reproducible across runs?”
  4. “What would you log to debug a bad context packet that led to a hallucinated answer?”
  5. “How do you decide what to drop when budget is exceeded - and what is the risk of each dropping strategy?”
  6. “Explain the difference between context engineering and prompt engineering. Why has the industry shifted focus?”

5.8 Hints in Layers

Hint 1: Budget the fixed parts first Before touching retrieval or packing, build the token ledger. Compute exact token counts for the system prompt, safety preamble, and any mandatory chunks using the provider-exact tokenizer. Subtract these from the total window along with the output reservation. The remaining number is your evidence budget. If this number is <= 0, the system prompt is too large for the model’s context window.

Hint 2: Separate retrieval quality from packing quality Build two independent evaluation metrics: retrieval recall (what percentage of relevant chunks appear in the candidate pool?) and packing utilization (what percentage of the evidence budget is filled with useful content?). Low recall means your retriever needs work; low utilization means your packer is leaving budget on the table. Debugging one problem at a time is far easier than debugging the whole pipeline.

Hint 3: Log dropped evidence with reasons For every candidate chunk that is NOT included in the final context packet, record why: “budget exceeded” (chunk did not fit), “below relevance threshold”, “duplicate of chunk X”, “untrusted source”, “stale timestamp”, or “not selected by MMR diversity.” This log is the single most valuable debugging artifact. When the model hallucinates, check this log to see if the correct information was excluded and why.

Hint 4: Use stable sort keys for deterministic packing When two chunks have identical composite scores, the sort order must be deterministic. Use a multi-key tie-breaker: chunk_id (lexicographic) -> source_id -> position_in_source. Without stable tie-breaking, the same query can produce different context packets on different runs, making quality issues unreproducible.

Pseudocode for context packing:

FUNCTION pack_context(query, config, kb):
    # Step 1: Budget
    ledger = TokenLedger(model=config.model)
    ledger.allocate("system", config.system_prompt)
    ledger.allocate("preamble", config.safety_preamble)
    ledger.allocate("output_res", config.output_tokens)
    evidence_budget = ledger.available()

    # Step 2: Retrieve
    candidates = retriever.search(query, kb, limit=config.retrieval_depth)
    FOR EACH chunk IN candidates:
        chunk.token_count = ledger.tokenizer.count(chunk.text)

    # Step 3: Rerank
    ranked = reranker.score(candidates, query, config.weights)
    ranked = mmr_select(ranked, lambda=config.mmr_lambda, k=config.top_k)

    # Step 4: Pack
    remaining = evidence_budget
    FOR EACH mandatory IN config.mandatory_chunks:
        remaining -= mandatory.token_count
        packed.append(mandatory, reason="mandatory")

    FOR EACH chunk IN ranked:
        IF chunk.token_count <= remaining:
            packed.append(chunk, reason="ranked_selection")
            remaining -= chunk.token_count
        ELSE:
            excluded.append(chunk, reason="budget_exceeded")

    # Step 5: Reorder (sandwich)
    packed = sandwich_order(packed)

    # Step 6: Verify
    coverage = check_coverage(packed, config.required_topics)
    consistency = check_consistency(packed)
    budget_ok = (evidence_budget - remaining) <= evidence_budget

    RETURN ContextPacket(ledger, packed, excluded, coverage, consistency)

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Information retrieval fundamentals | “Introduction to Information Retrieval” by Manning et al. | Ch. 6-8 | | Data pipeline reliability | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 2-3 | | Operational tracing | “Site Reliability Engineering” by Google | Ch. 6 | | AI evaluation and serving | “AI Engineering” by Chip Huyen | RAG and evaluation chapters |

5.10 Implementation Phases

Phase 1: Foundation (Ledger + Retriever)

  • Implement the token ledger with exact tokenizer integration (tiktoken for GPT models).
  • Build a simple retriever that loads chunks from JSONL fixtures and returns a candidate pool.
  • Implement exact token counting for each chunk.
  • Checkpoint: Token ledger correctly computes available evidence budget. Retriever returns candidate pool with accurate token counts.

Phase 2: Core Pipeline (Reranker + Packer)

  • Implement composite reranking with configurable weights and MMR diversity selection.
  • Implement greedy knapsack packing with mandatory chunk allocation and skip-on-overflow logic.
  • Implement sandwich ordering.
  • Add explainability fields to every chunk (included_reason / excluded_reason).
  • Checkpoint: End-to-end run produces deterministic context packet. Same query + seed = identical output.

Phase 3: Verification and Reporting

  • Implement coverage verification (required topics check).
  • Implement consistency verification (factual conflict detection - at minimum, flag duplicate source IDs).
  • Add budget compliance and source diversity checks.
  • Produce context packet JSON and explainability CSV.
  • Add exit code logic (0 = pass, 1 = coverage warning, 2 = budget error).
  • Checkpoint: All verification checks run. Coverage warnings and budget errors produce correct exit codes. Team member can reproduce from clean checkout.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Tokenizer | Approximate (chars/4) vs exact (tiktoken) | Exact provider tokenizer | Approximate is off by 10-30%, causing overflow or waste | | Retrieval strategy | Vector-only vs keyword-only vs hybrid | Hybrid (vector + BM25) | Hybrid captures both semantic and keyword matches | | Packing algorithm | Sort-and-truncate vs greedy knapsack | Greedy knapsack with skip | Sort-and-truncate wastes budget when a large chunk does not fit but smaller ones would | | Ordering strategy | Relevance order vs sandwich | Sandwich ordering | Mitigates “lost in the middle” by placing best evidence at edges | | Reproducibility | Seeded randomness vs deterministic tie-breaking | Deterministic tie-breaking by chunk_id | Avoids any randomness, ensuring identical packets across runs |

6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate tokenizer accuracy, budget computation, scoring formulas | Token count matches tiktoken exactly; budget ledger math is correct | | Integration Tests | Verify end-to-end pipeline from query to context packet | Golden-path query produces expected packed chunks and coverage score | | Regression Tests | Detect quality changes across config updates | Changing rerank weights produces expected score changes; no silent degradation | | Edge Case Tests | Ensure robust handling of boundary conditions | Budget too small, empty corpus, all chunks from single source, conflicting chunks |

6.2 Critical Test Cases

  1. Golden-path query produces context packet matching expected fixture (same chunks, same order, same scores).
  2. Token count for a known text matches the exact tokenizer output (not approximate).
  3. Mandatory chunk is always included even when it has a lower relevance score than other candidates.
  4. Chunk larger than remaining budget is correctly skipped, and the next smaller chunk is packed instead.
  5. Coverage check correctly identifies missing required topics and reports the right coverage score.
  6. Same query + config + seed produces identical context packet across runs (determinism).

6.3 Test Data

fixtures/tax_kb/chunks.jsonl          # 50 document chunks with metadata
fixtures/golden_query.json             # Query + expected context packet
fixtures/edge_cases/tiny_budget.json   # Budget smaller than mandatory chunks
fixtures/edge_cases/empty_corpus.json  # No matching chunks
fixtures/edge_cases/single_source.json # All chunks from one source

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | “Great relevance but too many hallucinations” | Context omitted critical grounding details; high relevance but low coverage. | Add required-coverage rules for key topics. Check coverage score in every run. | | “Token overruns in production” | API returns 400 error or silently truncates. | Use provider-exact tokenizer for budget accounting. Never approximate. Test with actual API calls. | | “Outputs vary between runs” | Different chunks packed for identical queries. | Add deterministic tie-break keys (chunk_id, source_id). Remove any randomness from scoring. | | “Packed lots of context but answers are worse” | “Lost in the middle” effect - critical evidence buried in center. | Implement sandwich ordering. Verify that highest-value chunks are at edges. | | “Prefix caching not activating” | Token costs higher than expected despite cacheable prefix. | Ensure system prompt bytes are identical across requests. No timestamps, no request IDs in the prefix. |

7.2 Debugging Strategies

  • Compare token counts between your ledger and the API’s usage response to detect tokenizer mismatches.
  • Inspect the explainability CSV to determine why the correct chunk was excluded.
  • Run with a minimal corpus (3-5 chunks) to isolate reranking and packing logic separately.
  • Diff two context packets for the same query to find non-deterministic components.

7.3 Performance Traps

  • Re-tokenizing the same chunk multiple times during scoring and packing. Cache token counts after first computation.
  • Loading the full corpus into memory when only a subset is needed. Use lazy loading with the retriever’s pre-filtering.
  • Computing cross-encoder reranking for all 42 candidates when only top-12 are needed. Pre-filter with a fast bi-encoder before running the expensive cross-encoder.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a second knowledge base (e.g., medical_kb) and support switching via CLI flag.
  • Add a simple HTML report showing the context packet with color-coded included/excluded chunks.

8.2 Intermediate Extensions

  • Implement adaptive budget allocation that gives more evidence space to complex queries and less to simple ones.
  • Add conversation history management with a sliding window and summarization for multi-turn support.

8.3 Advanced Extensions

  • Implement prefix caching optimization that tracks cache hit rates and suggests system prompt stabilization improvements.
  • Build a multi-model pipeline that selects the optimal model based on query complexity and adjusts tokenizer and budget accordingly.
  • Integrate with Project 12 (Memory Compressor) for conversation history compression within the token ledger.

9. Real-World Connections

9.1 Industry Applications

  • Enterprise RAG platforms (Azure AI Search, Amazon Bedrock, LangChain) all implement context assembly pipelines similar to this project’s architecture.
  • Anthropic has explicitly framed context engineering as the evolution of prompt engineering, emphasizing that what fills the context window matters more than how you phrase the instructions.
  • AI coding assistants dynamically assemble context from relevant code files, documentation, and conversation history within token budgets.
  • LangChain / LlamaIndex - RAG orchestration frameworks with retriever-reranker-packer patterns.
  • Cohere Rerank API - production reranking service for RAG pipelines.
  • tiktoken (OpenAI) - exact tokenizer library for GPT models.
  • MTEB (Massive Text Embedding Benchmark) - evaluation framework for retrieval quality.

9.3 Interview Relevance

  • Demonstrates understanding of context engineering, which is rapidly becoming a key competency for AI engineers.
  • Shows ability to build deterministic, reproducible systems around probabilistic AI components.
  • Proves practical knowledge of token economics, prefix caching, and cost optimization.

10. Resources

10.1 Essential Reading

  • Anthropic documentation on context engineering and prompt caching (2025).
  • OpenAI tiktoken library and tokenizer cookbook.
  • Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023).

10.2 Video Resources

  • Talks on RAG optimization from AI Engineer Summit and LangChain conferences.
  • Anthropic and OpenAI blog posts on context engineering best practices.

10.3 Tools & Documentation

  • tiktoken (Python) for GPT token counting.
  • SentencePiece for LLaMA / open-source model tokenization.
  • FAISS / ChromaDB for vector similarity search in the retriever.
  • Project 1 (Prompt Contract Harness): provides the contract validation foundation for context packet schemas.
  • Project 5 (Few-Shot Example Curator): applies similar selection logic to examples rather than evidence chunks.
  • Project 9 (Prompt Caching Optimizer): extends prefix caching optimization from this project.
  • Project 12 (Conversation Memory Compressor): manages the conversation history section of the token ledger.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why approximate token counting causes production failures and why exact tokenizers are necessary.
  • I can describe the 4-stage context assembly pipeline (retrieve, rerank, pack, verify) and what each stage contributes.
  • I can explain the “lost in the middle” phenomenon and how sandwich ordering mitigates it.
  • I can describe context engineering vs. prompt engineering and why the distinction matters.

11.2 Implementation

  • My token ledger uses the exact provider tokenizer and correctly computes available evidence budget.
  • My reranker scores on 5 signals with configurable weights and applies MMR diversity selection.
  • My packer produces deterministic context packets with stable tie-breaking and sandwich ordering.
  • Every chunk has an explainability field (included_reason or excluded_reason).
  • Coverage and consistency checks run on every context packet.

11.3 Growth

  • I can describe the relevance-diversity tradeoff (MMR lambda) and how I tuned it.
  • I can explain my token budget allocation strategy and how it changes for different query types.
  • I can explain this system design in an interview setting with concrete examples and metrics.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Token ledger with exact tokenizer integration.
  • Retriever + reranker + packer producing deterministic context packets.
  • Explainability fields on every chunk (included/excluded with reasons).
  • Deterministic and reproducible results from clean checkout.

Full Completion:

  • Coverage verification with required topics check.
  • Consistency verification detecting factual conflicts.
  • Budget utilization metrics and source diversity checks.
  • CLI with exit code logic (0 = pass, 1 = warning, 2 = error).
  • Explainability CSV export.

Excellence (Above & Beyond):

  • Sandwich ordering with empirical evaluation of attention impact.
  • Prefix caching optimization tracking cache hit rates.
  • Multi-model support with tokenizer switching.
  • Integration with Projects 5, 9, or 12.