Project 1: Token Window Visualizer
Build a deterministic context packing auditor that shows exactly how prompt segments consume token budget and what gets dropped.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | 4-8 hours |
| Main Programming Language | Python (Alternatives: TypeScript, Go) |
| Alternative Programming Languages | TypeScript, Go |
| Coolness Level | Level 3 |
| Business Potential | Level 2 |
| Prerequisites | Tokenization basics, JSON handling, CLI fundamentals |
| Key Topics | context windows, token budgeting, truncation policy |
1. Learning Objectives
By completing this project, you will:
- Build exact token accounting by prompt segment.
- Enforce hard output reservation before request assembly.
- Implement deterministic overflow handling with trace logs.
- Compare truncation policies and explain trade-offs.
2. All Theory Needed (Per-Concept Breakdown)
2.1 Token Budgeting as a Memory Contract
Fundamentals Token budgeting is the first memory system in any LLM application. It defines how much of your system prompt, user request, retrieved documents, and conversation history can enter a single model call. If the budget is wrong, every downstream memory pattern fails, because the model simply never sees critical information. A correct implementation is deterministic, tokenizer-aware, and explicit about what is mandatory versus optional.
Deep Dive into the concept Most teams discover token budgeting only after errors appear in production. They notice inconsistent answers, missing constraints, or sudden cost spikes. The root cause is usually unmanaged context assembly, not model capability. A model call is a constrained packing problem: fixed capacity, multiple competing inputs, and non-negotiable constraints. You need clear segment categories, for example: policy/system, user intent, retrieval evidence, short-term history, and optional tool traces. Each category gets a quota and a priority.
The first non-negotiable design choice is output reservation. If you fill the full context with input, the model has no room to generate a complete answer. This creates clipped responses or hard errors. The second design choice is deterministic overflow policy. If your overflow behavior depends on arbitrary list order or timing, two identical queries can produce different contexts, making debugging impossible.
A robust strategy uses a strict sequence: count, reserve, rank, pack, compress, fail closed. Counting must use the exact tokenizer for the target model, because tokenization varies across models. Ranking should encode utility, not only recency. For example, system safety constraints are typically mandatory, while older conversational chatter is optional. Packing adds segments in priority order until the budget is full. Compression is then applied to lower-priority segments, often by summary or pruning. If mandatory segments still do not fit, the request should fail with a clear error, not proceed silently.
Token budgets also drive cost and latency. More tokens generally increase processing time and billing. Good systems track budget telemetry: overflow rate, average unused space, compression frequency, and per-segment retention rate. These metrics let you tune chunk sizes and retrieval top-k. Without telemetry, teams often overcorrect by shrinking everything, which can hurt answer quality.
Finally, budgeting is a policy interface between retrieval and generation. Retrieval may return 20 candidates, but budgeting decides which 4 are worth carrying into the expensive generation stage. This is why token budgeting belongs in architecture reviews and reliability testing, not as an afterthought utility function.
How this fits on projects
Definitions & key terms
- Budget: total allowed input tokens after output reservation.
- Mandatory segment: segment that cannot be dropped.
- Overflow: candidate input exceeds budget.
- Compression policy: method to reduce optional segments.
Mental model diagram (ASCII)
Capacity = 8192
Reserve output = 1024
Usable input = 7168
[system 600][user 240][retrieved 4800][history 1800][tool 500]
total input candidate = 7940 -> overflow 772
apply policy:
1) compress history by 500
2) drop lowest-score retrieval by 272
=> final input 7168
How it works (step-by-step)
- Tokenize each segment with model tokenizer.
- Reserve output budget first.
- Sort segments by policy priority and relevance.
- Add mandatory segments.
- Add optional segments while budget remains.
- Compress/drop low-priority segments when needed.
- Emit trace report.
Invariants:
- Mandatory segments always retained.
- Final input tokens never exceed budget.
- Every dropped segment has an explicit reason.
Failure modes:
- Tokenizer mismatch.
- Non-deterministic ordering.
- Compression destroying key constraints.
Minimal concrete example
segments:
- system: 600 (mandatory)
- user: 220 (mandatory)
- retrieval_top8: 5100 (optional)
- history_recent: 1900 (optional)
policy output:
- keep system, user, retrieval_top6
- summarize history_recent -> 600
- final_input=7100
Common misconceptions
- “Words are close enough to tokens.”
- “Oldest-first trimming is always safe.”
- “If request succeeds, context policy is fine.”
Check-your-understanding questions
- Why reserve output before adding input segments?
- What makes truncation deterministic?
- Why should drop decisions be logged?
Check-your-understanding answers
- To guarantee generation space.
- Fixed ranking rules and fixed tokenizer behavior.
- For reproducibility and debugging.
Real-world applications
- Chatbots with strict latency/cost limits.
- RAG assistants with must-keep policy instructions.
Where you’ll apply it
References
- OpenAI models docs: https://platform.openai.com/docs/models
- Anthropic models docs: https://docs.anthropic.com/en/docs/about-claude/models/all-models
Key insights Token budgeting is a reliability control, not just a utility function.
Summary You are implementing a deterministic memory boundary that all later projects depend on.
Homework/Exercises to practice the concept
- Simulate three overflow scenarios with different segment priorities.
- Compare oldest-first vs utility-based trimming on the same transcript.
Solutions to the homework/exercises
- Utility-aware trimming should preserve constraints better.
- Oldest-first often drops still-relevant requirements.
3. Project Specification
3.1 What You Will Build
A CLI tool that ingests structured prompt segments and outputs:
- token usage by segment,
- overflow diagnosis,
- deterministic keep/drop/compress decisions,
- final packed context summary.
3.2 Functional Requirements
- Tokenize each segment with a selected model tokenizer.
- Enforce output token reservation.
- Apply at least two truncation strategies.
- Emit a deterministic trace JSON.
- Provide human-readable CLI report.
3.3 Non-Functional Requirements
- Performance: Complete analysis in under 200ms for 200 segments.
- Reliability: Same input must produce identical output.
- Usability: Error messages must identify exact overflow cause.
3.4 Example Usage / Output
$ llm-memory token-audit --input fixtures/session_a.json --context 8192 --reserve-output 1024
[INFO] usable_input=7168
[WARN] overflow=772
[ACTION] compressed history_recent by 500
[ACTION] dropped retrieval_chunk_08 by 272
[RESULT] status=OK final_input=7168
3.5 Data Formats / Schemas / Protocols
segment_schema:
- id: string
- type: system|user|retrieval|history|tool
- content: string
- priority: int
- mandatory: bool
3.6 Edge Cases
- Empty segments list.
- Mandatory segments exceed budget.
- Tokenizer unavailable for selected model.
- Non-UTF8 input payload.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
$ llm-memory token-audit --input fixtures/support_chat.json --context 8192 --reserve-output 1024
$ llm-memory token-audit --input fixtures/support_chat.json --context 4096 --reserve-output 768
3.7.2 Golden Path Demo (Deterministic)
$ llm-memory token-audit --input fixtures/golden.json --context 8192 --reserve-output 1024
[RESULT] status=OK final_input=7168 dropped=2 compressed=1
exit_code=0
3.7.3 Failure Demo (Deterministic)
$ llm-memory token-audit --input fixtures/mandatory_overflow.json --context 2048 --reserve-output 512
[ERROR] mandatory segments exceed usable input by 340 tokens
exit_code=2
4. Solution Architecture
4.1 High-Level Design
Input JSON -> Token Counter -> Policy Ranker -> Context Packer -> Trace Reporter
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Token Counter | Compute per-segment tokens | Use exact model tokenizer |
| Policy Ranker | Order optional segments | Utility score + recency |
| Context Packer | Enforce hard budget | Fail closed on mandatory overflow |
| Trace Reporter | Explain decisions | Deterministic, machine-readable |
4.3 Data Structures (No Full Code)
Segment{id,type,tokens,priority,mandatory}
PackResult{included_ids,dropped_ids,compressed_ids,final_tokens}
4.4 Algorithm Overview
- Compute usable input tokens.
- Pack mandatory segments.
- Greedily add optional segments by rank.
- Apply compression when needed.
- Emit final report.
Complexity:
- Time: O(n log n) for sorting segments.
- Space: O(n).
5. Implementation Guide
5.1 Development Environment Setup
# create env, install tokenizer lib, run fixture command
5.2 Project Structure
p01-token-window-visualizer/
src/
cli
tokenizer_adapter
policy_engine
reporter
fixtures/
tests/
5.3 The Core Question You’re Answering
“How can I make context assembly deterministic, safe, and observable under hard token limits?”
5.4 Concepts You Must Understand First
- Tokenizer behavior by model.
- Mandatory vs optional segment semantics.
- Deterministic policy ordering.
5.5 Questions to Guide Your Design
- Which segments are legally/safety mandatory?
- Which optional segments provide highest utility per token?
5.6 Thinking Exercise
Manually pack one over-budget request and justify each inclusion/exclusion decision.
5.7 The Interview Questions They’ll Ask
- How do you prevent silent truncation?
- What is your fail-closed rule?
- How do you validate token counts?
- Which metrics indicate budget health?
- How do you explain a dropped segment to another engineer?
5.8 Hints in Layers
- Hint 1: Build exact token accounting first.
- Hint 2: Add output reservation as a hard constraint.
- Hint 3: Add deterministic ranking.
- Hint 4: Emit machine-readable traces.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Constraints and architecture | Fundamentals of Software Architecture | Quality attributes |
| Search heuristics | Algorithms, Fourth Edition | Greedy/search intuition |
5.10 Implementation Phases
- Phase 1: counting + reporting.
- Phase 2: policy engine + overflow handling.
- Phase 3: fixtures + deterministic tests.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Overflow mode | silent trim / explicit error | explicit error for mandatory overflow | safer behavior |
| Policy ordering | recency-only / utility score | utility + recency tie-break | preserves constraints |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | counting and packing correctness | fixed fixtures |
| Integration | full CLI workflow | JSON in, report out |
| Edge Cases | hard failures | mandatory overflow |
6.2 Critical Test Cases
- Exact-match fixture with known token totals.
- Overflow where optional compression succeeds.
- Mandatory overflow that must fail.
6.3 Test Data
Use versioned fixtures with checksum-verified expected reports.
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Tokenizer mismatch | count drift | pin tokenizer/model pair |
| Non-deterministic sorting | flaky outputs | stable sorting keys |
| Missing trace logs | hard debugging | require trace output in CI |
7.2 Debugging Strategies
- Replay the same fixture across versions.
- Diff trace JSON, not only final token count.
7.3 Performance Traps
Repeated tokenization of identical segments without cache.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add HTML report output.
- Add segment colorization in CLI.
8.2 Intermediate Extensions
- Add compression quality score.
- Add policy simulator for multiple model windows.
8.3 Advanced Extensions
- Add adaptive budgeting from observed answer length.
- Add policy A/B testing mode.
9. Real-World Connections
9.1 Industry Applications
- Prompt orchestration services.
- Customer support context management.
9.2 Related Open Source Projects
- tiktoken: tokenizer tooling.
- promptfoo: evaluation-driven prompt testing.
9.3 Interview Relevance
This project gives concrete stories for context-limit debugging and safe truncation design.
10. Resources
10.1 Essential Reading
- OpenAI model docs (context/token limits).
- Anthropic model docs (context comparison).
10.2 Video Resources
- Long-context engineering talks from major LLM conferences.
10.3 Tools & Documentation
- Tokenizer playgrounds and model docs.
10.4 Related Projects in This Series
- Next: Project 2
11. Self-Assessment Checklist
11.1 Understanding
- I can explain token budgeting invariants.
- I can justify deterministic truncation rules.
11.2 Implementation
- CLI output is deterministic.
- Overflow behavior is explicit and testable.
11.3 Growth
- I can explain trade-offs in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- deterministic token accounting + overflow report
Full Completion:
- two truncation policies + trace JSON + edge-case tests
Excellence (Going Above & Beyond):
- adaptive budgeting and policy comparison dashboard