Project 1: Token Window Visualizer

Build a deterministic context packing auditor that shows exactly how prompt segments consume token budget and what gets dropped.

Quick Reference

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	4-8 hours
Main Programming Language	Python (Alternatives: TypeScript, Go)
Alternative Programming Languages	TypeScript, Go
Coolness Level	Level 3
Business Potential	Level 2
Prerequisites	Tokenization basics, JSON handling, CLI fundamentals
Key Topics	context windows, token budgeting, truncation policy

1. Learning Objectives

By completing this project, you will:

Build exact token accounting by prompt segment.
Enforce hard output reservation before request assembly.
Implement deterministic overflow handling with trace logs.
Compare truncation policies and explain trade-offs.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Token Budgeting as a Memory Contract

Fundamentals Token budgeting is the first memory system in any LLM application. It defines how much of your system prompt, user request, retrieved documents, and conversation history can enter a single model call. If the budget is wrong, every downstream memory pattern fails, because the model simply never sees critical information. A correct implementation is deterministic, tokenizer-aware, and explicit about what is mandatory versus optional.

Deep Dive into the concept Most teams discover token budgeting only after errors appear in production. They notice inconsistent answers, missing constraints, or sudden cost spikes. The root cause is usually unmanaged context assembly, not model capability. A model call is a constrained packing problem: fixed capacity, multiple competing inputs, and non-negotiable constraints. You need clear segment categories, for example: policy/system, user intent, retrieval evidence, short-term history, and optional tool traces. Each category gets a quota and a priority.

The first non-negotiable design choice is output reservation. If you fill the full context with input, the model has no room to generate a complete answer. This creates clipped responses or hard errors. The second design choice is deterministic overflow policy. If your overflow behavior depends on arbitrary list order or timing, two identical queries can produce different contexts, making debugging impossible.

A robust strategy uses a strict sequence: count, reserve, rank, pack, compress, fail closed. Counting must use the exact tokenizer for the target model, because tokenization varies across models. Ranking should encode utility, not only recency. For example, system safety constraints are typically mandatory, while older conversational chatter is optional. Packing adds segments in priority order until the budget is full. Compression is then applied to lower-priority segments, often by summary or pruning. If mandatory segments still do not fit, the request should fail with a clear error, not proceed silently.

Token budgets also drive cost and latency. More tokens generally increase processing time and billing. Good systems track budget telemetry: overflow rate, average unused space, compression frequency, and per-segment retention rate. These metrics let you tune chunk sizes and retrieval top-k. Without telemetry, teams often overcorrect by shrinking everything, which can hurt answer quality.

Finally, budgeting is a policy interface between retrieval and generation. Retrieval may return 20 candidates, but budgeting decides which 4 are worth carrying into the expensive generation stage. This is why token budgeting belongs in architecture reviews and reliability testing, not as an afterthought utility function.

How this fits on projects

Primary concept in this project.
Reused in Project 2 and Project 4.

Definitions & key terms

Budget: total allowed input tokens after output reservation.
Mandatory segment: segment that cannot be dropped.
Overflow: candidate input exceeds budget.
Compression policy: method to reduce optional segments.

Mental model diagram (ASCII)

Capacity = 8192
Reserve output = 1024
Usable input = 7168

[system 600][user 240][retrieved 4800][history 1800][tool 500]
 total input candidate = 7940 -> overflow 772

apply policy:
1) compress history by 500
2) drop lowest-score retrieval by 272
=> final input 7168

How it works (step-by-step)

Tokenize each segment with model tokenizer.
Reserve output budget first.
Sort segments by policy priority and relevance.
Add mandatory segments.
Add optional segments while budget remains.
Compress/drop low-priority segments when needed.
Emit trace report.

Invariants:

Mandatory segments always retained.
Final input tokens never exceed budget.
Every dropped segment has an explicit reason.

Failure modes:

Tokenizer mismatch.
Non-deterministic ordering.
Compression destroying key constraints.

Minimal concrete example

segments:
- system: 600 (mandatory)
- user: 220 (mandatory)
- retrieval_top8: 5100 (optional)
- history_recent: 1900 (optional)

policy output:
- keep system, user, retrieval_top6
- summarize history_recent -> 600
- final_input=7100

Common misconceptions

“Words are close enough to tokens.”
“Oldest-first trimming is always safe.”
“If request succeeds, context policy is fine.”

Check-your-understanding questions

Why reserve output before adding input segments?
What makes truncation deterministic?
Why should drop decisions be logged?

Check-your-understanding answers

To guarantee generation space.
Fixed ranking rules and fixed tokenizer behavior.
For reproducibility and debugging.

Real-world applications

Chatbots with strict latency/cost limits.
RAG assistants with must-keep policy instructions.

Where you’ll apply it

This project: context assembly engine.
Also used in: Project 2, Project 6.

References

OpenAI models docs: https://platform.openai.com/docs/models
Anthropic models docs: https://docs.anthropic.com/en/docs/about-claude/models/all-models

Key insights Token budgeting is a reliability control, not just a utility function.

Summary You are implementing a deterministic memory boundary that all later projects depend on.

Homework/Exercises to practice the concept

Simulate three overflow scenarios with different segment priorities.
Compare oldest-first vs utility-based trimming on the same transcript.

Solutions to the homework/exercises

Utility-aware trimming should preserve constraints better.
Oldest-first often drops still-relevant requirements.

3. Project Specification

3.1 What You Will Build

A CLI tool that ingests structured prompt segments and outputs:

token usage by segment,
overflow diagnosis,
deterministic keep/drop/compress decisions,
final packed context summary.

3.2 Functional Requirements

Tokenize each segment with a selected model tokenizer.
Enforce output token reservation.
Apply at least two truncation strategies.
Emit a deterministic trace JSON.
Provide human-readable CLI report.

3.3 Non-Functional Requirements

Performance: Complete analysis in under 200ms for 200 segments.
Reliability: Same input must produce identical output.
Usability: Error messages must identify exact overflow cause.

3.4 Example Usage / Output

$ llm-memory token-audit --input fixtures/session_a.json --context 8192 --reserve-output 1024
[INFO] usable_input=7168
[WARN] overflow=772
[ACTION] compressed history_recent by 500
[ACTION] dropped retrieval_chunk_08 by 272
[RESULT] status=OK final_input=7168

3.5 Data Formats / Schemas / Protocols

segment_schema:
- id: string
- type: system|user|retrieval|history|tool
- content: string
- priority: int
- mandatory: bool

3.6 Edge Cases

Empty segments list.
Mandatory segments exceed budget.
Tokenizer unavailable for selected model.
Non-UTF8 input payload.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

$ llm-memory token-audit --input fixtures/support_chat.json --context 8192 --reserve-output 1024
$ llm-memory token-audit --input fixtures/support_chat.json --context 4096 --reserve-output 768

3.7.2 Golden Path Demo (Deterministic)

$ llm-memory token-audit --input fixtures/golden.json --context 8192 --reserve-output 1024
[RESULT] status=OK final_input=7168 dropped=2 compressed=1
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ llm-memory token-audit --input fixtures/mandatory_overflow.json --context 2048 --reserve-output 512
[ERROR] mandatory segments exceed usable input by 340 tokens
exit_code=2

4. Solution Architecture

4.1 High-Level Design

Input JSON -> Token Counter -> Policy Ranker -> Context Packer -> Trace Reporter

4.2 Key Components

Component	Responsibility	Key Decisions
Token Counter	Compute per-segment tokens	Use exact model tokenizer
Policy Ranker	Order optional segments	Utility score + recency
Context Packer	Enforce hard budget	Fail closed on mandatory overflow
Trace Reporter	Explain decisions	Deterministic, machine-readable

4.3 Data Structures (No Full Code)

Segment{id,type,tokens,priority,mandatory}
PackResult{included_ids,dropped_ids,compressed_ids,final_tokens}

4.4 Algorithm Overview

Compute usable input tokens.
Pack mandatory segments.
Greedily add optional segments by rank.
Apply compression when needed.
Emit final report.

Complexity:

Time: O(n log n) for sorting segments.
Space: O(n).

5. Implementation Guide

5.1 Development Environment Setup

# create env, install tokenizer lib, run fixture command

5.2 Project Structure

p01-token-window-visualizer/
  src/
    cli
    tokenizer_adapter
    policy_engine
    reporter
  fixtures/
  tests/

5.3 The Core Question You’re Answering

“How can I make context assembly deterministic, safe, and observable under hard token limits?”

5.4 Concepts You Must Understand First

Tokenizer behavior by model.
Mandatory vs optional segment semantics.
Deterministic policy ordering.

5.5 Questions to Guide Your Design

Which segments are legally/safety mandatory?
Which optional segments provide highest utility per token?

5.6 Thinking Exercise

Manually pack one over-budget request and justify each inclusion/exclusion decision.

5.7 The Interview Questions They’ll Ask

How do you prevent silent truncation?
What is your fail-closed rule?
How do you validate token counts?
Which metrics indicate budget health?
How do you explain a dropped segment to another engineer?

5.8 Hints in Layers

Hint 1: Build exact token accounting first.
Hint 2: Add output reservation as a hard constraint.
Hint 3: Add deterministic ranking.
Hint 4: Emit machine-readable traces.

5.9 Books That Will Help

Topic	Book	Chapter
Constraints and architecture	Fundamentals of Software Architecture	Quality attributes
Search heuristics	Algorithms, Fourth Edition	Greedy/search intuition

5.10 Implementation Phases

Phase 1: counting + reporting.
Phase 2: policy engine + overflow handling.
Phase 3: fixtures + deterministic tests.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Overflow mode	silent trim / explicit error	explicit error for mandatory overflow	safer behavior
Policy ordering	recency-only / utility score	utility + recency tie-break	preserves constraints

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	counting and packing correctness	fixed fixtures
Integration	full CLI workflow	JSON in, report out
Edge Cases	hard failures	mandatory overflow

6.2 Critical Test Cases

Exact-match fixture with known token totals.
Overflow where optional compression succeeds.
Mandatory overflow that must fail.

6.3 Test Data

Use versioned fixtures with checksum-verified expected reports.

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Tokenizer mismatch	count drift	pin tokenizer/model pair
Non-deterministic sorting	flaky outputs	stable sorting keys
Missing trace logs	hard debugging	require trace output in CI

7.2 Debugging Strategies

Replay the same fixture across versions.
Diff trace JSON, not only final token count.

7.3 Performance Traps

Repeated tokenization of identical segments without cache.

8. Extensions & Challenges

8.1 Beginner Extensions

Add HTML report output.
Add segment colorization in CLI.

8.2 Intermediate Extensions

Add compression quality score.
Add policy simulator for multiple model windows.

8.3 Advanced Extensions

Add adaptive budgeting from observed answer length.
Add policy A/B testing mode.

9. Real-World Connections

9.1 Industry Applications

Prompt orchestration services.
Customer support context management.

tiktoken: tokenizer tooling.
promptfoo: evaluation-driven prompt testing.

9.3 Interview Relevance

This project gives concrete stories for context-limit debugging and safe truncation design.

10. Resources

10.1 Essential Reading

OpenAI model docs (context/token limits).
Anthropic model docs (context comparison).

10.2 Video Resources

Long-context engineering talks from major LLM conferences.

10.3 Tools & Documentation

Tokenizer playgrounds and model docs.

Next: Project 2

11. Self-Assessment Checklist

11.1 Understanding

I can explain token budgeting invariants.
I can justify deterministic truncation rules.

11.2 Implementation

CLI output is deterministic.
Overflow behavior is explicit and testable.

11.3 Growth

I can explain trade-offs in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

deterministic token accounting + overflow report

Full Completion:

two truncation policies + trace JSON + edge-case tests

Excellence (Going Above & Beyond):

adaptive budgeting and policy comparison dashboard