Project 19: Cost-Latency Optimization Router (Performance Engineering)
Build a routing and optimization layer that controls token budgets, context compression, retrieval depth, and model selection under cost/latency constraints.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 20-35 hours |
| Main Programming Language | TypeScript |
| Alternative Programming Languages | Python, Rust |
| Coolness Level | Level 4: Hardcore Tech Flex |
| Business Potential | 4. The “Open Core” Infrastructure |
| Prerequisites | observability basics, caching fundamentals, queue/concurrency understanding |
| Key Topics | token budgets, compression, caching, routing, hybrid local/cloud inference |
1. Learning Objectives
- Model and monitor per-task cost and latency.
- Implement dynamic token budgeting and context compression.
- Build a difficulty-aware model router.
- Add safe caching and parallel tool execution.
- Optimize without crossing quality guardrails.
2. Theoretical Foundation
2.1 Performance as a Multi-Objective Problem
Assistant performance is not one metric. You optimize latency, cost, and quality simultaneously. Improvements in one dimension often degrade another. A router must encode policy trade-offs explicitly and adjust per task class.
2.2 Routing and Compression Risks
Compression can remove crucial details. Cheap model routing can increase error rates. Caches can become stale or leak context. Each optimization requires explicit correctness checks.
3. Project Specification
3.1 What You Will Build
A performance control plane with:
- policy router
- token budget manager
- context compressor
- retrieval tuner
- caching layer
- quality canary checker
3.2 Functional Requirements
- Route requests by difficulty and risk profile.
- Enforce per-task token ceilings.
- Compress context when above budget.
- Run parallel tool calls where dependency graph allows.
- Compare optimized path versus quality baseline.
3.3 Non-Functional Requirements
- Transparency: show routing reasons and policy version.
- Safety: quality floor checks on optimized outputs.
- Efficiency: measurable cost and latency gains.
3.4 Real World Outcome
$ routerctl run --workload support_mix_v3 --policy adaptive
[Budget] token_cap=6400
[Routing] local_small=61% cloud_reasoning=39%
[Latency] p95=3.4s (baseline 5.1s)
[Cost] avg=$0.018 (baseline $0.029)
[Quality] success=0.87 (baseline 0.86)
[Decision] ACCEPT
4. Solution Architecture
4.1 High-Level Design
Request -> Classifier -> Router -> {Local Model | Cloud Model}
| | |
| v v
| Budget Engine Cache/Tools
|______________________________|
|
Quality Canary
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Router | choose model path | task complexity heuristics |
| Budget engine | token constraints | hard caps + fallback policy |
| Compressor | shrink context | salience-aware summarization |
| Canary checker | protect quality | baseline comparison thresholds |
5. Implementation Guide
5.1 The Core Question You’re Answering
“How do I reduce cost and latency without silently degrading assistant quality?”
5.2 Concepts You Must Understand First
- Latency percentile metrics
- Cost-per-task accounting
- Cache validity and key design
- Routing policy evaluation
5.3 Questions to Guide Your Design
- What quality floor is non-negotiable?
- Which features predict hard tasks?
- When is cache reuse safe?
5.4 Thinking Exercise
Build a routing decision table for extraction tasks, reasoning tasks, and multi-step tool workflows.
5.5 The Interview Questions They’ll Ask
- Which signals drive model routing?
- How do you validate context compression quality?
- How do you avoid dangerous cache hits?
- When does parallelism hurt reliability?
- How do you communicate optimization trade-offs to product teams?
5.6 Hints in Layers
Hint 1: add observability spans first.
Hint 2: enforce budget caps before optimization tuning.
Hint 3: use canary set to validate every routing policy change.
Hint 4: separate cache layers by determinism and personalization.
5.7 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Throughput and scaling | “Designing Data-Intensive Applications” | Ch. 7-8 |
| LLM serving trade-offs | “AI Engineering” | deployment chapters |
| Practical optimization | “Code Complete” | optimization sections |
5.8 Common Pitfalls and Debugging
Problem 1: quality regressions after aggressive routing
- Why: cheap model overused.
- Fix: difficulty thresholds + spot-check escalations.
- Quick test: canary quality remains above set floor.
Problem 2: stale personalized cache
- Why: cache key misses identity/context dimensions.
- Fix: include tenant/user scope and context hash.
- Quick test: cross-user cache contamination test.
5.9 Definition of Done
- Cost and latency improvements are measurable
- Quality protection canaries prevent silent degradation
- Routing policy versions are explicit and auditable
- Hybrid local/cloud fallback path is stable