Project 19: Cost-Latency Optimization Router (Performance Engineering)

Build a routing and optimization layer that controls token budgets, context compression, retrieval depth, and model selection under cost/latency constraints.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	20-35 hours
Main Programming Language	TypeScript
Alternative Programming Languages	Python, Rust
Coolness Level	Level 4: Hardcore Tech Flex
Business Potential	4. The “Open Core” Infrastructure
Prerequisites	observability basics, caching fundamentals, queue/concurrency understanding
Key Topics	token budgets, compression, caching, routing, hybrid local/cloud inference

1. Learning Objectives

Model and monitor per-task cost and latency.
Implement dynamic token budgeting and context compression.
Build a difficulty-aware model router.
Add safe caching and parallel tool execution.
Optimize without crossing quality guardrails.

2. Theoretical Foundation

2.1 Performance as a Multi-Objective Problem

Assistant performance is not one metric. You optimize latency, cost, and quality simultaneously. Improvements in one dimension often degrade another. A router must encode policy trade-offs explicitly and adjust per task class.

2.2 Routing and Compression Risks

Compression can remove crucial details. Cheap model routing can increase error rates. Caches can become stale or leak context. Each optimization requires explicit correctness checks.

3. Project Specification

3.1 What You Will Build

A performance control plane with:

policy router
token budget manager
context compressor
retrieval tuner
caching layer
quality canary checker

3.2 Functional Requirements

Route requests by difficulty and risk profile.
Enforce per-task token ceilings.
Compress context when above budget.
Run parallel tool calls where dependency graph allows.
Compare optimized path versus quality baseline.

3.3 Non-Functional Requirements

Transparency: show routing reasons and policy version.
Safety: quality floor checks on optimized outputs.
Efficiency: measurable cost and latency gains.

3.4 Real World Outcome

$ routerctl run --workload support_mix_v3 --policy adaptive
[Budget] token_cap=6400
[Routing] local_small=61% cloud_reasoning=39%
[Latency] p95=3.4s (baseline 5.1s)
[Cost] avg=$0.018 (baseline $0.029)
[Quality] success=0.87 (baseline 0.86)
[Decision] ACCEPT

4. Solution Architecture

4.1 High-Level Design

Request -> Classifier -> Router -> {Local Model | Cloud Model}
    |            |              |
    |            v              v
    |       Budget Engine    Cache/Tools
    |______________________________|
                 |
            Quality Canary

4.2 Key Components

Component	Responsibility	Key Decisions
Router	choose model path	task complexity heuristics
Budget engine	token constraints	hard caps + fallback policy
Compressor	shrink context	salience-aware summarization
Canary checker	protect quality	baseline comparison thresholds

5. Implementation Guide

5.1 The Core Question You’re Answering

“How do I reduce cost and latency without silently degrading assistant quality?”

5.2 Concepts You Must Understand First

Latency percentile metrics
Cost-per-task accounting
Cache validity and key design
Routing policy evaluation

5.3 Questions to Guide Your Design

What quality floor is non-negotiable?
Which features predict hard tasks?
When is cache reuse safe?

5.4 Thinking Exercise

Build a routing decision table for extraction tasks, reasoning tasks, and multi-step tool workflows.

5.5 The Interview Questions They’ll Ask

Which signals drive model routing?
How do you validate context compression quality?
How do you avoid dangerous cache hits?
When does parallelism hurt reliability?
How do you communicate optimization trade-offs to product teams?

5.6 Hints in Layers

Hint 1: add observability spans first.

Hint 2: enforce budget caps before optimization tuning.

Hint 3: use canary set to validate every routing policy change.

Hint 4: separate cache layers by determinism and personalization.

5.7 Books That Will Help

Topic	Book	Chapter
Throughput and scaling	“Designing Data-Intensive Applications”	Ch. 7-8
LLM serving trade-offs	“AI Engineering”	deployment chapters
Practical optimization	“Code Complete”	optimization sections

5.8 Common Pitfalls and Debugging

Problem 1: quality regressions after aggressive routing

Why: cheap model overused.
Fix: difficulty thresholds + spot-check escalations.
Quick test: canary quality remains above set floor.

Problem 2: stale personalized cache

Why: cache key misses identity/context dimensions.
Fix: include tenant/user scope and context hash.
Quick test: cross-user cache contamination test.

5.9 Definition of Done

Cost and latency improvements are measurable
Quality protection canaries prevent silent degradation
Routing policy versions are explicit and auditable
Hybrid local/cloud fallback path is stable