Project 2: Model Router Analyzer
Project 2: Model Router Analyzer
Understanding the Auto Router: Master cost-effective AI model selection
Project Metadata
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1 Week (15-20 hours) |
| Primary Language | Python |
| Alternative Languages | TypeScript, Go |
| Prerequisites | Project 1, basic Python, data analysis fundamentals |
| Main Reference | โAI Engineeringโ by Chip Huyen |
Learning Objectives
By completing this project, you will:
- Understand LLM model tiers - the capabilities, costs, and latency characteristics of Haiku, Sonnet, and Opus
- Analyze the Auto routerโs decision-making - when it escalates to more powerful models and why
- Develop cost optimization intuition - identifying opportunities to reduce AI spending without sacrificing quality
- Build data analysis skills - parsing logs, computing metrics, and generating visualizations
- Create actionable recommendations - translating data into practical workflow improvements
Deep Theoretical Foundation
The Model Selection Problem
When you interact with an AI, a critical decision happens before any response is generated: which model should handle this request? This decision has profound implications:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ THE MODEL SELECTION TRADEOFF SPACE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ CAPABILITY โ
โ โฒ โ
โ โ โโโโโโโโโโโโโโโ โ
โ โ โ OPUS โ โ
โ โ โ (Deep) โ โ
โ โ โ โ โ
โ High โ โ โข Complex โ โ
โ โ โ reasoningโ โ
โ โ โโโโโโโโโโโโโโโ โ โข Creative โ โ
โ โ โ SONNET โ โ โข Nuanced โ โ
โ โ โ (Smart) โ โโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ Med โ โ โข General โ โ
โ โ โ coding โ โ
โ โ โ โข Refactor โ โ
โ โ โโโโโโโโโ โโโโโโโโโโโโโโโ โ
โ โ โ HAIKU โ โ
โ โ โ(Fast) โ โ
โ Low โ โ โ โ
โ โ โโข Syntaxโ โ
โ โ โโข Simpleโ โ
โ โ โโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโถ
โ Low Medium High COST โ
โ โ
โ OPTIMAL SELECTION: Match task complexity to model capability โ
โ WASTE: Using Opus for syntax questions โ
โ FAILURE: Using Haiku for architecture design โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Model Characteristics Deep Dive
| Model | Cost Multiplier | Latency | Best For | Failure Modes |
|---|---|---|---|---|
| Haiku 4.5 | 0.4x | ~200ms | Syntax, simple queries, fast feedback loops | Misses nuance, shallow reasoning |
| Sonnet 4.5 | 1.3x | ~800ms | General coding, refactoring, debugging | Occasionally overthinks simple tasks |
| Opus 4.5 | 2.2x | ~2000ms | Architecture, complex reasoning, legacy code | Overkill for simple tasks, expensive |
How the Auto Router Works
The Auto router is Kiroโs intelligent model selector. It analyzes your prompt and routes it to the appropriate model tier:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AUTO ROUTER DECISION FLOW โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ User Prompt: "..." โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ PROMPT ANALYSIS โ โ
โ โ โ โ
โ โ Features Extracted: โ โ
โ โ โข Token count โ โ
โ โ โข Question type (syntax? architecture? debug?) โ โ
โ โ โข Complexity signals (numbers, conditions, dependencies) โ โ
โ โ โข Historical context (previous turns in conversation) โ โ
โ โ โข File context size (more files = more complexity) โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ COMPLEXITY SCORING โ โ
โ โ โ โ
โ โ Score = 0 โ โ
โ โ โ โ
โ โ IF contains "syntax", "import", "how to" โ score += 0 โ โ
โ โ IF contains "refactor", "debug", "fix" โ score += 5 โ โ
โ โ IF contains "design", "architect", "strategy" โ score += 10โ โ
โ โ IF token_count > 500 โ score += 3 โ โ
โ โ IF file_count > 5 โ score += 5 โ โ
โ โ IF has_code_block โ score += 2 โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MODEL SELECTION โ โ
โ โ โ โ
โ โ IF score < 5: โ โ
โ โ โโโโโโโโโโโโ โ โ
โ โ โ HAIKU โ โ Fast, cheap, sufficient โ โ
โ โ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ ELIF score < 12: โ โ
โ โ โโโโโโโโโโโโ โ โ
โ โ โ SONNET โ โ Balanced capability โ โ
โ โ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ ELSE: โ โ
โ โ โโโโโโโโโโโโ โ โ
โ โ โ OPUS โ โ Maximum reasoning power โ โ
โ โ โโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Economics of Model Selection
Understanding the cost implications is crucial:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ COST CALCULATION EXAMPLE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Scenario: 100 queries over a work session โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ NAIVE APPROACH: Always use Sonnet โ โ
โ โ โ โ
โ โ 100 queries ร 1,000 tokens ร $0.003/1K = $0.30 per session โ โ
โ โ โ โ
โ โ Monthly (20 sessions): $6.00 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ OPTIMIZED APPROACH: Smart routing โ โ
โ โ โ โ
โ โ 40 queries (simple) ร 1,000 ร $0.0012/1K = $0.048 โ โ
โ โ 50 queries (medium) ร 1,000 ร $0.0039/1K = $0.195 โ โ
โ โ 10 queries (complex) ร 1,000 ร $0.0066/1K = $0.066 โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ Total: $0.309 โ but with 23% BETTER routing: โ โ
โ โ Actual: $0.238 per session โ โ
โ โ โ โ
โ โ Monthly (20 sessions): $4.76 โ โ
โ โ SAVINGS: $1.24/month (21%) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ For heavy users (100 sessions/month): $62 savings/year! โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Historical Context: From Manual to Intelligent Routing
The evolution of model selection:
| Era | Approach | Overhead |
|---|---|---|
| 2023 | Manual API calls with explicit model selection | Developer decides every call |
| 2024 | Simple routing based on token count | Rule-based, crude |
| 2025 | ML-powered routing (Auto router) | Learns from usage patterns |
| Future | Predictive routing with quality feedback loops | Self-optimizing |
The Auto router represents a significant step: it uses a lightweight classifier trained on millions of queries to predict optimal model selection.
Real-World Analogy: The Restaurant Kitchen
Think of model selection like a restaurant kitchen:
- Haiku = Line cook: Fast, efficient, handles simple dishes
- Sonnet = Sous chef: Skilled, handles complex orders
- Opus = Executive chef: Creative, handles VIP requests
You wouldnโt have the executive chef make a salad, and you wouldnโt have the line cook design the tasting menu. The maรฎtre dโ (Auto router) routes orders appropriately.
Complete Project Specification
What You Are Building
A Model Usage Analyzer that:
- Logs Model Selections: Captures which model was used for each query
- Classifies Query Complexity: Categorizes queries by type and difficulty
- Calculates Cost Metrics: Computes actual vs. optimal spending
- Identifies Optimization Opportunities: Finds mismatches between task and model
- Generates Recommendations: Provides actionable advice for better routing
Architecture Overview
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ MODEL ANALYZER ARCHITECTURE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Data Collection โ โ
โ โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โ โ
โ โ โ Kiro Logs โ โ /usage API โ โ Session Data โ โ โ
โ โ โ ($TMPDIR) โ โ (credits) โ โ (Project 1) โ โ โ
โ โ โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโ โ โ
โ โ โ โ โ โ โ
โ โโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโ โ
โ โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Log Parser โ โ
โ โ โข Extract model selection events โ โ
โ โ โข Parse prompt content โ โ
โ โ โข Capture response metadata โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Query Classifier โ โ
โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ โ
โ โ โ Syntax โ โ Debug โ โArchitectureโ โ โ
โ โ โ Queries โ โ Queries โ โ Queries โ โ โ
โ โ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โ Keywords: "import", "syntax", "how to" โ SIMPLE โ โ
โ โ Keywords: "debug", "fix", "error" โ MEDIUM โ โ
โ โ Keywords: "design", "architect", "refactor" โ COMPLEX โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Cost Calculator โ โ
โ โ โ โ
โ โ actual_cost = sum(model_cost[m] * tokens[m]) โ โ
โ โ optimal_cost = sum(optimal_model_cost[q] * tokens[q]) โ โ
โ โ waste = actual_cost - optimal_cost โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Recommendation Engine โ โ
โ โ โ โ
โ โ IF simple_query AND used_sonnet: โ โ
โ โ โ "Consider forcing Haiku for syntax queries" โ โ
โ โ โ โ
โ โ IF complex_query AND used_sonnet AND low_quality: โ โ
โ โ โ "Force Opus for architecture discussions" โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Report Generator โ โ
โ โ โข Terminal dashboard (rich/matplotlib) โ โ
โ โ โข JSON export โ โ
โ โ โข Markdown report โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Expected Deliverables
model-analyzer/
โโโ analyzer/
โ โโโ __init__.py
โ โโโ log_parser.py # Parse Kiro logs
โ โโโ classifier.py # Classify query complexity
โ โโโ cost_calculator.py # Calculate costs and savings
โ โโโ recommender.py # Generate recommendations
โ โโโ reporter.py # Generate reports
โโโ cli.py # Command-line interface
โโโ tests/
โ โโโ test_classifier.py
โ โโโ test_cost_calculator.py
โ โโโ sample_logs/
โโโ requirements.txt
โโโ README.md
Solution Architecture
Data Model
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
from typing import List, Optional
class Model(Enum):
HAIKU = "haiku"
SONNET = "sonnet"
OPUS = "opus"
AUTO = "auto"
class ComplexityLevel(Enum):
SIMPLE = 1 # Syntax, imports, how-to
MEDIUM = 2 # Debugging, fixing, general coding
COMPLEX = 3 # Architecture, design, refactoring
@dataclass
class Query:
id: str
timestamp: datetime
prompt: str
model_used: Model
model_selected_by: str # "auto" or "manual"
tokens_input: int
tokens_output: int
latency_ms: int
complexity: Optional[ComplexityLevel] = None
optimal_model: Optional[Model] = None
@dataclass
class CostAnalysis:
actual_cost: float
optimal_cost: float
waste: float
savings_percentage: float
recommendations: List[str]
@dataclass
class UsageReport:
period_start: datetime
period_end: datetime
total_queries: int
model_distribution: dict[Model, int]
cost_analysis: CostAnalysis
misrouted_queries: List[Query]
Classification Algorithm
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ QUERY CLASSIFICATION ALGORITHM โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ INPUT: prompt (str), context_files (int), conversation_turns (int) โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ STEP 1: KEYWORD ANALYSIS โ โ
โ โ โ โ
โ โ simple_keywords = ["syntax", "import", "how to", "what is", โ โ
โ โ "convert", "format"] โ โ
โ โ โ โ
โ โ medium_keywords = ["debug", "fix", "error", "bug", "issue", โ โ
โ โ "not working", "broken"] โ โ
โ โ โ โ
โ โ complex_keywords = ["design", "architect", "refactor", โ โ
โ โ "restructure", "strategy", "optimize", โ โ
โ โ "implement from scratch"] โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ STEP 2: CONTEXT ANALYSIS โ โ
โ โ โ โ
โ โ IF context_files > 10: complexity += 1 โ โ
โ โ IF conversation_turns > 5: complexity += 1 โ โ
โ โ IF prompt_tokens > 500: complexity += 1 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ STEP 3: PATTERN MATCHING โ โ
โ โ โ โ
โ โ IF matches(r"^(what|how|where).*\?$"): likely SIMPLE โ โ
โ โ IF matches(r"(refactor|redesign).*entire"): likely COMPLEX โ โ
โ โ IF mentions_multiple_files(): likely MEDIUM or higher โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ STEP 4: OPTIMAL MODEL MAPPING โ โ
โ โ โ โ
โ โ SIMPLE โ HAIKU โ โ
โ โ MEDIUM โ SONNET โ โ
โ โ COMPLEX โ OPUS โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ OUTPUT: (ComplexityLevel, optimal_model: Model) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Cost Calculation Model
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ COST CALCULATION MODEL โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ PRICING (per 1K tokens, approximate): โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Model โ Input Cost โ Output Cost โ Multiplier โ โ
โ โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโ โ
โ โ Haiku 4.5 โ $0.0008 โ $0.0032 โ 0.4x โ โ
โ โ Sonnet 4.5 โ $0.003 โ $0.015 โ 1.0x (baseline) โ โ
โ โ Opus 4.5 โ $0.015 โ $0.075 โ 2.2x โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ CALCULATION: โ
โ โ
โ For each query q: โ
โ actual_cost[q] = (input_tokens * input_price[model_used]) + โ
โ (output_tokens * output_price[model_used]) โ
โ โ
โ optimal_cost[q] = (input_tokens * input_price[optimal_model]) + โ
โ (output_tokens * output_price[optimal_model]) โ
โ โ
โ waste[q] = actual_cost[q] - optimal_cost[q] โ
โ โ
โ AGGREGATE: โ
โ total_actual = sum(actual_cost) โ
โ total_optimal = sum(optimal_cost) โ
โ total_waste = sum(waste where waste > 0) โ
โ savings_opportunity = (total_waste / total_actual) * 100 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Phased Implementation Guide
Phase 1: Log Parsing (3-4 hours)
Goal: Extract model selection events from Kiro logs
What to Build:
- Locate Kiro log files
- Parse log format to extract relevant events
- Create structured Query objects
Hint 1: Kiro logs are typically in $TMPDIR/kiro-log/:
ls -la "$TMPDIR/kiro-log/" 2>/dev/null
# Or check ~/.kiro/logs/
Hint 2: Log entries often have a recognizable format:
import re
LOG_PATTERN = r'\[(\d{4}-\d{2}-\d{2}T[\d:]+)\] \[(\w+)\] model=(\w+) tokens=(\d+)'
def parse_log_line(line: str) -> Optional[dict]:
match = re.match(LOG_PATTERN, line)
if match:
return {
'timestamp': match.group(1),
'event': match.group(2),
'model': match.group(3),
'tokens': int(match.group(4))
}
return None
Hint 3: Handle both JSON and plain text log formats:
def parse_log_file(path: Path) -> List[dict]:
events = []
with open(path) as f:
for line in f:
try:
# Try JSON first
event = json.loads(line)
except json.JSONDecodeError:
# Fall back to pattern matching
event = parse_log_line(line)
if event:
events.append(event)
return events
Validation Checkpoint: You can parse a log file and print a list of model selections.
Phase 2: Query Classification (4-5 hours)
Goal: Classify queries by complexity and determine optimal model
What to Build:
- Keyword-based classifier
- Context-aware complexity scoring
- Optimal model mapping
Hint 1: Use a simple keyword scoring system:
COMPLEXITY_KEYWORDS = {
'simple': ['syntax', 'import', 'how to', 'what is', 'convert'],
'medium': ['debug', 'fix', 'error', 'bug', 'explain'],
'complex': ['design', 'architect', 'refactor', 'restructure']
}
def classify_prompt(prompt: str) -> ComplexityLevel:
prompt_lower = prompt.lower()
scores = {level: 0 for level in ComplexityLevel}
for level, keywords in COMPLEXITY_KEYWORDS.items():
for keyword in keywords:
if keyword in prompt_lower:
scores[ComplexityLevel[level.upper()]] += 1
return max(scores, key=scores.get)
Hint 2: Consider context size as a complexity signal:
def adjust_for_context(base_level: ComplexityLevel, context_files: int) -> ComplexityLevel:
if context_files > 10 and base_level == ComplexityLevel.SIMPLE:
return ComplexityLevel.MEDIUM
if context_files > 20:
return ComplexityLevel.COMPLEX
return base_level
Hint 3: Map complexity to optimal model:
OPTIMAL_MODEL_MAP = {
ComplexityLevel.SIMPLE: Model.HAIKU,
ComplexityLevel.MEDIUM: Model.SONNET,
ComplexityLevel.COMPLEX: Model.OPUS
}
Validation Checkpoint: You can classify a list of sample prompts and verify the classifications make sense.
Phase 3: Cost Analysis and Recommendations (4-5 hours)
Goal: Calculate costs and generate actionable recommendations
What to Build:
- Cost calculator with real pricing
- Waste identification
- Recommendation generator
- Report output
Hint 1: Use dataclasses for clean cost modeling:
@dataclass
class ModelPricing:
input_per_1k: float
output_per_1k: float
PRICING = {
Model.HAIKU: ModelPricing(0.0008, 0.0032),
Model.SONNET: ModelPricing(0.003, 0.015),
Model.OPUS: ModelPricing(0.015, 0.075)
}
def calculate_cost(model: Model, input_tokens: int, output_tokens: int) -> float:
pricing = PRICING[model]
return (input_tokens / 1000 * pricing.input_per_1k +
output_tokens / 1000 * pricing.output_per_1k)
Hint 2: Generate recommendations based on patterns:
def generate_recommendations(queries: List[Query]) -> List[str]:
recommendations = []
# Count misroutes
simple_with_opus = [q for q in queries
if q.complexity == ComplexityLevel.SIMPLE
and q.model_used == Model.OPUS]
if len(simple_with_opus) > 5:
savings = sum(calculate_waste(q) for q in simple_with_opus)
recommendations.append(
f"Found {len(simple_with_opus)} simple queries using Opus. "
f"Force Haiku with '/model set haiku' for syntax questions. "
f"Potential savings: ${savings:.2f}"
)
return recommendations
Hint 3: Use rich for beautiful terminal output:
from rich.console import Console
from rich.table import Table
def render_report(analysis: CostAnalysis):
console = Console()
table = Table(title="Model Usage Report")
table.add_column("Metric")
table.add_column("Value", justify="right")
table.add_row("Actual Cost", f"${analysis.actual_cost:.2f}")
table.add_row("Optimal Cost", f"${analysis.optimal_cost:.2f}")
table.add_row("Waste", f"${analysis.waste:.2f}")
table.add_row("Savings Opportunity", f"{analysis.savings_percentage:.1f}%")
console.print(table)
Validation Checkpoint: You can run the analyzer and see a formatted report with costs and recommendations.
Testing Strategy
Unit Tests
# test_classifier.py
import pytest
from analyzer.classifier import classify_prompt, ComplexityLevel
class TestClassifier:
def test_syntax_query_is_simple(self):
prompt = "What's the syntax for optional chaining in TypeScript?"
assert classify_prompt(prompt) == ComplexityLevel.SIMPLE
def test_debug_query_is_medium(self):
prompt = "Debug this segfault in my memory allocator"
assert classify_prompt(prompt) == ComplexityLevel.MEDIUM
def test_architecture_query_is_complex(self):
prompt = "Design a microservices architecture for a fintech app"
assert classify_prompt(prompt) == ComplexityLevel.COMPLEX
def test_ambiguous_query_defaults_to_medium(self):
prompt = "Help me with this code"
assert classify_prompt(prompt) == ComplexityLevel.MEDIUM
Integration Tests
# test_integration.py
def test_full_analysis_pipeline():
# Create sample log file
sample_logs = """
[2025-12-22T10:00:00] model=sonnet prompt="what is the syntax for..." tokens_in=50 tokens_out=100
[2025-12-22T10:01:00] model=opus prompt="design a new auth system" tokens_in=200 tokens_out=500
"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.log') as f:
f.write(sample_logs)
f.flush()
result = analyze_logs(f.name)
assert result.total_queries == 2
assert result.cost_analysis.waste > 0 # Sonnet for simple query
Sample Data for Testing
Create a sample_logs/ directory with realistic test data:
// sample_logs/diverse_queries.json
[
{"prompt": "What's the Python syntax for list comprehension?", "model": "sonnet", "tokens_in": 30, "tokens_out": 150},
{"prompt": "Debug why this React component is re-rendering", "model": "sonnet", "tokens_in": 500, "tokens_out": 800},
{"prompt": "Design a distributed caching layer for our microservices", "model": "sonnet", "tokens_in": 200, "tokens_out": 2000},
{"prompt": "How do I import numpy?", "model": "opus", "tokens_in": 10, "tokens_out": 50}
]
Common Pitfalls and Debugging
Pitfall 1: Log Format Variations
Symptom: Parser works on some logs but fails on others
Cause: Kiro log format may change between versions
Debug:
# Print first few lines to understand format
with open(log_file) as f:
for i, line in enumerate(f):
print(f"Line {i}: {repr(line[:100])}")
if i > 5:
break
Solution: Build flexible parsers that try multiple formats:
def parse_line(line: str) -> Optional[dict]:
parsers = [parse_json, parse_structured_text, parse_plain_text]
for parser in parsers:
result = parser(line)
if result:
return result
return None
Pitfall 2: Missing Token Counts
Symptom: Token counts are zero or missing
Cause: Logs may not include token counts for all events
Debug:
grep -o 'tokens[^,]*' /path/to/logs | sort | uniq -c
Solution: Estimate tokens when missing:
def estimate_tokens(text: str) -> int:
# Rough estimation: ~4 characters per token
return len(text) // 4
Pitfall 3: Classification Disagreements
Symptom: Queries are classified differently than expected
Cause: Keyword-based classification is imperfect
Debug:
def classify_with_debug(prompt: str) -> tuple[ComplexityLevel, dict]:
scores = {}
for level, keywords in COMPLEXITY_KEYWORDS.items():
matched = [k for k in keywords if k in prompt.lower()]
scores[level] = {'count': len(matched), 'keywords': matched}
return max(scores, key=lambda k: scores[k]['count']), scores
Solution: Allow manual override and feedback loop:
# Store corrections for learning
def record_correction(query_id: str, correct_complexity: ComplexityLevel):
corrections_file = Path.home() / '.model-analyzer' / 'corrections.json'
# Load existing corrections
corrections = json.loads(corrections_file.read_text()) if corrections_file.exists() else {}
corrections[query_id] = correct_complexity.value
corrections_file.write_text(json.dumps(corrections))
Pitfall 4: Pricing Data Outdated
Symptom: Cost calculations donโt match Kiroโs /usage output
Cause: Model pricing changes over time
Debug:
kiro-cli /usage --format json | jq '.credits'
Solution: Make pricing configurable:
# config.yaml
pricing:
haiku:
input_per_1k: 0.0008
output_per_1k: 0.0032
sonnet:
input_per_1k: 0.003
output_per_1k: 0.015
opus:
input_per_1k: 0.015
output_per_1k: 0.075
Extensions and Challenges
Extension 1: Real-time Dashboard
Create a live dashboard that updates as you use Kiro:
# Use watchdog to monitor log files
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class LogHandler(FileSystemEventHandler):
def on_modified(self, event):
if event.src_path.endswith('.log'):
self.update_dashboard()
Extension 2: ML-Based Classifier
Replace keyword matching with a trained classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Train on labeled examples
vectorizer = TfidfVectorizer()
classifier = MultinomialNB()
X = vectorizer.fit_transform(training_prompts)
classifier.fit(X, training_labels)
Extension 3: Team Analytics
Aggregate usage across team members for organization-wide insights:
./model-analyzer team-report --team-dir /shared/kiro-logs/
Extension 4: A/B Testing Framework
Compare model performance on similar queries:
def ab_test(prompt: str, models: List[Model]) -> dict:
results = {}
for model in models:
response = run_query(prompt, force_model=model)
results[model] = {
'response': response,
'latency': response.latency,
'cost': calculate_cost(model, response.tokens)
}
return results
Challenge: Predictive Routing
Build a system that predicts optimal routing before the query is sent:
User starts typing: "design a..."
System predicts: COMPLEX (confidence: 0.85)
Suggestion: "This looks like an architecture question. Consider forcing Opus for best results."
Real-World Connections
How Professionals Use This
- Cost Management: Engineering managers track AI spending per team/project
- Performance Optimization: DevOps teams monitor response latency vs. model selection
- Quality Assurance: Teams correlate model selection with code review feedback
- Capacity Planning: Predict AI compute needs based on usage patterns
Industry Patterns
LLM Observability (MLOps): This project introduces concepts used in production LLM monitoring systems like LangSmith, Weights & Biases, and Datadog LLM Observability.
Cost Attribution (FinOps): Tracking AI costs per project/feature mirrors cloud cost allocation practices.
Quality-Cost Tradeoffs (Engineering Economics): The model selection problem is a specific instance of the general engineering tradeoff between quality, speed, and cost.
Self-Assessment Checklist
Understanding Verification
- Can you explain when Haiku is sufficient vs. when Opus is needed?
- Haiku: Syntax, simple lookups, fast iteration
- Opus: Architecture, complex reasoning, creative solutions
- What factors influence the Auto routerโs decision?
- Prompt complexity (keywords, length)
- Context size (files loaded)
- Historical patterns (conversation depth)
- How do you calculate cost savings from better routing?
- Compare actual model cost vs. optimal model cost per query
- Aggregate waste across session/week/month
- When should you override Auto and force a specific model?
- Force Haiku: Known simple queries, speed-critical loops
- Force Opus: Architecture discussions, complex debugging
Skill Demonstration
- I can parse Kiro logs and extract model selection events
- I can classify query complexity with reasonable accuracy
- I can calculate cost metrics and identify waste
- I can generate actionable recommendations
- I can visualize usage patterns in the terminal
Interview Preparation
Be ready to answer:
- โHow would you design a model routing system for an AI application?โ
- โWhat metrics would you track to optimize LLM costs?โ
- โHow do you balance cost vs. quality in AI deployments?โ
- โHow would you A/B test different models for the same task?โ
Recommended Reading
| Topic | Resource | Why It Helps |
|---|---|---|
| LLM Engineering | โAI Engineeringโ by Chip Huyen, Ch. 4-6 | Deep dive into model serving and optimization |
| Cost Optimization | AWS Well-Architected Framework, Cost Pillar | General principles of cloud cost management |
| Data Analysis | โPython for Data Analysisโ by McKinney, Ch. 8 | Pandas patterns for log analysis |
| Visualization | Rich documentation (rich.readthedocs.io) | Beautiful terminal output |
| ML Classification | Scikit-learn tutorials | For ML-based classifier extension |
What Success Looks Like
When you complete this project, you will have:
- A Working Tool:
model-analyzercommand that generates usage reports - Cost Awareness: Intuition for when to override Auto router
- Data Analysis Skills: Log parsing, classification, and visualization
- Optimization Mindset: Identifying waste and recommending improvements
- Foundation for MLOps: Understanding of LLM observability concepts
Next Steps: Move to Project 3 (Context Window Visualizer) to understand token economics in depth.