Project 2: Model Router Analyzer
Understanding the Auto Router: Master cost-effective AI model selection
Project Metadata
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | 1 Week (15-20 hours) |
| Primary Language | Python |
| Alternative Languages | TypeScript, Go |
| Prerequisites | Project 1, basic Python, data analysis fundamentals |
| Main Reference | “AI Engineering” by Chip Huyen |
Learning Objectives
By completing this project, you will:
- Understand LLM model tiers - the capabilities, costs, and latency characteristics of Haiku, Sonnet, and Opus
- Analyze the Auto router’s decision-making - when it escalates to more powerful models and why
- Develop cost optimization intuition - identifying opportunities to reduce AI spending without sacrificing quality
- Build data analysis skills - parsing logs, computing metrics, and generating visualizations
- Create actionable recommendations - translating data into practical workflow improvements
Deep Theoretical Foundation
The Model Selection Problem
When you interact with an AI, a critical decision happens before any response is generated: which model should handle this request? This decision has profound implications:
┌─────────────────────────────────────────────────────────────────────┐
│ THE MODEL SELECTION TRADEOFF SPACE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ CAPABILITY │
│ ▲ │
│ │ ┌─────────────┐ │
│ │ │ OPUS │ │
│ │ │ (Deep) │ │
│ │ │ │ │
│ High │ │ • Complex │ │
│ │ │ reasoning│ │
│ │ ┌─────────────┐ │ • Creative │ │
│ │ │ SONNET │ │ • Nuanced │ │
│ │ │ (Smart) │ └─────────────┘ │
│ │ │ │ │
│ Med │ │ • General │ │
│ │ │ coding │ │
│ │ │ • Refactor │ │
│ │ ┌───────┐ └─────────────┘ │
│ │ │ HAIKU │ │
│ │ │(Fast) │ │
│ Low │ │ │ │
│ │ │• Syntax│ │
│ │ │• Simple│ │
│ │ └───────┘ │
│ │ │
│ └─────────────────────────────────────────────────────────────▶
│ Low Medium High COST │
│ │
│ OPTIMAL SELECTION: Match task complexity to model capability │
│ WASTE: Using Opus for syntax questions │
│ FAILURE: Using Haiku for architecture design │
│ │
└─────────────────────────────────────────────────────────────────────┘
Model Characteristics Deep Dive
| Model | Cost Multiplier | Latency | Best For | Failure Modes |
|---|---|---|---|---|
| Haiku 4.5 | 0.4x | ~200ms | Syntax, simple queries, fast feedback loops | Misses nuance, shallow reasoning |
| Sonnet 4.5 | 1.3x | ~800ms | General coding, refactoring, debugging | Occasionally overthinks simple tasks |
| Opus 4.5 | 2.2x | ~2000ms | Architecture, complex reasoning, legacy code | Overkill for simple tasks, expensive |
How the Auto Router Works
The Auto router is Kiro’s intelligent model selector. It analyzes your prompt and routes it to the appropriate model tier:
┌─────────────────────────────────────────────────────────────────────┐
│ AUTO ROUTER DECISION FLOW │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ User Prompt: "..." │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ PROMPT ANALYSIS │ │
│ │ │ │
│ │ Features Extracted: │ │
│ │ • Token count │ │
│ │ • Question type (syntax? architecture? debug?) │ │
│ │ • Complexity signals (numbers, conditions, dependencies) │ │
│ │ • Historical context (previous turns in conversation) │ │
│ │ • File context size (more files = more complexity) │ │
│ │ │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ COMPLEXITY SCORING │ │
│ │ │ │
│ │ Score = 0 │ │
│ │ │ │
│ │ IF contains "syntax", "import", "how to" → score += 0 │ │
│ │ IF contains "refactor", "debug", "fix" → score += 5 │ │
│ │ IF contains "design", "architect", "strategy" → score += 10│ │
│ │ IF token_count > 500 → score += 3 │ │
│ │ IF file_count > 5 → score += 5 │ │
│ │ IF has_code_block → score += 2 │ │
│ │ │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ MODEL SELECTION │ │
│ │ │ │
│ │ IF score < 5: │ │
│ │ ┌──────────┐ │ │
│ │ │ HAIKU │ → Fast, cheap, sufficient │ │
│ │ └──────────┘ │ │
│ │ │ │
│ │ ELIF score < 12: │ │
│ │ ┌──────────┐ │ │
│ │ │ SONNET │ → Balanced capability │ │
│ │ └──────────┘ │ │
│ │ │ │
│ │ ELSE: │ │
│ │ ┌──────────┐ │ │
│ │ │ OPUS │ → Maximum reasoning power │ │
│ │ └──────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
The Economics of Model Selection
Understanding the cost implications is crucial:
┌─────────────────────────────────────────────────────────────────────┐
│ COST CALCULATION EXAMPLE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Scenario: 100 queries over a work session │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ NAIVE APPROACH: Always use Sonnet │ │
│ │ │ │
│ │ 100 queries × 1,000 tokens × $0.003/1K = $0.30 per session │ │
│ │ │ │
│ │ Monthly (20 sessions): $6.00 │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ OPTIMIZED APPROACH: Smart routing │ │
│ │ │ │
│ │ 40 queries (simple) × 1,000 × $0.0012/1K = $0.048 │ │
│ │ 50 queries (medium) × 1,000 × $0.0039/1K = $0.195 │ │
│ │ 10 queries (complex) × 1,000 × $0.0066/1K = $0.066 │ │
│ │ ──────────────────────────────────────────────── │ │
│ │ Total: $0.309 → but with 23% BETTER routing: │ │
│ │ Actual: $0.238 per session │ │
│ │ │ │
│ │ Monthly (20 sessions): $4.76 │ │
│ │ SAVINGS: $1.24/month (21%) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ For heavy users (100 sessions/month): $62 savings/year! │
│ │
└─────────────────────────────────────────────────────────────────────┘
Historical Context: From Manual to Intelligent Routing
The evolution of model selection:
| Era | Approach | Overhead |
|---|---|---|
| 2023 | Manual API calls with explicit model selection | Developer decides every call |
| 2024 | Simple routing based on token count | Rule-based, crude |
| 2025 | ML-powered routing (Auto router) | Learns from usage patterns |
| Future | Predictive routing with quality feedback loops | Self-optimizing |
The Auto router represents a significant step: it uses a lightweight classifier trained on millions of queries to predict optimal model selection.
Real-World Analogy: The Restaurant Kitchen
Think of model selection like a restaurant kitchen:
- Haiku = Line cook: Fast, efficient, handles simple dishes
- Sonnet = Sous chef: Skilled, handles complex orders
- Opus = Executive chef: Creative, handles VIP requests
You wouldn’t have the executive chef make a salad, and you wouldn’t have the line cook design the tasting menu. The maître d’ (Auto router) routes orders appropriately.
Complete Project Specification
What You Are Building
A Model Usage Analyzer that:
- Logs Model Selections: Captures which model was used for each query
- Classifies Query Complexity: Categorizes queries by type and difficulty
- Calculates Cost Metrics: Computes actual vs. optimal spending
- Identifies Optimization Opportunities: Finds mismatches between task and model
- Generates Recommendations: Provides actionable advice for better routing
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ MODEL ANALYZER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Data Collection │ │
│ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │
│ │ │ Kiro Logs │ │ /usage API │ │ Session Data │ │ │
│ │ │ ($TMPDIR) │ │ (credits) │ │ (Project 1) │ │ │
│ │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │
│ │ │ │ │ │ │
│ └──────────┼──────────────────┼──────────────────┼─────────────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Log Parser │ │
│ │ • Extract model selection events │ │
│ │ • Parse prompt content │ │
│ │ • Capture response metadata │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Query Classifier │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Syntax │ │ Debug │ │Architecture│ │ │
│ │ │ Queries │ │ Queries │ │ Queries │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ │ │ │
│ │ Keywords: "import", "syntax", "how to" → SIMPLE │ │
│ │ Keywords: "debug", "fix", "error" → MEDIUM │ │
│ │ Keywords: "design", "architect", "refactor" → COMPLEX │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Cost Calculator │ │
│ │ │ │
│ │ actual_cost = sum(model_cost[m] * tokens[m]) │ │
│ │ optimal_cost = sum(optimal_model_cost[q] * tokens[q]) │ │
│ │ waste = actual_cost - optimal_cost │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Recommendation Engine │ │
│ │ │ │
│ │ IF simple_query AND used_sonnet: │ │
│ │ → "Consider forcing Haiku for syntax queries" │ │
│ │ │ │
│ │ IF complex_query AND used_sonnet AND low_quality: │ │
│ │ → "Force Opus for architecture discussions" │ │
│ │ │ │
│ └────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Report Generator │ │
│ │ • Terminal dashboard (rich/matplotlib) │ │
│ │ • JSON export │ │
│ │ • Markdown report │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Expected Deliverables
model-analyzer/
├── analyzer/
│ ├── __init__.py
│ ├── log_parser.py # Parse Kiro logs
│ ├── classifier.py # Classify query complexity
│ ├── cost_calculator.py # Calculate costs and savings
│ ├── recommender.py # Generate recommendations
│ └── reporter.py # Generate reports
├── cli.py # Command-line interface
├── tests/
│ ├── test_classifier.py
│ ├── test_cost_calculator.py
│ └── sample_logs/
├── requirements.txt
└── README.md
Solution Architecture
Data Model
from dataclasses import dataclass
from enum import Enum
from datetime import datetime
from typing import List, Optional
class Model(Enum):
HAIKU = "haiku"
SONNET = "sonnet"
OPUS = "opus"
AUTO = "auto"
class ComplexityLevel(Enum):
SIMPLE = 1 # Syntax, imports, how-to
MEDIUM = 2 # Debugging, fixing, general coding
COMPLEX = 3 # Architecture, design, refactoring
@dataclass
class Query:
id: str
timestamp: datetime
prompt: str
model_used: Model
model_selected_by: str # "auto" or "manual"
tokens_input: int
tokens_output: int
latency_ms: int
complexity: Optional[ComplexityLevel] = None
optimal_model: Optional[Model] = None
@dataclass
class CostAnalysis:
actual_cost: float
optimal_cost: float
waste: float
savings_percentage: float
recommendations: List[str]
@dataclass
class UsageReport:
period_start: datetime
period_end: datetime
total_queries: int
model_distribution: dict[Model, int]
cost_analysis: CostAnalysis
misrouted_queries: List[Query]
Classification Algorithm
┌─────────────────────────────────────────────────────────────────────┐
│ QUERY CLASSIFICATION ALGORITHM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ INPUT: prompt (str), context_files (int), conversation_turns (int) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ STEP 1: KEYWORD ANALYSIS │ │
│ │ │ │
│ │ simple_keywords = ["syntax", "import", "how to", "what is", │ │
│ │ "convert", "format"] │ │
│ │ │ │
│ │ medium_keywords = ["debug", "fix", "error", "bug", "issue", │ │
│ │ "not working", "broken"] │ │
│ │ │ │
│ │ complex_keywords = ["design", "architect", "refactor", │ │
│ │ "restructure", "strategy", "optimize", │ │
│ │ "implement from scratch"] │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ STEP 2: CONTEXT ANALYSIS │ │
│ │ │ │
│ │ IF context_files > 10: complexity += 1 │ │
│ │ IF conversation_turns > 5: complexity += 1 │ │
│ │ IF prompt_tokens > 500: complexity += 1 │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ STEP 3: PATTERN MATCHING │ │
│ │ │ │
│ │ IF matches(r"^(what|how|where).*\?$"): likely SIMPLE │ │
│ │ IF matches(r"(refactor|redesign).*entire"): likely COMPLEX │ │
│ │ IF mentions_multiple_files(): likely MEDIUM or higher │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ STEP 4: OPTIMAL MODEL MAPPING │ │
│ │ │ │
│ │ SIMPLE → HAIKU │ │
│ │ MEDIUM → SONNET │ │
│ │ COMPLEX → OPUS │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ OUTPUT: (ComplexityLevel, optimal_model: Model) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Cost Calculation Model
┌─────────────────────────────────────────────────────────────────────┐
│ COST CALCULATION MODEL │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ PRICING (per 1K tokens, approximate): │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Model │ Input Cost │ Output Cost │ Multiplier │ │
│ │──────────────┼─────────────┼─────────────┼──────────────────│ │
│ │ Haiku 4.5 │ $0.0008 │ $0.0032 │ 0.4x │ │
│ │ Sonnet 4.5 │ $0.003 │ $0.015 │ 1.0x (baseline) │ │
│ │ Opus 4.5 │ $0.015 │ $0.075 │ 2.2x │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ CALCULATION: │
│ │
│ For each query q: │
│ actual_cost[q] = (input_tokens * input_price[model_used]) + │
│ (output_tokens * output_price[model_used]) │
│ │
│ optimal_cost[q] = (input_tokens * input_price[optimal_model]) + │
│ (output_tokens * output_price[optimal_model]) │
│ │
│ waste[q] = actual_cost[q] - optimal_cost[q] │
│ │
│ AGGREGATE: │
│ total_actual = sum(actual_cost) │
│ total_optimal = sum(optimal_cost) │
│ total_waste = sum(waste where waste > 0) │
│ savings_opportunity = (total_waste / total_actual) * 100 │
│ │
└─────────────────────────────────────────────────────────────────────┘
Phased Implementation Guide
Phase 1: Log Parsing (3-4 hours)
Goal: Extract model selection events from Kiro logs
What to Build:
- Locate Kiro log files
- Parse log format to extract relevant events
- Create structured Query objects
Hint 1: Kiro logs are typically in $TMPDIR/kiro-log/:
ls -la "$TMPDIR/kiro-log/" 2>/dev/null
# Or check ~/.kiro/logs/
Hint 2: Log entries often have a recognizable format:
import re
LOG_PATTERN = r'\[(\d{4}-\d{2}-\d{2}T[\d:]+)\] \[(\w+)\] model=(\w+) tokens=(\d+)'
def parse_log_line(line: str) -> Optional[dict]:
match = re.match(LOG_PATTERN, line)
if match:
return {
'timestamp': match.group(1),
'event': match.group(2),
'model': match.group(3),
'tokens': int(match.group(4))
}
return None
Hint 3: Handle both JSON and plain text log formats:
def parse_log_file(path: Path) -> List[dict]:
events = []
with open(path) as f:
for line in f:
try:
# Try JSON first
event = json.loads(line)
except json.JSONDecodeError:
# Fall back to pattern matching
event = parse_log_line(line)
if event:
events.append(event)
return events
Validation Checkpoint: You can parse a log file and print a list of model selections.
Phase 2: Query Classification (4-5 hours)
Goal: Classify queries by complexity and determine optimal model
What to Build:
- Keyword-based classifier
- Context-aware complexity scoring
- Optimal model mapping
Hint 1: Use a simple keyword scoring system:
COMPLEXITY_KEYWORDS = {
'simple': ['syntax', 'import', 'how to', 'what is', 'convert'],
'medium': ['debug', 'fix', 'error', 'bug', 'explain'],
'complex': ['design', 'architect', 'refactor', 'restructure']
}
def classify_prompt(prompt: str) -> ComplexityLevel:
prompt_lower = prompt.lower()
scores = {level: 0 for level in ComplexityLevel}
for level, keywords in COMPLEXITY_KEYWORDS.items():
for keyword in keywords:
if keyword in prompt_lower:
scores[ComplexityLevel[level.upper()]] += 1
return max(scores, key=scores.get)
Hint 2: Consider context size as a complexity signal:
def adjust_for_context(base_level: ComplexityLevel, context_files: int) -> ComplexityLevel:
if context_files > 10 and base_level == ComplexityLevel.SIMPLE:
return ComplexityLevel.MEDIUM
if context_files > 20:
return ComplexityLevel.COMPLEX
return base_level
Hint 3: Map complexity to optimal model:
OPTIMAL_MODEL_MAP = {
ComplexityLevel.SIMPLE: Model.HAIKU,
ComplexityLevel.MEDIUM: Model.SONNET,
ComplexityLevel.COMPLEX: Model.OPUS
}
Validation Checkpoint: You can classify a list of sample prompts and verify the classifications make sense.
Phase 3: Cost Analysis and Recommendations (4-5 hours)
Goal: Calculate costs and generate actionable recommendations
What to Build:
- Cost calculator with real pricing
- Waste identification
- Recommendation generator
- Report output
Hint 1: Use dataclasses for clean cost modeling:
@dataclass
class ModelPricing:
input_per_1k: float
output_per_1k: float
PRICING = {
Model.HAIKU: ModelPricing(0.0008, 0.0032),
Model.SONNET: ModelPricing(0.003, 0.015),
Model.OPUS: ModelPricing(0.015, 0.075)
}
def calculate_cost(model: Model, input_tokens: int, output_tokens: int) -> float:
pricing = PRICING[model]
return (input_tokens / 1000 * pricing.input_per_1k +
output_tokens / 1000 * pricing.output_per_1k)
Hint 2: Generate recommendations based on patterns:
def generate_recommendations(queries: List[Query]) -> List[str]:
recommendations = []
# Count misroutes
simple_with_opus = [q for q in queries
if q.complexity == ComplexityLevel.SIMPLE
and q.model_used == Model.OPUS]
if len(simple_with_opus) > 5:
savings = sum(calculate_waste(q) for q in simple_with_opus)
recommendations.append(
f"Found {len(simple_with_opus)} simple queries using Opus. "
f"Force Haiku with '/model set haiku' for syntax questions. "
f"Potential savings: ${savings:.2f}"
)
return recommendations
Hint 3: Use rich for beautiful terminal output:
from rich.console import Console
from rich.table import Table
def render_report(analysis: CostAnalysis):
console = Console()
table = Table(title="Model Usage Report")
table.add_column("Metric")
table.add_column("Value", justify="right")
table.add_row("Actual Cost", f"${analysis.actual_cost:.2f}")
table.add_row("Optimal Cost", f"${analysis.optimal_cost:.2f}")
table.add_row("Waste", f"${analysis.waste:.2f}")
table.add_row("Savings Opportunity", f"{analysis.savings_percentage:.1f}%")
console.print(table)
Validation Checkpoint: You can run the analyzer and see a formatted report with costs and recommendations.
Testing Strategy
Unit Tests
# test_classifier.py
import pytest
from analyzer.classifier import classify_prompt, ComplexityLevel
class TestClassifier:
def test_syntax_query_is_simple(self):
prompt = "What's the syntax for optional chaining in TypeScript?"
assert classify_prompt(prompt) == ComplexityLevel.SIMPLE
def test_debug_query_is_medium(self):
prompt = "Debug this segfault in my memory allocator"
assert classify_prompt(prompt) == ComplexityLevel.MEDIUM
def test_architecture_query_is_complex(self):
prompt = "Design a microservices architecture for a fintech app"
assert classify_prompt(prompt) == ComplexityLevel.COMPLEX
def test_ambiguous_query_defaults_to_medium(self):
prompt = "Help me with this code"
assert classify_prompt(prompt) == ComplexityLevel.MEDIUM
Integration Tests
# test_integration.py
def test_full_analysis_pipeline():
# Create sample log file
sample_logs = """
[2025-12-22T10:00:00] model=sonnet prompt="what is the syntax for..." tokens_in=50 tokens_out=100
[2025-12-22T10:01:00] model=opus prompt="design a new auth system" tokens_in=200 tokens_out=500
"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.log') as f:
f.write(sample_logs)
f.flush()
result = analyze_logs(f.name)
assert result.total_queries == 2
assert result.cost_analysis.waste > 0 # Sonnet for simple query
Sample Data for Testing
Create a sample_logs/ directory with realistic test data:
// sample_logs/diverse_queries.json
[
{"prompt": "What's the Python syntax for list comprehension?", "model": "sonnet", "tokens_in": 30, "tokens_out": 150},
{"prompt": "Debug why this React component is re-rendering", "model": "sonnet", "tokens_in": 500, "tokens_out": 800},
{"prompt": "Design a distributed caching layer for our microservices", "model": "sonnet", "tokens_in": 200, "tokens_out": 2000},
{"prompt": "How do I import numpy?", "model": "opus", "tokens_in": 10, "tokens_out": 50}
]
Common Pitfalls and Debugging
Pitfall 1: Log Format Variations
Symptom: Parser works on some logs but fails on others
Cause: Kiro log format may change between versions
Debug:
# Print first few lines to understand format
with open(log_file) as f:
for i, line in enumerate(f):
print(f"Line {i}: {repr(line[:100])}")
if i > 5:
break
Solution: Build flexible parsers that try multiple formats:
def parse_line(line: str) -> Optional[dict]:
parsers = [parse_json, parse_structured_text, parse_plain_text]
for parser in parsers:
result = parser(line)
if result:
return result
return None
Pitfall 2: Missing Token Counts
Symptom: Token counts are zero or missing
Cause: Logs may not include token counts for all events
Debug:
grep -o 'tokens[^,]*' /path/to/logs | sort | uniq -c
Solution: Estimate tokens when missing:
def estimate_tokens(text: str) -> int:
# Rough estimation: ~4 characters per token
return len(text) // 4
Pitfall 3: Classification Disagreements
Symptom: Queries are classified differently than expected
Cause: Keyword-based classification is imperfect
Debug:
def classify_with_debug(prompt: str) -> tuple[ComplexityLevel, dict]:
scores = {}
for level, keywords in COMPLEXITY_KEYWORDS.items():
matched = [k for k in keywords if k in prompt.lower()]
scores[level] = {'count': len(matched), 'keywords': matched}
return max(scores, key=lambda k: scores[k]['count']), scores
Solution: Allow manual override and feedback loop:
# Store corrections for learning
def record_correction(query_id: str, correct_complexity: ComplexityLevel):
corrections_file = Path.home() / '.model-analyzer' / 'corrections.json'
# Load existing corrections
corrections = json.loads(corrections_file.read_text()) if corrections_file.exists() else {}
corrections[query_id] = correct_complexity.value
corrections_file.write_text(json.dumps(corrections))
Pitfall 4: Pricing Data Outdated
Symptom: Cost calculations don’t match Kiro’s /usage output
Cause: Model pricing changes over time
Debug:
kiro-cli /usage --format json | jq '.credits'
Solution: Make pricing configurable:
# config.yaml
pricing:
haiku:
input_per_1k: 0.0008
output_per_1k: 0.0032
sonnet:
input_per_1k: 0.003
output_per_1k: 0.015
opus:
input_per_1k: 0.015
output_per_1k: 0.075
Extensions and Challenges
Extension 1: Real-time Dashboard
Create a live dashboard that updates as you use Kiro:
# Use watchdog to monitor log files
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
class LogHandler(FileSystemEventHandler):
def on_modified(self, event):
if event.src_path.endswith('.log'):
self.update_dashboard()
Extension 2: ML-Based Classifier
Replace keyword matching with a trained classifier:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
# Train on labeled examples
vectorizer = TfidfVectorizer()
classifier = MultinomialNB()
X = vectorizer.fit_transform(training_prompts)
classifier.fit(X, training_labels)
Extension 3: Team Analytics
Aggregate usage across team members for organization-wide insights:
./model-analyzer team-report --team-dir /shared/kiro-logs/
Extension 4: A/B Testing Framework
Compare model performance on similar queries:
def ab_test(prompt: str, models: List[Model]) -> dict:
results = {}
for model in models:
response = run_query(prompt, force_model=model)
results[model] = {
'response': response,
'latency': response.latency,
'cost': calculate_cost(model, response.tokens)
}
return results
Challenge: Predictive Routing
Build a system that predicts optimal routing before the query is sent:
User starts typing: "design a..."
System predicts: COMPLEX (confidence: 0.85)
Suggestion: "This looks like an architecture question. Consider forcing Opus for best results."
Real-World Connections
How Professionals Use This
- Cost Management: Engineering managers track AI spending per team/project
- Performance Optimization: DevOps teams monitor response latency vs. model selection
- Quality Assurance: Teams correlate model selection with code review feedback
- Capacity Planning: Predict AI compute needs based on usage patterns
Industry Patterns
LLM Observability (MLOps): This project introduces concepts used in production LLM monitoring systems like LangSmith, Weights & Biases, and Datadog LLM Observability.
Cost Attribution (FinOps): Tracking AI costs per project/feature mirrors cloud cost allocation practices.
Quality-Cost Tradeoffs (Engineering Economics): The model selection problem is a specific instance of the general engineering tradeoff between quality, speed, and cost.
Self-Assessment Checklist
Understanding Verification
- Can you explain when Haiku is sufficient vs. when Opus is needed?
- Haiku: Syntax, simple lookups, fast iteration
- Opus: Architecture, complex reasoning, creative solutions
- What factors influence the Auto router’s decision?
- Prompt complexity (keywords, length)
- Context size (files loaded)
- Historical patterns (conversation depth)
- How do you calculate cost savings from better routing?
- Compare actual model cost vs. optimal model cost per query
- Aggregate waste across session/week/month
- When should you override Auto and force a specific model?
- Force Haiku: Known simple queries, speed-critical loops
- Force Opus: Architecture discussions, complex debugging
Skill Demonstration
- I can parse Kiro logs and extract model selection events
- I can classify query complexity with reasonable accuracy
- I can calculate cost metrics and identify waste
- I can generate actionable recommendations
- I can visualize usage patterns in the terminal
Interview Preparation
Be ready to answer:
- “How would you design a model routing system for an AI application?”
- “What metrics would you track to optimize LLM costs?”
- “How do you balance cost vs. quality in AI deployments?”
- “How would you A/B test different models for the same task?”
Recommended Reading
| Topic | Resource | Why It Helps |
|---|---|---|
| LLM Engineering | “AI Engineering” by Chip Huyen, Ch. 4-6 | Deep dive into model serving and optimization |
| Cost Optimization | AWS Well-Architected Framework, Cost Pillar | General principles of cloud cost management |
| Data Analysis | “Python for Data Analysis” by McKinney, Ch. 8 | Pandas patterns for log analysis |
| Visualization | Rich documentation (rich.readthedocs.io) | Beautiful terminal output |
| ML Classification | Scikit-learn tutorials | For ML-based classifier extension |
What Success Looks Like
When you complete this project, you will have:
- A Working Tool:
model-analyzercommand that generates usage reports - Cost Awareness: Intuition for when to override Auto router
- Data Analysis Skills: Log parsing, classification, and visualization
- Optimization Mindset: Identifying waste and recommending improvements
- Foundation for MLOps: Understanding of LLM observability concepts
Next Steps: Move to Project 3 (Context Window Visualizer) to understand token economics in depth.