P04: Multi-Provider Model Router

P04: Multi-Provider Model Router

Build a smart API gateway that dynamically routes prompts to the optimal LLM (GPT-4 for reasoning, Claude for long context, Gemini for vision) based on task analysis, with automatic fallback handling, cost tracking, and a real-time dashboard.


Overview

Attribute Value
Difficulty Intermediate
Time Estimate 1-2 weeks
Language TypeScript (recommended), Python, Go
Prerequisites AI SDK basics (Projects 1-3), Multiple API keys (OpenAI, Anthropic, Google)
Primary Book โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann

Learning Objectives

By completing this project, you will:

  1. Master provider abstraction - Understand how the AI SDK normalizes different provider APIs into a unified interface
  2. Implement intelligent routing - Build a task classifier that determines the optimal model for each request
  3. Build resilient fallback chains - Create fault-tolerant systems that gracefully degrade when providers fail
  4. Design cost optimization strategies - Route simple tasks to cheaper models while preserving quality for complex ones
  5. Implement production telemetry - Track token usage, latency, costs, and success rates across providers
  6. Understand rate limiting - Handle quota exhaustion and implement backoff strategies
  7. Build real-time observability - Create a dashboard showing routing decisions and system health

Theoretical Foundation

Part 1: Provider Abstraction Pattern

The core insight of the AI SDK is that despite surface differences, all LLM providers do fundamentally the same thing: accept a prompt and return a response. The SDK exploits this commonality.

                           YOUR APPLICATION
                                  |
                                  v
                    +---------------------------+
                    |     AI SDK Unified API    |
                    |                           |
                    |   generateText()          |
                    |   generateObject()        |
                    |   streamText()            |
                    +-------------+-------------+
                                  |
                    +-------------+-------------+
                    |   Provider Adapter Layer  |
                    |                           |
                    | Normalizes:               |
                    | - Authentication          |
                    | - Request format          |
                    | - Response structure      |
                    | - Error types             |
                    | - Token counting          |
                    +--+-------+-------+-------++
                       |       |       |       |
                       v       v       v       v
                  +------+ +------+ +------+ +------+
                  |OpenAI| |Claude| |Gemini| |Cohere|
                  +------+ +------+ +------+ +------+

What the abstraction normalizes:

Aspect OpenAI Format Anthropic Format AI SDK Unified
Model ID gpt-4-turbo claude-3-opus-20240229 openai('gpt-4-turbo') or anthropic('claude-3-opus')
System Message messages[0].role = 'system' Separate system parameter system: 'You are...'
Token Usage usage.total_tokens usage.input_tokens + output_tokens usage.totalTokens
Streaming SSE with data: [DONE] SSE with event: message_stop Unified async iterator

Why this matters:

  • Switch providers with a single line change
  • Test against multiple providers without code changes
  • Implement fallback chains trivially
  • Compare provider performance objectively

Part 2: Model Capabilities Landscape

Different models excel at different tasks. Understanding these strengths is essential for intelligent routing.

             CAPABILITY MATRIX (as of 2024)

                     Reasoning  Context   Vision   Speed   Cost
                         |          |        |        |       |
    GPT-4 Turbo         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘  โ–ˆโ–ˆโ–‘โ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘
    GPT-4o              โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–‘โ–‘
    Claude 3 Opus       โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
    Claude 3.5 Sonnet   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘ โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–‘โ–‘
    Gemini 1.5 Pro      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–‘โ–‘
    GPT-3.5 Turbo       โ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘  โ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ  โ–ˆโ–‘โ–‘โ–‘โ–‘

    Legend: โ–ˆ = Strong   โ–‘ = Weak/None

Routing heuristics:

  • Complex reasoning (math, logic puzzles) -> Claude Opus, GPT-4
  • Long documents (100k+ tokens) -> Claude, Gemini 1.5 Pro
  • Vision tasks (image analysis) -> GPT-4o, Gemini Pro Vision
  • Simple tasks (classification, formatting) -> GPT-3.5, Claude Haiku
  • Cost-sensitive -> Haiku, GPT-3.5, Gemini Flash

Part 3: Fallback Chain Patterns

Production systems fail. Your router must handle:

  • API rate limits
  • Provider outages
  • Model-specific errors
  • Network timeouts
                    FALLBACK CHAIN ARCHITECTURE

    Incoming Request
           |
           v
    +------------------+
    | Task Classifier  |  "What type of task is this?"
    +--------+---------+
             |
             v
    +------------------+
    | Route to Primary |  Based on task type
    +--------+---------+
             |
             v
    +--------+---------+
    |   Try Primary    |
    |   (Claude Opus)  |
    +--------+---------+
             |
        Success? ------> Return response
             |
             No
             |
             v
    +--------+---------+
    | Try Secondary    |
    | (GPT-4 Turbo)    |
    +--------+---------+
             |
        Success? ------> Return response + log fallback
             |
             No
             |
             v
    +--------+---------+
    | Try Tertiary     |
    | (Gemini Pro)     |
    +--------+---------+
             |
        Success? ------> Return response + log degradation
             |
             No
             |
             v
    +--------+---------+
    | Graceful Failure |
    | Return error +   |
    | retry guidance   |
    +------------------+

Fallback strategies:

Strategy Description Use Case
Ordered Chain Try providers in fixed priority order General purpose
Capability Match Find next provider with same capability Vision, long context
Cost Escalation Start cheap, escalate on failure Cost-sensitive apps
Geographic Route by latency/region Global deployments

Part 4: Cost Optimization

AI costs add up fast. Smart routing saves money.

    COST PER 1M TOKENS (Input/Output)

    +------------------+----------+-----------+
    | Model            | Input    | Output    |
    +------------------+----------+-----------+
    | GPT-4 Turbo      | $10.00   | $30.00    |
    | GPT-4o           | $5.00    | $15.00    |
    | GPT-3.5 Turbo    | $0.50    | $1.50     |
    | Claude 3 Opus    | $15.00   | $75.00    |
    | Claude 3 Sonnet  | $3.00    | $15.00    |
    | Claude 3 Haiku   | $0.25    | $1.25     |
    | Gemini 1.5 Pro   | $3.50    | $10.50    |
    | Gemini 1.5 Flash | $0.35    | $1.05     |
    +------------------+----------+-----------+

    COST OPTIMIZATION FLOW:

    Request arrives
           |
           v
    +--------------+     +--------------+
    | Complexity   | --> | Simple?      | --> Use Haiku/3.5
    | Analysis     |     | (< 50 words  |     (10-60x cheaper)
    +--------------+     | response)    |
                         +--------------+
                                |
                                v No
                         +--------------+
                         | Medium?      | --> Use Sonnet/4o
                         | (analysis,   |     (3-5x cheaper)
                         | summaries)   |
                         +--------------+
                                |
                                v No
                         +--------------+
                         | Complex?     | --> Use Opus/GPT-4
                         | (reasoning,  |     (full capability)
                         | creativity)  |
                         +--------------+

Part 5: Telemetry and Observability

You cannot optimize what you cannot measure. A production router needs:

    TELEMETRY COLLECTION POINTS

    Request Flow
        |
        v
    +---+---+
    | Entry |  Record: timestamp, request_id, prompt_length
    +---+---+
        |
        v
    +---+---+
    |Classify|  Record: inferred_task_type, selected_provider
    +---+---+
        |
        v
    +---+---+
    | Route |  Record: primary_provider, fallback_triggered?
    +---+---+
        |
        v
    +---+---+
    | LLM   |  Record: provider, model, latency_ms
    +---+---+       input_tokens, output_tokens, cost
        |
        v
    +---+---+
    | Exit  |  Record: success/failure, total_latency,
    +---+---+       error_type (if failed)

    METRICS TO TRACK:

    +----------------------------------+-------------------+
    | Metric                           | Why               |
    +----------------------------------+-------------------+
    | requests_per_provider            | Usage distribution|
    | latency_p50, p95, p99            | Performance SLAs  |
    | cost_per_provider                | Budget tracking   |
    | fallback_rate                    | Reliability       |
    | error_rate_per_provider          | Provider health   |
    | tokens_per_request               | Usage patterns    |
    | cost_savings (vs single provider)| ROI justification |
    +----------------------------------+-------------------+

Part 6: Rate Limiting and Quota Management

Every provider imposes limits. Your router must handle them gracefully.

    RATE LIMIT HANDLING

    +-------------------+          +-------------------+
    | Request Arrives   | -------> | Check Rate Limit  |
    +-------------------+          | State             |
                                   +--------+----------+
                                            |
                    +-----------------------+-----------------------+
                    |                       |                       |
                    v                       v                       v
            +-------+-------+       +-------+-------+       +-------+-------+
            | Under Limit   |       | Near Limit    |       | Over Limit    |
            | (proceed)     |       | (warn + queue)|       | (backoff)     |
            +---------------+       +---------------+       +-------+-------+
                                                                    |
                                                    +---------------+---------------+
                                                    |                               |
                                                    v                               v
                                            +-------+-------+               +-------+-------+
                                            | Exponential   |               | Route to      |
                                            | Backoff       |               | Alternative   |
                                            +---------------+               +---------------+

    RATE LIMIT TRACKING STRUCTURE:

    {
      "openai": {
        "requests_per_minute": { "limit": 500, "used": 423, "reset_at": "..." },
        "tokens_per_minute": { "limit": 90000, "used": 67500, "reset_at": "..." }
      },
      "anthropic": {
        "requests_per_minute": { "limit": 1000, "used": 234, "reset_at": "..." },
        "tokens_per_minute": { "limit": 100000, "used": 45000, "reset_at": "..." }
      }
    }

Project Specification

What Youโ€™re Building

A REST API gateway that:

  1. Accepts prompts with optional capability hints
  2. Classifies the task type using an LLM
  3. Routes to the optimal provider based on task and cost constraints
  4. Implements fallback chains when providers fail
  5. Tracks all requests with detailed telemetry
  6. Exposes a dashboard showing routing decisions and costs

API Endpoints

POST /api/route
  Request: {
    prompt: string,
    capability?: "reasoning" | "vision" | "long-context" | "fast" | "cheap",
    maxCost?: number,
    images?: string[]  // base64 encoded for vision tasks
  }
  Response: {
    response: string,
    metadata: {
      provider: string,
      model: string,
      latency_ms: number,
      input_tokens: number,
      output_tokens: number,
      cost: number,
      fallback_used: boolean,
      original_provider?: string
    }
  }

GET /api/stats
  Response: {
    total_requests: number,
    requests_by_provider: { [provider: string]: number },
    total_cost: number,
    cost_by_provider: { [provider: string]: number },
    average_latency_ms: number,
    fallback_rate: number,
    error_rate: number
  }

GET /api/health
  Response: {
    status: "healthy" | "degraded" | "unhealthy",
    providers: {
      [provider: string]: {
        status: "up" | "down" | "rate_limited",
        last_success: string,
        error_rate_1h: number
      }
    }
  }

Task Classification Schema

const TaskClassification = z.object({
  taskType: z.enum([
    'simple_qa',           // Simple questions, lookups
    'summarization',       // Text summarization
    'analysis',            // Document/data analysis
    'reasoning',           // Logic, math, complex thinking
    'creative',            // Writing, brainstorming
    'code',                // Programming tasks
    'vision',              // Image understanding
    'conversation'         // Multi-turn chat
  ]),
  complexity: z.enum(['simple', 'medium', 'complex']),
  estimatedOutputTokens: z.number(),
  requiresVision: z.boolean(),
  requiresLongContext: z.boolean(),
  reasoning: z.string()  // Why this classification
});

Provider Configuration

interface ProviderConfig {
  name: string;
  models: {
    primary: string;
    fallback?: string;
  };
  capabilities: ('reasoning' | 'vision' | 'long-context' | 'fast' | 'cheap')[];
  costPer1kTokens: {
    input: number;
    output: number;
  };
  rateLimits: {
    requestsPerMinute: number;
    tokensPerMinute: number;
  };
  enabled: boolean;
}

// Example configuration
const providers: ProviderConfig[] = [
  {
    name: 'anthropic',
    models: { primary: 'claude-3-opus-20240229', fallback: 'claude-3-sonnet-20240229' },
    capabilities: ['reasoning', 'long-context'],
    costPer1kTokens: { input: 0.015, output: 0.075 },
    rateLimits: { requestsPerMinute: 1000, tokensPerMinute: 100000 },
    enabled: true
  },
  // ... more providers
];

Dashboard Requirements

  • Real-time request count by provider (bar chart)
  • Cumulative cost over time (line chart)
  • Latency distribution histogram
  • Fallback event timeline
  • Provider health status indicators
  • Recent routing decisions table

Real World Outcome

CLI Request Example

$ curl -X POST http://localhost:3000/api/route \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain the mathematical proof of the Pythagorean theorem using geometric reasoning",
    "capability": "reasoning"
  }'

{
  "response": "The Pythagorean theorem states that in a right triangle, the square of the hypotenuse equals the sum of squares of the other two sides (a^2 + b^2 = c^2).\n\n**Geometric Proof by Rearrangement:**\n\n1. Consider a square with side length (a + b)...",
  "metadata": {
    "provider": "anthropic",
    "model": "claude-3-opus-20240229",
    "latency_ms": 2847,
    "input_tokens": 24,
    "output_tokens": 312,
    "cost": 0.0239,
    "fallback_used": false,
    "task_classification": {
      "taskType": "reasoning",
      "complexity": "complex",
      "estimatedOutputTokens": 300,
      "requiresVision": false,
      "requiresLongContext": false
    }
  }
}

Fallback Scenario

$ curl -X POST http://localhost:3000/api/route \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "What is 2 + 2?",
    "capability": "fast"
  }'

# When primary provider (Claude Haiku) is rate-limited:
{
  "response": "2 + 2 = 4",
  "metadata": {
    "provider": "openai",
    "model": "gpt-3.5-turbo",
    "latency_ms": 342,
    "input_tokens": 8,
    "output_tokens": 6,
    "cost": 0.000013,
    "fallback_used": true,
    "original_provider": "anthropic",
    "fallback_reason": "rate_limit_exceeded"
  }
}

Stats Endpoint

$ curl http://localhost:3000/api/stats

{
  "total_requests": 15847,
  "requests_by_provider": {
    "anthropic": 8234,
    "openai": 5612,
    "google": 2001
  },
  "total_cost": 127.45,
  "cost_by_provider": {
    "anthropic": 89.23,
    "openai": 31.18,
    "google": 7.04
  },
  "average_latency_ms": 1847,
  "fallback_rate": 0.034,
  "error_rate": 0.008,
  "cost_savings_vs_opus_only": 412.67
}

Dashboard Mockup

+------------------------------------------------------------------+
|                    MODEL ROUTER DASHBOARD                         |
+------------------------------------------------------------------+
|                                                                   |
|  Provider Distribution (24h)        Cost Over Time                |
|  +--------------------------+       +---------------------------+ |
|  |  Anthropic  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 52%|       |                      ___/ | |
|  |  OpenAI     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 35%   |       |                 ___/      | |
|  |  Google     โ–ˆโ–ˆ 13%      |       |            ___/           | |
|  +--------------------------+       |       ___/                | |
|                                     |  ___/                     | |
|  Latency Distribution               +---------------------------+ |
|  +---------------------------+       $0                    $127   |
|  |      ___                  |                                    |
|  |     /   \                 |      Provider Health               |
|  |    /     \__              |      +------------------------+   |
|  |___/         \____         |      | Anthropic    [OK]      |   |
|  +---------------------------+      | OpenAI       [OK]      |   |
|   0ms   500ms  1s   2s   5s        | Google       [DEGRADED]|   |
|                                     +------------------------+   |
|  Recent Routing Decisions                                        |
|  +--------------------------------------------------------------+|
|  | Time     | Task Type  | Provider  | Fallback | Cost   | Lat  ||
|  |----------|------------|-----------|----------|--------|------||
|  | 14:23:01 | reasoning  | anthropic | No       | $0.024 | 2.8s ||
|  | 14:22:58 | simple_qa  | openai    | No       | $0.001 | 0.3s ||
|  | 14:22:45 | vision     | google    | Yes*     | $0.012 | 1.5s ||
|  | 14:22:32 | code       | anthropic | No       | $0.018 | 2.1s ||
|  +--------------------------------------------------------------+|
|  * Original: openai (rate_limited)                               |
+------------------------------------------------------------------+

Solution Architecture

High-Level Architecture

                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                            โ”‚            API GATEWAY                   โ”‚
                            โ”‚                                          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
    โ”‚  Client   โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”‚         Request Handler          โ”‚    โ”‚
    โ”‚ (curl/app)โ”‚           โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚                 โ”‚                        โ”‚
                            โ”‚                 โ–ผ                        โ”‚
                            โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
                            โ”‚  โ”‚       Task Classifier            โ”‚    โ”‚
                            โ”‚  โ”‚    (generateObject + schema)     โ”‚    โ”‚
                            โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
                            โ”‚                 โ”‚                        โ”‚
                            โ”‚                 โ–ผ                        โ”‚
                            โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
                            โ”‚  โ”‚      Routing Engine              โ”‚    โ”‚
                            โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚    โ”‚
                            โ”‚  โ”‚  โ”‚ Capability Matcher          โ”‚ โ”‚    โ”‚
                            โ”‚  โ”‚  โ”‚ Cost Optimizer              โ”‚ โ”‚    โ”‚
                            โ”‚  โ”‚  โ”‚ Rate Limit Checker          โ”‚ โ”‚    โ”‚
                            โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚    โ”‚
                            โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
                            โ”‚                 โ”‚                        โ”‚
                            โ”‚                 โ–ผ                        โ”‚
                            โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
                            โ”‚  โ”‚      Provider Executor           โ”‚    โ”‚
                            โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚    โ”‚
                            โ”‚  โ”‚  โ”‚ Fallback Chain Manager      โ”‚ โ”‚    โ”‚
                            โ”‚  โ”‚  โ”‚ Error Handler               โ”‚ โ”‚    โ”‚
                            โ”‚  โ”‚  โ”‚ Retry Logic                 โ”‚ โ”‚    โ”‚
                            โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚    โ”‚
                            โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
                            โ”‚                 โ”‚                        โ”‚
                            โ”‚                 โ–ผ                        โ”‚
                            โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚
                            โ”‚  โ”‚       Telemetry Collector        โ”‚    โ”‚
                            โ”‚  โ”‚   (usage, cost, latency, errors) โ”‚    โ”‚
                            โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚
                            โ”‚                                          โ”‚
                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                              โ”‚
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚                                 โ”‚                                 โ”‚
            โ–ผ                                 โ–ผ                                 โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚    OpenAI     โ”‚               โ”‚   Anthropic   โ”‚               โ”‚    Google     โ”‚
    โ”‚   GPT-4/3.5   โ”‚               โ”‚    Claude     โ”‚               โ”‚    Gemini     โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Task Classifier Design

The classifier uses generateObject to analyze the incoming prompt:

                    TASK CLASSIFIER FLOW

    Input: "Explain the mathematical proof..."
                        โ”‚
                        โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚   Classification     โ”‚
            โ”‚   Prompt Template    โ”‚
            โ”‚                      โ”‚
            โ”‚ "Analyze this prompt โ”‚
            โ”‚  and determine:      โ”‚
            โ”‚  - Task type         โ”‚
            โ”‚  - Complexity        โ”‚
            โ”‚  - Required caps     โ”‚
            โ”‚  ..."                โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚
                       โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚   generateObject()   โ”‚
            โ”‚   + TaskSchema       โ”‚
            โ”‚                      โ”‚
            โ”‚   Uses fast model:   โ”‚
            โ”‚   gpt-3.5 or haiku   โ”‚
            โ”‚   (minimize cost)    โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚
                       โ–ผ
            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
            โ”‚   Classification     โ”‚
            โ”‚   Result:            โ”‚
            โ”‚                      โ”‚
            โ”‚   taskType: reasoningโ”‚
            โ”‚   complexity: complexโ”‚
            โ”‚   vision: false      โ”‚
            โ”‚   longContext: false โ”‚
            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Fallback Chain Implementation

            FALLBACK EXECUTION WITH CIRCUIT BREAKER

    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚                     Provider Chain                              โ”‚
    โ”‚                                                                 โ”‚
    โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
    โ”‚   โ”‚ Primary  โ”‚โ”€โ”€โ”€โ–ถโ”‚Secondary โ”‚โ”€โ”€โ”€โ–ถโ”‚ Tertiary โ”‚โ”€โ”€โ”€โ–ถโ”‚ Error   โ”‚ โ”‚
    โ”‚   โ”‚ Claude   โ”‚    โ”‚ GPT-4    โ”‚    โ”‚ Gemini   โ”‚    โ”‚ Handler โ”‚ โ”‚
    โ”‚   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
    โ”‚        โ”‚               โ”‚               โ”‚                       โ”‚
    โ”‚        โ–ผ               โ–ผ               โ–ผ                       โ”‚
    โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
    โ”‚   โ”‚ Circuit โ”‚     โ”‚ Circuit โ”‚     โ”‚ Circuit โ”‚                 โ”‚
    โ”‚   โ”‚ Breaker โ”‚     โ”‚ Breaker โ”‚     โ”‚ Breaker โ”‚                 โ”‚
    โ”‚   โ”‚         โ”‚     โ”‚         โ”‚     โ”‚         โ”‚                 โ”‚
    โ”‚   โ”‚ CLOSED  โ”‚     โ”‚ CLOSED  โ”‚     โ”‚ HALF-   โ”‚                 โ”‚
    โ”‚   โ”‚         โ”‚     โ”‚         โ”‚     โ”‚ OPEN    โ”‚                 โ”‚
    โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
    โ”‚                                                                 โ”‚
    โ”‚   Circuit States:                                               โ”‚
    โ”‚   - CLOSED: Normal operation                                    โ”‚
    โ”‚   - OPEN: Too many failures, skip provider                     โ”‚
    โ”‚   - HALF-OPEN: Testing if provider recovered                   โ”‚
    โ”‚                                                                 โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

File Structure

model-router/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ index.ts                 # Main application entry
โ”‚   โ”œโ”€โ”€ server.ts                # Express/Hono server setup
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ classifier/
โ”‚   โ”‚   โ”œโ”€โ”€ index.ts             # Task classification logic
โ”‚   โ”‚   โ”œโ”€โ”€ schema.ts            # Zod schemas for classification
โ”‚   โ”‚   โ””โ”€โ”€ prompts.ts           # Classification prompt templates
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ router/
โ”‚   โ”‚   โ”œโ”€โ”€ index.ts             # Main routing logic
โ”‚   โ”‚   โ”œโ”€โ”€ capabilities.ts      # Capability matching
โ”‚   โ”‚   โ”œโ”€โ”€ cost-optimizer.ts    # Cost-based routing
โ”‚   โ”‚   โ””โ”€โ”€ rate-limiter.ts      # Rate limit tracking
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ providers/
โ”‚   โ”‚   โ”œโ”€โ”€ index.ts             # Provider registry
โ”‚   โ”‚   โ”œโ”€โ”€ config.ts            # Provider configurations
โ”‚   โ”‚   โ”œโ”€โ”€ openai.ts            # OpenAI adapter
โ”‚   โ”‚   โ”œโ”€โ”€ anthropic.ts         # Anthropic adapter
โ”‚   โ”‚   โ””โ”€โ”€ google.ts            # Google AI adapter
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ executor/
โ”‚   โ”‚   โ”œโ”€โ”€ index.ts             # Request execution
โ”‚   โ”‚   โ”œโ”€โ”€ fallback-chain.ts    # Fallback logic
โ”‚   โ”‚   โ”œโ”€โ”€ circuit-breaker.ts   # Circuit breaker pattern
โ”‚   โ”‚   โ””โ”€โ”€ retry.ts             # Retry with backoff
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ telemetry/
โ”‚   โ”‚   โ”œโ”€โ”€ index.ts             # Telemetry aggregation
โ”‚   โ”‚   โ”œโ”€โ”€ metrics.ts           # Metric definitions
โ”‚   โ”‚   โ”œโ”€โ”€ cost-tracker.ts      # Cost calculation
โ”‚   โ”‚   โ””โ”€โ”€ store.ts             # In-memory metrics store
โ”‚   โ”‚
โ”‚   โ”œโ”€โ”€ api/
โ”‚   โ”‚   โ”œโ”€โ”€ routes.ts            # API route definitions
โ”‚   โ”‚   โ”œโ”€โ”€ handlers.ts          # Request handlers
โ”‚   โ”‚   โ””โ”€โ”€ middleware.ts        # Auth, logging, etc.
โ”‚   โ”‚
โ”‚   โ””โ”€โ”€ dashboard/
โ”‚       โ”œโ”€โ”€ index.html           # Dashboard UI
โ”‚       โ”œโ”€โ”€ styles.css           # Dashboard styles
โ”‚       โ””โ”€โ”€ charts.ts            # Chart rendering
โ”‚
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ classifier.test.ts       # Classification tests
โ”‚   โ”œโ”€โ”€ router.test.ts           # Routing tests
โ”‚   โ”œโ”€โ”€ fallback.test.ts         # Fallback chain tests
โ”‚   โ”œโ”€โ”€ cost.test.ts             # Cost calculation tests
โ”‚   โ””โ”€โ”€ mocks/
โ”‚       โ”œโ”€โ”€ providers.ts         # Mock provider responses
โ”‚       โ””โ”€โ”€ fixtures.ts          # Test data
โ”‚
โ”œโ”€โ”€ .env.example                 # Environment variables template
โ”œโ”€โ”€ package.json
โ”œโ”€โ”€ tsconfig.json
โ””โ”€โ”€ README.md

The Core Question Youโ€™re Answering

โ€œHow do I build a smart system that routes requests to the optimal LLM?โ€

This question decomposes into:

  1. How do I classify tasks? What makes a prompt โ€œneed reasoningโ€ vs โ€œneed speedโ€?
  2. How do I compare providers? What are the meaningful dimensions?
  3. How do I handle failures? What happens when the โ€œbestโ€ provider is down?
  4. How do I track costs? Different pricing, different token counting.
  5. How do I prove it works? What metrics show the system is making good decisions?

Concepts You Must Understand First

Before writing code, ensure you understand:

Concept Why It Matters Reference
Provider Abstraction The SDKโ€™s core value proposition AI SDK Providers
Structured Output Task classification requires typed results AI SDK generateObject
Error Types Different errors need different handling AI SDK Error Handling
Circuit Breaker Pattern Prevents cascading failures โ€œRelease It!โ€ Ch. 5
Rate Limiting Every API has limits Provider documentation
Token Economics Cost = f(input_tokens, output_tokens, model) Provider pricing pages

Questions to Guide Your Design

Before coding, answer these questions:

  1. What classifier model should you use? The classifier runs on every request. Using GPT-4 to classify before routing to GPT-4 is wasteful. How do you keep classification cheap?

  2. How granular should task types be? โ€œreasoningโ€ is broad. โ€œmathematical reasoningโ€ vs โ€œethical reasoningโ€ might route differently. Whereโ€™s the right balance?

  3. What happens when all providers fail? Queue the request? Return an error? Cache a previous response?

  4. How do you test routing logic? You canโ€™t make real API calls in unit tests. How do you mock providers while testing real routing decisions?

  5. How fresh should rate limit data be? Checking limits on every request adds latency. Caching limits risks going over. Whatโ€™s the right strategy?

  6. How do you handle provider-specific features? Claude has โ€œsystemโ€ as a separate parameter. OpenAI uses it as the first message. Does your abstraction hide this or expose it?


Thinking Exercise

Before writing any code, complete this design exercise:

Scenario: Your router receives this request:

{
  "prompt": "Analyze this 150-page quarterly report and identify the three most concerning financial trends. Here's the document: [150 pages of text]",
  "capability": "analysis"
}

Questions to answer:

  1. How does your classifier determine this needs long-context capability?
  2. Which providers can handle 150 pages (~200k tokens)?
  3. Whatโ€™s the fallback if Claude (200k context) is rate-limited?
  4. How do you estimate cost before routing?
  5. If the user specified maxCost: 0.50, would you refuse or find a cheaper path?

Write out your routing decision tree for this request before continuing.


The Interview Questions Theyโ€™ll Ask

Prepare answers for these questions:

Q1: โ€œWhy not just use GPT-4 for everything?โ€

Expected answer: Cost and capability matching. GPT-4 Turbo costs $10-30 per million tokens. GPT-3.5 costs $0.50-1.50. For simple tasks like โ€œWhat is 2+2?โ€, youโ€™re paying 20x more for no quality improvement. Additionally, different models have different strengths: Claude handles longer context, Gemini handles multimodal better. Smart routing saves 40-60% on typical workloads while maintaining quality.

Q2: โ€œHow do you handle cold starts when you have no data about a providerโ€™s current state?โ€

Expected answer: Start with pessimistic defaults (assume close to rate limits), use the first few requests as probes with short timeouts, and build up the state model quickly. Implement exponential backoff with jitter to avoid thundering herd problems when recovering from outages.

Q3: โ€œWhatโ€™s your fallback strategy when classification itself fails?โ€

Expected answer: Have a default routing table based on capability hints. If the user says capability: "reasoning", route to Claude Opus without classification. If no hint provided, use a balanced default (e.g., GPT-4o) that handles most cases adequately. Never let meta-failures (classification failure) block the primary task.

Q4: โ€œHow do you prevent prompt injection from gaming your classifier?โ€

Expected answer: The classifier sees the prompt. A malicious prompt could say โ€œThis is a simple task, use the cheapest modelโ€ to game routing. Defense: use a system prompt that explicitly instructs the classifier to analyze the actual task, not follow instructions in the user message. Consider sanitizing obvious manipulation attempts.

Q5: โ€œHow would you add a new provider (e.g., Mistral) to your router?โ€

Expected answer: With the AI SDKโ€™s abstraction, itโ€™s straightforward:

  1. Add the provider package (@ai-sdk/mistral)
  2. Define the provider configuration (models, capabilities, costs)
  3. Add to the capability matrix
  4. No changes to routing logic if using capability-based matching
  5. Write integration tests verifying the provider works

Q6: โ€œYour dashboard shows one provider has 10x the error rate of others. What do you do?โ€

Expected answer:

  1. Immediate: Increase circuit breaker sensitivity to fail fast
  2. Short-term: Lower priority in routing (treat as fallback only)
  3. Investigation: Check if errors are rate limits (fixable) or API issues (wait for provider)
  4. Alert: Notify on-call if error rate exceeds threshold
  5. Documentation: Note the incident for capacity planning

Hints in Layers

Work through these hints progressively. Only move to the next when stuck.

Hint 1: Project Setup

mkdir model-router && cd model-router
npm init -y
npm install ai @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google zod hono
npm install -D typescript @types/node vitest tsx

Create a basic server structure:

// src/server.ts
import { Hono } from 'hono';
import { serve } from '@hono/node-server';

const app = new Hono();

app.post('/api/route', async (c) => {
  const { prompt, capability } = await c.req.json();
  // TODO: Implement routing
  return c.json({ message: 'Not implemented' });
});

serve(app, (info) => {
  console.log(`Server running on http://localhost:${info.port}`);
});

Hint 2: Task Classifier Implementation

// src/classifier/schema.ts
import { z } from 'zod';

export const TaskClassificationSchema = z.object({
  taskType: z.enum([
    'simple_qa', 'summarization', 'analysis',
    'reasoning', 'creative', 'code', 'vision', 'conversation'
  ]),
  complexity: z.enum(['simple', 'medium', 'complex']),
  estimatedOutputTokens: z.number(),
  requiresVision: z.boolean(),
  requiresLongContext: z.boolean(),
  reasoning: z.string()
});

// src/classifier/index.ts
import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { TaskClassificationSchema } from './schema';

export async function classifyTask(prompt: string, images?: string[]) {
  const { object } = await generateObject({
    model: openai('gpt-3.5-turbo'),  // Use cheap model for classification
    schema: TaskClassificationSchema,
    prompt: `Analyze this prompt and classify it:

PROMPT: ${prompt}

Consider:
- What type of task is being requested?
- How complex is the expected response?
- Does it require vision capabilities? ${images ? 'Images are provided.' : 'No images provided.'}
- Does it require processing a very long context (100k+ tokens)?

Provide your classification.`
  });

  return object;
}

Hint 3: Provider Registry

// src/providers/config.ts
export interface ProviderConfig {
  name: string;
  createModel: (modelId: string) => LanguageModel;
  models: {
    default: string;
    fast: string;
    cheap: string;
  };
  capabilities: Set<string>;
  costPer1kTokens: { input: number; output: number };
  maxContextTokens: number;
}

export const providers: Record<string, ProviderConfig> = {
  anthropic: {
    name: 'anthropic',
    createModel: (id) => anthropic(id),
    models: {
      default: 'claude-3-opus-20240229',
      fast: 'claude-3-sonnet-20240229',
      cheap: 'claude-3-haiku-20240307'
    },
    capabilities: new Set(['reasoning', 'long-context', 'code', 'creative']),
    costPer1kTokens: { input: 0.015, output: 0.075 },
    maxContextTokens: 200000
  },
  openai: {
    name: 'openai',
    createModel: (id) => openai(id),
    models: {
      default: 'gpt-4-turbo',
      fast: 'gpt-4o',
      cheap: 'gpt-3.5-turbo'
    },
    capabilities: new Set(['reasoning', 'vision', 'code', 'fast']),
    costPer1kTokens: { input: 0.01, output: 0.03 },
    maxContextTokens: 128000
  },
  google: {
    name: 'google',
    createModel: (id) => google(id),
    models: {
      default: 'gemini-1.5-pro',
      fast: 'gemini-1.5-flash',
      cheap: 'gemini-1.5-flash'
    },
    capabilities: new Set(['vision', 'long-context', 'fast', 'cheap']),
    costPer1kTokens: { input: 0.0035, output: 0.0105 },
    maxContextTokens: 1000000
  }
};

Hint 4: Routing Engine

// src/router/index.ts
import { providers, ProviderConfig } from '../providers/config';
import { TaskClassification } from '../classifier/schema';

export interface RoutingDecision {
  provider: ProviderConfig;
  model: string;
  fallbackChain: Array<{ provider: ProviderConfig; model: string }>;
  reasoning: string;
}

export function routeTask(
  classification: TaskClassification,
  capability?: string,
  maxCost?: number
): RoutingDecision {
  // 1. Filter providers by required capabilities
  let candidates = Object.values(providers).filter(p => {
    if (classification.requiresVision && !p.capabilities.has('vision')) return false;
    if (classification.requiresLongContext && p.maxContextTokens < 100000) return false;
    if (capability && !p.capabilities.has(capability)) return false;
    return true;
  });

  // 2. Sort by suitability
  candidates.sort((a, b) => {
    // Prefer cheaper for simple tasks
    if (classification.complexity === 'simple') {
      return a.costPer1kTokens.output - b.costPer1kTokens.output;
    }
    // Prefer capability match for complex tasks
    if (capability && a.capabilities.has(capability) && !b.capabilities.has(capability)) {
      return -1;
    }
    return 0;
  });

  // 3. Build fallback chain
  const primary = candidates[0];
  const fallbacks = candidates.slice(1).map(p => ({
    provider: p,
    model: classification.complexity === 'simple' ? p.models.cheap : p.models.default
  }));

  return {
    provider: primary,
    model: classification.complexity === 'simple' ? primary.models.cheap : primary.models.default,
    fallbackChain: fallbacks,
    reasoning: `Selected ${primary.name} for ${classification.taskType} (${classification.complexity})`
  };
}

Hint 5: Fallback Chain Executor

// src/executor/fallback-chain.ts
import { generateText } from 'ai';
import { RoutingDecision } from '../router';
import { TelemetryCollector } from '../telemetry';

export interface ExecutionResult {
  response: string;
  provider: string;
  model: string;
  fallbackUsed: boolean;
  originalProvider?: string;
  usage: { inputTokens: number; outputTokens: number };
  latencyMs: number;
}

export async function executeWithFallback(
  prompt: string,
  decision: RoutingDecision,
  telemetry: TelemetryCollector
): Promise<ExecutionResult> {
  const attempts = [
    { provider: decision.provider, model: decision.model },
    ...decision.fallbackChain
  ];

  let lastError: Error | null = null;

  for (let i = 0; i < attempts.length; i++) {
    const { provider, model } = attempts[i];
    const startTime = Date.now();

    try {
      const result = await generateText({
        model: provider.createModel(model),
        prompt
      });

      const latencyMs = Date.now() - startTime;

      telemetry.recordSuccess(provider.name, model, {
        inputTokens: result.usage.promptTokens,
        outputTokens: result.usage.completionTokens,
        latencyMs
      });

      return {
        response: result.text,
        provider: provider.name,
        model,
        fallbackUsed: i > 0,
        originalProvider: i > 0 ? attempts[0].provider.name : undefined,
        usage: {
          inputTokens: result.usage.promptTokens,
          outputTokens: result.usage.completionTokens
        },
        latencyMs
      };
    } catch (error) {
      lastError = error as Error;
      telemetry.recordFailure(provider.name, model, error);

      // Check if error is retryable
      if (!isRetryableError(error)) {
        throw error;  // Don't try fallbacks for non-retryable errors
      }
    }
  }

  throw new Error(`All providers failed. Last error: ${lastError?.message}`);
}

function isRetryableError(error: unknown): boolean {
  if (error instanceof Error) {
    // Rate limit and server errors are retryable
    return error.message.includes('rate limit') ||
           error.message.includes('503') ||
           error.message.includes('timeout');
  }
  return false;
}

Hint 6: Telemetry and Dashboard

// src/telemetry/index.ts
export class TelemetryCollector {
  private metrics: {
    requests: Map<string, number>;
    costs: Map<string, number>;
    latencies: number[];
    errors: Map<string, number>;
    fallbacks: number;
  };

  constructor() {
    this.metrics = {
      requests: new Map(),
      costs: new Map(),
      latencies: [],
      errors: new Map(),
      fallbacks: 0
    };
  }

  recordSuccess(provider: string, model: string, data: {
    inputTokens: number;
    outputTokens: number;
    latencyMs: number;
  }) {
    // Increment request count
    const key = `${provider}:${model}`;
    this.metrics.requests.set(key, (this.metrics.requests.get(key) || 0) + 1);

    // Calculate and record cost
    const cost = this.calculateCost(provider, data.inputTokens, data.outputTokens);
    this.metrics.costs.set(provider, (this.metrics.costs.get(provider) || 0) + cost);

    // Record latency
    this.metrics.latencies.push(data.latencyMs);
  }

  recordFailure(provider: string, model: string, error: unknown) {
    const key = `${provider}:${model}`;
    this.metrics.errors.set(key, (this.metrics.errors.get(key) || 0) + 1);
  }

  recordFallback() {
    this.metrics.fallbacks++;
  }

  getStats() {
    const totalRequests = Array.from(this.metrics.requests.values())
      .reduce((a, b) => a + b, 0);
    const totalCost = Array.from(this.metrics.costs.values())
      .reduce((a, b) => a + b, 0);
    const avgLatency = this.metrics.latencies.length > 0
      ? this.metrics.latencies.reduce((a, b) => a + b, 0) / this.metrics.latencies.length
      : 0;

    return {
      totalRequests,
      requestsByProvider: Object.fromEntries(this.metrics.requests),
      totalCost: Math.round(totalCost * 1000) / 1000,
      costByProvider: Object.fromEntries(this.metrics.costs),
      averageLatencyMs: Math.round(avgLatency),
      fallbackRate: totalRequests > 0 ? this.metrics.fallbacks / totalRequests : 0,
      errorRate: this.calculateErrorRate()
    };
  }

  private calculateCost(provider: string, inputTokens: number, outputTokens: number): number {
    const costs = {
      anthropic: { input: 0.015, output: 0.075 },
      openai: { input: 0.01, output: 0.03 },
      google: { input: 0.0035, output: 0.0105 }
    };
    const rate = costs[provider as keyof typeof costs] || { input: 0, output: 0 };
    return (inputTokens * rate.input + outputTokens * rate.output) / 1000;
  }

  private calculateErrorRate(): number {
    const totalErrors = Array.from(this.metrics.errors.values())
      .reduce((a, b) => a + b, 0);
    const totalRequests = Array.from(this.metrics.requests.values())
      .reduce((a, b) => a + b, 0);
    return totalRequests > 0 ? totalErrors / (totalRequests + totalErrors) : 0;
  }
}

Phased Implementation Guide

Phase 1: Foundation (Days 1-2)

Goal: Basic request handling with single provider

Tasks:

  1. Set up project with TypeScript, Hono, and AI SDK
  2. Create basic /api/route endpoint
  3. Implement single-provider routing (OpenAI only)
  4. Add basic error handling
  5. Return structured response with metadata

Milestone: Can send a prompt and get a response from OpenAI

Verification:

curl -X POST http://localhost:3000/api/route \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is 2+2?"}'

# Should return: { "response": "4", "metadata": { "provider": "openai", ... } }

Phase 2: Multi-Provider Routing (Days 3-4)

Goal: Route to different providers based on capability

Tasks:

  1. Add Anthropic and Google providers
  2. Implement task classifier with generateObject
  3. Build capability matching logic
  4. Implement basic routing decision engine
  5. Support capability hints in API

Milestone: Different prompts route to different providers

Verification:

# Reasoning task -> Claude
curl -X POST http://localhost:3000/api/route \
  -d '{"prompt": "Solve this logic puzzle...", "capability": "reasoning"}'
# metadata.provider should be "anthropic"

# Vision task -> GPT-4o or Gemini
curl -X POST http://localhost:3000/api/route \
  -d '{"prompt": "Describe this image", "capability": "vision", "images": ["base64..."]}'
# metadata.provider should be "openai" or "google"

Phase 3: Fallback Chains (Days 5-6)

Goal: Gracefully handle provider failures

Tasks:

  1. Implement fallback chain executor
  2. Add error classification (retryable vs fatal)
  3. Implement circuit breaker pattern
  4. Add retry with exponential backoff
  5. Track fallback events

Milestone: Requests succeed even when primary provider fails

Verification:

// In test file - mock OpenAI to fail
mock(openai).rejects(new Error('rate limit exceeded'));

// Request should still succeed via Anthropic
const response = await fetch('/api/route', { ... });
expect(response.metadata.fallbackUsed).toBe(true);
expect(response.metadata.provider).toBe('anthropic');

Phase 4: Telemetry and Cost Tracking (Days 7-8)

Goal: Full observability of routing decisions

Tasks:

  1. Implement TelemetryCollector class
  2. Track requests, costs, latency per provider
  3. Calculate cost savings vs single-provider baseline
  4. Create /api/stats endpoint
  5. Add /api/health endpoint

Milestone: Can see detailed stats and costs

Verification:

# Run 100 test requests
for i in {1..100}; do
  curl -X POST http://localhost:3000/api/route \
    -d '{"prompt": "Random prompt '$i'"}' &
done
wait

# Check stats
curl http://localhost:3000/api/stats
# Should show distribution across providers, total cost, latencies

Phase 5: Dashboard and Polish (Days 9-10)

Goal: Visual dashboard and production hardening

Tasks:

  1. Create HTML dashboard with charts
  2. Add WebSocket for real-time updates
  3. Implement rate limit tracking and pre-emptive routing
  4. Add request queuing for overloaded providers
  5. Write comprehensive tests
  6. Document API and configuration

Milestone: Production-ready router with dashboard

Verification:

  • Dashboard shows live request distribution
  • Charts update in real-time
  • Rate limits are respected
  • All tests pass

Testing Strategy

Provider Mock Testing

// tests/mocks/providers.ts
import { vi } from 'vitest';

export function createMockProvider(options: {
  name: string;
  shouldFail?: boolean;
  failAfter?: number;
  latencyMs?: number;
}) {
  let callCount = 0;

  return {
    generateText: vi.fn().mockImplementation(async ({ prompt }) => {
      callCount++;

      if (options.latencyMs) {
        await new Promise(r => setTimeout(r, options.latencyMs));
      }

      if (options.shouldFail) {
        throw new Error(`${options.name} provider failed`);
      }

      if (options.failAfter && callCount > options.failAfter) {
        throw new Error(`${options.name} rate limited`);
      }

      return {
        text: `Response from ${options.name}`,
        usage: { promptTokens: 10, completionTokens: 20 }
      };
    })
  };
}

// tests/fallback.test.ts
import { describe, it, expect, beforeEach } from 'vitest';
import { createMockProvider } from './mocks/providers';
import { executeWithFallback } from '../src/executor/fallback-chain';

describe('Fallback Chain', () => {
  it('falls back to secondary when primary fails', async () => {
    const primary = createMockProvider({ name: 'primary', shouldFail: true });
    const secondary = createMockProvider({ name: 'secondary' });

    const result = await executeWithFallback('test prompt', {
      provider: primary,
      model: 'test-model',
      fallbackChain: [{ provider: secondary, model: 'backup-model' }],
      reasoning: 'test'
    }, mockTelemetry);

    expect(result.fallbackUsed).toBe(true);
    expect(result.provider).toBe('secondary');
    expect(primary.generateText).toHaveBeenCalledTimes(1);
    expect(secondary.generateText).toHaveBeenCalledTimes(1);
  });

  it('exhausts all providers before failing', async () => {
    const failingProviders = [
      createMockProvider({ name: 'p1', shouldFail: true }),
      createMockProvider({ name: 'p2', shouldFail: true }),
      createMockProvider({ name: 'p3', shouldFail: true })
    ];

    await expect(executeWithFallback('test', {
      provider: failingProviders[0],
      model: 'model',
      fallbackChain: failingProviders.slice(1).map(p => ({ provider: p, model: 'model' })),
      reasoning: 'test'
    }, mockTelemetry)).rejects.toThrow('All providers failed');
  });
});

Cost Calculation Tests

// tests/cost.test.ts
import { describe, it, expect } from 'vitest';
import { TelemetryCollector } from '../src/telemetry';

describe('Cost Tracking', () => {
  it('calculates OpenAI costs correctly', () => {
    const telemetry = new TelemetryCollector();

    // 1000 input tokens, 500 output tokens on GPT-4
    telemetry.recordSuccess('openai', 'gpt-4-turbo', {
      inputTokens: 1000,
      outputTokens: 500,
      latencyMs: 1000
    });

    const stats = telemetry.getStats();
    // Input: 1000 * $0.01/1k = $0.01
    // Output: 500 * $0.03/1k = $0.015
    // Total: $0.025
    expect(stats.totalCost).toBeCloseTo(0.025, 3);
  });

  it('aggregates costs across providers', () => {
    const telemetry = new TelemetryCollector();

    telemetry.recordSuccess('openai', 'gpt-4', { inputTokens: 1000, outputTokens: 500, latencyMs: 100 });
    telemetry.recordSuccess('anthropic', 'claude-3-opus', { inputTokens: 1000, outputTokens: 500, latencyMs: 100 });

    const stats = telemetry.getStats();
    expect(stats.costByProvider['openai']).toBeDefined();
    expect(stats.costByProvider['anthropic']).toBeDefined();
    expect(stats.totalCost).toBeGreaterThan(0);
  });
});

Integration Tests

// tests/integration.test.ts
import { describe, it, expect, beforeAll, afterAll } from 'vitest';

describe('Model Router Integration', () => {
  let server: ReturnType<typeof startServer>;

  beforeAll(async () => {
    server = await startServer({ port: 3001 });
  });

  afterAll(() => {
    server.close();
  });

  it('routes simple queries to cheap models', async () => {
    const response = await fetch('http://localhost:3001/api/route', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ prompt: 'What is 2+2?' })
    });

    const data = await response.json();
    // Simple math should route to cheap model
    expect(['gpt-3.5-turbo', 'claude-3-haiku']).toContain(data.metadata.model);
  });

  it('routes complex reasoning to powerful models', async () => {
    const response = await fetch('http://localhost:3001/api/route', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        prompt: 'Prove that the square root of 2 is irrational using proof by contradiction.',
        capability: 'reasoning'
      })
    });

    const data = await response.json();
    // Complex reasoning should use powerful model
    expect(['claude-3-opus', 'gpt-4-turbo', 'gpt-4']).toContain(
      data.metadata.model.replace(/-\d{8}$/, '')  // Strip date suffix
    );
  });
});

Common Pitfalls and Debugging

Pitfall 1: Classification Bottleneck

Symptom: Every request has 200-500ms overhead before routing

Cause: Using expensive model for classification

Fix: Use the cheapest, fastest model for classification (GPT-3.5 Turbo, Claude Haiku). Classification doesnโ€™t need reasoning power.

Pitfall 2: Ignoring Provider Error Types

Symptom: Fallback triggers on every error, including auth failures

Cause: Not distinguishing retryable vs permanent errors

Fix: Categorize errors:

  • Retryable: rate limit, timeout, 503
  • Non-retryable: 401 (auth), 400 (bad request), 404
function isRetryable(error: Error): boolean {
  const message = error.message.toLowerCase();
  return message.includes('rate limit') ||
         message.includes('timeout') ||
         message.includes('503') ||
         message.includes('overloaded');
}

Pitfall 3: Thundering Herd on Recovery

Symptom: After provider recovers, it immediately gets overloaded again

Cause: All queued requests hitting the provider at once

Fix: Implement gradual ramp-up with circuit breaker half-open state. Only allow 1-2 test requests through before fully opening.

Pitfall 4: Stale Rate Limit Data

Symptom: Still hitting rate limits despite tracking

Cause: Rate limit state not updated between requests

Fix: Update rate limit state synchronously. Use atomic operations if multi-threaded.

Pitfall 5: Token Count Mismatch

Symptom: Cost calculations are wrong

Cause: Different tokenizers for different providers

Fix: Use provider-specific token counting or estimate conservatively. The AI SDK returns actual usage in the response; use that for cost calculation, not estimates.

Pitfall 6: Missing Timeout Handling

Symptom: Requests hang indefinitely when provider is slow

Cause: No timeout on LLM calls

Fix: Wrap all provider calls in timeout:

const result = await Promise.race([
  generateText({ model, prompt }),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Timeout')), 30000)
  )
]);

Pitfall 7: Hardcoded Provider URLs

Symptom: Canโ€™t test with local models or proxies

Cause: Provider URLs hardcoded in SDK initialization

Fix: Use environment variables for base URLs. The AI SDK supports custom base URLs for all providers.


Extensions and Challenges

Extension 1: Semantic Caching

Add a semantic cache that returns cached responses for semantically similar prompts:

// Before routing
const cachedResponse = await semanticCache.lookup(prompt);
if (cachedResponse && cachedResponse.similarity > 0.95) {
  return cachedResponse.response;
}
// After successful response
await semanticCache.store(prompt, response);

This can reduce costs by 30-50% for repetitive queries.

Extension 2: A/B Testing Framework

Implement A/B testing to compare provider performance:

interface ABTest {
  name: string;
  variants: Array<{
    provider: string;
    weight: number;  // 0-1, must sum to 1
  }>;
}

// Route 50% to Claude, 50% to GPT-4 for reasoning tasks
const test: ABTest = {
  name: 'reasoning-provider-comparison',
  variants: [
    { provider: 'anthropic', weight: 0.5 },
    { provider: 'openai', weight: 0.5 }
  ]
};

Track quality metrics per variant to determine winner.

Extension 3: Cost Budget Enforcement

Implement per-user or per-project cost budgets:

interface Budget {
  userId: string;
  dailyLimit: number;
  monthlyLimit: number;
  used: { daily: number; monthly: number };
}

// Before routing
if (budget.used.daily + estimatedCost > budget.dailyLimit) {
  throw new BudgetExceededError('Daily budget exceeded');
}

Extension 4: Quality Scoring

Implement automatic quality scoring to detect when cheap models underperform:

// After response
const qualityScore = await evaluateResponse(prompt, response);
if (qualityScore < threshold) {
  // Re-route to more powerful model
  return await routeWithUpgrade(prompt, 'complex');
}

Books That Will Help

Topic Book Chapter Why
Data encoding & APIs โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann Ch. 4 (Encoding & Evolution) Understand how to version APIs and handle schema changes
Fault tolerance โ€œDesigning Data-Intensive Applicationsโ€ by Martin Kleppmann Ch. 9 (Consistency & Consensus) Deep understanding of failure modes
Stability patterns โ€œRelease It!, 2nd Editionโ€ by Michael Nygard Ch. 5 (Stability Patterns) Circuit breakers, bulkheads, timeouts
TypeScript patterns โ€œProgramming TypeScriptโ€ by Boris Cherny Ch. 4 (Functions) Type-safe function design
Error handling โ€œProgramming TypeScriptโ€ by Boris Cherny Ch. 7 (Error Handling) Error types and recovery
API design โ€œRESTful Web APIsโ€ by Leonard Richardson Ch. 8-10 Designing hypermedia APIs

Reading order for this project:

  1. โ€œRelease It!โ€ Ch. 5 - Stability patterns (1 hour) - Core resilience concepts
  2. Kleppmann Ch. 4 - Encoding (1 hour) - API versioning
  3. AI SDK Provider docs (30 min) - Provider abstraction
  4. AI SDK Error Handling docs (30 min) - Error types
  5. Then start coding

Self-Assessment Checklist

Core Understanding

  • I can explain why provider abstraction saves development time
  • I understand the trade-offs between different routing strategies
  • I can describe when to use fallback vs retry
  • I know how circuit breakers prevent cascading failures
  • I understand why classification needs to be fast and cheap
  • I can calculate the cost of an LLM request given token counts

Implementation Skills

  • My classifier correctly categorizes different task types
  • My router selects appropriate providers based on capabilities
  • My fallback chain executes correctly when primary fails
  • My telemetry accurately tracks costs and latencies
  • My circuit breaker prevents requests to failing providers
  • My API returns detailed metadata about routing decisions

Production Readiness

  • I handle all error types appropriately (retryable vs fatal)
  • I implement timeouts on all provider calls
  • I track rate limits and route proactively
  • I have comprehensive test coverage
  • My dashboard updates in real-time
  • I can demonstrate cost savings vs single-provider approach

Teaching Test

Can you explain to someone else:

  • Why use multiple LLM providers instead of just one?
  • How do you decide which model to use for a given task?
  • What happens when a provider fails mid-request?
  • How do you measure if your routing is actually saving money?
  • Whatโ€™s a circuit breaker and why do you need one?

Resources

Primary

Provider Pricing

Patterns

Tools

  • Vitest - Fast TypeScript testing
  • Hono - Lightweight web framework
  • Chart.js - Dashboard charting

When you complete this project, you will have built a production-grade AI routing system. Youโ€™ll understand how to leverage multiple LLM providers effectively, implement resilient fallback patterns, and optimize costs while maintaining quality. These skills are directly applicable to any organization running AI workloads at scale.