Project 4: Provider Failover and Cost Router

Quick Reference

Attribute Value
Difficulty 4
Time 2-3 weeks
Main Stack req_llm + fallback policies + config-driven keys
Alternatives Hardcoded provider routing
Why Now Production incidents often stem from provider instability and silent cost spikes

What You Will Build

A policy-driven routing service that uses provider metadata and cost constraints to choose execution paths, with automatic failover and explainable fallback.

Real World Outcome

$ mix run -e "RouterDemo.simulate(:golden_path)"
[info] request_id=fa19c1 policy_route=provider=groq:llama-3.3 provider_score=0.94 budget_usd=0.0035
[info] attempt_1 failed=timeout reason=provider_unreachable
[warn] fallback=anthropic:claude-3.5 selected
[info] final cost=0.0034 latency_ms=980
[info] explainability_event=recorded

When failover triggers, operators can inspect exactly:

  • original route
  • failure cause
  • chosen fallback
  • cost delta versus baseline

The Core Question You Are Answering

“How do we preserve product reliability under provider degradation without violating budget constraints?”

Why This Project Matters

req_llm exposes standardized usage metadata and provider registry information, so failover can be deterministic and auditable rather than ad-hoc.

Big-Picture Architecture

Incoming Request
       |
       v
 +------------------+
 | Route Policy DSL |
 +--------+---------+
          |
          v
 +---------------------------+
 | Candidate Planner          |
 | - budget-aware filtering  |
 | - latency SLA filtering    |
 | - capability compatibility |
 +------------+--------------+
              |
              v
 +---------------------------+
 | Execute Attempt #1         |
 +------------+--------------+
              |
      +-------+-------+
      | success/fail  |
      +-------+-------+
              | fail
              v
 +---------------------------+
 | Fault Classifier          |
 | timeout/5xx/over_budget   |
 +------------+--------------+
              |
              v
 +---------------------------+
 | Fallback Planner          |
 | attempts, cooldown, budget |
 +------------+--------------+
              |
              v
    +---------+---------+
    | Chosen Provider  |
    | + metadata + usage|
    +------------------+

Key Design Work

1. Cost-Aware Policy

  • Define normalized unit cost from metadata (if available).
  • Apply guardrails for outlier responses.
  • If metadata absent, switch to fallback provider with known safety margin.

2. Failure Classifier

  • Distinguish transient (timeout/network), hard (invalid key/model), and semantic (bad schema/mode).
  • Only transient/hard for transport should trigger fallback; semantic failures should escalate.

3. Key Management

  • Implement key precedence: request override -> in-memory store -> app config -> env -> .env.
  • Rotate keys by provider without restart when possible.
  • Never log secrets; log only source/identity metadata.

4. Circuit and Cooldown

  • Keep short blacklists after repeated transport failures.
  • Auto-recover with exponential cooldown.

5. Explainability Trail

  • Persist a single decision event:
    • candidates
    • reason code
    • chosen route
    • observed metrics

Concepts You Must Understand First

  1. Cost governance
    • Why hard caps and soft caps differ.
  2. Failure semantics
    • Retry-safe vs retry-unsafe failures.
  3. Provider capability metadata
    • Use metadata before selecting fallback.

Questions to Guide Your Design

  1. Policy precedence
    • Which constraints can be hard-failed locally before any network call?
  2. Fallback ethics
    • When is fallback a product violation versus safety improvement?
  3. Operator trust
    • What evidence should be required before declaring provider down?

Thinking Exercise

Design one fallback policy for a payment-adjacent workflow:

  • target latency < 1.5s
  • max cost per request 0.004 USD
  • providers: A (cheap, variable), B (stable, expensive), C (fast, low token cap)

What is your ranking and why?

Interview Questions They Will Ask

  1. “How do you avoid thundering herd during provider incident?”
  2. “Can cost router make wrong decisions and still recover gracefully?”
  3. “What is the role of model metadata in routing?”
  4. “How do you prevent credential leakage in logs?”
  5. “How do you test failover deterministically?”

Hints in Layers

Hint 1: Start with guardrail functions Implement pure policy checks before transport calls.

Hint 2: Add failure taxonomy enums :timeout, :provider_5xx, :invalid_credentials, :budget_violation.

Hint 3: Keep fallback logs machine-readable Use structured events, not prose-only logs.

Hint 4: Add chaos drills Script simulated outages to confirm cooldown + reintroduction behavior.

Common Pitfalls and Debugging

  • Problem: Failover loops forever.
    • Why: missing attempt counter or stale blackout window.
    • Fix: max attempts and per-provider cooldown.
    • Quick test: inject permanent outage and assert bounded attempts.
  • Problem: Costs spike after fallback enabled.
    • Why: fallback ignores budget constraints.
    • Fix: evaluate budget on every retry candidate.
    • Quick test: run load test with budget guard at 0.01%.
  • Problem: Key source ambiguity.
    • Why: environment and in-memory values conflict.
    • Fix: explicit source tracing in decision event.
    • Quick test: change key source and assert selected source in logs.

Books That Will Help

Topic Book Chapter
Resilience Release It! Circuit Breakers and Failover
OTP Patterns Programming Elixir Supervision and Retry

Definition of Done

  • Route selection obeys hard caps (cost, latency, compatibility)
  • Fallback triggers only for classified failure types
  • Provider incidents produce clear reason codes
  • Key precedence and rotation are tested
  • Cost deltas are visible per request

References

  • https://hexdocs.pm/req_llm/1.5.1/overview.html