Project 4: Provider Failover and Cost Router
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | 4 |
| Time | 2-3 weeks |
| Main Stack | req_llm + fallback policies + config-driven keys |
| Alternatives | Hardcoded provider routing |
| Why Now | Production incidents often stem from provider instability and silent cost spikes |
What You Will Build
A policy-driven routing service that uses provider metadata and cost constraints to choose execution paths, with automatic failover and explainable fallback.
Real World Outcome
$ mix run -e "RouterDemo.simulate(:golden_path)"
[info] request_id=fa19c1 policy_route=provider=groq:llama-3.3 provider_score=0.94 budget_usd=0.0035
[info] attempt_1 failed=timeout reason=provider_unreachable
[warn] fallback=anthropic:claude-3.5 selected
[info] final cost=0.0034 latency_ms=980
[info] explainability_event=recorded
When failover triggers, operators can inspect exactly:
- original route
- failure cause
- chosen fallback
- cost delta versus baseline
The Core Question You Are Answering
“How do we preserve product reliability under provider degradation without violating budget constraints?”
Why This Project Matters
req_llm exposes standardized usage metadata and provider registry information, so failover can be deterministic and auditable rather than ad-hoc.
Big-Picture Architecture
Incoming Request
|
v
+------------------+
| Route Policy DSL |
+--------+---------+
|
v
+---------------------------+
| Candidate Planner |
| - budget-aware filtering |
| - latency SLA filtering |
| - capability compatibility |
+------------+--------------+
|
v
+---------------------------+
| Execute Attempt #1 |
+------------+--------------+
|
+-------+-------+
| success/fail |
+-------+-------+
| fail
v
+---------------------------+
| Fault Classifier |
| timeout/5xx/over_budget |
+------------+--------------+
|
v
+---------------------------+
| Fallback Planner |
| attempts, cooldown, budget |
+------------+--------------+
|
v
+---------+---------+
| Chosen Provider |
| + metadata + usage|
+------------------+
Key Design Work
1. Cost-Aware Policy
- Define normalized unit cost from metadata (if available).
- Apply guardrails for outlier responses.
- If metadata absent, switch to fallback provider with known safety margin.
2. Failure Classifier
- Distinguish transient (timeout/network), hard (invalid key/model), and semantic (bad schema/mode).
- Only transient/hard for transport should trigger fallback; semantic failures should escalate.
3. Key Management
- Implement key precedence: request override -> in-memory store -> app config -> env ->
.env. - Rotate keys by provider without restart when possible.
- Never log secrets; log only source/identity metadata.
4. Circuit and Cooldown
- Keep short blacklists after repeated transport failures.
- Auto-recover with exponential cooldown.
5. Explainability Trail
- Persist a single decision event:
- candidates
- reason code
- chosen route
- observed metrics
Concepts You Must Understand First
- Cost governance
- Why hard caps and soft caps differ.
- Failure semantics
- Retry-safe vs retry-unsafe failures.
- Provider capability metadata
- Use metadata before selecting fallback.
Questions to Guide Your Design
- Policy precedence
- Which constraints can be hard-failed locally before any network call?
- Fallback ethics
- When is fallback a product violation versus safety improvement?
- Operator trust
- What evidence should be required before declaring provider down?
Thinking Exercise
Design one fallback policy for a payment-adjacent workflow:
- target latency < 1.5s
- max cost per request 0.004 USD
- providers: A (cheap, variable), B (stable, expensive), C (fast, low token cap)
What is your ranking and why?
Interview Questions They Will Ask
- “How do you avoid thundering herd during provider incident?”
- “Can cost router make wrong decisions and still recover gracefully?”
- “What is the role of model metadata in routing?”
- “How do you prevent credential leakage in logs?”
- “How do you test failover deterministically?”
Hints in Layers
Hint 1: Start with guardrail functions Implement pure policy checks before transport calls.
Hint 2: Add failure taxonomy enums
:timeout, :provider_5xx, :invalid_credentials, :budget_violation.
Hint 3: Keep fallback logs machine-readable Use structured events, not prose-only logs.
Hint 4: Add chaos drills Script simulated outages to confirm cooldown + reintroduction behavior.
Common Pitfalls and Debugging
- Problem: Failover loops forever.
- Why: missing attempt counter or stale blackout window.
- Fix: max attempts and per-provider cooldown.
- Quick test: inject permanent outage and assert bounded attempts.
- Problem: Costs spike after fallback enabled.
- Why: fallback ignores budget constraints.
- Fix: evaluate budget on every retry candidate.
- Quick test: run load test with budget guard at 0.01%.
- Problem: Key source ambiguity.
- Why: environment and in-memory values conflict.
- Fix: explicit source tracing in decision event.
- Quick test: change key source and assert selected source in logs.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Resilience | Release It! | Circuit Breakers and Failover |
| OTP Patterns | Programming Elixir | Supervision and Retry |
Definition of Done
- Route selection obeys hard caps (cost, latency, compatibility)
- Fallback triggers only for classified failure types
- Provider incidents produce clear reason codes
- Key precedence and rotation are tested
- Cost deltas are visible per request
References
- https://hexdocs.pm/req_llm/1.5.1/overview.html