Project 22: Production Engineering Control Tower for Agents
Build a unified control plane for reliability, telemetry, cost, and evaluation gates.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 14-24 hours |
| Language | TypeScript (alt: Python, Go) |
| Prerequisites | Projects 9, 17, 18 |
| Key Topics | fallbacks, retries, trace replay, budget gates, drift detection |
Learning Objectives
- Implement deterministic fallbacks around model/tool failures.
- Capture run-level telemetry that supports replay and root-cause analysis.
- Enforce token and latency budgets at runtime.
- Gate releases with golden/adversarial evaluation thresholds.
The Core Question You’re Answering
“How do you operate non-deterministic agents with deterministic production controls?”
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Circuit breakers | Stops cascading failures | Release It! |
| Structured telemetry | Enables reproducible debugging | OpenTelemetry docs |
| Cost envelopes | Prevents silent margin erosion | API pricing docs |
| Drift-aware eval | Catches regressions before rollout | Agent evaluation references |
Theoretical Foundation
Run Request -> Reliability Guard -> Execution -> Telemetry Store -> Eval Gate -> Promote/Hold
Reliability is not one feature; it is a set of enforceable invariants.
Project Specification
What You’ll Build
A control tower service that:
- Applies timeout/retry/fallback policy
- Emits structured traces and error classes
- Tracks token and cost budgets
- Runs post-run regression checks
Functional Requirements
- Error taxonomy + retry strategy mapping
- Trace correlation IDs across all components
- Runtime budget enforcement
- Replayable artifact pack per run
Non-Functional Requirements
- Low overhead instrumentation
- PII-safe logging
- Deterministic replay support
Real World Outcome
$ npm run p22:tower -- --scenario "billing_dispute_resolution"
[reliability] fallback=true timeout=12s retries=2
[telemetry] spans=41 classified_errors={tool:2,model:1}
[cost] budget=$0.09 actual=$0.07
[eval] golden=94% adversarial=89%
[gate] PASS
Architecture Overview
Gateway -> Policy Layer -> Agent Runtime -> Telemetry/Eval Worker -> Release Gate
Implementation Guide
Phase 1: Reliability Envelope
- Define error classes and control policies.
Phase 2: Telemetry + Replay
- Store decision traces and replay inputs.
Phase 3: Cost + Eval Gates
- Enforce budgets and quality thresholds.
Testing Strategy
- Fault-injection tests
- Replay consistency tests
- Budget breach tests
- Regression gate tests
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Generic retries | runaway costs | map retries by error class |
| Incomplete traces | slow incident RCA | enforce required span fields |
| Quality-blind fallback | quiet regressions | score fallback outputs separately |
Interview Questions They’ll Ask
- What should trigger deterministic fallback?
- Which telemetry fields are required for replay?
- How do you balance cost vs quality in release gating?
- How do you detect drift in live traffic?
Hints in Layers
- Hint 1: Build explicit failure taxonomy first.
- Hint 2: Add immutable run IDs.
- Hint 3: Treat budget violations as first-class failures.
- Hint 4: Keep golden and adversarial gates separate.
Submission / Completion Criteria
Minimum Completion
- Reliability controls and trace replay working
Full Completion
- Budget + eval gate controls active
Excellence
- Automated release hold/rollback based on scorecards