Project 22: Production Engineering Control Tower for Agents

Build a unified control plane for reliability, telemetry, cost, and evaluation gates.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	14-24 hours
Language	TypeScript (alt: Python, Go)
Prerequisites	Projects 9, 17, 18
Key Topics	fallbacks, retries, trace replay, budget gates, drift detection

Learning Objectives

Implement deterministic fallbacks around model/tool failures.
Capture run-level telemetry that supports replay and root-cause analysis.
Enforce token and latency budgets at runtime.
Gate releases with golden/adversarial evaluation thresholds.

The Core Question You’re Answering

“How do you operate non-deterministic agents with deterministic production controls?”

Concepts You Must Understand First

Concept	Why It Matters	Where to Learn
Circuit breakers	Stops cascading failures	Release It!
Structured telemetry	Enables reproducible debugging	OpenTelemetry docs
Cost envelopes	Prevents silent margin erosion	API pricing docs
Drift-aware eval	Catches regressions before rollout	Agent evaluation references

Theoretical Foundation

Run Request -> Reliability Guard -> Execution -> Telemetry Store -> Eval Gate -> Promote/Hold

Reliability is not one feature; it is a set of enforceable invariants.

Project Specification

What You’ll Build

A control tower service that:

Applies timeout/retry/fallback policy
Emits structured traces and error classes
Tracks token and cost budgets
Runs post-run regression checks

Functional Requirements

Error taxonomy + retry strategy mapping
Trace correlation IDs across all components
Runtime budget enforcement
Replayable artifact pack per run

Non-Functional Requirements

Low overhead instrumentation
PII-safe logging
Deterministic replay support

Real World Outcome

$ npm run p22:tower -- --scenario "billing_dispute_resolution"
[reliability] fallback=true timeout=12s retries=2
[telemetry] spans=41 classified_errors={tool:2,model:1}
[cost] budget=$0.09 actual=$0.07
[eval] golden=94% adversarial=89%
[gate] PASS

Architecture Overview

Gateway -> Policy Layer -> Agent Runtime -> Telemetry/Eval Worker -> Release Gate

Implementation Guide

Phase 1: Reliability Envelope

Define error classes and control policies.

Phase 2: Telemetry + Replay

Store decision traces and replay inputs.

Phase 3: Cost + Eval Gates

Enforce budgets and quality thresholds.

Testing Strategy

Fault-injection tests
Replay consistency tests
Budget breach tests
Regression gate tests

Common Pitfalls & Debugging

Pitfall	Symptom	Fix
Generic retries	runaway costs	map retries by error class
Incomplete traces	slow incident RCA	enforce required span fields
Quality-blind fallback	quiet regressions	score fallback outputs separately

Interview Questions They’ll Ask

What should trigger deterministic fallback?
Which telemetry fields are required for replay?
How do you balance cost vs quality in release gating?
How do you detect drift in live traffic?

Hints in Layers

Hint 1: Build explicit failure taxonomy first.
Hint 2: Add immutable run IDs.
Hint 3: Treat budget violations as first-class failures.
Hint 4: Keep golden and adversarial gates separate.

Submission / Completion Criteria

Minimum Completion

Reliability controls and trace replay working

Full Completion

Budget + eval gate controls active

Excellence

Automated release hold/rollback based on scorecards