Project 22: Production Engineering Control Tower for Agents

Build a unified control plane for reliability, telemetry, cost, and evaluation gates.


Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 14-24 hours
Language TypeScript (alt: Python, Go)
Prerequisites Projects 9, 17, 18
Key Topics fallbacks, retries, trace replay, budget gates, drift detection

Learning Objectives

  1. Implement deterministic fallbacks around model/tool failures.
  2. Capture run-level telemetry that supports replay and root-cause analysis.
  3. Enforce token and latency budgets at runtime.
  4. Gate releases with golden/adversarial evaluation thresholds.

The Core Question You’re Answering

“How do you operate non-deterministic agents with deterministic production controls?”


Concepts You Must Understand First

Concept Why It Matters Where to Learn
Circuit breakers Stops cascading failures Release It!
Structured telemetry Enables reproducible debugging OpenTelemetry docs
Cost envelopes Prevents silent margin erosion API pricing docs
Drift-aware eval Catches regressions before rollout Agent evaluation references

Theoretical Foundation

Run Request -> Reliability Guard -> Execution -> Telemetry Store -> Eval Gate -> Promote/Hold

Reliability is not one feature; it is a set of enforceable invariants.


Project Specification

What You’ll Build

A control tower service that:

  • Applies timeout/retry/fallback policy
  • Emits structured traces and error classes
  • Tracks token and cost budgets
  • Runs post-run regression checks

Functional Requirements

  1. Error taxonomy + retry strategy mapping
  2. Trace correlation IDs across all components
  3. Runtime budget enforcement
  4. Replayable artifact pack per run

Non-Functional Requirements

  • Low overhead instrumentation
  • PII-safe logging
  • Deterministic replay support

Real World Outcome

$ npm run p22:tower -- --scenario "billing_dispute_resolution"
[reliability] fallback=true timeout=12s retries=2
[telemetry] spans=41 classified_errors={tool:2,model:1}
[cost] budget=$0.09 actual=$0.07
[eval] golden=94% adversarial=89%
[gate] PASS

Architecture Overview

Gateway -> Policy Layer -> Agent Runtime -> Telemetry/Eval Worker -> Release Gate

Implementation Guide

Phase 1: Reliability Envelope

  • Define error classes and control policies.

Phase 2: Telemetry + Replay

  • Store decision traces and replay inputs.

Phase 3: Cost + Eval Gates

  • Enforce budgets and quality thresholds.

Testing Strategy

  • Fault-injection tests
  • Replay consistency tests
  • Budget breach tests
  • Regression gate tests

Common Pitfalls & Debugging

Pitfall Symptom Fix
Generic retries runaway costs map retries by error class
Incomplete traces slow incident RCA enforce required span fields
Quality-blind fallback quiet regressions score fallback outputs separately

Interview Questions They’ll Ask

  1. What should trigger deterministic fallback?
  2. Which telemetry fields are required for replay?
  3. How do you balance cost vs quality in release gating?
  4. How do you detect drift in live traffic?

Hints in Layers

  • Hint 1: Build explicit failure taxonomy first.
  • Hint 2: Add immutable run IDs.
  • Hint 3: Treat budget violations as first-class failures.
  • Hint 4: Keep golden and adversarial gates separate.

Submission / Completion Criteria

Minimum Completion

  • Reliability controls and trace replay working

Full Completion

  • Budget + eval gate controls active

Excellence

  • Automated release hold/rollback based on scorecards