Project 17: Agent Observability with OpenTelemetry
Build complete trace and metric instrumentation for agent loops, tool calls, policy checks, and memory operations.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 10-18 hours |
| Language | TypeScript (alt: Python, Go) |
| Prerequisites | Projects 6, 9 |
| Key Topics | tracing, semantic conventions, root-cause workflows |
Learning Objectives
- Instrument agent execution with end-to-end trace IDs.
- Emit GenAI semantic attributes for model/tool spans.
- Build latency, token, and cost dashboards.
- Perform root-cause analysis from a single failed request.
The Core Question You’re Answering
“How do you make stochastic agent behavior operationally debuggable?”
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Span trees | Causality across components | OpenTelemetry fundamentals |
| GenAI semantic attributes | Standardized AI telemetry fields | OTel GenAI conventions |
| Sampling strategy | Controls observability cost | SRE observability practices |
Theoretical Foundation
Request -> Trace Root -> (LLM span, Tool span, Policy span, Memory span) -> Outcome
Without span-level causality, agent debugging becomes guesswork.
Project Specification
What You’ll Build
An observability layer that:
- Instruments all critical boundaries
- Emits OTLP traces + metrics
- Redacts sensitive fields
- Supports trace-driven incident analysis
Functional Requirements
- Correlation ID propagation
- Span emission for model/tool/policy/memory operations
- Token/cost metric extraction
- Dashboard and alert definitions
Non-Functional Requirements
- Low instrumentation overhead
- Privacy-safe logs and traces
- Replay-friendly trace export
Real World Outcome
$ npm run p17:trace -- --goal "summarize incident retro"
[trace] id=trace_a9f2d
[spans] llm=7 tool=5 memory=3 policy=5
[latency] p50=1.2s p95=4.8s
[cost] est_usd=0.41
[dashboard] updated successfully
Architecture Overview
Agent Runtime -> Instrumentation SDK -> OTel Collector -> Storage -> Dashboards/Alerts
Implementation Guide
Phase 1: Trace Foundations
- Request IDs, root spans, and propagation.
Phase 2: GenAI Attributes
- Add model/tool metadata and token usage.
Phase 3: Ops Workflows
- Build runbooks for trace-based incident triage.
Testing Strategy
- Missing span detection tests
- PII redaction tests
- Synthetic incident replay tests
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Fragmented traces | no end-to-end view | enforce correlation ID propagation |
| Over-verbose labels | telemetry cost spikes | reduce high-cardinality attributes |
| Hidden sensitive fields | compliance risk | redact before export |
Interview Questions They’ll Ask
- What should every agent trace include?
- How do you map business failures to technical spans?
- How do you balance observability depth vs cost?
- How do traces support evaluation and routing?
Hints in Layers
- Hint 1: Instrument only critical path first.
- Hint 2: Add policy and memory spans explicitly.
- Hint 3: Track tokens and cost per model call.
- Hint 4: Build one-click trace drilldown playbook.
Submission / Completion Criteria
Minimum Completion
- End-to-end trace for one full agent run
Full Completion
- Dashboard with latency/cost/error breakdown
Excellence
- Alerting + incident triage workflow from trace IDs