Project 10: LLM App Deployment & Monitoring (The “MLOps”)
Project 10: LLM App Deployment & Monitoring (The “MLOps”)
Deploy one of your assistants and build observability: traces, token/cost metrics, latency breakdowns, and a feedback/eval loop.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 20–35 hours |
| Language | Python (Alternatives: Go, Node.js) |
| Prerequisites | Docker basics, HTTP services, logging/metrics fundamentals |
| Key Topics | tracing, evals, prompt/versioning, PII masking, dashboards, alerting |
1. Learning Objectives
By completing this project, you will:
- Containerize and deploy an LLM-backed service.
- Instrument requests with distributed traces and tool-level spans.
- Track cost, token usage, latency, and failure rates over time.
- Add prompt/version tracking so you can attribute behavior changes.
- Implement a feedback loop and a small eval set to detect regressions.
2. Theoretical Foundation
2.1 Core Concepts
- Golden signals for LLM apps: latency, error rate, throughput, cost, and quality.
- Distributed tracing: a single user request can involve multiple tool calls; traces show where time and failures occur.
- Prompt/versioning: prompts are code; you need version IDs and change history.
- Evaluation (Evals):
- Deterministic tasks: expected outputs or constraints (schema adherence).
- Non-deterministic tasks: rubric scoring, consistency checks, and human feedback sampling.
- PII masking: observability pipelines can accidentally become data leaks unless you redact.
2.2 Why This Matters
In production, assistants get expensive, slow, or unreliable. Without observability and evals, you can’t tell if a “prompt tweak” made things better or worse.
2.3 Common Misconceptions
- “Logging is enough.” Logs don’t show causality or per-tool timing; traces do.
- “We can add monitoring later.” You’ll ship blind; retrofitting is painful.
- “Accuracy is subjective.” You can operationalize quality with eval sets and rubrics.
3. Project Specification
3.1 What You Will Build
A deployed service (e.g., “Email Gatekeeper API”) plus:
- A dashboard showing latency, tokens, cost, tool breakdowns, and error rates
- A feedback system (thumbs up/down + notes)
- An eval runner that compares versions and flags regressions
3.2 Functional Requirements
- Service deployment: run the assistant as an HTTP API.
- Tracing: create spans for model calls, retrieval, and each tool.
- Metrics: request count, p50/p95 latency, tokens, cost, errors, retries.
- Prompt/version tracking: store version IDs per request.
- PII masking: redact emails, API keys, secrets in logs/traces.
- Evals: run a small suite and produce a report comparing versions.
3.3 Non-Functional Requirements
- Security: secrets stored in env/secret store; no tokens logged.
- Reliability: retries with backoff; graceful degradation on provider outages.
- Cost control: budgets/alerts for runaway token usage.
- Data governance: retention policy for traces and feedback.
3.4 Example Usage / Output
Total Requests: 1240
Avg Latency: 2.1s (p95: 5.4s)
Total Cost: $4.52
Top Tool Latency: web_search (avg 1.2s)
Prompt Version: email_triage_v7
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ HTTP ┌───────────────────┐
│ Client │──────────▶│ Assistant Service │
└──────────────┘ │ (tools + LLM) │
└───────┬───────────┘
│ emits telemetry
▼
┌─────────────────────────────┐
│ Observability Pipeline │
│ traces + metrics + logs │
└───────┬───────────┬─────────┘
▼ ▼
┌────────────┐ ┌─────────────┐
│ Dashboard │ │ Eval Runner │
└────────────┘ └─────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Service API | request handling | FastAPI/Flask; async vs sync |
| Telemetry | spans + metrics | OpenTelemetry as baseline |
| Storage | traces and feedback | local DB for learning; swap later |
| Evals | regression detection | deterministic checks + rubric scoring |
| Redaction | prevent leaks | pre-log sanitizer pipeline |
4.3 Data Structures
from dataclasses import dataclass
@dataclass(frozen=True)
class RequestMeta:
request_id: str
prompt_version: str
model: str
user_id: str | None
@dataclass(frozen=True)
class Feedback:
request_id: str
rating: int # -1/0/1
notes: str | None
4.4 Algorithm Overview
Key Algorithm: instrumented request
- Start trace with request_id and version tags.
- Run assistant pipeline (retrieval/tools/model), creating spans per step.
- Record tokens/cost as metrics and attach to trace attributes.
- Sanitize and persist logs/traces; apply retention policy.
- Collect feedback and join it to request_id for eval datasets.
Complexity Analysis:
- Time: O(pipeline steps) (telemetry overhead should be small)
- Space: O(traces retained + feedback)
5. Implementation Guide
5.1 Development Environment Setup
pip install fastapi uvicorn opentelemetry-sdk pydantic
5.2 Project Structure
llm-ops/
├── src/
│ ├── api.py
│ ├── assistant_pipeline.py
│ ├── telemetry.py
│ ├── redact.py
│ ├── storage.py
│ └── evals.py
├── docker/
│ └── Dockerfile
└── dashboards/
5.3 Implementation Phases
Phase 1: Deploy the assistant (5–8h)
Goals:
- Make the assistant callable via HTTP and run in Docker.
Tasks:
- Wrap a project (e.g., P03) as a FastAPI endpoint.
- Containerize and run locally.
Checkpoint: curl returns a valid response.
Phase 2: Add telemetry + masking (7–12h)
Goals:
- See where time/cost goes, safely.
Tasks:
- Add request traces and tool spans.
- Add token/cost metrics.
- Add redaction pipeline for logs and feedback.
Checkpoint: One request produces a trace and metrics without leaking content.
Phase 3: Feedback + evals + alerts (8–15h)
Goals:
- Detect regressions and control cost.
Tasks:
- Store feedback tied to request_id.
- Build an eval runner comparing versions on a fixed dataset.
- Add simple alerts (budget exceeded, p95 latency spike).
Checkpoint: A prompt change can be evaluated and compared automatically.
5.4 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Telemetry | vendor vs OpenTelemetry | OpenTelemetry | portable and standard |
| Evals | manual vs automated | automated baseline + manual sampling | scale + truth |
| Storage | local DB vs hosted | local first | learning-focused sprint |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | redaction/versioning | ensure secrets removed, version tags present |
| Integration | API + telemetry | request emits spans and metrics |
| Regression | eval suite | fail if score drops below threshold |
6.2 Critical Test Cases
- No secrets logged: API keys never appear in logs/traces.
- Trace completeness: tool spans exist for each tool call.
- Eval regression: version v8 worse than v7 triggers a report/alert.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Telemetry noise | hard to read traces | consistent span names and tags |
| PII leakage | sensitive content in logs | sanitize at boundaries; test redaction |
| Unattributed changes | “it got worse” mystery | prompt versioning per request |
| Metric lies | costs don’t match | compute from provider usage, not estimates |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a simple “request viewer” page with trace summaries.
- Add budget caps per user or per day.
8.2 Intermediate Extensions
- Add canary deployments for new prompt versions.
- Add automated test-case generation from feedback.
8.3 Advanced Extensions
- Add active learning: select worst feedback cases for eval set expansion.
- Add privacy-preserving analytics (aggregate only, no raw text).
9. Real-World Connections
9.1 Industry Applications
- Production LLM apps require observability, budgets, and regression detection.
- This is the difference between a demo and a product.
9.3 Interview Relevance
- Observability, evals, prompt versioning, and safe logging.
10. Resources
10.1 Essential Reading
- AI Engineering (Chip Huyen) — production workflows (Ch. 8)
- The LLM Engineering Handbook (Paul Iusztin) — evals and tracing patterns (Ch. 8)
10.3 Tools & Documentation
- OpenTelemetry docs
- Prometheus and dashboarding basics (if you choose Prometheus)
- LangSmith (optional) for LLM app tracing
10.4 Related Projects in This Series
- Previous: Project 7 (codebase concierge) — runs commands; needs monitoring
- Next: Project 11 (voice) — latency-sensitive interface benefits from tracing
11. Self-Assessment Checklist
- I can explain the golden signals for LLM apps.
- I can point to a trace and identify the bottleneck.
- I can compare prompt versions objectively with an eval suite.
- I can prove that logs do not leak secrets or email contents.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Deployed HTTP service for an assistant
- Basic metrics (latency, cost, tokens) and trace spans
- Redaction in logs
Full Completion:
- Prompt/version tracking and feedback collection
- Eval runner that compares versions and flags regressions
Excellence (Going Above & Beyond):
- Canary releases, alerting, and active-learning eval set expansion
This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.