Project 10: LLM App Deployment & Monitoring (The “MLOps”)

Project 10: LLM App Deployment & Monitoring (The “MLOps”)

Deploy one of your assistants and build observability: traces, token/cost metrics, latency breakdowns, and a feedback/eval loop.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 20–35 hours
Language Python (Alternatives: Go, Node.js)
Prerequisites Docker basics, HTTP services, logging/metrics fundamentals
Key Topics tracing, evals, prompt/versioning, PII masking, dashboards, alerting

1. Learning Objectives

By completing this project, you will:

  1. Containerize and deploy an LLM-backed service.
  2. Instrument requests with distributed traces and tool-level spans.
  3. Track cost, token usage, latency, and failure rates over time.
  4. Add prompt/version tracking so you can attribute behavior changes.
  5. Implement a feedback loop and a small eval set to detect regressions.

2. Theoretical Foundation

2.1 Core Concepts

  • Golden signals for LLM apps: latency, error rate, throughput, cost, and quality.
  • Distributed tracing: a single user request can involve multiple tool calls; traces show where time and failures occur.
  • Prompt/versioning: prompts are code; you need version IDs and change history.
  • Evaluation (Evals):
    • Deterministic tasks: expected outputs or constraints (schema adherence).
    • Non-deterministic tasks: rubric scoring, consistency checks, and human feedback sampling.
  • PII masking: observability pipelines can accidentally become data leaks unless you redact.

2.2 Why This Matters

In production, assistants get expensive, slow, or unreliable. Without observability and evals, you can’t tell if a “prompt tweak” made things better or worse.

2.3 Common Misconceptions

  • “Logging is enough.” Logs don’t show causality or per-tool timing; traces do.
  • “We can add monitoring later.” You’ll ship blind; retrofitting is painful.
  • “Accuracy is subjective.” You can operationalize quality with eval sets and rubrics.

3. Project Specification

3.1 What You Will Build

A deployed service (e.g., “Email Gatekeeper API”) plus:

  • A dashboard showing latency, tokens, cost, tool breakdowns, and error rates
  • A feedback system (thumbs up/down + notes)
  • An eval runner that compares versions and flags regressions

3.2 Functional Requirements

  1. Service deployment: run the assistant as an HTTP API.
  2. Tracing: create spans for model calls, retrieval, and each tool.
  3. Metrics: request count, p50/p95 latency, tokens, cost, errors, retries.
  4. Prompt/version tracking: store version IDs per request.
  5. PII masking: redact emails, API keys, secrets in logs/traces.
  6. Evals: run a small suite and produce a report comparing versions.

3.3 Non-Functional Requirements

  • Security: secrets stored in env/secret store; no tokens logged.
  • Reliability: retries with backoff; graceful degradation on provider outages.
  • Cost control: budgets/alerts for runaway token usage.
  • Data governance: retention policy for traces and feedback.

3.4 Example Usage / Output

Total Requests: 1240
Avg Latency: 2.1s (p95: 5.4s)
Total Cost: $4.52
Top Tool Latency: web_search (avg 1.2s)
Prompt Version: email_triage_v7

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐    HTTP    ┌───────────────────┐
│ Client        │──────────▶│ Assistant Service  │
└──────────────┘           │ (tools + LLM)      │
                            └───────┬───────────┘
                                    │ emits telemetry
                                    ▼
                     ┌─────────────────────────────┐
                     │ Observability Pipeline       │
                     │ traces + metrics + logs      │
                     └───────┬───────────┬─────────┘
                             ▼           ▼
                     ┌────────────┐  ┌─────────────┐
                     │ Dashboard   │  │ Eval Runner  │
                     └────────────┘  └─────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Service API request handling FastAPI/Flask; async vs sync
Telemetry spans + metrics OpenTelemetry as baseline
Storage traces and feedback local DB for learning; swap later
Evals regression detection deterministic checks + rubric scoring
Redaction prevent leaks pre-log sanitizer pipeline

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class RequestMeta:
    request_id: str
    prompt_version: str
    model: str
    user_id: str | None

@dataclass(frozen=True)
class Feedback:
    request_id: str
    rating: int  # -1/0/1
    notes: str | None

4.4 Algorithm Overview

Key Algorithm: instrumented request

  1. Start trace with request_id and version tags.
  2. Run assistant pipeline (retrieval/tools/model), creating spans per step.
  3. Record tokens/cost as metrics and attach to trace attributes.
  4. Sanitize and persist logs/traces; apply retention policy.
  5. Collect feedback and join it to request_id for eval datasets.

Complexity Analysis:

  • Time: O(pipeline steps) (telemetry overhead should be small)
  • Space: O(traces retained + feedback)

5. Implementation Guide

5.1 Development Environment Setup

pip install fastapi uvicorn opentelemetry-sdk pydantic

5.2 Project Structure

llm-ops/
├── src/
│   ├── api.py
│   ├── assistant_pipeline.py
│   ├── telemetry.py
│   ├── redact.py
│   ├── storage.py
│   └── evals.py
├── docker/
│   └── Dockerfile
└── dashboards/

5.3 Implementation Phases

Phase 1: Deploy the assistant (5–8h)

Goals:

  • Make the assistant callable via HTTP and run in Docker.

Tasks:

  1. Wrap a project (e.g., P03) as a FastAPI endpoint.
  2. Containerize and run locally.

Checkpoint: curl returns a valid response.

Phase 2: Add telemetry + masking (7–12h)

Goals:

  • See where time/cost goes, safely.

Tasks:

  1. Add request traces and tool spans.
  2. Add token/cost metrics.
  3. Add redaction pipeline for logs and feedback.

Checkpoint: One request produces a trace and metrics without leaking content.

Phase 3: Feedback + evals + alerts (8–15h)

Goals:

  • Detect regressions and control cost.

Tasks:

  1. Store feedback tied to request_id.
  2. Build an eval runner comparing versions on a fixed dataset.
  3. Add simple alerts (budget exceeded, p95 latency spike).

Checkpoint: A prompt change can be evaluated and compared automatically.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Telemetry vendor vs OpenTelemetry OpenTelemetry portable and standard
Evals manual vs automated automated baseline + manual sampling scale + truth
Storage local DB vs hosted local first learning-focused sprint

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit redaction/versioning ensure secrets removed, version tags present
Integration API + telemetry request emits spans and metrics
Regression eval suite fail if score drops below threshold

6.2 Critical Test Cases

  1. No secrets logged: API keys never appear in logs/traces.
  2. Trace completeness: tool spans exist for each tool call.
  3. Eval regression: version v8 worse than v7 triggers a report/alert.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Telemetry noise hard to read traces consistent span names and tags
PII leakage sensitive content in logs sanitize at boundaries; test redaction
Unattributed changes “it got worse” mystery prompt versioning per request
Metric lies costs don’t match compute from provider usage, not estimates

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a simple “request viewer” page with trace summaries.
  • Add budget caps per user or per day.

8.2 Intermediate Extensions

  • Add canary deployments for new prompt versions.
  • Add automated test-case generation from feedback.

8.3 Advanced Extensions

  • Add active learning: select worst feedback cases for eval set expansion.
  • Add privacy-preserving analytics (aggregate only, no raw text).

9. Real-World Connections

9.1 Industry Applications

  • Production LLM apps require observability, budgets, and regression detection.
  • This is the difference between a demo and a product.

9.3 Interview Relevance

  • Observability, evals, prompt versioning, and safe logging.

10. Resources

10.1 Essential Reading

  • AI Engineering (Chip Huyen) — production workflows (Ch. 8)
  • The LLM Engineering Handbook (Paul Iusztin) — evals and tracing patterns (Ch. 8)

10.3 Tools & Documentation

  • OpenTelemetry docs
  • Prometheus and dashboarding basics (if you choose Prometheus)
  • LangSmith (optional) for LLM app tracing
  • Previous: Project 7 (codebase concierge) — runs commands; needs monitoring
  • Next: Project 11 (voice) — latency-sensitive interface benefits from tracing

11. Self-Assessment Checklist

  • I can explain the golden signals for LLM apps.
  • I can point to a trace and identify the bottleneck.
  • I can compare prompt versions objectively with an eval suite.
  • I can prove that logs do not leak secrets or email contents.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Deployed HTTP service for an assistant
  • Basic metrics (latency, cost, tokens) and trace spans
  • Redaction in logs

Full Completion:

  • Prompt/version tracking and feedback collection
  • Eval runner that compares versions and flags regressions

Excellence (Going Above & Beyond):

  • Canary releases, alerting, and active-learning eval set expansion

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.