Project 10: LLM App Deployment & Monitoring (The “MLOps”)

Deploy one of your assistants and build observability: traces, token/cost metrics, latency breakdowns, and a feedback/eval loop.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	20–35 hours
Language	Python (Alternatives: Go, Node.js)
Prerequisites	Docker basics, HTTP services, logging/metrics fundamentals
Key Topics	tracing, evals, prompt/versioning, PII masking, dashboards, alerting

1. Learning Objectives

By completing this project, you will:

Containerize and deploy an LLM-backed service.
Instrument requests with distributed traces and tool-level spans.
Track cost, token usage, latency, and failure rates over time.
Add prompt/version tracking so you can attribute behavior changes.
Implement a feedback loop and a small eval set to detect regressions.

2. Theoretical Foundation

2.1 Core Concepts

Golden signals for LLM apps: latency, error rate, throughput, cost, and quality.
Distributed tracing: a single user request can involve multiple tool calls; traces show where time and failures occur.
Prompt/versioning: prompts are code; you need version IDs and change history.
Evaluation (Evals):
- Deterministic tasks: expected outputs or constraints (schema adherence).
- Non-deterministic tasks: rubric scoring, consistency checks, and human feedback sampling.
PII masking: observability pipelines can accidentally become data leaks unless you redact.

2.2 Why This Matters

In production, assistants get expensive, slow, or unreliable. Without observability and evals, you can’t tell if a “prompt tweak” made things better or worse.

2.3 Common Misconceptions

“Logging is enough.” Logs don’t show causality or per-tool timing; traces do.
“We can add monitoring later.” You’ll ship blind; retrofitting is painful.
“Accuracy is subjective.” You can operationalize quality with eval sets and rubrics.

3. Project Specification

3.1 What You Will Build

A deployed service (e.g., “Email Gatekeeper API”) plus:

A dashboard showing latency, tokens, cost, tool breakdowns, and error rates
A feedback system (thumbs up/down + notes)
An eval runner that compares versions and flags regressions

3.2 Functional Requirements

Service deployment: run the assistant as an HTTP API.
Tracing: create spans for model calls, retrieval, and each tool.
Metrics: request count, p50/p95 latency, tokens, cost, errors, retries.
Prompt/version tracking: store version IDs per request.
PII masking: redact emails, API keys, secrets in logs/traces.
Evals: run a small suite and produce a report comparing versions.

3.3 Non-Functional Requirements

Security: secrets stored in env/secret store; no tokens logged.
Reliability: retries with backoff; graceful degradation on provider outages.
Cost control: budgets/alerts for runaway token usage.
Data governance: retention policy for traces and feedback.

3.4 Example Usage / Output

Total Requests: 1240
Avg Latency: 2.1s (p95: 5.4s)
Total Cost: $4.52
Top Tool Latency: web_search (avg 1.2s)
Prompt Version: email_triage_v7

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐    HTTP    ┌───────────────────┐
│ Client        │──────────▶│ Assistant Service  │
└──────────────┘           │ (tools + LLM)      │
                            └───────┬───────────┘
                                    │ emits telemetry
                                    ▼
                     ┌─────────────────────────────┐
                     │ Observability Pipeline       │
                     │ traces + metrics + logs      │
                     └───────┬───────────┬─────────┘
                             ▼           ▼
                     ┌────────────┐  ┌─────────────┐
                     │ Dashboard   │  │ Eval Runner  │
                     └────────────┘  └─────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Service API	request handling	FastAPI/Flask; async vs sync
Telemetry	spans + metrics	OpenTelemetry as baseline
Storage	traces and feedback	local DB for learning; swap later
Evals	regression detection	deterministic checks + rubric scoring
Redaction	prevent leaks	pre-log sanitizer pipeline

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class RequestMeta:
    request_id: str
    prompt_version: str
    model: str
    user_id: str | None

@dataclass(frozen=True)
class Feedback:
    request_id: str
    rating: int  # -1/0/1
    notes: str | None

4.4 Algorithm Overview

Key Algorithm: instrumented request

Start trace with request_id and version tags.
Run assistant pipeline (retrieval/tools/model), creating spans per step.
Record tokens/cost as metrics and attach to trace attributes.
Sanitize and persist logs/traces; apply retention policy.
Collect feedback and join it to request_id for eval datasets.

Complexity Analysis:

Time: O(pipeline steps) (telemetry overhead should be small)
Space: O(traces retained + feedback)

5. Implementation Guide

5.1 Development Environment Setup

pip install fastapi uvicorn opentelemetry-sdk pydantic

5.2 Project Structure

llm-ops/
├── src/
│   ├── api.py
│   ├── assistant_pipeline.py
│   ├── telemetry.py
│   ├── redact.py
│   ├── storage.py
│   └── evals.py
├── docker/
│   └── Dockerfile
└── dashboards/

5.3 Implementation Phases

Phase 1: Deploy the assistant (5–8h)

Goals:

Make the assistant callable via HTTP and run in Docker.

Tasks:

Wrap a project (e.g., P03) as a FastAPI endpoint.
Containerize and run locally.

Checkpoint: curl returns a valid response.

Phase 2: Add telemetry + masking (7–12h)

Goals:

See where time/cost goes, safely.

Tasks:

Add request traces and tool spans.
Add token/cost metrics.
Add redaction pipeline for logs and feedback.

Checkpoint: One request produces a trace and metrics without leaking content.

Phase 3: Feedback + evals + alerts (8–15h)

Goals:

Detect regressions and control cost.

Tasks:

Store feedback tied to request_id.
Build an eval runner comparing versions on a fixed dataset.
Add simple alerts (budget exceeded, p95 latency spike).

Checkpoint: A prompt change can be evaluated and compared automatically.

5.4 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Telemetry	vendor vs OpenTelemetry	OpenTelemetry	portable and standard
Evals	manual vs automated	automated baseline + manual sampling	scale + truth
Storage	local DB vs hosted	local first	learning-focused sprint

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	redaction/versioning	ensure secrets removed, version tags present
Integration	API + telemetry	request emits spans and metrics
Regression	eval suite	fail if score drops below threshold

6.2 Critical Test Cases

No secrets logged: API keys never appear in logs/traces.
Trace completeness: tool spans exist for each tool call.
Eval regression: version v8 worse than v7 triggers a report/alert.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Telemetry noise	hard to read traces	consistent span names and tags
PII leakage	sensitive content in logs	sanitize at boundaries; test redaction
Unattributed changes	“it got worse” mystery	prompt versioning per request
Metric lies	costs don’t match	compute from provider usage, not estimates

8. Extensions & Challenges

8.1 Beginner Extensions

Add a simple “request viewer” page with trace summaries.
Add budget caps per user or per day.

8.2 Intermediate Extensions

Add canary deployments for new prompt versions.
Add automated test-case generation from feedback.

8.3 Advanced Extensions

Add active learning: select worst feedback cases for eval set expansion.
Add privacy-preserving analytics (aggregate only, no raw text).

9. Real-World Connections

9.1 Industry Applications

Production LLM apps require observability, budgets, and regression detection.
This is the difference between a demo and a product.

9.3 Interview Relevance

Observability, evals, prompt versioning, and safe logging.

10. Resources

10.1 Essential Reading

AI Engineering (Chip Huyen) — production workflows (Ch. 8)
The LLM Engineering Handbook (Paul Iusztin) — evals and tracing patterns (Ch. 8)

10.3 Tools & Documentation

OpenTelemetry docs
Prometheus and dashboarding basics (if you choose Prometheus)
LangSmith (optional) for LLM app tracing

Previous: Project 7 (codebase concierge) — runs commands; needs monitoring
Next: Project 11 (voice) — latency-sensitive interface benefits from tracing

11. Self-Assessment Checklist

I can explain the golden signals for LLM apps.
I can point to a trace and identify the bottleneck.
I can compare prompt versions objectively with an eval suite.
I can prove that logs do not leak secrets or email contents.

12. Submission / Completion Criteria

Minimum Viable Completion:

Deployed HTTP service for an assistant
Basic metrics (latency, cost, tokens) and trace spans
Redaction in logs

Full Completion:

Prompt/version tracking and feedback collection
Eval runner that compares versions and flags regressions

Excellence (Going Above & Beyond):

Canary releases, alerting, and active-learning eval set expansion

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.

Project 10: LLM App Deployment & Monitoring (The “MLOps”)

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Deploy the assistant (5–8h)

Phase 2: Add telemetry + masking (7–12h)

Phase 3: Feedback + evals + alerts (8–15h)

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

12. Submission / Completion Criteria