Project 1: LLM Prompt Playground & Analyzer

Build a Streamlit “prompt battle arena” to compare prompts/models/parameters with token+cost+latency analytics and optional LLM-as-a-judge scoring.

Quick Reference

Attribute	Value
Difficulty	Level 1: Beginner
Time Estimate	8–12 hours (weekend)
Language	Python (Alternatives: TypeScript/Node.js, Go)
Prerequisites	Python basics, HTTP/APIs, env vars, basic Git
Key Topics	system prompts, sampling (temperature/top-p), token accounting, eval rubrics, streaming, provider abstraction

1. Learning Objectives

By completing this project, you will:

Build a minimal but real LLM client with model/provider selection.
Understand how prompt structure changes output quality and failure modes.
Measure latency + token usage and compute cost across providers.
Design repeatable qualitative evaluation (rubrics + judge model).
Store experiments and compare results over time.

2. Theoretical Foundation

2.1 Core Concepts

Chat roles (system/developer/user): The system message anchors behavior; user messages are tasks. Your UI should make roles explicit so you can observe how changing “constitution” text affects outcomes.
Decoding & sampling:
- Temperature controls randomness by flattening/sharpening the token distribution.
- Top-p (nucleus sampling) selects from the smallest set of tokens whose cumulative probability is p.
- For assistants: lower randomness for extraction/formatting; higher randomness for ideation.
Tokenization & context windows: Models operate on tokens; prompt length is “RAM”. You can’t optimize cost/latency if you don’t track prompt+completion tokens.
Evaluation (Evals):
- Human evaluation is slow but high-quality.
- LLM-as-a-judge is fast and scalable but can be biased; you mitigate by using explicit rubrics and stable sampling settings (e.g., temperature 0).

2.2 Why This Matters

Personal assistants are only as good as their instruction-following. Before you orchestrate tools or build memory, you need a “lab” where you can reliably answer:

“Which prompt produces fewer hallucinations for this task?”
“How much does this interaction cost at scale?”
“Is a more expensive model worth it for this workflow?”

2.3 Common Misconceptions

“Model selection matters more than prompts.” In many workflows, prompt quality dominates model choice.
“Temperature is a creativity knob only.” It’s also a reliability knob; non-determinism can break structured outputs.
“Token counts don’t matter for small apps.” Costs compound quickly when you add memory, tools, and multi-step loops.

3. Project Specification

3.1 What You Will Build

A Streamlit web app where you:

Define a shared Goal/Task.
Enter Prompt A and Prompt B (system prompts).
Run both prompts against one or two selected models/providers.
Inspect responses + metrics (tokens, cost, latency, throughput).
Optionally run a judge that scores outputs with a rubric.
Save sessions and export results as JSON.

3.2 Functional Requirements

Provider abstraction: support at least one provider; design so adding another is straightforward.
Side-by-side comparison: show Prompt A vs Prompt B outputs with identical input task and parameters.
Parameter control: temperature, max tokens, optional top-p.
Metrics: prompt tokens, completion tokens, total tokens, latency (first-token if streaming; total time).
Cost calculator: compute cost using configurable pricing.
Session persistence: store each run (prompts, model, params, outputs, metrics).
Export: JSON export of a session (for later analysis).

3.3 Non-Functional Requirements

Reliability: handle provider errors, timeouts, and rate limits without crashing the UI.
Privacy: avoid logging secrets; clearly mark what is persisted to disk.
Repeatability: allow deterministic settings (temperature 0, fixed seed if available).
Usability: minimal clicks; comparisons should be visually obvious.

3.4 Example Usage / Output

streamlit run prompt_battle.py

Example rubric snippet (judge):

{
  "accuracy": 8,
  "clarity": 9,
  "actionability": 7,
  "format_adherence": 10,
  "notes": "B follows the table format and cites assumptions."
}

4. Solution Architecture

4.1 High-Level Design

┌───────────────┐   run/config   ┌────────────────┐
│  Streamlit UI │───────────────▶│  Battle Runner  │
└───────┬───────┘                └───────┬────────┘
        │                                │
        │ metrics/results                │ provider calls
        ▼                                ▼
┌────────────────┐                ┌────────────────┐
│ Session Store  │◀──────────────▶│ Provider Client │
└────────────────┘                └────────────────┘
        │
        ▼
┌────────────────┐
│ Pricing Config │
└────────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
UI	Collect inputs + display results	Keep “Goal” separate from system prompts
Provider client(s)	Call the model API	Normalize responses + usage across providers
Battle runner	Execute A/B calls + judge	Deterministic ordering; consistent params
Cost/pricing	Compute cost by token type	Pricing as config, not hard-coded
Session store	Persist runs	SQLite/JSONL; include schema versioning

4.3 Data Structures

from dataclasses import dataclass
from typing import Any, Literal

Role = Literal["system", "user", "assistant"]

@dataclass
class ModelInvocation:
    provider: str
    model: str
    temperature: float
    max_tokens: int
    top_p: float | None

@dataclass
class ModelResult:
    text: str
    prompt_tokens: int | None
    completion_tokens: int | None
    latency_ms: int
    raw: dict[str, Any]

4.4 Algorithm Overview

Key Algorithm: A/B battle execution

Validate config (models, params, keys).
Call model A and model B with identical user input.
Record latency and token usage.
Optionally call judge model with rubric + both outputs.
Persist the session and render metrics.

Complexity Analysis:

Time: O(number_of_models + judge) network calls
Space: O(output_size + stored_session)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install streamlit pydantic python-dotenv

5.2 Project Structure

prompt-battle/
├── src/
│   ├── app.py
│   ├── providers/
│   │   ├── base.py
│   │   └── openai_provider.py
│   ├── battle.py
│   ├── pricing.py
│   └── storage.py
├── data/
│   └── sessions.sqlite
└── README.md

5.3 Implementation Phases

Phase 1: Single provider, single prompt (2–3h)

Goals:

Make one LLM call and show output.
Capture token usage if available.

Tasks:

Implement a provider client with one chat(messages, config) entrypoint.
Render a minimal Streamlit UI that sends a user message and prints response.

Checkpoint: One button press produces one response reliably.

Phase 2: A/B comparison + metrics (3–5h)

Goals:

Run Prompt A vs Prompt B and compare side-by-side.
Compute cost + latency.

Tasks:

Build BattleRunner that runs both invocations and returns structured results.
Add a metrics panel (tokens, cost, latency, tokens/sec).

Checkpoint: Two outputs plus metric table on every run.

Phase 3: Persistence + judge + polish (3–4h)

Goals:

Save sessions and enable optional judge scoring.

Tasks:

Persist sessions (SQLite or JSONL) with a schema version.
Add judge rubric and JSON output parsing.
Add export/download.

Checkpoint: You can reload older sessions and compare trends.

5.4 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Storage	JSONL vs SQLite	SQLite	Simple querying + evolution
Metrics	provider usage vs estimate	provider usage	Avoid pricing errors
Judge	same model vs separate model	separate model	Reduce self-bias

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Pure logic	cost calc, schema validation, storage
Integration	Provider wiring	mock HTTP, retries, timeouts
Regression	Output structure	judge JSON parsing, export format

6.2 Critical Test Cases

Cost calculation: given token counts + pricing, total matches expected.
Provider normalization: response without usage doesn’t crash; metrics show None gracefully.
Judge parsing: judge returns malformed JSON → recover with a fallback parse or show error.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Hidden state between runs	“Same prompt, different output” confusion	Show full message payload and params per run
Misleading comparisons	Model A uses different max tokens	Lock params and display them prominently
Token pricing drift	Costs stop matching docs	Keep pricing in config with update date
Judge bias	Judge always prefers verbose outputs	Include brevity/format criteria in rubric

Debugging strategies:

Log the exact request payload (without secrets) and response metadata.
Add a “replay session” button from stored payloads.

8. Extensions & Challenges

8.1 Beginner Extensions

Add prompt library presets (roles: “teacher”, “PM”, “critic”).
Add a “diff view” highlighting structural differences between outputs.

8.2 Intermediate Extensions

Add multi-turn conversations and show “context growth” per turn.
Add batch evaluation (run 20 tasks; compute average rubric score).

8.3 Advanced Extensions

Add provider plug-ins + dynamic model discovery.
Add automatic prompt mutation (small edits) and search for improvements.

9. Real-World Connections

9.1 Industry Applications

Prompt experimentation for support agents, summarizers, and copilots.
A/B testing prompts/models before shipping to production.
Cost governance for multi-agent systems.

LangSmith: tracing + evals for LLM apps.
OpenAI Evals: evaluation harness patterns and datasets.

9.3 Interview Relevance

Explain sampling and why deterministic settings matter for tooling.
Explain how you’d measure “quality” beyond subjective opinions.

10. Resources

10.1 Essential Reading

The LLM Engineering Handbook (Paul Iusztin) — prompt patterns + evals (Ch. 3, 8)
AI Engineering (Chip Huyen) — production LLM systems + failures (Ch. 2, 8)

10.2 Tools & Documentation

Streamlit docs (state, forms, layout)
Provider API docs for chat + usage fields

Previous: none (start here)
Next: Project 2 (RAG) — turns experimentation into memory-grounded assistants

11. Self-Assessment Checklist

I can explain temperature/top-p and when to use each.
I can compute per-request cost from token usage.
I can design a rubric that discourages “verbose but wrong” answers.
I can add a second provider without rewriting the UI.

12. Submission / Completion Criteria

Minimum Viable Completion:

A/B comparison UI with at least one provider/model
Token + cost + latency displayed
Sessions persist to disk and can be exported

Full Completion:

Optional judge scoring with a rubric
Support for at least two models (or two providers)
Robust error handling and clean session replay

Excellence (Going Above & Beyond):

Batch eval mode with aggregated metrics
Prompt library + search for improved prompts

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.

Project 1: LLM Prompt Playground & Analyzer

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Single provider, single prompt (2–3h)

Phase 2: A/B comparison + metrics (3–5h)

Phase 3: Persistence + judge + polish (3–4h)

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

12. Submission / Completion Criteria