Project 1: LLM Prompt Playground & Analyzer

Project 1: LLM Prompt Playground & Analyzer

Build a Streamlit โ€œprompt battle arenaโ€ to compare prompts/models/parameters with token+cost+latency analytics and optional LLM-as-a-judge scoring.

Quick Reference

Attribute Value
Difficulty Level 1: Beginner
Time Estimate 8โ€“12 hours (weekend)
Language Python (Alternatives: TypeScript/Node.js, Go)
Prerequisites Python basics, HTTP/APIs, env vars, basic Git
Key Topics system prompts, sampling (temperature/top-p), token accounting, eval rubrics, streaming, provider abstraction

1. Learning Objectives

By completing this project, you will:

  1. Build a minimal but real LLM client with model/provider selection.
  2. Understand how prompt structure changes output quality and failure modes.
  3. Measure latency + token usage and compute cost across providers.
  4. Design repeatable qualitative evaluation (rubrics + judge model).
  5. Store experiments and compare results over time.

2. Theoretical Foundation

2.1 Core Concepts

  • Chat roles (system/developer/user): The system message anchors behavior; user messages are tasks. Your UI should make roles explicit so you can observe how changing โ€œconstitutionโ€ text affects outcomes.
  • Decoding & sampling:
    • Temperature controls randomness by flattening/sharpening the token distribution.
    • Top-p (nucleus sampling) selects from the smallest set of tokens whose cumulative probability is p.
    • For assistants: lower randomness for extraction/formatting; higher randomness for ideation.
  • Tokenization & context windows: Models operate on tokens; prompt length is โ€œRAMโ€. You canโ€™t optimize cost/latency if you donโ€™t track prompt+completion tokens.
  • Evaluation (Evals):
    • Human evaluation is slow but high-quality.
    • LLM-as-a-judge is fast and scalable but can be biased; you mitigate by using explicit rubrics and stable sampling settings (e.g., temperature 0).

2.2 Why This Matters

Personal assistants are only as good as their instruction-following. Before you orchestrate tools or build memory, you need a โ€œlabโ€ where you can reliably answer:

  • โ€œWhich prompt produces fewer hallucinations for this task?โ€
  • โ€œHow much does this interaction cost at scale?โ€
  • โ€œIs a more expensive model worth it for this workflow?โ€

2.3 Common Misconceptions

  • โ€œModel selection matters more than prompts.โ€ In many workflows, prompt quality dominates model choice.
  • โ€œTemperature is a creativity knob only.โ€ Itโ€™s also a reliability knob; non-determinism can break structured outputs.
  • โ€œToken counts donโ€™t matter for small apps.โ€ Costs compound quickly when you add memory, tools, and multi-step loops.

3. Project Specification

3.1 What You Will Build

A Streamlit web app where you:

  • Define a shared Goal/Task.
  • Enter Prompt A and Prompt B (system prompts).
  • Run both prompts against one or two selected models/providers.
  • Inspect responses + metrics (tokens, cost, latency, throughput).
  • Optionally run a judge that scores outputs with a rubric.
  • Save sessions and export results as JSON.

3.2 Functional Requirements

  1. Provider abstraction: support at least one provider; design so adding another is straightforward.
  2. Side-by-side comparison: show Prompt A vs Prompt B outputs with identical input task and parameters.
  3. Parameter control: temperature, max tokens, optional top-p.
  4. Metrics: prompt tokens, completion tokens, total tokens, latency (first-token if streaming; total time).
  5. Cost calculator: compute cost using configurable pricing.
  6. Session persistence: store each run (prompts, model, params, outputs, metrics).
  7. Export: JSON export of a session (for later analysis).

3.3 Non-Functional Requirements

  • Reliability: handle provider errors, timeouts, and rate limits without crashing the UI.
  • Privacy: avoid logging secrets; clearly mark what is persisted to disk.
  • Repeatability: allow deterministic settings (temperature 0, fixed seed if available).
  • Usability: minimal clicks; comparisons should be visually obvious.

3.4 Example Usage / Output

streamlit run prompt_battle.py

Example rubric snippet (judge):

{
  "accuracy": 8,
  "clarity": 9,
  "actionability": 7,
  "format_adherence": 10,
  "notes": "B follows the table format and cites assumptions."
}

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   run/config   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Streamlit UI โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  Battle Runner  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                                โ”‚
        โ”‚ metrics/results                โ”‚ provider calls
        โ–ผ                                โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Session Store  โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Provider Client โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚
        โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Pricing Config โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Components

Component Responsibility Key Decisions
UI Collect inputs + display results Keep โ€œGoalโ€ separate from system prompts
Provider client(s) Call the model API Normalize responses + usage across providers
Battle runner Execute A/B calls + judge Deterministic ordering; consistent params
Cost/pricing Compute cost by token type Pricing as config, not hard-coded
Session store Persist runs SQLite/JSONL; include schema versioning

4.3 Data Structures

from dataclasses import dataclass
from typing import Any, Literal

Role = Literal["system", "user", "assistant"]

@dataclass
class ModelInvocation:
    provider: str
    model: str
    temperature: float
    max_tokens: int
    top_p: float | None

@dataclass
class ModelResult:
    text: str
    prompt_tokens: int | None
    completion_tokens: int | None
    latency_ms: int
    raw: dict[str, Any]

4.4 Algorithm Overview

Key Algorithm: A/B battle execution

  1. Validate config (models, params, keys).
  2. Call model A and model B with identical user input.
  3. Record latency and token usage.
  4. Optionally call judge model with rubric + both outputs.
  5. Persist the session and render metrics.

Complexity Analysis:

  • Time: O(number_of_models + judge) network calls
  • Space: O(output_size + stored_session)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install streamlit pydantic python-dotenv

5.2 Project Structure

prompt-battle/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ app.py
โ”‚   โ”œโ”€โ”€ providers/
โ”‚   โ”‚   โ”œโ”€โ”€ base.py
โ”‚   โ”‚   โ””โ”€โ”€ openai_provider.py
โ”‚   โ”œโ”€โ”€ battle.py
โ”‚   โ”œโ”€โ”€ pricing.py
โ”‚   โ””โ”€โ”€ storage.py
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ sessions.sqlite
โ””โ”€โ”€ README.md

5.3 Implementation Phases

Phase 1: Single provider, single prompt (2โ€“3h)

Goals:

  • Make one LLM call and show output.
  • Capture token usage if available.

Tasks:

  1. Implement a provider client with one chat(messages, config) entrypoint.
  2. Render a minimal Streamlit UI that sends a user message and prints response.

Checkpoint: One button press produces one response reliably.

Phase 2: A/B comparison + metrics (3โ€“5h)

Goals:

  • Run Prompt A vs Prompt B and compare side-by-side.
  • Compute cost + latency.

Tasks:

  1. Build BattleRunner that runs both invocations and returns structured results.
  2. Add a metrics panel (tokens, cost, latency, tokens/sec).

Checkpoint: Two outputs plus metric table on every run.

Phase 3: Persistence + judge + polish (3โ€“4h)

Goals:

  • Save sessions and enable optional judge scoring.

Tasks:

  1. Persist sessions (SQLite or JSONL) with a schema version.
  2. Add judge rubric and JSON output parsing.
  3. Add export/download.

Checkpoint: You can reload older sessions and compare trends.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Storage JSONL vs SQLite SQLite Simple querying + evolution
Metrics provider usage vs estimate provider usage Avoid pricing errors
Judge same model vs separate model separate model Reduce self-bias

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Pure logic cost calc, schema validation, storage
Integration Provider wiring mock HTTP, retries, timeouts
Regression Output structure judge JSON parsing, export format

6.2 Critical Test Cases

  1. Cost calculation: given token counts + pricing, total matches expected.
  2. Provider normalization: response without usage doesnโ€™t crash; metrics show None gracefully.
  3. Judge parsing: judge returns malformed JSON โ†’ recover with a fallback parse or show error.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Hidden state between runs โ€œSame prompt, different outputโ€ confusion Show full message payload and params per run
Misleading comparisons Model A uses different max tokens Lock params and display them prominently
Token pricing drift Costs stop matching docs Keep pricing in config with update date
Judge bias Judge always prefers verbose outputs Include brevity/format criteria in rubric

Debugging strategies:

  • Log the exact request payload (without secrets) and response metadata.
  • Add a โ€œreplay sessionโ€ button from stored payloads.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add prompt library presets (roles: โ€œteacherโ€, โ€œPMโ€, โ€œcriticโ€).
  • Add a โ€œdiff viewโ€ highlighting structural differences between outputs.

8.2 Intermediate Extensions

  • Add multi-turn conversations and show โ€œcontext growthโ€ per turn.
  • Add batch evaluation (run 20 tasks; compute average rubric score).

8.3 Advanced Extensions

  • Add provider plug-ins + dynamic model discovery.
  • Add automatic prompt mutation (small edits) and search for improvements.

9. Real-World Connections

9.1 Industry Applications

  • Prompt experimentation for support agents, summarizers, and copilots.
  • A/B testing prompts/models before shipping to production.
  • Cost governance for multi-agent systems.
  • LangSmith: tracing + evals for LLM apps.
  • OpenAI Evals: evaluation harness patterns and datasets.

9.3 Interview Relevance

  • Explain sampling and why deterministic settings matter for tooling.
  • Explain how youโ€™d measure โ€œqualityโ€ beyond subjective opinions.

10. Resources

10.1 Essential Reading

  • The LLM Engineering Handbook (Paul Iusztin) โ€” prompt patterns + evals (Ch. 3, 8)
  • AI Engineering (Chip Huyen) โ€” production LLM systems + failures (Ch. 2, 8)

10.2 Tools & Documentation

  • Streamlit docs (state, forms, layout)
  • Provider API docs for chat + usage fields
  • Previous: none (start here)
  • Next: Project 2 (RAG) โ€” turns experimentation into memory-grounded assistants

11. Self-Assessment Checklist

  • I can explain temperature/top-p and when to use each.
  • I can compute per-request cost from token usage.
  • I can design a rubric that discourages โ€œverbose but wrongโ€ answers.
  • I can add a second provider without rewriting the UI.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • A/B comparison UI with at least one provider/model
  • Token + cost + latency displayed
  • Sessions persist to disk and can be exported

Full Completion:

  • Optional judge scoring with a rubric
  • Support for at least two models (or two providers)
  • Robust error handling and clean session replay

Excellence (Going Above & Beyond):

  • Batch eval mode with aggregated metrics
  • Prompt library + search for improved prompts

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.