Project 1: LLM Prompt Playground & Analyzer
Project 1: LLM Prompt Playground & Analyzer
Build a Streamlit โprompt battle arenaโ to compare prompts/models/parameters with token+cost+latency analytics and optional LLM-as-a-judge scoring.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 1: Beginner |
| Time Estimate | 8โ12 hours (weekend) |
| Language | Python (Alternatives: TypeScript/Node.js, Go) |
| Prerequisites | Python basics, HTTP/APIs, env vars, basic Git |
| Key Topics | system prompts, sampling (temperature/top-p), token accounting, eval rubrics, streaming, provider abstraction |
1. Learning Objectives
By completing this project, you will:
- Build a minimal but real LLM client with model/provider selection.
- Understand how prompt structure changes output quality and failure modes.
- Measure latency + token usage and compute cost across providers.
- Design repeatable qualitative evaluation (rubrics + judge model).
- Store experiments and compare results over time.
2. Theoretical Foundation
2.1 Core Concepts
- Chat roles (system/developer/user): The system message anchors behavior; user messages are tasks. Your UI should make roles explicit so you can observe how changing โconstitutionโ text affects outcomes.
- Decoding & sampling:
- Temperature controls randomness by flattening/sharpening the token distribution.
- Top-p (nucleus sampling) selects from the smallest set of tokens whose cumulative probability is
p. - For assistants: lower randomness for extraction/formatting; higher randomness for ideation.
- Tokenization & context windows: Models operate on tokens; prompt length is โRAMโ. You canโt optimize cost/latency if you donโt track prompt+completion tokens.
- Evaluation (Evals):
- Human evaluation is slow but high-quality.
- LLM-as-a-judge is fast and scalable but can be biased; you mitigate by using explicit rubrics and stable sampling settings (e.g., temperature 0).
2.2 Why This Matters
Personal assistants are only as good as their instruction-following. Before you orchestrate tools or build memory, you need a โlabโ where you can reliably answer:
- โWhich prompt produces fewer hallucinations for this task?โ
- โHow much does this interaction cost at scale?โ
- โIs a more expensive model worth it for this workflow?โ
2.3 Common Misconceptions
- โModel selection matters more than prompts.โ In many workflows, prompt quality dominates model choice.
- โTemperature is a creativity knob only.โ Itโs also a reliability knob; non-determinism can break structured outputs.
- โToken counts donโt matter for small apps.โ Costs compound quickly when you add memory, tools, and multi-step loops.
3. Project Specification
3.1 What You Will Build
A Streamlit web app where you:
- Define a shared Goal/Task.
- Enter Prompt A and Prompt B (system prompts).
- Run both prompts against one or two selected models/providers.
- Inspect responses + metrics (tokens, cost, latency, throughput).
- Optionally run a judge that scores outputs with a rubric.
- Save sessions and export results as JSON.
3.2 Functional Requirements
- Provider abstraction: support at least one provider; design so adding another is straightforward.
- Side-by-side comparison: show Prompt A vs Prompt B outputs with identical input task and parameters.
- Parameter control: temperature, max tokens, optional top-p.
- Metrics: prompt tokens, completion tokens, total tokens, latency (first-token if streaming; total time).
- Cost calculator: compute cost using configurable pricing.
- Session persistence: store each run (prompts, model, params, outputs, metrics).
- Export: JSON export of a session (for later analysis).
3.3 Non-Functional Requirements
- Reliability: handle provider errors, timeouts, and rate limits without crashing the UI.
- Privacy: avoid logging secrets; clearly mark what is persisted to disk.
- Repeatability: allow deterministic settings (temperature 0, fixed seed if available).
- Usability: minimal clicks; comparisons should be visually obvious.
3.4 Example Usage / Output
streamlit run prompt_battle.py
Example rubric snippet (judge):
{
"accuracy": 8,
"clarity": 9,
"actionability": 7,
"format_adherence": 10,
"notes": "B follows the table format and cites assumptions."
}
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโโ run/config โโโโโโโโโโโโโโโโโโ
โ Streamlit UI โโโโโโโโโโโโโโโโโถโ Battle Runner โ
โโโโโโโโโฌโโโโโโโโ โโโโโโโโโฌโโโโโโโโโ
โ โ
โ metrics/results โ provider calls
โผ โผ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ Session Store โโโโโโโโโโโโโโโโโถโ Provider Client โ
โโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโ
โ Pricing Config โ
โโโโโโโโโโโโโโโโโโ
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| UI | Collect inputs + display results | Keep โGoalโ separate from system prompts |
| Provider client(s) | Call the model API | Normalize responses + usage across providers |
| Battle runner | Execute A/B calls + judge | Deterministic ordering; consistent params |
| Cost/pricing | Compute cost by token type | Pricing as config, not hard-coded |
| Session store | Persist runs | SQLite/JSONL; include schema versioning |
4.3 Data Structures
from dataclasses import dataclass
from typing import Any, Literal
Role = Literal["system", "user", "assistant"]
@dataclass
class ModelInvocation:
provider: str
model: str
temperature: float
max_tokens: int
top_p: float | None
@dataclass
class ModelResult:
text: str
prompt_tokens: int | None
completion_tokens: int | None
latency_ms: int
raw: dict[str, Any]
4.4 Algorithm Overview
Key Algorithm: A/B battle execution
- Validate config (models, params, keys).
- Call model A and model B with identical user input.
- Record latency and token usage.
- Optionally call judge model with rubric + both outputs.
- Persist the session and render metrics.
Complexity Analysis:
- Time: O(number_of_models + judge) network calls
- Space: O(output_size + stored_session)
5. Implementation Guide
5.1 Development Environment Setup
python -m venv .venv
source .venv/bin/activate
pip install streamlit pydantic python-dotenv
5.2 Project Structure
prompt-battle/
โโโ src/
โ โโโ app.py
โ โโโ providers/
โ โ โโโ base.py
โ โ โโโ openai_provider.py
โ โโโ battle.py
โ โโโ pricing.py
โ โโโ storage.py
โโโ data/
โ โโโ sessions.sqlite
โโโ README.md
5.3 Implementation Phases
Phase 1: Single provider, single prompt (2โ3h)
Goals:
- Make one LLM call and show output.
- Capture token usage if available.
Tasks:
- Implement a provider client with one
chat(messages, config)entrypoint. - Render a minimal Streamlit UI that sends a user message and prints response.
Checkpoint: One button press produces one response reliably.
Phase 2: A/B comparison + metrics (3โ5h)
Goals:
- Run Prompt A vs Prompt B and compare side-by-side.
- Compute cost + latency.
Tasks:
- Build
BattleRunnerthat runs both invocations and returns structured results. - Add a metrics panel (tokens, cost, latency, tokens/sec).
Checkpoint: Two outputs plus metric table on every run.
Phase 3: Persistence + judge + polish (3โ4h)
Goals:
- Save sessions and enable optional judge scoring.
Tasks:
- Persist sessions (SQLite or JSONL) with a schema version.
- Add judge rubric and JSON output parsing.
- Add export/download.
Checkpoint: You can reload older sessions and compare trends.
5.4 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Storage | JSONL vs SQLite | SQLite | Simple querying + evolution |
| Metrics | provider usage vs estimate | provider usage | Avoid pricing errors |
| Judge | same model vs separate model | separate model | Reduce self-bias |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Pure logic | cost calc, schema validation, storage |
| Integration | Provider wiring | mock HTTP, retries, timeouts |
| Regression | Output structure | judge JSON parsing, export format |
6.2 Critical Test Cases
- Cost calculation: given token counts + pricing, total matches expected.
- Provider normalization: response without usage doesnโt crash; metrics show
Nonegracefully. - Judge parsing: judge returns malformed JSON โ recover with a fallback parse or show error.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Hidden state between runs | โSame prompt, different outputโ confusion | Show full message payload and params per run |
| Misleading comparisons | Model A uses different max tokens | Lock params and display them prominently |
| Token pricing drift | Costs stop matching docs | Keep pricing in config with update date |
| Judge bias | Judge always prefers verbose outputs | Include brevity/format criteria in rubric |
Debugging strategies:
- Log the exact request payload (without secrets) and response metadata.
- Add a โreplay sessionโ button from stored payloads.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add prompt library presets (roles: โteacherโ, โPMโ, โcriticโ).
- Add a โdiff viewโ highlighting structural differences between outputs.
8.2 Intermediate Extensions
- Add multi-turn conversations and show โcontext growthโ per turn.
- Add batch evaluation (run 20 tasks; compute average rubric score).
8.3 Advanced Extensions
- Add provider plug-ins + dynamic model discovery.
- Add automatic prompt mutation (small edits) and search for improvements.
9. Real-World Connections
9.1 Industry Applications
- Prompt experimentation for support agents, summarizers, and copilots.
- A/B testing prompts/models before shipping to production.
- Cost governance for multi-agent systems.
9.2 Related Open Source Projects
- LangSmith: tracing + evals for LLM apps.
- OpenAI Evals: evaluation harness patterns and datasets.
9.3 Interview Relevance
- Explain sampling and why deterministic settings matter for tooling.
- Explain how youโd measure โqualityโ beyond subjective opinions.
10. Resources
10.1 Essential Reading
- The LLM Engineering Handbook (Paul Iusztin) โ prompt patterns + evals (Ch. 3, 8)
- AI Engineering (Chip Huyen) โ production LLM systems + failures (Ch. 2, 8)
10.2 Tools & Documentation
- Streamlit docs (state, forms, layout)
- Provider API docs for chat + usage fields
10.4 Related Projects in This Series
- Previous: none (start here)
- Next: Project 2 (RAG) โ turns experimentation into memory-grounded assistants
11. Self-Assessment Checklist
- I can explain temperature/top-p and when to use each.
- I can compute per-request cost from token usage.
- I can design a rubric that discourages โverbose but wrongโ answers.
- I can add a second provider without rewriting the UI.
12. Submission / Completion Criteria
Minimum Viable Completion:
- A/B comparison UI with at least one provider/model
- Token + cost + latency displayed
- Sessions persist to disk and can be exported
Full Completion:
- Optional judge scoring with a rubric
- Support for at least two models (or two providers)
- Robust error handling and clean session replay
Excellence (Going Above & Beyond):
- Batch eval mode with aggregated metrics
- Prompt library + search for improved prompts
This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.