Project 1: Prompt Contract Harness
Project 1: Prompt Contract Harness
Build a production-grade CLI tool that treats prompts like software artifacts with automated testing
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 3-5 days |
| Language | Python (Alternatives: TypeScript) |
| Prerequisites | Basic Python/TypeScript, LLM API access |
| Key Topics | Testing, Invariants, Regression, SLOs |
| Knowledge Area | PromptOps / Testing |
| Software/Tool | CLI harness + validators + reports |
| Main Book | โSite Reliability Engineeringโ by Google (Concepts of SLOs) |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 4. The โOpen Coreโ Infrastructure |
1. Learning Objectives
By completing this project, you will:
- Master Prompt Contracts: Learn to treat prompts as deterministic functions with defined inputs, invariant constraints, and strictly typed outputs
- Build Production Testing Infrastructure: Create automated testing pipelines that detect regression when prompts change
- Understand Evaluation Metrics: Apply SRE principles (SLOs, error budgets) to AI system reliability
- Handle Non-Determinism: Learn when to test at temperature=0 vs higher temperatures for different validation types
- Create Actionable Reports: Generate both human-readable (HTML) and machine-readable (JSON) test reports
- Implement Parametric Evaluation: Run tests across different models, temperatures, and configurations
- Design Robust Invariants: Translate vague business requirements into programmatic assertions
2. Theoretical Foundation
2.1 Core Concepts
Unit Testing for Prompts
In traditional software development, unit tests verify that a function produces expected outputs for given inputs. The same principle applies to promptsโthey are functions that transform inputs (context, user query) into outputs (responses).
Traditional Function Testing:
def calculate_tax(amount: float, rate: float) -> float:
return amount * rate
# Test
assert calculate_tax(100, 0.1) == 10.0 # Deterministic
Prompt Testing:
def customer_support_prompt(query: str, context: str) -> dict:
# LLM call
response = llm.complete(f"Context: {context}\nQuery: {query}")
return parse_json(response)
# Test
result = customer_support_prompt("refund order #123", policy_docs)
assert result["has_citation"] == True # Contract check
assert "order_id" in result # Required field
The key difference: prompts are probabilistic, so we must test invariants (properties that should always hold) rather than exact outputs.
Invariants and Contracts
An invariant is a condition that must always be true regardless of the specific execution. A contract is a set of invariants that define what โcorrectโ means.
Types of Invariants:
| Type | Description | Example |
|---|---|---|
| Structural | Output format/schema | โMust be valid JSONโ |
| Semantic | Meaning constraints | โMust cite a source documentโ |
| Safety | Security boundaries | โMust not contain PIIโ |
| Performance | Resource limits | โMust respond in <500msโ |
Example Contract:
contract:
name: "Customer Support Response"
invariants:
- type: schema
spec: response_schema.json
- type: citation
rule: "Must reference at least one policy document"
- type: length
min: 50
max: 500
- type: tone
rule: "Must be professional and empathetic"
Service Level Objectives (SLOs)
SLOs define the reliability target for your system. For LLM applications:
Example SLOs:
- Accuracy SLO: 99% of responses must pass all invariants
- Latency SLO: 95th percentile response time < 500ms
- Cost SLO: Average cost per request < $0.01
Error Budgets: If your SLO is 99%, you have a 1% error budget. Once exhausted, you halt prompt changes until you fix regressions.
Deterministic vs Non-Deterministic Testing
Temperature controls randomness in LLM outputs:
- temp=0.0: Deterministic (always picks highest probability token)
- temp=0.7: Balanced creativity and consistency
- temp=1.0+: High creativity, low consistency
Testing Strategy:
Deterministic Tests (temp=0.0):
- JSON schema validation
- Required field presence
- Exact format matches
- Citation checks
Non-Deterministic Tests (temp=0.7, N=10 samples):
- Tone/style quality (average score)
- Creativity metrics
- Diversity of responses
Schema Validation
JSON Schema acts as a โtype systemโ for LLM outputs. It ensures the modelโs response is structurally valid before your application processes it.
Example Schema:
{
"type": "object",
"properties": {
"answer": {"type": "string", "minLength": 10},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"citations": {
"type": "array",
"items": {"type": "string"},
"minItems": 1
}
},
"required": ["answer", "confidence", "citations"],
"additionalProperties": false
}
This prevents crashes from missing fields, wrong types, or hallucinated extra fields.
2.2 Why This Matters
Production Relevance
Real-world LLM applications fail in production due to:
- Prompt Changes: A single word change can break edge cases
- Model Updates: Provider model updates can change behavior
- Context Drift: As your data changes, prompt performance degrades
- Edge Cases: Rare inputs that werenโt tested manually
Without automated testing: You discover failures after users complain. With this harness: You catch regressions before deployment.
Real-World Applications
This testing pattern is used by:
- OpenAI Evals: OpenAIโs internal framework for model evaluation
- LangSmith: LangChainโs testing and observability platform
- Anthropic Workbench: Claude application testing infrastructure
- Microsoft Prompt Flow: Azureโs prompt engineering toolchain
Companies like Stripe, Shopify, and Notion use similar harnesses to ensure their AI features maintain quality across updates.
2.3 Historical Context
Evolution of Prompt Engineering
2020-2021: The โVibesโ Era
- Prompts were magic spells
- โAct as aโฆโ and โTake a deep breathโ heuristics
- No systematic evaluation
- Manual testing only
2022-2023: The Systematization Era
- OpenAI releases Evals framework
- Chain-of-Thought (CoT) prompting formalized
- Few-shot learning standardized
- Schema-based outputs emerge
2024+: The Engineering Discipline Era
- Prompts as code (version control, testing, CI/CD)
- Automated evaluation pipelines
- Statistical significance testing
- Production observability
This project teaches you the modern, engineering-driven approach.
2.4 Common Misconceptions
| Misconception | Reality |
|---|---|
| โPrompts are too random to testโ | You test invariants, not exact outputs |
| โManual testing is sufficientโ | Manual testing doesnโt scale to 100+ edge cases |
| โHigher temperature = better responsesโ | Temperature depends on use case; many tasks need temp=0 |
| โIf it works once, it works alwaysโ | LLM outputs vary; you need statistical sampling |
| โTesting slows down developmentโ | Testing prevents shipping broken prompts (faster overall) |
3. Project Specification
3.1 What You Will Build
A command-line tool that:
- Loads test suites from YAML/JSON files containing test cases
- Executes prompts against LLM APIs (OpenAI, Anthropic, etc.)
- Validates responses against defined invariants
- Generates reports in HTML and JSON formats
- Detects regression by comparing current run to previous versions
- Provides actionable feedback on which tests failed and why
Core Question This Tool Answers:
โHow do I know if my prompt change made things better or worse?โ
3.2 Functional Requirements
FR1: Test Suite Loading
- Load test cases from YAML or JSON files
- Support multiple test categories (e.g., Refund, Technical, Policy)
- Parse test case metadata: ID, input, expected invariants
- Validate test file structure before execution
FR2: Prompt Execution
- Support multiple LLM providers (OpenAI, Anthropic)
- Execute prompts with configurable temperature, max tokens, etc.
- Handle API rate limiting with exponential backoff
- Capture response metadata: latency, token count
FR3: Invariant Checking
Implement at least these validator types:
- JSONSchema: Validate against JSON Schema spec
- Contains: Check if response contains specific strings
- NotContains: Ensure response doesnโt contain forbidden strings
- Length: Validate character/word count bounds
- Citation: Verify presence of source citations
- Regex: Pattern matching for structured formats
FR4: Report Generation
- Console Output: Rich, colorized terminal output with progress bars
- HTML Report: Interactive web page with drill-down capabilities
- JSON Export: Machine-readable format for CI/CD integration
- Trend Tracking: Compare current run to historical runs
FR5: Regression Detection
- Store test results with version identifiers
- Compare current run against previous baseline
- Flag accuracy drops above configurable threshold (e.g., >5% drop)
- Provide diff view showing which cases broke
3.3 Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Performance | Handle 50+ test cases in <5 minutes | Reasonable for CI/CD pipelines |
| Reliability | Retry failed API calls 3x with backoff | Handle transient network issues |
| Usability | Clear error messages with fix suggestions | Developers must understand failures quickly |
| Extensibility | Plugin architecture for custom validators | Different use cases need custom checks |
| Cost Efficiency | Cache responses during development | Avoid paying for repeated identical calls |
3.4 Example Usage
Running the harness:
$ python harness.py test prompts/support_agent.yaml
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PROMPT HARNESS v1.0 - Test Suite Execution โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loading test suite: prompts/support_agent.yaml
Found 16 test cases across 3 categories (Refund, Technical, Policy)
Testing against: gpt-4 (temperature=0.0)
RUNNING SUITE: Customer Support Evals
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Category: Refund Queries
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[PASS] Case: simple_refund_query (120ms)
Input: "I want to return my order #12345"
โ Invariant: Valid JSON Schema ............. OK
โ Invariant: Has Citation .................. OK (Cited: policy_doc_3)
โ Invariant: Contains Order ID ............. OK (#12345)
โ Invariant: Polite Tone ................... OK (Confidence: 0.95)
[PASS] Case: refund_outside_window (135ms)
Input: "Can I return something I bought 6 months ago?"
โ Invariant: Valid JSON Schema ............. OK
โ Invariant: Has Citation .................. OK (Cited: policy_doc_1)
โ Invariant: Mentions Time Limit ........... OK
โ Invariant: Suggests Alternative .......... OK
[FAIL] Case: ambiguous_policy_query (98ms)
Input: "What's your return policy?"
โ Invariant: Valid JSON Schema ............. OK
โ Invariant: Has Citation .................. FAIL
โ Invariant: Contains Policy Details ....... OK
Expected: Citation to a policy document
Actual Output: {
"response": "I think you can return it maybe?",
"confidence": 0.3,
"citation": null
}
Failure Reason: Model responded with vague language without
grounding answer in provided policy documents.
Category: Technical Support
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[PASS] Case: password_reset (105ms)
[PASS] Case: account_locked (118ms)
[PASS] Case: api_integration_help (142ms)
... (showing 3/10 cases for brevity)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SUMMARY REPORT
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Total Cases: 16
Passed: 15
Failed: 1
Success Rate: 93.7%
Total Time: 1.82s
Avg Latency: 113ms
FAILURES BY INVARIANT:
โข Has Citation: 1 failure (ambiguous_policy_query)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ REGRESSION DETECTED: Score dropped from 100% (v1.2.0) to 93.7%
See detailed report: ./reports/run_2024-12-27_14-32-01.html
Recommendations:
1. Review prompt instructions for citation requirements
2. Add explicit instruction: "Always cite source documents"
3. Consider adding few-shot examples with citations
Integration with workflow:
# Run before committing a prompt change
$ python harness.py test prompts/support_agent.yaml --compare-to v1.2.0
# Run in CI/CD (fail build if score drops below threshold)
$ python harness.py test prompts/*.yaml --min-score 95 --format junit
# Run parametric sweep (test across multiple models/temperatures)
$ python harness.py test prompts/support_agent.yaml --sweep temperature=0.0,0.3,0.7
# Generate regression report comparing two versions
$ python harness.py diff v1.2.0 v1.3.0 --output regression_report.md
Generated HTML Report Features:
The tool generates a detailed HTML report (reports/run_2024-12-27_14-32-01.html) with:
- Side-by-side comparison: Shows your prompt version vs. the previous version
- Per-case drill-down: Click any failed case to see full input, expected output, actual output, and which specific assertion failed
- Trend graphs: Visual charts showing your accuracy over time across different prompt versions
- Diff highlighting: Color-coded changes showing what you modified in your prompt between runs
- Export options: Download results as JSON for integration with CI/CD pipelines
Example HTML report sections:
- โAccuracy Trendโ graph showing 100% โ 95% โ 93.7% over three runs
- โToken Usage Analysisโ showing average tokens per response
- โLatency Distributionโ histogram showing response time patterns
- โFailure Clusteringโ identifying which types of queries break most often
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Test Loader โโโโโโถโ Test Runner โโโโโโถโ Reporter โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ LLM Client โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ Validators โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโถ Results DB โโโโโโโโโโโโโโโโโ
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| TestLoader | Parse YAML/JSON test files | Use PyYAML for human-readable format with comments |
| TestRunner | Execute tests with rate limiting | Async execution with asyncio for concurrent API calls |
| LLMClient | Abstract API calls to different providers | Strategy pattern for OpenAI/Anthropic/local models |
| Validators | Check invariants against responses | Plugin architecture - easy to add custom validators |
| Reporter | Generate HTML/JSON reports | Jinja2 templates for HTML, structured JSON for CI/CD |
| ResultsDB | Store historical test runs | SQLite for simplicity, JSON files for portability |
4.3 Data Structures
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum
class InvariantType(Enum):
JSON_SCHEMA = "json_schema"
CONTAINS = "contains"
NOT_CONTAINS = "not_contains"
LENGTH = "length"
CITATION = "citation"
REGEX = "regex"
CUSTOM = "custom"
@dataclass
class Invariant:
"""Defines a single invariant check"""
type: InvariantType
name: str
spec: Dict[str, Any] # Type-specific configuration
@dataclass
class TestCase:
"""Represents a single test case"""
id: str
category: str
input: str
context: Optional[str]
expected_invariants: List[Invariant]
metadata: Dict[str, Any]
@dataclass
class ValidationResult:
"""Result of validating a single invariant"""
invariant_name: str
passed: bool
error_message: Optional[str]
details: Dict[str, Any]
@dataclass
class TestResult:
"""Complete result for one test case"""
test_case: TestCase
response: str
passed: bool
validation_results: List[ValidationResult]
latency_ms: float
tokens_used: int
timestamp: str
@dataclass
class TestSuite:
"""Collection of test cases"""
name: str
version: str
test_cases: List[TestCase]
@dataclass
class TestReport:
"""Aggregated results across all tests"""
suite_name: str
version: str
total_cases: int
passed_cases: int
failed_cases: int
success_rate: float
total_time_s: float
avg_latency_ms: float
results: List[TestResult]
regression_info: Optional[Dict[str, Any]]
4.4 Algorithm Overview
Test Execution Algorithm
def run_test_suite(suite: TestSuite, config: Config) -> TestReport:
"""
Main test execution algorithm
Complexity: O(n) where n = number of test cases
Space: O(n) for storing results
"""
results = []
start_time = time.time()
for test_case in suite.test_cases:
# 1. Build prompt with input and context
prompt = build_prompt(test_case.input, test_case.context)
# 2. Call LLM API (with retry logic)
response, metadata = call_llm_with_retry(
prompt=prompt,
config=config,
max_retries=3
)
# 3. Validate against all invariants
validation_results = []
for invariant in test_case.expected_invariants:
validator = get_validator(invariant.type)
result = validator.validate(response, invariant.spec)
validation_results.append(result)
# 4. Determine if test passed (all invariants must pass)
test_passed = all(r.passed for r in validation_results)
# 5. Record result
results.append(TestResult(
test_case=test_case,
response=response,
passed=test_passed,
validation_results=validation_results,
latency_ms=metadata['latency_ms'],
tokens_used=metadata['tokens']
))
# 6. Generate aggregate report
total_time = time.time() - start_time
return generate_report(results, total_time)
Regression Detection Algorithm
def detect_regression(current_run: TestReport,
baseline_run: TestReport,
threshold: float = 0.05) -> Dict[str, Any]:
"""
Compare two test runs to detect regression
Args:
current_run: Latest test results
baseline_run: Previous baseline to compare against
threshold: Maximum acceptable drop in success rate (e.g., 0.05 = 5%)
Returns:
Dictionary with regression analysis
"""
# Calculate score delta
score_delta = current_run.success_rate - baseline_run.success_rate
# Identify newly broken test cases
current_failures = {r.test_case.id for r in current_run.results if not r.passed}
baseline_failures = {r.test_case.id for r in baseline_run.results if not r.passed}
new_failures = current_failures - baseline_failures
fixed_cases = baseline_failures - current_failures
# Categorize failures by invariant type
failure_breakdown = defaultdict(list)
for result in current_run.results:
if not result.passed:
for validation in result.validation_results:
if not validation.passed:
failure_breakdown[validation.invariant_name].append(result.test_case.id)
return {
"regression_detected": score_delta < -threshold,
"score_delta": score_delta,
"new_failures": list(new_failures),
"fixed_cases": list(fixed_cases),
"failure_breakdown": dict(failure_breakdown)
}
Complexity Analysis:
- Time Complexity: O(n ร m) where n = test cases, m = invariants per case
- Space Complexity: O(n) for storing all results
- API Call Parallelization: Can achieve O(n/p) with p parallel workers
5. Implementation Guide
5.1 Development Environment Setup
# Create project directory
mkdir prompt-harness
cd prompt-harness
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install openai anthropic pyyaml jsonschema jinja2 click rich pytest
# Create .env file for API keys
cat > .env << EOF
OPENAI_API_KEY="your-openai-key-here"
ANTHROPIC_API_KEY="your-anthropic-key-here"
EOF
# Install python-dotenv for env management
pip install python-dotenv
5.2 Project Structure
prompt-harness/
โโโ harness/
โ โโโ __init__.py
โ โโโ loader.py # Test file parsing
โ โโโ runner.py # Test execution engine
โ โโโ validators.py # Invariant validators
โ โโโ reporter.py # Report generation
โ โโโ llm_client.py # LLM API abstraction
โ โโโ models.py # Data structures
โ โโโ cli.py # CLI interface
โโโ tests/
โ โโโ __init__.py
โ โโโ test_loader.py
โ โโโ test_validators.py
โ โโโ test_runner.py
โโโ prompts/
โ โโโ support_agent.yaml # Example test suite
โ โโโ schemas/
โ โโโ response_schema.json
โโโ templates/
โ โโโ report.html.jinja2 # HTML report template
โโโ results/
โ โโโ .gitkeep # Store test run results
โโโ reports/
โ โโโ .gitkeep # Generated HTML reports
โโโ pyproject.toml # Project metadata
โโโ requirements.txt # Dependencies
โโโ .env # API keys (gitignored)
โโโ .gitignore
โโโ README.md
5.3 Implementation Phases
Phase 1: Foundation (Day 1)
Goals:
- Set up project structure with proper Python packaging
- Implement test file loader
- Create basic data structures
- Write unit tests for loader
Tasks:
- Create Project Scaffold
# Initialize project mkdir -p harness tests prompts templates results reports touch harness/__init__.py tests/__init__.py - Implement Data Models (
harness/models.py)# See data structures in section 4.3 above # Copy the dataclass definitions - Implement YAML Test Loader (
harness/loader.py) ```python import yaml from pathlib import Path from typing import List from .models import TestSuite, TestCase, Invariant, InvariantType
class TestLoader: โ"โLoads and parses test suite filesโโโ
def load_from_file(self, filepath: Path) -> TestSuite:
"""Load test suite from YAML file"""
with open(filepath, 'r') as f:
data = yaml.safe_load(f)
return self._parse_suite(data)
def _parse_suite(self, data: dict) -> TestSuite:
"""Parse raw YAML data into TestSuite object"""
test_cases = [
self._parse_test_case(tc)
for tc in data.get('test_cases', [])
]
return TestSuite(
name=data['name'],
version=data['version'],
test_cases=test_cases
)
def _parse_test_case(self, data: dict) -> TestCase:
"""Parse a single test case"""
invariants = [
self._parse_invariant(inv)
for inv in data.get('invariants', [])
]
return TestCase(
id=data['id'],
category=data['category'],
input=data['input'],
context=data.get('context'),
expected_invariants=invariants,
metadata=data.get('metadata', {})
)
def _parse_invariant(self, data: dict) -> Invariant:
"""Parse an invariant definition"""
return Invariant(
type=InvariantType(data['type']),
name=data['name'],
spec=data.get('spec', {})
) ```
- Create Sample Test File (
prompts/support_agent.yaml) ```yaml name: โCustomer Support Evalsโ version: โ1.0.0โ
test_cases:
- id: โsimple_refund_queryโ
category: โRefundโ
input: โI want to return my order #12345โ
context: |
Policy: Customers can return items within 30 days.
Order #12345 was placed 10 days ago.
invariants:
- type: โjson_schemaโ name: โValid JSON Schemaโ spec: schema_file: โprompts/schemas/response_schema.jsonโ
- type: โcontainsโ name: โHas Citationโ spec: substring: โpolicy_docโ
- type: โcontainsโ name: โContains Order IDโ spec: substring: โ#12345โ ```
- Write Unit Tests (
tests/test_loader.py) ```python import pytest from pathlib import Path from harness.loader import TestLoader
def test_load_valid_suite(): loader = TestLoader() suite = loader.load_from_file(Path(โprompts/support_agent.yamlโ))
assert suite.name == "Customer Support Evals"
assert len(suite.test_cases) > 0
def test_parse_invariants(): loader = TestLoader() suite = loader.load_from_file(Path(โprompts/support_agent.yamlโ))
test_case = suite.test_cases[0]
assert len(test_case.expected_invariants) == 3
assert test_case.expected_invariants[0].name == "Valid JSON Schema" ```
Checkpoint: Can load and parse a sample test YAML file, verified by passing unit tests.
Phase 2: Core Execution (Days 2-3)
Goals:
- Implement LLM API client with retry logic
- Build test runner that executes cases
- Create core validators
- Add rate limiting and error handling
Tasks:
- Implement LLM Client (
harness/llm_client.py) ```python import time import openai from typing import Dict, Any, Optional from dataclasses import dataclass
@dataclass class LLMResponse: text: str latency_ms: float tokens_used: int model: str
class LLMClient: โ"โAbstract interface for LLM API callsโโโ
def __init__(self, provider: str, model: str, temperature: float = 0.0):
self.provider = provider
self.model = model
self.temperature = temperature
def complete(self, prompt: str, max_retries: int = 3) -> LLMResponse:
"""Call LLM with exponential backoff retry"""
for attempt in range(max_retries):
try:
return self._call_api(prompt)
except Exception as e:
if attempt == max_retries - 1:
raise
wait_time = 2 ** attempt
time.sleep(wait_time)
def _call_api(self, prompt: str) -> LLMResponse:
"""Make actual API call (provider-specific)"""
start = time.time()
if self.provider == "openai":
response = openai.ChatCompletion.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=self.temperature
)
text = response.choices[0].message.content
tokens = response.usage.total_tokens
else:
raise ValueError(f"Unknown provider: {self.provider}")
latency_ms = (time.time() - start) * 1000
return LLMResponse(
text=text,
latency_ms=latency_ms,
tokens_used=tokens,
model=self.model
) ```
- Implement Validators (
harness/validators.py) ```python import json import re from abc import ABC, abstractmethod from typing import Dict, Any from jsonschema import validate, ValidationError from .models import ValidationResult
class Validator(ABC): โ"โBase class for all validatorsโโโ
@abstractmethod
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
pass
class JSONSchemaValidator(Validator): โ"โValidates response against JSON Schemaโโโ
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
try:
# Parse response as JSON
data = json.loads(response)
# Load schema from file
with open(spec['schema_file'], 'r') as f:
schema = json.load(f)
# Validate
validate(instance=data, schema=schema)
return ValidationResult(
invariant_name="JSON Schema",
passed=True,
error_message=None,
details={"schema_file": spec['schema_file']}
)
except json.JSONDecodeError as e:
return ValidationResult(
invariant_name="JSON Schema",
passed=False,
error_message=f"Invalid JSON: {str(e)}",
details={}
)
except ValidationError as e:
return ValidationResult(
invariant_name="JSON Schema",
passed=False,
error_message=f"Schema validation failed: {e.message}",
details={"path": list(e.path)}
)
class ContainsValidator(Validator): โ"โValidates that response contains a substringโโโ
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
substring = spec['substring']
passed = substring in response
return ValidationResult(
invariant_name=f"Contains '{substring}'",
passed=passed,
error_message=None if passed else f"Response does not contain '{substring}'",
details={"substring": substring}
)
class LengthValidator(Validator): โ"โValidates response lengthโโโ
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
min_len = spec.get('min', 0)
max_len = spec.get('max', float('inf'))
actual_len = len(response)
passed = min_len <= actual_len <= max_len
return ValidationResult(
invariant_name="Length",
passed=passed,
error_message=None if passed else f"Length {actual_len} not in range [{min_len}, {max_len}]",
details={"actual": actual_len, "min": min_len, "max": max_len}
)
Validator factory
VALIDATORS = { โjson_schemaโ: JSONSchemaValidator(), โcontainsโ: ContainsValidator(), โlengthโ: LengthValidator() }
def get_validator(validator_type: str) -> Validator: โ"โGet validator instance by typeโโโ return VALIDATORS.get(validator_type)
3. **Implement Test Runner** (`harness/runner.py`)
```python
import time
from typing import List
from .models import TestSuite, TestResult, TestReport, ValidationResult
from .llm_client import LLMClient
from .validators import get_validator
class TestRunner:
"""Executes test suites"""
def __init__(self, llm_client: LLMClient):
self.llm_client = llm_client
def run_suite(self, suite: TestSuite) -> TestReport:
"""Execute all tests in suite"""
results = []
start_time = time.time()
for test_case in suite.test_cases:
# Build prompt
prompt = self._build_prompt(test_case)
# Call LLM
llm_response = self.llm_client.complete(prompt)
# Validate against invariants
validation_results = []
for invariant in test_case.expected_invariants:
validator = get_validator(invariant.type.value)
result = validator.validate(llm_response.text, invariant.spec)
validation_results.append(result)
# Determine pass/fail
test_passed = all(v.passed for v in validation_results)
# Store result
results.append(TestResult(
test_case=test_case,
response=llm_response.text,
passed=test_passed,
validation_results=validation_results,
latency_ms=llm_response.latency_ms,
tokens_used=llm_response.tokens_used,
timestamp=time.strftime("%Y-%m-%d %H:%M:%S")
))
total_time = time.time() - start_time
return self._generate_report(suite, results, total_time)
def _build_prompt(self, test_case):
"""Construct prompt from test case"""
parts = []
if test_case.context:
parts.append(f"Context:\n{test_case.context}\n")
parts.append(f"Query: {test_case.input}")
return "\n".join(parts)
def _generate_report(self, suite, results, total_time):
"""Generate aggregate report"""
passed = sum(1 for r in results if r.passed)
failed = len(results) - passed
return TestReport(
suite_name=suite.name,
version=suite.version,
total_cases=len(results),
passed_cases=passed,
failed_cases=failed,
success_rate=passed / len(results) if results else 0.0,
total_time_s=total_time,
avg_latency_ms=sum(r.latency_ms for r in results) / len(results) if results else 0.0,
results=results,
regression_info=None
)
Checkpoint: Can run tests against real LLM API and validate basic invariants.
Phase 3: Reporting & Polish (Days 4-5)
Goals:
- Generate HTML and JSON reports
- Add regression detection
- Polish CLI interface with rich formatting
- Add detailed failure analysis
Tasks:
- Implement Reporter (
harness/reporter.py) ```python import json from pathlib import Path from jinja2 import Environment, FileSystemLoader from .models import TestReport
class Reporter: โ"โGenerates test reports in various formatsโโโ
def __init__(self, template_dir: Path):
self.jinja_env = Environment(loader=FileSystemLoader(template_dir))
def generate_html(self, report: TestReport, output_path: Path):
"""Generate HTML report"""
template = self.jinja_env.get_template('report.html.jinja2')
html = template.render(
report=report,
failure_breakdown=self._get_failure_breakdown(report)
)
with open(output_path, 'w') as f:
f.write(html)
def generate_json(self, report: TestReport, output_path: Path):
"""Generate JSON report for CI/CD"""
data = {
"suite_name": report.suite_name,
"version": report.version,
"summary": {
"total": report.total_cases,
"passed": report.passed_cases,
"failed": report.failed_cases,
"success_rate": report.success_rate
},
"results": [
{
"test_id": r.test_case.id,
"passed": r.passed,
"latency_ms": r.latency_ms,
"failures": [
v.invariant_name
for v in r.validation_results
if not v.passed
]
}
for r in report.results
]
}
with open(output_path, 'w') as f:
json.dump(data, f, indent=2)
def _get_failure_breakdown(self, report: TestReport):
"""Group failures by invariant type"""
breakdown = {}
for result in report.results:
if not result.passed:
for validation in result.validation_results:
if not validation.passed:
if validation.invariant_name not in breakdown:
breakdown[validation.invariant_name] = []
breakdown[validation.invariant_name].append(result.test_case.id)
return breakdown ```
- Create HTML Template (
templates/report.html.jinja2) ```html <!DOCTYPE html>
{{ report.suite_name }}
Version: {{ report.version }}
Success Rate: {{ "%.1f"|format(report.success_rate * 100) }}%
Total Cases: {{ report.total_cases }} | Passed: {{ report.passed_cases }} | Failed: {{ report.failed_cases }}
Test Results
{% for result in report.results %}{{ "PASS" if result.passed else "FAIL" }} - {{ result.test_case.id }}
Input: {{ result.test_case.input }}
Latency: {{ result.latency_ms|round(0) }}ms
{{ validation.error_message }} {% endif %}
Failure Breakdown
-
{% for invariant, cases in failure_breakdown.items() %}
- {{ invariant }}: {{ cases|length }} failure(s) - {{ cases|join(", ") }} {% endfor %}
3. **Build CLI Interface** (`harness/cli.py`)
```python
import click
from pathlib import Path
from rich.console import Console
from rich.progress import track
from .loader import TestLoader
from .runner import TestRunner
from .llm_client import LLMClient
from .reporter import Reporter
console = Console()
@click.group()
def cli():
"""Prompt Contract Harness - Test your prompts like code"""
pass
@cli.command()
@click.argument('test_file', type=click.Path(exists=True))
@click.option('--model', default='gpt-4', help='LLM model to use')
@click.option('--temperature', default=0.0, help='Sampling temperature')
@click.option('--provider', default='openai', help='LLM provider')
def test(test_file, model, temperature, provider):
"""Run test suite"""
console.print(f"[bold blue]Loading test suite:[/bold blue] {test_file}")
# Load tests
loader = TestLoader()
suite = loader.load_from_file(Path(test_file))
console.print(f"Found {len(suite.test_cases)} test cases")
# Run tests
client = LLMClient(provider=provider, model=model, temperature=temperature)
runner = TestRunner(client)
console.print("[bold green]Running tests...[/bold green]")
report = runner.run_suite(suite)
# Print summary
console.print(f"\n[bold]Results:[/bold]")
console.print(f"Success Rate: {report.success_rate * 100:.1f}%")
console.print(f"Passed: {report.passed_cases}/{report.total_cases}")
# Generate reports
reporter = Reporter(Path("templates"))
html_path = Path(f"reports/run_{report.version}.html")
reporter.generate_html(report, html_path)
console.print(f"\n[bold]Report saved to:[/bold] {html_path}")
if __name__ == '__main__':
cli()
- Add Regression Detection
```python
In harness/runner.py, add method:
def compare_with_baseline(self, current: TestReport, baseline: TestReport) -> dict: โ"โDetect regression between runsโโโ score_delta = current.success_rate - baseline.success_rate
current_failures = {r.test_case.id for r in current.results if not r.passed}
baseline_failures = {r.test_case.id for r in baseline.results if not r.passed}
return {
"regression_detected": score_delta < -0.05, # 5% threshold
"score_delta": score_delta,
"new_failures": list(current_failures - baseline_failures),
"fixed_cases": list(baseline_failures - current_failures)
} ```
Checkpoint: Full working harness with beautiful reports and regression detection.
5.4 Key Implementation Decisions
| Decision | Options Considered | Recommendation | Rationale |
|---|---|---|---|
| Test Format | JSON, YAML, Python code | YAML | Human-readable, supports comments, widely understood |
| Async vs Sync | asyncio, threading, sequential | asyncio | Better rate limit handling, can parallelize independent tests |
| Validation Architecture | Inline checks, Plugin system | Plugin system | Extensibility for custom validators, clean separation |
| Report Format | HTML only, JSON only, both | Both | HTML for humans, JSON for CI/CD integration |
| Results Storage | SQLite, JSON files, CSV | JSON files | Simple, portable, version-controllable |
| CLI Framework | argparse, click, typer | click | Rich ecosystem, good documentation, decorator-based |
| Progress Display | Print statements, rich library | rich library | Beautiful terminal output, progress bars, colors |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Coverage Target | Examples |
|---|---|---|---|
| Unit Tests | Test individual components in isolation | 90%+ | Validator logic, loader parsing, data models |
| Integration Tests | Test end-to-end flow with mocked LLM | 80%+ | Full test run with mock API, report generation |
| Smoke Tests | Quick sanity checks | Critical paths | Sample test suite passes, CLI loads |
| Regression Tests | Ensure bugs stay fixed | All bug fixes | Previously failing cases now pass |
6.2 Critical Test Cases
Unit Tests
Test: Loader Handles Invalid YAML
def test_loader_invalid_yaml():
loader = TestLoader()
with pytest.raises(yaml.YAMLError):
loader.load_from_file(Path("tests/fixtures/invalid.yaml"))
Test: JSONSchema Validator Detects Type Errors
def test_json_schema_type_error():
validator = JSONSchemaValidator()
response = '{"age": "twenty-five"}' # Should be int
spec = {"schema_file": "tests/fixtures/user_schema.json"}
result = validator.validate(response, spec)
assert not result.passed
assert "type" in result.error_message.lower()
Test: Contains Validator Case Sensitivity
def test_contains_case_sensitive():
validator = ContainsValidator()
response = "The user wants a REFUND"
spec = {"substring": "refund", "case_sensitive": True}
result = validator.validate(response, spec)
assert not result.passed # "REFUND" != "refund"
Integration Tests
Test: Full Run with Mocked LLM
def test_full_run_with_mock(mocker):
# Mock LLM to return deterministic responses
mock_client = mocker.Mock(spec=LLMClient)
mock_client.complete.return_value = LLMResponse(
text='{"answer": "test", "confidence": 0.9}',
latency_ms=100,
tokens_used=50,
model="gpt-4"
)
loader = TestLoader()
suite = loader.load_from_file(Path("tests/fixtures/sample.yaml"))
runner = TestRunner(mock_client)
report = runner.run_suite(suite)
assert report.total_cases > 0
assert report.success_rate >= 0.0
Test: Regression Detection Works
def test_regression_detection():
# Create two reports with different scores
baseline = TestReport(
suite_name="test",
version="1.0",
total_cases=10,
passed_cases=10,
failed_cases=0,
success_rate=1.0,
# ...
)
current = TestReport(
suite_name="test",
version="1.1",
total_cases=10,
passed_cases=8,
failed_cases=2,
success_rate=0.8,
# ...
)
runner = TestRunner(None)
regression = runner.compare_with_baseline(current, baseline)
assert regression["regression_detected"] == True
assert regression["score_delta"] == -0.2
6.3 Test Data
Sample Test Suite (tests/fixtures/sample.yaml)
name: "Sample Test Suite"
version: "1.0.0"
test_cases:
- id: "valid_json_response"
category: "Format"
input: "Return user info as JSON"
invariants:
- type: "json_schema"
name: "Valid JSON"
spec:
schema_file: "tests/fixtures/user_schema.json"
- id: "contains_citation"
category: "Grounding"
input: "What is our refund policy?"
context: "Policy: 30-day returns allowed"
invariants:
- type: "contains"
name: "Mentions Policy"
spec:
substring: "30-day"
JSON Schema (tests/fixtures/user_schema.json)
{
"type": "object",
"properties": {
"answer": {"type": "string"},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["answer", "confidence"]
}
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Root Cause | Solution |
|---|---|---|---|
| No Rate Limiting | 429 Too Many Requests errors | Sending requests too fast to API | Implement exponential backoff, use async with semaphore for concurrency control |
| Flaky Tests at High Temp | Inconsistent pass/fail results | Non-deterministic sampling (temp > 0) | Use temp=0 for deterministic tests, or run multiple samples and aggregate |
| Poor Error Messages | โTest failedโ with no context | Generic exception handling | Include full context: expected vs actual, test case ID, invariant name |
| Missing Token Tracking | Unexpected API bills | Not logging token usage | Log tokens per request, aggregate in report, set budget alerts |
| Timeout Handling | Tests hang indefinitely | No timeout on API calls | Set request timeout (e.g., 30s), retry with backoff |
| Schema Path Issues | โSchema file not foundโ | Relative paths break in different contexts | Use absolute paths or paths relative to project root |
| Not Validating Test Files | Cryptic errors during execution | Malformed YAML loaded without validation | Validate test file structure before running tests |
| Hardcoded API Keys | Security vulnerability | Keys in source code | Use environment variables, never commit keys to git |
7.2 Debugging Strategies
API Issues
Problem: Getting 401 Unauthorized errors
Debug Steps:
- Enable verbose logging to see full request/response
import logging logging.basicConfig(level=logging.DEBUG) - Verify API key is loaded correctly
import os print(f"API Key loaded: {os.getenv('OPENAI_API_KEY')[:10]}...") - Test API key with minimal curl request
curl https://api.openai.com/v1/models \ -H "Authorization: Bearer $OPENAI_API_KEY"
Validation Failures
Problem: Canโt understand why invariant failed
Debug Strategy:
- Print the full response before validation
print(f"Raw response: {response}") print(f"Expected: {invariant.spec}") - Add detailed error messages with context
return ValidationResult( invariant_name="JSON Schema", passed=False, error_message=f"Schema validation failed at path '{path}': expected {expected_type}, got {actual_type}", details={ "expected": expected, "actual": actual, "raw_response": response[:200] # First 200 chars } ) - Create minimal reproduction test
def test_minimal_repro(): validator = JSONSchemaValidator() response = '{"age": "twenty-five"}' # Exact failing response result = validator.validate(response, spec) print(f"Failed: {result.error_message}")
Performance Issues
Problem: Tests running too slowly
Profile with cProfile:
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
# Run your test suite
runner.run_suite(suite)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10) # Top 10 slowest functions
Common Bottlenecks:
- Sequential API calls โ Parallelize with asyncio
- Loading schema repeatedly โ Cache schema in memory
- Large responses โ Stream instead of loading full response
7.3 Performance Traps
Memory Issues
Problem: Loading all responses into memory for large test suites
Bad:
responses = [llm_client.complete(prompt) for test in tests] # Loads all at once
Good:
for test in tests:
response = llm_client.complete(test.prompt)
result = validate(response)
save_result(result) # Process and save incrementally
del response # Free memory
Cost Optimization
Problem: Racking up API bills during development
Solution: Response Caching
import hashlib
import json
from pathlib import Path
class CachingLLMClient:
def __init__(self, client, cache_dir=Path(".cache")):
self.client = client
self.cache_dir = cache_dir
self.cache_dir.mkdir(exist_ok=True)
def complete(self, prompt: str):
# Generate cache key from prompt
key = hashlib.md5(prompt.encode()).hexdigest()
cache_file = self.cache_dir / f"{key}.json"
# Check cache
if cache_file.exists():
with open(cache_file) as f:
return LLMResponse(**json.load(f))
# Call API
response = self.client.complete(prompt)
# Save to cache
with open(cache_file, 'w') as f:
json.dump(response.__dict__, f)
return response
8. Extensions & Challenges
8.1 Beginner Extensions
Extension 1: CSV Export Format
What: Add ability to export results as CSV for Excel analysis
Why: Non-technical stakeholders may prefer spreadsheets
How: Use Pythonโs csv module to write results row by row
import csv
def export_to_csv(report: TestReport, output_path: Path):
with open(output_path, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Test ID', 'Category', 'Passed', 'Latency (ms)', 'Failed Invariants'])
for result in report.results:
failed_inv = ', '.join(v.invariant_name for v in result.validation_results if not v.passed)
writer.writerow([
result.test_case.id,
result.test_case.category,
result.passed,
result.latency_ms,
failed_inv
])
Challenge: Add a --format csv flag to CLI
Extension 2: Email Notifications for Regression
What: Send email alert when regression detected
Why: Get notified immediately without checking dashboard
How: Use smtplib to send email via SMTP server
import smtplib
from email.mime.text import MIMEText
def send_regression_alert(report: TestReport, config: dict):
if not report.regression_info or not report.regression_info['regression_detected']:
return
body = f"""
Regression detected in {report.suite_name}!
Success rate dropped: {report.regression_info['score_delta']:.1%}
New failures: {', '.join(report.regression_info['new_failures'])}
See report: {config['report_url']}
"""
msg = MIMEText(body)
msg['Subject'] = f"Prompt Regression Alert: {report.suite_name}"
msg['From'] = config['from_email']
msg['To'] = config['to_email']
with smtplib.SMTP(config['smtp_server']) as server:
server.send_message(msg)
Challenge: Make email template customizable with Jinja2
Extension 3: Pre-built Validator Library
What: Create validators for common patterns (email, phone, PII)
Why: Donโt reinvent the wheel for standard checks
How: Use regex and libraries like phonenumbers, email-validator
import re
from email_validator import validate_email
class EmailValidator(Validator):
"""Detects email addresses in response"""
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
found_emails = re.findall(email_pattern, response)
should_contain = spec.get('should_contain', False)
passed = (len(found_emails) > 0) == should_contain
return ValidationResult(
invariant_name="Email Detection",
passed=passed,
error_message=f"Found {len(found_emails)} email(s): {found_emails}" if not passed else None,
details={"emails_found": found_emails}
)
Challenge: Create validators for: SSN, credit card numbers, phone numbers, addresses
8.2 Intermediate Extensions
Extension 4: LLM-as-a-Judge Validators
What: Use another LLM to evaluate subjective qualities (tone, helpfulness) Why: Some invariants canโt be captured by regex or schemas How: Call LLM with structured prompt to grade responses
class ToneValidator(Validator):
"""Uses LLM to evaluate tone"""
def __init__(self, judge_client: LLMClient):
self.judge = judge_client
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
expected_tone = spec['expected_tone'] # e.g., "professional and empathetic"
judge_prompt = f"""
Evaluate if the following response has a {expected_tone} tone.
Response: "{response}"
Return only a JSON object:
{{
"score": <0.0-1.0>,
"reasoning": "<brief explanation>"
}}
"""
judge_response = self.judge.complete(judge_prompt)
result = json.loads(judge_response.text)
threshold = spec.get('min_score', 0.7)
passed = result['score'] >= threshold
return ValidationResult(
invariant_name=f"Tone: {expected_tone}",
passed=passed,
error_message=f"Tone score {result['score']:.2f} below threshold {threshold}. Reasoning: {result['reasoning']}" if not passed else None,
details=result
)
Challenge: Create LLM judges for: factual accuracy, relevance, completeness
Extension 5: Parametric Sweeps
What: Test across multiple models, temperatures, and prompts Why: Find optimal configuration for your use case How: Add nested loops to test all combinations
def parametric_sweep(suite: TestSuite, sweep_config: dict) -> List[TestReport]:
"""Run tests across parameter grid"""
reports = []
for model in sweep_config['models']:
for temp in sweep_config['temperatures']:
for prompt_version in sweep_config['prompt_versions']:
client = LLMClient(model=model, temperature=temp)
runner = TestRunner(client)
# Modify suite with prompt version
versioned_suite = apply_prompt_version(suite, prompt_version)
report = runner.run_suite(versioned_suite)
report.metadata = {
'model': model,
'temperature': temp,
'prompt_version': prompt_version
}
reports.append(report)
return reports
# Usage
sweep_config = {
'models': ['gpt-4', 'gpt-3.5-turbo', 'claude-3-opus'],
'temperatures': [0.0, 0.3, 0.7],
'prompt_versions': ['v1', 'v2', 'v3']
}
results = parametric_sweep(suite, sweep_config)
Challenge: Create heatmap visualization showing model vs temperature performance
Extension 6: Web UI for Results
What: Interactive dashboard to explore test results Why: Easier than opening HTML files, supports filtering/sorting How: Use Flask or FastAPI to serve results from database
from flask import Flask, render_template, jsonify
import sqlite3
app = Flask(__name__)
@app.route('/')
def dashboard():
return render_template('dashboard.html')
@app.route('/api/runs')
def get_runs():
conn = sqlite3.connect('results.db')
runs = conn.execute('SELECT * FROM test_runs ORDER BY timestamp DESC LIMIT 50').fetchall()
return jsonify([dict(run) for run in runs])
@app.route('/api/run/<run_id>')
def get_run_details(run_id):
conn = sqlite3.connect('results.db')
results = conn.execute('SELECT * FROM test_results WHERE run_id = ?', (run_id,)).fetchall()
return jsonify([dict(r) for r in results])
Challenge: Add real-time updates using WebSockets as tests run
8.3 Advanced Extensions
Extension 7: CI/CD Integration
What: Run harness in GitHub Actions/Jenkins on every commit Why: Catch regressions before merging to main How: Create GitHub Action workflow
# .github/workflows/prompt-tests.yml
name: Prompt Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
run: |
pip install -r requirements.txt
- name: Run prompt tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python harness.py test prompts/*.yaml --format junit --min-score 95
- name: Upload results
if: always()
uses: actions/upload-artifact@v2
with:
name: test-results
path: reports/
Challenge: Add step to post test results as PR comment
Extension 8: A/B Testing Framework
What: Compare two prompt versions statistically Why: Know with confidence which prompt performs better How: Run both prompts on same test set, compute significance
from scipy import stats
def ab_test(prompt_a: str, prompt_b: str, test_set: TestSuite) -> dict:
"""Run A/B test between two prompts"""
# Run tests with both prompts
runner_a = TestRunner(LLMClient(prompt=prompt_a))
runner_b = TestRunner(LLMClient(prompt=prompt_b))
results_a = [r.passed for r in runner_a.run_suite(test_set).results]
results_b = [r.passed for r in runner_b.run_suite(test_set).results]
# Statistical significance test (paired t-test)
t_stat, p_value = stats.ttest_rel(results_a, results_b)
# Effect size (Cohen's d)
mean_diff = np.mean(results_a) - np.mean(results_b)
pooled_std = np.sqrt((np.std(results_a)**2 + np.std(results_b)**2) / 2)
cohens_d = mean_diff / pooled_std
return {
'prompt_a_score': np.mean(results_a),
'prompt_b_score': np.mean(results_b),
'p_value': p_value,
'significant': p_value < 0.05,
'effect_size': cohens_d,
'recommendation': 'Use Prompt B' if np.mean(results_b) > np.mean(results_a) and p_value < 0.05 else 'Use Prompt A'
}
Challenge: Add Bayesian A/B testing for continuous monitoring
Extension 9: Statistical Significance Testing
What: Donโt just report score changes, report if theyโre statistically significant Why: Avoid false positives from random variance How: Use hypothesis testing
def is_regression_significant(current: List[bool], baseline: List[bool], alpha: float = 0.05) -> dict:
"""
Test if current performance is significantly worse than baseline
Uses McNemar's test for paired binary data
"""
# Build contingency table
both_pass = sum(c and b for c, b in zip(current, baseline))
current_fail_baseline_pass = sum(not c and b for c, b in zip(current, baseline))
current_pass_baseline_fail = sum(c and not b for c, b in zip(current, baseline))
both_fail = sum(not c and not b for c, b in zip(current, baseline))
# McNemar's test
statistic = (current_fail_baseline_pass - current_pass_baseline_fail)**2 / (current_fail_baseline_pass + current_pass_baseline_fail)
p_value = 1 - stats.chi2.cdf(statistic, df=1)
return {
'statistically_significant': p_value < alpha,
'p_value': p_value,
'contingency_table': {
'both_pass': both_pass,
'regression_cases': current_fail_baseline_pass,
'improvement_cases': current_pass_baseline_fail,
'both_fail': both_fail
}
}
Challenge: Implement sequential testing for early stopping when regression is detected
Extension 10: Distributed Test Execution
What: Run tests across multiple machines for large suites Why: Scale to 1000+ test cases How: Use Celery or Ray for distributed task execution
from celery import Celery
app = Celery('prompt_harness', broker='redis://localhost:6379')
@app.task
def run_single_test(test_case_data: dict, llm_config: dict) -> dict:
"""Execute single test case (runs on worker)"""
test_case = TestCase(**test_case_data)
client = LLMClient(**llm_config)
runner = TestRunner(client)
# Run just this one test
result = runner.run_single_test(test_case)
return result.dict()
def run_distributed_suite(suite: TestSuite, llm_config: dict) -> TestReport:
"""Distribute test execution across workers"""
# Submit all tests as async tasks
tasks = [
run_single_test.delay(tc.dict(), llm_config)
for tc in suite.test_cases
]
# Collect results as they complete
results = [task.get() for task in tasks]
# Aggregate into report
return aggregate_results(results)
Challenge: Add dynamic worker scaling based on queue depth
9. Real-World Connections
9.1 Industry Applications
OpenAI Evals
OpenAIโs internal evaluation framework follows the same contract-based testing pattern. They maintain a large repository of evals for different capabilities:
# Example from OpenAI Evals
{
"id": "math.addition",
"dataset": "math_problems.jsonl",
"scorer": "exact_match",
"samples": 1000
}
Key Insight: Even OpenAI needs systematic testing to ensure model updates donโt regress on critical tasks.
LangSmith (LangChain)
LangChainโs testing platform implements:
- Automatic dataset collection from production traffic
- LLM-as-a-judge evaluation
- Regression tracking across model versions
- A/B testing for prompt variants
Used by: Companies like Zapier, Notion, Robinhood for production LLM monitoring
Anthropic Workbench
Claudeโs evaluation infrastructure:
- Constitutional AI alignment testing
- Safety evals (jailbreak attempts, bias detection)
- Capability benchmarks (coding, math, reasoning)
Key Insight: Testing isnโt just about correctnessโitโs also about safety and alignment.
Microsoft Prompt Flow
Azureโs prompt engineering toolkit includes:
- Visual prompt DAG editor
- Evaluation metrics (relevance, groundedness, coherence)
- Integration with Azure ML for model deployment
Used by: Enterprise customers who need compliance-ready AI systems
9.2 Related Open Source Projects
PromptFoo
https://github.com/promptfoo/promptfoo
What it does:
- LLM evaluation framework focused on red-teaming and security
- Supports multiple providers (OpenAI, Anthropic, local models)
- Automated adversarial testing
Key Feature: Built-in vulnerability detection for prompt injection, jailbreaks, PII leaks
When to use: Security-focused evaluation, especially for customer-facing chatbots
OpenAI Evals
https://github.com/openai/evals
What it does:
- OpenAIโs official eval framework
- Large library of pre-built evals for common tasks
- Standardized format for sharing benchmarks
Key Feature: Community-contributed evals for niche domains
When to use: Benchmarking new models, contributing to open eval datasets
Giskard
https://github.com/Giskard-AI/giskard
What it does:
- ML testing library with LLM support
- Automated test generation
- Metamorphic testing (if input changes slightly, output should too)
Key Feature: Automatic discovery of failure modes
When to use: QA testing for ML models, finding edge cases automatically
LangCheck
https://github.com/citadel-ai/langcheck
What it does:
- Simple evaluation metrics for LLM outputs
- Covers factual consistency, toxicity, fluency, etc.
- No API calls required for some metrics
Key Feature: Lightweight, can run offline
When to use: Quick quality checks without additional LLM calls
9.3 Interview Relevance
This project prepares you for common AI engineering interview questions:
Question 1: โHow do you test non-deterministic systems?โ
Strong Answer: โI test invariants rather than exact outputs. For example, instead of expecting a specific response, I verify that:
- The output is valid JSON matching a schema
- It cites at least one source document
- It doesnโt contain PII
- It responds within 500ms
I use temperature=0 for deterministic tests (format, required fields) and higher temperatures with multiple samples for subjective qualities (tone, creativity), aggregating scores across runs.โ
Question 2: โHow do you prevent regression in prompt engineering?โ
Strong Answer: โI treat prompts as code: version controlled, tested, and monitored. Before deploying a prompt change:
- Run it against a golden dataset of edge cases
- Compare results to the baseline version
- Check if success rate dropped below threshold (e.g., 5%)
- If regression detected, either fix the prompt or update test cases if the change was intentional
- Track metrics over time to detect gradual degradation
This is analogous to regression testing in software, but adapted for probabilistic systems.โ
Question 3: โWhatโs the difference between unit tests and integration tests for LLMs?โ
Strong Answer: โUnit tests validate individual components in isolation:
- Validator logic (does the JSON schema validator work correctly?)
- Prompt templating (does input substitution work?)
- Response parsing (can we extract structured data?)
Integration tests validate the full pipeline:
- Prompt โ LLM โ Validation โ Business Logic
- These tests use mocked LLM responses to be fast and deterministic
- Real LLM tests are โE2E testsโ and run less frequently due to cost/latency
The key is isolating what youโre testing: code behavior vs. model behavior.โ
Question 4: โHow would you evaluate a summarization model?โ
Strong Answer: โIโd use a multi-layered approach:
- Deterministic Checks (fast, cheap):
- Length within bounds (50-200 words)
- No PII leakage
- Valid format (bullet points if required)
- Reference-Based Metrics (medium cost):
- ROUGE score against human-written summaries
- Factual consistency (all facts in summary appear in source)
- LLM-as-a-Judge (slower, higher cost):
- Coherence (does it read naturally?)
- Relevance (captures key points?)
- Conciseness (no redundancy?)
- Human Evaluation (highest quality, most expensive):
- Random sample (5%) reviewed by domain experts
- Track inter-rater reliability
Different use cases prioritize different metrics. A legal summary prioritizes factual consistency; a news summary prioritizes coherence.โ
10. Resources
10.1 Essential Reading
Books
| Title | Author | Relevant Chapters | Why It Matters |
|---|---|---|---|
| โSite Reliability Engineeringโ | Ch. 4 (Service Level Objectives) | Learn to apply SLO/SLA thinking to AI systems | |
| โClean Codeโ | Robert C. Martin | Ch. 9 (Unit Tests) | Master test design principles |
| โAI Engineeringโ | Chip Huyen | Ch. 5 (Model Development and Offline Evaluation) | Industry best practices for LLM evaluation |
| โDesigning Data-Intensive Applicationsโ | Martin Kleppmann | Ch. 4 (Encoding and Evolution) | Understand schema evolution and compatibility |
| โThe Pragmatic Programmerโ | Hunt & Thomas | Ch. 7 (Test-Driven Development) | Learn TDD principles applicable to prompts |
| โRelease It!โ | Michael T. Nygard | Ch. 5 (Stability Patterns) | Error handling and resilience patterns |
| โSoftware Testingโ | Ron Patton | Ch. 7 (Regression Testing) | Deep dive into regression detection |
| โCode Completeโ | Steve McConnell | Ch. 8 (Defensive Programming) | Invariants and assertions |
Papers
- โLanguage Models are Few-Shot Learnersโ (GPT-3 paper) - Section on evaluation methodology
- โConstitutional AI: Harmlessness from AI Feedbackโ (Anthropic) - LLM-as-a-judge pattern
- โEvaluating Large Language Models Trained on Codeโ (Codex paper) - Pass@k metrics
10.2 Video Resources
- โBuilding Reliable LLM Applicationsโ - Chip Huyen (YouTube)
- Covers evaluation pipelines, monitoring, failure modes
- Link: Search YouTube for โChip Huyen LLM Applicationsโ
- โOpenAI Evals Deep Diveโ - OpenAI Developer Day
- How OpenAI evaluates their models internally
- Link: openai.com/events/developer-day
- โPrompt Engineering Guideโ - promptingguide.ai
- Interactive tutorials on prompting techniques
- Free, comprehensive, regularly updated
10.3 Tools & Documentation
Testing Frameworks
- pytest: https://pytest.org
- Pythonโs de facto testing framework
- Rich plugin ecosystem (pytest-mock, pytest-asyncio)
- JSON Schema: https://json-schema.org
- Standard for defining JSON structure
- Validators available in all languages
- Pydantic: https://docs.pydantic.dev
- Data validation using Python type annotations
- Generates JSON schemas automatically
LLM APIs
- OpenAI API: https://platform.openai.com/docs
- Official docs for GPT-4, GPT-3.5
- Parameter reference (temperature, top_p, etc.)
- Anthropic API: https://docs.anthropic.com
- Claude 3 family documentation
- Prompt engineering best practices
CLI Tools
- Click: https://click.palletsprojects.com
- Python CLI framework
- Decorator-based, easy to use
- Rich: https://rich.readthedocs.io
- Beautiful terminal output
- Progress bars, tables, syntax highlighting
10.4 Related Projects in This Series
Next Project
Project 2: JSON Output Enforcer (builds on validation concepts)
- Implements self-repair loop for malformed JSON
- Teaches reliability patterns for production systems
- Prerequisite: Complete Project 1 first
Related Projects
Project 7: Temperature Sweeper
- Parametric evaluation across temperature settings
- Uses the harness from this project as a foundation
- Adds statistical analysis of variance
Project 5: RAG Quality Evaluator
- Evaluates retrieval-augmented generation systems
- Reuses validator architecture from this project
- Adds retrieval-specific metrics (citation accuracy, groundedness)
10.5 Community Resources
- r/PromptEngineering (Reddit) - Community discussing techniques
- LangChain Discord - Active community for LLM application builders
- EleutherAI Discord - Open-source LLM research community
- Anthropic Forum - Official Anthropic community
11. Self-Assessment Checklist
Understanding
- I can explain what a prompt contract is without looking at notes
- Test yourself: Define it out loud, then write a contract for a new use case
- I understand the difference between testing at temp=0 vs temp=0.7
- Test yourself: When would you use each? Give 3 examples per temperature
- I can design invariants for a new use case
- Test yourself: Pick a use case (e.g., email classifier), define 5 invariants
- I know when to use deterministic vs probabilistic testing
- Test yourself: Categorize these checks: JSON validity, creativity, tone, citation presence
- I can explain error budgets in the context of LLM systems
- Test yourself: Your SLO is 99% accuracy. How many failures are allowed per 1000 requests?
Implementation
- All functional requirements are met
- Test loader works with YAML/JSON
- Test runner executes against real LLM
- At least 3 validator types implemented
- Reports generated in HTML and JSON
- Test harness runs against real LLM API
- Successfully tested with OpenAI or Anthropic
- Handles rate limits gracefully
- Retries on transient failures
- Reports are clear and actionable
- Failed tests clearly show what went wrong
- Recommendations provided for fixing issues
- HTML report is readable by non-technical stakeholders
- Regression detection works correctly
- Can compare two test runs
- Correctly identifies new failures
- Flags score drops above threshold
- Code is production-ready
- Error handling for edge cases
- Logging for debugging
- Tests for core components
- Documentation (README, docstrings)
Growth
- I can identify when to use deterministic vs probabilistic testing
- Application: Design a test strategy for a new LLM feature
- Iโve documented lessons learned
- What surprised you during implementation?
- What would you do differently next time?
- What patterns emerged that youโll reuse?
- I can explain this project in a job interview
- Practice: Explain in 2 minutes: problem, solution, results, learnings
- I understand how this applies to production systems
- How would you integrate this into a CI/CD pipeline?
- How would you monitor prompt performance in production?
- What metrics would you track over time?
- I can extend this to new domains
- Pick a different domain (e.g., code generation, SQL queries)
- What invariants would you test?
- What new validators would you need?
12. Submission / Completion Criteria
Minimum Viable Completion
To consider this project โcompleteโ at a basic level, you must:
- Can load test suite from YAML file
- Parses all fields correctly (id, input, invariants)
- Handles malformed YAML gracefully with clear error messages
- Can execute tests against LLM API
- Successfully calls OpenAI or Anthropic API
- Captures response and metadata (latency, tokens)
- Implements at least 3 validator types
- JSONSchema, Contains, Length (minimum set)
- Each validator returns structured ValidationResult
- Generates basic text report
- Console output shows pass/fail for each test
- Summary statistics (total, passed, failed, success rate)
Proof of Completion:
- Screenshot of CLI output showing test run
- Sample test YAML file
- Code walkthrough explaining validator architecture
Full Completion
All minimum criteria plus:
- HTML report generation
- Interactive report with drill-down capability
- Clear visualization of failures
- Exportable/shareable
- Regression detection between runs
- Can compare current run to baseline
- Identifies newly broken test cases
- Flags significant score drops
- Rate limiting and error handling
- Exponential backoff for retries
- Graceful handling of API failures
- Timeout protection
- CLI with rich formatting
- Color-coded output (pass=green, fail=red)
- Progress bars during execution
- Clear help text and usage examples
- Unit tests for core components
- Test coverage >70% for validators, loader, runner
- Integration tests with mocked LLM
Proof of Completion:
- Public GitHub repository with code
- README with setup instructions and examples
- HTML report sample
- Passing test suite
Excellence (Going Above & Beyond)
All full completion criteria plus any 2+ of:
- JSON export for CI/CD integration
- JUnit XML format for test runners
- Structured JSON for custom pipelines
- GitHub Actions workflow example
- LLM-as-a-Judge validators
- Subjective quality evaluation (tone, helpfulness)
- Configurable judge prompts
- Aggregation across multiple judge runs
- Parametric sweeps across models/temps
- Test same suite with multiple configurations
- Heatmap visualization of results
- Automated recommendation of best config
- Statistical significance testing
- Not just score comparison, but confidence intervals
- McNemarโs test for regression significance
- Effect size calculation (Cohenโs d)
- Production-ready features
- Response caching to reduce costs
- Distributed execution for large suites
- Web dashboard for result exploration
- Slack/email notifications
Proof of Completion:
- Blog post explaining your implementation and learnings
- Video demo showing advanced features
- Public deployment (e.g., web UI hosted on Vercel/Railway)
- Contribution to open-source eval framework (PromptFoo, LangSmith)
Appendix: Sample Files
Example Test Suite (prompts/support_agent.yaml)
name: "Customer Support Evals"
version: "1.3.0"
description: "Test suite for customer support chatbot prompt"
prompt_template: |
You are a helpful customer support agent. Use the provided policy documents to answer questions.
Always cite your sources using [doc_id] format.
Policy Documents:
{context}
Customer Query: {input}
Respond in JSON format:
{{
"answer": "your response here",
"confidence": 0.0-1.0,
"citations": ["doc_id_1", "doc_id_2"]
}}
test_cases:
# Refund Queries
- id: "simple_refund_query"
category: "Refund"
input: "I want to return my order #12345"
context: |
[policy_doc_1] Refund Policy: Customers can return items within 30 days of purchase for a full refund.
[policy_doc_2] Order #12345 was placed on 2024-12-15 (12 days ago).
invariants:
- type: "json_schema"
name: "Valid JSON Schema"
spec:
schema_file: "prompts/schemas/support_response.json"
- type: "contains"
name: "Has Citation"
spec:
substring: "policy_doc"
- type: "contains"
name: "Contains Order ID"
spec:
substring: "#12345"
- type: "regex"
name: "Polite Language"
spec:
pattern: "(please|thank you|happy to help)"
flags: "IGNORECASE"
- id: "refund_outside_window"
category: "Refund"
input: "Can I return something I bought 6 months ago?"
context: |
[policy_doc_1] Refund Policy: Customers can return items within 30 days of purchase.
invariants:
- type: "json_schema"
name: "Valid JSON Schema"
spec:
schema_file: "prompts/schemas/support_response.json"
- type: "contains"
name: "Mentions Time Limit"
spec:
substring: "30"
- type: "contains"
name: "Suggests Alternative"
spec:
substring: "warranty" # or exchange, or other options
- id: "ambiguous_policy_query"
category: "Policy"
input: "What's your return policy?"
context: |
[policy_doc_1] Refund Policy: Customers can return items within 30 days.
[policy_doc_2] Items must be in original packaging with tags attached.
invariants:
- type: "json_schema"
name: "Valid JSON Schema"
spec:
schema_file: "prompts/schemas/support_response.json"
- type: "contains"
name: "Has Citation"
spec:
substring: "policy_doc"
- type: "length"
name: "Detailed Response"
spec:
min: 100
max: 500
# Technical Support
- id: "password_reset"
category: "Technical"
input: "I forgot my password, how do I reset it?"
context: |
[help_doc_1] To reset your password:
1. Click "Forgot Password" on login page
2. Enter your email
3. Check your inbox for reset link
invariants:
- type: "json_schema"
name: "Valid JSON Schema"
spec:
schema_file: "prompts/schemas/support_response.json"
- type: "contains"
name: "Step-by-Step"
spec:
substring: "1."
- type: "contains"
name: "Cites Help Doc"
spec:
substring: "help_doc"
JSON Schema (prompts/schemas/support_response.json)
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"answer": {
"type": "string",
"minLength": 10,
"description": "The support agent's response"
},
"confidence": {
"type": "number",
"minimum": 0,
"maximum": 1,
"description": "Confidence score for the answer"
},
"citations": {
"type": "array",
"items": {
"type": "string",
"pattern": "^(policy_doc|help_doc)_\\d+$"
},
"minItems": 1,
"description": "List of cited document IDs"
}
},
"required": ["answer", "confidence", "citations"],
"additionalProperties": false
}
This comprehensive guide was generated from PROMPT_ENGINEERING_PROJECTS.md. For the complete learning path and other projects, see the parent directory.