Project 1: Prompt Contract Harness

Build a production-grade CLI tool that treats prompts like software artifacts with automated testing

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	3-5 days
Language	Python (Alternatives: TypeScript)
Prerequisites	Basic Python/TypeScript, LLM API access
Key Topics	Testing, Invariants, Regression, SLOs
Knowledge Area	PromptOps / Testing
Software/Tool	CLI harness + validators + reports
Main Book	“Site Reliability Engineering” by Google (Concepts of SLOs)
Coolness Level	Level 2: Practical but Forgettable
Business Potential	4. The “Open Core” Infrastructure

1. Learning Objectives

By completing this project, you will:

Master Prompt Contracts: Learn to treat prompts as deterministic functions with defined inputs, invariant constraints, and strictly typed outputs
Build Production Testing Infrastructure: Create automated testing pipelines that detect regression when prompts change
Understand Evaluation Metrics: Apply SRE principles (SLOs, error budgets) to AI system reliability
Handle Non-Determinism: Learn when to test at temperature=0 vs higher temperatures for different validation types
Create Actionable Reports: Generate both human-readable (HTML) and machine-readable (JSON) test reports
Implement Parametric Evaluation: Run tests across different models, temperatures, and configurations
Design Robust Invariants: Translate vague business requirements into programmatic assertions

2. Theoretical Foundation

2.1 Core Concepts

Unit Testing for Prompts

In traditional software development, unit tests verify that a function produces expected outputs for given inputs. The same principle applies to prompts—they are functions that transform inputs (context, user query) into outputs (responses).

Traditional Function Testing:

def calculate_tax(amount: float, rate: float) -> float:
    return amount * rate

# Test
assert calculate_tax(100, 0.1) == 10.0  # Deterministic

Prompt Testing:

def customer_support_prompt(query: str, context: str) -> dict:
    # LLM call
    response = llm.complete(f"Context: {context}\nQuery: {query}")
    return parse_json(response)

# Test
result = customer_support_prompt("refund order #123", policy_docs)
assert result["has_citation"] == True  # Contract check
assert "order_id" in result  # Required field

The key difference: prompts are probabilistic, so we must test invariants (properties that should always hold) rather than exact outputs.

Invariants and Contracts

An invariant is a condition that must always be true regardless of the specific execution. A contract is a set of invariants that define what “correct” means.

Types of Invariants:

Type	Description	Example
Structural	Output format/schema	“Must be valid JSON”
Semantic	Meaning constraints	“Must cite a source document”
Safety	Security boundaries	“Must not contain PII”
Performance	Resource limits	“Must respond in <500ms”

Example Contract:

contract:
  name: "Customer Support Response"
  invariants:
    - type: schema
      spec: response_schema.json
    - type: citation
      rule: "Must reference at least one policy document"
    - type: length
      min: 50
      max: 500
    - type: tone
      rule: "Must be professional and empathetic"

Service Level Objectives (SLOs)

SLOs define the reliability target for your system. For LLM applications:

Example SLOs:

Accuracy SLO: 99% of responses must pass all invariants
Latency SLO: 95th percentile response time < 500ms
Cost SLO: Average cost per request < $0.01

Error Budgets: If your SLO is 99%, you have a 1% error budget. Once exhausted, you halt prompt changes until you fix regressions.

Deterministic vs Non-Deterministic Testing

Temperature controls randomness in LLM outputs:

temp=0.0: Deterministic (always picks highest probability token)
temp=0.7: Balanced creativity and consistency
temp=1.0+: High creativity, low consistency

Testing Strategy:

Deterministic Tests (temp=0.0):
- JSON schema validation
- Required field presence
- Exact format matches
- Citation checks

Non-Deterministic Tests (temp=0.7, N=10 samples):
- Tone/style quality (average score)
- Creativity metrics
- Diversity of responses

Schema Validation

JSON Schema acts as a “type system” for LLM outputs. It ensures the model’s response is structurally valid before your application processes it.

Example Schema:

{
  "type": "object",
  "properties": {
    "answer": {"type": "string", "minLength": 10},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    "citations": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 1
    }
  },
  "required": ["answer", "confidence", "citations"],
  "additionalProperties": false
}

This prevents crashes from missing fields, wrong types, or hallucinated extra fields.

2.2 Why This Matters

Production Relevance

Real-world LLM applications fail in production due to:

Prompt Changes: A single word change can break edge cases
Model Updates: Provider model updates can change behavior
Context Drift: As your data changes, prompt performance degrades
Edge Cases: Rare inputs that weren’t tested manually

Without automated testing: You discover failures after users complain. With this harness: You catch regressions before deployment.

Real-World Applications

This testing pattern is used by:

OpenAI Evals: OpenAI’s internal framework for model evaluation
LangSmith: LangChain’s testing and observability platform
Anthropic Workbench: Claude application testing infrastructure
Microsoft Prompt Flow: Azure’s prompt engineering toolchain

Companies like Stripe, Shopify, and Notion use similar harnesses to ensure their AI features maintain quality across updates.

2.3 Historical Context

Evolution of Prompt Engineering

2020-2021: The “Vibes” Era

Prompts were magic spells
“Act as a…” and “Take a deep breath” heuristics
No systematic evaluation
Manual testing only

2022-2023: The Systematization Era

OpenAI releases Evals framework
Chain-of-Thought (CoT) prompting formalized
Few-shot learning standardized
Schema-based outputs emerge

2024+: The Engineering Discipline Era

Prompts as code (version control, testing, CI/CD)
Automated evaluation pipelines
Statistical significance testing
Production observability

This project teaches you the modern, engineering-driven approach.

2.4 Common Misconceptions

Misconception	Reality
“Prompts are too random to test”	You test invariants, not exact outputs
“Manual testing is sufficient”	Manual testing doesn’t scale to 100+ edge cases
“Higher temperature = better responses”	Temperature depends on use case; many tasks need temp=0
“If it works once, it works always”	LLM outputs vary; you need statistical sampling
“Testing slows down development”	Testing prevents shipping broken prompts (faster overall)

3. Project Specification

3.1 What You Will Build

A command-line tool that:

Loads test suites from YAML/JSON files containing test cases
Executes prompts against LLM APIs (OpenAI, Anthropic, etc.)
Validates responses against defined invariants
Generates reports in HTML and JSON formats
Detects regression by comparing current run to previous versions
Provides actionable feedback on which tests failed and why

Core Question This Tool Answers:

“How do I know if my prompt change made things better or worse?”

3.2 Functional Requirements

FR1: Test Suite Loading

Load test cases from YAML or JSON files
Support multiple test categories (e.g., Refund, Technical, Policy)
Parse test case metadata: ID, input, expected invariants
Validate test file structure before execution

FR2: Prompt Execution

Support multiple LLM providers (OpenAI, Anthropic)
Execute prompts with configurable temperature, max tokens, etc.
Handle API rate limiting with exponential backoff
Capture response metadata: latency, token count

FR3: Invariant Checking

Implement at least these validator types:

JSONSchema: Validate against JSON Schema spec
Contains: Check if response contains specific strings
NotContains: Ensure response doesn’t contain forbidden strings
Length: Validate character/word count bounds
Citation: Verify presence of source citations
Regex: Pattern matching for structured formats

FR4: Report Generation

Console Output: Rich, colorized terminal output with progress bars
HTML Report: Interactive web page with drill-down capabilities
JSON Export: Machine-readable format for CI/CD integration
Trend Tracking: Compare current run to historical runs

FR5: Regression Detection

Store test results with version identifiers
Compare current run against previous baseline
Flag accuracy drops above configurable threshold (e.g., >5% drop)
Provide diff view showing which cases broke

3.3 Non-Functional Requirements

Requirement	Target	Rationale
Performance	Handle 50+ test cases in <5 minutes	Reasonable for CI/CD pipelines
Reliability	Retry failed API calls 3x with backoff	Handle transient network issues
Usability	Clear error messages with fix suggestions	Developers must understand failures quickly
Extensibility	Plugin architecture for custom validators	Different use cases need custom checks
Cost Efficiency	Cache responses during development	Avoid paying for repeated identical calls

3.4 Example Usage

Running the harness:

$ python harness.py test prompts/support_agent.yaml

╔══════════════════════════════════════════════════╗
║  PROMPT HARNESS v1.0 - Test Suite Execution     ║
╚══════════════════════════════════════════════════╝

Loading test suite: prompts/support_agent.yaml
Found 16 test cases across 3 categories (Refund, Technical, Policy)
Testing against: gpt-4 (temperature=0.0)

RUNNING SUITE: Customer Support Evals
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Category: Refund Queries
──────────────────────────────────────────────────

[PASS] Case: simple_refund_query (120ms)
  Input: "I want to return my order #12345"
  ✓ Invariant: Valid JSON Schema ............. OK
  ✓ Invariant: Has Citation .................. OK (Cited: policy_doc_3)
  ✓ Invariant: Contains Order ID ............. OK (#12345)
  ✓ Invariant: Polite Tone ................... OK (Confidence: 0.95)

[PASS] Case: refund_outside_window (135ms)
  Input: "Can I return something I bought 6 months ago?"
  ✓ Invariant: Valid JSON Schema ............. OK
  ✓ Invariant: Has Citation .................. OK (Cited: policy_doc_1)
  ✓ Invariant: Mentions Time Limit ........... OK
  ✓ Invariant: Suggests Alternative .......... OK

[FAIL] Case: ambiguous_policy_query (98ms)
  Input: "What's your return policy?"
  ✓ Invariant: Valid JSON Schema ............. OK
  ✗ Invariant: Has Citation .................. FAIL
  ✓ Invariant: Contains Policy Details ....... OK

  Expected: Citation to a policy document
  Actual Output: {
    "response": "I think you can return it maybe?",
    "confidence": 0.3,
    "citation": null
  }

  Failure Reason: Model responded with vague language without
                  grounding answer in provided policy documents.

Category: Technical Support
──────────────────────────────────────────────────

[PASS] Case: password_reset (105ms)
[PASS] Case: account_locked (118ms)
[PASS] Case: api_integration_help (142ms)
... (showing 3/10 cases for brevity)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SUMMARY REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Total Cases:     16
Passed:          15
Failed:          1
Success Rate:    93.7%
Total Time:      1.82s
Avg Latency:     113ms

FAILURES BY INVARIANT:
  • Has Citation: 1 failure (ambiguous_policy_query)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠ REGRESSION DETECTED: Score dropped from 100% (v1.2.0) to 93.7%
  See detailed report: ./reports/run_2024-12-27_14-32-01.html

Recommendations:
  1. Review prompt instructions for citation requirements
  2. Add explicit instruction: "Always cite source documents"
  3. Consider adding few-shot examples with citations

Integration with workflow:

# Run before committing a prompt change
$ python harness.py test prompts/support_agent.yaml --compare-to v1.2.0

# Run in CI/CD (fail build if score drops below threshold)
$ python harness.py test prompts/*.yaml --min-score 95 --format junit

# Run parametric sweep (test across multiple models/temperatures)
$ python harness.py test prompts/support_agent.yaml --sweep temperature=0.0,0.3,0.7

# Generate regression report comparing two versions
$ python harness.py diff v1.2.0 v1.3.0 --output regression_report.md

Generated HTML Report Features:

The tool generates a detailed HTML report (reports/run_2024-12-27_14-32-01.html) with:

Side-by-side comparison: Shows your prompt version vs. the previous version
Per-case drill-down: Click any failed case to see full input, expected output, actual output, and which specific assertion failed
Trend graphs: Visual charts showing your accuracy over time across different prompt versions
Diff highlighting: Color-coded changes showing what you modified in your prompt between runs
Export options: Download results as JSON for integration with CI/CD pipelines

Example HTML report sections:

“Accuracy Trend” graph showing 100% → 95% → 93.7% over three runs
“Token Usage Analysis” showing average tokens per response
“Latency Distribution” histogram showing response time patterns
“Failure Clustering” identifying which types of queries break most often

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Test Loader │────▶│ Test Runner  │────▶│   Reporter   │
└──────────────┘     └──────────────┘     └──────────────┘
       │                    │                     │
       │                    ▼                     │
       │             ┌──────────────┐             │
       │             │  LLM Client  │             │
       │             └──────────────┘             │
       │                    │                     │
       │                    ▼                     │
       │             ┌──────────────┐             │
       │             │  Validators  │             │
       │             └──────────────┘             │
       │                                          │
       └─────────────▶ Results DB ◀───────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
TestLoader	Parse YAML/JSON test files	Use PyYAML for human-readable format with comments
TestRunner	Execute tests with rate limiting	Async execution with `asyncio` for concurrent API calls
LLMClient	Abstract API calls to different providers	Strategy pattern for OpenAI/Anthropic/local models
Validators	Check invariants against responses	Plugin architecture - easy to add custom validators
Reporter	Generate HTML/JSON reports	Jinja2 templates for HTML, structured JSON for CI/CD
ResultsDB	Store historical test runs	SQLite for simplicity, JSON files for portability

4.3 Data Structures

from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum

class InvariantType(Enum):
    JSON_SCHEMA = "json_schema"
    CONTAINS = "contains"
    NOT_CONTAINS = "not_contains"
    LENGTH = "length"
    CITATION = "citation"
    REGEX = "regex"
    CUSTOM = "custom"

@dataclass
class Invariant:
    """Defines a single invariant check"""
    type: InvariantType
    name: str
    spec: Dict[str, Any]  # Type-specific configuration

@dataclass
class TestCase:
    """Represents a single test case"""
    id: str
    category: str
    input: str
    context: Optional[str]
    expected_invariants: List[Invariant]
    metadata: Dict[str, Any]

@dataclass
class ValidationResult:
    """Result of validating a single invariant"""
    invariant_name: str
    passed: bool
    error_message: Optional[str]
    details: Dict[str, Any]

@dataclass
class TestResult:
    """Complete result for one test case"""
    test_case: TestCase
    response: str
    passed: bool
    validation_results: List[ValidationResult]
    latency_ms: float
    tokens_used: int
    timestamp: str

@dataclass
class TestSuite:
    """Collection of test cases"""
    name: str
    version: str
    test_cases: List[TestCase]

@dataclass
class TestReport:
    """Aggregated results across all tests"""
    suite_name: str
    version: str
    total_cases: int
    passed_cases: int
    failed_cases: int
    success_rate: float
    total_time_s: float
    avg_latency_ms: float
    results: List[TestResult]
    regression_info: Optional[Dict[str, Any]]

4.4 Algorithm Overview

Test Execution Algorithm

def run_test_suite(suite: TestSuite, config: Config) -> TestReport:
    """
    Main test execution algorithm

    Complexity: O(n) where n = number of test cases
    Space: O(n) for storing results
    """
    results = []
    start_time = time.time()

    for test_case in suite.test_cases:
        # 1. Build prompt with input and context
        prompt = build_prompt(test_case.input, test_case.context)

        # 2. Call LLM API (with retry logic)
        response, metadata = call_llm_with_retry(
            prompt=prompt,
            config=config,
            max_retries=3
        )

        # 3. Validate against all invariants
        validation_results = []
        for invariant in test_case.expected_invariants:
            validator = get_validator(invariant.type)
            result = validator.validate(response, invariant.spec)
            validation_results.append(result)

        # 4. Determine if test passed (all invariants must pass)
        test_passed = all(r.passed for r in validation_results)

        # 5. Record result
        results.append(TestResult(
            test_case=test_case,
            response=response,
            passed=test_passed,
            validation_results=validation_results,
            latency_ms=metadata['latency_ms'],
            tokens_used=metadata['tokens']
        ))

    # 6. Generate aggregate report
    total_time = time.time() - start_time
    return generate_report(results, total_time)

Regression Detection Algorithm

def detect_regression(current_run: TestReport,
                     baseline_run: TestReport,
                     threshold: float = 0.05) -> Dict[str, Any]:
    """
    Compare two test runs to detect regression

    Args:
        current_run: Latest test results
        baseline_run: Previous baseline to compare against
        threshold: Maximum acceptable drop in success rate (e.g., 0.05 = 5%)

    Returns:
        Dictionary with regression analysis
    """
    # Calculate score delta
    score_delta = current_run.success_rate - baseline_run.success_rate

    # Identify newly broken test cases
    current_failures = {r.test_case.id for r in current_run.results if not r.passed}
    baseline_failures = {r.test_case.id for r in baseline_run.results if not r.passed}
    new_failures = current_failures - baseline_failures
    fixed_cases = baseline_failures - current_failures

    # Categorize failures by invariant type
    failure_breakdown = defaultdict(list)
    for result in current_run.results:
        if not result.passed:
            for validation in result.validation_results:
                if not validation.passed:
                    failure_breakdown[validation.invariant_name].append(result.test_case.id)

    return {
        "regression_detected": score_delta < -threshold,
        "score_delta": score_delta,
        "new_failures": list(new_failures),
        "fixed_cases": list(fixed_cases),
        "failure_breakdown": dict(failure_breakdown)
    }

Complexity Analysis:

Time Complexity: O(n × m) where n = test cases, m = invariants per case
Space Complexity: O(n) for storing all results
API Call Parallelization: Can achieve O(n/p) with p parallel workers

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
mkdir prompt-harness
cd prompt-harness

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install openai anthropic pyyaml jsonschema jinja2 click rich pytest

# Create .env file for API keys
cat > .env << EOF
OPENAI_API_KEY="your-openai-key-here"
ANTHROPIC_API_KEY="your-anthropic-key-here"
EOF

# Install python-dotenv for env management
pip install python-dotenv

5.2 Project Structure

prompt-harness/
├── harness/
│   ├── __init__.py
│   ├── loader.py           # Test file parsing
│   ├── runner.py           # Test execution engine
│   ├── validators.py       # Invariant validators
│   ├── reporter.py         # Report generation
│   ├── llm_client.py       # LLM API abstraction
│   ├── models.py           # Data structures
│   └── cli.py              # CLI interface
├── tests/
│   ├── __init__.py
│   ├── test_loader.py
│   ├── test_validators.py
│   └── test_runner.py
├── prompts/
│   ├── support_agent.yaml  # Example test suite
│   └── schemas/
│       └── response_schema.json
├── templates/
│   └── report.html.jinja2  # HTML report template
├── results/
│   └── .gitkeep            # Store test run results
├── reports/
│   └── .gitkeep            # Generated HTML reports
├── pyproject.toml          # Project metadata
├── requirements.txt        # Dependencies
├── .env                    # API keys (gitignored)
├── .gitignore
└── README.md

5.3 Implementation Phases

Phase 1: Foundation (Day 1)

Goals:

Set up project structure with proper Python packaging
Implement test file loader
Create basic data structures
Write unit tests for loader

Tasks:

Create Project Scaffold

# Initialize project
mkdir -p harness tests prompts templates results reports
touch harness/__init__.py tests/__init__.py

Implement Data Models (harness/models.py)

# See data structures in section 4.3 above
# Copy the dataclass definitions

Implement YAML Test Loader (harness/loader.py) ```python import yaml from pathlib import Path from typing import List from .models import TestSuite, TestCase, Invariant, InvariantType

class TestLoader: “"”Loads and parses test suite files”””

def load_from_file(self, filepath: Path) -> TestSuite:
    """Load test suite from YAML file"""
    with open(filepath, 'r') as f:
        data = yaml.safe_load(f)

    return self._parse_suite(data)

def _parse_suite(self, data: dict) -> TestSuite:
    """Parse raw YAML data into TestSuite object"""
    test_cases = [
        self._parse_test_case(tc)
        for tc in data.get('test_cases', [])
    ]

    return TestSuite(
        name=data['name'],
        version=data['version'],
        test_cases=test_cases
    )

def _parse_test_case(self, data: dict) -> TestCase:
    """Parse a single test case"""
    invariants = [
        self._parse_invariant(inv)
        for inv in data.get('invariants', [])
    ]

    return TestCase(
        id=data['id'],
        category=data['category'],
        input=data['input'],
        context=data.get('context'),
        expected_invariants=invariants,
        metadata=data.get('metadata', {})
    )

def _parse_invariant(self, data: dict) -> Invariant:
    """Parse an invariant definition"""
    return Invariant(
        type=InvariantType(data['type']),
        name=data['name'],
        spec=data.get('spec', {})
    ) ```

Create Sample Test File (prompts/support_agent.yaml) ```yaml name: “Customer Support Evals” version: “1.0.0”

test_cases:

id: “simple_refund_query” category: “Refund” input: “I want to return my order #12345” context: | Policy: Customers can return items within 30 days. Order #12345 was placed 10 days ago. invariants:
- type: “json_schema” name: “Valid JSON Schema” spec: schema_file: “prompts/schemas/response_schema.json”
- type: “contains” name: “Has Citation” spec: substring: “policy_doc”
- type: “contains” name: “Contains Order ID” spec: substring: “#12345” ```

Write Unit Tests (tests/test_loader.py) ```python import pytest from pathlib import Path from harness.loader import TestLoader

def test_load_valid_suite(): loader = TestLoader() suite = loader.load_from_file(Path(“prompts/support_agent.yaml”))

assert suite.name == "Customer Support Evals"
assert len(suite.test_cases) > 0

def test_parse_invariants(): loader = TestLoader() suite = loader.load_from_file(Path(“prompts/support_agent.yaml”))

test_case = suite.test_cases[0]
assert len(test_case.expected_invariants) == 3
assert test_case.expected_invariants[0].name == "Valid JSON Schema" ```

Checkpoint: Can load and parse a sample test YAML file, verified by passing unit tests.

Phase 2: Core Execution (Days 2-3)

Goals:

Implement LLM API client with retry logic
Build test runner that executes cases
Create core validators
Add rate limiting and error handling

Tasks:

Implement LLM Client (harness/llm_client.py) ```python import time import openai from typing import Dict, Any, Optional from dataclasses import dataclass

@dataclass class LLMResponse: text: str latency_ms: float tokens_used: int model: str

class LLMClient: “"”Abstract interface for LLM API calls”””

def __init__(self, provider: str, model: str, temperature: float = 0.0):
    self.provider = provider
    self.model = model
    self.temperature = temperature

def complete(self, prompt: str, max_retries: int = 3) -> LLMResponse:
    """Call LLM with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            return self._call_api(prompt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            time.sleep(wait_time)

def _call_api(self, prompt: str) -> LLMResponse:
    """Make actual API call (provider-specific)"""
    start = time.time()

    if self.provider == "openai":
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=self.temperature
        )
        text = response.choices[0].message.content
        tokens = response.usage.total_tokens
    else:
        raise ValueError(f"Unknown provider: {self.provider}")

    latency_ms = (time.time() - start) * 1000

    return LLMResponse(
        text=text,
        latency_ms=latency_ms,
        tokens_used=tokens,
        model=self.model
    ) ```

Implement Validators (harness/validators.py) ```python import json import re from abc import ABC, abstractmethod from typing import Dict, Any from jsonschema import validate, ValidationError from .models import ValidationResult

class Validator(ABC): “"”Base class for all validators”””

@abstractmethod
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    pass

class JSONSchemaValidator(Validator): “"”Validates response against JSON Schema”””

def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    try:
        # Parse response as JSON
        data = json.loads(response)

        # Load schema from file
        with open(spec['schema_file'], 'r') as f:
            schema = json.load(f)

        # Validate
        validate(instance=data, schema=schema)

        return ValidationResult(
            invariant_name="JSON Schema",
            passed=True,
            error_message=None,
            details={"schema_file": spec['schema_file']}
        )
    except json.JSONDecodeError as e:
        return ValidationResult(
            invariant_name="JSON Schema",
            passed=False,
            error_message=f"Invalid JSON: {str(e)}",
            details={}
        )
    except ValidationError as e:
        return ValidationResult(
            invariant_name="JSON Schema",
            passed=False,
            error_message=f"Schema validation failed: {e.message}",
            details={"path": list(e.path)}
        )

class ContainsValidator(Validator): “"”Validates that response contains a substring”””

def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    substring = spec['substring']
    passed = substring in response

    return ValidationResult(
        invariant_name=f"Contains '{substring}'",
        passed=passed,
        error_message=None if passed else f"Response does not contain '{substring}'",
        details={"substring": substring}
    )

class LengthValidator(Validator): “"”Validates response length”””

def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    min_len = spec.get('min', 0)
    max_len = spec.get('max', float('inf'))
    actual_len = len(response)

    passed = min_len <= actual_len <= max_len

    return ValidationResult(
        invariant_name="Length",
        passed=passed,
        error_message=None if passed else f"Length {actual_len} not in range [{min_len}, {max_len}]",
        details={"actual": actual_len, "min": min_len, "max": max_len}
    )

Validator factory

VALIDATORS = { “json_schema”: JSONSchemaValidator(), “contains”: ContainsValidator(), “length”: LengthValidator() }

def get_validator(validator_type: str) -> Validator: “"”Get validator instance by type””” return VALIDATORS.get(validator_type)

3. **Implement Test Runner** (`harness/runner.py`)
```python
import time
from typing import List
from .models import TestSuite, TestResult, TestReport, ValidationResult
from .llm_client import LLMClient
from .validators import get_validator

class TestRunner:
    """Executes test suites"""

    def __init__(self, llm_client: LLMClient):
        self.llm_client = llm_client

    def run_suite(self, suite: TestSuite) -> TestReport:
        """Execute all tests in suite"""
        results = []
        start_time = time.time()

        for test_case in suite.test_cases:
            # Build prompt
            prompt = self._build_prompt(test_case)

            # Call LLM
            llm_response = self.llm_client.complete(prompt)

            # Validate against invariants
            validation_results = []
            for invariant in test_case.expected_invariants:
                validator = get_validator(invariant.type.value)
                result = validator.validate(llm_response.text, invariant.spec)
                validation_results.append(result)

            # Determine pass/fail
            test_passed = all(v.passed for v in validation_results)

            # Store result
            results.append(TestResult(
                test_case=test_case,
                response=llm_response.text,
                passed=test_passed,
                validation_results=validation_results,
                latency_ms=llm_response.latency_ms,
                tokens_used=llm_response.tokens_used,
                timestamp=time.strftime("%Y-%m-%d %H:%M:%S")
            ))

        total_time = time.time() - start_time

        return self._generate_report(suite, results, total_time)

    def _build_prompt(self, test_case):
        """Construct prompt from test case"""
        parts = []
        if test_case.context:
            parts.append(f"Context:\n{test_case.context}\n")
        parts.append(f"Query: {test_case.input}")
        return "\n".join(parts)

    def _generate_report(self, suite, results, total_time):
        """Generate aggregate report"""
        passed = sum(1 for r in results if r.passed)
        failed = len(results) - passed

        return TestReport(
            suite_name=suite.name,
            version=suite.version,
            total_cases=len(results),
            passed_cases=passed,
            failed_cases=failed,
            success_rate=passed / len(results) if results else 0.0,
            total_time_s=total_time,
            avg_latency_ms=sum(r.latency_ms for r in results) / len(results) if results else 0.0,
            results=results,
            regression_info=None
        )

Checkpoint: Can run tests against real LLM API and validate basic invariants.

Phase 3: Reporting & Polish (Days 4-5)

Goals:

Generate HTML and JSON reports
Add regression detection
Polish CLI interface with rich formatting
Add detailed failure analysis

Tasks:

Implement Reporter (harness/reporter.py) ```python import json from pathlib import Path from jinja2 import Environment, FileSystemLoader from .models import TestReport

class Reporter: “"”Generates test reports in various formats”””

def __init__(self, template_dir: Path):
    self.jinja_env = Environment(loader=FileSystemLoader(template_dir))

def generate_html(self, report: TestReport, output_path: Path):
    """Generate HTML report"""
    template = self.jinja_env.get_template('report.html.jinja2')

    html = template.render(
        report=report,
        failure_breakdown=self._get_failure_breakdown(report)
    )

    with open(output_path, 'w') as f:
        f.write(html)

def generate_json(self, report: TestReport, output_path: Path):
    """Generate JSON report for CI/CD"""
    data = {
        "suite_name": report.suite_name,
        "version": report.version,
        "summary": {
            "total": report.total_cases,
            "passed": report.passed_cases,
            "failed": report.failed_cases,
            "success_rate": report.success_rate
        },
        "results": [
            {
                "test_id": r.test_case.id,
                "passed": r.passed,
                "latency_ms": r.latency_ms,
                "failures": [
                    v.invariant_name
                    for v in r.validation_results
                    if not v.passed
                ]
            }
            for r in report.results
        ]
    }

    with open(output_path, 'w') as f:
        json.dump(data, f, indent=2)

def _get_failure_breakdown(self, report: TestReport):
    """Group failures by invariant type"""
    breakdown = {}
    for result in report.results:
        if not result.passed:
            for validation in result.validation_results:
                if not validation.passed:
                    if validation.invariant_name not in breakdown:
                        breakdown[validation.invariant_name] = []
                    breakdown[validation.invariant_name].append(result.test_case.id)
    return breakdown ```

Create HTML Template (templates/report.html.jinja2) ```html <!DOCTYPE html>

{{ report.suite_name }} - Test Report

Test Results

{% for result in report.results %}

{{ "PASS" if result.passed else "FAIL" }} - {{ result.test_case.id }}

Input: {{ result.test_case.input }}

Latency: {{ result.latency_ms|round(0) }}ms

{% for validation in result.validation_results %}

{{ "✓" if validation.passed else "✗" }} {{ validation.invariant_name }} {% if not validation.passed %}
{{ validation.error_message }} {% endif %}

{% endfor %}

{% endfor %} {% if failure_breakdown %}

Failure Breakdown

{{ invariant }}: {{ cases|length }} failure(s) - {{ cases|join(", ") }}

{% endif %}

3. **Build CLI Interface** (`harness/cli.py`)
```python
import click
from pathlib import Path
from rich.console import Console
from rich.progress import track
from .loader import TestLoader
from .runner import TestRunner
from .llm_client import LLMClient
from .reporter import Reporter

console = Console()

@click.group()
def cli():
    """Prompt Contract Harness - Test your prompts like code"""
    pass

@cli.command()
@click.argument('test_file', type=click.Path(exists=True))
@click.option('--model', default='gpt-4', help='LLM model to use')
@click.option('--temperature', default=0.0, help='Sampling temperature')
@click.option('--provider', default='openai', help='LLM provider')
def test(test_file, model, temperature, provider):
    """Run test suite"""
    console.print(f"[bold blue]Loading test suite:[/bold blue] {test_file}")

    # Load tests
    loader = TestLoader()
    suite = loader.load_from_file(Path(test_file))
    console.print(f"Found {len(suite.test_cases)} test cases")

    # Run tests
    client = LLMClient(provider=provider, model=model, temperature=temperature)
    runner = TestRunner(client)

    console.print("[bold green]Running tests...[/bold green]")
    report = runner.run_suite(suite)

    # Print summary
    console.print(f"\n[bold]Results:[/bold]")
    console.print(f"Success Rate: {report.success_rate * 100:.1f}%")
    console.print(f"Passed: {report.passed_cases}/{report.total_cases}")

    # Generate reports
    reporter = Reporter(Path("templates"))
    html_path = Path(f"reports/run_{report.version}.html")
    reporter.generate_html(report, html_path)
    console.print(f"\n[bold]Report saved to:[/bold] {html_path}")

if __name__ == '__main__':
    cli()

Add Regression Detection ```python
In harness/runner.py, add method:

def compare_with_baseline(self, current: TestReport, baseline: TestReport) -> dict: “"”Detect regression between runs””” score_delta = current.success_rate - baseline.success_rate

current_failures = {r.test_case.id for r in current.results if not r.passed}
baseline_failures = {r.test_case.id for r in baseline.results if not r.passed}

return {
    "regression_detected": score_delta < -0.05,  # 5% threshold
    "score_delta": score_delta,
    "new_failures": list(current_failures - baseline_failures),
    "fixed_cases": list(baseline_failures - current_failures)
} ```

Checkpoint: Full working harness with beautiful reports and regression detection.

5.4 Key Implementation Decisions

Decision	Options Considered	Recommendation	Rationale
Test Format	JSON, YAML, Python code	YAML	Human-readable, supports comments, widely understood
Async vs Sync	asyncio, threading, sequential	asyncio	Better rate limit handling, can parallelize independent tests
Validation Architecture	Inline checks, Plugin system	Plugin system	Extensibility for custom validators, clean separation
Report Format	HTML only, JSON only, both	Both	HTML for humans, JSON for CI/CD integration
Results Storage	SQLite, JSON files, CSV	JSON files	Simple, portable, version-controllable
CLI Framework	argparse, click, typer	click	Rich ecosystem, good documentation, decorator-based
Progress Display	Print statements, rich library	rich library	Beautiful terminal output, progress bars, colors

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Coverage Target	Examples
Unit Tests	Test individual components in isolation	90%+	Validator logic, loader parsing, data models
Integration Tests	Test end-to-end flow with mocked LLM	80%+	Full test run with mock API, report generation
Smoke Tests	Quick sanity checks	Critical paths	Sample test suite passes, CLI loads
Regression Tests	Ensure bugs stay fixed	All bug fixes	Previously failing cases now pass

6.2 Critical Test Cases

Unit Tests

Test: Loader Handles Invalid YAML

def test_loader_invalid_yaml():
    loader = TestLoader()
    with pytest.raises(yaml.YAMLError):
        loader.load_from_file(Path("tests/fixtures/invalid.yaml"))

Test: JSONSchema Validator Detects Type Errors

def test_json_schema_type_error():
    validator = JSONSchemaValidator()
    response = '{"age": "twenty-five"}'  # Should be int
    spec = {"schema_file": "tests/fixtures/user_schema.json"}

    result = validator.validate(response, spec)
    assert not result.passed
    assert "type" in result.error_message.lower()

Test: Contains Validator Case Sensitivity

def test_contains_case_sensitive():
    validator = ContainsValidator()
    response = "The user wants a REFUND"
    spec = {"substring": "refund", "case_sensitive": True}

    result = validator.validate(response, spec)
    assert not result.passed  # "REFUND" != "refund"

Integration Tests

Test: Full Run with Mocked LLM

def test_full_run_with_mock(mocker):
    # Mock LLM to return deterministic responses
    mock_client = mocker.Mock(spec=LLMClient)
    mock_client.complete.return_value = LLMResponse(
        text='{"answer": "test", "confidence": 0.9}',
        latency_ms=100,
        tokens_used=50,
        model="gpt-4"
    )

    loader = TestLoader()
    suite = loader.load_from_file(Path("tests/fixtures/sample.yaml"))

    runner = TestRunner(mock_client)
    report = runner.run_suite(suite)

    assert report.total_cases > 0
    assert report.success_rate >= 0.0

Test: Regression Detection Works

def test_regression_detection():
    # Create two reports with different scores
    baseline = TestReport(
        suite_name="test",
        version="1.0",
        total_cases=10,
        passed_cases=10,
        failed_cases=0,
        success_rate=1.0,
        # ...
    )

    current = TestReport(
        suite_name="test",
        version="1.1",
        total_cases=10,
        passed_cases=8,
        failed_cases=2,
        success_rate=0.8,
        # ...
    )

    runner = TestRunner(None)
    regression = runner.compare_with_baseline(current, baseline)

    assert regression["regression_detected"] == True
    assert regression["score_delta"] == -0.2

6.3 Test Data

Sample Test Suite (tests/fixtures/sample.yaml)

name: "Sample Test Suite"
version: "1.0.0"

test_cases:
  - id: "valid_json_response"
    category: "Format"
    input: "Return user info as JSON"
    invariants:
      - type: "json_schema"
        name: "Valid JSON"
        spec:
          schema_file: "tests/fixtures/user_schema.json"

  - id: "contains_citation"
    category: "Grounding"
    input: "What is our refund policy?"
    context: "Policy: 30-day returns allowed"
    invariants:
      - type: "contains"
        name: "Mentions Policy"
        spec:
          substring: "30-day"

JSON Schema (tests/fixtures/user_schema.json)

{
  "type": "object",
  "properties": {
    "answer": {"type": "string"},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "required": ["answer", "confidence"]
}

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Root Cause	Solution
No Rate Limiting	429 Too Many Requests errors	Sending requests too fast to API	Implement exponential backoff, use async with semaphore for concurrency control
Flaky Tests at High Temp	Inconsistent pass/fail results	Non-deterministic sampling (temp > 0)	Use temp=0 for deterministic tests, or run multiple samples and aggregate
Poor Error Messages	“Test failed” with no context	Generic exception handling	Include full context: expected vs actual, test case ID, invariant name
Missing Token Tracking	Unexpected API bills	Not logging token usage	Log tokens per request, aggregate in report, set budget alerts
Timeout Handling	Tests hang indefinitely	No timeout on API calls	Set request timeout (e.g., 30s), retry with backoff
Schema Path Issues	“Schema file not found”	Relative paths break in different contexts	Use absolute paths or paths relative to project root
Not Validating Test Files	Cryptic errors during execution	Malformed YAML loaded without validation	Validate test file structure before running tests
Hardcoded API Keys	Security vulnerability	Keys in source code	Use environment variables, never commit keys to git

7.2 Debugging Strategies

API Issues

Problem: Getting 401 Unauthorized errors

Debug Steps:

Enable verbose logging to see full request/response

import logging
logging.basicConfig(level=logging.DEBUG)

Verify API key is loaded correctly

import os
print(f"API Key loaded: {os.getenv('OPENAI_API_KEY')[:10]}...")

Test API key with minimal curl request

curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"

Validation Failures

Problem: Can’t understand why invariant failed

Debug Strategy:

Print the full response before validation

print(f"Raw response: {response}")
print(f"Expected: {invariant.spec}")

Add detailed error messages with context

return ValidationResult(
 invariant_name="JSON Schema",
 passed=False,
 error_message=f"Schema validation failed at path '{path}': expected {expected_type}, got {actual_type}",
 details={
     "expected": expected,
     "actual": actual,
     "raw_response": response[:200]  # First 200 chars
 }
)

Create minimal reproduction test

def test_minimal_repro():
 validator = JSONSchemaValidator()
 response = '{"age": "twenty-five"}'  # Exact failing response
 result = validator.validate(response, spec)
 print(f"Failed: {result.error_message}")

Performance Issues

Problem: Tests running too slowly

Profile with cProfile:

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Run your test suite
runner.run_suite(suite)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 slowest functions

Common Bottlenecks:

Sequential API calls → Parallelize with asyncio
Loading schema repeatedly → Cache schema in memory
Large responses → Stream instead of loading full response

7.3 Performance Traps

Memory Issues

Problem: Loading all responses into memory for large test suites

Bad:

responses = [llm_client.complete(prompt) for test in tests]  # Loads all at once

Good:

for test in tests:
    response = llm_client.complete(test.prompt)
    result = validate(response)
    save_result(result)  # Process and save incrementally
    del response  # Free memory

Cost Optimization

Problem: Racking up API bills during development

Solution: Response Caching

import hashlib
import json
from pathlib import Path

class CachingLLMClient:
    def __init__(self, client, cache_dir=Path(".cache")):
        self.client = client
        self.cache_dir = cache_dir
        self.cache_dir.mkdir(exist_ok=True)

    def complete(self, prompt: str):
        # Generate cache key from prompt
        key = hashlib.md5(prompt.encode()).hexdigest()
        cache_file = self.cache_dir / f"{key}.json"

        # Check cache
        if cache_file.exists():
            with open(cache_file) as f:
                return LLMResponse(**json.load(f))

        # Call API
        response = self.client.complete(prompt)

        # Save to cache
        with open(cache_file, 'w') as f:
            json.dump(response.__dict__, f)

        return response

8. Extensions & Challenges

8.1 Beginner Extensions

Extension 1: CSV Export Format

What: Add ability to export results as CSV for Excel analysis Why: Non-technical stakeholders may prefer spreadsheets How: Use Python’s csv module to write results row by row

import csv

def export_to_csv(report: TestReport, output_path: Path):
    with open(output_path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Test ID', 'Category', 'Passed', 'Latency (ms)', 'Failed Invariants'])
        for result in report.results:
            failed_inv = ', '.join(v.invariant_name for v in result.validation_results if not v.passed)
            writer.writerow([
                result.test_case.id,
                result.test_case.category,
                result.passed,
                result.latency_ms,
                failed_inv
            ])

Challenge: Add a --format csv flag to CLI

Extension 2: Email Notifications for Regression

What: Send email alert when regression detected Why: Get notified immediately without checking dashboard How: Use smtplib to send email via SMTP server

import smtplib
from email.mime.text import MIMEText

def send_regression_alert(report: TestReport, config: dict):
    if not report.regression_info or not report.regression_info['regression_detected']:
        return

    body = f"""
    Regression detected in {report.suite_name}!

    Success rate dropped: {report.regression_info['score_delta']:.1%}
    New failures: {', '.join(report.regression_info['new_failures'])}

    See report: {config['report_url']}
    """

    msg = MIMEText(body)
    msg['Subject'] = f"Prompt Regression Alert: {report.suite_name}"
    msg['From'] = config['from_email']
    msg['To'] = config['to_email']

    with smtplib.SMTP(config['smtp_server']) as server:
        server.send_message(msg)

Challenge: Make email template customizable with Jinja2

Extension 3: Pre-built Validator Library

What: Create validators for common patterns (email, phone, PII) Why: Don’t reinvent the wheel for standard checks How: Use regex and libraries like phonenumbers, email-validator

import re
from email_validator import validate_email

class EmailValidator(Validator):
    """Detects email addresses in response"""

    def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        found_emails = re.findall(email_pattern, response)

        should_contain = spec.get('should_contain', False)
        passed = (len(found_emails) > 0) == should_contain

        return ValidationResult(
            invariant_name="Email Detection",
            passed=passed,
            error_message=f"Found {len(found_emails)} email(s): {found_emails}" if not passed else None,
            details={"emails_found": found_emails}
        )

Challenge: Create validators for: SSN, credit card numbers, phone numbers, addresses

8.2 Intermediate Extensions

Extension 4: LLM-as-a-Judge Validators

What: Use another LLM to evaluate subjective qualities (tone, helpfulness) Why: Some invariants can’t be captured by regex or schemas How: Call LLM with structured prompt to grade responses

class ToneValidator(Validator):
    """Uses LLM to evaluate tone"""

    def __init__(self, judge_client: LLMClient):
        self.judge = judge_client

    def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
        expected_tone = spec['expected_tone']  # e.g., "professional and empathetic"

        judge_prompt = f"""
        Evaluate if the following response has a {expected_tone} tone.

        Response: "{response}"

        Return only a JSON object:
        {{
            "score": <0.0-1.0>,
            "reasoning": "<brief explanation>"
        }}
        """

        judge_response = self.judge.complete(judge_prompt)
        result = json.loads(judge_response.text)

        threshold = spec.get('min_score', 0.7)
        passed = result['score'] >= threshold

        return ValidationResult(
            invariant_name=f"Tone: {expected_tone}",
            passed=passed,
            error_message=f"Tone score {result['score']:.2f} below threshold {threshold}. Reasoning: {result['reasoning']}" if not passed else None,
            details=result
        )

Challenge: Create LLM judges for: factual accuracy, relevance, completeness

Extension 5: Parametric Sweeps

What: Test across multiple models, temperatures, and prompts Why: Find optimal configuration for your use case How: Add nested loops to test all combinations

def parametric_sweep(suite: TestSuite, sweep_config: dict) -> List[TestReport]:
    """Run tests across parameter grid"""
    reports = []

    for model in sweep_config['models']:
        for temp in sweep_config['temperatures']:
            for prompt_version in sweep_config['prompt_versions']:
                client = LLMClient(model=model, temperature=temp)
                runner = TestRunner(client)

                # Modify suite with prompt version
                versioned_suite = apply_prompt_version(suite, prompt_version)

                report = runner.run_suite(versioned_suite)
                report.metadata = {
                    'model': model,
                    'temperature': temp,
                    'prompt_version': prompt_version
                }
                reports.append(report)

    return reports

# Usage
sweep_config = {
    'models': ['gpt-4', 'gpt-3.5-turbo', 'claude-3-opus'],
    'temperatures': [0.0, 0.3, 0.7],
    'prompt_versions': ['v1', 'v2', 'v3']
}
results = parametric_sweep(suite, sweep_config)

Challenge: Create heatmap visualization showing model vs temperature performance

Extension 6: Web UI for Results

What: Interactive dashboard to explore test results Why: Easier than opening HTML files, supports filtering/sorting How: Use Flask or FastAPI to serve results from database

from flask import Flask, render_template, jsonify
import sqlite3

app = Flask(__name__)

@app.route('/')
def dashboard():
    return render_template('dashboard.html')

@app.route('/api/runs')
def get_runs():
    conn = sqlite3.connect('results.db')
    runs = conn.execute('SELECT * FROM test_runs ORDER BY timestamp DESC LIMIT 50').fetchall()
    return jsonify([dict(run) for run in runs])

@app.route('/api/run/<run_id>')
def get_run_details(run_id):
    conn = sqlite3.connect('results.db')
    results = conn.execute('SELECT * FROM test_results WHERE run_id = ?', (run_id,)).fetchall()
    return jsonify([dict(r) for r in results])

Challenge: Add real-time updates using WebSockets as tests run

8.3 Advanced Extensions

Extension 7: CI/CD Integration

What: Run harness in GitHub Actions/Jenkins on every commit Why: Catch regressions before merging to main How: Create GitHub Action workflow

# .github/workflows/prompt-tests.yml
name: Prompt Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run prompt tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python harness.py test prompts/*.yaml --format junit --min-score 95

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v2
        with:
          name: test-results
          path: reports/

Challenge: Add step to post test results as PR comment

Extension 8: A/B Testing Framework

What: Compare two prompt versions statistically Why: Know with confidence which prompt performs better How: Run both prompts on same test set, compute significance

from scipy import stats

def ab_test(prompt_a: str, prompt_b: str, test_set: TestSuite) -> dict:
    """Run A/B test between two prompts"""
    # Run tests with both prompts
    runner_a = TestRunner(LLMClient(prompt=prompt_a))
    runner_b = TestRunner(LLMClient(prompt=prompt_b))

    results_a = [r.passed for r in runner_a.run_suite(test_set).results]
    results_b = [r.passed for r in runner_b.run_suite(test_set).results]

    # Statistical significance test (paired t-test)
    t_stat, p_value = stats.ttest_rel(results_a, results_b)

    # Effect size (Cohen's d)
    mean_diff = np.mean(results_a) - np.mean(results_b)
    pooled_std = np.sqrt((np.std(results_a)**2 + np.std(results_b)**2) / 2)
    cohens_d = mean_diff / pooled_std

    return {
        'prompt_a_score': np.mean(results_a),
        'prompt_b_score': np.mean(results_b),
        'p_value': p_value,
        'significant': p_value < 0.05,
        'effect_size': cohens_d,
        'recommendation': 'Use Prompt B' if np.mean(results_b) > np.mean(results_a) and p_value < 0.05 else 'Use Prompt A'
    }

Challenge: Add Bayesian A/B testing for continuous monitoring

Extension 9: Statistical Significance Testing

What: Don’t just report score changes, report if they’re statistically significant Why: Avoid false positives from random variance How: Use hypothesis testing

def is_regression_significant(current: List[bool], baseline: List[bool], alpha: float = 0.05) -> dict:
    """
    Test if current performance is significantly worse than baseline

    Uses McNemar's test for paired binary data
    """
    # Build contingency table
    both_pass = sum(c and b for c, b in zip(current, baseline))
    current_fail_baseline_pass = sum(not c and b for c, b in zip(current, baseline))
    current_pass_baseline_fail = sum(c and not b for c, b in zip(current, baseline))
    both_fail = sum(not c and not b for c, b in zip(current, baseline))

    # McNemar's test
    statistic = (current_fail_baseline_pass - current_pass_baseline_fail)**2 / (current_fail_baseline_pass + current_pass_baseline_fail)
    p_value = 1 - stats.chi2.cdf(statistic, df=1)

    return {
        'statistically_significant': p_value < alpha,
        'p_value': p_value,
        'contingency_table': {
            'both_pass': both_pass,
            'regression_cases': current_fail_baseline_pass,
            'improvement_cases': current_pass_baseline_fail,
            'both_fail': both_fail
        }
    }

Challenge: Implement sequential testing for early stopping when regression is detected

Extension 10: Distributed Test Execution

What: Run tests across multiple machines for large suites Why: Scale to 1000+ test cases How: Use Celery or Ray for distributed task execution

from celery import Celery

app = Celery('prompt_harness', broker='redis://localhost:6379')

@app.task
def run_single_test(test_case_data: dict, llm_config: dict) -> dict:
    """Execute single test case (runs on worker)"""
    test_case = TestCase(**test_case_data)
    client = LLMClient(**llm_config)
    runner = TestRunner(client)

    # Run just this one test
    result = runner.run_single_test(test_case)
    return result.dict()

def run_distributed_suite(suite: TestSuite, llm_config: dict) -> TestReport:
    """Distribute test execution across workers"""
    # Submit all tests as async tasks
    tasks = [
        run_single_test.delay(tc.dict(), llm_config)
        for tc in suite.test_cases
    ]

    # Collect results as they complete
    results = [task.get() for task in tasks]

    # Aggregate into report
    return aggregate_results(results)

Challenge: Add dynamic worker scaling based on queue depth

9. Real-World Connections

9.1 Industry Applications

OpenAI Evals

OpenAI’s internal evaluation framework follows the same contract-based testing pattern. They maintain a large repository of evals for different capabilities:

# Example from OpenAI Evals
{
    "id": "math.addition",
    "dataset": "math_problems.jsonl",
    "scorer": "exact_match",
    "samples": 1000
}

Key Insight: Even OpenAI needs systematic testing to ensure model updates don’t regress on critical tasks.

LangSmith (LangChain)

LangChain’s testing platform implements:

Automatic dataset collection from production traffic
LLM-as-a-judge evaluation
Regression tracking across model versions
A/B testing for prompt variants

Used by: Companies like Zapier, Notion, Robinhood for production LLM monitoring

Anthropic Workbench

Claude’s evaluation infrastructure:

Constitutional AI alignment testing
Safety evals (jailbreak attempts, bias detection)
Capability benchmarks (coding, math, reasoning)

Key Insight: Testing isn’t just about correctness—it’s also about safety and alignment.

Microsoft Prompt Flow

Azure’s prompt engineering toolkit includes:

Visual prompt DAG editor
Evaluation metrics (relevance, groundedness, coherence)
Integration with Azure ML for model deployment

Used by: Enterprise customers who need compliance-ready AI systems

PromptFoo

https://github.com/promptfoo/promptfoo

What it does:

LLM evaluation framework focused on red-teaming and security
Supports multiple providers (OpenAI, Anthropic, local models)
Automated adversarial testing

Key Feature: Built-in vulnerability detection for prompt injection, jailbreaks, PII leaks

When to use: Security-focused evaluation, especially for customer-facing chatbots

OpenAI Evals

https://github.com/openai/evals

What it does:

OpenAI’s official eval framework
Large library of pre-built evals for common tasks
Standardized format for sharing benchmarks

Key Feature: Community-contributed evals for niche domains

When to use: Benchmarking new models, contributing to open eval datasets

Giskard

https://github.com/Giskard-AI/giskard

What it does:

ML testing library with LLM support
Automated test generation
Metamorphic testing (if input changes slightly, output should too)

Key Feature: Automatic discovery of failure modes

When to use: QA testing for ML models, finding edge cases automatically

LangCheck

https://github.com/citadel-ai/langcheck

What it does:

Simple evaluation metrics for LLM outputs
Covers factual consistency, toxicity, fluency, etc.
No API calls required for some metrics

Key Feature: Lightweight, can run offline

When to use: Quick quality checks without additional LLM calls

9.3 Interview Relevance

This project prepares you for common AI engineering interview questions:

Question 1: “How do you test non-deterministic systems?”

Strong Answer: “I test invariants rather than exact outputs. For example, instead of expecting a specific response, I verify that:

The output is valid JSON matching a schema
It cites at least one source document
It doesn’t contain PII
It responds within 500ms

I use temperature=0 for deterministic tests (format, required fields) and higher temperatures with multiple samples for subjective qualities (tone, creativity), aggregating scores across runs.”

Question 2: “How do you prevent regression in prompt engineering?”

Strong Answer: “I treat prompts as code: version controlled, tested, and monitored. Before deploying a prompt change:

Run it against a golden dataset of edge cases
Compare results to the baseline version
Check if success rate dropped below threshold (e.g., 5%)
If regression detected, either fix the prompt or update test cases if the change was intentional
Track metrics over time to detect gradual degradation

This is analogous to regression testing in software, but adapted for probabilistic systems.”

Question 3: “What’s the difference between unit tests and integration tests for LLMs?”

Strong Answer: “Unit tests validate individual components in isolation:

Validator logic (does the JSON schema validator work correctly?)
Prompt templating (does input substitution work?)
Response parsing (can we extract structured data?)

Integration tests validate the full pipeline:

Prompt → LLM → Validation → Business Logic
These tests use mocked LLM responses to be fast and deterministic
Real LLM tests are ‘E2E tests’ and run less frequently due to cost/latency

The key is isolating what you’re testing: code behavior vs. model behavior.”

Question 4: “How would you evaluate a summarization model?”

Strong Answer: “I’d use a multi-layered approach:

Deterministic Checks (fast, cheap):
- Length within bounds (50-200 words)
- No PII leakage
- Valid format (bullet points if required)
Reference-Based Metrics (medium cost):
- ROUGE score against human-written summaries
- Factual consistency (all facts in summary appear in source)
LLM-as-a-Judge (slower, higher cost):
- Coherence (does it read naturally?)
- Relevance (captures key points?)
- Conciseness (no redundancy?)
Human Evaluation (highest quality, most expensive):
- Random sample (5%) reviewed by domain experts
- Track inter-rater reliability

Different use cases prioritize different metrics. A legal summary prioritizes factual consistency; a news summary prioritizes coherence.”

10. Resources

10.1 Essential Reading

Books

Title	Author	Relevant Chapters	Why It Matters
“Site Reliability Engineering”	Google	Ch. 4 (Service Level Objectives)	Learn to apply SLO/SLA thinking to AI systems
“Clean Code”	Robert C. Martin	Ch. 9 (Unit Tests)	Master test design principles
“AI Engineering”	Chip Huyen	Ch. 5 (Model Development and Offline Evaluation)	Industry best practices for LLM evaluation
“Designing Data-Intensive Applications”	Martin Kleppmann	Ch. 4 (Encoding and Evolution)	Understand schema evolution and compatibility
“The Pragmatic Programmer”	Hunt & Thomas	Ch. 7 (Test-Driven Development)	Learn TDD principles applicable to prompts
“Release It!”	Michael T. Nygard	Ch. 5 (Stability Patterns)	Error handling and resilience patterns
“Software Testing”	Ron Patton	Ch. 7 (Regression Testing)	Deep dive into regression detection
“Code Complete”	Steve McConnell	Ch. 8 (Defensive Programming)	Invariants and assertions

Papers

“Language Models are Few-Shot Learners” (GPT-3 paper) - Section on evaluation methodology
“Constitutional AI: Harmlessness from AI Feedback” (Anthropic) - LLM-as-a-judge pattern
“Evaluating Large Language Models Trained on Code” (Codex paper) - Pass@k metrics

10.2 Video Resources

“Building Reliable LLM Applications” - Chip Huyen (YouTube)
- Covers evaluation pipelines, monitoring, failure modes
- Link: Search YouTube for “Chip Huyen LLM Applications”
“OpenAI Evals Deep Dive” - OpenAI Developer Day
- How OpenAI evaluates their models internally
- Link: openai.com/events/developer-day
“Prompt Engineering Guide” - promptingguide.ai
- Interactive tutorials on prompting techniques
- Free, comprehensive, regularly updated

10.3 Tools & Documentation

Testing Frameworks

pytest: https://pytest.org
- Python’s de facto testing framework
- Rich plugin ecosystem (pytest-mock, pytest-asyncio)
JSON Schema: https://json-schema.org
- Standard for defining JSON structure
- Validators available in all languages
Pydantic: https://docs.pydantic.dev
- Data validation using Python type annotations
- Generates JSON schemas automatically

LLM APIs

OpenAI API: https://platform.openai.com/docs
- Official docs for GPT-4, GPT-3.5
- Parameter reference (temperature, top_p, etc.)
Anthropic API: https://docs.anthropic.com
- Claude 3 family documentation
- Prompt engineering best practices

CLI Tools

Click: https://click.palletsprojects.com
- Python CLI framework
- Decorator-based, easy to use
Rich: https://rich.readthedocs.io
- Beautiful terminal output
- Progress bars, tables, syntax highlighting

Next Project

Project 2: JSON Output Enforcer (builds on validation concepts)

Implements self-repair loop for malformed JSON
Teaches reliability patterns for production systems
Prerequisite: Complete Project 1 first

Project 7: Temperature Sweeper

Parametric evaluation across temperature settings
Uses the harness from this project as a foundation
Adds statistical analysis of variance

Project 5: RAG Quality Evaluator

Evaluates retrieval-augmented generation systems
Reuses validator architecture from this project
Adds retrieval-specific metrics (citation accuracy, groundedness)

10.5 Community Resources

r/PromptEngineering (Reddit) - Community discussing techniques
LangChain Discord - Active community for LLM application builders
EleutherAI Discord - Open-source LLM research community
Anthropic Forum - Official Anthropic community

11. Self-Assessment Checklist

Understanding

I can explain what a prompt contract is without looking at notes
- Test yourself: Define it out loud, then write a contract for a new use case
I understand the difference between testing at temp=0 vs temp=0.7
- Test yourself: When would you use each? Give 3 examples per temperature
I can design invariants for a new use case
- Test yourself: Pick a use case (e.g., email classifier), define 5 invariants
I know when to use deterministic vs probabilistic testing
- Test yourself: Categorize these checks: JSON validity, creativity, tone, citation presence
I can explain error budgets in the context of LLM systems
- Test yourself: Your SLO is 99% accuracy. How many failures are allowed per 1000 requests?

Implementation

All functional requirements are met
- Test loader works with YAML/JSON
- Test runner executes against real LLM
- At least 3 validator types implemented
- Reports generated in HTML and JSON
Test harness runs against real LLM API
- Successfully tested with OpenAI or Anthropic
- Handles rate limits gracefully
- Retries on transient failures
Reports are clear and actionable
- Failed tests clearly show what went wrong
- Recommendations provided for fixing issues
- HTML report is readable by non-technical stakeholders
Regression detection works correctly
- Can compare two test runs
- Correctly identifies new failures
- Flags score drops above threshold
Code is production-ready
- Error handling for edge cases
- Logging for debugging
- Tests for core components
- Documentation (README, docstrings)

Growth

I can identify when to use deterministic vs probabilistic testing
- Application: Design a test strategy for a new LLM feature
I’ve documented lessons learned
- What surprised you during implementation?
- What would you do differently next time?
- What patterns emerged that you’ll reuse?
I can explain this project in a job interview
- Practice: Explain in 2 minutes: problem, solution, results, learnings
I understand how this applies to production systems
- How would you integrate this into a CI/CD pipeline?
- How would you monitor prompt performance in production?
- What metrics would you track over time?
I can extend this to new domains
- Pick a different domain (e.g., code generation, SQL queries)
- What invariants would you test?
- What new validators would you need?

12. Submission / Completion Criteria

Minimum Viable Completion

To consider this project “complete” at a basic level, you must:

Can load test suite from YAML file
- Parses all fields correctly (id, input, invariants)
- Handles malformed YAML gracefully with clear error messages
Can execute tests against LLM API
- Successfully calls OpenAI or Anthropic API
- Captures response and metadata (latency, tokens)
Implements at least 3 validator types
- JSONSchema, Contains, Length (minimum set)
- Each validator returns structured ValidationResult
Generates basic text report
- Console output shows pass/fail for each test
- Summary statistics (total, passed, failed, success rate)

Proof of Completion:

Screenshot of CLI output showing test run
Sample test YAML file
Code walkthrough explaining validator architecture

Full Completion

All minimum criteria plus:

HTML report generation
- Interactive report with drill-down capability
- Clear visualization of failures
- Exportable/shareable
Regression detection between runs
- Can compare current run to baseline
- Identifies newly broken test cases
- Flags significant score drops
Rate limiting and error handling
- Exponential backoff for retries
- Graceful handling of API failures
- Timeout protection
CLI with rich formatting
- Color-coded output (pass=green, fail=red)
- Progress bars during execution
- Clear help text and usage examples
Unit tests for core components
- Test coverage >70% for validators, loader, runner
- Integration tests with mocked LLM

Proof of Completion:

Public GitHub repository with code
README with setup instructions and examples
HTML report sample
Passing test suite

Excellence (Going Above & Beyond)

All full completion criteria plus any 2+ of:

JSON export for CI/CD integration
- JUnit XML format for test runners
- Structured JSON for custom pipelines
- GitHub Actions workflow example
LLM-as-a-Judge validators
- Subjective quality evaluation (tone, helpfulness)
- Configurable judge prompts
- Aggregation across multiple judge runs
Parametric sweeps across models/temps
- Test same suite with multiple configurations
- Heatmap visualization of results
- Automated recommendation of best config
Statistical significance testing
- Not just score comparison, but confidence intervals
- McNemar’s test for regression significance
- Effect size calculation (Cohen’s d)
Production-ready features
- Response caching to reduce costs
- Distributed execution for large suites
- Web dashboard for result exploration
- Slack/email notifications

Proof of Completion:

Blog post explaining your implementation and learnings
Video demo showing advanced features
Public deployment (e.g., web UI hosted on Vercel/Railway)
Contribution to open-source eval framework (PromptFoo, LangSmith)

Appendix: Sample Files

Example Test Suite (prompts/support_agent.yaml)

name: "Customer Support Evals"
version: "1.3.0"
description: "Test suite for customer support chatbot prompt"

prompt_template: |
  You are a helpful customer support agent. Use the provided policy documents to answer questions.
  Always cite your sources using [doc_id] format.

  Policy Documents:
  {context}

  Customer Query: {input}

  Respond in JSON format:
  {{
    "answer": "your response here",
    "confidence": 0.0-1.0,
    "citations": ["doc_id_1", "doc_id_2"]
  }}

test_cases:
  # Refund Queries
  - id: "simple_refund_query"
    category: "Refund"
    input: "I want to return my order #12345"
    context: |
      [policy_doc_1] Refund Policy: Customers can return items within 30 days of purchase for a full refund.
      [policy_doc_2] Order #12345 was placed on 2024-12-15 (12 days ago).
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Has Citation"
        spec:
          substring: "policy_doc"
      - type: "contains"
        name: "Contains Order ID"
        spec:
          substring: "#12345"
      - type: "regex"
        name: "Polite Language"
        spec:
          pattern: "(please|thank you|happy to help)"
          flags: "IGNORECASE"

  - id: "refund_outside_window"
    category: "Refund"
    input: "Can I return something I bought 6 months ago?"
    context: |
      [policy_doc_1] Refund Policy: Customers can return items within 30 days of purchase.
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Mentions Time Limit"
        spec:
          substring: "30"
      - type: "contains"
        name: "Suggests Alternative"
        spec:
          substring: "warranty"  # or exchange, or other options

  - id: "ambiguous_policy_query"
    category: "Policy"
    input: "What's your return policy?"
    context: |
      [policy_doc_1] Refund Policy: Customers can return items within 30 days.
      [policy_doc_2] Items must be in original packaging with tags attached.
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Has Citation"
        spec:
          substring: "policy_doc"
      - type: "length"
        name: "Detailed Response"
        spec:
          min: 100
          max: 500

  # Technical Support
  - id: "password_reset"
    category: "Technical"
    input: "I forgot my password, how do I reset it?"
    context: |
      [help_doc_1] To reset your password:
      1. Click "Forgot Password" on login page
      2. Enter your email
      3. Check your inbox for reset link
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Step-by-Step"
        spec:
          substring: "1."
      - type: "contains"
        name: "Cites Help Doc"
        spec:
          substring: "help_doc"

JSON Schema (prompts/schemas/support_response.json)

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "answer": {
      "type": "string",
      "minLength": 10,
      "description": "The support agent's response"
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1,
      "description": "Confidence score for the answer"
    },
    "citations": {
      "type": "array",
      "items": {
        "type": "string",
        "pattern": "^(policy_doc|help_doc)_\\d+$"
      },
      "minItems": 1,
      "description": "List of cited document IDs"
    }
  },
  "required": ["answer", "confidence", "citations"],
  "additionalProperties": false
}

This comprehensive guide was generated from PROMPT_ENGINEERING_PROJECTS.md. For the complete learning path and other projects, see the parent directory.

Project 1: Prompt Contract Harness

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

Unit Testing for Prompts

Invariants and Contracts

Service Level Objectives (SLOs)

Deterministic vs Non-Deterministic Testing

Schema Validation

2.2 Why This Matters

Production Relevance

Real-World Applications

2.3 Historical Context

Evolution of Prompt Engineering

2.4 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

FR1: Test Suite Loading

FR2: Prompt Execution

FR3: Invariant Checking

FR4: Report Generation

FR5: Regression Detection

3.3 Non-Functional Requirements

3.4 Example Usage

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

Test Execution Algorithm

Regression Detection Algorithm

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Foundation (Day 1)

Phase 2: Core Execution (Days 2-3)

Validator factory

Phase 3: Reporting & Polish (Days 4-5)

{{ report.suite_name }}

Test Results

{{ "PASS" if result.passed else "FAIL" }} - {{ result.test_case.id }}

Failure Breakdown

In harness/runner.py, add method:

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Unit Tests

Integration Tests

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

API Issues

Validation Failures

Performance Issues

7.3 Performance Traps

Memory Issues

Cost Optimization

8. Extensions & Challenges

8.1 Beginner Extensions

Extension 1: CSV Export Format

Extension 2: Email Notifications for Regression

Extension 3: Pre-built Validator Library

8.2 Intermediate Extensions

Extension 4: LLM-as-a-Judge Validators

Extension 5: Parametric Sweeps

Extension 6: Web UI for Results

8.3 Advanced Extensions

Extension 7: CI/CD Integration

Extension 8: A/B Testing Framework

Extension 9: Statistical Significance Testing

Extension 10: Distributed Test Execution

9. Real-World Connections

9.1 Industry Applications

OpenAI Evals

LangSmith (LangChain)