Project 1: Prompt Contract Harness

Project 1: Prompt Contract Harness

Build a production-grade CLI tool that treats prompts like software artifacts with automated testing

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 3-5 days
Language Python (Alternatives: TypeScript)
Prerequisites Basic Python/TypeScript, LLM API access
Key Topics Testing, Invariants, Regression, SLOs
Knowledge Area PromptOps / Testing
Software/Tool CLI harness + validators + reports
Main Book โ€œSite Reliability Engineeringโ€ by Google (Concepts of SLOs)
Coolness Level Level 2: Practical but Forgettable
Business Potential 4. The โ€œOpen Coreโ€ Infrastructure

1. Learning Objectives

By completing this project, you will:

  1. Master Prompt Contracts: Learn to treat prompts as deterministic functions with defined inputs, invariant constraints, and strictly typed outputs
  2. Build Production Testing Infrastructure: Create automated testing pipelines that detect regression when prompts change
  3. Understand Evaluation Metrics: Apply SRE principles (SLOs, error budgets) to AI system reliability
  4. Handle Non-Determinism: Learn when to test at temperature=0 vs higher temperatures for different validation types
  5. Create Actionable Reports: Generate both human-readable (HTML) and machine-readable (JSON) test reports
  6. Implement Parametric Evaluation: Run tests across different models, temperatures, and configurations
  7. Design Robust Invariants: Translate vague business requirements into programmatic assertions

2. Theoretical Foundation

2.1 Core Concepts

Unit Testing for Prompts

In traditional software development, unit tests verify that a function produces expected outputs for given inputs. The same principle applies to promptsโ€”they are functions that transform inputs (context, user query) into outputs (responses).

Traditional Function Testing:

def calculate_tax(amount: float, rate: float) -> float:
    return amount * rate

# Test
assert calculate_tax(100, 0.1) == 10.0  # Deterministic

Prompt Testing:

def customer_support_prompt(query: str, context: str) -> dict:
    # LLM call
    response = llm.complete(f"Context: {context}\nQuery: {query}")
    return parse_json(response)

# Test
result = customer_support_prompt("refund order #123", policy_docs)
assert result["has_citation"] == True  # Contract check
assert "order_id" in result  # Required field

The key difference: prompts are probabilistic, so we must test invariants (properties that should always hold) rather than exact outputs.

Invariants and Contracts

An invariant is a condition that must always be true regardless of the specific execution. A contract is a set of invariants that define what โ€œcorrectโ€ means.

Types of Invariants:

Type Description Example
Structural Output format/schema โ€œMust be valid JSONโ€
Semantic Meaning constraints โ€œMust cite a source documentโ€
Safety Security boundaries โ€œMust not contain PIIโ€
Performance Resource limits โ€œMust respond in <500msโ€

Example Contract:

contract:
  name: "Customer Support Response"
  invariants:
    - type: schema
      spec: response_schema.json
    - type: citation
      rule: "Must reference at least one policy document"
    - type: length
      min: 50
      max: 500
    - type: tone
      rule: "Must be professional and empathetic"

Service Level Objectives (SLOs)

SLOs define the reliability target for your system. For LLM applications:

Example SLOs:

  • Accuracy SLO: 99% of responses must pass all invariants
  • Latency SLO: 95th percentile response time < 500ms
  • Cost SLO: Average cost per request < $0.01

Error Budgets: If your SLO is 99%, you have a 1% error budget. Once exhausted, you halt prompt changes until you fix regressions.

Deterministic vs Non-Deterministic Testing

Temperature controls randomness in LLM outputs:

  • temp=0.0: Deterministic (always picks highest probability token)
  • temp=0.7: Balanced creativity and consistency
  • temp=1.0+: High creativity, low consistency

Testing Strategy:

Deterministic Tests (temp=0.0):
- JSON schema validation
- Required field presence
- Exact format matches
- Citation checks

Non-Deterministic Tests (temp=0.7, N=10 samples):
- Tone/style quality (average score)
- Creativity metrics
- Diversity of responses

Schema Validation

JSON Schema acts as a โ€œtype systemโ€ for LLM outputs. It ensures the modelโ€™s response is structurally valid before your application processes it.

Example Schema:

{
  "type": "object",
  "properties": {
    "answer": {"type": "string", "minLength": 10},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1},
    "citations": {
      "type": "array",
      "items": {"type": "string"},
      "minItems": 1
    }
  },
  "required": ["answer", "confidence", "citations"],
  "additionalProperties": false
}

This prevents crashes from missing fields, wrong types, or hallucinated extra fields.

2.2 Why This Matters

Production Relevance

Real-world LLM applications fail in production due to:

  1. Prompt Changes: A single word change can break edge cases
  2. Model Updates: Provider model updates can change behavior
  3. Context Drift: As your data changes, prompt performance degrades
  4. Edge Cases: Rare inputs that werenโ€™t tested manually

Without automated testing: You discover failures after users complain. With this harness: You catch regressions before deployment.

Real-World Applications

This testing pattern is used by:

  • OpenAI Evals: OpenAIโ€™s internal framework for model evaluation
  • LangSmith: LangChainโ€™s testing and observability platform
  • Anthropic Workbench: Claude application testing infrastructure
  • Microsoft Prompt Flow: Azureโ€™s prompt engineering toolchain

Companies like Stripe, Shopify, and Notion use similar harnesses to ensure their AI features maintain quality across updates.

2.3 Historical Context

Evolution of Prompt Engineering

2020-2021: The โ€œVibesโ€ Era

  • Prompts were magic spells
  • โ€œAct as aโ€ฆโ€ and โ€œTake a deep breathโ€ heuristics
  • No systematic evaluation
  • Manual testing only

2022-2023: The Systematization Era

  • OpenAI releases Evals framework
  • Chain-of-Thought (CoT) prompting formalized
  • Few-shot learning standardized
  • Schema-based outputs emerge

2024+: The Engineering Discipline Era

  • Prompts as code (version control, testing, CI/CD)
  • Automated evaluation pipelines
  • Statistical significance testing
  • Production observability

This project teaches you the modern, engineering-driven approach.

2.4 Common Misconceptions

Misconception Reality
โ€œPrompts are too random to testโ€ You test invariants, not exact outputs
โ€œManual testing is sufficientโ€ Manual testing doesnโ€™t scale to 100+ edge cases
โ€œHigher temperature = better responsesโ€ Temperature depends on use case; many tasks need temp=0
โ€œIf it works once, it works alwaysโ€ LLM outputs vary; you need statistical sampling
โ€œTesting slows down developmentโ€ Testing prevents shipping broken prompts (faster overall)

3. Project Specification

3.1 What You Will Build

A command-line tool that:

  1. Loads test suites from YAML/JSON files containing test cases
  2. Executes prompts against LLM APIs (OpenAI, Anthropic, etc.)
  3. Validates responses against defined invariants
  4. Generates reports in HTML and JSON formats
  5. Detects regression by comparing current run to previous versions
  6. Provides actionable feedback on which tests failed and why

Core Question This Tool Answers:

โ€œHow do I know if my prompt change made things better or worse?โ€

3.2 Functional Requirements

FR1: Test Suite Loading

  • Load test cases from YAML or JSON files
  • Support multiple test categories (e.g., Refund, Technical, Policy)
  • Parse test case metadata: ID, input, expected invariants
  • Validate test file structure before execution

FR2: Prompt Execution

  • Support multiple LLM providers (OpenAI, Anthropic)
  • Execute prompts with configurable temperature, max tokens, etc.
  • Handle API rate limiting with exponential backoff
  • Capture response metadata: latency, token count

FR3: Invariant Checking

Implement at least these validator types:

  • JSONSchema: Validate against JSON Schema spec
  • Contains: Check if response contains specific strings
  • NotContains: Ensure response doesnโ€™t contain forbidden strings
  • Length: Validate character/word count bounds
  • Citation: Verify presence of source citations
  • Regex: Pattern matching for structured formats

FR4: Report Generation

  • Console Output: Rich, colorized terminal output with progress bars
  • HTML Report: Interactive web page with drill-down capabilities
  • JSON Export: Machine-readable format for CI/CD integration
  • Trend Tracking: Compare current run to historical runs

FR5: Regression Detection

  • Store test results with version identifiers
  • Compare current run against previous baseline
  • Flag accuracy drops above configurable threshold (e.g., >5% drop)
  • Provide diff view showing which cases broke

3.3 Non-Functional Requirements

Requirement Target Rationale
Performance Handle 50+ test cases in <5 minutes Reasonable for CI/CD pipelines
Reliability Retry failed API calls 3x with backoff Handle transient network issues
Usability Clear error messages with fix suggestions Developers must understand failures quickly
Extensibility Plugin architecture for custom validators Different use cases need custom checks
Cost Efficiency Cache responses during development Avoid paying for repeated identical calls

3.4 Example Usage

Running the harness:

$ python harness.py test prompts/support_agent.yaml

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  PROMPT HARNESS v1.0 - Test Suite Execution     โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Loading test suite: prompts/support_agent.yaml
Found 16 test cases across 3 categories (Refund, Technical, Policy)
Testing against: gpt-4 (temperature=0.0)

RUNNING SUITE: Customer Support Evals
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Category: Refund Queries
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[PASS] Case: simple_refund_query (120ms)
  Input: "I want to return my order #12345"
  โœ“ Invariant: Valid JSON Schema ............. OK
  โœ“ Invariant: Has Citation .................. OK (Cited: policy_doc_3)
  โœ“ Invariant: Contains Order ID ............. OK (#12345)
  โœ“ Invariant: Polite Tone ................... OK (Confidence: 0.95)

[PASS] Case: refund_outside_window (135ms)
  Input: "Can I return something I bought 6 months ago?"
  โœ“ Invariant: Valid JSON Schema ............. OK
  โœ“ Invariant: Has Citation .................. OK (Cited: policy_doc_1)
  โœ“ Invariant: Mentions Time Limit ........... OK
  โœ“ Invariant: Suggests Alternative .......... OK

[FAIL] Case: ambiguous_policy_query (98ms)
  Input: "What's your return policy?"
  โœ“ Invariant: Valid JSON Schema ............. OK
  โœ— Invariant: Has Citation .................. FAIL
  โœ“ Invariant: Contains Policy Details ....... OK

  Expected: Citation to a policy document
  Actual Output: {
    "response": "I think you can return it maybe?",
    "confidence": 0.3,
    "citation": null
  }

  Failure Reason: Model responded with vague language without
                  grounding answer in provided policy documents.

Category: Technical Support
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[PASS] Case: password_reset (105ms)
[PASS] Case: account_locked (118ms)
[PASS] Case: api_integration_help (142ms)
... (showing 3/10 cases for brevity)

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”
SUMMARY REPORT
โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

Total Cases:     16
Passed:          15
Failed:          1
Success Rate:    93.7%
Total Time:      1.82s
Avg Latency:     113ms

FAILURES BY INVARIANT:
  โ€ข Has Citation: 1 failure (ambiguous_policy_query)

โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”

โš  REGRESSION DETECTED: Score dropped from 100% (v1.2.0) to 93.7%
  See detailed report: ./reports/run_2024-12-27_14-32-01.html

Recommendations:
  1. Review prompt instructions for citation requirements
  2. Add explicit instruction: "Always cite source documents"
  3. Consider adding few-shot examples with citations

Integration with workflow:

# Run before committing a prompt change
$ python harness.py test prompts/support_agent.yaml --compare-to v1.2.0

# Run in CI/CD (fail build if score drops below threshold)
$ python harness.py test prompts/*.yaml --min-score 95 --format junit

# Run parametric sweep (test across multiple models/temperatures)
$ python harness.py test prompts/support_agent.yaml --sweep temperature=0.0,0.3,0.7

# Generate regression report comparing two versions
$ python harness.py diff v1.2.0 v1.3.0 --output regression_report.md

Generated HTML Report Features:

The tool generates a detailed HTML report (reports/run_2024-12-27_14-32-01.html) with:

  • Side-by-side comparison: Shows your prompt version vs. the previous version
  • Per-case drill-down: Click any failed case to see full input, expected output, actual output, and which specific assertion failed
  • Trend graphs: Visual charts showing your accuracy over time across different prompt versions
  • Diff highlighting: Color-coded changes showing what you modified in your prompt between runs
  • Export options: Download results as JSON for integration with CI/CD pipelines

Example HTML report sections:

  • โ€œAccuracy Trendโ€ graph showing 100% โ†’ 95% โ†’ 93.7% over three runs
  • โ€œToken Usage Analysisโ€ showing average tokens per response
  • โ€œLatency Distributionโ€ histogram showing response time patterns
  • โ€œFailure Clusteringโ€ identifying which types of queries break most often

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Test Loader โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Test Runner  โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚   Reporter   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚                    โ”‚                     โ”‚
       โ”‚                    โ–ผ                     โ”‚
       โ”‚             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
       โ”‚             โ”‚  LLM Client  โ”‚             โ”‚
       โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
       โ”‚                    โ”‚                     โ”‚
       โ”‚                    โ–ผ                     โ”‚
       โ”‚             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
       โ”‚             โ”‚  Validators  โ”‚             โ”‚
       โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
       โ”‚                                          โ”‚
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ Results DB โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Components

Component Responsibility Key Decisions
TestLoader Parse YAML/JSON test files Use PyYAML for human-readable format with comments
TestRunner Execute tests with rate limiting Async execution with asyncio for concurrent API calls
LLMClient Abstract API calls to different providers Strategy pattern for OpenAI/Anthropic/local models
Validators Check invariants against responses Plugin architecture - easy to add custom validators
Reporter Generate HTML/JSON reports Jinja2 templates for HTML, structured JSON for CI/CD
ResultsDB Store historical test runs SQLite for simplicity, JSON files for portability

4.3 Data Structures

from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum

class InvariantType(Enum):
    JSON_SCHEMA = "json_schema"
    CONTAINS = "contains"
    NOT_CONTAINS = "not_contains"
    LENGTH = "length"
    CITATION = "citation"
    REGEX = "regex"
    CUSTOM = "custom"

@dataclass
class Invariant:
    """Defines a single invariant check"""
    type: InvariantType
    name: str
    spec: Dict[str, Any]  # Type-specific configuration

@dataclass
class TestCase:
    """Represents a single test case"""
    id: str
    category: str
    input: str
    context: Optional[str]
    expected_invariants: List[Invariant]
    metadata: Dict[str, Any]

@dataclass
class ValidationResult:
    """Result of validating a single invariant"""
    invariant_name: str
    passed: bool
    error_message: Optional[str]
    details: Dict[str, Any]

@dataclass
class TestResult:
    """Complete result for one test case"""
    test_case: TestCase
    response: str
    passed: bool
    validation_results: List[ValidationResult]
    latency_ms: float
    tokens_used: int
    timestamp: str

@dataclass
class TestSuite:
    """Collection of test cases"""
    name: str
    version: str
    test_cases: List[TestCase]

@dataclass
class TestReport:
    """Aggregated results across all tests"""
    suite_name: str
    version: str
    total_cases: int
    passed_cases: int
    failed_cases: int
    success_rate: float
    total_time_s: float
    avg_latency_ms: float
    results: List[TestResult]
    regression_info: Optional[Dict[str, Any]]

4.4 Algorithm Overview

Test Execution Algorithm

def run_test_suite(suite: TestSuite, config: Config) -> TestReport:
    """
    Main test execution algorithm

    Complexity: O(n) where n = number of test cases
    Space: O(n) for storing results
    """
    results = []
    start_time = time.time()

    for test_case in suite.test_cases:
        # 1. Build prompt with input and context
        prompt = build_prompt(test_case.input, test_case.context)

        # 2. Call LLM API (with retry logic)
        response, metadata = call_llm_with_retry(
            prompt=prompt,
            config=config,
            max_retries=3
        )

        # 3. Validate against all invariants
        validation_results = []
        for invariant in test_case.expected_invariants:
            validator = get_validator(invariant.type)
            result = validator.validate(response, invariant.spec)
            validation_results.append(result)

        # 4. Determine if test passed (all invariants must pass)
        test_passed = all(r.passed for r in validation_results)

        # 5. Record result
        results.append(TestResult(
            test_case=test_case,
            response=response,
            passed=test_passed,
            validation_results=validation_results,
            latency_ms=metadata['latency_ms'],
            tokens_used=metadata['tokens']
        ))

    # 6. Generate aggregate report
    total_time = time.time() - start_time
    return generate_report(results, total_time)

Regression Detection Algorithm

def detect_regression(current_run: TestReport,
                     baseline_run: TestReport,
                     threshold: float = 0.05) -> Dict[str, Any]:
    """
    Compare two test runs to detect regression

    Args:
        current_run: Latest test results
        baseline_run: Previous baseline to compare against
        threshold: Maximum acceptable drop in success rate (e.g., 0.05 = 5%)

    Returns:
        Dictionary with regression analysis
    """
    # Calculate score delta
    score_delta = current_run.success_rate - baseline_run.success_rate

    # Identify newly broken test cases
    current_failures = {r.test_case.id for r in current_run.results if not r.passed}
    baseline_failures = {r.test_case.id for r in baseline_run.results if not r.passed}
    new_failures = current_failures - baseline_failures
    fixed_cases = baseline_failures - current_failures

    # Categorize failures by invariant type
    failure_breakdown = defaultdict(list)
    for result in current_run.results:
        if not result.passed:
            for validation in result.validation_results:
                if not validation.passed:
                    failure_breakdown[validation.invariant_name].append(result.test_case.id)

    return {
        "regression_detected": score_delta < -threshold,
        "score_delta": score_delta,
        "new_failures": list(new_failures),
        "fixed_cases": list(fixed_cases),
        "failure_breakdown": dict(failure_breakdown)
    }

Complexity Analysis:

  • Time Complexity: O(n ร— m) where n = test cases, m = invariants per case
  • Space Complexity: O(n) for storing all results
  • API Call Parallelization: Can achieve O(n/p) with p parallel workers

5. Implementation Guide

5.1 Development Environment Setup

# Create project directory
mkdir prompt-harness
cd prompt-harness

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install openai anthropic pyyaml jsonschema jinja2 click rich pytest

# Create .env file for API keys
cat > .env << EOF
OPENAI_API_KEY="your-openai-key-here"
ANTHROPIC_API_KEY="your-anthropic-key-here"
EOF

# Install python-dotenv for env management
pip install python-dotenv

5.2 Project Structure

prompt-harness/
โ”œโ”€โ”€ harness/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ loader.py           # Test file parsing
โ”‚   โ”œโ”€โ”€ runner.py           # Test execution engine
โ”‚   โ”œโ”€โ”€ validators.py       # Invariant validators
โ”‚   โ”œโ”€โ”€ reporter.py         # Report generation
โ”‚   โ”œโ”€โ”€ llm_client.py       # LLM API abstraction
โ”‚   โ”œโ”€โ”€ models.py           # Data structures
โ”‚   โ””โ”€โ”€ cli.py              # CLI interface
โ”œโ”€โ”€ tests/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_loader.py
โ”‚   โ”œโ”€โ”€ test_validators.py
โ”‚   โ””โ”€โ”€ test_runner.py
โ”œโ”€โ”€ prompts/
โ”‚   โ”œโ”€โ”€ support_agent.yaml  # Example test suite
โ”‚   โ””โ”€โ”€ schemas/
โ”‚       โ””โ”€โ”€ response_schema.json
โ”œโ”€โ”€ templates/
โ”‚   โ””โ”€โ”€ report.html.jinja2  # HTML report template
โ”œโ”€โ”€ results/
โ”‚   โ””โ”€โ”€ .gitkeep            # Store test run results
โ”œโ”€โ”€ reports/
โ”‚   โ””โ”€โ”€ .gitkeep            # Generated HTML reports
โ”œโ”€โ”€ pyproject.toml          # Project metadata
โ”œโ”€โ”€ requirements.txt        # Dependencies
โ”œโ”€โ”€ .env                    # API keys (gitignored)
โ”œโ”€โ”€ .gitignore
โ””โ”€โ”€ README.md

5.3 Implementation Phases

Phase 1: Foundation (Day 1)

Goals:

  • Set up project structure with proper Python packaging
  • Implement test file loader
  • Create basic data structures
  • Write unit tests for loader

Tasks:

  1. Create Project Scaffold
    # Initialize project
    mkdir -p harness tests prompts templates results reports
    touch harness/__init__.py tests/__init__.py
    
  2. Implement Data Models (harness/models.py)
    # See data structures in section 4.3 above
    # Copy the dataclass definitions
    
  3. Implement YAML Test Loader (harness/loader.py) ```python import yaml from pathlib import Path from typing import List from .models import TestSuite, TestCase, Invariant, InvariantType

class TestLoader: โ€œ"โ€Loads and parses test suite filesโ€โ€โ€

def load_from_file(self, filepath: Path) -> TestSuite:
    """Load test suite from YAML file"""
    with open(filepath, 'r') as f:
        data = yaml.safe_load(f)

    return self._parse_suite(data)

def _parse_suite(self, data: dict) -> TestSuite:
    """Parse raw YAML data into TestSuite object"""
    test_cases = [
        self._parse_test_case(tc)
        for tc in data.get('test_cases', [])
    ]

    return TestSuite(
        name=data['name'],
        version=data['version'],
        test_cases=test_cases
    )

def _parse_test_case(self, data: dict) -> TestCase:
    """Parse a single test case"""
    invariants = [
        self._parse_invariant(inv)
        for inv in data.get('invariants', [])
    ]

    return TestCase(
        id=data['id'],
        category=data['category'],
        input=data['input'],
        context=data.get('context'),
        expected_invariants=invariants,
        metadata=data.get('metadata', {})
    )

def _parse_invariant(self, data: dict) -> Invariant:
    """Parse an invariant definition"""
    return Invariant(
        type=InvariantType(data['type']),
        name=data['name'],
        spec=data.get('spec', {})
    ) ```
  1. Create Sample Test File (prompts/support_agent.yaml) ```yaml name: โ€œCustomer Support Evalsโ€ version: โ€œ1.0.0โ€

test_cases:

  • id: โ€œsimple_refund_queryโ€ category: โ€œRefundโ€ input: โ€œI want to return my order #12345โ€ context: | Policy: Customers can return items within 30 days. Order #12345 was placed 10 days ago. invariants:
    • type: โ€œjson_schemaโ€ name: โ€œValid JSON Schemaโ€ spec: schema_file: โ€œprompts/schemas/response_schema.jsonโ€
    • type: โ€œcontainsโ€ name: โ€œHas Citationโ€ spec: substring: โ€œpolicy_docโ€
    • type: โ€œcontainsโ€ name: โ€œContains Order IDโ€ spec: substring: โ€œ#12345โ€ ```
  1. Write Unit Tests (tests/test_loader.py) ```python import pytest from pathlib import Path from harness.loader import TestLoader

def test_load_valid_suite(): loader = TestLoader() suite = loader.load_from_file(Path(โ€œprompts/support_agent.yamlโ€))

assert suite.name == "Customer Support Evals"
assert len(suite.test_cases) > 0

def test_parse_invariants(): loader = TestLoader() suite = loader.load_from_file(Path(โ€œprompts/support_agent.yamlโ€))

test_case = suite.test_cases[0]
assert len(test_case.expected_invariants) == 3
assert test_case.expected_invariants[0].name == "Valid JSON Schema" ```

Checkpoint: Can load and parse a sample test YAML file, verified by passing unit tests.

Phase 2: Core Execution (Days 2-3)

Goals:

  • Implement LLM API client with retry logic
  • Build test runner that executes cases
  • Create core validators
  • Add rate limiting and error handling

Tasks:

  1. Implement LLM Client (harness/llm_client.py) ```python import time import openai from typing import Dict, Any, Optional from dataclasses import dataclass

@dataclass class LLMResponse: text: str latency_ms: float tokens_used: int model: str

class LLMClient: โ€œ"โ€Abstract interface for LLM API callsโ€โ€โ€

def __init__(self, provider: str, model: str, temperature: float = 0.0):
    self.provider = provider
    self.model = model
    self.temperature = temperature

def complete(self, prompt: str, max_retries: int = 3) -> LLMResponse:
    """Call LLM with exponential backoff retry"""
    for attempt in range(max_retries):
        try:
            return self._call_api(prompt)
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait_time = 2 ** attempt
            time.sleep(wait_time)

def _call_api(self, prompt: str) -> LLMResponse:
    """Make actual API call (provider-specific)"""
    start = time.time()

    if self.provider == "openai":
        response = openai.ChatCompletion.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=self.temperature
        )
        text = response.choices[0].message.content
        tokens = response.usage.total_tokens
    else:
        raise ValueError(f"Unknown provider: {self.provider}")

    latency_ms = (time.time() - start) * 1000

    return LLMResponse(
        text=text,
        latency_ms=latency_ms,
        tokens_used=tokens,
        model=self.model
    ) ```
  1. Implement Validators (harness/validators.py) ```python import json import re from abc import ABC, abstractmethod from typing import Dict, Any from jsonschema import validate, ValidationError from .models import ValidationResult

class Validator(ABC): โ€œ"โ€Base class for all validatorsโ€โ€โ€

@abstractmethod
def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    pass

class JSONSchemaValidator(Validator): โ€œ"โ€Validates response against JSON Schemaโ€โ€โ€

def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    try:
        # Parse response as JSON
        data = json.loads(response)

        # Load schema from file
        with open(spec['schema_file'], 'r') as f:
            schema = json.load(f)

        # Validate
        validate(instance=data, schema=schema)

        return ValidationResult(
            invariant_name="JSON Schema",
            passed=True,
            error_message=None,
            details={"schema_file": spec['schema_file']}
        )
    except json.JSONDecodeError as e:
        return ValidationResult(
            invariant_name="JSON Schema",
            passed=False,
            error_message=f"Invalid JSON: {str(e)}",
            details={}
        )
    except ValidationError as e:
        return ValidationResult(
            invariant_name="JSON Schema",
            passed=False,
            error_message=f"Schema validation failed: {e.message}",
            details={"path": list(e.path)}
        )

class ContainsValidator(Validator): โ€œ"โ€Validates that response contains a substringโ€โ€โ€

def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    substring = spec['substring']
    passed = substring in response

    return ValidationResult(
        invariant_name=f"Contains '{substring}'",
        passed=passed,
        error_message=None if passed else f"Response does not contain '{substring}'",
        details={"substring": substring}
    )

class LengthValidator(Validator): โ€œ"โ€Validates response lengthโ€โ€โ€

def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
    min_len = spec.get('min', 0)
    max_len = spec.get('max', float('inf'))
    actual_len = len(response)

    passed = min_len <= actual_len <= max_len

    return ValidationResult(
        invariant_name="Length",
        passed=passed,
        error_message=None if passed else f"Length {actual_len} not in range [{min_len}, {max_len}]",
        details={"actual": actual_len, "min": min_len, "max": max_len}
    )

Validator factory

VALIDATORS = { โ€œjson_schemaโ€: JSONSchemaValidator(), โ€œcontainsโ€: ContainsValidator(), โ€œlengthโ€: LengthValidator() }

def get_validator(validator_type: str) -> Validator: โ€œ"โ€Get validator instance by typeโ€โ€โ€ return VALIDATORS.get(validator_type)


3. **Implement Test Runner** (`harness/runner.py`)
```python
import time
from typing import List
from .models import TestSuite, TestResult, TestReport, ValidationResult
from .llm_client import LLMClient
from .validators import get_validator

class TestRunner:
    """Executes test suites"""

    def __init__(self, llm_client: LLMClient):
        self.llm_client = llm_client

    def run_suite(self, suite: TestSuite) -> TestReport:
        """Execute all tests in suite"""
        results = []
        start_time = time.time()

        for test_case in suite.test_cases:
            # Build prompt
            prompt = self._build_prompt(test_case)

            # Call LLM
            llm_response = self.llm_client.complete(prompt)

            # Validate against invariants
            validation_results = []
            for invariant in test_case.expected_invariants:
                validator = get_validator(invariant.type.value)
                result = validator.validate(llm_response.text, invariant.spec)
                validation_results.append(result)

            # Determine pass/fail
            test_passed = all(v.passed for v in validation_results)

            # Store result
            results.append(TestResult(
                test_case=test_case,
                response=llm_response.text,
                passed=test_passed,
                validation_results=validation_results,
                latency_ms=llm_response.latency_ms,
                tokens_used=llm_response.tokens_used,
                timestamp=time.strftime("%Y-%m-%d %H:%M:%S")
            ))

        total_time = time.time() - start_time

        return self._generate_report(suite, results, total_time)

    def _build_prompt(self, test_case):
        """Construct prompt from test case"""
        parts = []
        if test_case.context:
            parts.append(f"Context:\n{test_case.context}\n")
        parts.append(f"Query: {test_case.input}")
        return "\n".join(parts)

    def _generate_report(self, suite, results, total_time):
        """Generate aggregate report"""
        passed = sum(1 for r in results if r.passed)
        failed = len(results) - passed

        return TestReport(
            suite_name=suite.name,
            version=suite.version,
            total_cases=len(results),
            passed_cases=passed,
            failed_cases=failed,
            success_rate=passed / len(results) if results else 0.0,
            total_time_s=total_time,
            avg_latency_ms=sum(r.latency_ms for r in results) / len(results) if results else 0.0,
            results=results,
            regression_info=None
        )

Checkpoint: Can run tests against real LLM API and validate basic invariants.

Phase 3: Reporting & Polish (Days 4-5)

Goals:

  • Generate HTML and JSON reports
  • Add regression detection
  • Polish CLI interface with rich formatting
  • Add detailed failure analysis

Tasks:

  1. Implement Reporter (harness/reporter.py) ```python import json from pathlib import Path from jinja2 import Environment, FileSystemLoader from .models import TestReport

class Reporter: โ€œ"โ€Generates test reports in various formatsโ€โ€โ€

def __init__(self, template_dir: Path):
    self.jinja_env = Environment(loader=FileSystemLoader(template_dir))

def generate_html(self, report: TestReport, output_path: Path):
    """Generate HTML report"""
    template = self.jinja_env.get_template('report.html.jinja2')

    html = template.render(
        report=report,
        failure_breakdown=self._get_failure_breakdown(report)
    )

    with open(output_path, 'w') as f:
        f.write(html)

def generate_json(self, report: TestReport, output_path: Path):
    """Generate JSON report for CI/CD"""
    data = {
        "suite_name": report.suite_name,
        "version": report.version,
        "summary": {
            "total": report.total_cases,
            "passed": report.passed_cases,
            "failed": report.failed_cases,
            "success_rate": report.success_rate
        },
        "results": [
            {
                "test_id": r.test_case.id,
                "passed": r.passed,
                "latency_ms": r.latency_ms,
                "failures": [
                    v.invariant_name
                    for v in r.validation_results
                    if not v.passed
                ]
            }
            for r in report.results
        ]
    }

    with open(output_path, 'w') as f:
        json.dump(data, f, indent=2)

def _get_failure_breakdown(self, report: TestReport):
    """Group failures by invariant type"""
    breakdown = {}
    for result in report.results:
        if not result.passed:
            for validation in result.validation_results:
                if not validation.passed:
                    if validation.invariant_name not in breakdown:
                        breakdown[validation.invariant_name] = []
                    breakdown[validation.invariant_name].append(result.test_case.id)
    return breakdown ```
  1. Create HTML Template (templates/report.html.jinja2) ```html <!DOCTYPE html>
{{ report.suite_name }} - Test Report

{{ report.suite_name }}

Version: {{ report.version }}

Success Rate: {{ "%.1f"|format(report.success_rate * 100) }}%

Total Cases: {{ report.total_cases }} | Passed: {{ report.passed_cases }} | Failed: {{ report.failed_cases }}

Test Results

{% for result in report.results %}

{{ "PASS" if result.passed else "FAIL" }} - {{ result.test_case.id }}

Input: {{ result.test_case.input }}

Latency: {{ result.latency_ms|round(0) }}ms

{% for validation in result.validation_results %}
{{ "โœ“" if validation.passed else "โœ—" }} {{ validation.invariant_name }} {% if not validation.passed %}
{{ validation.error_message }} {% endif %}
{% endfor %}
{% endfor %} {% if failure_breakdown %}

Failure Breakdown

    {% for invariant, cases in failure_breakdown.items() %}
  • {{ invariant }}: {{ cases|length }} failure(s) - {{ cases|join(", ") }}
  • {% endfor %}
{% endif %}

3. **Build CLI Interface** (`harness/cli.py`)
```python
import click
from pathlib import Path
from rich.console import Console
from rich.progress import track
from .loader import TestLoader
from .runner import TestRunner
from .llm_client import LLMClient
from .reporter import Reporter

console = Console()

@click.group()
def cli():
    """Prompt Contract Harness - Test your prompts like code"""
    pass

@cli.command()
@click.argument('test_file', type=click.Path(exists=True))
@click.option('--model', default='gpt-4', help='LLM model to use')
@click.option('--temperature', default=0.0, help='Sampling temperature')
@click.option('--provider', default='openai', help='LLM provider')
def test(test_file, model, temperature, provider):
    """Run test suite"""
    console.print(f"[bold blue]Loading test suite:[/bold blue] {test_file}")

    # Load tests
    loader = TestLoader()
    suite = loader.load_from_file(Path(test_file))
    console.print(f"Found {len(suite.test_cases)} test cases")

    # Run tests
    client = LLMClient(provider=provider, model=model, temperature=temperature)
    runner = TestRunner(client)

    console.print("[bold green]Running tests...[/bold green]")
    report = runner.run_suite(suite)

    # Print summary
    console.print(f"\n[bold]Results:[/bold]")
    console.print(f"Success Rate: {report.success_rate * 100:.1f}%")
    console.print(f"Passed: {report.passed_cases}/{report.total_cases}")

    # Generate reports
    reporter = Reporter(Path("templates"))
    html_path = Path(f"reports/run_{report.version}.html")
    reporter.generate_html(report, html_path)
    console.print(f"\n[bold]Report saved to:[/bold] {html_path}")

if __name__ == '__main__':
    cli()
  1. Add Regression Detection ```python

    In harness/runner.py, add method:

def compare_with_baseline(self, current: TestReport, baseline: TestReport) -> dict: โ€œ"โ€Detect regression between runsโ€โ€โ€ score_delta = current.success_rate - baseline.success_rate

current_failures = {r.test_case.id for r in current.results if not r.passed}
baseline_failures = {r.test_case.id for r in baseline.results if not r.passed}

return {
    "regression_detected": score_delta < -0.05,  # 5% threshold
    "score_delta": score_delta,
    "new_failures": list(current_failures - baseline_failures),
    "fixed_cases": list(baseline_failures - current_failures)
} ```

Checkpoint: Full working harness with beautiful reports and regression detection.

5.4 Key Implementation Decisions

Decision Options Considered Recommendation Rationale
Test Format JSON, YAML, Python code YAML Human-readable, supports comments, widely understood
Async vs Sync asyncio, threading, sequential asyncio Better rate limit handling, can parallelize independent tests
Validation Architecture Inline checks, Plugin system Plugin system Extensibility for custom validators, clean separation
Report Format HTML only, JSON only, both Both HTML for humans, JSON for CI/CD integration
Results Storage SQLite, JSON files, CSV JSON files Simple, portable, version-controllable
CLI Framework argparse, click, typer click Rich ecosystem, good documentation, decorator-based
Progress Display Print statements, rich library rich library Beautiful terminal output, progress bars, colors

6. Testing Strategy

6.1 Test Categories

Category Purpose Coverage Target Examples
Unit Tests Test individual components in isolation 90%+ Validator logic, loader parsing, data models
Integration Tests Test end-to-end flow with mocked LLM 80%+ Full test run with mock API, report generation
Smoke Tests Quick sanity checks Critical paths Sample test suite passes, CLI loads
Regression Tests Ensure bugs stay fixed All bug fixes Previously failing cases now pass

6.2 Critical Test Cases

Unit Tests

Test: Loader Handles Invalid YAML

def test_loader_invalid_yaml():
    loader = TestLoader()
    with pytest.raises(yaml.YAMLError):
        loader.load_from_file(Path("tests/fixtures/invalid.yaml"))

Test: JSONSchema Validator Detects Type Errors

def test_json_schema_type_error():
    validator = JSONSchemaValidator()
    response = '{"age": "twenty-five"}'  # Should be int
    spec = {"schema_file": "tests/fixtures/user_schema.json"}

    result = validator.validate(response, spec)
    assert not result.passed
    assert "type" in result.error_message.lower()

Test: Contains Validator Case Sensitivity

def test_contains_case_sensitive():
    validator = ContainsValidator()
    response = "The user wants a REFUND"
    spec = {"substring": "refund", "case_sensitive": True}

    result = validator.validate(response, spec)
    assert not result.passed  # "REFUND" != "refund"

Integration Tests

Test: Full Run with Mocked LLM

def test_full_run_with_mock(mocker):
    # Mock LLM to return deterministic responses
    mock_client = mocker.Mock(spec=LLMClient)
    mock_client.complete.return_value = LLMResponse(
        text='{"answer": "test", "confidence": 0.9}',
        latency_ms=100,
        tokens_used=50,
        model="gpt-4"
    )

    loader = TestLoader()
    suite = loader.load_from_file(Path("tests/fixtures/sample.yaml"))

    runner = TestRunner(mock_client)
    report = runner.run_suite(suite)

    assert report.total_cases > 0
    assert report.success_rate >= 0.0

Test: Regression Detection Works

def test_regression_detection():
    # Create two reports with different scores
    baseline = TestReport(
        suite_name="test",
        version="1.0",
        total_cases=10,
        passed_cases=10,
        failed_cases=0,
        success_rate=1.0,
        # ...
    )

    current = TestReport(
        suite_name="test",
        version="1.1",
        total_cases=10,
        passed_cases=8,
        failed_cases=2,
        success_rate=0.8,
        # ...
    )

    runner = TestRunner(None)
    regression = runner.compare_with_baseline(current, baseline)

    assert regression["regression_detected"] == True
    assert regression["score_delta"] == -0.2

6.3 Test Data

Sample Test Suite (tests/fixtures/sample.yaml)

name: "Sample Test Suite"
version: "1.0.0"

test_cases:
  - id: "valid_json_response"
    category: "Format"
    input: "Return user info as JSON"
    invariants:
      - type: "json_schema"
        name: "Valid JSON"
        spec:
          schema_file: "tests/fixtures/user_schema.json"

  - id: "contains_citation"
    category: "Grounding"
    input: "What is our refund policy?"
    context: "Policy: 30-day returns allowed"
    invariants:
      - type: "contains"
        name: "Mentions Policy"
        spec:
          substring: "30-day"

JSON Schema (tests/fixtures/user_schema.json)

{
  "type": "object",
  "properties": {
    "answer": {"type": "string"},
    "confidence": {"type": "number", "minimum": 0, "maximum": 1}
  },
  "required": ["answer", "confidence"]
}

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Root Cause Solution
No Rate Limiting 429 Too Many Requests errors Sending requests too fast to API Implement exponential backoff, use async with semaphore for concurrency control
Flaky Tests at High Temp Inconsistent pass/fail results Non-deterministic sampling (temp > 0) Use temp=0 for deterministic tests, or run multiple samples and aggregate
Poor Error Messages โ€œTest failedโ€ with no context Generic exception handling Include full context: expected vs actual, test case ID, invariant name
Missing Token Tracking Unexpected API bills Not logging token usage Log tokens per request, aggregate in report, set budget alerts
Timeout Handling Tests hang indefinitely No timeout on API calls Set request timeout (e.g., 30s), retry with backoff
Schema Path Issues โ€œSchema file not foundโ€ Relative paths break in different contexts Use absolute paths or paths relative to project root
Not Validating Test Files Cryptic errors during execution Malformed YAML loaded without validation Validate test file structure before running tests
Hardcoded API Keys Security vulnerability Keys in source code Use environment variables, never commit keys to git

7.2 Debugging Strategies

API Issues

Problem: Getting 401 Unauthorized errors

Debug Steps:

  1. Enable verbose logging to see full request/response
    import logging
    logging.basicConfig(level=logging.DEBUG)
    
  2. Verify API key is loaded correctly
    import os
    print(f"API Key loaded: {os.getenv('OPENAI_API_KEY')[:10]}...")
    
  3. Test API key with minimal curl request
    curl https://api.openai.com/v1/models \
      -H "Authorization: Bearer $OPENAI_API_KEY"
    

Validation Failures

Problem: Canโ€™t understand why invariant failed

Debug Strategy:

  1. Print the full response before validation
    print(f"Raw response: {response}")
    print(f"Expected: {invariant.spec}")
    
  2. Add detailed error messages with context
    return ValidationResult(
     invariant_name="JSON Schema",
     passed=False,
     error_message=f"Schema validation failed at path '{path}': expected {expected_type}, got {actual_type}",
     details={
         "expected": expected,
         "actual": actual,
         "raw_response": response[:200]  # First 200 chars
     }
    )
    
  3. Create minimal reproduction test
    def test_minimal_repro():
     validator = JSONSchemaValidator()
     response = '{"age": "twenty-five"}'  # Exact failing response
     result = validator.validate(response, spec)
     print(f"Failed: {result.error_message}")
    

Performance Issues

Problem: Tests running too slowly

Profile with cProfile:

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()

# Run your test suite
runner.run_suite(suite)

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)  # Top 10 slowest functions

Common Bottlenecks:

  • Sequential API calls โ†’ Parallelize with asyncio
  • Loading schema repeatedly โ†’ Cache schema in memory
  • Large responses โ†’ Stream instead of loading full response

7.3 Performance Traps

Memory Issues

Problem: Loading all responses into memory for large test suites

Bad:

responses = [llm_client.complete(prompt) for test in tests]  # Loads all at once

Good:

for test in tests:
    response = llm_client.complete(test.prompt)
    result = validate(response)
    save_result(result)  # Process and save incrementally
    del response  # Free memory

Cost Optimization

Problem: Racking up API bills during development

Solution: Response Caching

import hashlib
import json
from pathlib import Path

class CachingLLMClient:
    def __init__(self, client, cache_dir=Path(".cache")):
        self.client = client
        self.cache_dir = cache_dir
        self.cache_dir.mkdir(exist_ok=True)

    def complete(self, prompt: str):
        # Generate cache key from prompt
        key = hashlib.md5(prompt.encode()).hexdigest()
        cache_file = self.cache_dir / f"{key}.json"

        # Check cache
        if cache_file.exists():
            with open(cache_file) as f:
                return LLMResponse(**json.load(f))

        # Call API
        response = self.client.complete(prompt)

        # Save to cache
        with open(cache_file, 'w') as f:
            json.dump(response.__dict__, f)

        return response

8. Extensions & Challenges

8.1 Beginner Extensions

Extension 1: CSV Export Format

What: Add ability to export results as CSV for Excel analysis Why: Non-technical stakeholders may prefer spreadsheets How: Use Pythonโ€™s csv module to write results row by row

import csv

def export_to_csv(report: TestReport, output_path: Path):
    with open(output_path, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Test ID', 'Category', 'Passed', 'Latency (ms)', 'Failed Invariants'])
        for result in report.results:
            failed_inv = ', '.join(v.invariant_name for v in result.validation_results if not v.passed)
            writer.writerow([
                result.test_case.id,
                result.test_case.category,
                result.passed,
                result.latency_ms,
                failed_inv
            ])

Challenge: Add a --format csv flag to CLI

Extension 2: Email Notifications for Regression

What: Send email alert when regression detected Why: Get notified immediately without checking dashboard How: Use smtplib to send email via SMTP server

import smtplib
from email.mime.text import MIMEText

def send_regression_alert(report: TestReport, config: dict):
    if not report.regression_info or not report.regression_info['regression_detected']:
        return

    body = f"""
    Regression detected in {report.suite_name}!

    Success rate dropped: {report.regression_info['score_delta']:.1%}
    New failures: {', '.join(report.regression_info['new_failures'])}

    See report: {config['report_url']}
    """

    msg = MIMEText(body)
    msg['Subject'] = f"Prompt Regression Alert: {report.suite_name}"
    msg['From'] = config['from_email']
    msg['To'] = config['to_email']

    with smtplib.SMTP(config['smtp_server']) as server:
        server.send_message(msg)

Challenge: Make email template customizable with Jinja2

Extension 3: Pre-built Validator Library

What: Create validators for common patterns (email, phone, PII) Why: Donโ€™t reinvent the wheel for standard checks How: Use regex and libraries like phonenumbers, email-validator

import re
from email_validator import validate_email

class EmailValidator(Validator):
    """Detects email addresses in response"""

    def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
        email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
        found_emails = re.findall(email_pattern, response)

        should_contain = spec.get('should_contain', False)
        passed = (len(found_emails) > 0) == should_contain

        return ValidationResult(
            invariant_name="Email Detection",
            passed=passed,
            error_message=f"Found {len(found_emails)} email(s): {found_emails}" if not passed else None,
            details={"emails_found": found_emails}
        )

Challenge: Create validators for: SSN, credit card numbers, phone numbers, addresses

8.2 Intermediate Extensions

Extension 4: LLM-as-a-Judge Validators

What: Use another LLM to evaluate subjective qualities (tone, helpfulness) Why: Some invariants canโ€™t be captured by regex or schemas How: Call LLM with structured prompt to grade responses

class ToneValidator(Validator):
    """Uses LLM to evaluate tone"""

    def __init__(self, judge_client: LLMClient):
        self.judge = judge_client

    def validate(self, response: str, spec: Dict[str, Any]) -> ValidationResult:
        expected_tone = spec['expected_tone']  # e.g., "professional and empathetic"

        judge_prompt = f"""
        Evaluate if the following response has a {expected_tone} tone.

        Response: "{response}"

        Return only a JSON object:
        {{
            "score": <0.0-1.0>,
            "reasoning": "<brief explanation>"
        }}
        """

        judge_response = self.judge.complete(judge_prompt)
        result = json.loads(judge_response.text)

        threshold = spec.get('min_score', 0.7)
        passed = result['score'] >= threshold

        return ValidationResult(
            invariant_name=f"Tone: {expected_tone}",
            passed=passed,
            error_message=f"Tone score {result['score']:.2f} below threshold {threshold}. Reasoning: {result['reasoning']}" if not passed else None,
            details=result
        )

Challenge: Create LLM judges for: factual accuracy, relevance, completeness

Extension 5: Parametric Sweeps

What: Test across multiple models, temperatures, and prompts Why: Find optimal configuration for your use case How: Add nested loops to test all combinations

def parametric_sweep(suite: TestSuite, sweep_config: dict) -> List[TestReport]:
    """Run tests across parameter grid"""
    reports = []

    for model in sweep_config['models']:
        for temp in sweep_config['temperatures']:
            for prompt_version in sweep_config['prompt_versions']:
                client = LLMClient(model=model, temperature=temp)
                runner = TestRunner(client)

                # Modify suite with prompt version
                versioned_suite = apply_prompt_version(suite, prompt_version)

                report = runner.run_suite(versioned_suite)
                report.metadata = {
                    'model': model,
                    'temperature': temp,
                    'prompt_version': prompt_version
                }
                reports.append(report)

    return reports

# Usage
sweep_config = {
    'models': ['gpt-4', 'gpt-3.5-turbo', 'claude-3-opus'],
    'temperatures': [0.0, 0.3, 0.7],
    'prompt_versions': ['v1', 'v2', 'v3']
}
results = parametric_sweep(suite, sweep_config)

Challenge: Create heatmap visualization showing model vs temperature performance

Extension 6: Web UI for Results

What: Interactive dashboard to explore test results Why: Easier than opening HTML files, supports filtering/sorting How: Use Flask or FastAPI to serve results from database

from flask import Flask, render_template, jsonify
import sqlite3

app = Flask(__name__)

@app.route('/')
def dashboard():
    return render_template('dashboard.html')

@app.route('/api/runs')
def get_runs():
    conn = sqlite3.connect('results.db')
    runs = conn.execute('SELECT * FROM test_runs ORDER BY timestamp DESC LIMIT 50').fetchall()
    return jsonify([dict(run) for run in runs])

@app.route('/api/run/<run_id>')
def get_run_details(run_id):
    conn = sqlite3.connect('results.db')
    results = conn.execute('SELECT * FROM test_results WHERE run_id = ?', (run_id,)).fetchall()
    return jsonify([dict(r) for r in results])

Challenge: Add real-time updates using WebSockets as tests run

8.3 Advanced Extensions

Extension 7: CI/CD Integration

What: Run harness in GitHub Actions/Jenkins on every commit Why: Catch regressions before merging to main How: Create GitHub Action workflow

# .github/workflows/prompt-tests.yml
name: Prompt Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: 3.9

      - name: Install dependencies
        run: |
          pip install -r requirements.txt

      - name: Run prompt tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python harness.py test prompts/*.yaml --format junit --min-score 95

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v2
        with:
          name: test-results
          path: reports/

Challenge: Add step to post test results as PR comment

Extension 8: A/B Testing Framework

What: Compare two prompt versions statistically Why: Know with confidence which prompt performs better How: Run both prompts on same test set, compute significance

from scipy import stats

def ab_test(prompt_a: str, prompt_b: str, test_set: TestSuite) -> dict:
    """Run A/B test between two prompts"""
    # Run tests with both prompts
    runner_a = TestRunner(LLMClient(prompt=prompt_a))
    runner_b = TestRunner(LLMClient(prompt=prompt_b))

    results_a = [r.passed for r in runner_a.run_suite(test_set).results]
    results_b = [r.passed for r in runner_b.run_suite(test_set).results]

    # Statistical significance test (paired t-test)
    t_stat, p_value = stats.ttest_rel(results_a, results_b)

    # Effect size (Cohen's d)
    mean_diff = np.mean(results_a) - np.mean(results_b)
    pooled_std = np.sqrt((np.std(results_a)**2 + np.std(results_b)**2) / 2)
    cohens_d = mean_diff / pooled_std

    return {
        'prompt_a_score': np.mean(results_a),
        'prompt_b_score': np.mean(results_b),
        'p_value': p_value,
        'significant': p_value < 0.05,
        'effect_size': cohens_d,
        'recommendation': 'Use Prompt B' if np.mean(results_b) > np.mean(results_a) and p_value < 0.05 else 'Use Prompt A'
    }

Challenge: Add Bayesian A/B testing for continuous monitoring

Extension 9: Statistical Significance Testing

What: Donโ€™t just report score changes, report if theyโ€™re statistically significant Why: Avoid false positives from random variance How: Use hypothesis testing

def is_regression_significant(current: List[bool], baseline: List[bool], alpha: float = 0.05) -> dict:
    """
    Test if current performance is significantly worse than baseline

    Uses McNemar's test for paired binary data
    """
    # Build contingency table
    both_pass = sum(c and b for c, b in zip(current, baseline))
    current_fail_baseline_pass = sum(not c and b for c, b in zip(current, baseline))
    current_pass_baseline_fail = sum(c and not b for c, b in zip(current, baseline))
    both_fail = sum(not c and not b for c, b in zip(current, baseline))

    # McNemar's test
    statistic = (current_fail_baseline_pass - current_pass_baseline_fail)**2 / (current_fail_baseline_pass + current_pass_baseline_fail)
    p_value = 1 - stats.chi2.cdf(statistic, df=1)

    return {
        'statistically_significant': p_value < alpha,
        'p_value': p_value,
        'contingency_table': {
            'both_pass': both_pass,
            'regression_cases': current_fail_baseline_pass,
            'improvement_cases': current_pass_baseline_fail,
            'both_fail': both_fail
        }
    }

Challenge: Implement sequential testing for early stopping when regression is detected

Extension 10: Distributed Test Execution

What: Run tests across multiple machines for large suites Why: Scale to 1000+ test cases How: Use Celery or Ray for distributed task execution

from celery import Celery

app = Celery('prompt_harness', broker='redis://localhost:6379')

@app.task
def run_single_test(test_case_data: dict, llm_config: dict) -> dict:
    """Execute single test case (runs on worker)"""
    test_case = TestCase(**test_case_data)
    client = LLMClient(**llm_config)
    runner = TestRunner(client)

    # Run just this one test
    result = runner.run_single_test(test_case)
    return result.dict()

def run_distributed_suite(suite: TestSuite, llm_config: dict) -> TestReport:
    """Distribute test execution across workers"""
    # Submit all tests as async tasks
    tasks = [
        run_single_test.delay(tc.dict(), llm_config)
        for tc in suite.test_cases
    ]

    # Collect results as they complete
    results = [task.get() for task in tasks]

    # Aggregate into report
    return aggregate_results(results)

Challenge: Add dynamic worker scaling based on queue depth

9. Real-World Connections

9.1 Industry Applications

OpenAI Evals

OpenAIโ€™s internal evaluation framework follows the same contract-based testing pattern. They maintain a large repository of evals for different capabilities:

# Example from OpenAI Evals
{
    "id": "math.addition",
    "dataset": "math_problems.jsonl",
    "scorer": "exact_match",
    "samples": 1000
}

Key Insight: Even OpenAI needs systematic testing to ensure model updates donโ€™t regress on critical tasks.

LangSmith (LangChain)

LangChainโ€™s testing platform implements:

  • Automatic dataset collection from production traffic
  • LLM-as-a-judge evaluation
  • Regression tracking across model versions
  • A/B testing for prompt variants

Used by: Companies like Zapier, Notion, Robinhood for production LLM monitoring

Anthropic Workbench

Claudeโ€™s evaluation infrastructure:

  • Constitutional AI alignment testing
  • Safety evals (jailbreak attempts, bias detection)
  • Capability benchmarks (coding, math, reasoning)

Key Insight: Testing isnโ€™t just about correctnessโ€”itโ€™s also about safety and alignment.

Microsoft Prompt Flow

Azureโ€™s prompt engineering toolkit includes:

  • Visual prompt DAG editor
  • Evaluation metrics (relevance, groundedness, coherence)
  • Integration with Azure ML for model deployment

Used by: Enterprise customers who need compliance-ready AI systems

PromptFoo

https://github.com/promptfoo/promptfoo

What it does:

  • LLM evaluation framework focused on red-teaming and security
  • Supports multiple providers (OpenAI, Anthropic, local models)
  • Automated adversarial testing

Key Feature: Built-in vulnerability detection for prompt injection, jailbreaks, PII leaks

When to use: Security-focused evaluation, especially for customer-facing chatbots

OpenAI Evals

https://github.com/openai/evals

What it does:

  • OpenAIโ€™s official eval framework
  • Large library of pre-built evals for common tasks
  • Standardized format for sharing benchmarks

Key Feature: Community-contributed evals for niche domains

When to use: Benchmarking new models, contributing to open eval datasets

Giskard

https://github.com/Giskard-AI/giskard

What it does:

  • ML testing library with LLM support
  • Automated test generation
  • Metamorphic testing (if input changes slightly, output should too)

Key Feature: Automatic discovery of failure modes

When to use: QA testing for ML models, finding edge cases automatically

LangCheck

https://github.com/citadel-ai/langcheck

What it does:

  • Simple evaluation metrics for LLM outputs
  • Covers factual consistency, toxicity, fluency, etc.
  • No API calls required for some metrics

Key Feature: Lightweight, can run offline

When to use: Quick quality checks without additional LLM calls

9.3 Interview Relevance

This project prepares you for common AI engineering interview questions:

Question 1: โ€œHow do you test non-deterministic systems?โ€

Strong Answer: โ€œI test invariants rather than exact outputs. For example, instead of expecting a specific response, I verify that:

  1. The output is valid JSON matching a schema
  2. It cites at least one source document
  3. It doesnโ€™t contain PII
  4. It responds within 500ms

I use temperature=0 for deterministic tests (format, required fields) and higher temperatures with multiple samples for subjective qualities (tone, creativity), aggregating scores across runs.โ€

Question 2: โ€œHow do you prevent regression in prompt engineering?โ€

Strong Answer: โ€œI treat prompts as code: version controlled, tested, and monitored. Before deploying a prompt change:

  1. Run it against a golden dataset of edge cases
  2. Compare results to the baseline version
  3. Check if success rate dropped below threshold (e.g., 5%)
  4. If regression detected, either fix the prompt or update test cases if the change was intentional
  5. Track metrics over time to detect gradual degradation

This is analogous to regression testing in software, but adapted for probabilistic systems.โ€

Question 3: โ€œWhatโ€™s the difference between unit tests and integration tests for LLMs?โ€

Strong Answer: โ€œUnit tests validate individual components in isolation:

  • Validator logic (does the JSON schema validator work correctly?)
  • Prompt templating (does input substitution work?)
  • Response parsing (can we extract structured data?)

Integration tests validate the full pipeline:

  • Prompt โ†’ LLM โ†’ Validation โ†’ Business Logic
  • These tests use mocked LLM responses to be fast and deterministic
  • Real LLM tests are โ€˜E2E testsโ€™ and run less frequently due to cost/latency

The key is isolating what youโ€™re testing: code behavior vs. model behavior.โ€

Question 4: โ€œHow would you evaluate a summarization model?โ€

Strong Answer: โ€œIโ€™d use a multi-layered approach:

  1. Deterministic Checks (fast, cheap):
    • Length within bounds (50-200 words)
    • No PII leakage
    • Valid format (bullet points if required)
  2. Reference-Based Metrics (medium cost):
    • ROUGE score against human-written summaries
    • Factual consistency (all facts in summary appear in source)
  3. LLM-as-a-Judge (slower, higher cost):
    • Coherence (does it read naturally?)
    • Relevance (captures key points?)
    • Conciseness (no redundancy?)
  4. Human Evaluation (highest quality, most expensive):
    • Random sample (5%) reviewed by domain experts
    • Track inter-rater reliability

Different use cases prioritize different metrics. A legal summary prioritizes factual consistency; a news summary prioritizes coherence.โ€

10. Resources

10.1 Essential Reading

Books

Title Author Relevant Chapters Why It Matters
โ€œSite Reliability Engineeringโ€ Google Ch. 4 (Service Level Objectives) Learn to apply SLO/SLA thinking to AI systems
โ€œClean Codeโ€ Robert C. Martin Ch. 9 (Unit Tests) Master test design principles
โ€œAI Engineeringโ€ Chip Huyen Ch. 5 (Model Development and Offline Evaluation) Industry best practices for LLM evaluation
โ€œDesigning Data-Intensive Applicationsโ€ Martin Kleppmann Ch. 4 (Encoding and Evolution) Understand schema evolution and compatibility
โ€œThe Pragmatic Programmerโ€ Hunt & Thomas Ch. 7 (Test-Driven Development) Learn TDD principles applicable to prompts
โ€œRelease It!โ€ Michael T. Nygard Ch. 5 (Stability Patterns) Error handling and resilience patterns
โ€œSoftware Testingโ€ Ron Patton Ch. 7 (Regression Testing) Deep dive into regression detection
โ€œCode Completeโ€ Steve McConnell Ch. 8 (Defensive Programming) Invariants and assertions

Papers

  • โ€œLanguage Models are Few-Shot Learnersโ€ (GPT-3 paper) - Section on evaluation methodology
  • โ€œConstitutional AI: Harmlessness from AI Feedbackโ€ (Anthropic) - LLM-as-a-judge pattern
  • โ€œEvaluating Large Language Models Trained on Codeโ€ (Codex paper) - Pass@k metrics

10.2 Video Resources

  • โ€œBuilding Reliable LLM Applicationsโ€ - Chip Huyen (YouTube)
    • Covers evaluation pipelines, monitoring, failure modes
    • Link: Search YouTube for โ€œChip Huyen LLM Applicationsโ€
  • โ€œOpenAI Evals Deep Diveโ€ - OpenAI Developer Day
    • How OpenAI evaluates their models internally
    • Link: openai.com/events/developer-day
  • โ€œPrompt Engineering Guideโ€ - promptingguide.ai
    • Interactive tutorials on prompting techniques
    • Free, comprehensive, regularly updated

10.3 Tools & Documentation

Testing Frameworks

  • pytest: https://pytest.org
    • Pythonโ€™s de facto testing framework
    • Rich plugin ecosystem (pytest-mock, pytest-asyncio)
  • JSON Schema: https://json-schema.org
    • Standard for defining JSON structure
    • Validators available in all languages
  • Pydantic: https://docs.pydantic.dev
    • Data validation using Python type annotations
    • Generates JSON schemas automatically

LLM APIs

  • OpenAI API: https://platform.openai.com/docs
    • Official docs for GPT-4, GPT-3.5
    • Parameter reference (temperature, top_p, etc.)
  • Anthropic API: https://docs.anthropic.com
    • Claude 3 family documentation
    • Prompt engineering best practices

CLI Tools

  • Click: https://click.palletsprojects.com
    • Python CLI framework
    • Decorator-based, easy to use
  • Rich: https://rich.readthedocs.io
    • Beautiful terminal output
    • Progress bars, tables, syntax highlighting

Next Project

Project 2: JSON Output Enforcer (builds on validation concepts)

  • Implements self-repair loop for malformed JSON
  • Teaches reliability patterns for production systems
  • Prerequisite: Complete Project 1 first

Project 7: Temperature Sweeper

  • Parametric evaluation across temperature settings
  • Uses the harness from this project as a foundation
  • Adds statistical analysis of variance

Project 5: RAG Quality Evaluator

  • Evaluates retrieval-augmented generation systems
  • Reuses validator architecture from this project
  • Adds retrieval-specific metrics (citation accuracy, groundedness)

10.5 Community Resources

  • r/PromptEngineering (Reddit) - Community discussing techniques
  • LangChain Discord - Active community for LLM application builders
  • EleutherAI Discord - Open-source LLM research community
  • Anthropic Forum - Official Anthropic community

11. Self-Assessment Checklist

Understanding

  • I can explain what a prompt contract is without looking at notes
    • Test yourself: Define it out loud, then write a contract for a new use case
  • I understand the difference between testing at temp=0 vs temp=0.7
    • Test yourself: When would you use each? Give 3 examples per temperature
  • I can design invariants for a new use case
    • Test yourself: Pick a use case (e.g., email classifier), define 5 invariants
  • I know when to use deterministic vs probabilistic testing
    • Test yourself: Categorize these checks: JSON validity, creativity, tone, citation presence
  • I can explain error budgets in the context of LLM systems
    • Test yourself: Your SLO is 99% accuracy. How many failures are allowed per 1000 requests?

Implementation

  • All functional requirements are met
    • Test loader works with YAML/JSON
    • Test runner executes against real LLM
    • At least 3 validator types implemented
    • Reports generated in HTML and JSON
  • Test harness runs against real LLM API
    • Successfully tested with OpenAI or Anthropic
    • Handles rate limits gracefully
    • Retries on transient failures
  • Reports are clear and actionable
    • Failed tests clearly show what went wrong
    • Recommendations provided for fixing issues
    • HTML report is readable by non-technical stakeholders
  • Regression detection works correctly
    • Can compare two test runs
    • Correctly identifies new failures
    • Flags score drops above threshold
  • Code is production-ready
    • Error handling for edge cases
    • Logging for debugging
    • Tests for core components
    • Documentation (README, docstrings)

Growth

  • I can identify when to use deterministic vs probabilistic testing
    • Application: Design a test strategy for a new LLM feature
  • Iโ€™ve documented lessons learned
    • What surprised you during implementation?
    • What would you do differently next time?
    • What patterns emerged that youโ€™ll reuse?
  • I can explain this project in a job interview
    • Practice: Explain in 2 minutes: problem, solution, results, learnings
  • I understand how this applies to production systems
    • How would you integrate this into a CI/CD pipeline?
    • How would you monitor prompt performance in production?
    • What metrics would you track over time?
  • I can extend this to new domains
    • Pick a different domain (e.g., code generation, SQL queries)
    • What invariants would you test?
    • What new validators would you need?

12. Submission / Completion Criteria

Minimum Viable Completion

To consider this project โ€œcompleteโ€ at a basic level, you must:

  • Can load test suite from YAML file
    • Parses all fields correctly (id, input, invariants)
    • Handles malformed YAML gracefully with clear error messages
  • Can execute tests against LLM API
    • Successfully calls OpenAI or Anthropic API
    • Captures response and metadata (latency, tokens)
  • Implements at least 3 validator types
    • JSONSchema, Contains, Length (minimum set)
    • Each validator returns structured ValidationResult
  • Generates basic text report
    • Console output shows pass/fail for each test
    • Summary statistics (total, passed, failed, success rate)

Proof of Completion:

  • Screenshot of CLI output showing test run
  • Sample test YAML file
  • Code walkthrough explaining validator architecture

Full Completion

All minimum criteria plus:

  • HTML report generation
    • Interactive report with drill-down capability
    • Clear visualization of failures
    • Exportable/shareable
  • Regression detection between runs
    • Can compare current run to baseline
    • Identifies newly broken test cases
    • Flags significant score drops
  • Rate limiting and error handling
    • Exponential backoff for retries
    • Graceful handling of API failures
    • Timeout protection
  • CLI with rich formatting
    • Color-coded output (pass=green, fail=red)
    • Progress bars during execution
    • Clear help text and usage examples
  • Unit tests for core components
    • Test coverage >70% for validators, loader, runner
    • Integration tests with mocked LLM

Proof of Completion:

  • Public GitHub repository with code
  • README with setup instructions and examples
  • HTML report sample
  • Passing test suite

Excellence (Going Above & Beyond)

All full completion criteria plus any 2+ of:

  • JSON export for CI/CD integration
    • JUnit XML format for test runners
    • Structured JSON for custom pipelines
    • GitHub Actions workflow example
  • LLM-as-a-Judge validators
    • Subjective quality evaluation (tone, helpfulness)
    • Configurable judge prompts
    • Aggregation across multiple judge runs
  • Parametric sweeps across models/temps
    • Test same suite with multiple configurations
    • Heatmap visualization of results
    • Automated recommendation of best config
  • Statistical significance testing
    • Not just score comparison, but confidence intervals
    • McNemarโ€™s test for regression significance
    • Effect size calculation (Cohenโ€™s d)
  • Production-ready features
    • Response caching to reduce costs
    • Distributed execution for large suites
    • Web dashboard for result exploration
    • Slack/email notifications

Proof of Completion:

  • Blog post explaining your implementation and learnings
  • Video demo showing advanced features
  • Public deployment (e.g., web UI hosted on Vercel/Railway)
  • Contribution to open-source eval framework (PromptFoo, LangSmith)

Appendix: Sample Files

Example Test Suite (prompts/support_agent.yaml)

name: "Customer Support Evals"
version: "1.3.0"
description: "Test suite for customer support chatbot prompt"

prompt_template: |
  You are a helpful customer support agent. Use the provided policy documents to answer questions.
  Always cite your sources using [doc_id] format.

  Policy Documents:
  {context}

  Customer Query: {input}

  Respond in JSON format:
  {{
    "answer": "your response here",
    "confidence": 0.0-1.0,
    "citations": ["doc_id_1", "doc_id_2"]
  }}

test_cases:
  # Refund Queries
  - id: "simple_refund_query"
    category: "Refund"
    input: "I want to return my order #12345"
    context: |
      [policy_doc_1] Refund Policy: Customers can return items within 30 days of purchase for a full refund.
      [policy_doc_2] Order #12345 was placed on 2024-12-15 (12 days ago).
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Has Citation"
        spec:
          substring: "policy_doc"
      - type: "contains"
        name: "Contains Order ID"
        spec:
          substring: "#12345"
      - type: "regex"
        name: "Polite Language"
        spec:
          pattern: "(please|thank you|happy to help)"
          flags: "IGNORECASE"

  - id: "refund_outside_window"
    category: "Refund"
    input: "Can I return something I bought 6 months ago?"
    context: |
      [policy_doc_1] Refund Policy: Customers can return items within 30 days of purchase.
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Mentions Time Limit"
        spec:
          substring: "30"
      - type: "contains"
        name: "Suggests Alternative"
        spec:
          substring: "warranty"  # or exchange, or other options

  - id: "ambiguous_policy_query"
    category: "Policy"
    input: "What's your return policy?"
    context: |
      [policy_doc_1] Refund Policy: Customers can return items within 30 days.
      [policy_doc_2] Items must be in original packaging with tags attached.
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Has Citation"
        spec:
          substring: "policy_doc"
      - type: "length"
        name: "Detailed Response"
        spec:
          min: 100
          max: 500

  # Technical Support
  - id: "password_reset"
    category: "Technical"
    input: "I forgot my password, how do I reset it?"
    context: |
      [help_doc_1] To reset your password:
      1. Click "Forgot Password" on login page
      2. Enter your email
      3. Check your inbox for reset link
    invariants:
      - type: "json_schema"
        name: "Valid JSON Schema"
        spec:
          schema_file: "prompts/schemas/support_response.json"
      - type: "contains"
        name: "Step-by-Step"
        spec:
          substring: "1."
      - type: "contains"
        name: "Cites Help Doc"
        spec:
          substring: "help_doc"

JSON Schema (prompts/schemas/support_response.json)

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "answer": {
      "type": "string",
      "minLength": 10,
      "description": "The support agent's response"
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1,
      "description": "Confidence score for the answer"
    },
    "citations": {
      "type": "array",
      "items": {
        "type": "string",
        "pattern": "^(policy_doc|help_doc)_\\d+$"
      },
      "minItems": 1,
      "description": "List of cited document IDs"
    }
  },
  "required": ["answer", "confidence", "citations"],
  "additionalProperties": false
}

This comprehensive guide was generated from PROMPT_ENGINEERING_PROJECTS.md. For the complete learning path and other projects, see the parent directory.