Project 7: Temperature Sweeper + Confidence Policy

Build a statistical analysis tool that measures prompt reliability across temperature ranges and establishes confidence-based policies

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	1 week
Language	Python (Alternatives: TypeScript)
Prerequisites	Project 1 (Harness), basic statistics knowledge
Key Topics	Sampling strategies, variance measurement, confidence scoring, reliability curves
Knowledge Area	Sampling / Uncertainty
Software/Tool	Matplotlib / Plotly (Visualization)
Main Book	“Hands-On Machine Learning” by Géron (Ch. 3: Classification metrics)
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. The “Service & Support” Model

1. Learning Objectives

By completing this project, you will:

Understand Temperature and Sampling: Master how temperature, top-p, and top-k control randomness in LLM outputs
Measure Output Variance: Quantify how different two JSON outputs are mathematically
Build Reliability Curves: Plot accuracy vs. temperature to find the stability sweet spot
Implement Self-Consistency: Use majority voting across multiple samples to improve reliability
Design Confidence Policies: Create rules for when to trust the model vs. ask for human review
Analyze Logprobs: Interpret the model’s internal probability scores
Optimize for Latency vs. Accuracy: Understand the trade-off between speed and reliability
Build Statistical Frameworks: Apply hypothesis testing to prompt engineering decisions
Visualize Uncertainty: Create production-ready charts showing model confidence over time
Establish Production Policies: Define SLOs for when model outputs are “good enough”

2. Theoretical Foundation

2.1 Core Concepts

What is Temperature?

Temperature is a sampling parameter that controls the randomness of token selection during text generation.

How LLMs Generate Text:

# At each step, the model outputs a probability distribution
next_token_probs = {
  "The": 0.7,
  "A": 0.2,
  "An": 0.08,
  "Their": 0.02
}

# Temperature modifies this distribution

Temperature = 0.0 (Deterministic):

# Always pick highest probability
next_token = "The"  # Always the same

Temperature = 1.0 (Default):

# Sample from the distribution
next_token = random.choice(["The", "A", "An", "Their"], p=[0.7, 0.2, 0.08, 0.02])

Temperature = 2.0 (High randomness):

# Flattens the distribution, making rare tokens more likely
adjusted_probs = {
  "The": 0.45,    # Still most likely, but less dominant
  "A": 0.3,
  "An": 0.15,
  "Their": 0.1
}

Mathematical Formula:

P(token_i) = exp(logit_i / T) / Σ exp(logit_j / T)

Where:
- T = temperature
- logit_i = model's raw score for token i

Visual Effect:

Temperature = 0.0: ▰▱▱▱▱▱▱▱ (peaked)
Temperature = 0.5: ▰▰▰▱▱▱▱▱ (focused)
Temperature = 1.0: ▰▰▰▰▰▱▱▱ (balanced)
Temperature = 2.0: ▰▰▰▰▰▰▰▰ (flat)

Greedy vs. Nucleus Sampling

Greedy Decoding (temp=0):

Always pick the most probable token
Deterministic (same input → same output)
Safe for structured outputs (JSON)
Can be boring or repetitive

Top-P (Nucleus Sampling):

Select from the smallest set of tokens whose cumulative probability ≥ p
More dynamic than greedy
Less wild than high temperature

# Top-P = 0.9 example
token_probs = {
  "The": 0.7,
  "A": 0.2,    # Cumulative: 0.9 (stop here)
  "An": 0.08,  # Not included
  "Their": 0.02
}

# Sample from {"The", "A"} only

Top-K Sampling:

Consider only the K most probable tokens
Fixed size (unlike top-p which is dynamic)

# Top-K = 2
token_probs = {
  "The": 0.7,
  "A": 0.2,    # Only these 2 considered
  "An": 0.08,  # Ignored
  "Their": 0.02
}

Variance Measurement

Variance quantifies how different multiple outputs are from each other.

For Numeric Outputs:

outputs = [0.8, 0.85, 0.75, 0.9, 0.82]
variance = np.var(outputs)  # 0.0029
std_dev = np.std(outputs)   # 0.054

# Low variance = consistent
# High variance = unpredictable

For Categorical Outputs:

# 10 runs at temp=0
categories = ["refund", "refund", "refund", "refund", "refund",
              "refund", "refund", "refund", "refund", "refund"]
# Variance = 0 (all same)

# 10 runs at temp=1.5
categories = ["refund", "technical", "refund", "policy", "refund",
              "technical", "refund", "refund", "policy", "technical"]
# Variance = High (inconsistent)

For JSON Outputs (Structural Variance):

def json_variance(outputs: List[dict]) -> float:
    """
    Measure how different JSON outputs are

    Strategy:
    1. Extract all key-value pairs
    2. Compare across outputs
    3. Return ratio of differences
    """
    all_keys = set()
    for output in outputs:
        all_keys.update(flatten_keys(output))

    differences = 0
    comparisons = 0

    for key in all_keys:
        values = [get_nested_value(output, key) for output in outputs]
        unique_values = len(set(values))

        if unique_values > 1:
            differences += 1

        comparisons += 1

    return differences / comparisons if comparisons > 0 else 0

Self-Consistency Pattern

Idea: Run the same prompt N times and take the majority vote.

Example:

# Run 5 times with temp=0.7
outputs = [
  {"category": "refund", "priority": "high"},
  {"category": "refund", "priority": "high"},
  {"category": "refund", "priority": "medium"},
  {"category": "technical", "priority": "high"},
  {"category": "refund", "priority": "high"}
]

# Majority vote: category="refund" (4/5), priority="high" (4/5)
final_output = {"category": "refund", "priority": "high"}
confidence = 0.8  # 4/5 agreement

When to Use:

Critical decisions (medical, legal, financial)
High uncertainty queries
When accuracy > latency
Production fallback when single-sample fails validation

Research Foundation:

“Self-Consistency Improves Chain of Thought Reasoning in Language Models” (Wang et al., 2022) showed that self-consistency can improve accuracy from 57% → 78% on reasoning tasks.

Logprobs (Log Probabilities)

Logprobs reveal the model’s internal confidence for each token.

What They Show:

{
  "token": "refund",
  "logprob": -0.2,      // log(probability)
  "probability": 0.82,   // exp(-0.2) ≈ 0.82
  "top_logprobs": [
    {"token": "refund", "logprob": -0.2},
    {"token": "return", "logprob": -1.6},  // Much less likely
    {"token": "cancel", "logprob": -2.3}
  ]
}

Interpreting Logprobs:

Logprob	Probability	Interpretation
-0.05	~0.95	Very confident
-0.5	~0.60	Somewhat confident
-1.0	~0.37	Uncertain
-2.0	~0.14	Guessing
-5.0	~0.007	Wild guess

Using Logprobs for Confidence Scoring:

def calculate_confidence(logprobs: List[float]) -> float:
    """
    Average probability across all tokens in the output

    High confidence: All tokens had high probability
    Low confidence: Model was guessing on many tokens
    """
    probabilities = [math.exp(lp) for lp in logprobs]
    return sum(probabilities) / len(probabilities)

Production Use Case:

if confidence < 0.6:
    # Low confidence - run self-consistency or escalate
    outputs = run_n_times(prompt, n=5)
    final = majority_vote(outputs)
else:
    # High confidence - use single sample
    final = output

The Reliability Curve

A reliability curve plots model performance against temperature.

Typical Shape:

Accuracy
100% │  ●
     │   ╲
  95% │    ●
     │     ╲
  90% │      ●
     │       ╲
  85% │        ●
     │         ╲
  80% │          ●
     │           ╲╲
  60% │             ●●●
     └─────────────────────
       0.0  0.5  1.0  1.5  Temperature

Key Observations:

Plateau (0.0-0.3): Temperature has minimal effect
Degradation (0.5-1.0): Accuracy starts dropping
Collapse (1.5+): Model becomes unreliable

Task-Specific Curves:

Task Type	Optimal Temp	Why
JSON extraction	0.0	Need determinism
Creative writing	0.7-1.0	Want variety
Code generation	0.2	Balance correctness & diversity
Classification	0.0	Need consistency
Brainstorming	1.2+	Want novelty

2.2 Why This Matters

Production Relevance

Problem: No one measures temperature impact systematically

# Common anti-pattern
response = llm.complete(prompt, temperature=0.7)  # Why 0.7? No idea!

# Result: Inconsistent outputs, unexplained failures

Solution: Empirical measurement

# Run sweeper
results = sweep_temperature(prompt, test_cases, temps=[0, 0.3, 0.7, 1.0])

# Results show:
# temp=0.0 → 98% accuracy
# temp=0.7 → 85% accuracy
# temp=1.0 → 65% accuracy

# Decision: Use temp=0.0 for this task

Real-World Impact:

Medical diagnosis chatbot: Must use temp=0 (consistency critical)
Marketing copy generator: Can use temp=0.9 (creativity valued)
Customer support: temp=0.2 (slight variety, high accuracy)

The Latency Trade-off:

Self-consistency improves accuracy but increases latency:

Single sample (temp=0):   200ms, 95% accurate
Self-consistency (n=5):  1000ms, 99% accurate

Decision depends on:
- User-facing? (latency matters)
- Critical decision? (accuracy matters)
- Cost budget? (5x API calls)

Industry Applications

1. OpenAI’s GPT-4 Evaluations

OpenAI runs temperature sweeps on every model release:

# Simplified version of their approach
for temp in [0.0, 0.3, 0.7, 1.0]:
    for benchmark in ["MMLU", "HumanEval", "HellaSwag"]:
        score = evaluate(model, benchmark, temperature=temp)
        report[benchmark][temp] = score

# Publish results showing optimal temp per task

2. Anthropic’s Constitutional AI

Anthropic uses variance measurement to detect when Claude is uncertain:

# High variance on sensitive topics → Trigger safety review
variance = measure_variance(outputs)

if variance > 0.3 and topic in ["medical", "legal", "financial"]:
    escalate_to_human()

3. Production Confidence Policies (Stripe)

Stripe’s AI-powered fraud detection:

def fraud_policy(transaction, model_confidence):
    if model_confidence > 0.95:
        return "auto_approve"
    elif model_confidence > 0.7:
        return "manual_review"
    else:
        return "auto_reject"

# This policy was tuned using temperature sweeps

2.3 Common Misconceptions

Misconception	Reality
“Higher temp = better outputs”	Depends on task; temp=0 is often optimal
“Temp=0 makes model boring”	For creative tasks yes, for structured outputs no
“Self-consistency is always better”	Only if accuracy gain > 5x cost increase
“Variance = 0 is ideal”	Depends on task; some tasks need diversity
“Logprobs are hard to use”	Simple threshold rules work well in practice

3. Project Specification

3.1 What You Will Build

A temperature analysis framework that:

Runs parametric sweeps across temperature ranges (0.0, 0.2, 0.4, …, 2.0)
Measures variance for each temperature setting
Calculates accuracy against test cases from Project 1 harness
Generates reliability curves showing accuracy vs. temperature
Implements self-consistency with configurable N
Analyzes logprobs to quantify model confidence
Recommends optimal settings based on your task requirements
Defines confidence policies (when to trust single sample vs. self-consistency)
Visualizes results with interactive charts
Integrates with CI/CD to catch regressions in reliability

Core Question This Tool Answers:

“How do I know when the model is guessing?”

3.2 Functional Requirements

FR1: Temperature Sweep Execution

Run same prompt N times per temperature setting
Support temperature range 0.0 to 2.0 in 0.1 increments
Parallel execution to reduce total runtime
Cache results to avoid redundant API calls

FR2: Variance Measurement

Implement variance metrics for:

Categorical outputs: Mode, entropy, agreement ratio
Numeric outputs: Standard deviation, coefficient of variation
JSON outputs: Structural diff count, value divergence
Text outputs: Edit distance, semantic similarity

FR3: Accuracy Evaluation

Integrate with Project 1 test harness
Run full test suite per temperature
Track success rate, precision, recall
Compare against baseline (temp=0)

FR4: Self-Consistency Implementation

Run prompt N times (configurable, default N=5)
Implement majority voting for categorical outputs
Implement median/mean for numeric outputs
Track agreement scores as confidence metric

FR5: Logprob Analysis

Extract logprobs from API responses
Calculate per-token and aggregate confidence scores
Identify low-confidence tokens
Correlate logprobs with actual accuracy

FR6: Visualization & Reporting

Generate charts:

Reliability Curve: Accuracy vs. temperature
Variance Plot: Variance vs. temperature
Confidence Distribution: Histogram of confidence scores
Latency vs. Accuracy: Trade-off visualization
Self-Consistency Impact: N vs. accuracy improvement

FR7: Policy Recommendation

Based on sweep results, recommend:

Optimal temperature for this prompt
When to use self-consistency
Confidence threshold for auto-accept
Escalation policy for low-confidence outputs

3.3 Non-Functional Requirements

Requirement	Target	Rationale
Sweep Time	<10 minutes for 100 test cases	Must be practical for CI/CD integration
Statistical Significance	N≥30 samples per temp for variance	Need confidence in measurements
Visualization Quality	Publication-ready charts	Communicate findings to stakeholders
Cost Efficiency	Smart caching, parallel execution	Temperature sweeps are expensive
Reproducibility	Same results on re-run (with seed)	Scientific rigor

3.4 Example Usage

Running a temperature sweep:

$ python sweeper.py --prompt prompts/support_agent.yaml --temps "0.0,0.3,0.7,1.0,1.5" --runs-per-temp 10

╔══════════════════════════════════════════════════════════╗
║  TEMPERATURE SWEEPER - Reliability Analysis              ║
╚══════════════════════════════════════════════════════════╝

Loading prompt: prompts/support_agent.yaml
Loading test cases: 50 cases across 3 categories

[Sweep Configuration]
Temperature range: [0.0, 0.3, 0.7, 1.0, 1.5]
Runs per temperature: 10
Total API calls: 2500 (50 cases × 5 temps × 10 runs)
Estimated cost: $12.50
Estimated time: 8 minutes

Proceed? (y/n): y

[Running Sweep]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Temperature: 0.0
──────────────────────────────────────────────────────────
Running test case 1/50 (10 samples)... Done (avg: 120ms)
Running test case 2/50 (10 samples)... Done (avg: 115ms)
...
Running test case 50/50 (10 samples)... Done (avg: 118ms)

Results for temp=0.0:
  Success Rate: 98% (49/50 cases passed)
  Avg Variance: 0.01 (very low)
  Avg Confidence: 0.94
  Avg Latency: 117ms

Temperature: 0.3
──────────────────────────────────────────────────────────
...

Temperature: 0.7
──────────────────────────────────────────────────────────
Results for temp=0.7:
  Success Rate: 92% (46/50 cases passed)
  Avg Variance: 0.15 (moderate)
  Avg Confidence: 0.78
  Avg Latency: 123ms

Temperature: 1.0
──────────────────────────────────────────────────────────
Results for temp=1.0:
  Success Rate: 74% (37/50 cases passed)
  Avg Variance: 0.35 (high)
  Avg Confidence: 0.62
  Avg Latency: 128ms

Temperature: 1.5
──────────────────────────────────────────────────────────
Results for temp=1.5:
  Success Rate: 48% (24/50 cases passed)
  Avg Variance: 0.58 (very high)
  Avg Confidence: 0.41
  Avg Latency: 135ms

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
SWEEP SUMMARY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

TEMP | SUCCESS RATE | VARIANCE | CONFIDENCE
0.0  | 98%          | 0.01     | 0.94
0.3  | 96%          | 0.08     | 0.88
0.7  | 92%          | 0.15     | 0.78
1.0  | 74%          | 0.35     | 0.62
1.5  | 48%          | 0.58     | 0.41

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RECOMMENDATIONS:

1. OPTIMAL TEMPERATURE: 0.0
   - Highest accuracy (98%)
   - Lowest variance (0.01)
   - Fast and consistent

2. SELF-CONSISTENCY POLICY:
   - For confidence < 0.7: Use self-consistency (n=5)
   - Expected improvement: +12% accuracy on edge cases

3. CONFIDENCE POLICY:
   - Auto-accept: confidence ≥ 0.85 (covers 87% of cases)
   - Human review: confidence < 0.85 (13% of cases)
   - Auto-reject: confidence < 0.5 (2% of cases)

4. PRODUCTION SETTINGS:
   temperature: 0.0
   fallback_strategy: "self_consistency"
   fallback_threshold: 0.7
   fallback_n: 5

Saved detailed results to: reports/sweep_2024-12-27_16-45-23.json
Saved charts to: reports/charts/

Self-Consistency Analysis:

$ python sweeper.py --self-consistency --n 3,5,10 --prompt prompts/support_agent.yaml

[Self-Consistency Analysis]
Testing N=3, 5, 10 on 50 test cases

Baseline (single sample, temp=0.7): 92% accuracy

N=3 (majority vote):
  Accuracy: 95% (+3 percentage points)
  Latency: 3x (360ms avg)
  Cost: 3x

N=5 (majority vote):
  Accuracy: 97% (+5 percentage points)
  Latency: 5x (600ms avg)
  Cost: 5x

N=10 (majority vote):
  Accuracy: 98% (+6 percentage points)
  Latency: 10x (1200ms avg)
  Cost: 10x

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
RECOMMENDATION

For your use case:
- If latency-sensitive (user-facing): Use temp=0.0 single sample (98%)
- If accuracy-critical: Use temp=0.7 + N=5 self-consistency (97%)
- Sweet spot: N=5 gives 90% of improvement at 50% of N=10 cost

4. Solution Architecture

4.1 High-Level Design

┌─────────────────────────────────────────────────────────┐
│  Test Harness (from Project 1)                          │
│  - Loads test cases                                     │
│  - Defines success criteria                             │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Sweep Executor                                         │
│  - Iterates over temperature settings                   │
│  - Runs N samples per test case per temperature        │
│  - Parallelizes API calls                               │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Variance Calculator                                    │
│  - Measures consistency across samples                  │
│  - Computes structural diffs for JSON                   │
│  - Calculates semantic similarity for text              │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Accuracy Evaluator                                     │
│  - Runs validators from Project 1                       │
│  - Aggregates pass/fail across samples                  │
│  - Computes success rate per temperature                │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Self-Consistency Engine                                │
│  - Implements majority voting                           │
│  - Tracks agreement scores                              │
│  - Compares single vs. ensemble performance             │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Logprob Analyzer                                       │
│  - Extracts probabilities from API                      │
│  - Calculates confidence scores                         │
│  - Correlates with actual accuracy                      │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Report Generator                                       │
│  - Creates reliability curves                           │
│  - Generates recommendation summaries                   │
│  - Exports JSON for CI/CD integration                   │
└─────────────────────────────────────────────────────────┘

4.2 Key Components

Component	Responsibility	Implementation Notes
SweepExecutor	Orchestrate temperature sweep	Use asyncio for parallel API calls
VarianceCalculator	Measure output consistency	Different strategies per output type
AccuracyEvaluator	Validate outputs	Reuse validators from Project 1
SelfConsistency	Majority voting & aggregation	Handle ties with random selection
LogprobAnalyzer	Confidence scoring	Requires API support (GPT-4, Claude)
ChartGenerator	Create visualizations	Matplotlib for static, Plotly for interactive
PolicyRecommender	Suggest optimal settings	Rule-based system with heuristics
CacheManager	Avoid redundant API calls	Hash-based lookup (prompt + temp + seed)

4.3 Data Structures

from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum

@dataclass
class SweepConfig:
    """Configuration for temperature sweep"""
    temperatures: List[float]
    runs_per_temperature: int
    test_cases: List[TestCase]
    model: str
    max_parallel: int = 10
    cache_enabled: bool = True
    seed: Optional[int] = None

@dataclass
class SampleResult:
    """Single API call result"""
    temperature: float
    test_case_id: str
    output: Any
    passed: bool
    latency_ms: float
    logprobs: Optional[List[float]]
    token_count: int
    timestamp: str

@dataclass
class TemperatureResult:
    """Aggregated results for one temperature"""
    temperature: float
    samples: List[SampleResult]
    success_rate: float
    avg_variance: float
    avg_confidence: float
    avg_latency_ms: float
    failed_cases: List[str]

@dataclass
class SweepReport:
    """Complete sweep analysis"""
    config: SweepConfig
    results: List[TemperatureResult]
    optimal_temperature: float
    recommendations: Dict[str, Any]
    charts: Dict[str, str]  # Chart name -> file path
    total_cost: float
    total_time_s: float

@dataclass
class SelfConsistencyResult:
    """Self-consistency analysis for one test case"""
    test_case_id: str
    n: int
    samples: List[Any]
    majority_vote: Any
    agreement_score: float  # 0.0 to 1.0
    confidence: float

class VarianceMetric(Enum):
    CATEGORICAL_MODE = "categorical_mode"
    NUMERIC_STD = "numeric_std"
    JSON_STRUCTURAL = "json_structural"
    TEXT_EDIT_DISTANCE = "text_edit_distance"

4.4 Algorithm Overview

Temperature Sweep Algorithm

async def run_temperature_sweep(
    config: SweepConfig
) -> SweepReport:
    """
    Execute temperature sweep across all test cases

    Complexity: O(T × C × N) where:
    - T = number of temperatures
    - C = number of test cases
    - N = runs per temperature

    Parallelization: Run up to max_parallel API calls concurrently
    """
    results = []

    for temp in config.temperatures:
        print(f"\nTesting temperature: {temp}")

        temp_results = []

        # For each test case
        for test_case in config.test_cases:
            # Run N times
            samples = []

            # Parallelize the N runs
            tasks = [
                execute_with_temp(test_case, temp, config.model)
                for _ in range(config.runs_per_temperature)
            ]

            # Execute in batches to respect rate limits
            for batch in chunks(tasks, config.max_parallel):
                batch_results = await asyncio.gather(*batch)
                samples.extend(batch_results)

            # Evaluate samples
            passed_count = sum(1 for s in samples if s.passed)
            success_rate = passed_count / len(samples)

            # Calculate variance
            variance = calculate_variance(
                [s.output for s in samples],
                get_variance_metric(test_case)
            )

            # Calculate confidence (from logprobs)
            if samples[0].logprobs:
                avg_confidence = np.mean([
                    calculate_confidence(s.logprobs) for s in samples
                ])
            else:
                avg_confidence = None

            temp_results.append(TemperatureResult(
                temperature=temp,
                samples=samples,
                success_rate=success_rate,
                avg_variance=variance,
                avg_confidence=avg_confidence,
                avg_latency_ms=np.mean([s.latency_ms for s in samples]),
                failed_cases=[s.test_case_id for s in samples if not s.passed]
            ))

        results.append(aggregate_temp_results(temp_results))

    # Generate recommendations
    recommendations = generate_recommendations(results, config)

    # Create charts
    charts = generate_charts(results)

    return SweepReport(
        config=config,
        results=results,
        optimal_temperature=recommendations['optimal_temperature'],
        recommendations=recommendations,
        charts=charts,
        total_cost=calculate_total_cost(results),
        total_time_s=calculate_total_time(results)
    )

Variance Calculation (JSON)

def calculate_json_variance(outputs: List[dict]) -> float:
    """
    Measure structural consistency across JSON outputs

    Algorithm:
    1. Flatten all JSON objects to key-value pairs
    2. For each key, check if all outputs have same value
    3. Return ratio of differing keys

    Returns: 0.0 (identical) to 1.0 (completely different)
    """
    if len(outputs) == 0:
        return 0.0

    if len(outputs) == 1:
        return 0.0

    # Flatten all outputs
    flattened = [flatten_json(output) for output in outputs]

    # Get all keys that appear in any output
    all_keys = set()
    for flat in flattened:
        all_keys.update(flat.keys())

    # Count differences
    different_keys = 0
    total_keys = len(all_keys)

    for key in all_keys:
        values = [flat.get(key) for flat in flattened]

        # Check if all values are the same
        unique_values = set(v for v in values if v is not None)

        if len(unique_values) > 1:
            different_keys += 1

    return different_keys / total_keys if total_keys > 0 else 0.0

def flatten_json(obj: dict, prefix: str = "") -> dict:
    """
    Flatten nested JSON to key-value pairs

    Example:
    {"user": {"name": "Alice", "age": 30}}
    →
    {"user.name": "Alice", "user.age": 30}
    """
    result = {}

    for key, value in obj.items():
        full_key = f"{prefix}.{key}" if prefix else key

        if isinstance(value, dict):
            result.update(flatten_json(value, full_key))
        elif isinstance(value, list):
            for i, item in enumerate(value):
                if isinstance(item, dict):
                    result.update(flatten_json(item, f"{full_key}[{i}]"))
                else:
                    result[f"{full_key}[{i}]"] = item
        else:
            result[full_key] = value

    return result

Self-Consistency Algorithm

def apply_self_consistency(
    samples: List[Any],
    n: int
) -> SelfConsistencyResult:
    """
    Majority voting across N samples

    For categorical outputs:
    - Return mode (most common value)
    - Confidence = (count of mode) / n

    For JSON outputs:
    - Per-field majority voting
    - Confidence = average agreement across fields
    """

    if len(samples) < n:
        raise ValueError(f"Need at least {n} samples, got {len(samples)}")

    # Sample n items
    selected = random.sample(samples, n)

    # Determine output type
    if all(isinstance(s, dict) for s in selected):
        return self_consistency_json(selected)
    elif all(isinstance(s, (int, float)) for s in selected):
        return self_consistency_numeric(selected)
    else:
        return self_consistency_categorical(selected)

def self_consistency_json(samples: List[dict]) -> dict:
    """
    Per-field majority voting for JSON outputs
    """
    # Flatten all samples
    flattened = [flatten_json(s) for s in samples]

    # Get all keys
    all_keys = set()
    for flat in flattened:
        all_keys.update(flat.keys())

    # Majority vote per key
    result = {}
    agreement_scores = []

    for key in all_keys:
        values = [flat.get(key) for flat in flattened if key in flat]

        # Get mode
        counter = Counter(values)
        most_common_value, count = counter.most_common(1)[0]

        result[key] = most_common_value
        agreement_scores.append(count / len(values))

    # Unflatten
    final_result = unflatten_json(result)

    return SelfConsistencyResult(
        test_case_id="",  # Set by caller
        n=len(samples),
        samples=samples,
        majority_vote=final_result,
        agreement_score=np.mean(agreement_scores),
        confidence=np.mean(agreement_scores)
    )

def self_consistency_categorical(samples: List[str]) -> str:
    """
    Simple mode for categorical outputs
    """
    counter = Counter(samples)
    mode, count = counter.most_common(1)[0]

    return SelfConsistencyResult(
        n=len(samples),
        samples=samples,
        majority_vote=mode,
        agreement_score=count / len(samples),
        confidence=count / len(samples)
    )

Confidence from Logprobs

def calculate_confidence_from_logprobs(
    logprobs: List[float]
) -> float:
    """
    Convert logprobs to confidence score

    Strategy:
    1. Convert log probabilities to probabilities
    2. Take geometric mean (avoids being dominated by one low prob)
    3. Scale to [0, 1]

    Returns: Confidence score where 1.0 = very confident, 0.0 = guessing
    """
    if not logprobs:
        return None

    # Convert to probabilities
    probs = [math.exp(lp) for lp in logprobs]

    # Geometric mean
    geometric_mean = np.exp(np.mean(np.log(probs)))

    return geometric_mean

def identify_uncertain_tokens(
    tokens: List[str],
    logprobs: List[float],
    threshold: float = -1.0
) -> List[tuple[str, float]]:
    """
    Find tokens where model was uncertain

    Returns: List of (token, probability) for uncertain tokens
    """
    uncertain = []

    for token, lp in zip(tokens, logprobs):
        if lp < threshold:  # Low log prob = uncertain
            prob = math.exp(lp)
            uncertain.append((token, prob))

    return uncertain

5. Implementation Guide

Phase 1: Basic Sweep (Days 1-2)

Step 1: Set Up Sweep Infrastructure

import asyncio
from typing import List
import openai

async def run_single_sample(
    prompt: str,
    temperature: float,
    model: str = "gpt-4"
) -> dict:
    """Execute one API call"""
    start_time = time.time()

    response = await openai.ChatCompletion.acreate(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        seed=42  # For reproducibility
    )

    latency_ms = (time.time() - start_time) * 1000

    return {
        "output": response.choices[0].message.content,
        "latency_ms": latency_ms,
        "tokens": response.usage.total_tokens
    }

Checkpoint 1.1: Can you run a single sample and measure latency?

Step 2: Implement Basic Sweep

async def run_basic_sweep(
    prompt: str,
    temperatures: List[float],
    runs_per_temp: int = 10
) -> dict:
    """Run sweep across temperatures"""

    results = {}

    for temp in temperatures:
        print(f"Testing temperature: {temp}")

        samples = []
        for i in range(runs_per_temp):
            sample = await run_single_sample(prompt, temp)
            samples.append(sample)

            print(f"  Sample {i+1}/{runs_per_temp}: {sample['latency_ms']:.0f}ms")

        results[temp] = samples

    return results

Checkpoint 1.2: Run sweep with temps=[0.0, 0.5, 1.0]. Do you see output variation increase?

Phase 2: Variance Measurement (Days 3-4)

Step 3: Implement Variance for JSON

def calculate_variance(outputs: List[str], output_type: str = "json") -> float:
    """Calculate variance based on output type"""

    if output_type == "json":
        # Parse JSON
        try:
            parsed = [json.loads(output) for output in outputs]
            return calculate_json_variance(parsed)
        except json.JSONDecodeError:
            # Fallback to text variance
            return calculate_text_variance(outputs)

    elif output_type == "categorical":
        return calculate_categorical_variance(outputs)

    elif output_type == "numeric":
        values = [float(output) for output in outputs]
        return np.std(values) / np.mean(values)  # Coefficient of variation

def calculate_categorical_variance(outputs: List[str]) -> float:
    """
    Variance for categorical outputs

    Returns entropy normalized to [0, 1]
    """
    counter = Counter(outputs)
    n = len(outputs)

    # Calculate entropy
    entropy = 0
    for count in counter.values():
        p = count / n
        if p > 0:
            entropy -= p * math.log2(p)

    # Normalize by max entropy (log2(n))
    max_entropy = math.log2(n) if n > 1 else 1

    return entropy / max_entropy if max_entropy > 0 else 0

Checkpoint 2.1: Test variance calculation. Does temp=0 give variance≈0?

Step 4: Integrate with Project 1 Harness

from project1 import TestHarness, validate_output

def run_sweep_with_validation(
    test_suite_file: str,
    temperatures: List[float],
    runs_per_temp: int = 10
) -> SweepReport:
    """Run sweep with validation"""

    harness = TestHarness.load(test_suite_file)

    results = {}

    for temp in temperatures:
        temp_results = []

        for test_case in harness.test_cases:
            samples = []

            for _ in range(runs_per_temp):
                # Run prompt
                output = run_single_sample(test_case.input, temp)

                # Validate
                passed = validate_output(output, test_case.invariants)

                samples.append({
                    "output": output,
                    "passed": passed
                })

            # Calculate success rate
            success_rate = sum(1 for s in samples if s["passed"]) / len(samples)

            # Calculate variance
            variance = calculate_variance([s["output"] for s in samples])

            temp_results.append({
                "test_case": test_case.id,
                "success_rate": success_rate,
                "variance": variance,
                "samples": samples
            })

        results[temp] = temp_results

    return results

Checkpoint 2.2: Run with a test suite. Does accuracy decrease as temp increases?

Phase 3: Self-Consistency & Logprobs (Days 5-6)

Step 5: Implement Self-Consistency

def apply_self_consistency(
    test_case: TestCase,
    temperature: float,
    n: int = 5
) -> SelfConsistencyResult:
    """Run N times and take majority vote"""

    samples = []

    for _ in range(n):
        output = run_single_sample(test_case.input, temperature)
        samples.append(output)

    # For JSON outputs
    if all(is_valid_json(s) for s in samples):
        parsed = [json.loads(s) for s in samples]
        majority = majority_vote_json(parsed)
        agreement = calculate_agreement(parsed)

        return {
            "majority_vote": majority,
            "agreement_score": agreement,
            "samples": samples
        }

def majority_vote_json(samples: List[dict]) -> dict:
    """Per-field majority voting"""
    result = {}

    # Get all keys
    all_keys = set()
    for sample in samples:
        all_keys.update(flatten_json(sample).keys())

    # Vote per key
    for key in all_keys:
        values = []
        for sample in samples:
            flat = flatten_json(sample)
            if key in flat:
                values.append(flat[key])

        # Get mode
        if values:
            counter = Counter(values)
            most_common = counter.most_common(1)[0][0]
            result[key] = most_common

    return unflatten_json(result)

Checkpoint 3.1: Test self-consistency. Does N=5 improve accuracy over N=1?

Step 6: Add Logprob Analysis

async def run_with_logprobs(
    prompt: str,
    temperature: float
) -> dict:
    """Run with logprob extraction"""

    response = await openai.ChatCompletion.acreate(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        logprobs=True,
        top_logprobs=3
    )

    content = response.choices[0].message.content
    logprobs = response.choices[0].logprobs.content

    # Extract log probabilities
    token_logprobs = [token.logprob for token in logprobs]

    # Calculate confidence
    confidence = calculate_confidence_from_logprobs(token_logprobs)

    return {
        "output": content,
        "confidence": confidence,
        "logprobs": token_logprobs
    }

Checkpoint 3.2: Check if low logprob correlates with validation failures.

Phase 4: Visualization & Recommendations (Day 7)

Step 7: Generate Reliability Curve

import matplotlib.pyplot as plt

def plot_reliability_curve(sweep_results: dict):
    """Create accuracy vs. temperature chart"""

    temps = sorted(sweep_results.keys())
    accuracies = [sweep_results[t]["success_rate"] for t in temps]
    variances = [sweep_results[t]["avg_variance"] for t in temps]

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    # Accuracy curve
    ax1.plot(temps, accuracies, marker='o', linewidth=2, markersize=8)
    ax1.set_xlabel("Temperature", fontsize=12)
    ax1.set_ylabel("Success Rate (%)", fontsize=12)
    ax1.set_title("Reliability Curve", fontsize=14, fontweight='bold')
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim([0, 105])

    # Variance curve
    ax2.plot(temps, variances, marker='s', color='orange', linewidth=2, markersize=8)
    ax2.set_xlabel("Temperature", fontsize=12)
    ax2.set_ylabel("Variance", fontsize=12)
    ax2.set_title("Output Variance", fontsize=14, fontweight='bold')
    ax2.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig("reliability_curve.png", dpi=300)
    print("Saved: reliability_curve.png")

Checkpoint 4.1: Generate chart. Does it clearly show the temp-accuracy relationship?

Step 8: Build Recommendation Engine

def generate_recommendations(sweep_results: dict) -> dict:
    """Analyze results and recommend settings"""

    # Find optimal temperature (highest accuracy)
    best_temp = max(sweep_results.keys(),
                   key=lambda t: sweep_results[t]["success_rate"])

    best_accuracy = sweep_results[best_temp]["success_rate"]
    best_variance = sweep_results[best_temp]["avg_variance"]

    # Determine if self-consistency is worth it
    baseline_accuracy = sweep_results[0.0]["success_rate"]
    temp_0_7_accuracy = sweep_results.get(0.7, {}).get("success_rate", 0)

    # If temp=0 is near-perfect, no need for self-consistency
    if baseline_accuracy >= 0.95:
        sc_recommendation = "Not needed (temp=0 already achieves 95%+ accuracy)"
    else:
        # Estimate self-consistency improvement (empirical: ~5-10%)
        estimated_improvement = min(0.1, 1.0 - baseline_accuracy)
        sc_recommendation = f"Recommended for edge cases (estimated +{estimated_improvement*100:.0f}% accuracy)"

    # Confidence policy
    avg_confidence = sweep_results[best_temp].get("avg_confidence", 0.8)

    if avg_confidence >= 0.85:
        confidence_threshold = 0.75
    else:
        confidence_threshold = 0.65

    return {
        "optimal_temperature": best_temp,
        "optimal_accuracy": best_accuracy,
        "optimal_variance": best_variance,
        "self_consistency": {
            "recommendation": sc_recommendation,
            "suggested_n": 5
        },
        "confidence_policy": {
            "auto_accept_threshold": confidence_threshold + 0.1,
            "human_review_threshold": confidence_threshold,
            "auto_reject_threshold": confidence_threshold - 0.2
        }
    }

Checkpoint 4.2: Verify recommendations make sense for your test suite.

6. Testing Strategy

6.1 Validation Tests

def test_variance_calculation():
    """Test variance metrics"""

    # Test 1: Identical outputs should have variance=0
    outputs = [{"category": "refund"}] * 10
    variance = calculate_json_variance(outputs)
    assert variance == 0.0

    # Test 2: Completely different outputs should have variance≈1
    outputs = [{"category": f"cat_{i}"} for i in range(10)]
    variance = calculate_json_variance(outputs)
    assert variance > 0.9

    # Test 3: Partially different
    outputs = [{"category": "refund"}] * 7 + [{"category": "technical"}] * 3
    variance = calculate_json_variance(outputs)
    assert 0.2 < variance < 0.4

def test_self_consistency():
    """Test majority voting"""

    # Test 1: Clear majority
    samples = ["refund"] * 7 + ["technical"] * 3
    result = apply_self_consistency(samples, n=10)
    assert result.majority_vote == "refund"
    assert result.agreement_score == 0.7

    # Test 2: Tie (should pick randomly but consistently with seed)
    samples = ["refund"] * 5 + ["technical"] * 5
    result = apply_self_consistency(samples, n=10)
    assert result.agreement_score == 0.5

6.2 Integration Tests

def test_temperature_sweep_integration():
    """Test full sweep pipeline"""

    # Create small test suite
    test_cases = [
        TestCase(id="tc1", input="Refund order #123", ...)
    ]

    # Run sweep
    results = run_temperature_sweep(
        test_cases=test_cases,
        temperatures=[0.0, 0.5, 1.0],
        runs_per_temp=5
    )

    # Validate results structure
    assert len(results) == 3  # 3 temperatures
    assert results[0.0].success_rate >= 0.9  # temp=0 should be accurate

    # Validate that variance increases with temp
    assert results[0.0].avg_variance < results[1.0].avg_variance

7. Common Pitfalls & Debugging

7.1 The Hallucination Gradient Problem

Symptom: At temp=1.5, model outputs complete nonsense

# Temp=0.0
{"category": "refund", "priority": "high"}

# Temp=1.5
{"category": "banana", "priority": "purple"}

Diagnosis: Check when hallucinations start

def detect_hallucination_threshold(sweep_results):
    """Find temperature where model starts hallucinating"""

    for temp in sorted(sweep_results.keys()):
        # Check for nonsensical outputs
        samples = sweep_results[temp]["samples"]
        hallucination_count = sum(1 for s in samples if is_nonsensical(s))
        hallucination_rate = hallucination_count / len(samples)

        if hallucination_rate > 0.1:  # 10% threshold
            print(f"⚠️ Hallucination starts at temp={temp}")
            return temp

    return None

7.2 The Seed Mismatch Problem

Symptom: Variance measurements inconsistent between runs

Cause: Not using seed parameter, or seed not supported

Solution:

# Always set seed for reproducibility
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0.7,
    seed=42  # Critical for reproducibility
)

# Verify seed is used
if response.system_fingerprint:
    print(f"System fingerprint: {response.system_fingerprint}")
else:
    print("⚠️ Warning: Seed may not be supported by this model")

7.3 The Logprob Unavailable Problem

Symptom: logprobs=None in response

Cause: Not all models support logprobs

Solution:

# Check model capabilities
MODELS_WITH_LOGPROBS = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]

if model not in MODELS_WITH_LOGPROBS:
    print(f"⚠️ Warning: {model} may not support logprobs")
    print("Confidence scoring will use fallback method (self-consistency)")

8. Extensions

8.1 Beginner Extensions

Extension 1: Top-P Sweep

Also sweep top-p values:

def run_top_p_sweep(
    prompt: str,
    top_p_values: List[float] = [0.5, 0.7, 0.9, 0.95, 1.0]
):
    """Compare top-p strategies"""

    for top_p in top_p_values:
        results = run_samples(prompt, temperature=1.0, top_p=top_p, n=10)
        # Analyze variance and accuracy

Extension 2: Cost-Accuracy Pareto Frontier

Plot the trade-off:

def plot_cost_accuracy_tradeoff(sweep_results):
    """
    Show Pareto frontier of cost vs. accuracy

    Points:
    - temp=0, N=1: Low cost, high accuracy
    - temp=0.7, N=5: Medium cost, highest accuracy
    - temp=1.0, N=1: Low cost, low accuracy
    """
    # Calculate cost (proportional to token count)
    # Plot accuracy vs. cost
    # Highlight Pareto-optimal points

8.2 Intermediate Extensions

Extension 3: Adaptive Self-Consistency

Dynamically adjust N based on initial agreement:

def adaptive_self_consistency(prompt: str, max_n: int = 10):
    """
    Start with N=3, increase if disagreement high
    """
    n = 3
    samples = run_n_samples(prompt, n)
    agreement = calculate_agreement(samples)

    while agreement < 0.8 and n < max_n:
        # Add 2 more samples
        samples.extend(run_n_samples(prompt, 2))
        n += 2
        agreement = calculate_agreement(samples)

    return majority_vote(samples), agreement

Extension 4: Logprob-Based Early Stopping

Stop generating if confidence drops:

def generate_with_confidence_monitoring(prompt: str):
    """
    Stream tokens, stop if logprob drops too low
    """
    # Use streaming API
    for token, logprob in stream_with_logprobs(prompt):
        if logprob < -2.0:  # Very uncertain
            # Stop and ask for user guidance
            return "UNCERTAIN", token

        yield token

8.3 Advanced Extensions

Extension 5: Multi-Dimensional Sweep

Sweep temperature AND top-p simultaneously:

def run_2d_sweep(
    temperatures: List[float],
    top_p_values: List[float]
):
    """
    Create heatmap of accuracy across temp × top_p
    """
    results = np.zeros((len(temperatures), len(top_p_values)))

    for i, temp in enumerate(temperatures):
        for j, top_p in enumerate(top_p_values):
            accuracy = evaluate(temp=temp, top_p=top_p)
            results[i, j] = accuracy

    # Plot heatmap
    plt.imshow(results, cmap='RdYlGn')
    plt.xlabel('Top-P')
    plt.ylabel('Temperature')
    plt.colorbar(label='Accuracy')

Extension 6: Calibration Analysis

Check if model’s confidence matches actual accuracy:

def analyze_calibration(results):
    """
    Plot: Predicted confidence vs. actual accuracy

    Well-calibrated: Diagonal line
    Overconfident: Points above diagonal
    Underconfident: Points below diagonal
    """
    confidences = []
    accuracies = []

    for result in results:
        confidences.append(result.confidence)
        accuracies.append(result.passed)

    # Bin by confidence
    bins = np.linspace(0, 1, 11)
    # Plot predicted vs. actual

9. Real-World Connections

9.1 Production Case Studies

1. Anthropic’s Constitutional AI

Claude uses variance to detect uncertain responses:

# Simplified version of their approach
if variance_across_samples > threshold:
    # Model is uncertain - apply additional constraints
    response = generate_with_safety_guidelines(prompt)

2. OpenAI’s GPT-4 Safety

OpenAI varies temperature based on content category:

safety_temps = {
    "general": 0.7,
    "medical": 0.0,   # Must be deterministic
    "creative": 1.0,
    "code": 0.2
}

temp = safety_temps.get(category, 0.7)

3. Google’s Med-PaLM

Medical Q&A uses ensemble (self-consistency):

# For medical questions, always use N=5
responses = [generate(question, temp=0.5) for _ in range(5)]
final_answer = medical_consensus(responses)

# Only return if agreement > 0.8
if agreement < 0.8:
    return "I'm not certain. Please consult a doctor."

10. Resources

10.1 Books

Topic	Book	Chapter
Classification Metrics	“Hands-On Machine Learning” by Géron	Ch. 3 (Classification)
Probability Theory	“Introduction to Probability” by Blitzstein	Ch. 1-2 (Basics)
Statistical Significance	“Naked Statistics” by Wheelan	Ch. 4 (Inference)
Visualization	“Storytelling with Data” by Nussbaumer Knaflic	Ch. 3 (Choosing Charts)

10.2 Papers

“Self-Consistency Improves Chain of Thought Reasoning” (Wang et al., 2022)
- Foundation for majority voting
- https://arxiv.org/abs/2203.11171
“Language Models are Few-Shot Learners” (Brown et al., 2020)
- Analysis of temperature effects
- https://arxiv.org/abs/2005.14165

10.3 Libraries

pip install matplotlib plotly
pip install scipy scikit-learn
pip install pandas numpy

11. Self-Assessment Checklist

Core Understanding

I can explain what temperature does
I understand variance metrics
I know when to use self-consistency
I can interpret logprobs

Implementation

I’ve run a temperature sweep
I’ve measured variance
I’ve implemented self-consistency
I’ve generated reliability curves

Production Readiness

I’ve defined optimal temperature for my use case
I have a confidence policy
I can explain trade-offs to stakeholders

12. Completion Criteria

Minimum Viable

Temperature sweep working
Variance calculation implemented
Basic chart generated

Full Completion

Self-consistency working
Logprob analysis
Recommendation engine
Integration with Project 1 harness

Excellence

2D sweep (temp × top-p)
Calibration analysis
Production deployment

End of Project 7: Temperature Sweeper + Confidence Policy