Project 7: Temperature Sweeper + Confidence Policy
Project 7: Temperature Sweeper + Confidence Policy
Build a statistical analysis tool that measures prompt reliability across temperature ranges and establishes confidence-based policies
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Advanced |
| Time Estimate | 1 week |
| Language | Python (Alternatives: TypeScript) |
| Prerequisites | Project 1 (Harness), basic statistics knowledge |
| Key Topics | Sampling strategies, variance measurement, confidence scoring, reliability curves |
| Knowledge Area | Sampling / Uncertainty |
| Software/Tool | Matplotlib / Plotly (Visualization) |
| Main Book | โHands-On Machine Learningโ by Gรฉron (Ch. 3: Classification metrics) |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 3. The โService & Supportโ Model |
1. Learning Objectives
By completing this project, you will:
- Understand Temperature and Sampling: Master how temperature, top-p, and top-k control randomness in LLM outputs
- Measure Output Variance: Quantify how different two JSON outputs are mathematically
- Build Reliability Curves: Plot accuracy vs. temperature to find the stability sweet spot
- Implement Self-Consistency: Use majority voting across multiple samples to improve reliability
- Design Confidence Policies: Create rules for when to trust the model vs. ask for human review
- Analyze Logprobs: Interpret the modelโs internal probability scores
- Optimize for Latency vs. Accuracy: Understand the trade-off between speed and reliability
- Build Statistical Frameworks: Apply hypothesis testing to prompt engineering decisions
- Visualize Uncertainty: Create production-ready charts showing model confidence over time
- Establish Production Policies: Define SLOs for when model outputs are โgood enoughโ
2. Theoretical Foundation
2.1 Core Concepts
What is Temperature?
Temperature is a sampling parameter that controls the randomness of token selection during text generation.
How LLMs Generate Text:
# At each step, the model outputs a probability distribution
next_token_probs = {
"The": 0.7,
"A": 0.2,
"An": 0.08,
"Their": 0.02
}
# Temperature modifies this distribution
Temperature = 0.0 (Deterministic):
# Always pick highest probability
next_token = "The" # Always the same
Temperature = 1.0 (Default):
# Sample from the distribution
next_token = random.choice(["The", "A", "An", "Their"], p=[0.7, 0.2, 0.08, 0.02])
Temperature = 2.0 (High randomness):
# Flattens the distribution, making rare tokens more likely
adjusted_probs = {
"The": 0.45, # Still most likely, but less dominant
"A": 0.3,
"An": 0.15,
"Their": 0.1
}
Mathematical Formula:
P(token_i) = exp(logit_i / T) / ฮฃ exp(logit_j / T)
Where:
- T = temperature
- logit_i = model's raw score for token i
Visual Effect:
Temperature = 0.0: โฐโฑโฑโฑโฑโฑโฑโฑ (peaked)
Temperature = 0.5: โฐโฐโฐโฑโฑโฑโฑโฑ (focused)
Temperature = 1.0: โฐโฐโฐโฐโฐโฑโฑโฑ (balanced)
Temperature = 2.0: โฐโฐโฐโฐโฐโฐโฐโฐ (flat)
Greedy vs. Nucleus Sampling
Greedy Decoding (temp=0):
- Always pick the most probable token
- Deterministic (same input โ same output)
- Safe for structured outputs (JSON)
- Can be boring or repetitive
Top-P (Nucleus Sampling):
- Select from the smallest set of tokens whose cumulative probability โฅ p
- More dynamic than greedy
- Less wild than high temperature
# Top-P = 0.9 example
token_probs = {
"The": 0.7,
"A": 0.2, # Cumulative: 0.9 (stop here)
"An": 0.08, # Not included
"Their": 0.02
}
# Sample from {"The", "A"} only
Top-K Sampling:
- Consider only the K most probable tokens
- Fixed size (unlike top-p which is dynamic)
# Top-K = 2
token_probs = {
"The": 0.7,
"A": 0.2, # Only these 2 considered
"An": 0.08, # Ignored
"Their": 0.02
}
Variance Measurement
Variance quantifies how different multiple outputs are from each other.
For Numeric Outputs:
outputs = [0.8, 0.85, 0.75, 0.9, 0.82]
variance = np.var(outputs) # 0.0029
std_dev = np.std(outputs) # 0.054
# Low variance = consistent
# High variance = unpredictable
For Categorical Outputs:
# 10 runs at temp=0
categories = ["refund", "refund", "refund", "refund", "refund",
"refund", "refund", "refund", "refund", "refund"]
# Variance = 0 (all same)
# 10 runs at temp=1.5
categories = ["refund", "technical", "refund", "policy", "refund",
"technical", "refund", "refund", "policy", "technical"]
# Variance = High (inconsistent)
For JSON Outputs (Structural Variance):
def json_variance(outputs: List[dict]) -> float:
"""
Measure how different JSON outputs are
Strategy:
1. Extract all key-value pairs
2. Compare across outputs
3. Return ratio of differences
"""
all_keys = set()
for output in outputs:
all_keys.update(flatten_keys(output))
differences = 0
comparisons = 0
for key in all_keys:
values = [get_nested_value(output, key) for output in outputs]
unique_values = len(set(values))
if unique_values > 1:
differences += 1
comparisons += 1
return differences / comparisons if comparisons > 0 else 0
Self-Consistency Pattern
Idea: Run the same prompt N times and take the majority vote.
Example:
# Run 5 times with temp=0.7
outputs = [
{"category": "refund", "priority": "high"},
{"category": "refund", "priority": "high"},
{"category": "refund", "priority": "medium"},
{"category": "technical", "priority": "high"},
{"category": "refund", "priority": "high"}
]
# Majority vote: category="refund" (4/5), priority="high" (4/5)
final_output = {"category": "refund", "priority": "high"}
confidence = 0.8 # 4/5 agreement
When to Use:
- Critical decisions (medical, legal, financial)
- High uncertainty queries
- When accuracy > latency
- Production fallback when single-sample fails validation
Research Foundation:
โSelf-Consistency Improves Chain of Thought Reasoning in Language Modelsโ (Wang et al., 2022) showed that self-consistency can improve accuracy from 57% โ 78% on reasoning tasks.
Logprobs (Log Probabilities)
Logprobs reveal the modelโs internal confidence for each token.
What They Show:
{
"token": "refund",
"logprob": -0.2, // log(probability)
"probability": 0.82, // exp(-0.2) โ 0.82
"top_logprobs": [
{"token": "refund", "logprob": -0.2},
{"token": "return", "logprob": -1.6}, // Much less likely
{"token": "cancel", "logprob": -2.3}
]
}
Interpreting Logprobs:
| Logprob | Probability | Interpretation |
|---|---|---|
| -0.05 | ~0.95 | Very confident |
| -0.5 | ~0.60 | Somewhat confident |
| -1.0 | ~0.37 | Uncertain |
| -2.0 | ~0.14 | Guessing |
| -5.0 | ~0.007 | Wild guess |
Using Logprobs for Confidence Scoring:
def calculate_confidence(logprobs: List[float]) -> float:
"""
Average probability across all tokens in the output
High confidence: All tokens had high probability
Low confidence: Model was guessing on many tokens
"""
probabilities = [math.exp(lp) for lp in logprobs]
return sum(probabilities) / len(probabilities)
Production Use Case:
if confidence < 0.6:
# Low confidence - run self-consistency or escalate
outputs = run_n_times(prompt, n=5)
final = majority_vote(outputs)
else:
# High confidence - use single sample
final = output
The Reliability Curve
A reliability curve plots model performance against temperature.
Typical Shape:
Accuracy
100% โ โ
โ โฒ
95% โ โ
โ โฒ
90% โ โ
โ โฒ
85% โ โ
โ โฒ
80% โ โ
โ โฒโฒ
60% โ โโโ
โโโโโโโโโโโโโโโโโโโโโโ
0.0 0.5 1.0 1.5 Temperature
Key Observations:
- Plateau (0.0-0.3): Temperature has minimal effect
- Degradation (0.5-1.0): Accuracy starts dropping
- Collapse (1.5+): Model becomes unreliable
Task-Specific Curves:
| Task Type | Optimal Temp | Why |
|---|---|---|
| JSON extraction | 0.0 | Need determinism |
| Creative writing | 0.7-1.0 | Want variety |
| Code generation | 0.2 | Balance correctness & diversity |
| Classification | 0.0 | Need consistency |
| Brainstorming | 1.2+ | Want novelty |
2.2 Why This Matters
Production Relevance
Problem: No one measures temperature impact systematically
# Common anti-pattern
response = llm.complete(prompt, temperature=0.7) # Why 0.7? No idea!
# Result: Inconsistent outputs, unexplained failures
Solution: Empirical measurement
# Run sweeper
results = sweep_temperature(prompt, test_cases, temps=[0, 0.3, 0.7, 1.0])
# Results show:
# temp=0.0 โ 98% accuracy
# temp=0.7 โ 85% accuracy
# temp=1.0 โ 65% accuracy
# Decision: Use temp=0.0 for this task
Real-World Impact:
- Medical diagnosis chatbot: Must use temp=0 (consistency critical)
- Marketing copy generator: Can use temp=0.9 (creativity valued)
- Customer support: temp=0.2 (slight variety, high accuracy)
The Latency Trade-off:
Self-consistency improves accuracy but increases latency:
Single sample (temp=0): 200ms, 95% accurate
Self-consistency (n=5): 1000ms, 99% accurate
Decision depends on:
- User-facing? (latency matters)
- Critical decision? (accuracy matters)
- Cost budget? (5x API calls)
Industry Applications
1. OpenAIโs GPT-4 Evaluations
OpenAI runs temperature sweeps on every model release:
# Simplified version of their approach
for temp in [0.0, 0.3, 0.7, 1.0]:
for benchmark in ["MMLU", "HumanEval", "HellaSwag"]:
score = evaluate(model, benchmark, temperature=temp)
report[benchmark][temp] = score
# Publish results showing optimal temp per task
2. Anthropicโs Constitutional AI
Anthropic uses variance measurement to detect when Claude is uncertain:
# High variance on sensitive topics โ Trigger safety review
variance = measure_variance(outputs)
if variance > 0.3 and topic in ["medical", "legal", "financial"]:
escalate_to_human()
3. Production Confidence Policies (Stripe)
Stripeโs AI-powered fraud detection:
def fraud_policy(transaction, model_confidence):
if model_confidence > 0.95:
return "auto_approve"
elif model_confidence > 0.7:
return "manual_review"
else:
return "auto_reject"
# This policy was tuned using temperature sweeps
2.3 Common Misconceptions
| Misconception | Reality |
|---|---|
| โHigher temp = better outputsโ | Depends on task; temp=0 is often optimal |
| โTemp=0 makes model boringโ | For creative tasks yes, for structured outputs no |
| โSelf-consistency is always betterโ | Only if accuracy gain > 5x cost increase |
| โVariance = 0 is idealโ | Depends on task; some tasks need diversity |
| โLogprobs are hard to useโ | Simple threshold rules work well in practice |
3. Project Specification
3.1 What You Will Build
A temperature analysis framework that:
- Runs parametric sweeps across temperature ranges (0.0, 0.2, 0.4, โฆ, 2.0)
- Measures variance for each temperature setting
- Calculates accuracy against test cases from Project 1 harness
- Generates reliability curves showing accuracy vs. temperature
- Implements self-consistency with configurable N
- Analyzes logprobs to quantify model confidence
- Recommends optimal settings based on your task requirements
- Defines confidence policies (when to trust single sample vs. self-consistency)
- Visualizes results with interactive charts
- Integrates with CI/CD to catch regressions in reliability
Core Question This Tool Answers:
โHow do I know when the model is guessing?โ
3.2 Functional Requirements
FR1: Temperature Sweep Execution
- Run same prompt N times per temperature setting
- Support temperature range 0.0 to 2.0 in 0.1 increments
- Parallel execution to reduce total runtime
- Cache results to avoid redundant API calls
FR2: Variance Measurement
Implement variance metrics for:
- Categorical outputs: Mode, entropy, agreement ratio
- Numeric outputs: Standard deviation, coefficient of variation
- JSON outputs: Structural diff count, value divergence
- Text outputs: Edit distance, semantic similarity
FR3: Accuracy Evaluation
- Integrate with Project 1 test harness
- Run full test suite per temperature
- Track success rate, precision, recall
- Compare against baseline (temp=0)
FR4: Self-Consistency Implementation
- Run prompt N times (configurable, default N=5)
- Implement majority voting for categorical outputs
- Implement median/mean for numeric outputs
- Track agreement scores as confidence metric
FR5: Logprob Analysis
- Extract logprobs from API responses
- Calculate per-token and aggregate confidence scores
- Identify low-confidence tokens
- Correlate logprobs with actual accuracy
FR6: Visualization & Reporting
Generate charts:
- Reliability Curve: Accuracy vs. temperature
- Variance Plot: Variance vs. temperature
- Confidence Distribution: Histogram of confidence scores
- Latency vs. Accuracy: Trade-off visualization
- Self-Consistency Impact: N vs. accuracy improvement
FR7: Policy Recommendation
Based on sweep results, recommend:
- Optimal temperature for this prompt
- When to use self-consistency
- Confidence threshold for auto-accept
- Escalation policy for low-confidence outputs
3.3 Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Sweep Time | <10 minutes for 100 test cases | Must be practical for CI/CD integration |
| Statistical Significance | Nโฅ30 samples per temp for variance | Need confidence in measurements |
| Visualization Quality | Publication-ready charts | Communicate findings to stakeholders |
| Cost Efficiency | Smart caching, parallel execution | Temperature sweeps are expensive |
| Reproducibility | Same results on re-run (with seed) | Scientific rigor |
3.4 Example Usage
Running a temperature sweep:
$ python sweeper.py --prompt prompts/support_agent.yaml --temps "0.0,0.3,0.7,1.0,1.5" --runs-per-temp 10
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TEMPERATURE SWEEPER - Reliability Analysis โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Loading prompt: prompts/support_agent.yaml
Loading test cases: 50 cases across 3 categories
[Sweep Configuration]
Temperature range: [0.0, 0.3, 0.7, 1.0, 1.5]
Runs per temperature: 10
Total API calls: 2500 (50 cases ร 5 temps ร 10 runs)
Estimated cost: $12.50
Estimated time: 8 minutes
Proceed? (y/n): y
[Running Sweep]
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Temperature: 0.0
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Running test case 1/50 (10 samples)... Done (avg: 120ms)
Running test case 2/50 (10 samples)... Done (avg: 115ms)
...
Running test case 50/50 (10 samples)... Done (avg: 118ms)
Results for temp=0.0:
Success Rate: 98% (49/50 cases passed)
Avg Variance: 0.01 (very low)
Avg Confidence: 0.94
Avg Latency: 117ms
Temperature: 0.3
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
...
Temperature: 0.7
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Results for temp=0.7:
Success Rate: 92% (46/50 cases passed)
Avg Variance: 0.15 (moderate)
Avg Confidence: 0.78
Avg Latency: 123ms
Temperature: 1.0
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Results for temp=1.0:
Success Rate: 74% (37/50 cases passed)
Avg Variance: 0.35 (high)
Avg Confidence: 0.62
Avg Latency: 128ms
Temperature: 1.5
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Results for temp=1.5:
Success Rate: 48% (24/50 cases passed)
Avg Variance: 0.58 (very high)
Avg Confidence: 0.41
Avg Latency: 135ms
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SWEEP SUMMARY
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
TEMP | SUCCESS RATE | VARIANCE | CONFIDENCE
0.0 | 98% | 0.01 | 0.94
0.3 | 96% | 0.08 | 0.88
0.7 | 92% | 0.15 | 0.78
1.0 | 74% | 0.35 | 0.62
1.5 | 48% | 0.58 | 0.41
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RECOMMENDATIONS:
1. OPTIMAL TEMPERATURE: 0.0
- Highest accuracy (98%)
- Lowest variance (0.01)
- Fast and consistent
2. SELF-CONSISTENCY POLICY:
- For confidence < 0.7: Use self-consistency (n=5)
- Expected improvement: +12% accuracy on edge cases
3. CONFIDENCE POLICY:
- Auto-accept: confidence โฅ 0.85 (covers 87% of cases)
- Human review: confidence < 0.85 (13% of cases)
- Auto-reject: confidence < 0.5 (2% of cases)
4. PRODUCTION SETTINGS:
temperature: 0.0
fallback_strategy: "self_consistency"
fallback_threshold: 0.7
fallback_n: 5
Saved detailed results to: reports/sweep_2024-12-27_16-45-23.json
Saved charts to: reports/charts/
Self-Consistency Analysis:
$ python sweeper.py --self-consistency --n 3,5,10 --prompt prompts/support_agent.yaml
[Self-Consistency Analysis]
Testing N=3, 5, 10 on 50 test cases
Baseline (single sample, temp=0.7): 92% accuracy
N=3 (majority vote):
Accuracy: 95% (+3 percentage points)
Latency: 3x (360ms avg)
Cost: 3x
N=5 (majority vote):
Accuracy: 97% (+5 percentage points)
Latency: 5x (600ms avg)
Cost: 5x
N=10 (majority vote):
Accuracy: 98% (+6 percentage points)
Latency: 10x (1200ms avg)
Cost: 10x
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
RECOMMENDATION
For your use case:
- If latency-sensitive (user-facing): Use temp=0.0 single sample (98%)
- If accuracy-critical: Use temp=0.7 + N=5 self-consistency (97%)
- Sweet spot: N=5 gives 90% of improvement at 50% of N=10 cost
4. Solution Architecture
4.1 High-Level Design
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Test Harness (from Project 1) โ
โ - Loads test cases โ
โ - Defines success criteria โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Sweep Executor โ
โ - Iterates over temperature settings โ
โ - Runs N samples per test case per temperature โ
โ - Parallelizes API calls โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Variance Calculator โ
โ - Measures consistency across samples โ
โ - Computes structural diffs for JSON โ
โ - Calculates semantic similarity for text โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Accuracy Evaluator โ
โ - Runs validators from Project 1 โ
โ - Aggregates pass/fail across samples โ
โ - Computes success rate per temperature โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Self-Consistency Engine โ
โ - Implements majority voting โ
โ - Tracks agreement scores โ
โ - Compares single vs. ensemble performance โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Logprob Analyzer โ
โ - Extracts probabilities from API โ
โ - Calculates confidence scores โ
โ - Correlates with actual accuracy โ
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Report Generator โ
โ - Creates reliability curves โ
โ - Generates recommendation summaries โ
โ - Exports JSON for CI/CD integration โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
4.2 Key Components
| Component | Responsibility | Implementation Notes |
|---|---|---|
| SweepExecutor | Orchestrate temperature sweep | Use asyncio for parallel API calls |
| VarianceCalculator | Measure output consistency | Different strategies per output type |
| AccuracyEvaluator | Validate outputs | Reuse validators from Project 1 |
| SelfConsistency | Majority voting & aggregation | Handle ties with random selection |
| LogprobAnalyzer | Confidence scoring | Requires API support (GPT-4, Claude) |
| ChartGenerator | Create visualizations | Matplotlib for static, Plotly for interactive |
| PolicyRecommender | Suggest optimal settings | Rule-based system with heuristics |
| CacheManager | Avoid redundant API calls | Hash-based lookup (prompt + temp + seed) |
4.3 Data Structures
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum
@dataclass
class SweepConfig:
"""Configuration for temperature sweep"""
temperatures: List[float]
runs_per_temperature: int
test_cases: List[TestCase]
model: str
max_parallel: int = 10
cache_enabled: bool = True
seed: Optional[int] = None
@dataclass
class SampleResult:
"""Single API call result"""
temperature: float
test_case_id: str
output: Any
passed: bool
latency_ms: float
logprobs: Optional[List[float]]
token_count: int
timestamp: str
@dataclass
class TemperatureResult:
"""Aggregated results for one temperature"""
temperature: float
samples: List[SampleResult]
success_rate: float
avg_variance: float
avg_confidence: float
avg_latency_ms: float
failed_cases: List[str]
@dataclass
class SweepReport:
"""Complete sweep analysis"""
config: SweepConfig
results: List[TemperatureResult]
optimal_temperature: float
recommendations: Dict[str, Any]
charts: Dict[str, str] # Chart name -> file path
total_cost: float
total_time_s: float
@dataclass
class SelfConsistencyResult:
"""Self-consistency analysis for one test case"""
test_case_id: str
n: int
samples: List[Any]
majority_vote: Any
agreement_score: float # 0.0 to 1.0
confidence: float
class VarianceMetric(Enum):
CATEGORICAL_MODE = "categorical_mode"
NUMERIC_STD = "numeric_std"
JSON_STRUCTURAL = "json_structural"
TEXT_EDIT_DISTANCE = "text_edit_distance"
4.4 Algorithm Overview
Temperature Sweep Algorithm
async def run_temperature_sweep(
config: SweepConfig
) -> SweepReport:
"""
Execute temperature sweep across all test cases
Complexity: O(T ร C ร N) where:
- T = number of temperatures
- C = number of test cases
- N = runs per temperature
Parallelization: Run up to max_parallel API calls concurrently
"""
results = []
for temp in config.temperatures:
print(f"\nTesting temperature: {temp}")
temp_results = []
# For each test case
for test_case in config.test_cases:
# Run N times
samples = []
# Parallelize the N runs
tasks = [
execute_with_temp(test_case, temp, config.model)
for _ in range(config.runs_per_temperature)
]
# Execute in batches to respect rate limits
for batch in chunks(tasks, config.max_parallel):
batch_results = await asyncio.gather(*batch)
samples.extend(batch_results)
# Evaluate samples
passed_count = sum(1 for s in samples if s.passed)
success_rate = passed_count / len(samples)
# Calculate variance
variance = calculate_variance(
[s.output for s in samples],
get_variance_metric(test_case)
)
# Calculate confidence (from logprobs)
if samples[0].logprobs:
avg_confidence = np.mean([
calculate_confidence(s.logprobs) for s in samples
])
else:
avg_confidence = None
temp_results.append(TemperatureResult(
temperature=temp,
samples=samples,
success_rate=success_rate,
avg_variance=variance,
avg_confidence=avg_confidence,
avg_latency_ms=np.mean([s.latency_ms for s in samples]),
failed_cases=[s.test_case_id for s in samples if not s.passed]
))
results.append(aggregate_temp_results(temp_results))
# Generate recommendations
recommendations = generate_recommendations(results, config)
# Create charts
charts = generate_charts(results)
return SweepReport(
config=config,
results=results,
optimal_temperature=recommendations['optimal_temperature'],
recommendations=recommendations,
charts=charts,
total_cost=calculate_total_cost(results),
total_time_s=calculate_total_time(results)
)
Variance Calculation (JSON)
def calculate_json_variance(outputs: List[dict]) -> float:
"""
Measure structural consistency across JSON outputs
Algorithm:
1. Flatten all JSON objects to key-value pairs
2. For each key, check if all outputs have same value
3. Return ratio of differing keys
Returns: 0.0 (identical) to 1.0 (completely different)
"""
if len(outputs) == 0:
return 0.0
if len(outputs) == 1:
return 0.0
# Flatten all outputs
flattened = [flatten_json(output) for output in outputs]
# Get all keys that appear in any output
all_keys = set()
for flat in flattened:
all_keys.update(flat.keys())
# Count differences
different_keys = 0
total_keys = len(all_keys)
for key in all_keys:
values = [flat.get(key) for flat in flattened]
# Check if all values are the same
unique_values = set(v for v in values if v is not None)
if len(unique_values) > 1:
different_keys += 1
return different_keys / total_keys if total_keys > 0 else 0.0
def flatten_json(obj: dict, prefix: str = "") -> dict:
"""
Flatten nested JSON to key-value pairs
Example:
{"user": {"name": "Alice", "age": 30}}
โ
{"user.name": "Alice", "user.age": 30}
"""
result = {}
for key, value in obj.items():
full_key = f"{prefix}.{key}" if prefix else key
if isinstance(value, dict):
result.update(flatten_json(value, full_key))
elif isinstance(value, list):
for i, item in enumerate(value):
if isinstance(item, dict):
result.update(flatten_json(item, f"{full_key}[{i}]"))
else:
result[f"{full_key}[{i}]"] = item
else:
result[full_key] = value
return result
Self-Consistency Algorithm
def apply_self_consistency(
samples: List[Any],
n: int
) -> SelfConsistencyResult:
"""
Majority voting across N samples
For categorical outputs:
- Return mode (most common value)
- Confidence = (count of mode) / n
For JSON outputs:
- Per-field majority voting
- Confidence = average agreement across fields
"""
if len(samples) < n:
raise ValueError(f"Need at least {n} samples, got {len(samples)}")
# Sample n items
selected = random.sample(samples, n)
# Determine output type
if all(isinstance(s, dict) for s in selected):
return self_consistency_json(selected)
elif all(isinstance(s, (int, float)) for s in selected):
return self_consistency_numeric(selected)
else:
return self_consistency_categorical(selected)
def self_consistency_json(samples: List[dict]) -> dict:
"""
Per-field majority voting for JSON outputs
"""
# Flatten all samples
flattened = [flatten_json(s) for s in samples]
# Get all keys
all_keys = set()
for flat in flattened:
all_keys.update(flat.keys())
# Majority vote per key
result = {}
agreement_scores = []
for key in all_keys:
values = [flat.get(key) for flat in flattened if key in flat]
# Get mode
counter = Counter(values)
most_common_value, count = counter.most_common(1)[0]
result[key] = most_common_value
agreement_scores.append(count / len(values))
# Unflatten
final_result = unflatten_json(result)
return SelfConsistencyResult(
test_case_id="", # Set by caller
n=len(samples),
samples=samples,
majority_vote=final_result,
agreement_score=np.mean(agreement_scores),
confidence=np.mean(agreement_scores)
)
def self_consistency_categorical(samples: List[str]) -> str:
"""
Simple mode for categorical outputs
"""
counter = Counter(samples)
mode, count = counter.most_common(1)[0]
return SelfConsistencyResult(
n=len(samples),
samples=samples,
majority_vote=mode,
agreement_score=count / len(samples),
confidence=count / len(samples)
)
Confidence from Logprobs
def calculate_confidence_from_logprobs(
logprobs: List[float]
) -> float:
"""
Convert logprobs to confidence score
Strategy:
1. Convert log probabilities to probabilities
2. Take geometric mean (avoids being dominated by one low prob)
3. Scale to [0, 1]
Returns: Confidence score where 1.0 = very confident, 0.0 = guessing
"""
if not logprobs:
return None
# Convert to probabilities
probs = [math.exp(lp) for lp in logprobs]
# Geometric mean
geometric_mean = np.exp(np.mean(np.log(probs)))
return geometric_mean
def identify_uncertain_tokens(
tokens: List[str],
logprobs: List[float],
threshold: float = -1.0
) -> List[tuple[str, float]]:
"""
Find tokens where model was uncertain
Returns: List of (token, probability) for uncertain tokens
"""
uncertain = []
for token, lp in zip(tokens, logprobs):
if lp < threshold: # Low log prob = uncertain
prob = math.exp(lp)
uncertain.append((token, prob))
return uncertain
5. Implementation Guide
Phase 1: Basic Sweep (Days 1-2)
Step 1: Set Up Sweep Infrastructure
import asyncio
from typing import List
import openai
async def run_single_sample(
prompt: str,
temperature: float,
model: str = "gpt-4"
) -> dict:
"""Execute one API call"""
start_time = time.time()
response = await openai.ChatCompletion.acreate(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
seed=42 # For reproducibility
)
latency_ms = (time.time() - start_time) * 1000
return {
"output": response.choices[0].message.content,
"latency_ms": latency_ms,
"tokens": response.usage.total_tokens
}
Checkpoint 1.1: Can you run a single sample and measure latency?
Step 2: Implement Basic Sweep
async def run_basic_sweep(
prompt: str,
temperatures: List[float],
runs_per_temp: int = 10
) -> dict:
"""Run sweep across temperatures"""
results = {}
for temp in temperatures:
print(f"Testing temperature: {temp}")
samples = []
for i in range(runs_per_temp):
sample = await run_single_sample(prompt, temp)
samples.append(sample)
print(f" Sample {i+1}/{runs_per_temp}: {sample['latency_ms']:.0f}ms")
results[temp] = samples
return results
Checkpoint 1.2: Run sweep with temps=[0.0, 0.5, 1.0]. Do you see output variation increase?
Phase 2: Variance Measurement (Days 3-4)
Step 3: Implement Variance for JSON
def calculate_variance(outputs: List[str], output_type: str = "json") -> float:
"""Calculate variance based on output type"""
if output_type == "json":
# Parse JSON
try:
parsed = [json.loads(output) for output in outputs]
return calculate_json_variance(parsed)
except json.JSONDecodeError:
# Fallback to text variance
return calculate_text_variance(outputs)
elif output_type == "categorical":
return calculate_categorical_variance(outputs)
elif output_type == "numeric":
values = [float(output) for output in outputs]
return np.std(values) / np.mean(values) # Coefficient of variation
def calculate_categorical_variance(outputs: List[str]) -> float:
"""
Variance for categorical outputs
Returns entropy normalized to [0, 1]
"""
counter = Counter(outputs)
n = len(outputs)
# Calculate entropy
entropy = 0
for count in counter.values():
p = count / n
if p > 0:
entropy -= p * math.log2(p)
# Normalize by max entropy (log2(n))
max_entropy = math.log2(n) if n > 1 else 1
return entropy / max_entropy if max_entropy > 0 else 0
Checkpoint 2.1: Test variance calculation. Does temp=0 give varianceโ0?
Step 4: Integrate with Project 1 Harness
from project1 import TestHarness, validate_output
def run_sweep_with_validation(
test_suite_file: str,
temperatures: List[float],
runs_per_temp: int = 10
) -> SweepReport:
"""Run sweep with validation"""
harness = TestHarness.load(test_suite_file)
results = {}
for temp in temperatures:
temp_results = []
for test_case in harness.test_cases:
samples = []
for _ in range(runs_per_temp):
# Run prompt
output = run_single_sample(test_case.input, temp)
# Validate
passed = validate_output(output, test_case.invariants)
samples.append({
"output": output,
"passed": passed
})
# Calculate success rate
success_rate = sum(1 for s in samples if s["passed"]) / len(samples)
# Calculate variance
variance = calculate_variance([s["output"] for s in samples])
temp_results.append({
"test_case": test_case.id,
"success_rate": success_rate,
"variance": variance,
"samples": samples
})
results[temp] = temp_results
return results
Checkpoint 2.2: Run with a test suite. Does accuracy decrease as temp increases?
Phase 3: Self-Consistency & Logprobs (Days 5-6)
Step 5: Implement Self-Consistency
def apply_self_consistency(
test_case: TestCase,
temperature: float,
n: int = 5
) -> SelfConsistencyResult:
"""Run N times and take majority vote"""
samples = []
for _ in range(n):
output = run_single_sample(test_case.input, temperature)
samples.append(output)
# For JSON outputs
if all(is_valid_json(s) for s in samples):
parsed = [json.loads(s) for s in samples]
majority = majority_vote_json(parsed)
agreement = calculate_agreement(parsed)
return {
"majority_vote": majority,
"agreement_score": agreement,
"samples": samples
}
def majority_vote_json(samples: List[dict]) -> dict:
"""Per-field majority voting"""
result = {}
# Get all keys
all_keys = set()
for sample in samples:
all_keys.update(flatten_json(sample).keys())
# Vote per key
for key in all_keys:
values = []
for sample in samples:
flat = flatten_json(sample)
if key in flat:
values.append(flat[key])
# Get mode
if values:
counter = Counter(values)
most_common = counter.most_common(1)[0][0]
result[key] = most_common
return unflatten_json(result)
Checkpoint 3.1: Test self-consistency. Does N=5 improve accuracy over N=1?
Step 6: Add Logprob Analysis
async def run_with_logprobs(
prompt: str,
temperature: float
) -> dict:
"""Run with logprob extraction"""
response = await openai.ChatCompletion.acreate(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
logprobs=True,
top_logprobs=3
)
content = response.choices[0].message.content
logprobs = response.choices[0].logprobs.content
# Extract log probabilities
token_logprobs = [token.logprob for token in logprobs]
# Calculate confidence
confidence = calculate_confidence_from_logprobs(token_logprobs)
return {
"output": content,
"confidence": confidence,
"logprobs": token_logprobs
}
Checkpoint 3.2: Check if low logprob correlates with validation failures.
Phase 4: Visualization & Recommendations (Day 7)
Step 7: Generate Reliability Curve
import matplotlib.pyplot as plt
def plot_reliability_curve(sweep_results: dict):
"""Create accuracy vs. temperature chart"""
temps = sorted(sweep_results.keys())
accuracies = [sweep_results[t]["success_rate"] for t in temps]
variances = [sweep_results[t]["avg_variance"] for t in temps]
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy curve
ax1.plot(temps, accuracies, marker='o', linewidth=2, markersize=8)
ax1.set_xlabel("Temperature", fontsize=12)
ax1.set_ylabel("Success Rate (%)", fontsize=12)
ax1.set_title("Reliability Curve", fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 105])
# Variance curve
ax2.plot(temps, variances, marker='s', color='orange', linewidth=2, markersize=8)
ax2.set_xlabel("Temperature", fontsize=12)
ax2.set_ylabel("Variance", fontsize=12)
ax2.set_title("Output Variance", fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("reliability_curve.png", dpi=300)
print("Saved: reliability_curve.png")
Checkpoint 4.1: Generate chart. Does it clearly show the temp-accuracy relationship?
Step 8: Build Recommendation Engine
def generate_recommendations(sweep_results: dict) -> dict:
"""Analyze results and recommend settings"""
# Find optimal temperature (highest accuracy)
best_temp = max(sweep_results.keys(),
key=lambda t: sweep_results[t]["success_rate"])
best_accuracy = sweep_results[best_temp]["success_rate"]
best_variance = sweep_results[best_temp]["avg_variance"]
# Determine if self-consistency is worth it
baseline_accuracy = sweep_results[0.0]["success_rate"]
temp_0_7_accuracy = sweep_results.get(0.7, {}).get("success_rate", 0)
# If temp=0 is near-perfect, no need for self-consistency
if baseline_accuracy >= 0.95:
sc_recommendation = "Not needed (temp=0 already achieves 95%+ accuracy)"
else:
# Estimate self-consistency improvement (empirical: ~5-10%)
estimated_improvement = min(0.1, 1.0 - baseline_accuracy)
sc_recommendation = f"Recommended for edge cases (estimated +{estimated_improvement*100:.0f}% accuracy)"
# Confidence policy
avg_confidence = sweep_results[best_temp].get("avg_confidence", 0.8)
if avg_confidence >= 0.85:
confidence_threshold = 0.75
else:
confidence_threshold = 0.65
return {
"optimal_temperature": best_temp,
"optimal_accuracy": best_accuracy,
"optimal_variance": best_variance,
"self_consistency": {
"recommendation": sc_recommendation,
"suggested_n": 5
},
"confidence_policy": {
"auto_accept_threshold": confidence_threshold + 0.1,
"human_review_threshold": confidence_threshold,
"auto_reject_threshold": confidence_threshold - 0.2
}
}
Checkpoint 4.2: Verify recommendations make sense for your test suite.
6. Testing Strategy
6.1 Validation Tests
def test_variance_calculation():
"""Test variance metrics"""
# Test 1: Identical outputs should have variance=0
outputs = [{"category": "refund"}] * 10
variance = calculate_json_variance(outputs)
assert variance == 0.0
# Test 2: Completely different outputs should have varianceโ1
outputs = [{"category": f"cat_{i}"} for i in range(10)]
variance = calculate_json_variance(outputs)
assert variance > 0.9
# Test 3: Partially different
outputs = [{"category": "refund"}] * 7 + [{"category": "technical"}] * 3
variance = calculate_json_variance(outputs)
assert 0.2 < variance < 0.4
def test_self_consistency():
"""Test majority voting"""
# Test 1: Clear majority
samples = ["refund"] * 7 + ["technical"] * 3
result = apply_self_consistency(samples, n=10)
assert result.majority_vote == "refund"
assert result.agreement_score == 0.7
# Test 2: Tie (should pick randomly but consistently with seed)
samples = ["refund"] * 5 + ["technical"] * 5
result = apply_self_consistency(samples, n=10)
assert result.agreement_score == 0.5
6.2 Integration Tests
def test_temperature_sweep_integration():
"""Test full sweep pipeline"""
# Create small test suite
test_cases = [
TestCase(id="tc1", input="Refund order #123", ...)
]
# Run sweep
results = run_temperature_sweep(
test_cases=test_cases,
temperatures=[0.0, 0.5, 1.0],
runs_per_temp=5
)
# Validate results structure
assert len(results) == 3 # 3 temperatures
assert results[0.0].success_rate >= 0.9 # temp=0 should be accurate
# Validate that variance increases with temp
assert results[0.0].avg_variance < results[1.0].avg_variance
7. Common Pitfalls & Debugging
7.1 The Hallucination Gradient Problem
Symptom: At temp=1.5, model outputs complete nonsense
# Temp=0.0
{"category": "refund", "priority": "high"}
# Temp=1.5
{"category": "banana", "priority": "purple"}
Diagnosis: Check when hallucinations start
def detect_hallucination_threshold(sweep_results):
"""Find temperature where model starts hallucinating"""
for temp in sorted(sweep_results.keys()):
# Check for nonsensical outputs
samples = sweep_results[temp]["samples"]
hallucination_count = sum(1 for s in samples if is_nonsensical(s))
hallucination_rate = hallucination_count / len(samples)
if hallucination_rate > 0.1: # 10% threshold
print(f"โ ๏ธ Hallucination starts at temp={temp}")
return temp
return None
7.2 The Seed Mismatch Problem
Symptom: Variance measurements inconsistent between runs
Cause: Not using seed parameter, or seed not supported
Solution:
# Always set seed for reproducibility
response = openai.ChatCompletion.create(
model="gpt-4",
messages=messages,
temperature=0.7,
seed=42 # Critical for reproducibility
)
# Verify seed is used
if response.system_fingerprint:
print(f"System fingerprint: {response.system_fingerprint}")
else:
print("โ ๏ธ Warning: Seed may not be supported by this model")
7.3 The Logprob Unavailable Problem
Symptom: logprobs=None in response
Cause: Not all models support logprobs
Solution:
# Check model capabilities
MODELS_WITH_LOGPROBS = ["gpt-4", "gpt-3.5-turbo", "claude-3-opus"]
if model not in MODELS_WITH_LOGPROBS:
print(f"โ ๏ธ Warning: {model} may not support logprobs")
print("Confidence scoring will use fallback method (self-consistency)")
8. Extensions
8.1 Beginner Extensions
Extension 1: Top-P Sweep
Also sweep top-p values:
def run_top_p_sweep(
prompt: str,
top_p_values: List[float] = [0.5, 0.7, 0.9, 0.95, 1.0]
):
"""Compare top-p strategies"""
for top_p in top_p_values:
results = run_samples(prompt, temperature=1.0, top_p=top_p, n=10)
# Analyze variance and accuracy
Extension 2: Cost-Accuracy Pareto Frontier
Plot the trade-off:
def plot_cost_accuracy_tradeoff(sweep_results):
"""
Show Pareto frontier of cost vs. accuracy
Points:
- temp=0, N=1: Low cost, high accuracy
- temp=0.7, N=5: Medium cost, highest accuracy
- temp=1.0, N=1: Low cost, low accuracy
"""
# Calculate cost (proportional to token count)
# Plot accuracy vs. cost
# Highlight Pareto-optimal points
8.2 Intermediate Extensions
Extension 3: Adaptive Self-Consistency
Dynamically adjust N based on initial agreement:
def adaptive_self_consistency(prompt: str, max_n: int = 10):
"""
Start with N=3, increase if disagreement high
"""
n = 3
samples = run_n_samples(prompt, n)
agreement = calculate_agreement(samples)
while agreement < 0.8 and n < max_n:
# Add 2 more samples
samples.extend(run_n_samples(prompt, 2))
n += 2
agreement = calculate_agreement(samples)
return majority_vote(samples), agreement
Extension 4: Logprob-Based Early Stopping
Stop generating if confidence drops:
def generate_with_confidence_monitoring(prompt: str):
"""
Stream tokens, stop if logprob drops too low
"""
# Use streaming API
for token, logprob in stream_with_logprobs(prompt):
if logprob < -2.0: # Very uncertain
# Stop and ask for user guidance
return "UNCERTAIN", token
yield token
8.3 Advanced Extensions
Extension 5: Multi-Dimensional Sweep
Sweep temperature AND top-p simultaneously:
def run_2d_sweep(
temperatures: List[float],
top_p_values: List[float]
):
"""
Create heatmap of accuracy across temp ร top_p
"""
results = np.zeros((len(temperatures), len(top_p_values)))
for i, temp in enumerate(temperatures):
for j, top_p in enumerate(top_p_values):
accuracy = evaluate(temp=temp, top_p=top_p)
results[i, j] = accuracy
# Plot heatmap
plt.imshow(results, cmap='RdYlGn')
plt.xlabel('Top-P')
plt.ylabel('Temperature')
plt.colorbar(label='Accuracy')
Extension 6: Calibration Analysis
Check if modelโs confidence matches actual accuracy:
def analyze_calibration(results):
"""
Plot: Predicted confidence vs. actual accuracy
Well-calibrated: Diagonal line
Overconfident: Points above diagonal
Underconfident: Points below diagonal
"""
confidences = []
accuracies = []
for result in results:
confidences.append(result.confidence)
accuracies.append(result.passed)
# Bin by confidence
bins = np.linspace(0, 1, 11)
# Plot predicted vs. actual
9. Real-World Connections
9.1 Production Case Studies
1. Anthropicโs Constitutional AI
Claude uses variance to detect uncertain responses:
# Simplified version of their approach
if variance_across_samples > threshold:
# Model is uncertain - apply additional constraints
response = generate_with_safety_guidelines(prompt)
2. OpenAIโs GPT-4 Safety
OpenAI varies temperature based on content category:
safety_temps = {
"general": 0.7,
"medical": 0.0, # Must be deterministic
"creative": 1.0,
"code": 0.2
}
temp = safety_temps.get(category, 0.7)
3. Googleโs Med-PaLM
Medical Q&A uses ensemble (self-consistency):
# For medical questions, always use N=5
responses = [generate(question, temp=0.5) for _ in range(5)]
final_answer = medical_consensus(responses)
# Only return if agreement > 0.8
if agreement < 0.8:
return "I'm not certain. Please consult a doctor."
10. Resources
10.1 Books
| Topic | Book | Chapter |
|---|---|---|
| Classification Metrics | โHands-On Machine Learningโ by Gรฉron | Ch. 3 (Classification) |
| Probability Theory | โIntroduction to Probabilityโ by Blitzstein | Ch. 1-2 (Basics) |
| Statistical Significance | โNaked Statisticsโ by Wheelan | Ch. 4 (Inference) |
| Visualization | โStorytelling with Dataโ by Nussbaumer Knaflic | Ch. 3 (Choosing Charts) |
10.2 Papers
- โSelf-Consistency Improves Chain of Thought Reasoningโ (Wang et al., 2022)
- Foundation for majority voting
- https://arxiv.org/abs/2203.11171
- โLanguage Models are Few-Shot Learnersโ (Brown et al., 2020)
- Analysis of temperature effects
- https://arxiv.org/abs/2005.14165
10.3 Libraries
pip install matplotlib plotly
pip install scipy scikit-learn
pip install pandas numpy
11. Self-Assessment Checklist
Core Understanding
- I can explain what temperature does
- I understand variance metrics
- I know when to use self-consistency
- I can interpret logprobs
Implementation
- Iโve run a temperature sweep
- Iโve measured variance
- Iโve implemented self-consistency
- Iโve generated reliability curves
Production Readiness
- Iโve defined optimal temperature for my use case
- I have a confidence policy
- I can explain trade-offs to stakeholders
12. Completion Criteria
Minimum Viable
- Temperature sweep working
- Variance calculation implemented
- Basic chart generated
Full Completion
- Self-consistency working
- Logprob analysis
- Recommendation engine
- Integration with Project 1 harness
Excellence
- 2D sweep (temp ร top-p)
- Calibration analysis
- Production deployment
End of Project 7: Temperature Sweeper + Confidence Policy