Project 7: Temperature Sweeper + Confidence Policy
Reliability curve report mapping temperature ranges to failure classes.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | See main guide estimates (typically 3-8 days except capstone) |
| Main Programming Language | Python |
| Alternative Programming Languages | TypeScript |
| Coolness Level | Level 2: Scientific Tuning |
| Business Potential | 3. Ops Efficiency |
| Knowledge Area | Reliability Engineering |
| Software or Tool | Sampling evaluator + policy engine |
| Main Book | Site Reliability Engineering (Google) |
| Concept Clusters | Evaluation, Rollouts, and Governance; Prompt Contracts and Output Typing |
1. Learning Objectives
By completing this project, you will:
- Understand how temperature, top-p, top-k, and penalty parameters shape the token probability distribution and how each controls a different axis of output randomness.
- Design and execute a systematic parameter sweep that varies one sampling knob at a time while holding others fixed, producing statistically comparable reliability data across configurations.
- Build confidence calibration logic that maps raw model output scores to calibrated probabilities, enabling principled abstention when the model cannot answer reliably.
- Implement an abstention policy engine that decides when the model should refuse to answer, retry with different parameters, or escalate to a human reviewer based on task-class risk profiles.
- Produce a cost-quality tradeoff analysis showing how different temperature bands affect latency, token usage, pass rate, and safety metrics for each task family.
- Generate a production-ready policy artifact (YAML) that freezes approved sampling configurations per task class with deterministic seed strategies for reproducibility.
2. All Theory Needed (Per-Concept Breakdown)
Temperature and Sampling Parameters
Fundamentals Temperature is the single most important parameter controlling LLM output randomness because it directly rescales the logit vector before the softmax function converts logits into a probability distribution over the vocabulary. When temperature is low (approaching 0.0), the softmax sharpens: the highest-logit token gets nearly all the probability mass, producing near-deterministic output. When temperature is high (approaching or exceeding 1.0), the softmax flattens: lower-ranked tokens gain meaningful probability, producing diverse and sometimes surprising output. Understanding this mechanism is essential because every other sampling parameter (top-p, top-k, penalties) operates on the distribution that temperature has already shaped. If you tune top-p without understanding the temperature setting, you are optimizing a second-order effect while the first-order knob is misconfigured.
Deep Dive into the concept The mathematical foundation is straightforward. Given a vocabulary of size V, the model produces a raw logit vector z = [z_1, z_2, …, z_V]. The softmax with temperature T converts these into probabilities:
P(token_i) = exp(z_i / T) / sum_j(exp(z_j / T))
When T = 1.0, this is the standard softmax. When T < 1.0, the division amplifies differences between logits: if z_1 = 5.0 and z_2 = 4.0, at T = 1.0 the ratio P(1)/P(2) = e^1 ~ 2.7, but at T = 0.5 the ratio becomes e^2 ~ 7.4. At T = 0.1, the ratio is e^10 ~ 22,026, making selection essentially greedy. When T > 1.0, differences shrink: at T = 2.0, the same ratio becomes e^0.5 ~ 1.6, giving the second token nearly equal probability.
Top-k sampling truncates the distribution after temperature has been applied. It keeps only the k highest-probability tokens and redistributes their probability mass to sum to 1.0. Top-k is a hard cutoff: if k = 50, exactly 50 tokens are candidates regardless of whether the 50th token has 0.001% probability or 5% probability. This makes top-k brittle for distributions that vary in entropy across different contexts. In a factual context where the model is highly confident, k = 50 might include many irrelevant tokens. In a creative context where many tokens are plausible, k = 50 might be too restrictive.
Top-p (nucleus) sampling is more adaptive. Instead of fixing the number of candidates, it includes the smallest set of tokens whose cumulative probability exceeds a threshold p. If p = 0.9, the model considers whatever number of tokens is needed to cover 90% of the probability mass. In a peaked distribution, this might be 3 tokens. In a flat distribution, this might be 500 tokens. Top-p automatically adjusts the candidate set size to match the model’s confidence at each generation step, which is why many practitioners prefer it over top-k.
Frequency penalty and presence penalty control repetition. Frequency penalty subtracts a value proportional to how many times each token has appeared in the generated text so far. If a token has appeared n times and the penalty coefficient is alpha, its logit is reduced by alpha * n. Presence penalty is a binary version: it subtracts a fixed value if the token has appeared at all, regardless of how many times. The formula is: modified_logit = original_logit - (frequency_penalty * count) - (presence_penalty * (1 if count > 0 else 0)). Frequency penalty is useful for preventing repetitive phrasing in long outputs. Presence penalty encourages topic diversity by discouraging any reuse of previous tokens.
Provider-specific differences matter for sweep design. OpenAI exposes temperature (0.0-2.0), top_p, frequency_penalty (-2.0 to 2.0), presence_penalty (-2.0 to 2.0), and a seed parameter for near-deterministic output. Anthropic limits temperature to 0.0-1.0, offers top_p and top_k but no frequency or presence penalties, and has no seed parameter. Google Gemini on Vertex AI supports temperature, top_p, top_k, and a seed for reproducibility. When designing a sweep, you must normalize parameter ranges across providers and handle missing parameters gracefully.
The interaction between temperature and top-p is subtle. If temperature is very low, the distribution is already peaked, so top-p has little effect (the top token alone might exceed p = 0.9). If temperature is high, top-p becomes the effective control because the flattened distribution includes many tokens. Best practice from the provider documentation is to tune one or the other, setting the unused parameter to a neutral value (temperature = 1.0 if tuning top-p, or top-p = 1.0 if tuning temperature). Tuning both simultaneously creates a confounded search space where changes in one mask the effects of the other.
Temperature Effect on Probability Distribution (Vocabulary = 5 tokens)
Raw logits: z = [5.0, 4.0, 2.0, 1.0, 0.5]
T = 0.1 (near greedy):
Token A: ████████████████████████████████████████ ~99.9%
Token B: . ~0.1%
Token C: . ~0.0%
Token D: . ~0.0%
Token E: . ~0.0%
T = 0.5 (focused):
Token A: ████████████████████████ ~72.0%
Token B: ████████████ ~19.5%
Token C: ██ ~2.6%
Token D: . ~0.5%
Token E: . ~0.2%
T = 1.0 (standard):
Token A: ████████████████ ~44.1%
Token B: ██████████ ~26.2%
Token C: █████ ~9.7%
Token D: ██ ~3.6%
Token E: █ ~2.4%
T = 2.0 (creative):
Token A: ████████ ~30.1%
Token B: ██████ ~23.4%
Token C: ████ ~14.2%
Token D: ███ ~8.6%
Token E: ██ ~7.1%
Sampling Parameter Pipeline
Raw Logits from Model
|
v
+---------+---------+
| Temperature Scale | z_i / T
| (reshapes entire |
| distribution) |
+---------+---------+
|
v
+---------+---------+
| Penalty Adjust | subtract frequency/presence
| (reduces repeats) | penalties from logits
+---------+---------+
|
v
+---------+---------+
| Softmax | convert to probabilities
+---------+---------+
|
v
+--------------+--------------+
| |
+------+------+ +-------+------+
| Top-k Filter | | Top-p Filter |
| (keep top k | OR | (keep tokens |
| tokens) | | until sum>p)|
+------+------+ +-------+------+
| |
+--------------+--------------+
|
v
+---------+---------+
| Renormalize | probabilities sum to 1.0
+---------+---------+
|
v
+---------+---------+
| Sample Token | random draw (or argmax
| (with seed if | if T ~ 0)
| available) |
+---------+---------+
|
v
Selected Token
How this fit on projects Temperature and sampling parameters are the independent variables of the sweep. The entire Project 7 is structured around systematically varying these parameters, measuring their effects on output quality, and compiling the results into a policy artifact. Understanding the math behind these parameters is what separates a principled sweep from random guessing.
Definitions & key terms
- Logit: The raw, unnormalized score the model assigns to each vocabulary token before softmax. Higher logits mean the model considers that token more likely.
- Softmax: The function that converts logits into a proper probability distribution. Temperature modifies the logits before softmax is applied.
- Temperature (T): Scaling factor applied to logits before softmax. T < 1.0 sharpens the distribution (more deterministic). T > 1.0 flattens it (more random).
- Top-k: Hard cutoff that keeps only the k highest-probability tokens after softmax, discarding all others.
- Top-p (nucleus sampling): Adaptive cutoff that keeps the smallest set of tokens whose cumulative probability exceeds threshold p.
- Frequency penalty: Per-token logit reduction proportional to how many times that token has appeared in the generated text.
- Presence penalty: Fixed logit reduction applied to any token that has appeared at least once in the generated text.
- Seed: Integer value that initializes the random number generator for sampling. Same seed + same parameters + same prompt should produce the same output (within provider guarantees).
- Entropy: Measure of randomness in the probability distribution. High temperature increases entropy; low temperature decreases it.
Mental model diagram (ASCII)
Parameter Control Space for a Single Generation Step
High
Randomness | +-----------+
(entropy) | / \
| / CREATIVE \
| / ZONE \
| / \
| +---------+-----------+
| | | |
| | BALANCED ZONE |
| | (typical prod) |
| +---------+-----------+
| \ /
| \ DETERMINISTIC /
| \ ZONE /
| \ /
Low +-----------+
0.0 0.3 0.5 0.7 1.0 1.5 2.0
Temperature (T) -->
Orthogonal controls:
- top_p 0.1..1.0 (narrows or widens candidate set)
- top_k 1..100+ (hard cap on candidates)
- freq_penalty (reduces repetition over time)
- presence_penalty (encourages new topics)
- seed (fixes random draw for reproducibility)
How it works (step-by-step, with invariants and failure modes)
- The model produces a logit vector of size V (vocabulary). Invariant: logits are real-valued numbers; there are exactly V of them. Failure mode: model returns empty or truncated logit vector (indicates API or model loading error).
- Temperature scaling divides each logit by T. Invariant: T > 0 (division by zero is undefined; T = 0 is approximated as argmax). Failure mode: T set to exactly 0.0 may cause different behavior across providers (some treat it as argmax, others reject it).
- Penalty adjustments subtract frequency and presence penalties from the modified logits. Invariant: penalties are applied after temperature but before softmax. Failure mode: negative penalties (< 0) actually encourage repetition, which can cause degenerate loops.
- Softmax converts adjusted logits to probabilities. Invariant: all probabilities are in [0, 1] and sum to 1.0. Failure mode: numerical overflow if logits are very large (mitigated by subtracting the max logit before softmax).
- Top-k or top-p filtering removes low-probability tokens. Invariant: at least one token remains after filtering. Failure mode: if top_p is extremely small (e.g., 0.001) and the distribution is flat, very few tokens remain, possibly causing unexpected truncation.
- Renormalization adjusts remaining probabilities to sum to 1.0. Invariant: the selected tokens’ probabilities form a valid distribution.
- Sampling draws one token from the filtered distribution using the random seed. Invariant: same seed + same inputs = same token (within provider guarantees). Failure mode: provider infrastructure changes (GPU routing, model snapshots) can break reproducibility even with fixed seeds.
Minimal concrete example
Sweep configuration for a customer FAQ task class:
sweep_config:
task_class: customer_faq
prompt_template: templates/faq_v2.prompt
dataset: fixtures/faq_200.jsonl
seed: 42
parameters:
temperature: [0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0]
top_p: [1.0] # hold top_p constant
frequency_penalty: [0.0] # hold penalty constant
metrics:
- pass_rate # % of outputs matching expected answer
- abstention_rate # % of outputs where model declines
- format_compliance # % of outputs matching schema
- mean_latency_ms # average response time
- token_cost # total tokens consumed
acceptance_criteria:
min_pass_rate: 0.95
max_abstention_rate: 0.05
max_latency_ms: 2000
Expected sweep result (abbreviated):
| Temperature | Pass Rate | Abstain | Format OK | Latency | Tokens |
|-------------|-----------|---------|-----------|---------|--------|
| 0.0 | 94.0% | 0.0% | 100.0% | 1,120ms | 42,800 |
| 0.1 | 95.5% | 0.5% | 99.5% | 1,140ms | 43,200 |
| 0.2 | 96.5% | 2.0% | 99.0% | 1,160ms | 44,100 |
| 0.3 | 95.0% | 3.5% | 98.0% | 1,180ms | 45,600 |
| 0.5 | 89.0% | 1.0% | 95.0% | 1,220ms | 48,200 |
| 0.7 | 82.0% | 0.5% | 88.0% | 1,300ms | 52,400 |
| 1.0 | 68.0% | 0.0% | 72.0% | 1,450ms | 58,900 |
Common misconceptions
- “Temperature 0 guarantees deterministic output.” No provider currently guarantees fully deterministic outputs even at T = 0. GPU floating-point non-determinism, load balancing across hardware, and model snapshot updates can introduce variation. The seed parameter helps but does not eliminate this. Design your system to tolerate small deviations.
- “Higher temperature means better creativity.” Higher temperature means more randomness, which sometimes produces creative output but also produces more nonsense, hallucinations, and format violations. Creativity requires the model to have learned good creative patterns; temperature just controls how freely it samples from what it knows.
- “Top-p and temperature do the same thing.” Temperature reshapes the entire distribution before any filtering. Top-p filters the already-shaped distribution. They operate at different stages of the pipeline and have different effects. A low temperature with a high top-p is very different from a high temperature with a low top-p.
- “You should always tune both temperature and top-p together.” Most provider documentation recommends tuning one and setting the other to a neutral value. Tuning both simultaneously creates a confounded optimization space where you cannot attribute improvements to either parameter.
- “Frequency penalty always improves output quality.” Negative or excessive frequency penalties can degrade output by forcing the model to use unnatural synonyms, break formatting patterns, or avoid necessary technical terms that must be repeated.
Check-your-understanding questions
- If the raw logits for two tokens are z_1 = 10.0 and z_2 = 8.0, what happens to the probability ratio P(1)/P(2) as temperature decreases from 1.0 to 0.1?
- Why does top-p adapt better than top-k to varying levels of model confidence across different generation steps?
- A sweep shows that temperature 0.0 gives 94% pass rate but temperature 0.2 gives 96.5%. What could explain the higher pass rate at a slightly non-zero temperature?
- Why is it problematic to sweep temperature and top-p simultaneously without a control dimension?
- How does the absence of a seed parameter in Anthropic’s API affect sweep reproducibility compared to OpenAI?
Check-your-understanding answers
- At T = 1.0, P(1)/P(2) = exp(2) ~ 7.4. At T = 0.5, P(1)/P(2) = exp(4) ~ 54.6. At T = 0.1, P(1)/P(2) = exp(20) ~ 485 million. The ratio grows exponentially as T decreases, making selection increasingly greedy.
- Top-k always keeps exactly k tokens regardless of distribution shape. When the model is confident (peaked distribution), top-k includes many near-zero probability tokens. When uncertain (flat distribution), top-k might exclude plausible tokens. Top-p adjusts the candidate set size dynamically: few tokens when confident, many when uncertain.
- At T = 0.0 the model always picks the single highest-logit token (greedy). Sometimes the second-highest token is the correct answer (the model’s “second guess” is better for certain inputs). A small temperature allows the model to occasionally select these near-optimal alternatives, which can improve aggregate pass rate. This is a well-documented phenomenon called “greedy decoding suboptimality.”
- If you change both temperature and top-p between runs, you cannot determine which parameter caused the observed quality change. This violates the basic principle of controlled experimentation. You need at least one fixed control dimension to attribute effects correctly.
- Without a seed parameter, Anthropic API calls at any non-zero temperature will produce different outputs across runs even with identical inputs. This means sweep results require multiple runs per configuration to compute confidence intervals, increasing cost and time. With OpenAI’s seed parameter, a single run per configuration can produce a reproducible baseline (though still not guaranteed identical due to infrastructure factors).
Real-world applications
- Customer support systems use low temperature (0.0-0.2) for factual FAQ answers to minimize hallucination, but higher temperature (0.5-0.7) for generating empathetic response variations.
- Code generation tools typically use temperature 0.0-0.2 for correctness-critical code and 0.5-0.8 for brainstorming alternative implementations.
- Content generation platforms configure temperature per content type: product descriptions (T=0.3), marketing copy (T=0.7), creative fiction (T=0.9-1.0).
- Evaluation pipelines run sweeps across temperature ranges to find the reliability-creativity boundary for each use case before freezing production settings.
Where you’ll apply it
- Phase 1 of this project: define the sweep configuration with parameter ranges, fixture datasets, and metrics to collect.
- Phase 2: execute the sweep and analyze per-configuration reliability data.
- The sweep results feed into the confidence calibration and policy compilation concepts.
References
- “AI Engineering” by Chip Huyen - Chapters on model evaluation and parameter tuning
- “Site Reliability Engineering” by Google - SLO and error budget chapters for reliability framing
- OpenAI API Reference: Chat Completions parameters (temperature, top_p, frequency_penalty, presence_penalty, seed)
- Anthropic API Reference: Messages parameters (temperature, top_p, top_k)
- vLLM Sampling Parameters documentation for open-weight model configurations
- “Controlling randomness in LLMs: Temperature and Seed” by Dylan Castillo (2025)
Key insights Temperature is not a quality knob; it is an entropy knob. Quality depends on how well the model learned the task; temperature controls how freely it samples from what it learned.
Summary Temperature and sampling parameters form a multi-dimensional control surface that shapes every token the model produces. Temperature rescales logits before softmax, top-k and top-p filter the resulting distribution, and penalties discourage repetition. Provider-specific differences in parameter ranges, penalty support, and seed availability mean that sweep designs must account for the target provider’s API surface. The goal of a sweep is to map this control surface to task-specific reliability metrics, producing evidence-based configurations rather than guesses.
Homework/Exercises to practice the concept
- Calculate the probability distribution for a 4-token vocabulary with logits [6.0, 3.0, 1.0, 0.5] at temperatures T = 0.1, T = 0.5, T = 1.0, and T = 2.0. Show how the entropy changes across these four settings.
- Design a sweep configuration (in pseudocode YAML) that tests 5 temperature values for a “legal clause extraction” task class. Specify which parameters you would hold constant and why.
- Write pseudocode for a function that normalizes temperature ranges across three providers (OpenAI: 0.0-2.0, Anthropic: 0.0-1.0, Google: 0.0-2.0) so that sweep results are comparable.
Solutions to the homework/exercises
- For the probability calculation: at T = 0.1, token A dominates with >99.99% probability (entropy near 0). At T = 0.5, token A has ~95%, token B ~4.8% (low entropy). At T = 1.0, the distribution spreads: A ~73%, B ~5%, C ~0.5%, D ~0.3%. At T = 2.0, the distribution flattens further: A ~46%, B ~15%, C ~6%, D ~5%. Entropy increases monotonically with temperature.
- The sweep YAML should fix top_p = 1.0, frequency_penalty = 0.0, and presence_penalty = 0.0 while varying temperature across [0.0, 0.1, 0.2, 0.3, 0.5]. Legal extraction is a precision task, so the sweep range should focus on the low-temperature regime. Include a seed for reproducibility and metrics for extraction accuracy, schema compliance, and hallucination rate.
- The normalization pseudocode should map each provider’s range to a canonical [0.0, 1.0] scale. For OpenAI, canonical = raw / 2.0. For Anthropic, canonical = raw (already 0-1). For Google, canonical = raw / 2.0. The sweep engine works in canonical space and converts to provider-native values before API calls. Document that this is a linear approximation and that the actual effect of T = 0.5 on OpenAI versus T = 0.5 on Anthropic may differ due to model architecture differences.
Confidence Calibration and Abstention Policies
Fundamentals Confidence calibration is the process of transforming a model’s raw output scores into probabilities that accurately reflect the true likelihood of correctness. An LLM might output a response with high apparent confidence (fluent, detailed, assertive) while being factually wrong 30% of the time. Calibration measures and corrects this gap between stated confidence and actual accuracy. Abstention is the decision to not answer when confidence is below a task-specific threshold. Together, calibration and abstention form the reliability layer between raw model output and downstream consumption. Without calibration, you cannot build principled abstention policies. Without abstention, you cannot prevent the model from confidently delivering wrong answers in high-stakes contexts. This concept is critical for Project 7 because the temperature sweep produces varying levels of model certainty, and the policy engine must interpret those levels correctly to assign approved configurations per task class.
Deep Dive into the concept Confidence in LLMs is fundamentally different from confidence in traditional classifiers. A classifier trained on labeled data produces a softmax score over classes that, after calibration, can approximate P(correct | output). An LLM produces a sequence of tokens, where each token has a probability from the generation process. The “confidence” of the full response is not a single number the model outputs; it must be constructed from signals like per-token log-probabilities, self-consistency across multiple samples, or the model’s own verbal expression of certainty.
There are three main approaches to measuring LLM confidence:
First, token-level log-probabilities. Most APIs expose log-probs for generated tokens. You can aggregate these (mean, min, product) to get a sequence-level confidence score. The mean log-probability is the most common aggregation. A response where every token has high probability (mean log-prob close to 0) suggests the model was confident at every step. A response with even one very low-probability token (a “surprise” token) might indicate uncertainty or hallucination. The minimum log-probability in the sequence is a useful signal for detecting moments of model uncertainty.
Second, self-consistency sampling. Generate N responses to the same prompt at moderate temperature (T = 0.5-0.7). If all N responses agree on the key facts, confidence is high. If they diverge, the model is uncertain. This is computationally expensive (N API calls per input) but provides a strong signal that does not depend on the model’s ability to self-assess. The agreement rate across samples is a well-calibrated proxy for correctness on factual tasks.
Third, verbalized confidence. Ask the model to state its confidence level as part of the response (e.g., “Rate your confidence 1-10”). Research from 2025 shows this is the least reliable approach: models are poorly calibrated when verbalizing confidence, often expressing high confidence for wrong answers and rarely adjusting their confidence based on actual difficulty. The AbstentionBench study found that LLMs do not adapt their decision policies in response to changing risk, even when abstention is explicitly incentivized.
| Calibration methods fall into two categories: post-hoc and training-based. Post-hoc calibration takes a set of model outputs with known correctness labels and fits a function that maps raw confidence scores to calibrated probabilities. The simplest method is Platt scaling: fit a logistic regression on (raw_score, is_correct) pairs to get P(correct | raw_score). Temperature scaling (confusingly sharing the name with the generation parameter) is another post-hoc method that fits a single scalar to the logits. Isotonic regression is a non-parametric alternative that fits a monotonic function without assuming a functional form. Training-based approaches like LACIE (2024) cast calibration as a preference optimization problem during fine-tuning, producing models with emergent abstention behavior. |
Abstention policy design requires defining three components: a confidence threshold below which the model should abstain, a cost model that quantifies the relative cost of wrong answers versus abstentions, and an escalation path for abstained queries. The threshold is task-specific: a medical triage system might require 95% calibrated confidence to answer, while a casual chatbot might accept 60%. The cost model formalizes the tradeoff: if a wrong answer costs 10x more than an abstention (which triggers a human review), the optimal threshold is much higher than if both outcomes have similar costs. The escalation path defines what happens to abstained queries: queue for human review, retry with different parameters, or return a safe default response.
Confidence Calibration Pipeline
+------------------+
| Raw Model Output |
| (response text + |
| per-token logps) |
+--------+---------+
|
v
+--------+---------+
| Signal Extraction |
| - mean log-prob | Approach 1: Token Log-Probs
| - min log-prob |
| - entropy |
+--------+---------+
|
| +-------------------+
| | Multi-Sample | Approach 2: Self-Consistency
+-->| Agreement Rate |
| | (N=5 samples) |
| +-------------------+
|
v
+--------+---------+
| Calibration Model |
| (Platt scaling or |
| isotonic regress) |
+--------+---------+
|
v
+--------+----------+
| Calibrated P(correct) |
+---------+---------+
|
v
+---------+---------+ +-----------------------+
| Abstention Policy |<----| Task-Class Config |
| | | - threshold: 0.85 |
| if P < threshold: | | - cost_wrong: 10x |
| ABSTAIN | | - escalation: human |
| else: | +-----------------------+
| ANSWER |
+---------+---------+
|
+----+----+
| |
ANSWER ABSTAIN
| |
v v
Deliver Escalate / Retry /
Response Safe Default
Calibration Reliability Diagram (ideal vs uncalibrated)
Actual |
Accuracy | /
(fraction | /
correct) | / <-- perfectly calibrated
| / (diagonal line)
| . . . ./
| . / .
| . / . <-- typical uncalibrated LLM
| . / . (overconfident: curve below
| . / . the diagonal)
|. /
+---+---+---+---+---+---->
0.0 0.2 0.4 0.6 0.8 1.0
Predicted Confidence
ECE (Expected Calibration Error) = mean |accuracy_bin - confidence_bin|
A perfectly calibrated model has ECE = 0
Typical LLMs have ECE 0.10-0.25 (overconfident by 10-25%)
How this fit on projects Confidence calibration converts raw sweep data into actionable reliability metrics. The sweep runner (Concept 1) produces per-configuration pass rates and log-probability distributions. The calibration layer transforms these into calibrated confidence scores. The policy engine (Concept 3) uses calibrated scores to set abstention thresholds per task class. Without calibration, the policy would be based on uncalibrated scores that overestimate reliability.
Definitions & key terms
- Calibration: The alignment between predicted confidence and actual accuracy. A well-calibrated model that says “80% confident” should be correct 80% of the time.
- Expected Calibration Error (ECE): The average absolute difference between confidence and accuracy across binned predictions. Lower is better; 0.0 is perfect.
- Platt scaling: Post-hoc calibration that fits a logistic regression to map raw scores to calibrated probabilities.
- Isotonic regression: Non-parametric post-hoc calibration that fits a monotonic step function without assuming a specific functional form.
- Abstention: The decision to not answer a query because the model’s calibrated confidence is below a task-specific threshold.
- Self-consistency: Measuring confidence by sampling multiple responses and computing agreement rate.
- Cost-sensitive threshold: An abstention threshold derived from the relative costs of wrong answers versus abstentions.
- Verbalized confidence: Asking the model to self-report its confidence level (least reliable approach per 2025 research).
- LACIE: A training-time approach that casts calibration as preference optimization, producing emergent abstention behavior.
Mental model diagram (ASCII)
The Confidence-Accuracy Tradeoff Space
High
Task |
Accuracy | +---------+
| | SWEET | High accuracy + reasonable
| | SPOT | abstention rate
| +---------+
| /
| /
| /
| / <-- raising threshold improves
| / accuracy but increases abstention
| /
| /
| /
Low |/
+--------------------------->
Low High
Abstention Rate
Key insight: there is no free lunch.
Higher accuracy requires more abstention.
The "sweet spot" depends on task-class cost model:
- Medical triage: accept 15% abstention for 99% accuracy
- Chatbot: accept 2% abstention for 85% accuracy
How it works (step-by-step, with invariants and failure modes)
- Collect a labeled dataset of (query, model_response, is_correct) triples from the sweep. Invariant: the dataset must be large enough (typically 200+ samples per task class) for calibration to be meaningful. Failure mode: too few samples causes overfitting of the calibration function.
- Extract confidence signals from each response: mean log-probability, min log-probability, and optionally self-consistency agreement rate. Invariant: log-probabilities must come from the same model and configuration used in production. Failure mode: calibrating on one model’s log-probs and deploying with another model produces miscalibrated thresholds.
- Fit a calibration model (Platt scaling or isotonic regression) on the labeled data. Invariant: use held-out validation data to evaluate calibration quality (never calibrate and evaluate on the same data). Failure mode: calibration function overfits to a specific prompt version or dataset distribution.
- Compute Expected Calibration Error (ECE) on the validation set. Invariant: ECE should decrease after calibration. Failure mode: ECE increases, indicating the calibration function is making things worse (usually means insufficient data or distribution mismatch).
- Set task-class-specific abstention thresholds based on the calibrated confidence scores and the cost model. Invariant: threshold must be re-evaluated when the model, prompt, or task distribution changes. Failure mode: stale thresholds after a model update cause either excessive abstention or insufficient safety.
Minimal concrete example
Calibration data from sweep (customer_faq, T=0.2):
| Query ID | Mean Log-Prob | Is Correct | Calibrated P(correct) |
|----------|---------------|------------|-----------------------|
| q_001 | -0.12 | True | 0.94 |
| q_002 | -0.45 | True | 0.82 |
| q_003 | -1.20 | False | 0.41 |
| q_004 | -0.08 | True | 0.96 |
| q_005 | -0.88 | False | 0.58 |
Abstention policy for customer_faq:
threshold: 0.85 # abstain if P(correct) < 0.85
cost_wrong: 5.0 # wrong answer costs 5x abstention
escalation: human_queue # abstained queries go to human review
max_abstention_rate: 0.10 # alert if >10% queries abstain
Applied to sweep results:
q_001: ANSWER (0.94 >= 0.85)
q_002: ABSTAIN (0.82 < 0.85) -> human_queue
q_003: ABSTAIN (0.41 < 0.85) -> human_queue
q_004: ANSWER (0.96 >= 0.85)
q_005: ABSTAIN (0.58 < 0.85) -> human_queue
Abstention rate: 60% -> ALERT: exceeds max_abstention_rate
Common misconceptions
- “LLMs know when they are wrong.” Research consistently shows that LLMs are poorly calibrated at self-assessing correctness. Fluent, detailed responses can be entirely fabricated. External calibration using labeled data is necessary.
- “Confidence is a single number the model outputs.” LLMs do not output a native confidence score. Confidence must be constructed from proxy signals (log-probs, self-consistency, verbalized assessment), each with different reliability characteristics.
- “A fixed abstention threshold works across all task classes.” Different task classes have different accuracy-abstention tradeoffs. A medical triage system needs a much higher threshold than a product recommendation system. Thresholds must be set per task class using task-specific cost models.
- “Abstaining more always makes the system safer.” Excessive abstention degrades user experience and can cause users to work around the system (e.g., rephrasing queries to trick the model into answering), which may be less safe than a confident-but-slightly-wrong answer.
- “Self-consistency is too expensive for production.” Self-consistency can be used selectively: only for queries where the initial confidence score falls near the abstention threshold. This “borderline sampling” strategy limits the extra cost to a small fraction of total queries.
Check-your-understanding questions
- Why is Platt scaling preferred over raw log-probability thresholds for abstention decisions?
- How does the cost model for wrong answers versus abstentions affect the optimal threshold?
- What happens to calibration accuracy when the model is updated but the calibration function is not re-fitted?
- Why might self-consistency sampling give better calibration than mean log-probability for factual question-answering?
- How would you design a “borderline sampling” strategy that uses self-consistency only when needed?
Check-your-understanding answers
- Raw log-probabilities are not calibrated: a mean log-prob of -0.3 might correspond to 70% accuracy for one task class and 90% for another. Platt scaling maps raw scores to calibrated probabilities that are comparable across examples and interpretable as actual correctness rates.
- If wrong answers cost much more than abstentions (high cost_wrong), the optimal threshold is higher (abstain more aggressively). If costs are similar, the threshold is lower (answer more often). The threshold is set where the expected cost of answering equals the expected cost of abstaining: threshold = cost_abstain / (cost_wrong + cost_abstain).
- The calibration function becomes stale. The relationship between log-probs and correctness may shift with the new model, causing the calibrated probabilities to be inaccurate. This manifests as increased ECE and either too many false confidences (dangerous) or too many false abstentions (wasteful). Always re-calibrate after model updates.
- Self-consistency measures whether the model can reliably reproduce the same answer across multiple samples. For factual QA, if 5 out of 5 samples agree on the same answer, that answer is likely correct regardless of the individual log-probabilities. Mean log-probability can be high even for wrong answers if the model is confidently wrong. Self-consistency captures a different signal: robustness of the answer under sampling variation.
- Route every query through the initial confidence scorer (mean log-prob, cheap). If the calibrated score falls within a “borderline zone” (e.g., threshold +/- 0.10), trigger self-consistency sampling with N = 3-5 additional samples. If the score is clearly above or below the threshold, accept or abstain immediately without extra sampling. This limits extra API calls to the 10-20% of queries that are genuinely ambiguous.
Real-world applications
- Medical AI triage systems use calibrated confidence to decide when to escalate to a human doctor, with thresholds set from historical accuracy data and regulatory requirements.
- Financial document analysis pipelines abstain on ambiguous clauses and flag them for human review, using isotonic regression calibrated on attorney-labeled datasets.
- Customer support bots use self-consistency to detect when the model is uncertain about product-specific answers, routing those queries to specialized agents.
- Search engines use calibrated confidence to decide whether to show a direct answer or only search results, based on per-query-type thresholds.
Where you’ll apply it
- Phase 2 of this project: after the sweep produces per-configuration results, build the calibration pipeline that transforms raw metrics into calibrated confidence scores.
- Phase 3: use calibrated scores to set abstention thresholds in the policy artifact.
References
- “Know Your Limits: A Survey of Abstention in Large Language Models” (Wen et al., 2025, TACL) - Comprehensive survey of abstention methods and evaluation
- “AI Engineering” by Chip Huyen - Chapter on evaluation metrics and model quality assessment
- “Pattern Recognition and Machine Learning” by Bishop - Calibration and posterior probability estimation
- “Trustworthy Online Controlled Experiments” by Kohavi et al. - Statistical foundations for A/B testing and experiment design
- AbstentionBench evaluation framework documentation
Key insights Confidence without calibration is just a number. Calibration with labeled data transforms it into a decision-making tool that enables principled abstention.
Summary Confidence calibration bridges the gap between raw model output scores and actual correctness probabilities. Post-hoc methods like Platt scaling and isotonic regression require labeled validation data but produce interpretable, task-specific confidence scores. Abstention policies use these calibrated scores with task-class cost models to decide when the model should answer, abstain, or escalate. The key insight from 2025 research is that LLMs are intrinsically poorly calibrated and do not naturally know when to abstain, making external calibration infrastructure essential for production reliability.
Homework/Exercises to practice the concept
- Given a dataset of 20 (mean_log_prob, is_correct) pairs, sketch a reliability diagram and estimate the ECE. Then describe how Platt scaling would transform the raw scores.
- Design an abstention policy for a “medical symptom checker” task class with cost_wrong = 20x and cost_abstain = 1x. What calibrated confidence threshold minimizes expected cost?
- Compare the advantages and disadvantages of self-consistency versus mean log-probability as confidence signals for a code generation task.
Solutions to the homework/exercises
-
The reliability diagram bins predictions by confidence (e.g., 10 bins) and plots actual accuracy per bin. ECE is the weighted average of accuracy_bin - confidence_bin across bins. Typical uncalibrated LLMs show the curve below the diagonal (overconfident). Platt scaling fits logistic regression coefficients (a, b) such that P(correct) = sigmoid(a * raw_score + b), pulling the curve toward the diagonal. -
With cost_wrong = 20 and cost_abstain = 1, the break-even threshold is cost_abstain / (cost_wrong + cost_abstain) = 1/21 ~ 0.048 for the acceptable error rate. This means the calibrated confidence threshold should be set so that P(correct score >= threshold) >= 1 - 0.048 ~ 0.952. In practice, set the threshold to the calibrated score where accuracy in that bin reaches 95.2%. This will likely be a high threshold (e.g., 0.90-0.95 calibrated), resulting in significant abstention. - For code generation, self-consistency is stronger because it detects functional equivalence: two code samples that look different but produce the same output on test cases. Mean log-probability might be high for plausible-looking but buggy code. However, self-consistency requires running test cases or comparing outputs, which is more expensive. Mean log-probability is cheaper but less reliable for detecting subtle bugs where the model is confidently wrong.
Sweep Design, Statistical Comparison, and Cost-Quality Tradeoffs
Fundamentals A temperature sweep is a controlled experiment. Like any experiment, it requires careful design to produce valid conclusions. The sweep must vary one parameter at a time while holding others constant, use a fixed evaluation dataset with known-correct answers, run enough samples per configuration to achieve statistical significance, and account for the cost (time, money, tokens) of each configuration. Without this discipline, sweep results are noise: you might conclude that T = 0.3 is optimal when the real signal is dominated by random variation in a small dataset. This concept teaches the experimental methodology that makes sweep results trustworthy and actionable for production policy decisions.
Deep Dive into the concept Sweep design starts with the parameter grid. Define the ranges and step sizes for each parameter you want to test. For temperature, a typical range is [0.0, 0.1, 0.2, 0.3, 0.5, 0.7, 1.0], with finer steps near the expected optimal zone. For top-p, [0.1, 0.3, 0.5, 0.7, 0.9, 1.0] covers the useful range. The key rule is to vary one parameter at a time: fix top_p = 1.0 while sweeping temperature, then fix temperature at the best value and sweep top_p. This is the one-factor-at-a-time (OFAT) approach, simple and interpretable. For more thorough analysis, factorial designs test combinations, but the number of configurations grows multiplicatively (7 temperatures x 6 top_p values = 42 configurations).
Each configuration must be evaluated on the same fixed dataset. The dataset should represent the production distribution: same query types, same difficulty distribution, same edge cases. If the sweep dataset is easier than production, the selected configuration will underperform in deployment. If it is harder, you will over-engineer robustness at the cost of creativity. Dataset design is as important as parameter selection.
Statistical comparison between configurations requires accounting for variance. A configuration with 96% pass rate on 200 samples has a 95% confidence interval of roughly +/- 2.7% (using the normal approximation for proportions). This means a configuration with 95% pass rate is not statistically distinguishable from one with 97% at this sample size. To detect a 2% difference with 95% confidence and 80% power, you need approximately 1,900 samples per configuration. This is why production sweep systems run thousands of evaluations, not hundreds.
For pairwise comparison, McNemar’s test is appropriate because the same dataset is evaluated under different configurations. McNemar’s test checks whether the off-diagonal cells (cases where configuration A passes but B fails, and vice versa) differ significantly. This is more powerful than comparing independent proportions because it uses the paired structure of the data.
Cost-quality tradeoff analysis quantifies the relationship between configuration settings and operational cost. Higher temperature typically produces longer outputs (more tokens), increasing both latency and monetary cost. The tradeoff curve plots quality (pass rate) against cost (tokens per query, dollars per 1000 queries, or p95 latency). The optimal configuration is not necessarily the highest-quality one; it is the one that provides acceptable quality at the best cost point. An SLO-based approach sets minimum quality thresholds (e.g., pass rate >= 95%, p95 latency <= 2 seconds) and selects the cheapest configuration that meets all thresholds.
Sweep Methodology: One-Factor-at-a-Time (OFAT)
Phase 1: Sweep Temperature (fix top_p=1.0, penalties=0.0)
+-------+-------+-------+-------+-------+-------+-------+
| T=0.0 | T=0.1 | T=0.2 | T=0.3 | T=0.5 | T=0.7 | T=1.0 |
+-------+-------+-------+-------+-------+-------+-------+
Result: T=0.2 is best for customer_faq task class
|
v
Phase 2: Sweep Top-p (fix T=0.2, penalties=0.0)
+-------+-------+-------+-------+-------+-------+
| p=0.1 | p=0.3 | p=0.5 | p=0.7 | p=0.9 | p=1.0 |
+-------+-------+-------+-------+-------+-------+
Result: p=0.9 matches p=1.0 (no improvement)
|
v
Phase 3: Sweep Penalties (fix T=0.2, top_p=1.0)
+----------+----------+----------+----------+
| freq=0.0 | freq=0.2 | freq=0.5 | freq=1.0 |
+----------+----------+----------+----------+
Result: freq=0.0 is best (FAQ answers need repetition)
|
v
Final Config: T=0.2, top_p=1.0, freq_penalty=0.0, seed=42
Cost-Quality Tradeoff Curve
Pass Rate |
(quality) |
| * T=0.0
96% -------- -|- - - - *-T=0.2- - - - - SLO minimum quality
| *
94% ---------|------* T=0.1
| |
92% ---------|-----|---------------------
| | * T=0.3
90% ---------|-----|-----|---------------
| | |
85% ---------|-----|----|-----------* T=0.5
| | | |
80% ---------|-----|----|-----------|-* T=0.7
| | | | |
+-----+----+-----------+-+------>
40k 42k 44k 48k 52k
Tokens per 200 queries (cost)
Optimal: T=0.2 achieves SLO quality (96.5%) at
reasonable cost (44.1k tokens). T=0.0 saves tokens
but misses SLO. T=0.3+ wastes tokens without
quality improvement.
How this fit on projects Sweep design is the experimental backbone of Project 7. Without proper experimental methodology, the sweep produces unreliable results that cannot be trusted for production policy decisions. This concept ensures that the sweep runner generates statistically valid, cost-aware configuration comparisons.
Definitions & key terms
- OFAT (One-Factor-at-a-Time): Experimental design that varies one parameter while holding all others constant. Simple and interpretable but may miss interaction effects.
- Factorial design: Tests all combinations of parameter values. More thorough but exponentially more expensive.
- McNemar’s test: Statistical test for comparing two classifiers evaluated on the same dataset. Uses the paired structure of the data for more statistical power.
- Confidence interval: Range of values within which the true parameter value lies with a specified probability (typically 95%).
- SLO (Service Level Objective): A target value for a reliability or quality metric (e.g., pass rate >= 95%).
- Cost-quality tradeoff: The relationship between operational cost and output quality across configurations. The optimal point depends on the SLO and budget.
- Effect size: The magnitude of the difference between configurations. Small effect sizes require larger sample sizes to detect reliably.
Mental model diagram (ASCII)
Sweep Execution Pipeline
+------------------+ +-------------------+
| Sweep Config | | Fixture Dataset |
| - param grid | | - N queries |
| - metrics list | | - expected answers|
| - seed | | - task class |
+--------+---------+ +--------+----------+
| |
v v
+--------+------------------------+----------+
| Sweep Runner |
| for each config in param_grid: |
| for each query in dataset: |
| call LLM(query, config, seed) |
| score(response, expected_answer) |
| record(config, query_id, metrics) |
+---------------------+-----------------------+
|
v
+---------------------+-----------------------+
| Metric Aggregator |
| per config: |
| pass_rate, abstention_rate, format_ok |
| mean_latency, total_tokens, cost |
| confidence intervals (bootstrap or normal) |
| pairwise comparisons (McNemar) |
+---------------------+-----------------------+
|
v
+---------------------+-----------------------+
| Report Generator |
| - sweep results CSV |
| - cost-quality tradeoff chart data |
| - statistical comparison table |
| - recommended config with justification |
+---------------------------------------------+
How it works (step-by-step, with invariants and failure modes)
- Load the sweep configuration and validate parameter ranges against provider limits. Invariant: all parameter values are within the provider’s accepted range. Failure mode: passing temperature 1.5 to Anthropic (max 1.0) causes an API error.
- Load the fixture dataset and verify it has expected answer labels. Invariant: every query in the dataset has a labeled correct answer. Failure mode: missing labels cause evaluation to produce undefined pass rates.
- For each configuration, evaluate every query in the fixture set and record per-query metrics. Invariant: the same seed is used across configurations so that randomness differences are attributable to parameter changes, not sampling luck. Failure mode: forgetting to set the seed causes each run to be non-reproducible.
- Aggregate metrics per configuration: compute pass rate, abstention rate, format compliance, mean latency, and total token cost. Invariant: aggregation uses all queries, not a sample. Failure mode: skipping failed API calls biases the pass rate upward.
- Compute confidence intervals for each metric. Invariant: confidence intervals are computed correctly (e.g., using Wilson score interval for proportions, not the Wald interval which is inaccurate for extreme proportions). Failure mode: using the wrong interval formula produces artificially narrow or wide bounds.
- Run pairwise statistical tests between configurations. Invariant: use a paired test (McNemar) since the same dataset is used for all configurations. Failure mode: using an unpaired test (chi-squared) loses statistical power and may miss real differences.
- Generate the cost-quality tradeoff analysis and recommend the configuration that meets all SLOs at the lowest cost. Invariant: the recommendation includes both the configuration and the evidence supporting it (pass rate, confidence interval, p-value vs alternatives). Failure mode: recommending a configuration based on point estimates without checking whether the difference is statistically significant.
Minimal concrete example
Pairwise comparison (McNemar) for T=0.1 vs T=0.2 on 200 queries:
T=0.2 PASS T=0.2 FAIL
T=0.1 PASS | 182 | 9 |
T=0.1 FAIL | 11 | 8 | (N=200 total, 10 missing)
Off-diagonal: b=9, c=11
McNemar chi2 = (|b-c| - 1)^2 / (b+c) = (|9-11| - 1)^2 / 20 = 0.05
p-value = 0.82 -> NOT significant at alpha=0.05
Conclusion: T=0.1 and T=0.2 are statistically indistinguishable
at this sample size. Choose the cheaper one (T=0.1
uses fewer tokens on average).
Cost comparison:
T=0.1: 43,200 tokens total, $0.43 per sweep
T=0.2: 44,100 tokens total, $0.44 per sweep
Savings: $0.01 per sweep (negligible)
Decision: Use T=0.2 (slightly higher point estimate, negligible cost diff)
Common misconceptions
- “The configuration with the highest pass rate is always the best.” If the difference is not statistically significant, you are choosing based on noise. Always check confidence intervals and p-values before declaring a winner.
- “200 samples is enough for a reliable sweep.” 200 samples gives a 95% confidence interval of roughly +/- 2.7% for a proportion around 0.95. This means you cannot reliably distinguish configurations that differ by less than ~5%. For finer discrimination, you need 500-2000 samples per configuration.
- “Sweeps only need to run once.” Model updates, prompt changes, and dataset drift all invalidate previous sweep results. Sweeps should be re-run periodically (e.g., monthly) or triggered by model/prompt version changes.
- “Cost does not matter if quality improves.” In production, every token costs money and adds latency. A configuration that is 1% better but 30% more expensive may not be worth it. The cost-quality tradeoff analysis makes this decision explicit.
Check-your-understanding questions
- Why is McNemar’s test more appropriate than a chi-squared test for comparing two sweep configurations on the same dataset?
- A sweep on 200 samples shows T=0.2 at 96.5% and T=0.3 at 95.0%. Can you confidently say T=0.2 is better?
- How would you modify the sweep design if you wanted to test interaction effects between temperature and top-p?
- What is the minimum sample size needed to detect a 3% difference in pass rate with 95% confidence?
Check-your-understanding answers
- McNemar’s test uses the paired structure: it examines cases where the two configurations disagree (one passes, the other fails). A chi-squared test treats the two samples as independent, ignoring the pairing and losing statistical power. Since both configurations are evaluated on the same queries, the paired test is correct.
- No. The 95% confidence interval for a proportion of 0.965 on 200 samples is approximately [0.935, 0.985]. For 0.950, it is [0.915, 0.975]. These intervals overlap substantially, meaning the difference is not statistically significant. You would need ~1500 samples to reliably detect a 1.5% difference.
- Use a factorial design: test all combinations of (T values) x (top_p values). For 7 temperature values and 6 top_p values, this requires 42 configurations. Analyze with two-way ANOVA or its non-parametric equivalent to test for main effects and interaction effects.
- Using the formula n = (Z_alpha/2 + Z_beta)^2 * 2 * p * (1-p) / delta^2, with alpha = 0.05, beta = 0.20, p ~ 0.95, delta = 0.03: n ~ (1.96 + 0.84)^2 * 2 * 0.95 * 0.05 / 0.03^2 ~ 7.84 * 0.095 / 0.0009 ~ 828. You need approximately 830 samples per configuration.
Real-world applications
- A/B testing platforms at scale (Netflix, Booking.com) use similar experimental methodology to compare recommendation algorithms, with the same principles of statistical significance, confidence intervals, and cost-aware optimization.
- Pharmaceutical clinical trials use factorial designs and paired statistical tests to compare drug dosages, analogous to comparing LLM parameter configurations.
- Cloud cost optimization teams analyze cost-quality tradeoffs for compute instance types, using SLO-based selection to balance performance and budget.
Where you’ll apply it
- Phase 1: design the sweep grid and fixture dataset.
- Phase 2: execute the sweep and run statistical comparisons.
- Phase 3: generate the cost-quality tradeoff analysis and compile the final policy recommendation.
References
- “Trustworthy Online Controlled Experiments” by Kohavi, Tang, Xu - Chapters on experiment design and statistical testing
- “Site Reliability Engineering” by Google - SLO and error budget chapters
- “AI Engineering” by Chip Huyen - Evaluation methodology and metrics
- McNemar’s test: McNemar, Q. (1947). “Note on the sampling error of the difference between correlated proportions or percentages.” Psychometrika.
Key insights A sweep without statistical rigor is just anecdote collection. The experiment design determines whether your policy decisions are evidence-based or noise-based.
Summary Sweep design applies controlled experiment methodology to LLM parameter optimization. One-factor-at-a-time designs isolate parameter effects. Fixed datasets and seeds ensure reproducibility. Statistical tests (McNemar for paired comparisons) determine whether observed differences are real. Cost-quality tradeoff analysis selects the configuration that meets SLOs at minimal cost. The entire process produces an evidence-backed policy artifact rather than a subjective parameter guess.
Homework/Exercises to practice the concept
- Design a complete sweep plan for a “product description generator” task class. Specify: parameter grid, fixture dataset requirements, metrics, SLOs, and the statistical test you would use for pairwise comparison.
- Given two configurations with pass rates 92% and 89% on 500 shared queries, where configuration A passes but B fails on 35 queries and B passes but A fails on 20 queries, compute the McNemar test statistic and determine if the difference is significant at alpha = 0.05.
- Sketch a cost-quality tradeoff curve for 5 hypothetical configurations and identify the optimal configuration given an SLO of pass_rate >= 90% and a budget constraint of $0.50 per 1000 queries.
Solutions to the homework/exercises
- The sweep plan should include: temperature [0.0, 0.1, 0.2, 0.3, 0.5, 0.7], top_p fixed at 1.0, frequency_penalty fixed at 0.0. Fixture: 500+ product description queries with human-rated quality labels. Metrics: quality score, format compliance, word count, token cost. SLOs: quality >= 4.0/5.0, format compliance >= 98%, cost <= $0.002 per description. Use McNemar’s test for pairwise comparison because all configurations are evaluated on the same dataset.
-
McNemar chi2 = ( 35-20 - 1)^2 / (35+20) = 14^2 / 55 = 196/55 = 3.56. The critical value for chi-squared with 1 degree of freedom at alpha = 0.05 is 3.84. Since 3.56 < 3.84, the difference is NOT significant at the 5% level (p ~ 0.059). Despite a 3% point difference, this sweep cannot confirm that A is better than B. Recommend increasing sample size to ~800 queries. - The sketch should show points plotted with quality on the y-axis and cost on the x-axis. Draw a horizontal line at 90% (SLO) and a vertical line at $0.50 (budget). The optimal configuration is the one that falls in the upper-left quadrant (meets SLO, within budget) closest to the lower-left corner (highest quality at lowest cost). If multiple configurations meet both constraints, the cheapest one wins.
3. Project Specification
3.1 What You Will Build
A reliability experiment runner that sweeps sampling settings and learns safe confidence policies.
3.2 Functional Requirements
- Run fixed evaluation set across multiple temperature/top_p settings.
- Record pass rate, abstention rate, and volatility per setting.
- Generate confidence bands and recommended production policy.
- Export side-by-side chart-ready CSV for review.
3.3 Non-Functional Requirements
- Performance: Full sweep on 200 cases completes under 6 minutes.
- Reliability: Fixed seeds make per-band comparisons reproducible.
- Security/Policy: Policy blocks unapproved decoding values in production mode.
3.4 Example Usage / Output
$ uv run p07-sweeper run --dataset fixtures/faq_200.jsonl --temperatures 0.0,0.2,0.4,0.7 --seed 7 --out out/p07
[INFO] Task class: customer_faq
[INFO] Evaluated 4 temperature bands x 200 cases
[PASS] Best reliability band: T=0.2 (pass=96.5%, abstain=2.0%)
[PASS] Creativity band accepted for ideation: T=0.7
[INFO] Recommended policy saved: out/p07/confidence_policy.yaml
3.5 Data Formats / Schemas / Protocols
- Input fixture JSONL with expected outcomes by task class.
- Sweep results CSV keyed by temperature, top_p, and case id.
- Policy YAML mapping task classes to approved decoding band.
3.6 Edge Cases
- Task classes with tiny sample size causing unstable metrics.
- Decoding setting that improves style but hurts safety.
- Confidence scores not comparable across task families.
- Policy conflicts between experiment and runtime configuration.
3.7 Real World Outcome
This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.
3.7.1 How to Run (Copy/Paste)
$ uv run p07-sweeper run --dataset fixtures/faq_200.jsonl --temperatures 0.0,0.2,0.4,0.7 --seed 7 --out out/p07
- Working directory:
project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS - Required inputs: project fixtures under
fixtures/ - Output directory:
out/p07
3.7.2 Golden Path Demo (Deterministic)
Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.
3.7.3 If CLI: exact terminal transcript
$ uv run p07-sweeper run --dataset fixtures/faq_200.jsonl --temperatures 0.0,0.2,0.4,0.7 --seed 7 --out out/p07
[INFO] Task class: customer_faq
[INFO] Evaluated 4 temperature bands x 200 cases
[PASS] Best reliability band: T=0.2 (pass=96.5%, abstain=2.0%)
[PASS] Creativity band accepted for ideation: T=0.7
[INFO] Recommended policy saved: out/p07/confidence_policy.yaml
$ echo $?
0
Failure demo:
$ uv run p07-sweeper run --dataset fixtures/faq_200.jsonl --temperatures 1.8 --seed 7 --out out/p07
[ERROR] Temperature value 1.8 exceeds policy max (1.2)
[HINT] Use approved sweep range from policies/p07_sampling_bounds.yaml
$ echo $?
2
4. Solution Architecture
4.1 High-Level Design
User Input / Trigger
|
v
+-------------------------+
| Sweep Runner |
| (parameterized eval |
| across configurations) |
+-------------------------+
|
v
+-------------------------+
| Metric Aggregator |
| (pass rate, abstention, |
| cost, confidence CIs) |
+-------------------------+
|
v
+-------------------------+
| Calibration Engine |
| (Platt scaling, ECE, |
| threshold selection) |
+-------------------------+
|
v
+-------------------------+
| Policy Compiler |
| (task-class -> config |
| mapping with evidence) |
+-------------------------+
|
v
Artifacts / CSV / YAML / Logs
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Sweep Runner | Executes eval matrix over decoding settings. | Keep fixture order stable for fair comparison. Use seed for reproducibility. | | Metric Aggregator | Computes reliability, volatility, and cost metrics per configuration. | Track abstentions separately from errors. Compute confidence intervals. | | Calibration Engine | Transforms raw scores into calibrated confidence probabilities. | Use Platt scaling for simplicity; isotonic regression for non-linear patterns. | | Policy Compiler | Converts sweep results into runtime policy artifact. | Freeze approved bands per task class with statistical evidence. |
4.3 Data Structures (No Full Code)
SweepConfig:
- task_class: string
- param_grid: {temperature: [float], top_p: [float], ...}
- dataset_path: string
- seed: int
- metrics: [string]
- acceptance_criteria: {metric: threshold}
SweepResult:
- config_id: string (hash of parameters)
- query_id: string
- response_text: string
- is_correct: bool
- log_probs: [float]
- latency_ms: int
- token_count: int
ConfigSummary:
- config_id: string
- pass_rate: float
- pass_rate_ci: [float, float]
- abstention_rate: float
- mean_latency_ms: float
- total_tokens: int
- calibrated_threshold: float
PolicyArtifact:
- task_class: string
- approved_config: {temperature, top_p, ...}
- evidence: {pass_rate, ci, p_value_vs_alternatives}
- abstention_threshold: float
- escalation_path: string
4.4 Algorithm Overview
Key algorithm: Sweep-Calibrate-Compile pipeline
- Load sweep config and fixture dataset. Validate parameter ranges.
- For each configuration in the parameter grid, evaluate every query and record per-query metrics.
- Aggregate metrics per configuration with confidence intervals.
- Run pairwise statistical comparisons (McNemar) between top configurations.
- Build calibration model on labeled sweep data. Compute ECE.
- Set abstention thresholds per task class using calibrated scores and cost model.
- Compile policy artifact with approved configuration, evidence, and thresholds.
Complexity Analysis (conceptual):
- Time: O(C * N) where C = number of configurations, N = dataset size. Each (config, query) pair requires one API call.
- Space: O(C * N) for storing all per-query results. Aggregated summaries are O(C).
- Cost: C * N * avg_tokens_per_call * price_per_token. For 7 configs x 200 queries x 200 tokens/query at $0.01/1K tokens = $2.80 per sweep.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies (Python 3.11+, uv package manager)
# 2) Prepare fixtures under fixtures/ with labeled query-answer pairs
# 3) Set API keys for target provider(s)
# 4) Run the project command(s) listed in section 3.7
5.2 Project Structure
p07/
├── src/
│ ├── sweep_runner.py # Parameter grid execution
│ ├── metric_aggregator.py # Statistical analysis
│ ├── calibration.py # Platt scaling, ECE
│ ├── policy_compiler.py # YAML policy generation
│ └── cli.py # Command-line interface
├── fixtures/
│ ├── faq_200.jsonl # Customer FAQ evaluation set
│ └── ideation_100.jsonl # Creative task evaluation set
├── policies/
│ └── p07_sampling_bounds.yaml # Allowed parameter ranges
├── out/
└── README.md
5.3 The Core Question You’re Answering
“Where is the reliability-versus-creativity boundary for each task class, and what statistical evidence supports that boundary?”
This question matters because it forces the project to produce objective, evidence-based parameter recommendations instead of relying on subjective impressions or anecdotal testing.
5.4 Concepts You Must Understand First
- Sampling controls and entropy
- How does temperature reshape the token probability distribution, and why does this matter for output reliability?
- Book Reference: OpenAI/Anthropic API documentation + “AI Engineering” by Chip Huyen
- Confidence calibration
- How do you transform raw model scores into calibrated probabilities that reflect true correctness rates?
- Book Reference: “Pattern Recognition and Machine Learning” by Bishop - Calibration sections
- Statistical experiment design
- How do you design a controlled experiment that produces statistically significant comparisons between configurations?
- Book Reference: “Trustworthy Online Controlled Experiments” by Kohavi et al. - Experiment fundamentals
- Policy thresholding and SLOs
- How do you translate reliability data into actionable production policies with abstention and escalation paths?
- Book Reference: “Site Reliability Engineering” by Google - SLO chapters
5.5 Questions to Guide Your Design
- Sweep design
- Which parameters will you sweep? What ranges and step sizes?
- How many samples per configuration are needed for your target statistical power?
- How will you handle provider-specific parameter limitations?
- Calibration pipeline
- What confidence signals will you extract from model outputs?
- Which calibration method (Platt, isotonic) is appropriate for your data?
- How will you evaluate calibration quality (ECE, reliability diagram)?
- Policy compilation
- What SLOs must each task class meet?
- How will you set abstention thresholds using calibrated scores and cost models?
- How will the policy artifact be versioned and deployed?
5.6 Thinking Exercise
Pre-Mortem for Temperature Sweeper + Confidence Policy
Before implementing, write down 10 ways this project can fail in production. Classify each failure into: sweep design, calibration, policy, or operations.
Questions to answer:
- Which failures stem from insufficient sample sizes or flawed experiment design?
- Which failures stem from stale calibration after a model update?
- Which failures require runtime detection and human escalation?
5.7 The Interview Questions They’ll Ask
- “Why should temperature policies differ by task class?”
- “How do you evaluate decoding reliability scientifically, not just by eyeballing outputs?”
- “What is the difference between model uncertainty and task-class risk, and how do they interact in abstention decisions?”
- “How would you design confidence bands for abstention that minimize expected cost?”
- “When should creativity (higher temperature) be intentionally reduced, and what evidence would you use to justify that decision?”
5.8 Hints in Layers
Hint 1: Fix your dataset first Unstable or unrepresentative fixture sets make any sweep conclusion noisy. Invest in a high-quality labeled evaluation set before running any parameter experiments.
Hint 2: Control one variable at a time Separate temperature sweeps from top-p sweeps from penalty sweeps. Run OFAT first; only use factorial designs if you need to test interaction effects.
Hint 3: Track abstentions explicitly A high pass rate can hide over-abstention (the model is only answering easy queries). Always report pass rate AND abstention rate together. The “effective pass rate” (passes / total queries, not passes / answered queries) is the true reliability metric.
Hint 4: Write policy as artifact, not tribal knowledge The final policy YAML should include the approved configuration, the statistical evidence supporting it, and the conditions under which it should be re-evaluated. Never leave production parameters as undocumented settings in code.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Sampling parameter mechanics | “AI Engineering” by Chip Huyen | Model evaluation and tuning chapters | | Calibration and posterior estimation | “Pattern Recognition and Machine Learning” by Bishop | Calibration-related sections | | Experiment design and statistical testing | “Trustworthy Online Controlled Experiments” by Kohavi et al. | Experiment fundamentals and analysis | | SLOs and operational thresholds | “Site Reliability Engineering” by Google | SLO chapters | | Cost-quality optimization | “Designing Data-Intensive Applications” by Kleppmann | Tradeoff and system design chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Define sweep configuration schema and validate parameter ranges per provider.
- Build the fixture dataset with labeled expected answers for at least one task class.
- Implement the sweep runner that executes one configuration at a time with seed-based reproducibility.
- Checkpoint: One configuration runs end-to-end and produces per-query metrics CSV.
Phase 2: Core Functionality
- Implement metric aggregation with confidence intervals and pairwise statistical tests.
- Build the calibration pipeline (Platt scaling, ECE computation, reliability diagram data).
- Add abstention threshold computation using calibrated scores and cost model.
- Checkpoint: Full sweep across 4+ configurations produces a comparison table with CIs and p-values.
Phase 3: Operational Hardening
- Implement the policy compiler that generates a versioned YAML artifact with evidence.
- Add cost-quality tradeoff analysis and chart-ready CSV export.
- Add validation that blocks unapproved parameter values in production mode.
- Document the runbook for re-running sweeps after model or prompt updates.
- Checkpoint: A team member can reproduce the entire sweep from a clean checkout and get the same policy artifact.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Sweep design | OFAT vs factorial | OFAT first, factorial if needed | OFAT is cheaper and interpretable; factorial needed only for interaction effects | | Calibration method | Platt vs isotonic | Platt for small datasets, isotonic for large | Platt has fewer parameters (less overfitting risk); isotonic is more flexible | | Confidence signal | Log-probs vs self-consistency | Log-probs primary, self-consistency for borderline | Log-probs are cheap (1 API call); self-consistency is expensive (N calls) but more reliable | | Statistical test | McNemar vs chi-squared | McNemar | Data is paired (same dataset for all configs); McNemar uses pairing for more power | | Policy format | JSON vs YAML | YAML | Human-readable, supports comments for evidence documentation |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate statistical calculations | confidence interval computation, McNemar test, Platt scaling fit | | Integration Tests | Verify end-to-end sweep pipeline | golden-path sweep with mock API produces expected CSV and YAML | | Edge Case Tests | Ensure robust failure handling | empty dataset, out-of-range parameters, zero-variance metrics |
6.2 Critical Test Cases
- Golden path succeeds: sweep runs across 4 temperatures on 200 fixtures and produces comparison table and policy YAML.
- Parameter validation: temperature outside provider range returns error with hint.
- Statistical edge case: two configurations with identical pass rates produce non-significant p-value.
- Calibration validation: ECE decreases after Platt scaling compared to raw scores.
- Determinism: same seed produces identical results across two runs.
6.3 Test Data
fixtures/faq_200.jsonl # Primary evaluation set
fixtures/ideation_100.jsonl # Secondary creative task set
fixtures/edge_cases/
tiny_sample.jsonl # 5 queries (too few for reliable stats)
all_correct.jsonl # 100% pass rate edge case
all_wrong.jsonl # 0% pass rate edge case
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Best temperature in eval fails live” | Eval data does not match production distribution. | Sample fresh production-like fixtures regularly. Use stratified sampling by query type. | | “Confidence score is misleading” | Confidence is uncalibrated across task classes. | Calibrate thresholds per task family using Platt scaling on held-out labeled data. | | “Policy drift over time” | Runtime settings changed without re-sweep. | Tie runtime config to policy artifact hash. Alert when artifact age exceeds threshold. | | “Statistical significance ignored” | Configuration A chosen over B based on 1% difference on 100 samples. | Always compute confidence intervals and run McNemar test before declaring a winner. | | “Seed gives false reproducibility” | Same seed produces different results after model update. | Track model version in policy artifact. Re-sweep after any model change. |
7.2 Debugging Strategies
- Re-run deterministic fixtures with fixed seed and compare per-query results.
- Diff latest sweep CSV against last known-good baseline to identify which queries changed.
- Isolate whether regression is in model output, metric computation, or calibration by testing each stage independently.
- Check provider changelog for model snapshot updates that might affect reproducibility.
7.3 Performance Traps
- Running factorial sweeps when OFAT suffices wastes API budget exponentially.
- Self-consistency sampling on all queries instead of borderline queries multiplies cost by N.
- Storing full response text for every (config, query) pair instead of just metrics creates storage bloat.
- Not caching API responses when re-running sweeps with the same seed wastes money on duplicate calls.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one new task class with its own fixture set and expected outcomes.
- Add format compliance checking (does the response match the expected schema?).
8.2 Intermediate Extensions
- Implement self-consistency sampling as an alternative confidence signal for borderline queries.
- Add dashboard-ready visualizations: reliability diagram, cost-quality tradeoff curve, sweep comparison heatmap.
- Implement automated regression detection that compares new sweep results against historical baselines.
8.3 Advanced Extensions
- Build a multi-provider sweep that normalizes parameter ranges across OpenAI, Anthropic, and Google APIs.
- Implement Bayesian optimization to efficiently search the parameter space instead of grid search.
- Integrate with the canary rollout controller (P11) to automatically deploy approved configurations.
9. Real-World Connections
9.1 Industry Applications
- PromptOps platform teams use temperature sweeps to establish per-use-case reliability baselines before production deployment.
- AI governance teams require statistical evidence (sweep reports with confidence intervals) before approving new model configurations.
- Cost optimization teams use tradeoff analysis to reduce LLM spend by finding cheaper parameter configurations that still meet SLOs.
9.2 Related Open Source Projects
- OpenAI Evals framework for structured evaluation of LLM outputs.
- LangSmith evaluation and tracing workflows for parameter comparison.
- PromptFoo for automated prompt and parameter testing.
9.3 Interview Relevance
- Demonstrates ability to apply experimental methodology to LLM parameter optimization rather than relying on guesswork.
- Shows understanding of statistical significance, calibration, and cost-quality tradeoffs that production systems require.
- Proves familiarity with provider-specific parameter differences and reproducibility challenges.
10. Resources
10.1 Essential Reading
- OpenAI API Reference: temperature, top_p, frequency_penalty, presence_penalty, seed parameters
- Anthropic API Reference: temperature, top_p, top_k parameters and limitations
- “Know Your Limits: A Survey of Abstention in Large Language Models” (Wen et al., 2025, TACL)
- “AI Engineering” by Chip Huyen - evaluation and parameter tuning chapters
10.2 Video Resources
- Talks on LLM eval systems, PromptOps, and parameter optimization methodology.
- Conference presentations on confidence calibration and abstention in production systems.
10.3 Tools & Documentation
- vLLM Sampling Parameters documentation for open-weight model configurations.
- PromptFoo for automated prompt and parameter testing.
- SciPy stats module for McNemar’s test and calibration functions.
10.4 Related Projects in This Series
- P08 (Prompt DSL + Linter): sweep results feed into linting rules that enforce approved parameter ranges.
- P09 (Prompt Caching Optimizer): cost analysis from sweeps informs caching decisions.
- P11 (Canary Prompt Rollout Controller): approved configurations from sweeps are deployed via canary rollout.
- P14 (Adversarial Eval Forge): adversarial datasets extend the sweep fixture set for robustness testing.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how temperature reshapes the token probability distribution and why T=0 is not truly deterministic.
- I can explain the difference between top-k and top-p sampling and when each is appropriate.
- I can explain why confidence calibration is necessary and how Platt scaling works.
- I can justify abstention threshold choices using cost models and calibrated confidence scores.
11.2 Implementation
- Golden-path sweep runs end-to-end and produces a comparison table with confidence intervals.
- McNemar’s test correctly identifies statistically significant differences between configurations.
- Calibration pipeline reduces ECE compared to raw scores.
- Policy YAML artifact includes approved configuration, evidence, and re-evaluation criteria.
11.3 Growth
- I can explain the cost-quality tradeoff for my task class and justify the chosen configuration.
- I can describe how to re-run the sweep after a model update and what to look for in the new results.
- I can explain this project’s methodology in an interview setting with statistical rigor.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Sweep runs across 4+ temperature values on a 200-query fixture set with fixed seed.
- Per-configuration metrics include pass rate, abstention rate, and token cost.
- Policy YAML artifact maps the task class to an approved temperature band.
Full Completion:
- Confidence intervals and McNemar’s test results are computed for pairwise comparisons.
- Calibration pipeline (Platt scaling) produces calibrated confidence scores with ECE report.
- Cost-quality tradeoff analysis with chart-ready CSV export.
- Runbook documents re-sweep procedure for model and prompt updates.
Excellence (Above & Beyond):
- Multi-provider sweep with normalized parameter ranges.
- Self-consistency sampling for borderline queries.
- Integration with P11 (canary rollout) for automated policy deployment.
- Bayesian optimization for efficient parameter search beyond grid sweep.