Project 5: Few-Shot Example Curator
Curated few-shot library with measurable lift and drift alerts.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | See main guide estimates (typically 3-8 days except capstone) |
| Main Programming Language | Python |
| Alternative Programming Languages | TypeScript |
| Coolness Level | Level 3: Quietly Powerful |
| Business Potential | 3. Consulting Accelerator |
| Knowledge Area | Prompt Data Engineering |
| Software or Tool | Example bank + selector |
| Main Book | Pattern Recognition and Machine Learning (Bishop) |
| Concept Clusters | Prompt Contracts and Output Typing; Evaluation, Rollouts, and Governance |
1. Learning Objectives
By completing this project, you will:
- Design and maintain a versioned example bank with structured metadata for task class, difficulty, risk tags, and source provenance.
- Implement example selection algorithms that optimize for relevance, diversity, and token budget constraints simultaneously.
- Measure quality lift on holdout fixtures and produce reproducible evaluation artifacts.
- Detect and alert on example drift when demonstrations become stale, biased, or contradictory.
- Build a deterministic selection pipeline where fixed bank version plus fixed seed always produces the same example set.
2. All Theory Needed (Per-Concept Breakdown)
In-Context Learning and Few-Shot Prompting Theory
Fundamentals In-context learning (ICL) is the mechanism by which large language models learn to perform tasks from demonstrations provided directly in the prompt, without any gradient updates or fine-tuning. When you include input-output pairs in a prompt before posing the actual query, the model uses those examples to infer the task pattern and generate an appropriate response. This is few-shot prompting: providing a small number of demonstrations (typically 2-8) that teach the model what you want by showing rather than telling. Few-shot prompting works because transformer models develop an implicit ability during pretraining to recognize patterns in sequential data and apply them to novel inputs. The quality, relevance, and diversity of these demonstrations have a dramatic impact on output quality, often exceeding the impact of instruction tuning for specific tasks.
Deep Dive into the concept The theoretical foundation of in-context learning rests on the observation that transformers can implement implicit learning algorithms during their forward pass. Research from Brown et al. (GPT-3, 2020) demonstrated that scaling model size dramatically improves few-shot performance, and subsequent work has shown that ICL can approximate gradient descent in the attention mechanism itself. Understanding this mechanism is critical for this project because it explains why example selection matters so much: the model is effectively fitting an internal model to your demonstrations during inference.
There are three modes of shot-based prompting. Zero-shot provides only an instruction and relies entirely on the model’s pretraining knowledge. One-shot provides a single demonstration, which is often sufficient for simple formatting tasks but fragile for nuanced reasoning. Few-shot provides multiple demonstrations (typically 3-8), which gives the model enough signal to learn the task pattern reliably. The sweet spot depends on the task complexity, the model size, and the available token budget.
A critical finding from recent research is that few-shot performance is not monotonically increasing with the number of examples. Adding too many examples can hurt performance through a phenomenon called “over-prompting,” where the model becomes confused by excessive or contradictory demonstrations. The marginal utility of each additional example decreases, and eventually the noise from imperfect examples outweighs the signal. This means your selection pipeline must optimize for quality over quantity.
The format of demonstrations matters as much as their content. Input-output pairs are the simplest format: show the input, show the desired output. Chain-of-thought demonstrations go further by including the reasoning steps between input and output, which significantly improves performance on multi-step reasoning tasks. The choice between these formats depends on whether the task requires pattern matching (use simple pairs) or reasoning (use chain-of-thought).
Few-shot prompting interacts with the instruction in complex ways. The instruction sets the task framing, while the examples calibrate the output distribution. When instruction and examples conflict, models tend to follow the examples over the instruction. This means that carefully curated examples can override ambiguous or imprecise instructions, but it also means that bad examples can undermine even well-written instructions. This duality is why a curation pipeline is essential: you cannot rely on instructions alone to compensate for poor demonstrations.
Token budget constraints add a practical dimension. Each demonstration consumes tokens from the context window. A typical few-shot setup with 5 examples of moderate length might consume 500-2000 tokens, which is significant when the context window is 4K-8K tokens and the task also requires substantial input context. Budget-aware selection means choosing examples that maximize information density: high-quality demonstrations that teach the task pattern efficiently, without redundancy.
The relationship between example count, quality, and token budget creates a three-way tradeoff. More examples provide more signal but consume more tokens. Higher quality examples provide more signal per token but may be harder to find. Smaller token budgets force fewer examples but demand that each one counts. The curator’s job is to navigate this tradeoff for each task class and model configuration.
How this fit on projects This concept is the theoretical foundation for Project 5. Every component of the curator pipeline, from the bank cleaner to the selector to the lift evaluator, depends on understanding how in-context learning works, why example quality matters, and what happens when examples are poorly chosen. The selection algorithm directly implements the principles described here.
Definitions & key terms
- In-context learning (ICL): The ability of LLMs to learn task patterns from demonstrations in the prompt without weight updates.
- Few-shot prompting: Providing 2-8 demonstration examples before the actual query to teach the model the desired behavior.
- Over-prompting: The phenomenon where too many demonstrations degrade performance by confusing the model or introducing contradictory signal.
- Chain-of-thought (CoT) demonstrations: Examples that include intermediate reasoning steps between input and output.
- Token budget: The maximum number of tokens available for demonstrations, constrained by context window size minus instruction and input tokens.
- Demonstration: A single input-output pair (or input-reasoning-output triple) included in the prompt as a teaching example.
- Marginal utility: The incremental quality improvement gained by adding one more demonstration to the prompt.
Mental model diagram (ASCII)
FEW-SHOT PROMPTING ARCHITECTURE
===============================
+----------------------------------------------------------+
| PROMPT STRUCTURE |
| |
| +---------------------------------------------------+ |
| | INSTRUCTION BLOCK | |
| | "You are a customer support classifier..." | |
| +---------------------------------------------------+ |
| | |
| +---------------------------------------------------+ |
| | DEMONSTRATION 1 (input-output pair) | |
| | Input: "My order is late" | |
| | Output: { "category": "shipping", "priority": 2 }| |
| +---------------------------------------------------+ |
| +---------------------------------------------------+ |
| | DEMONSTRATION 2 (with chain-of-thought) | |
| | Input: "I want a refund for damaged item" | |
| | Reasoning: "mentions damage -> refund policy..." | |
| | Output: { "category": "refund", "priority": 3 } | |
| +---------------------------------------------------+ |
| +---------------------------------------------------+ |
| | DEMONSTRATION 3 ... | |
| +---------------------------------------------------+ |
| +---------------------------------------------------+ |
| | DEMONSTRATION K (last) | |
| +---------------------------------------------------+ |
| | |
| +---------------------------------------------------+ |
| | ACTUAL QUERY | |
| | Input: "Charged twice for same order" | |
| +---------------------------------------------------+ |
+----------------------------------------------------------+
|
v
+-------------+
| LLM |
| (in-context |
| learning) |
+-------------+
|
v
+----------------------------+
| Output: { "category": |
| "billing", "priority": 3 }|
+----------------------------+
TOKEN BUDGET TRADEOFF:
+===========================================+
| Context Window (e.g. 8192 tokens) |
| |
| [Instruction] [Demo1] [Demo2] ... [DemoK] |
| ~200 tok ~300 ~300 ~300 |
| |
| [Query + Input Context] [Output Reserve] |
| ~2000 tokens ~500 tokens |
| |
| Budget for demos = 8192 - 200 - 2000 - 500|
| = 5492 tokens |
| At ~300 tok/demo: max K = 18 |
| But diminishing returns after K = 5-8 |
+===========================================+
How it works (step-by-step, with invariants and failure modes)
- The instruction block establishes the task frame: what the model should do and what format output should take. Invariant: the instruction is always present and precedes demonstrations. Failure mode: if instructions are omitted, the model must infer the task entirely from examples, which is fragile for complex tasks.
- Demonstrations are inserted in sequence after the instruction. Invariant: each demonstration is a complete input-output pair with consistent formatting. Failure mode: if a demonstration is malformed (missing output, inconsistent format), the model may copy the malformation in its response.
- The model processes the full prompt in a single forward pass. Invariant: the total token count of instruction + demonstrations + query + output reserve does not exceed the context window. Failure mode: exceeding the context window causes truncation, which silently drops demonstrations.
- The model generates a response conditioned on the demonstrations. Invariant: with fixed demonstrations and temperature=0, the response is deterministic. Failure mode: if demonstrations are contradictory, the model output becomes unpredictable.
- The response is evaluated against the expected schema and quality criteria. Invariant: the evaluation uses a holdout set not present in the demonstration bank. Failure mode: evaluating on the same examples used as demonstrations inflates quality metrics.
Minimal concrete example
Example bank record (JSONL format):
{
"id": "ex_refund_042",
"task_class": "support_refund",
"input": "I received a broken laptop screen and need my money back",
"output": {
"category": "refund",
"priority": 3,
"sentiment": "frustrated"
},
"reasoning": "Customer reports physical damage, requests refund explicitly",
"tags": ["electronics", "damage", "explicit_refund"],
"difficulty": "medium",
"source": "production_logs_2025_q3",
"added_date": "2025-09-15",
"quality_score": 0.94
}
Selection command pseudocode:
FUNCTION select_examples(bank, query, k, budget_tokens):
candidates = filter_by_task_class(bank, query.task_class)
candidates = remove_policy_blocked(candidates)
candidates = deduplicate_near_clones(candidates, threshold=0.92)
scored = score_relevance(candidates, query.input)
selected = diversity_constrained_topk(scored, k, budget_tokens)
RETURN selected
Common misconceptions
- “More examples always means better performance.” Research shows diminishing returns after 5-8 examples, and over-prompting can actively degrade output quality. Quality and diversity matter far more than quantity.
- “Any relevant example is a good example.” Examples with subtle errors, ambiguous outputs, or outdated information teach the model bad patterns. Curation means aggressively filtering for quality, not just relevance.
- “Few-shot prompting is just about formatting.” The model is performing implicit learning during its forward pass. The examples are not just formatting guides; they calibrate the entire output distribution.
- “Instructions and examples are interchangeable.” When they conflict, models follow examples. This means examples have more power than instructions for shaping output behavior, which makes curation critical.
- “The order of examples does not matter.” Research on primacy and recency bias shows that example ordering significantly affects model behavior, with models attending more to examples at the beginning and end of the sequence.
Check-your-understanding questions
- Why does adding a 9th example sometimes decrease performance compared to using only 6?
- When would you choose chain-of-thought demonstrations over simple input-output pairs?
- How does the token budget tradeoff change when moving from a 4K to a 128K context window model?
- Why is it dangerous to evaluate few-shot quality on the same examples used as demonstrations?
- What happens when the instruction says “classify into 3 categories” but the demonstrations show 5 categories?
Check-your-understanding answers
- The 9th example may introduce noise, redundancy, or contradictory signal that confuses the model’s implicit pattern matching. With 6 high-quality, diverse examples, the model has enough signal to perform the task well. Additional examples beyond the marginal utility threshold add token cost without proportional quality improvement, and may actively interfere with the learned pattern.
- Chain-of-thought demonstrations are superior for tasks requiring multi-step reasoning, complex classification with justification, or tasks where the output depends on intermediate analysis. Simple input-output pairs suffice for straightforward pattern matching tasks like formatting, simple classification, or extraction where the mapping is direct.
- With a larger context window, you have more token budget for demonstrations, but diminishing returns still apply. The marginal utility curve flattens regardless of available budget. A 128K window lets you include more complex demonstrations with richer context, but you should still optimize for quality and diversity rather than maximizing count.
- This creates a circular evaluation: you are testing whether the model can reproduce examples it just saw, not whether it learned the underlying pattern. Holdout evaluation tests generalization, which is the actual goal.
- The model will likely follow the examples and use 5 categories, because demonstrations have more influence than instructions on output behavior. This is why curation must ensure alignment between instruction claims and demonstration patterns.
Real-world applications
- Customer support platforms use curated few-shot examples to classify tickets, route escalations, and generate draft responses, with example banks refreshed quarterly from production logs.
- Medical NLP systems use carefully curated clinical demonstrations to extract structured data from clinical notes, where example quality directly impacts patient safety.
- Financial compliance teams use few-shot examples to classify transactions and flag suspicious patterns, with strict auditability requirements on which examples were used for each classification.
- E-commerce search teams use few-shot demonstrations to improve query understanding and product categorization across multilingual catalogs.
Where you’ll apply it
- Phase 1: designing the example bank schema and understanding what metadata each demonstration needs.
- Phase 2: implementing the selection algorithm that chooses demonstrations based on relevance, diversity, and budget.
- Phase 3: measuring lift and detecting when the example bank has drifted.
References
- “Language Models are Few-Shot Learners” by Brown et al. (GPT-3 paper) - foundational ICL research
- “AI Engineering” by Chip Huyen - Chapters on prompting strategies and evaluation
- “Pattern Recognition and Machine Learning” by Bishop - selection theory foundations
- “Rethinking the Role of Demonstrations” by Min et al. - research on what makes demonstrations effective
- Prompt Engineering Guide (promptingguide.ai) - few-shot prompting techniques
- “More Samples or More Prompts?” (ACL 2024) - research on in-context sampling strategies
Key insights Few-shot prompting is not formatting; it is implicit learning, and the quality of what the model learns is entirely determined by the quality of the demonstrations you curate.
Summary In-context learning allows LLMs to learn task patterns from demonstrations in the prompt. Few-shot prompting leverages this by providing 2-8 high-quality examples before the actual query. The effectiveness depends on example quality, diversity, format (simple pairs vs chain-of-thought), and ordering. Over-prompting degrades performance. Token budgets create a three-way tradeoff between example count, quality, and context space. A curation pipeline must optimize across all these dimensions simultaneously.
Homework/Exercises to practice the concept
- Design a JSONL schema for an example bank record that includes at least 10 metadata fields (id, task_class, input, output, reasoning, tags, difficulty, source, added_date, quality_score). Justify each field’s purpose.
- Given a 4096-token context window, 200 tokens for instruction, and 1500 tokens for query context, calculate the maximum number of demonstrations at 250 tokens each. Then explain why you might choose fewer than the maximum.
- Write pseudocode for a function that determines whether adding one more demonstration improves expected quality (marginal utility check).
Solutions to the homework/exercises
- The JSONL schema should justify each field: id for traceability, task_class for filtering, input/output for the demonstration itself, reasoning for chain-of-thought support, tags for diversity constraint checking, difficulty for stratified selection, source for provenance auditing, added_date for freshness filtering, quality_score for ranking. Additional useful fields: token_count (for budget calculation), hash (for deduplication), last_used_date (for staleness detection).
- Available budget = 4096 - 200 - 1500 - 500 (output reserve) = 1896 tokens. At 250 tokens each: max = 7. You might choose 5 instead because: diminishing returns plateau around 5-6 examples for most classification tasks, leaving headroom prevents silent truncation if the query is slightly longer than expected, and fewer examples with higher diversity scores often outperform more examples with redundancy.
- The pseudocode should compute the expected quality score with and without the candidate example (using holdout evaluation), compare the delta, and only add the example if the delta exceeds a minimum threshold (e.g., 0.5% improvement). The function should also check that the token budget is not exceeded.
Example Selection Strategies
Fundamentals Example selection is the algorithmic core of the curator pipeline: given a bank of candidate demonstrations, a task query, and constraints (token budget, diversity requirements, policy filters), choose the subset that maximizes expected quality on the target task. Naive selection (random sampling or top-k by cosine similarity) leaves significant quality on the table. Production-grade selection requires multi-objective optimization across relevance, diversity, coverage, and token efficiency. The selection strategy determines whether few-shot prompting delivers consistent quality improvement or introduces unpredictable variance.
Deep Dive into the concept There are four primary families of selection strategies, each with distinct tradeoffs.
Semantic similarity selection is the most common approach. Embed both the candidate demonstrations and the incoming query using the same embedding model, then select the K demonstrations with the highest cosine similarity to the query. This works well when the task requires demonstrations that are topically similar to the query. The weakness is homogeneity: if the top-K most similar examples are all near-duplicates of each other, the model receives redundant signal. Similarity-based selection also fails when the task requires diverse edge-case coverage rather than topical relevance.
Diversity-constrained selection addresses the homogeneity problem. After computing relevance scores, apply a diversity constraint that penalizes selecting examples that are too similar to already-selected examples. Maximal Marginal Relevance (MMR) is the classic algorithm: at each selection step, choose the example that maximizes a weighted combination of relevance to the query and dissimilarity to already-selected examples. The lambda parameter controls this tradeoff: lambda=1.0 is pure relevance, lambda=0.0 is pure diversity. In practice, lambda=0.5-0.7 works well for most tasks.
Difficulty-stratified selection ensures that the demonstration set covers a range of task difficulties. If all examples are easy cases, the model may not learn to handle complex inputs. If all examples are hard cases, the model may not calibrate well for simple inputs. Stratified selection assigns each candidate a difficulty tier (easy, medium, hard) and ensures the selected set includes examples from each tier proportionally. This is especially important for tasks with heterogeneous input complexity, like customer support where queries range from simple FAQ lookups to complex multi-issue complaints.
Coverage-based selection optimizes for covering the space of possible inputs rather than matching a specific query. This is useful when the example set will be reused across many queries (static example bank) rather than dynamically selected per query. Coverage selection partitions the input space into regions (by topic, entity type, difficulty, or other dimensions) and ensures at least one example represents each region. The k-medoids algorithm is a natural fit: select K examples that minimize the maximum distance from any candidate to its nearest selected example.
Dynamic vs static selection is a key architectural decision. Static selection precomputes a fixed example set for each task class and uses the same set for every query. This is simpler, deterministic, and cacheable, but may not perform well on queries that are far from the centroid of the task class. Dynamic selection computes a fresh example set for each incoming query, which can improve quality for diverse query distributions but adds latency (embedding computation + selection algorithm) and non-determinism. A hybrid approach uses a static base set plus 1-2 dynamically selected examples tailored to the specific query.
Token-aware selection integrates token budget constraints directly into the selection algorithm. Rather than selecting K examples and hoping they fit, the algorithm maintains a running token count and only considers candidates that fit within the remaining budget. This changes the optimization from “select the best K” to “select the best set that fits within B tokens,” which is a variant of the knapsack problem. Greedy selection with budget checking works well in practice: sort candidates by quality/token ratio, then add them in order until the budget is exhausted.
Quality scoring for candidates determines the ranking input to the selection algorithm. A candidate’s quality score should combine multiple signals: embedding similarity to the query (for dynamic selection), human quality rating (if available), automated quality metrics (output completeness, format compliance), freshness (time since the example was added or validated), and production feedback (whether this example has been associated with good or bad outcomes when used in past prompts). Combining these signals into a single score requires a weighting scheme that can be tuned per task class.
Near-duplicate detection is a preprocessing step that must happen before selection. If the bank contains near-duplicates (examples that differ only in trivial wording), the selection algorithm may waste slots on redundant demonstrations. MinHash or SimHash fingerprinting with a similarity threshold (typically 0.85-0.92) identifies clusters of near-duplicates, from which only the highest-quality representative should be retained.
How this fit on projects This concept is the algorithmic heart of Project 5. The selector component implements one or more of these strategies, and the evaluation framework measures which strategy produces the best lift for each task class.
Definitions & key terms
- Maximal Marginal Relevance (MMR): A selection algorithm that balances relevance to the query with diversity among selected items, controlled by a lambda parameter.
- Coverage: The degree to which the selected example set represents the full space of possible inputs.
- k-medoids: A clustering algorithm that selects K representative items minimizing within-cluster dissimilarity.
- Knapsack constraint: A budget limit (tokens) that restricts which combinations of items can be selected.
- SimHash / MinHash: Locality-sensitive hashing techniques for near-duplicate detection.
- Quality/token ratio: A metric combining example quality with token efficiency for budget-aware selection.
Mental model diagram (ASCII)
EXAMPLE SELECTION PIPELINE
==========================
+------------------+
| Example Bank | +------------------+
| (842 records) | | Incoming Query |
+--------+---------+ +--------+---------+
| |
v v
+------------------+ +------------------+
| PREPROCESS | | EMBED QUERY |
| - Policy filter | | (vector repr.) |
| - Freshness cut | +--------+---------+
| - Dedup (MinHash)| |
+--------+---------+ |
| |
v |
+------------------+ |
| SCORE CANDIDATES |<-------------+
| - Embed examples |
| - Cosine sim |
| - Quality score |
| - Freshness wt |
+--------+---------+
|
v
+-------------------------------+
| SELECT (Strategy Choice) |
| |
| Option A: Top-K by relevance |
| [simple, fast, homogeneous] |
| |
| Option B: MMR (lambda=0.6) |
| [balanced, recommended] |
| |
| Option C: Coverage (k-medoids)|
| [static banks, broad tasks] |
| |
| Option D: Stratified |
| [difficulty-diverse sets] |
+-------------------------------+
|
v
+------------------+
| BUDGET CHECK |
| Tokens used: 1450|
| Budget: 1800 |
| Remaining: 350 |
| -> fits? YES |
+--------+---------+
|
v
+------------------+
| OUTPUT: Selected |
| examples [k=6] |
| + manifest JSON |
+------------------+
How it works (step-by-step, with invariants and failure modes)
- Preprocess the bank: apply policy filters (remove sensitive/blocked examples), freshness cutoffs (remove stale examples), and near-duplicate detection. Invariant: no policy-blocked example survives preprocessing. Failure mode: policy labels are missing or outdated, allowing blocked examples through.
- Embed the incoming query and all candidate examples using the same embedding model. Invariant: the embedding model version is recorded in the selection manifest. Failure mode: embedding model mismatch between index time and query time produces meaningless similarity scores.
- Score each candidate using the composite quality metric (similarity + quality rating + freshness). Invariant: scores are deterministic given fixed inputs and model version. Failure mode: if the embedding service is unavailable, the pipeline must fall back to keyword-based scoring or abort with a clear error.
- Apply the selection strategy (MMR, coverage, stratified, or top-K). Invariant: the selected set size equals K or the maximum achievable given budget constraints. Failure mode: if all candidates are policy-blocked or stale, the selector returns fewer than K examples with a warning.
- Verify token budget compliance. Invariant: total token count of selected examples does not exceed the budget. Failure mode: a candidate’s actual token count differs from the precomputed estimate (due to tokenizer version differences), pushing the total over budget.
- Emit the selection manifest with selected IDs, scores, strategy used, budget consumed, and rationale. Invariant: the manifest is a complete audit record. Failure mode: missing fields make the selection non-reproducible.
Minimal concrete example
MMR Selection pseudocode:
FUNCTION mmr_select(candidates, query_embedding, k, lambda, budget):
selected = []
remaining = candidates
tokens_used = 0
WHILE len(selected) < k AND remaining is not empty:
best_score = -infinity
best_candidate = null
FOR each candidate IN remaining:
IF tokens_used + candidate.token_count > budget:
CONTINUE
relevance = cosine_sim(candidate.embedding, query_embedding)
max_sim_to_selected = MAX(
cosine_sim(candidate.embedding, s.embedding)
FOR s IN selected
) IF selected is not empty ELSE 0
mmr_score = lambda * relevance - (1 - lambda) * max_sim_to_selected
IF mmr_score > best_score:
best_score = mmr_score
best_candidate = candidate
IF best_candidate is null:
BREAK -- no candidate fits budget
selected.append(best_candidate)
remaining.remove(best_candidate)
tokens_used += best_candidate.token_count
RETURN selected, tokens_used
Common misconceptions
- “Cosine similarity is all you need for selection.” Similarity alone produces homogeneous sets that waste demonstration slots on redundant information. Diversity constraints are essential for robust performance.
- “Static example sets work fine for all queries.” Queries that fall outside the centroid of the task class get poor demonstrations. Dynamic or hybrid selection adapts to query variation.
- “Token budget does not matter with large context windows.” Even with 128K-token windows, demonstrations compete with input context, retrieval results, and output space. Budget discipline remains important.
- “Near-duplicate detection is optional.” Without deduplication, the selection algorithm often picks 3-4 examples that are paraphrases of each other, wasting most of the demonstration budget.
Check-your-understanding questions
- How does the lambda parameter in MMR affect the selected example set as it moves from 0.0 to 1.0?
- When would coverage-based selection (k-medoids) outperform similarity-based selection?
- Why should near-duplicate detection happen before scoring, not after?
- How does dynamic selection add latency compared to static selection, and when is this tradeoff acceptable?
Check-your-understanding answers
- At lambda=0.0, MMR selects purely for diversity (maximum dissimilarity among selected examples, ignoring query relevance). At lambda=1.0, it reduces to standard top-K by relevance. Values between 0.5-0.7 balance both objectives. As lambda increases, the set becomes more relevant but more homogeneous.
- Coverage outperforms similarity when the example set will be reused across many diverse queries (static bank), when the task class has high input variability (e.g., support tickets ranging from billing to technical issues), or when the goal is robustness across edge cases rather than peak performance on typical queries.
- Near-duplicates inflate the candidate pool with redundant options. If scoring happens first, the top-K may all be near-duplicates of each other. Deduplicating first ensures the scoring pool contains genuinely distinct candidates.
- Dynamic selection requires embedding the query at request time and running the selection algorithm, adding 50-200ms of latency depending on the embedding model and bank size. This is acceptable for non-real-time tasks (batch processing, async APIs) but may be too slow for interactive chatbots. The hybrid approach (static base + 1-2 dynamic additions) mitigates this.
Real-world applications
- Retrieval-augmented generation (RAG) systems use similar selection strategies to choose which retrieved documents to include in the prompt context.
- Recommendation systems use MMR-style diversity constraints to avoid showing users repetitive suggestions.
- Active learning frameworks use coverage and difficulty-stratified selection to choose the most informative examples for human labeling.
Where you’ll apply it
- Phase 2: implementing the selector component with at least MMR and one other strategy.
- Phase 3: comparing selection strategies on holdout evaluation to determine the best approach per task class.
References
- “Carbonell & Goldstein (1998)” - original MMR paper for diversity-aware retrieval
- “Mining of Massive Datasets” by Leskovec, Rajaraman, Ullman - similarity, clustering, and coverage
- “AI Engineering” by Chip Huyen - retrieval and selection for LLM applications
- “Unified Few-Shot Classification” by Triantafillou et al. - systematic comparison of selection strategies
Key insights The best example selection strategy matches the deployment pattern: use similarity for per-query dynamic selection, coverage for static shared banks, and MMR when you need both relevance and diversity in a single set.
Summary Example selection strategies range from simple top-K similarity to sophisticated multi-objective algorithms like MMR, coverage-based k-medoids, and difficulty-stratified sampling. Production systems need budget-aware selection that integrates token constraints directly into the algorithm, near-duplicate detection to avoid redundancy, and composite quality scores that combine similarity, freshness, and human ratings. The choice of strategy depends on whether selection is per-query (dynamic) or per-task-class (static) and whether the priority is relevance, diversity, or robustness.
Homework/Exercises to practice the concept
- Implement MMR selection in pseudocode with lambda=0.6 and a token budget constraint. Trace through the algorithm with 8 candidates and K=4, showing which candidate is selected at each step and why.
- Design a composite quality score that combines cosine similarity (0-1), human quality rating (1-5), and freshness (days since added). Define the weighting scheme and justify your choices.
- Compare the expected behavior of top-K similarity vs MMR (lambda=0.5) when the top 5 candidates by similarity are all refund-related examples but the task class includes billing, shipping, and refund queries.
Solutions to the homework/exercises
- The MMR trace should show that the first selection is the highest-relevance candidate. Subsequent selections balance relevance against dissimilarity to already-selected examples. If candidates 1 and 2 are near-duplicates, MMR skips candidate 2 in favor of a slightly less relevant but more diverse candidate 3. The token budget check should skip any candidate whose addition would exceed the limit.
- A reasonable weighting: 0.5 * cosine_similarity + 0.25 * (quality_rating / 5.0) + 0.25 * freshness_decay(days). Freshness decay could be exponential: exp(-days/180) (half-life of ~6 months). Justification: relevance to the query is the primary signal; quality and freshness serve as tiebreakers. The weights should be tunable per task class because some tasks prioritize freshness (rapidly changing products) while others prioritize quality (legal/compliance).
- Top-K similarity would select 5 refund examples, leaving billing and shipping uncovered. MMR would select 2-3 refund examples (the most relevant), then diversity pressure would force selection of billing and shipping examples, providing broader coverage. On a holdout set with mixed query types, MMR would significantly outperform top-K.
Example Ordering and Positional Bias
Fundamentals The order in which few-shot demonstrations appear in the prompt significantly affects model behavior due to positional biases inherent in transformer architectures. Research has identified two primary biases: primacy bias (stronger attention to examples at the beginning of the sequence) and recency bias (stronger attention to examples at the end). The “Lost in the Middle” phenomenon shows that information placed in the middle of the context receives the least attention. For few-shot curation, this means the ordering strategy for selected demonstrations is not cosmetic; it directly impacts output quality, and ignoring it leaves measurable quality gains on the table.
Deep Dive into the concept Transformers learn positional attention patterns during pretraining. Because pretraining corpora naturally place the most relevant information at the beginning and end of documents (titles, conclusions, topic sentences), models develop a learned bias to attend more to these positions. For few-shot prompting, this creates a U-shaped attention curve: the first and last demonstrations receive the most attention, while middle demonstrations receive the least.
The practical implications are significant. If you place your highest-quality, most representative demonstration first and your second-best demonstration last, the model will attend to both strongly. If you place your best demonstration in position 3 of 6, it may receive less attention than a mediocre demonstration in position 1.
The strength of primacy and recency varies with context utilization. When the demonstrations occupy a small fraction of the context window (less than 50%), primacy bias is strongest. As the demonstrations fill more of the context window, recency bias becomes dominant. This means the optimal ordering strategy depends on how much of the context window the demonstrations consume.
Recency bias also interacts with label distributions. If the last few demonstrations all have the same label (e.g., all classified as “refund”), the model becomes biased toward that label for the actual query. This is called majority label bias from recency. The fix is to ensure the last 2-3 demonstrations have diverse labels.
Several ordering strategies mitigate these biases. Random ordering averages out positional effects over many queries but does not optimize for any individual query. Relevance-ordered (most relevant last) leverages recency bias to ensure the most useful example gets the strongest attention. Difficulty-ordered (easy first, hard last) provides a curriculum-like progression. Alternating-label ordering prevents majority label bias from any position. The best strategy depends on the task and the model.
How this fit on projects This concept is applied in Phase 2 of the selector component. After selecting the K best examples, the ordering step arranges them to maximize expected quality. The evaluation framework (Phase 3) should measure the impact of different ordering strategies on holdout performance.
Definitions & key terms
- Primacy bias: The tendency for models to attend more strongly to tokens/examples at the beginning of the context.
- Recency bias: The tendency for models to attend more strongly to tokens/examples at the end of the context.
- Lost in the Middle: The phenomenon where information in the middle of the context receives the least model attention.
- Majority label bias: When repeated labels in recent positions bias the model toward that label for the next prediction.
- Context utilization: The fraction of the context window consumed by the prompt content.
Mental model diagram (ASCII)
ATTENTION DISTRIBUTION ACROSS DEMONSTRATION POSITIONS
=====================================================
Attention
Strength
^
| * *
| * * * *
| * * * *
| * * * *
| * * * *
| * * * * * * *
| * * * * * * * * *
| * * * * * * *
+--+---+---+---+---+---+---+---+---+---+---+--> Position
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 Query
|<--- primacy --->|<- lost in ->|<- recency ->|
| bias zone | the middle | bias zone |
ORDERING STRATEGIES:
====================
Strategy A: Relevance-ordered (most relevant last)
[low-rel] [low-rel] [med-rel] [med-rel] [high-rel] [QUERY]
^-- recency boost
Strategy B: Anchor + diverse
[BEST] [diverse] [diverse] [diverse] [2nd BEST] [QUERY]
^-- primacy ^-- recency
Strategy C: Alternating labels
[refund] [billing] [shipping] [refund] [billing] [QUERY]
-- prevents majority label bias from any position
How it works (step-by-step, with invariants and failure modes)
- After selection, compute ordering scores based on the chosen strategy. Invariant: the set of selected examples is fixed; only the order changes. Failure mode: the ordering step accidentally adds or removes examples.
- For relevance-ordered strategy: sort by ascending relevance so the most relevant example is last (recency position). Invariant: the last example has the highest relevance score. Failure mode: ties in relevance score make the ordering non-deterministic without a tiebreaker.
- For anchor strategy: place the best example first and the second-best last, fill the middle with diverse examples. Invariant: positions 1 and K are the highest-quality examples. Failure mode: if K=2, both positions are “anchor” and there is no middle to fill.
- Check label distribution in the last 2-3 positions. Invariant: no single label dominates the final positions. Failure mode: if the task has an imbalanced label distribution and only 3 examples are selected, it may be impossible to avoid label repetition.
- Record the ordering strategy and final sequence in the selection manifest. Invariant: the manifest records enough information to reproduce the exact ordering.
Minimal concrete example
Selected examples (before ordering):
ex_042: relevance=0.91, label=refund, difficulty=medium
ex_107: relevance=0.88, label=billing, difficulty=easy
ex_215: relevance=0.85, label=shipping, difficulty=hard
ex_331: relevance=0.82, label=refund, difficulty=easy
ex_099: relevance=0.79, label=billing, difficulty=medium
After anchor ordering:
Position 1: ex_042 (highest relevance -> primacy)
Position 2: ex_331 (refund, easy -> fills middle)
Position 3: ex_215 (shipping, hard -> fills middle)
Position 4: ex_099 (billing, medium -> fills middle)
Position 5: ex_107 (2nd highest relevance -> recency)
Label check for last 2 positions: [billing, billing]
-> SWAP: move ex_215 to position 4, ex_099 to position 3
-> Final last 2: [billing, shipping] -- diverse, OK
Common misconceptions
- “Example ordering is cosmetic and does not affect output.” Research consistently shows ordering can change accuracy by 5-15% on classification tasks.
- “Random ordering is safest.” Random ordering averages out bias over many runs but does not optimize any individual query. Deterministic ordering with a clear strategy is both reproducible and optimizable.
- “Recency bias is always strongest.” It depends on context utilization. For prompts that use less than 50% of the context window, primacy bias dominates.
- “You can fix ordering bias with better instructions.” Instructions do not override the attention patterns learned during pretraining. Ordering is a separate dimension that must be managed independently.
Check-your-understanding questions
- Why does the “Lost in the Middle” effect matter more for 8-shot than 3-shot prompting?
- How would you design an experiment to measure ordering effects for a specific task?
- When should you place the most relevant example first vs last?
Check-your-understanding answers
- With 3 examples, there is effectively no “middle” (positions 1, 2, 3 are all near the edges). With 8 examples, positions 3-6 are in the attention trough, which means up to half the demonstrations receive reduced attention. The wasted attention budget grows with the number of examples.
- Select a fixed set of K examples, then evaluate performance with K! permutations (or a representative sample of permutations) on a holdout set. Compare per-permutation accuracy to identify which orderings maximize performance. Control for randomness by averaging over multiple seeds.
- Place the most relevant example last (recency position) when context utilization is high (>50% of window) or when the task is label-sensitive (to leverage recency for the correct label). Place it first (primacy position) when context utilization is low or when the task benefits from establishing the pattern early.
Real-world applications
- Search engines rerank retrieved documents to exploit position effects in user attention (analogous to LLM attention patterns).
- Educational curriculum design places foundational concepts first and summative examples last, mirroring the anchor ordering strategy.
- A/B testing frameworks for prompt engineering systematically test ordering strategies as an optimization dimension.
Where you’ll apply it
- Phase 2: the ordering step after selection in the curator pipeline.
- Phase 3: evaluating ordering strategies on holdout data as part of the quality optimization process.
References
- “Lost in the Middle” by Liu et al. (2023) - foundational research on positional bias in long contexts
- “Attention Sorting Combats Recency Bias in Long Context Language Models” (2023)
- “Positional Biases Shift as Inputs Approach Context Window Limits” (2025)
- “Large Language Models are Zero-Shot Rankers” by Hou et al. - position bias in ranking tasks
Key insights Example ordering is a free optimization lever: it costs zero additional tokens but can improve quality by 5-15% when aligned with the model’s positional attention patterns.
Summary Transformer models exhibit primacy and recency bias in their attention patterns, creating a U-shaped attention curve where the first and last demonstrations receive the most attention. The “Lost in the Middle” effect means middle demonstrations are partially ignored. Ordering strategies (relevance-ordered, anchor, alternating-label) exploit these patterns to maximize the impact of each demonstration. The optimal strategy depends on context utilization, task type, and label distribution.
Homework/Exercises to practice the concept
- Given 6 selected examples with relevance scores and labels, apply the anchor ordering strategy and verify that the last 2 positions have diverse labels.
- Design an evaluation protocol that measures the impact of 3 different ordering strategies on a classification task with a 50-example holdout set.
Solutions to the homework/exercises
- The anchor ordering should place the highest-relevance example at position 1, the second-highest at position 6, and fill positions 2-5 with remaining examples ordered by diversity. After placement, check the labels at positions 5-6: if they match, swap position 5 with the example from positions 2-4 that has a different label.
- Protocol: (1) Fix the selected example set (same 6 examples for all conditions). (2) Apply each ordering strategy to produce 3 different orderings. (3) Run each ordering against all 50 holdout queries. (4) Compute accuracy, F1, and average confidence for each strategy. (5) Run a paired statistical test (e.g., McNemar’s test) to determine if differences are significant. (6) Report mean + 95% confidence interval for each metric per strategy.
Quality Metrics and Drift Detection for Example Banks
Fundamentals An example bank is a living dataset that degrades over time. Products change, customer language evolves, new categories emerge, and previously correct outputs become wrong. Quality metrics measure the current health of the example bank, while drift detection alerts you when the bank has diverged from the actual data distribution in production. Without these mechanisms, a once-effective few-shot pipeline silently degrades until someone notices that output quality has dropped, by which time the damage is already done. Quality metrics and drift detection transform the example bank from a static resource into a monitored, maintainable asset.
Deep Dive into the concept Quality metrics operate at two levels: per-example metrics and bank-level metrics.
Per-example metrics assess individual demonstration quality. Key metrics include: correctness (does the output match current ground truth?), completeness (does the output include all required fields?), format compliance (does the output match the expected schema?), relevance decay (how old is the example relative to the current product/domain state?), and usage feedback (when this example was used in production, did it correlate with good outcomes?). Each metric produces a score that feeds into the composite quality score used by the selection algorithm.
Bank-level metrics assess the health of the entire bank. Coverage measures whether the bank has examples for all task subcategories. Diversity measures the spread of examples across the input space. Redundancy measures how many near-duplicate clusters exist. Staleness measures what fraction of examples are older than a freshness threshold. Label balance measures whether the distribution of output labels in the bank matches the expected production distribution.
Drift detection compares the current example bank against a reference distribution. The reference might be the production query distribution (what queries are users actually sending?), the production label distribution (what categories are actually occurring?), or a golden evaluation set (hand-curated examples that define ground truth). When the bank distribution diverges from the reference beyond a threshold, the system triggers an alert.
Statistical drift detection methods include: population stability index (PSI) for comparing discrete distributions, Kullback-Leibler divergence (KL) for measuring information loss between distributions, embedding centroid drift for detecting when the semantic center of the bank has shifted, and time-windowed quality score regression for detecting gradual degradation trends. PSI is the most practical for production use because it has well-established thresholds (PSI < 0.1 = no drift, 0.1-0.25 = moderate drift, > 0.25 = significant drift) and works directly on categorical distributions.
Drift can be caused by several factors: the underlying domain changed (new product categories, regulatory changes), the production query distribution shifted (seasonal patterns, marketing campaigns), the example bank was not refreshed after domain changes, or the evaluation criteria changed (new quality standards, updated schemas). Each cause requires a different remediation: domain changes require adding new examples and retiring outdated ones; query distribution shifts require rebalancing the bank; evaluation criteria changes require re-scoring all examples against updated criteria.
Automated drift alerts should include: the metric that triggered the alert, the current value vs the threshold, the time window over which drift was measured, the affected task classes, and a recommended action (refresh bank, review stale examples, add coverage for new categories). Alerts should be routed to the team responsible for the example bank, with escalation if unacknowledged within a time window.
How this fit on projects This concept drives Phase 3 of Project 5. The lift evaluator measures per-example and bank-level metrics, while the drift detector compares current bank distributions against reference distributions and triggers alerts.
Definitions & key terms
- Population Stability Index (PSI): A metric comparing two discrete distributions to detect drift. Values below 0.1 indicate stability; above 0.25 indicates significant drift.
- Coverage: The fraction of task subcategories that have at least one representative example in the bank.
- Redundancy: The number of near-duplicate clusters in the bank, measured by SimHash or embedding similarity.
- Staleness: The fraction of examples older than a configurable freshness threshold.
- Lift: The quality improvement on a holdout set when using the curated few-shot examples compared to a baseline (zero-shot or random examples).
- Freshness threshold: The maximum age of an example before it is considered potentially stale.
Mental model diagram (ASCII)
QUALITY METRICS AND DRIFT DETECTION PIPELINE
=============================================
+--------------------------------------------------+
| EXAMPLE BANK (v3.2) |
| 842 examples across 12 task classes |
+--------------------------------------------------+
| | |
v v v
+-----------------+ +-----------------+ +------------------+
| PER-EXAMPLE | | BANK-LEVEL | | DRIFT DETECTION |
| METRICS | | METRICS | | |
| | | | | |
| - correctness | | - coverage: 92% | | Reference: |
| - completeness | | - diversity: 0.8| | prod query dist |
| - format OK | | - redundancy: 39| | (last 30 days) |
| - freshness | | - staleness: 12%| | |
| - usage score | | - label balance | | Current bank: |
+---------+-------+ +--------+--------+ | category dist |
| | +--------+---------+
v v |
+-----------------+ +-----------------+ v
| COMPOSITE | | HEALTH REPORT | +-----------------+
| QUALITY SCORE | | | | PSI per class: |
| per example | | PASS: coverage | | refund: 0.04 |
| (feeds selector)| | WARN: staleness | | billing: 0.08 |
| | | PASS: diversity | | shipping: 0.31 |
+-----------------+ | FAIL: redundancy| | ^^^^^^^^ |
+-----------------+ | ALERT: shipping |
| class drifted |
+-----------------+
|
v
+-----------------+
| DRIFT ALERT |
| metric: PSI |
| value: 0.31 |
| threshold: 0.25 |
| class: shipping |
| action: refresh |
+-----------------+
How it works (step-by-step, with invariants and failure modes)
- Compute per-example metrics for all examples in the bank. Invariant: every example has a complete metrics record. Failure mode: if the evaluation service is unavailable, use cached metrics with a “stale_metrics” flag.
- Aggregate bank-level metrics from per-example data. Invariant: bank-level metrics are computed from the current, complete bank (not a sample). Failure mode: if the bank is very large, use stratified sampling with confidence intervals.
- Load the reference distribution (production query logs, golden set, or previous bank version). Invariant: the reference is timestamped and versioned. Failure mode: if no reference exists (first run), skip drift detection and establish the current distribution as the baseline.
- Compute PSI for each task class by comparing bank label distribution to reference label distribution. Invariant: PSI is computed per-class, not globally, to avoid masking local drift. Failure mode: if a task class has very few examples (n < 10), PSI becomes unreliable; flag as “insufficient data.”
- Evaluate thresholds and emit alerts for classes exceeding the drift threshold. Invariant: alerts include all required fields (metric, value, threshold, class, recommended action). Failure mode: alert fatigue from overly sensitive thresholds; calibrate thresholds using historical data.
- Write the health report and drift assessment to the output directory. Invariant: the report is JSON-formatted and machine-readable for downstream automation.
Minimal concrete example
Health report (JSON):
{
"bank_version": "3.2",
"timestamp": "2025-11-15T10:30:00Z",
"total_examples": 842,
"bank_metrics": {
"coverage": 0.92,
"diversity_score": 0.81,
"redundancy_clusters": 39,
"staleness_fraction": 0.12,
"label_balance_chi2_p": 0.34
},
"drift_assessment": [
{"class": "refund", "psi": 0.04, "status": "STABLE"},
{"class": "billing", "psi": 0.08, "status": "STABLE"},
{"class": "shipping", "psi": 0.31, "status": "DRIFTED",
"alert": "Refresh shipping examples: new carrier categories added"}
],
"recommended_actions": [
"Remove 39 redundant examples (dedup clusters)",
"Add examples for 'express_delivery' subcategory (new)",
"Re-score 101 examples older than 180 days"
]
}
Common misconceptions
- “If the bank was good when created, it stays good.” Example banks degrade as the domain evolves. Without monitoring, you only discover degradation through user complaints.
- “Global drift metrics are sufficient.” A global PSI might be 0.05 (stable) while one task class has PSI 0.35 (severely drifted). Always compute drift per task class.
- “Staleness means the example is wrong.” An old example might still be perfectly correct. Staleness means the example has not been validated against current ground truth and deserves a review, not automatic removal.
- “More examples fix coverage gaps.” If a new task subcategory emerged, existing examples will not cover it regardless of bank size. Coverage monitoring identifies the gap; targeted example creation fills it.
Check-your-understanding questions
- Why should drift detection use PSI per task class rather than a single global metric?
- What is the difference between staleness and incorrectness in an example bank?
- How would you set the freshness threshold for an example bank in a rapidly changing domain vs a stable domain?
- What information must a drift alert contain to be actionable?
Check-your-understanding answers
- Global PSI averages across classes, which can mask significant drift in a minority class. If 90% of examples are in stable classes, a severely drifted 10% class would barely move the global PSI. Per-class computation ensures every class is monitored independently.
- Staleness means the example has not been validated recently and might be wrong. Incorrectness means the example has been validated and is definitely wrong. Staleness is a risk signal that triggers review; incorrectness is a certainty that triggers removal.
- In a rapidly changing domain (e.g., social media trends, breaking news), set freshness threshold to 30-90 days. In a stable domain (e.g., legal terminology, physical constants), set it to 6-12 months. The threshold should be calibrated by measuring how quickly examples become incorrect in practice.
- An actionable drift alert must include: which metric triggered it, the current value vs threshold, the affected task class(es), the time window, and a specific recommended action (refresh, add, review, or remove).
Real-world applications
- Financial institutions monitor data drift in fraud detection models and retrain when distributions shift, using the same PSI-based approach applicable to example banks.
- E-commerce platforms detect category drift in product catalogs and refresh classification training data accordingly.
- Healthcare NLP systems monitor for concept drift as medical terminology and coding standards evolve (ICD code updates).
- Content moderation systems detect drift in violation patterns as new types of harmful content emerge.
Where you’ll apply it
- Phase 3: implementing the health report and drift detection system.
- Operational hardening: setting up scheduled drift checks and automated alerting.
References
- “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 11 on data pipelines and evolution
- “AI Engineering” by Chip Huyen - Chapters on data distribution monitoring
- “Mining of Massive Datasets” by Leskovec et al. - similarity and clustering techniques for coverage analysis
- Population Stability Index documentation in credit risk modeling literature
Key insights An example bank without drift detection is a time bomb: it works today but silently degrades until someone notices the damage, by which time the remediation cost has compounded.
Summary Quality metrics operate at per-example (correctness, completeness, freshness) and bank-level (coverage, diversity, redundancy, staleness) granularity. Drift detection compares the current bank distribution against a reference distribution (production queries, golden set) using statistical methods like PSI. Alerts must be per-task-class, actionable, and routed to the responsible team. Monitoring transforms the example bank from a static artifact into a living, maintainable asset.
Homework/Exercises to practice the concept
- Design a JSON schema for a bank health report that includes per-example metrics summaries, bank-level metrics, and drift assessment per task class.
- Calculate PSI for a task class where the bank has distribution [refund: 40%, billing: 35%, shipping: 25%] and the production reference is [refund: 30%, billing: 30%, shipping: 40%]. Interpret the result.
- Write a pseudocode function that determines whether a specific task class needs example refresh based on three signals: PSI, staleness fraction, and coverage gap count.
Solutions to the homework/exercises
- The JSON schema should mirror the health report example above, with additional fields for: per-class example counts, min/max/mean quality scores per class, a list of flagged examples (stale, low-quality, or policy-blocked), and a metadata section with bank version, evaluation timestamp, and reference distribution version.
- PSI calculation: for each bin, compute (bank_pct - ref_pct) * ln(bank_pct / ref_pct). Refund: (0.40 - 0.30) * ln(0.40/0.30) = 0.10 * 0.288 = 0.0288. Billing: (0.35 - 0.30) * ln(0.35/0.30) = 0.05 * 0.154 = 0.0077. Shipping: (0.25 - 0.40) * ln(0.25/0.40) = -0.15 * (-0.470) = 0.0705. Total PSI = 0.0288 + 0.0077 + 0.0705 = 0.107. This indicates moderate drift (0.1 < PSI < 0.25), primarily driven by the shipping class being underrepresented in the bank.
- The pseudocode should check: IF psi > 0.25 OR staleness_fraction > 0.30 OR coverage_gaps > 0 THEN needs_refresh = TRUE. It should also handle edge cases: if PSI is unreliable (n < 10 examples), fall back to staleness and coverage signals only.
3. Project Specification
3.1 What You Will Build
An example-bank curation and selection pipeline that maximizes quality lift while minimizing bias drift.
3.2 Functional Requirements
- Maintain a versioned example bank with metadata (task class, difficulty, risk tags).
- Select top-k demonstrations using relevance + diversity constraints.
- Evaluate selected set against holdout fixtures and log quality lift.
- Detect and alert on drift when examples become stale or biased.
3.3 Non-Functional Requirements
- Performance: Selection + holdout eval completes in under 3 minutes for 1k examples.
- Reliability: Given fixed bank version and seed, selected example ids are deterministic.
- Security/Policy: Sensitive examples are excluded by policy labels before scoring.
3.4 Example Usage / Output
$ uv run p05-curator select --task-class support_refund --bank examples/refund_bank.jsonl --k 6 --out out/p05
[INFO] Loaded example bank: 842 records
[PASS] Deduplicated near-clones: 39 removed
[PASS] Selected set size: 6
[PASS] Diversity score: 0.81
[PASS] Expected lift on holdout: +7.4%
[INFO] Selection manifest: out/p05/selection_manifest.json
3.5 Data Formats / Schemas / Protocols
- Example-bank JSONL with prompt, ideal output, tags, and source metadata.
- Selection manifest JSON with selected ids and rationale.
- Holdout evaluation CSV with per-case deltas versus baseline.
3.6 Edge Cases
- Bank contains contradictory “ideal” outputs for same intent.
- Highly relevant examples are all near-duplicates (low diversity).
- Selection overfits one region/customer segment.
- Example bank version changes mid-evaluation.
3.7 Real World Outcome
This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.
3.7.1 How to Run (Copy/Paste)
$ uv run p05-curator select --task-class support_refund --bank examples/refund_bank.jsonl --k 6 --out out/p05
- Working directory:
project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS - Required inputs: project fixtures under
fixtures/ - Output directory:
out/p05
3.7.2 Golden Path Demo (Deterministic)
Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.
3.7.3 If CLI: exact terminal transcript
$ uv run p05-curator select --task-class support_refund --bank examples/refund_bank.jsonl --k 6 --out out/p05
[INFO] Loaded example bank: 842 records
[PASS] Deduplicated near-clones: 39 removed
[PASS] Selected set size: 6
[PASS] Diversity score: 0.81
[PASS] Expected lift on holdout: +7.4%
[INFO] Selection manifest: out/p05/selection_manifest.json
$ echo $?
0
Failure demo:
$ uv run p05-curator select --task-class support_refund --bank examples/refund_bank.jsonl --k 80 --out out/p05
[ERROR] Requested k=80 exceeds policy max for token budget (max=12)
[HINT] Lower --k or increase budget profile in policies/p05_budget.yaml
$ echo $?
2
4. Solution Architecture
4.1 High-Level Design
CURATOR PIPELINE ARCHITECTURE
==============================
Example Bank (JSONL) Config (YAML) Incoming Query
[842 records] [budget, thresholds] [task_class, k]
| | |
v v v
+-----------+ +-----------+ +------------+
| Bank | | Policy | | Query |
| Loader | | Loader | | Parser |
+-----------+ +-----------+ +------------+
| | |
v v v
+-----------------------------------------------------------+
| BANK CLEANER |
| 1. Policy filter (remove blocked) |
| 2. Freshness filter (remove stale) |
| 3. Near-duplicate detection (MinHash) |
| 4. Quality score validation |
+-----------------------------------------------------------+
|
v
+-----------------------------------------------------------+
| SELECTOR |
| 1. Embed query + candidates |
| 2. Score (similarity + quality + freshness) |
| 3. Select via MMR (lambda configurable) |
| 4. Token budget enforcement |
| 5. Ordering (anchor strategy) |
+-----------------------------------------------------------+
|
v
+-----------------------------------------------------------+
| LIFT EVALUATOR |
| 1. Run selected set against holdout fixtures |
| 2. Compare vs zero-shot baseline |
| 3. Compute per-class accuracy delta |
| 4. Compute diversity and coverage scores |
+-----------------------------------------------------------+
|
v
+-----------------------------------------------------------+
| DRIFT DETECTOR |
| 1. Load reference distribution |
| 2. Compute PSI per task class |
| 3. Evaluate thresholds |
| 4. Emit alerts for drifted classes |
+-----------------------------------------------------------+
|
v
+-----------+ +-----------+ +-----------+
| Selection | | Health | | Holdout |
| Manifest | | Report | | Eval CSV |
| (JSON) | | (JSON) | | |
+-----------+ +-----------+ +-----------+
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Bank Cleaner | Deduplicates and validates example quality. | Remove stale, contradictory, and policy-blocked examples first. | | Selector | Chooses top-k by relevance/diversity objective. | Optimize for coverage, not just similarity score. Use MMR with configurable lambda. | | Lift Evaluator | Measures performance delta on holdout set. | Use fixed holdout set to detect real improvements vs noise. | | Drift Detector | Compares bank distribution to production reference. | Compute PSI per task class, not globally. |
4.3 Data Structures (No Full Code)
ExampleRecord:
- id: string
- task_class: string
- input: string
- output: object
- reasoning: string (optional, for CoT)
- tags: string[]
- difficulty: enum(easy, medium, hard)
- source: string
- added_date: date
- quality_score: float
- token_count: int
- embedding: float[] (precomputed)
SelectionManifest:
- trace_id: string
- bank_version: string
- seed: int
- strategy: enum(mmr, topk, coverage, stratified)
- lambda: float
- budget_tokens: int
- tokens_used: int
- selected_ids: string[]
- ordering_strategy: string
- scores: { id: float }[]
HealthReport:
- bank_version: string
- timestamp: datetime
- bank_metrics: { coverage, diversity, redundancy, staleness, label_balance }
- drift_assessment: { class, psi, status, alert? }[]
- recommended_actions: string[]
4.4 Algorithm Overview
Key algorithm: MMR-based selection with token budget constraint
- Load and clean the example bank (policy filter, freshness filter, dedup).
- Embed the query and all candidates.
- Iteratively select candidates using MMR: at each step, pick the candidate that maximizes lambda * relevance - (1 - lambda) * max_sim_to_selected, subject to the token budget constraint.
- Apply ordering strategy to the selected set.
- Evaluate against holdout fixtures and compute lift.
- Run drift detection against reference distribution.
Complexity Analysis (conceptual):
- Time: O(K * N) for MMR selection where N = bank size, K = selection count. O(N) for deduplication with MinHash. O(N) for drift detection.
- Space: O(N) for embeddings and metadata. O(K) for selection state.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies (uv, Python 3.11+, embedding library)
# 2) Prepare fixtures under fixtures/
# 3) Create example bank JSONL under examples/
# 4) Run the project command(s) listed in section 3.7
5.2 Project Structure
p05/
├── src/
│ ├── bank_loader.py # Load and validate JSONL bank
│ ├── cleaner.py # Policy filter, dedup, freshness
│ ├── selector.py # MMR, top-K, coverage strategies
│ ├── ordering.py # Positional bias-aware ordering
│ ├── evaluator.py # Holdout lift measurement
│ ├── drift_detector.py # PSI computation and alerting
│ └── cli.py # CLI entry point
├── fixtures/
│ ├── holdout_set.jsonl # Fixed evaluation examples
│ └── golden_case.jsonl # Golden path test data
├── examples/
│ └── refund_bank.jsonl # Example bank
├── policies/
│ └── p05_budget.yaml # Token budget and thresholds
├── out/ # Output artifacts
└── README.md
5.3 The Core Question You’re Answering
“How do I choose few-shot examples that improve behavior without introducing hidden bias?”
This question matters because it forces the project to produce objective evidence (holdout lift metrics, diversity scores, drift alerts) instead of relying on subjective prompt impressions.
5.4 Concepts You Must Understand First
- In-context learning and few-shot prompting
- How does the model learn from demonstrations without gradient updates?
- Book Reference: “AI Engineering” by Chip Huyen - prompting chapters
- Demonstration selection strategies (MMR, coverage, stratified)
- Why does naive top-K similarity produce poor demonstration sets?
- Book Reference: “Mining of Massive Datasets” by Leskovec et al. - similarity + clustering chapters
- Positional bias and ordering effects
- How does the position of an example in the prompt affect model attention?
- Book Reference: “Lost in the Middle” by Liu et al. (2023)
- Drift detection and quality monitoring
- How do you know when an example bank has gone stale?
- Book Reference: “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 11
5.5 Questions to Guide Your Design
- Selection algorithm
- What tradeoff does the lambda parameter in MMR control?
- Should selection be dynamic (per-query) or static (per-task-class)?
- How do you enforce token budget constraints during selection?
- Quality measurement
- What holdout set design prevents evaluation contamination?
- How do you measure lift relative to a zero-shot baseline?
- What per-class metrics detect hidden bias in the selected set?
- Operational monitoring
- What PSI threshold triggers a refresh alert for a task class?
- How frequently should drift checks run?
- What metadata must the selection manifest contain for full reproducibility?
5.6 Thinking Exercise
Pre-Mortem for Few-Shot Example Curator
Before implementing, write down 10 ways this project can fail in production. Classify each failure into: selection, quality, drift, or operations.
Questions to answer:
- Which failures can be prevented by better example bank design?
- Which failures require runtime monitoring and alerting?
- What happens when the bank has zero examples for a new task class?
5.7 The Interview Questions They’ll Ask
- “What makes a good few-shot example set, and how do you measure goodness?”
- “How do you quantify diversity versus relevance in example selection?”
- “How would you detect example-bank drift in production?”
- “Why can too many examples hurt output quality?”
- “How do you prevent bias amplification in demonstration selection?”
- “What is the Lost in the Middle effect, and how does it impact few-shot ordering?”
5.8 Hints in Layers
Hint 1: Tag examples aggressively Task, risk, region, difficulty, and freshness tags make selection tractable. Without metadata, you cannot filter, stratify, or monitor.
Hint 2: Optimize for marginal gain Each added example should bring new coverage. Use MMR to ensure that each selection adds information the previous selections did not.
Hint 3: Use a fixed holdout Do not evaluate lift on the same examples you selected from. The holdout set must be separate, versioned, and never used as candidate demonstrations.
Hint 4: Track bank lineage Record bank version, selection seed, strategy parameters, and embedding model version in every manifest. Without this, selections are not reproducible.
Hint 5: Start with PSI, not custom metrics Population Stability Index has well-established thresholds and works directly on categorical distributions. Start here before building custom drift detectors.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | Few-shot learning theory | “AI Engineering” by Chip Huyen | Prompting and evaluation chapters | | Statistical selection | “Pattern Recognition and Machine Learning” by Bishop | Model selection chapters | | Large-scale data curation | “Mining of Massive Datasets” by Leskovec et al. | Similarity + clustering chapters | | Data lifecycle governance | “Designing Data-Intensive Applications” by Martin Kleppmann | Ch. 11 |
5.10 Implementation Phases
Phase 1: Foundation
- Design the example bank JSONL schema with all required metadata fields.
- Build the bank loader and cleaner (policy filter, freshness filter, near-duplicate detection).
- Create fixture data: at least 50 example records spanning 3+ task classes.
- Checkpoint: Bank loads, cleans, and reports dedup statistics correctly.
Phase 2: Core Functionality
- Implement at least two selection strategies (MMR + one other).
- Add token budget enforcement to the selection algorithm.
- Implement the ordering step with at least two strategies.
- Build the selection manifest output.
- Checkpoint: Selection produces deterministic results with fixed seed. Manifest contains all required fields.
Phase 3: Operational Hardening
- Implement holdout evaluation with lift computation.
- Build drift detection using PSI per task class.
- Create the bank health report.
- Add automated alerting for drifted classes.
- Checkpoint: Drift detection correctly identifies intentionally shifted distributions in test data.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Selection strategy | Top-K / MMR / Coverage / Stratified | MMR with configurable lambda | Best balance of relevance and diversity | | Dynamic vs static selection | Per-query / per-task-class / hybrid | Hybrid (static base + dynamic additions) | Balances quality and latency | | Drift metric | PSI / KL divergence / custom | PSI per task class | Well-established thresholds, easy to interpret | | Ordering strategy | Random / relevance-last / anchor | Anchor (best first + last) | Exploits both primacy and recency bias | | Dedup method | Exact hash / SimHash / embedding similarity | SimHash with threshold 0.90 | Fast, effective for near-duplicates |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate selection algorithms | MMR with known inputs, budget enforcement, dedup accuracy | | Integration Tests | Verify end-to-end pipeline | Golden-path CLI flow, manifest completeness | | Edge Case Tests | Ensure robust failure handling | Empty bank, all-blocked examples, budget too small for 1 example | | Drift Tests | Validate detection accuracy | Intentionally shifted distributions, per-class PSI thresholds |
6.2 Critical Test Cases
- Golden path succeeds and emits selection manifest with all required fields.
- Same bank version + seed + parameters produces identical selection every run.
- MMR with lambda=1.0 produces same result as top-K (degenerate case).
- Token budget constraint prevents selection of examples that would exceed limit.
- Drift detection fires alert when PSI exceeds threshold for a task class.
- Near-duplicate detection correctly identifies and removes paraphrased examples.
6.3 Test Data
fixtures/golden_case.jsonl # Happy path with expected outputs
fixtures/failure_case.jsonl # Malformed examples, policy-blocked records
fixtures/drift_test_bank.jsonl # Bank with intentional distribution shift
fixtures/holdout_set.jsonl # Fixed evaluation examples (never used as candidates)
fixtures/edge_cases/
empty_bank.jsonl # Zero records
all_blocked.jsonl # All records policy-blocked
single_class.jsonl # Bank with only one task class
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Model copies style but misses facts” | Examples optimized for tone, not correctness. | Include fact-critical examples with strict correctness labels. | | “Lift vanished after one week” | Examples drifted out of date. | Add freshness cutoff and scheduled re-curation via drift detector. | | “Quality improves for one segment only” | Coverage across user segments is imbalanced. | Add stratified selection constraints and per-class lift measurement. | | “Selection is non-deterministic between runs” | Missing seed or floating-point ordering differences. | Fix seed in config, add tiebreaker to score comparisons. | | “Budget exceeded silently” | Token counts estimated, not actual. | Use the same tokenizer as the target model for exact counts. |
7.2 Debugging Strategies
- Re-run with fixed seed and compare selection manifests byte-for-byte.
- Visualize selected examples on an embedding scatter plot to check diversity.
- Compare holdout accuracy with and without specific examples to identify harmful demonstrations.
- Diff the bank health report against the previous version to identify what changed.
7.3 Performance Traps
- Embedding all candidates on every selection request: precompute and cache embeddings.
- Computing pairwise similarity for the entire bank: use approximate nearest neighbor search for large banks.
- Running drift detection on every selection request: schedule drift checks hourly or daily, not per-request.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a new task class to the example bank with at least 20 examples and verify coverage metrics.
- Implement a simple top-K selector and compare its holdout lift against MMR.
8.2 Intermediate Extensions
- Add chain-of-thought support: store reasoning fields and measure lift difference between CoT and simple demonstrations.
- Build a drift dashboard that visualizes PSI trends over time per task class.
- Implement dynamic (per-query) selection and compare latency/quality vs static selection.
8.3 Advanced Extensions
- Implement Monte Carlo Shapley estimation to measure the marginal contribution of each example.
- Build an automated bank refresh pipeline that proposes new examples from production logs.
- Integrate with the registry (P15) to version and deploy example bank updates through a CI pipeline.
9. Real-World Connections
9.1 Industry Applications
- PromptOps platform teams curating few-shot libraries for customer support, content moderation, and code generation tasks.
- Healthcare NLP teams maintaining curated clinical demonstration sets with strict auditability and freshness requirements.
- Financial compliance teams using few-shot examples for transaction classification with drift monitoring tied to regulatory reporting.
9.2 Related Open Source Projects
- LangChain/LangSmith example selectors for dynamic few-shot construction.
- OpenAI Evals framework for holdout-based prompt evaluation.
- Evidently AI for data drift monitoring (adaptable to example bank drift detection).
- Sentence-Transformers for embedding computation in selection pipelines.
9.3 Interview Relevance
- Demonstrates understanding of in-context learning mechanics beyond surface-level “just add examples.”
- Shows practical data curation skills: selection algorithms, quality metrics, drift monitoring.
- Illustrates production thinking: reproducibility, audit trails, automated alerting, and operational monitoring.
10. Resources
10.1 Essential Reading
- “Language Models are Few-Shot Learners” by Brown et al. (GPT-3 paper)
- “Lost in the Middle: How Language Models Use Long Contexts” by Liu et al.
- “Rethinking the Role of Demonstrations” by Min et al.
- Anthropic documentation on prompt engineering and few-shot best practices.
- OpenAI Cookbook: few-shot prompting techniques.
10.2 Video Resources
- Talks on in-context learning theory and retrieval-augmented prompting.
- Conference presentations on data drift monitoring in ML systems.
10.3 Tools & Documentation
- Sentence-Transformers documentation for embedding models.
- MinHash/SimHash libraries for near-duplicate detection.
- PSI calculation libraries in Python (e.g., scipy for KL divergence).
- Evidently AI documentation for data drift monitoring.
10.4 Related Projects in This Series
- P01 (Prompt Contract Harness): Defines the contract validation that your curated examples must pass.
- P04 (Context Window Manager): Token budget management directly applies to demonstration budget allocation.
- P07 (Temperature Sweeper): Evaluation methodology for measuring prompt quality on holdout sets.
- P15 (Prompt Registry): Version management for example banks alongside prompt templates.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain how in-context learning works and why example quality matters more than quantity.
- I can describe at least 3 selection strategies (MMR, top-K, coverage) and their tradeoffs.
- I can explain the Lost in the Middle effect and how ordering strategies mitigate it.
- I can define PSI and interpret its values for drift detection.
11.2 Implementation
- Golden-path and failure-path flows both work deterministically.
- Selection manifest contains all fields needed for reproducibility.
- Drift detection correctly identifies intentionally shifted distributions.
- Holdout evaluation produces meaningful lift metrics.
11.3 Growth
- I can explain the tradeoff between dynamic and static selection in an interview.
- I can design a bank maintenance schedule for a production system.
- I can describe how this project integrates with the registry (P15) and evaluator (P07).
12. Submission / Completion Criteria
Minimum Viable Completion:
- Bank cleaner processes JSONL input and outputs cleaned candidates.
- At least one selection strategy (MMR) produces deterministic results with fixed seed.
- Selection manifest is written with all required fields.
- Holdout evaluation computes lift vs zero-shot baseline.
Full Completion:
- Two or more selection strategies implemented and compared.
- Ordering step implements at least one positional bias mitigation strategy.
- Drift detection with PSI per task class produces health report.
- Automated tests cover golden path, failure path, and drift detection.
Excellence (Above & Beyond):
- Dynamic per-query selection with latency benchmarking.
- Monte Carlo Shapley estimation for per-example value attribution.
- Integration with P15 (registry) for versioned bank deployment.
- Drift dashboard with time-series visualization.