Project 5: Few-Shot Example Curator
Project 5: Few-Shot Example Curator
Build a dynamic example selection system that treats few-shot examples as data, not hardcoded magic
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | 3-5 days |
| Language | Python (Alternatives: TypeScript) |
| Prerequisites | Project 1 (Harness), basic understanding of embeddings |
| Key Topics | Few-shot learning, semantic similarity, example diversity |
| Knowledge Area | Few-shot Prompting / Generalization |
| Software/Tool | ChromaDB / FAISS (Optional), Sentence-Transformers |
| Main Book | “Hands-On Machine Learning” by Géron (Data selection, Ch. 2) |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 3. The “Service & Support” Model |
1. Learning Objectives
By completing this project, you will:
- Master In-Context Learning (ICL): Understand how LLMs learn from examples in the prompt without retraining
- Implement Semantic Search: Use embedding-based similarity to find contextually relevant examples
- Balance Similarity vs. Diversity: Learn when to pick similar examples vs. when to ensure coverage
- Design Example Architectures: Create versioned, maintainable example pools that scale
- Measure Example Quality: Quantify the impact of example selection on model performance
- Handle Example Bias: Detect and mitigate biases that leak through few-shot examples
- Build Production Selection Logic: Create fast, cost-effective example retrieval systems
2. Theoretical Foundation
2.1 Core Concepts
What is In-Context Learning?
In-Context Learning (ICL) is the ability of LLMs to learn from examples provided in the prompt at inference time, without any weight updates. It’s like giving the model a “mini training set” for each request.
Zero-Shot vs. Few-Shot:
# Zero-Shot: No examples, just instructions
prompt = """
You are a support agent. Answer customer queries.
User: How do I reset my password?
Assistant:
"""
# Few-Shot: Include examples showing desired behavior
prompt = """
You are a support agent. Answer customer queries.
Example 1:
User: I need a refund for order #123
Assistant: I can help you with that refund. Our policy allows returns within 30 days. [Cited: policy_doc_1]
Example 2:
User: My account is locked
Assistant: Let me guide you through unlocking your account. Please check your email for a verification link. [Cited: help_doc_3]
Now answer this:
User: How do I reset my password?
Assistant:
"""
Why Few-Shot Works:
GPT models learn patterns from the examples through attention mechanisms. The examples act as “soft fine-tuning” that:
- Override generic model behavior
- Establish tone, format, and structure
- Demonstrate edge case handling
- Show refusal patterns (what NOT to do)
Research Foundation:
The GPT-3 paper (“Language Models are Few-Shot Learners” by Brown et al., 2020) demonstrated that larger models exhibit emergent few-shot learning capabilities. Performance scales with:
- Model size (larger = better few-shot learning)
- Example quality (relevant > random)
- Example quantity (up to a point, then saturates)
Semantic Similarity and Embeddings
An embedding is a dense vector representation of text that captures semantic meaning. Similar texts have similar vectors.
Vector Space Model:
"cat" → [0.2, 0.8, 0.1, ...]
"dog" → [0.3, 0.7, 0.2, ...] # Close to "cat"
"car" → [0.9, 0.1, 0.3, ...] # Far from "cat"
Cosine Similarity:
Measures the angle between two vectors (ignoring magnitude):
similarity = cos(θ) = (A · B) / (||A|| × ||B||)
Range: -1 (opposite) to +1 (identical)
Why Cosine Similarity?
Better than string matching for semantic search:
# String matching
"I want my money back" != "refund request" # 0% match
# Semantic similarity
embedding("I want my money back") ≈ embedding("refund request") # 0.85 similarity
Generating Embeddings:
Modern approaches:
- OpenAI Embeddings API:
text-embedding-3-small(cheap, fast) - Sentence-Transformers: Open-source models (all-MiniLM-L6-v2)
- Cohere Embeddings: Optimized for semantic search
The Primacy and Recency Effects
LLMs don’t “see” all examples equally. Position matters.
Primacy Effect: Models remember the first examples better Recency Effect: Models remember the last examples better Lost in the Middle: Middle examples get ignored
Research: “Lost in the Middle” (Liu et al., 2023) showed that LLMs struggle to use information in the middle of long contexts.
Practical Implications:
# GOOD: Put most important example first and last
examples = [
most_relevant_example, # Primacy
supporting_example_1,
supporting_example_2, # Might be ignored
negative_example # Recency - shows refusal pattern
]
# BAD: Random order
examples = random.shuffle(all_examples)[:3]
Example Diversity vs. Similarity
The Overfitting Problem:
If you pick 3 examples that are TOO similar:
# All examples are refund requests
examples = [
"I want a refund for #123",
"Refund my order #456",
"Need refund for #789"
]
# User asks: "My 2FA isn't working"
# Model response: Tries to frame it as a refund issue! (Wrong)
The Confusion Problem:
If you pick 3 examples that are TOO diverse:
examples = [
"Refund order #123", # Topic: refunds
"Reset my password", # Topic: account access
"Why is shipping slow?" # Topic: logistics
]
# User asks: "Cancel my subscription"
# Model response: Confused, gives generic answer
The Sweet Spot:
Balance similarity to the current query with coverage of edge cases:
def select_examples(query, pool, n=3):
# Get top 5 most similar
candidates = get_top_k_similar(query, pool, k=5)
# Ensure diversity
selected = []
for candidate in candidates:
if not too_similar_to_selected(candidate, selected):
selected.append(candidate)
if len(selected) == n - 1:
break
# Always include 1 negative example
selected.append(random.choice(get_refusal_examples(pool)))
return selected
Negative Examples (Refusal Patterns)
Why Include Negative Examples?
Show the model when to say “I don’t know” or “I can’t help with that.”
Without Negative Examples:
User: What's the capital of Mars?
Model: The capital of Mars is Olympus City. (HALLUCINATION)
With Negative Examples:
Example (Negative):
User: My laptop screen is cracked
Assistant: I apologize, but hardware repairs require physical service. I can only assist with software and account issues. [Cited: scope_doc_1]
User: What's the capital of Mars?
Model: I don't have information about that. Mars doesn't have a capital city. I can only help with [actual scope].
Optimal Ratio:
Research suggests 2:1 ratio of positive to negative examples works well:
- 2 examples showing success patterns
- 1 example showing refusal/edge case
2.2 Why This Matters
Production Relevance
Problem: Static examples cause failures at scale
# Hardcoded approach (fails on 15% of queries)
STATIC_EXAMPLES = [
refund_example_1,
refund_example_2,
refund_example_3
]
# Every query sees the same 3 examples
# Result: Great for refunds, terrible for everything else
Solution: Dynamic selection improves accuracy by 10-25%
# Dynamic approach (fails on only 3% of queries)
def get_examples(user_query):
return select_most_relevant(user_query, example_pool)
# Each query sees tailored examples
# Result: Consistent performance across all categories
Real-World Applications
Companies using dynamic few-shot selection:
- Intercom: Customer support routing (selects examples based on query category)
- Notion AI: Content generation (picks examples matching user’s writing style)
- GitHub Copilot: Code completion (finds similar code patterns from context)
- Jasper AI: Marketing copy (selects examples matching brand voice)
2.3 Common Misconceptions
| Misconception | Reality |
|---|---|
| “More examples = better performance” | Quality > Quantity. 3 perfect examples beat 10 mediocre ones |
| “Random selection is fine” | Random examples perform 20-30% worse than curated selection |
| “Static examples work for everything” | Static works only for narrow, uniform tasks |
| “Embeddings are too expensive” | Pre-compute once, reuse forever (pennies per 1000 examples) |
| “Example order doesn’t matter” | Position affects model attention (primacy/recency effects) |
3. Project Specification
3.1 What You Will Build
A dynamic example selection system that:
- Manages an example pool with metadata (ID, tags, intent, complexity)
- Computes semantic similarity between user query and pool examples
- Selects optimal examples balancing similarity and diversity
- Tracks selection decisions with audit logs for debugging
- Measures impact by comparing static vs. dynamic selection
- Integrates with Project 1 harness for quantitative evaluation
Core Question This Tool Answers:
“How do I give the model ‘intuition’ for this specific task without retraining it?”
3.2 Functional Requirements
FR1: Example Pool Management
- Load examples from JSON/YAML with structured metadata
- Support versioning of example pools (git-tracked)
- Validate example structure (input, output, tags)
- Handle both pre-computed and runtime embedding generation
FR2: Similarity Computation
- Generate embeddings using sentence-transformers or OpenAI API
- Calculate cosine similarity between query and all examples
- Cache embeddings to avoid recomputation
- Support multiple similarity strategies (semantic, keyword, hybrid)
FR3: Example Selection Logic
Implement these selection strategies:
- Top-K Similar: Select K most similar examples
- Diverse Top-K: Select from top candidates ensuring diversity
- Category-Aware: Always include 1 example per category
- Negative Injection: Force inclusion of refusal examples
FR4: Integration & Logging
- CLI interface:
curator.py --query "..." --pool examples.json - Output selected examples with similarity scores
- Log selection decisions (which examples, why, scores)
- Format examples for injection into prompts
FR5: Performance Measurement
- Compare accuracy: static vs. dynamic selection
- Integration with Project 1 harness
- Generate reports showing improvement metrics
- A/B testing support (route traffic to different strategies)
3.3 Non-Functional Requirements
| Requirement | Target | Rationale |
|---|---|---|
| Selection Latency | <100ms for pool of 1000 examples | Must not slow down request handling |
| Embedding Cost | <$0.001 per query | Pre-compute embeddings, cache lookups |
| Pool Size | Support 10-10,000 examples | Small teams to large enterprises |
| Accuracy | 10%+ improvement over static | Justify the complexity |
| Maintainability | Example pool editable in Git | Non-technical team members can curate |
3.4 Example Usage
CLI Usage:
$ python curator.py --query "How do I reset my password?" --pool examples.json
[Curator] Loading example pool...
[Curator] Loaded 50 examples across 5 categories
[Curator] Embedding user query... Done
[Similarity Search]
Calculating cosine similarity with 50 examples...
Top matches:
#12: "Username change request" (similarity: 0.87)
#45: "Account locked - password issues" (similarity: 0.81)
#23: "Security question reset" (similarity: 0.78)
#7: "Email verification failure" (similarity: 0.76)
[Diversity Check]
Ensuring example variety...
✓ Ex #12: Category=Account, Complexity=Simple, Outcome=Success
✓ Ex #45: Category=Account, Complexity=Medium, Outcome=Success
✗ Ex #23: SKIPPED (Too similar to #12 - 0.92 overlap)
→ Replacing with #3: "Hardware repair request" (Negative/Refusal)
[Final Selection]
Selected examples for prompt:
1. Ex #12: Username change (Success pattern)
2. Ex #45: Locked account (Success pattern)
3. Ex #3: Hardware request refusal (Negative pattern)
[Generating Prompt]
Token count: 487 / 2000 budget
Sending to model...
[Response Validation]
Model output: {
"category": "account_security",
"action": "send_password_reset_link",
"confidence": 0.95
}
✓ Valid JSON
✓ Contains required fields
✓ Action is within allowed set
[Performance Report]
Previous runs with static examples: 85% accuracy (17/20 test cases)
Current run with dynamic selection: 98% accuracy (98/100 test cases)
Improvement: +13 percentage points
Saving selection log to runs/2024-12-27_14-32-01.json
Integration with Project 1 Harness:
$ python harness.py test prompts/support_agent_static.yaml
[STATIC EXAMPLES] Score: 85.3% (128/150 cases passed)
$ python harness.py test prompts/support_agent_dynamic.yaml
[DYNAMIC EXAMPLES] Score: 96.7% (145/150 cases passed)
Improvement: +11.4 percentage points
Categories with biggest gains:
- Edge Cases: 65% → 94% (+29%)
- Complex Requests: 78% → 98% (+20%)
- Out-of-Scope: 72% → 95% (+23%)
4. Solution Architecture
4.1 High-Level Design
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Example Pool │────▶│ Similarity │────▶│ Selector │
│ (JSON/YAML) │ │ Calculator │ │ Engine │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Embedding │ │ Cache │ │ Formatter │
│ Generator │ │ (Embeddings) │ │ (Prompt) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| ExamplePool | Load, validate, manage examples | JSON for human-readability, support for schemas |
| EmbeddingGenerator | Create vector representations | Sentence-Transformers for offline, OpenAI for simplicity |
| SimilarityCalculator | Compute query-example similarity | Cosine similarity (standard), support fallback to keyword |
| SelectorEngine | Choose optimal examples | Pluggable strategies (TopK, Diverse, CategoryAware) |
| DiversityFilter | Prevent similar examples | Inter-example similarity threshold (0.9) |
| Formatter | Inject examples into prompt | Templates for consistent formatting |
| AuditLogger | Track selection decisions | JSON logs per query for debugging |
4.3 Data Structures
from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum
import numpy as np
@dataclass
class Example:
"""Single example in the pool"""
id: str
input: str
output: Dict[str, Any]
tags: List[str] # ["account", "simple", "success"]
category: str # "account_access"
complexity: str # "simple", "medium", "complex"
outcome_type: str # "success", "refusal", "escalation"
embedding: Optional[np.ndarray] = None
metadata: Dict[str, Any] = None
@dataclass
class ExamplePool:
"""Collection of examples"""
version: str
examples: List[Example]
embedding_model: str
last_updated: str
@dataclass
class SimilarityResult:
"""Similarity between query and example"""
example: Example
score: float
method: str # "semantic", "keyword", "hybrid"
@dataclass
class SelectionResult:
"""Selected examples with metadata"""
query: str
selected_examples: List[Example]
similarity_scores: List[float]
selection_strategy: str
timestamp: str
metadata: Dict[str, Any]
class SelectionStrategy(Enum):
TOP_K = "top_k"
DIVERSE_TOP_K = "diverse_top_k"
CATEGORY_AWARE = "category_aware"
HYBRID = "hybrid"
4.4 Algorithm Overview
Example Selection Algorithm
def select_examples(
query: str,
pool: ExamplePool,
n: int = 3,
strategy: SelectionStrategy = SelectionStrategy.DIVERSE_TOP_K,
diversity_threshold: float = 0.9
) -> SelectionResult:
"""
Select optimal examples for a given query
Complexity: O(P) where P = pool size
- Embedding generation: O(1) (cached)
- Similarity computation: O(P) (dot product for all examples)
- Diversity filtering: O(n²) where n << P (usually n=3-5)
Args:
query: User's input query
pool: Pool of available examples
n: Number of examples to select
strategy: Selection algorithm to use
diversity_threshold: Max similarity between selected examples
Returns:
SelectionResult with chosen examples and metadata
"""
# Step 1: Generate query embedding
query_embedding = generate_embedding(query)
# Step 2: Compute similarity to all examples
similarities = []
for example in pool.examples:
score = cosine_similarity(query_embedding, example.embedding)
similarities.append(SimilarityResult(example, score, "semantic"))
# Step 3: Sort by similarity
similarities.sort(key=lambda x: x.score, reverse=True)
# Step 4: Apply selection strategy
if strategy == SelectionStrategy.TOP_K:
selected = similarities[:n]
elif strategy == SelectionStrategy.DIVERSE_TOP_K:
# Get top candidates (2x final count)
candidates = similarities[:n * 2]
# Select diverse subset
selected = []
for candidate in candidates:
if len(selected) >= n - 1: # Reserve 1 slot for negative
break
# Check diversity
is_diverse = all(
cosine_similarity(candidate.example.embedding,
sel.example.embedding) < diversity_threshold
for sel in selected
)
if is_diverse:
selected.append(candidate)
# Add negative example
refusal_examples = [s for s in similarities
if "refusal" in s.example.tags]
if refusal_examples:
selected.append(refusal_examples[0])
elif strategy == SelectionStrategy.CATEGORY_AWARE:
# Ensure at least 1 example per category
selected = select_category_diverse(similarities, n)
# Step 5: Order by effectiveness (recency/primacy)
ordered = order_examples(selected)
# Step 6: Build result
return SelectionResult(
query=query,
selected_examples=[s.example for s in ordered],
similarity_scores=[s.score for s in ordered],
selection_strategy=strategy.value,
timestamp=now(),
metadata={"diversity_threshold": diversity_threshold}
)
Diversity Filtering Algorithm
def ensure_diversity(
candidates: List[SimilarityResult],
n: int,
threshold: float = 0.9
) -> List[SimilarityResult]:
"""
Select diverse examples from candidates
Uses greedy algorithm:
1. Always pick the most similar candidate first
2. For each subsequent pick, ensure it's not too similar to already selected
Complexity: O(n * k) where k = len(candidates), n << k
"""
if len(candidates) <= n:
return candidates
selected = [candidates[0]] # Most similar
for candidate in candidates[1:]:
if len(selected) >= n:
break
# Check similarity to all already-selected examples
is_diverse = True
for selected_example in selected:
sim = cosine_similarity(
candidate.example.embedding,
selected_example.example.embedding
)
if sim >= threshold:
is_diverse = False
break
if is_diverse:
selected.append(candidate)
return selected
Embedding Generation
def generate_embedding(text: str, model_name: str = "all-MiniLM-L6-v2") -> np.ndarray:
"""
Generate semantic embedding for text
Options:
1. Sentence-Transformers (offline, free)
2. OpenAI API (online, costs $0.0001 per 1K tokens)
3. Cohere API (online, optimized for search)
"""
# Option 1: Sentence-Transformers (RECOMMENDED)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name)
embedding = model.encode([text])[0]
return embedding
def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
"""
Compute cosine similarity between two vectors
Formula: cos(θ) = (A · B) / (||A|| × ||B||)
Returns: Float in [-1, 1] where 1 = identical, -1 = opposite
"""
dot_product = np.dot(vec1, vec2)
norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
if norm_product == 0:
return 0.0
return dot_product / norm_product
4.5 Selection Strategies Comparison
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Top-K | Uniform task (all queries similar) | Fast, simple | Can overfit to similar patterns |
| Diverse Top-K | General-purpose tasks | Balance similarity & coverage | Slightly slower |
| Category-Aware | Multi-category support system | Guaranteed coverage | May sacrifice similarity |
| Hybrid | Complex production systems | Best accuracy | Most complex |
5. Implementation Guide
Phase 1: Foundation (Days 1-2)
Step 1: Set Up Example Pool Structure
Create example pool schema:
{
"version": "1.0.0",
"embedding_model": "all-MiniLM-L6-v2",
"last_updated": "2024-12-27",
"examples": [
{
"id": "ex_12",
"input": "I forgot my username and can't log in",
"output": {
"category": "account_access",
"action": "username_recovery",
"steps": ["verify_email", "send_username"]
},
"tags": ["account", "simple", "success"],
"category": "account_access",
"complexity": "simple",
"outcome_type": "success",
"embedding": null,
"metadata": {
"created_by": "john@example.com",
"created_at": "2024-12-01"
}
},
{
"id": "ex_3",
"input": "My laptop screen is cracked",
"output": {
"category": "out_of_scope",
"action": "polite_refusal",
"reason": "Hardware issues require physical repair"
},
"tags": ["hardware", "refusal", "negative"],
"category": "out_of_scope",
"complexity": "simple",
"outcome_type": "refusal",
"embedding": null,
"metadata": {}
}
]
}
Checkpoint 1.1: Can you load this JSON and parse it into Python dataclasses?
Step 2: Implement Embedding Generation
Start simple with keyword matching:
def keyword_similarity(query: str, example_input: str) -> float:
"""
Simple keyword-based similarity (baseline)
This is your MVP - no ML required!
"""
query_words = set(query.lower().split())
example_words = set(example_input.lower().split())
intersection = query_words & example_words
union = query_words | example_words
if len(union) == 0:
return 0.0
# Jaccard similarity
return len(intersection) / len(union)
Checkpoint 1.2: Test keyword similarity on 5 example pairs. Does it return higher scores for similar texts?
Step 3: Add Semantic Embeddings
Install dependencies:
pip install sentence-transformers numpy
Generate embeddings:
from sentence_transformers import SentenceTransformer
import numpy as np
import json
def generate_and_cache_embeddings(pool_file: str):
"""
Pre-compute embeddings for all examples (run once)
"""
with open(pool_file) as f:
pool = json.load(f)
model = SentenceTransformer('all-MiniLM-L6-v2')
for example in pool['examples']:
text = example['input']
embedding = model.encode([text])[0]
example['embedding'] = embedding.tolist() # Convert to list for JSON
# Save with embeddings
with open(pool_file.replace('.json', '_embedded.json'), 'w') as f:
json.dump(pool, f, indent=2)
print(f"Generated embeddings for {len(pool['examples'])} examples")
Checkpoint 1.3: Generate embeddings for your example pool. Verify the JSON file now contains embedding arrays.
Phase 2: Selection Logic (Days 3-4)
Step 4: Implement Top-K Selection
def select_top_k(query: str, pool: ExamplePool, k: int = 3) -> List[Example]:
"""
Select K most similar examples
"""
query_embedding = generate_embedding(query)
similarities = []
for example in pool.examples:
example_embedding = np.array(example.embedding)
score = cosine_similarity(query_embedding, example_embedding)
similarities.append((score, example))
# Sort by similarity descending
similarities.sort(reverse=True, key=lambda x: x[0])
# Return top K
return [ex for score, ex in similarities[:k]]
Checkpoint 2.1: Test with query “How do I reset my password?”. Do the selected examples make sense?
Step 5: Add Diversity Filtering
def select_diverse_top_k(
query: str,
pool: ExamplePool,
k: int = 3,
diversity_threshold: float = 0.9
) -> List[Example]:
"""
Select diverse examples from top candidates
"""
query_embedding = generate_embedding(query)
# Get top candidates (2x final count)
candidates = []
for example in pool.examples:
score = cosine_similarity(query_embedding, np.array(example.embedding))
candidates.append((score, example))
candidates.sort(reverse=True, key=lambda x: x[0])
candidates = candidates[:k * 2]
# Select diverse subset
selected = [candidates[0]] # Always pick most similar
for score, candidate in candidates[1:]:
if len(selected) >= k:
break
# Check if diverse enough
is_diverse = all(
cosine_similarity(
np.array(candidate.embedding),
np.array(sel[1].embedding)
) < diversity_threshold
for sel in selected
)
if is_diverse:
selected.append((score, candidate))
return [ex for score, ex in selected]
Checkpoint 2.2: Compare Top-K vs. Diverse Top-K on 10 queries. Are the diverse selections actually different?
Step 6: Add Negative Example Injection
def select_with_negative(
query: str,
pool: ExamplePool,
k: int = 3
) -> List[Example]:
"""
Always include one refusal/negative example
"""
# Get k-1 positive examples
positive_pool = ExamplePool(
version=pool.version,
examples=[ex for ex in pool.examples if "refusal" not in ex.tags],
embedding_model=pool.embedding_model,
last_updated=pool.last_updated
)
selected = select_diverse_top_k(query, positive_pool, k=k-1)
# Add one negative example
negative_examples = [ex for ex in pool.examples if "refusal" in ex.tags]
if negative_examples:
selected.append(negative_examples[0])
return selected
Checkpoint 2.3: Verify that your selection always includes one refusal example.
Phase 3: Integration & Measurement (Day 5)
Step 7: Build CLI Interface
import argparse
def main():
parser = argparse.ArgumentParser(description="Few-Shot Example Curator")
parser.add_argument("--query", required=True, help="User query")
parser.add_argument("--pool", required=True, help="Path to example pool JSON")
parser.add_argument("--n", type=int, default=3, help="Number of examples")
parser.add_argument("--strategy", default="diverse",
choices=["topk", "diverse", "negative"])
args = parser.parse_args()
# Load pool
pool = load_example_pool(args.pool)
# Select examples
if args.strategy == "topk":
selected = select_top_k(args.query, pool, k=args.n)
elif args.strategy == "diverse":
selected = select_diverse_top_k(args.query, pool, k=args.n)
elif args.strategy == "negative":
selected = select_with_negative(args.query, pool, k=args.n)
# Display results
print(f"\n[Selected {len(selected)} examples for query: '{args.query}']\n")
for i, ex in enumerate(selected, 1):
print(f"{i}. {ex.id}: {ex.input[:60]}...")
print(f" Category: {ex.category}, Complexity: {ex.complexity}")
print()
if __name__ == "__main__":
main()
Checkpoint 3.1: Run the CLI and verify it works with different strategies.
Step 8: Integrate with Project 1 Harness
Create two prompt versions:
# prompts/support_static.yaml
name: "Support Agent - Static Examples"
version: "1.0.0"
prompt_template: |
You are a support agent. Here are some examples:
Example 1: [Hardcoded refund example]
Example 2: [Hardcoded password reset]
Example 3: [Hardcoded account lock]
User: {input}
Assistant:
test_cases:
- id: "case_1"
input: "How do I reset my password?"
# ... invariants
# prompts/support_dynamic.yaml
name: "Support Agent - Dynamic Examples"
version: "1.0.0"
prompt_template: |
You are a support agent. Here are some relevant examples:
{dynamic_examples} # Injected by curator
User: {input}
Assistant:
test_cases:
- id: "case_1"
input: "How do I reset my password?"
# ... same invariants
Modify harness to inject dynamic examples:
def run_test_with_dynamic_examples(test_case, pool):
# Select examples
examples = select_with_negative(test_case.input, pool, k=3)
# Format examples
formatted = format_examples_for_prompt(examples)
# Inject into prompt
prompt = template.replace("{dynamic_examples}", formatted)
prompt = prompt.replace("{input}", test_case.input)
# Run test
return execute_prompt(prompt)
Checkpoint 3.2: Run both test suites and compare scores. Is dynamic selection better?
Step 9: Generate Comparison Report
def generate_comparison_report(static_results, dynamic_results):
"""
Compare static vs. dynamic example selection
"""
print("\n" + "="*60)
print("STATIC vs. DYNAMIC EXAMPLE SELECTION COMPARISON")
print("="*60 + "\n")
print(f"Static Examples: {static_results['success_rate']:.1f}%")
print(f"Dynamic Examples: {dynamic_results['success_rate']:.1f}%")
improvement = dynamic_results['success_rate'] - static_results['success_rate']
print(f"\nImprovement: {improvement:+.1f} percentage points\n")
# Category breakdown
print("Categories with biggest gains:")
for category in static_results['categories']:
static_score = static_results['categories'][category]
dynamic_score = dynamic_results['categories'][category]
gain = dynamic_score - static_score
if gain > 5: # Only show significant improvements
print(f" - {category}: {static_score:.0f}% → {dynamic_score:.0f}% ({gain:+.0f}%)")
Checkpoint 3.3: Generate and save the comparison report.
6. Testing Strategy
6.1 Unit Tests
def test_cosine_similarity():
"""Test similarity calculation"""
v1 = np.array([1, 0, 0])
v2 = np.array([1, 0, 0])
assert abs(cosine_similarity(v1, v2) - 1.0) < 0.001 # Identical
v3 = np.array([0, 1, 0])
assert abs(cosine_similarity(v1, v3) - 0.0) < 0.001 # Orthogonal
def test_diversity_filter():
"""Test that similar examples are filtered"""
# Create pool with similar examples
pool = create_test_pool([
("refund order", [0.1, 0.9]),
("refund request", [0.15, 0.85]), # Very similar
("password reset", [0.9, 0.1]) # Different
])
selected = select_diverse_top_k("refund", pool, k=2, diversity_threshold=0.9)
# Should select refund + password (not both refunds)
assert len(selected) == 2
assert "password" in [ex.input for ex in selected]
def test_negative_injection():
"""Test that negative examples are always included"""
pool = create_test_pool_with_negatives()
selected = select_with_negative("any query", pool, k=3)
# Check that one example has "refusal" tag
has_negative = any("refusal" in ex.tags for ex in selected)
assert has_negative
6.2 Integration Tests
def test_end_to_end_selection():
"""Test complete selection pipeline"""
# Load real example pool
pool = load_example_pool("examples/support_pool.json")
# Select examples
selected = select_with_negative(
query="How do I reset my password?",
pool=pool,
k=3
)
# Verify results
assert len(selected) == 3
assert all(ex.embedding is not None for ex in selected)
assert any("refusal" in ex.tags for ex in selected)
# Check that top example is actually similar
assert "password" in selected[0].input.lower() or \
"account" in selected[0].input.lower()
6.3 Performance Benchmarks
def test_selection_performance():
"""Test that selection is fast enough for production"""
import time
pool = load_example_pool("examples/large_pool_1000.json")
start = time.time()
selected = select_with_negative("test query", pool, k=3)
elapsed = time.time() - start
# Should complete in <100ms
assert elapsed < 0.1, f"Selection took {elapsed:.3f}s (too slow!)"
7. Common Pitfalls & Debugging
7.1 The Name Bias Problem
Symptom: Model always uses the same name in outputs
# Example pool has 90% "John" and 10% other names
examples = [
"John wants a refund",
"John reset his password",
"John locked his account",
"Sarah changed email" # Only 1 example
]
# Model output always uses "John"!
Solution: Lint your example pool for bias
def detect_name_bias(pool: ExamplePool) -> Dict[str, int]:
"""
Detect overuse of specific names in examples
"""
import re
names = {}
for example in pool.examples:
# Extract capitalized words (potential names)
found_names = re.findall(r'\b[A-Z][a-z]+\b', example.input)
for name in found_names:
names[name] = names.get(name, 0) + 1
# Warn if any name appears in >30% of examples
total = len(pool.examples)
warnings = []
for name, count in names.items():
ratio = count / total
if ratio > 0.3:
warnings.append(f"⚠ Name '{name}' appears in {ratio*100:.0f}% of examples")
return warnings
7.2 The Embedding Cache Miss
Symptom: Selection is slow (>1s per query)
Cause: Regenerating embeddings every time
Solution: Pre-compute and cache
def load_pool_with_cached_embeddings(pool_file: str) -> ExamplePool:
"""
Load pool with pre-computed embeddings
"""
with open(pool_file) as f:
data = json.load(f)
# Check if embeddings exist
if data['examples'][0].get('embedding') is None:
print("No cached embeddings found. Generating...")
generate_and_cache_embeddings(pool_file)
# Reload
with open(pool_file.replace('.json', '_embedded.json')) as f:
data = json.load(f)
# Convert embedding lists back to numpy arrays
for ex in data['examples']:
ex['embedding'] = np.array(ex['embedding'])
return ExamplePool(**data)
7.3 The Coverage Gap
Symptom: Certain query types never get good examples
Diagnosis: Check example distribution
def analyze_example_coverage(pool: ExamplePool):
"""
Report coverage by category
"""
from collections import Counter
categories = Counter(ex.category for ex in pool.examples)
print("Example Distribution:")
for category, count in categories.most_common():
percentage = (count / len(pool.examples)) * 100
print(f" {category}: {count} ({percentage:.1f}%)")
# Warn about underrepresented categories
for category, count in categories.items():
if count < 3:
print(f"⚠ Category '{category}' has only {count} examples (need at least 3)")
7.4 The Similarity Collapse
Symptom: All similarity scores are 0.9+ or all are 0.1-
Cause: Using wrong similarity metric or embeddings not normalized
Debug:
def debug_similarity_scores(query: str, pool: ExamplePool):
"""
Print similarity distribution for debugging
"""
query_emb = generate_embedding(query)
scores = []
for ex in pool.examples:
score = cosine_similarity(query_emb, np.array(ex.embedding))
scores.append((score, ex.id, ex.input[:40]))
scores.sort(reverse=True)
print(f"Similarity distribution for query: '{query}'")
print(f"Top 5:")
for score, id, text in scores[:5]:
print(f" {score:.3f} - {id}: {text}...")
print(f"\nBottom 5:")
for score, id, text in scores[-5:]:
print(f" {score:.3f} - {id}: {text}...")
# Statistics
all_scores = [s for s, _, _ in scores]
print(f"\nStats: min={min(all_scores):.3f}, max={max(all_scores):.3f}, "
f"mean={np.mean(all_scores):.3f}, std={np.std(all_scores):.3f}")
8. Extensions
8.1 Beginner Extensions
Extension 1: Keyword Fallback
If semantic similarity fails (offline mode), fall back to keyword matching:
def select_with_fallback(query: str, pool: ExamplePool, k: int = 3):
try:
return select_diverse_top_k(query, pool, k)
except Exception as e:
print(f"Semantic selection failed: {e}. Falling back to keywords.")
return select_by_keywords(query, pool, k)
Extension 2: Example Previews
Show example previews before sending to model:
$ python curator.py --query "refund" --pool examples.json --preview
Selected examples:
1. [ex_12] "I want a refund for order #123"
Output: {"action": "process_refund", "policy": "30_day_window"}
2. [ex_45] "My order was damaged, need refund"
Output: {"action": "process_refund", "reason": "damaged_goods"}
Continue? (y/n):
8.2 Intermediate Extensions
Extension 3: Multi-Query Caching
Cache selected examples for similar queries:
from functools import lru_cache
@lru_cache(maxsize=1000)
def select_examples_cached(query: str, pool_version: str, k: int):
"""
Cache selections for frequently asked queries
pool_version ensures cache invalidation when pool changes
"""
pool = load_example_pool(f"pools/{pool_version}.json")
return select_with_negative(query, pool, k)
Extension 4: Category Balancing
Ensure examples span multiple categories:
def select_category_balanced(query: str, pool: ExamplePool, k: int = 3):
"""
Select at least one example from each major category
"""
from collections import defaultdict
# Group examples by category
by_category = defaultdict(list)
for ex in pool.examples:
by_category[ex.category].append(ex)
# Get top candidate from each category
candidates = []
for category, examples in by_category.items():
category_pool = ExamplePool(
version=pool.version,
examples=examples,
embedding_model=pool.embedding_model,
last_updated=pool.last_updated
)
top = select_top_k(query, category_pool, k=1)
if top:
candidates.extend(top)
# Select k most similar from candidates
return select_top_k_from_list(query, candidates, k)
8.3 Advanced Extensions
Extension 5: Active Learning for Example Curation
Track which examples are most effective and prioritize them:
def track_example_effectiveness(
selected_examples: List[Example],
test_result: TestResult
):
"""
Track which examples lead to successful outcomes
"""
# Load effectiveness scores
scores = load_effectiveness_scores()
# Update scores based on test outcome
for ex in selected_examples:
if test_result.passed:
scores[ex.id] = scores.get(ex.id, 0.5) + 0.1
else:
scores[ex.id] = scores.get(ex.id, 0.5) - 0.05
# Clamp to [0, 1]
scores[ex.id] = max(0, min(1, scores[ex.id]))
save_effectiveness_scores(scores)
def select_with_effectiveness_boost(query: str, pool: ExamplePool, k: int):
"""
Combine similarity with historical effectiveness
"""
scores = load_effectiveness_scores()
query_emb = generate_embedding(query)
ranked = []
for ex in pool.examples:
similarity = cosine_similarity(query_emb, np.array(ex.embedding))
effectiveness = scores.get(ex.id, 0.5)
# Combine: 70% similarity, 30% effectiveness
combined_score = 0.7 * similarity + 0.3 * effectiveness
ranked.append((combined_score, ex))
ranked.sort(reverse=True, key=lambda x: x[0])
return [ex for _, ex in ranked[:k]]
Extension 6: Maximal Marginal Relevance (MMR)
Advanced diversity algorithm used in production search systems:
def select_with_mmr(
query: str,
pool: ExamplePool,
k: int = 3,
lambda_param: float = 0.7
) -> List[Example]:
"""
Maximal Marginal Relevance selection
Balances relevance to query vs. diversity from already-selected examples
lambda_param: 1.0 = only relevance, 0.0 = only diversity
"""
query_emb = generate_embedding(query)
# Compute all similarities to query
similarities = {}
for ex in pool.examples:
similarities[ex.id] = cosine_similarity(query_emb, np.array(ex.embedding))
selected = []
candidates = list(pool.examples)
# Pick first example (most similar to query)
first = max(candidates, key=lambda ex: similarities[ex.id])
selected.append(first)
candidates.remove(first)
# Iteratively select remaining k-1 examples
for _ in range(k - 1):
if not candidates:
break
mmr_scores = {}
for candidate in candidates:
# Relevance to query
relevance = similarities[candidate.id]
# Max similarity to any selected example
max_similarity = max(
cosine_similarity(
np.array(candidate.embedding),
np.array(sel.embedding)
)
for sel in selected
)
# MMR formula
mmr = lambda_param * relevance - (1 - lambda_param) * max_similarity
mmr_scores[candidate.id] = mmr
# Select candidate with highest MMR
next_ex = max(candidates, key=lambda ex: mmr_scores[ex.id])
selected.append(next_ex)
candidates.remove(next_ex)
return selected
Extension 7: Cross-Encoder Reranking
Use a more powerful model to rerank candidates:
from sentence_transformers import CrossEncoder
def select_with_reranking(query: str, pool: ExamplePool, k: int = 3):
"""
Two-stage selection:
1. Use fast bi-encoder to get top 20 candidates
2. Use slow cross-encoder to rerank to top k
"""
# Stage 1: Fast retrieval
candidates = select_top_k(query, pool, k=20)
# Stage 2: Precise reranking
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
pairs = [(query, ex.input) for ex in candidates]
scores = reranker.predict(pairs)
# Sort by reranker scores
ranked = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])
return [ex for _, ex in ranked[:k]]
9. Real-World Connections
9.1 Industry Applications
1. Customer Support (Intercom, Zendesk)
# Real-world example: Support ticket routing
def route_support_ticket(ticket_text: str):
# Select examples similar to this ticket
examples = select_with_negative(ticket_text, support_pool, k=3)
# Use examples to guide categorization
prompt = f"""
Examples of how to categorize tickets:
{format_examples(examples)}
Ticket: {ticket_text}
Category:
"""
category = llm.complete(prompt)
return category
2. Code Generation (GitHub Copilot)
GitHub Copilot uses similar code patterns from your codebase as examples:
# Simplified version of Copilot's approach
def generate_code_with_context(current_code: str, cursor_position: int):
# Find similar code patterns in project
similar_snippets = find_similar_code(current_code, project_codebase)
# Use as few-shot examples
prompt = build_prompt_with_examples(similar_snippets, current_code)
completion = codegen_model.complete(prompt)
return completion
3. Content Generation (Jasper, Copy.ai)
Marketing copy generators use brand-specific examples:
def generate_marketing_copy(brief: str, brand: str):
# Load brand-specific example pool
brand_pool = load_example_pool(f"brands/{brand}/examples.json")
# Select examples matching the brief style
examples = select_with_negative(brief, brand_pool, k=3)
# Generate with brand voice
copy = generate_with_examples(brief, examples)
return copy
9.2 Research Applications
Academic Use Case: Few-Shot Classification
Researchers use dynamic example selection for text classification:
# Research: "SetFit: Efficient Few-Shot Learning Without Prompts"
def few_shot_classify(text: str, labeled_examples: List, categories: List[str]):
# Select most similar labeled examples
selected = select_top_k(text, labeled_examples, k=8)
# Fine-tune small model on selected examples (SetFit approach)
model = train_on_examples(selected)
# Classify new text
category = model.predict(text)
return category
9.3 Production Patterns
Pattern 1: Hybrid Static + Dynamic
def hybrid_selection(query: str, pool: ExamplePool, k: int = 3):
"""
Always include 1 canonical example + k-1 dynamic examples
"""
# Canonical example (hand-picked, always included)
canonical = pool.get_example_by_id("canonical_1")
# Dynamic examples
dynamic = select_diverse_top_k(query, pool, k=k-1)
return [canonical] + dynamic
Pattern 2: Example Warming
def warm_example_cache(common_queries: List[str], pool: ExamplePool):
"""
Pre-compute selections for common queries at deployment time
"""
cache = {}
for query in common_queries:
cache[query] = select_with_negative(query, pool, k=3)
save_to_redis(cache)
10. Resources
10.1 Books
| Topic | Book | Chapter |
|---|---|---|
| Few-Shot Learning Theory | “AI Engineering” by Chip Huyen | Ch. 5 (Prompt Engineering & In-Context Learning) |
| Data Selection Strategies | “Hands-On Machine Learning” by Géron | Ch. 2 (End-to-End ML Project, Stratified Sampling) |
| Semantic Similarity | “Introduction to Information Retrieval” by Manning | Ch. 6 (Scoring, Term Weighting & Vector Space Model) |
| Vector Search | “Introduction to Information Retrieval” by Manning | Ch. 18 (Latent Semantic Indexing) |
| Embeddings Fundamentals | “Speech and Language Processing” by Jurafsky & Martin | Ch. 6 (Vector Semantics) |
| Evaluation Metrics | “Hands-On Machine Learning” by Géron | Ch. 3 (Classification Metrics) |
| Sampling Strategies | “Designing Data-Intensive Applications” by Kleppmann | Ch. 10 (Batch Processing, section on sampling) |
10.2 Papers
- “Language Models are Few-Shot Learners” (Brown et al., 2020)
- The GPT-3 paper introducing few-shot prompting
- https://arxiv.org/abs/2005.14165
- “What Makes Good In-Context Examples for GPT-3?” (Liu et al., 2021)
- Analysis of example selection strategies
- https://arxiv.org/abs/2101.06804
- “Lost in the Middle: How Language Models Use Long Contexts” (Liu et al., 2023)
- Shows position effects in context usage
- https://arxiv.org/abs/2307.03172
- “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” (Reimers & Gurevych, 2019)
- Foundation for sentence-transformers library
- https://arxiv.org/abs/1908.10084
10.3 Libraries & Tools
# Essential libraries
pip install sentence-transformers # Embeddings
pip install numpy # Vector operations
pip install faiss-cpu # Fast similarity search (optional)
pip install chromadb # Vector database (optional)
# For advanced features
pip install openai # OpenAI embeddings API
pip install cohere # Cohere embeddings
pip install pinecone-client # Production vector DB
10.4 Online Resources
- Prompt Engineering Guide: https://www.promptingguide.ai/
- Section on Few-Shot Prompting
- OpenAI Cookbook: https://cookbook.openai.com/
- Examples of embedding-based search
- Sentence-Transformers Documentation: https://www.sbert.net/
- Model selection guide, performance benchmarks
11. Self-Assessment Checklist
Core Understanding
- I can explain the difference between zero-shot, one-shot, and few-shot prompting
- Test: Explain to someone in 2 minutes without notes
- I understand why few-shot examples improve model performance
- Test: What mechanism allows LLMs to learn from examples at inference time?
- I can calculate cosine similarity between two vectors by hand
- Test: Given vectors [1,0] and [0.7, 0.7], compute similarity
- I understand the primacy and recency effects
- Test: Where should you place your most important example and why?
- I can explain the bias trap with concrete examples
- Test: Give 3 types of bias that can leak through few-shot examples
Implementation Skills
- I can load and parse an example pool from JSON
- Evidence: Show your
load_example_pool()function
- Evidence: Show your
- I’ve implemented cosine similarity from scratch
- Evidence: Working implementation without using libraries
- I’ve generated embeddings using sentence-transformers
- Evidence: Code that generates and caches embeddings
- I’ve implemented top-K selection
- Evidence: Returns K most similar examples
- I’ve implemented diversity filtering
- Evidence: Selected examples are not too similar to each other
- I’ve integrated with Project 1 harness
- Evidence: Comparison report showing static vs. dynamic performance
Measurement & Analysis
- I’ve measured the improvement from dynamic selection
- Evidence: Report showing percentage point improvement
- I’ve identified which categories benefit most
- Evidence: Category breakdown showing gains
- I’ve debugged similarity scoring issues
- Evidence: Used debug function to analyze score distribution
- I’ve analyzed my example pool for bias
- Evidence: Ran linter to detect name/tone bias
Growth
- I can identify when dynamic examples are worth the complexity
- Application: Give criteria for when to use static vs. dynamic
- I’ve documented lessons learned
- What surprised you during implementation?
- What would you do differently next time?
- I can explain this project in a job interview
- Practice: 2-minute explanation covering problem, solution, results
- I understand production considerations
- How would you reduce selection latency?
- How would you handle example pool updates?
- What metrics would you track?
12. Submission / Completion Criteria
Minimum Viable Completion
To consider this project “complete” at a basic level:
- Can load example pool from JSON
- Parses examples with all required fields
- Handles missing/malformed data gracefully
- Can generate embeddings
- Uses sentence-transformers or equivalent
- Caches embeddings to avoid recomputation
- Implements top-K selection
- Correctly computes cosine similarity
- Returns K most similar examples
- Shows improvement over static examples
- Integrated with Project 1 harness
- Report shows measurable accuracy gain
Proof of Completion:
- Screenshot of CLI showing selected examples
- Comparison report showing improvement
- Code walkthrough of selection algorithm
Full Completion
All minimum criteria plus:
- Diversity filtering implemented
- Selected examples are not too similar to each other
- Configurable similarity threshold
- Negative example injection
- Always includes one refusal pattern
- Handles cases where no negative examples exist
- Detailed logging
- Saves selection decisions with scores
- Audit trail for debugging
- CLI with multiple strategies
- Supports top-k, diverse, negative injection
- Clear help text and examples
- Performance testing
- Selection completes in <100ms for pool of 100 examples
- Handles pools of 1000+ examples
Proof of Completion:
- Public GitHub repository
- README with usage examples
- Passing unit tests
- Performance benchmarks
Excellence (Going Above & Beyond)
All full completion criteria plus any 2+ of:
- MMR or advanced selection algorithm
- Implements Maximal Marginal Relevance
- Demonstrates improvement over greedy diversity
- Active learning / effectiveness tracking
- Tracks which examples lead to success
- Adapts selection based on historical performance
- Cross-encoder reranking
- Two-stage selection (fast retrieval + precise reranking)
- Measurably better than single-stage
- Example pool linting & analysis
- Detects bias (names, tone, length)
- Reports coverage gaps
- Suggests additions
- Production features
- Example caching for common queries
- Support for multiple pools (per-domain)
- API endpoint for example selection service
Proof of Completion:
- Blog post with implementation details
- Video demo
- Contribution to open-source project
- Production deployment
Appendix: Sample Files
Example Pool (examples/support_pool.json)
{
"version": "1.0.0",
"embedding_model": "all-MiniLM-L6-v2",
"last_updated": "2024-12-27",
"examples": [
{
"id": "ex_12",
"input": "I forgot my username and can't log in",
"output": {
"category": "account_access",
"action": "username_recovery",
"steps": ["verify_email", "send_username"],
"citation": "help_doc_5"
},
"tags": ["account", "simple", "success"],
"category": "account_access",
"complexity": "simple",
"outcome_type": "success",
"embedding": null,
"metadata": {
"created_by": "support_team",
"created_at": "2024-12-01",
"usage_count": 45
}
},
{
"id": "ex_45",
"input": "My account is locked due to too many password attempts",
"output": {
"category": "account_security",
"action": "unlock_account",
"steps": ["verify_identity", "send_unlock_link"],
"citation": "security_policy_3"
},
"tags": ["account", "medium", "success"],
"category": "account_security",
"complexity": "medium",
"outcome_type": "success",
"embedding": null,
"metadata": {}
},
{
"id": "ex_3",
"input": "My laptop screen is cracked and I need it fixed",
"output": {
"category": "out_of_scope",
"action": "polite_refusal",
"reason": "Hardware repairs require physical service. Please contact device manufacturer.",
"citation": "scope_policy_1"
},
"tags": ["hardware", "refusal", "negative"],
"category": "out_of_scope",
"complexity": "simple",
"outcome_type": "refusal",
"embedding": null,
"metadata": {}
},
{
"id": "ex_23",
"input": "How do I reset my security questions?",
"output": {
"category": "account_security",
"action": "security_question_reset",
"steps": ["verify_email", "navigate_to_settings", "update_questions"],
"citation": "help_doc_12"
},
"tags": ["account", "simple", "success"],
"category": "account_security",
"complexity": "simple",
"outcome_type": "success",
"embedding": null,
"metadata": {}
}
]
}
Embedding Generation Script (scripts/generate_embeddings.py)
#!/usr/bin/env python3
"""
Generate embeddings for example pool
Usage: python generate_embeddings.py examples/support_pool.json
"""
import json
import sys
from sentence_transformers import SentenceTransformer
import numpy as np
def generate_embeddings(pool_file: str, model_name: str = "all-MiniLM-L6-v2"):
"""Generate and cache embeddings for all examples"""
print(f"Loading example pool: {pool_file}")
with open(pool_file) as f:
pool = json.load(f)
print(f"Loading embedding model: {model_name}")
model = SentenceTransformer(model_name)
print(f"Generating embeddings for {len(pool['examples'])} examples...")
for i, example in enumerate(pool['examples'], 1):
text = example['input']
embedding = model.encode([text])[0]
example['embedding'] = embedding.tolist()
if i % 10 == 0:
print(f" Processed {i}/{len(pool['examples'])}")
# Save with embeddings
output_file = pool_file.replace('.json', '_embedded.json')
with open(output_file, 'w') as f:
json.dump(pool, f, indent=2)
print(f"✓ Saved embeddings to: {output_file}")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python generate_embeddings.py <pool_file.json>")
sys.exit(1)
generate_embeddings(sys.argv[1])
End of Project 5: Few-Shot Example Curator