Project 5: Few-Shot Example Curator

Project 5: Few-Shot Example Curator

Build a dynamic example selection system that treats few-shot examples as data, not hardcoded magic

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate 3-5 days
Language Python (Alternatives: TypeScript)
Prerequisites Project 1 (Harness), basic understanding of embeddings
Key Topics Few-shot learning, semantic similarity, example diversity
Knowledge Area Few-shot Prompting / Generalization
Software/Tool ChromaDB / FAISS (Optional), Sentence-Transformers
Main Book “Hands-On Machine Learning” by Géron (Data selection, Ch. 2)
Coolness Level Level 2: Practical but Forgettable
Business Potential 3. The “Service & Support” Model

1. Learning Objectives

By completing this project, you will:

  1. Master In-Context Learning (ICL): Understand how LLMs learn from examples in the prompt without retraining
  2. Implement Semantic Search: Use embedding-based similarity to find contextually relevant examples
  3. Balance Similarity vs. Diversity: Learn when to pick similar examples vs. when to ensure coverage
  4. Design Example Architectures: Create versioned, maintainable example pools that scale
  5. Measure Example Quality: Quantify the impact of example selection on model performance
  6. Handle Example Bias: Detect and mitigate biases that leak through few-shot examples
  7. Build Production Selection Logic: Create fast, cost-effective example retrieval systems

2. Theoretical Foundation

2.1 Core Concepts

What is In-Context Learning?

In-Context Learning (ICL) is the ability of LLMs to learn from examples provided in the prompt at inference time, without any weight updates. It’s like giving the model a “mini training set” for each request.

Zero-Shot vs. Few-Shot:

# Zero-Shot: No examples, just instructions
prompt = """
You are a support agent. Answer customer queries.

User: How do I reset my password?
Assistant:
"""

# Few-Shot: Include examples showing desired behavior
prompt = """
You are a support agent. Answer customer queries.

Example 1:
User: I need a refund for order #123
Assistant: I can help you with that refund. Our policy allows returns within 30 days. [Cited: policy_doc_1]

Example 2:
User: My account is locked
Assistant: Let me guide you through unlocking your account. Please check your email for a verification link. [Cited: help_doc_3]

Now answer this:
User: How do I reset my password?
Assistant:
"""

Why Few-Shot Works:

GPT models learn patterns from the examples through attention mechanisms. The examples act as “soft fine-tuning” that:

  • Override generic model behavior
  • Establish tone, format, and structure
  • Demonstrate edge case handling
  • Show refusal patterns (what NOT to do)

Research Foundation:

The GPT-3 paper (“Language Models are Few-Shot Learners” by Brown et al., 2020) demonstrated that larger models exhibit emergent few-shot learning capabilities. Performance scales with:

  1. Model size (larger = better few-shot learning)
  2. Example quality (relevant > random)
  3. Example quantity (up to a point, then saturates)

Semantic Similarity and Embeddings

An embedding is a dense vector representation of text that captures semantic meaning. Similar texts have similar vectors.

Vector Space Model:

"cat" → [0.2, 0.8, 0.1, ...]
"dog" → [0.3, 0.7, 0.2, ...]  # Close to "cat"
"car" → [0.9, 0.1, 0.3, ...]  # Far from "cat"

Cosine Similarity:

Measures the angle between two vectors (ignoring magnitude):

similarity = cos(θ) = (A · B) / (||A|| × ||B||)

Range: -1 (opposite) to +1 (identical)

Why Cosine Similarity?

Better than string matching for semantic search:

# String matching
"I want my money back" != "refund request"  # 0% match

# Semantic similarity
embedding("I want my money back")  embedding("refund request")  # 0.85 similarity

Generating Embeddings:

Modern approaches:

  1. OpenAI Embeddings API: text-embedding-3-small (cheap, fast)
  2. Sentence-Transformers: Open-source models (all-MiniLM-L6-v2)
  3. Cohere Embeddings: Optimized for semantic search

The Primacy and Recency Effects

LLMs don’t “see” all examples equally. Position matters.

Primacy Effect: Models remember the first examples better Recency Effect: Models remember the last examples better Lost in the Middle: Middle examples get ignored

Research: “Lost in the Middle” (Liu et al., 2023) showed that LLMs struggle to use information in the middle of long contexts.

Practical Implications:

# GOOD: Put most important example first and last
examples = [
    most_relevant_example,     # Primacy
    supporting_example_1,
    supporting_example_2,      # Might be ignored
    negative_example           # Recency - shows refusal pattern
]

# BAD: Random order
examples = random.shuffle(all_examples)[:3]

Example Diversity vs. Similarity

The Overfitting Problem:

If you pick 3 examples that are TOO similar:

# All examples are refund requests
examples = [
    "I want a refund for #123",
    "Refund my order #456",
    "Need refund for #789"
]

# User asks: "My 2FA isn't working"
# Model response: Tries to frame it as a refund issue! (Wrong)

The Confusion Problem:

If you pick 3 examples that are TOO diverse:

examples = [
    "Refund order #123",       # Topic: refunds
    "Reset my password",        # Topic: account access
    "Why is shipping slow?"     # Topic: logistics
]

# User asks: "Cancel my subscription"
# Model response: Confused, gives generic answer

The Sweet Spot:

Balance similarity to the current query with coverage of edge cases:

def select_examples(query, pool, n=3):
    # Get top 5 most similar
    candidates = get_top_k_similar(query, pool, k=5)

    # Ensure diversity
    selected = []
    for candidate in candidates:
        if not too_similar_to_selected(candidate, selected):
            selected.append(candidate)
            if len(selected) == n - 1:
                break

    # Always include 1 negative example
    selected.append(random.choice(get_refusal_examples(pool)))

    return selected

Negative Examples (Refusal Patterns)

Why Include Negative Examples?

Show the model when to say “I don’t know” or “I can’t help with that.”

Without Negative Examples:

User: What's the capital of Mars?
Model: The capital of Mars is Olympus City. (HALLUCINATION)

With Negative Examples:

Example (Negative):
User: My laptop screen is cracked
Assistant: I apologize, but hardware repairs require physical service. I can only assist with software and account issues. [Cited: scope_doc_1]

User: What's the capital of Mars?
Model: I don't have information about that. Mars doesn't have a capital city. I can only help with [actual scope].

Optimal Ratio:

Research suggests 2:1 ratio of positive to negative examples works well:

  • 2 examples showing success patterns
  • 1 example showing refusal/edge case

2.2 Why This Matters

Production Relevance

Problem: Static examples cause failures at scale

# Hardcoded approach (fails on 15% of queries)
STATIC_EXAMPLES = [
    refund_example_1,
    refund_example_2,
    refund_example_3
]

# Every query sees the same 3 examples
# Result: Great for refunds, terrible for everything else

Solution: Dynamic selection improves accuracy by 10-25%

# Dynamic approach (fails on only 3% of queries)
def get_examples(user_query):
    return select_most_relevant(user_query, example_pool)

# Each query sees tailored examples
# Result: Consistent performance across all categories

Real-World Applications

Companies using dynamic few-shot selection:

  1. Intercom: Customer support routing (selects examples based on query category)
  2. Notion AI: Content generation (picks examples matching user’s writing style)
  3. GitHub Copilot: Code completion (finds similar code patterns from context)
  4. Jasper AI: Marketing copy (selects examples matching brand voice)

2.3 Common Misconceptions

Misconception Reality
“More examples = better performance” Quality > Quantity. 3 perfect examples beat 10 mediocre ones
“Random selection is fine” Random examples perform 20-30% worse than curated selection
“Static examples work for everything” Static works only for narrow, uniform tasks
“Embeddings are too expensive” Pre-compute once, reuse forever (pennies per 1000 examples)
“Example order doesn’t matter” Position affects model attention (primacy/recency effects)

3. Project Specification

3.1 What You Will Build

A dynamic example selection system that:

  1. Manages an example pool with metadata (ID, tags, intent, complexity)
  2. Computes semantic similarity between user query and pool examples
  3. Selects optimal examples balancing similarity and diversity
  4. Tracks selection decisions with audit logs for debugging
  5. Measures impact by comparing static vs. dynamic selection
  6. Integrates with Project 1 harness for quantitative evaluation

Core Question This Tool Answers:

“How do I give the model ‘intuition’ for this specific task without retraining it?”

3.2 Functional Requirements

FR1: Example Pool Management

  • Load examples from JSON/YAML with structured metadata
  • Support versioning of example pools (git-tracked)
  • Validate example structure (input, output, tags)
  • Handle both pre-computed and runtime embedding generation

FR2: Similarity Computation

  • Generate embeddings using sentence-transformers or OpenAI API
  • Calculate cosine similarity between query and all examples
  • Cache embeddings to avoid recomputation
  • Support multiple similarity strategies (semantic, keyword, hybrid)

FR3: Example Selection Logic

Implement these selection strategies:

  • Top-K Similar: Select K most similar examples
  • Diverse Top-K: Select from top candidates ensuring diversity
  • Category-Aware: Always include 1 example per category
  • Negative Injection: Force inclusion of refusal examples

FR4: Integration & Logging

  • CLI interface: curator.py --query "..." --pool examples.json
  • Output selected examples with similarity scores
  • Log selection decisions (which examples, why, scores)
  • Format examples for injection into prompts

FR5: Performance Measurement

  • Compare accuracy: static vs. dynamic selection
  • Integration with Project 1 harness
  • Generate reports showing improvement metrics
  • A/B testing support (route traffic to different strategies)

3.3 Non-Functional Requirements

Requirement Target Rationale
Selection Latency <100ms for pool of 1000 examples Must not slow down request handling
Embedding Cost <$0.001 per query Pre-compute embeddings, cache lookups
Pool Size Support 10-10,000 examples Small teams to large enterprises
Accuracy 10%+ improvement over static Justify the complexity
Maintainability Example pool editable in Git Non-technical team members can curate

3.4 Example Usage

CLI Usage:

$ python curator.py --query "How do I reset my password?" --pool examples.json

[Curator] Loading example pool...
[Curator] Loaded 50 examples across 5 categories
[Curator] Embedding user query... Done

[Similarity Search]
Calculating cosine similarity with 50 examples...
Top matches:
  #12: "Username change request" (similarity: 0.87)
  #45: "Account locked - password issues" (similarity: 0.81)
  #23: "Security question reset" (similarity: 0.78)
  #7: "Email verification failure" (similarity: 0.76)

[Diversity Check]
Ensuring example variety...
  ✓ Ex #12: Category=Account, Complexity=Simple, Outcome=Success
  ✓ Ex #45: Category=Account, Complexity=Medium, Outcome=Success
  ✗ Ex #23: SKIPPED (Too similar to #12 - 0.92 overlap)
  → Replacing with #3: "Hardware repair request" (Negative/Refusal)

[Final Selection]
Selected examples for prompt:
  1. Ex #12: Username change (Success pattern)
  2. Ex #45: Locked account (Success pattern)
  3. Ex #3: Hardware request refusal (Negative pattern)

[Generating Prompt]
Token count: 487 / 2000 budget
Sending to model...

[Response Validation]
Model output: {
  "category": "account_security",
  "action": "send_password_reset_link",
  "confidence": 0.95
}
✓ Valid JSON
✓ Contains required fields
✓ Action is within allowed set

[Performance Report]
Previous runs with static examples: 85% accuracy (17/20 test cases)
Current run with dynamic selection: 98% accuracy (98/100 test cases)
Improvement: +13 percentage points

Saving selection log to runs/2024-12-27_14-32-01.json

Integration with Project 1 Harness:

$ python harness.py test prompts/support_agent_static.yaml
[STATIC EXAMPLES] Score: 85.3% (128/150 cases passed)

$ python harness.py test prompts/support_agent_dynamic.yaml
[DYNAMIC EXAMPLES] Score: 96.7% (145/150 cases passed)

Improvement: +11.4 percentage points
Categories with biggest gains:
  - Edge Cases: 65% → 94% (+29%)
  - Complex Requests: 78% → 98% (+20%)
  - Out-of-Scope: 72% → 95% (+23%)

4. Solution Architecture

4.1 High-Level Design

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Example Pool   │────▶│  Similarity      │────▶│   Selector      │
│  (JSON/YAML)    │     │  Calculator      │     │   Engine        │
└─────────────────┘     └──────────────────┘     └─────────────────┘
         │                       │                         │
         │                       │                         │
         ▼                       ▼                         ▼
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Embedding      │     │  Cache           │     │   Formatter     │
│  Generator      │     │  (Embeddings)    │     │   (Prompt)      │
└─────────────────┘     └──────────────────┘     └─────────────────┘

4.2 Key Components

Component Responsibility Key Decisions
ExamplePool Load, validate, manage examples JSON for human-readability, support for schemas
EmbeddingGenerator Create vector representations Sentence-Transformers for offline, OpenAI for simplicity
SimilarityCalculator Compute query-example similarity Cosine similarity (standard), support fallback to keyword
SelectorEngine Choose optimal examples Pluggable strategies (TopK, Diverse, CategoryAware)
DiversityFilter Prevent similar examples Inter-example similarity threshold (0.9)
Formatter Inject examples into prompt Templates for consistent formatting
AuditLogger Track selection decisions JSON logs per query for debugging

4.3 Data Structures

from dataclasses import dataclass
from typing import List, Dict, Any, Optional
from enum import Enum
import numpy as np

@dataclass
class Example:
    """Single example in the pool"""
    id: str
    input: str
    output: Dict[str, Any]
    tags: List[str]  # ["account", "simple", "success"]
    category: str    # "account_access"
    complexity: str  # "simple", "medium", "complex"
    outcome_type: str  # "success", "refusal", "escalation"
    embedding: Optional[np.ndarray] = None
    metadata: Dict[str, Any] = None

@dataclass
class ExamplePool:
    """Collection of examples"""
    version: str
    examples: List[Example]
    embedding_model: str
    last_updated: str

@dataclass
class SimilarityResult:
    """Similarity between query and example"""
    example: Example
    score: float
    method: str  # "semantic", "keyword", "hybrid"

@dataclass
class SelectionResult:
    """Selected examples with metadata"""
    query: str
    selected_examples: List[Example]
    similarity_scores: List[float]
    selection_strategy: str
    timestamp: str
    metadata: Dict[str, Any]

class SelectionStrategy(Enum):
    TOP_K = "top_k"
    DIVERSE_TOP_K = "diverse_top_k"
    CATEGORY_AWARE = "category_aware"
    HYBRID = "hybrid"

4.4 Algorithm Overview

Example Selection Algorithm

def select_examples(
    query: str,
    pool: ExamplePool,
    n: int = 3,
    strategy: SelectionStrategy = SelectionStrategy.DIVERSE_TOP_K,
    diversity_threshold: float = 0.9
) -> SelectionResult:
    """
    Select optimal examples for a given query

    Complexity: O(P) where P = pool size
    - Embedding generation: O(1) (cached)
    - Similarity computation: O(P) (dot product for all examples)
    - Diversity filtering: O(n²) where n << P (usually n=3-5)

    Args:
        query: User's input query
        pool: Pool of available examples
        n: Number of examples to select
        strategy: Selection algorithm to use
        diversity_threshold: Max similarity between selected examples

    Returns:
        SelectionResult with chosen examples and metadata
    """

    # Step 1: Generate query embedding
    query_embedding = generate_embedding(query)

    # Step 2: Compute similarity to all examples
    similarities = []
    for example in pool.examples:
        score = cosine_similarity(query_embedding, example.embedding)
        similarities.append(SimilarityResult(example, score, "semantic"))

    # Step 3: Sort by similarity
    similarities.sort(key=lambda x: x.score, reverse=True)

    # Step 4: Apply selection strategy
    if strategy == SelectionStrategy.TOP_K:
        selected = similarities[:n]

    elif strategy == SelectionStrategy.DIVERSE_TOP_K:
        # Get top candidates (2x final count)
        candidates = similarities[:n * 2]

        # Select diverse subset
        selected = []
        for candidate in candidates:
            if len(selected) >= n - 1:  # Reserve 1 slot for negative
                break

            # Check diversity
            is_diverse = all(
                cosine_similarity(candidate.example.embedding,
                                 sel.example.embedding) < diversity_threshold
                for sel in selected
            )

            if is_diverse:
                selected.append(candidate)

        # Add negative example
        refusal_examples = [s for s in similarities
                           if "refusal" in s.example.tags]
        if refusal_examples:
            selected.append(refusal_examples[0])

    elif strategy == SelectionStrategy.CATEGORY_AWARE:
        # Ensure at least 1 example per category
        selected = select_category_diverse(similarities, n)

    # Step 5: Order by effectiveness (recency/primacy)
    ordered = order_examples(selected)

    # Step 6: Build result
    return SelectionResult(
        query=query,
        selected_examples=[s.example for s in ordered],
        similarity_scores=[s.score for s in ordered],
        selection_strategy=strategy.value,
        timestamp=now(),
        metadata={"diversity_threshold": diversity_threshold}
    )

Diversity Filtering Algorithm

def ensure_diversity(
    candidates: List[SimilarityResult],
    n: int,
    threshold: float = 0.9
) -> List[SimilarityResult]:
    """
    Select diverse examples from candidates

    Uses greedy algorithm:
    1. Always pick the most similar candidate first
    2. For each subsequent pick, ensure it's not too similar to already selected

    Complexity: O(n * k) where k = len(candidates), n << k
    """

    if len(candidates) <= n:
        return candidates

    selected = [candidates[0]]  # Most similar

    for candidate in candidates[1:]:
        if len(selected) >= n:
            break

        # Check similarity to all already-selected examples
        is_diverse = True
        for selected_example in selected:
            sim = cosine_similarity(
                candidate.example.embedding,
                selected_example.example.embedding
            )
            if sim >= threshold:
                is_diverse = False
                break

        if is_diverse:
            selected.append(candidate)

    return selected

Embedding Generation

def generate_embedding(text: str, model_name: str = "all-MiniLM-L6-v2") -> np.ndarray:
    """
    Generate semantic embedding for text

    Options:
    1. Sentence-Transformers (offline, free)
    2. OpenAI API (online, costs $0.0001 per 1K tokens)
    3. Cohere API (online, optimized for search)
    """

    # Option 1: Sentence-Transformers (RECOMMENDED)
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer(model_name)
    embedding = model.encode([text])[0]

    return embedding

def cosine_similarity(vec1: np.ndarray, vec2: np.ndarray) -> float:
    """
    Compute cosine similarity between two vectors

    Formula: cos(θ) = (A · B) / (||A|| × ||B||)

    Returns: Float in [-1, 1] where 1 = identical, -1 = opposite
    """
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)

    if norm_product == 0:
        return 0.0

    return dot_product / norm_product

4.5 Selection Strategies Comparison

Strategy When to Use Pros Cons
Top-K Uniform task (all queries similar) Fast, simple Can overfit to similar patterns
Diverse Top-K General-purpose tasks Balance similarity & coverage Slightly slower
Category-Aware Multi-category support system Guaranteed coverage May sacrifice similarity
Hybrid Complex production systems Best accuracy Most complex

5. Implementation Guide

Phase 1: Foundation (Days 1-2)

Step 1: Set Up Example Pool Structure

Create example pool schema:

{
  "version": "1.0.0",
  "embedding_model": "all-MiniLM-L6-v2",
  "last_updated": "2024-12-27",
  "examples": [
    {
      "id": "ex_12",
      "input": "I forgot my username and can't log in",
      "output": {
        "category": "account_access",
        "action": "username_recovery",
        "steps": ["verify_email", "send_username"]
      },
      "tags": ["account", "simple", "success"],
      "category": "account_access",
      "complexity": "simple",
      "outcome_type": "success",
      "embedding": null,
      "metadata": {
        "created_by": "john@example.com",
        "created_at": "2024-12-01"
      }
    },
    {
      "id": "ex_3",
      "input": "My laptop screen is cracked",
      "output": {
        "category": "out_of_scope",
        "action": "polite_refusal",
        "reason": "Hardware issues require physical repair"
      },
      "tags": ["hardware", "refusal", "negative"],
      "category": "out_of_scope",
      "complexity": "simple",
      "outcome_type": "refusal",
      "embedding": null,
      "metadata": {}
    }
  ]
}

Checkpoint 1.1: Can you load this JSON and parse it into Python dataclasses?

Step 2: Implement Embedding Generation

Start simple with keyword matching:

def keyword_similarity(query: str, example_input: str) -> float:
    """
    Simple keyword-based similarity (baseline)

    This is your MVP - no ML required!
    """
    query_words = set(query.lower().split())
    example_words = set(example_input.lower().split())

    intersection = query_words & example_words
    union = query_words | example_words

    if len(union) == 0:
        return 0.0

    # Jaccard similarity
    return len(intersection) / len(union)

Checkpoint 1.2: Test keyword similarity on 5 example pairs. Does it return higher scores for similar texts?

Step 3: Add Semantic Embeddings

Install dependencies:

pip install sentence-transformers numpy

Generate embeddings:

from sentence_transformers import SentenceTransformer
import numpy as np
import json

def generate_and_cache_embeddings(pool_file: str):
    """
    Pre-compute embeddings for all examples (run once)
    """
    with open(pool_file) as f:
        pool = json.load(f)

    model = SentenceTransformer('all-MiniLM-L6-v2')

    for example in pool['examples']:
        text = example['input']
        embedding = model.encode([text])[0]
        example['embedding'] = embedding.tolist()  # Convert to list for JSON

    # Save with embeddings
    with open(pool_file.replace('.json', '_embedded.json'), 'w') as f:
        json.dump(pool, f, indent=2)

    print(f"Generated embeddings for {len(pool['examples'])} examples")

Checkpoint 1.3: Generate embeddings for your example pool. Verify the JSON file now contains embedding arrays.

Phase 2: Selection Logic (Days 3-4)

Step 4: Implement Top-K Selection

def select_top_k(query: str, pool: ExamplePool, k: int = 3) -> List[Example]:
    """
    Select K most similar examples
    """
    query_embedding = generate_embedding(query)

    similarities = []
    for example in pool.examples:
        example_embedding = np.array(example.embedding)
        score = cosine_similarity(query_embedding, example_embedding)
        similarities.append((score, example))

    # Sort by similarity descending
    similarities.sort(reverse=True, key=lambda x: x[0])

    # Return top K
    return [ex for score, ex in similarities[:k]]

Checkpoint 2.1: Test with query “How do I reset my password?”. Do the selected examples make sense?

Step 5: Add Diversity Filtering

def select_diverse_top_k(
    query: str,
    pool: ExamplePool,
    k: int = 3,
    diversity_threshold: float = 0.9
) -> List[Example]:
    """
    Select diverse examples from top candidates
    """
    query_embedding = generate_embedding(query)

    # Get top candidates (2x final count)
    candidates = []
    for example in pool.examples:
        score = cosine_similarity(query_embedding, np.array(example.embedding))
        candidates.append((score, example))

    candidates.sort(reverse=True, key=lambda x: x[0])
    candidates = candidates[:k * 2]

    # Select diverse subset
    selected = [candidates[0]]  # Always pick most similar

    for score, candidate in candidates[1:]:
        if len(selected) >= k:
            break

        # Check if diverse enough
        is_diverse = all(
            cosine_similarity(
                np.array(candidate.embedding),
                np.array(sel[1].embedding)
            ) < diversity_threshold
            for sel in selected
        )

        if is_diverse:
            selected.append((score, candidate))

    return [ex for score, ex in selected]

Checkpoint 2.2: Compare Top-K vs. Diverse Top-K on 10 queries. Are the diverse selections actually different?

Step 6: Add Negative Example Injection

def select_with_negative(
    query: str,
    pool: ExamplePool,
    k: int = 3
) -> List[Example]:
    """
    Always include one refusal/negative example
    """
    # Get k-1 positive examples
    positive_pool = ExamplePool(
        version=pool.version,
        examples=[ex for ex in pool.examples if "refusal" not in ex.tags],
        embedding_model=pool.embedding_model,
        last_updated=pool.last_updated
    )

    selected = select_diverse_top_k(query, positive_pool, k=k-1)

    # Add one negative example
    negative_examples = [ex for ex in pool.examples if "refusal" in ex.tags]
    if negative_examples:
        selected.append(negative_examples[0])

    return selected

Checkpoint 2.3: Verify that your selection always includes one refusal example.

Phase 3: Integration & Measurement (Day 5)

Step 7: Build CLI Interface

import argparse

def main():
    parser = argparse.ArgumentParser(description="Few-Shot Example Curator")
    parser.add_argument("--query", required=True, help="User query")
    parser.add_argument("--pool", required=True, help="Path to example pool JSON")
    parser.add_argument("--n", type=int, default=3, help="Number of examples")
    parser.add_argument("--strategy", default="diverse",
                       choices=["topk", "diverse", "negative"])

    args = parser.parse_args()

    # Load pool
    pool = load_example_pool(args.pool)

    # Select examples
    if args.strategy == "topk":
        selected = select_top_k(args.query, pool, k=args.n)
    elif args.strategy == "diverse":
        selected = select_diverse_top_k(args.query, pool, k=args.n)
    elif args.strategy == "negative":
        selected = select_with_negative(args.query, pool, k=args.n)

    # Display results
    print(f"\n[Selected {len(selected)} examples for query: '{args.query}']\n")
    for i, ex in enumerate(selected, 1):
        print(f"{i}. {ex.id}: {ex.input[:60]}...")
        print(f"   Category: {ex.category}, Complexity: {ex.complexity}")
        print()

if __name__ == "__main__":
    main()

Checkpoint 3.1: Run the CLI and verify it works with different strategies.

Step 8: Integrate with Project 1 Harness

Create two prompt versions:

# prompts/support_static.yaml
name: "Support Agent - Static Examples"
version: "1.0.0"

prompt_template: |
  You are a support agent. Here are some examples:

  Example 1: [Hardcoded refund example]
  Example 2: [Hardcoded password reset]
  Example 3: [Hardcoded account lock]

  User: {input}
  Assistant:

test_cases:
  - id: "case_1"
    input: "How do I reset my password?"
    # ... invariants
# prompts/support_dynamic.yaml
name: "Support Agent - Dynamic Examples"
version: "1.0.0"

prompt_template: |
  You are a support agent. Here are some relevant examples:

  {dynamic_examples}  # Injected by curator

  User: {input}
  Assistant:

test_cases:
  - id: "case_1"
    input: "How do I reset my password?"
    # ... same invariants

Modify harness to inject dynamic examples:

def run_test_with_dynamic_examples(test_case, pool):
    # Select examples
    examples = select_with_negative(test_case.input, pool, k=3)

    # Format examples
    formatted = format_examples_for_prompt(examples)

    # Inject into prompt
    prompt = template.replace("{dynamic_examples}", formatted)
    prompt = prompt.replace("{input}", test_case.input)

    # Run test
    return execute_prompt(prompt)

Checkpoint 3.2: Run both test suites and compare scores. Is dynamic selection better?

Step 9: Generate Comparison Report

def generate_comparison_report(static_results, dynamic_results):
    """
    Compare static vs. dynamic example selection
    """
    print("\n" + "="*60)
    print("STATIC vs. DYNAMIC EXAMPLE SELECTION COMPARISON")
    print("="*60 + "\n")

    print(f"Static Examples:  {static_results['success_rate']:.1f}%")
    print(f"Dynamic Examples: {dynamic_results['success_rate']:.1f}%")

    improvement = dynamic_results['success_rate'] - static_results['success_rate']
    print(f"\nImprovement: {improvement:+.1f} percentage points\n")

    # Category breakdown
    print("Categories with biggest gains:")
    for category in static_results['categories']:
        static_score = static_results['categories'][category]
        dynamic_score = dynamic_results['categories'][category]
        gain = dynamic_score - static_score

        if gain > 5:  # Only show significant improvements
            print(f"  - {category}: {static_score:.0f}% → {dynamic_score:.0f}% ({gain:+.0f}%)")

Checkpoint 3.3: Generate and save the comparison report.

6. Testing Strategy

6.1 Unit Tests

def test_cosine_similarity():
    """Test similarity calculation"""
    v1 = np.array([1, 0, 0])
    v2 = np.array([1, 0, 0])
    assert abs(cosine_similarity(v1, v2) - 1.0) < 0.001  # Identical

    v3 = np.array([0, 1, 0])
    assert abs(cosine_similarity(v1, v3) - 0.0) < 0.001  # Orthogonal

def test_diversity_filter():
    """Test that similar examples are filtered"""
    # Create pool with similar examples
    pool = create_test_pool([
        ("refund order", [0.1, 0.9]),
        ("refund request", [0.15, 0.85]),  # Very similar
        ("password reset", [0.9, 0.1])      # Different
    ])

    selected = select_diverse_top_k("refund", pool, k=2, diversity_threshold=0.9)

    # Should select refund + password (not both refunds)
    assert len(selected) == 2
    assert "password" in [ex.input for ex in selected]

def test_negative_injection():
    """Test that negative examples are always included"""
    pool = create_test_pool_with_negatives()

    selected = select_with_negative("any query", pool, k=3)

    # Check that one example has "refusal" tag
    has_negative = any("refusal" in ex.tags for ex in selected)
    assert has_negative

6.2 Integration Tests

def test_end_to_end_selection():
    """Test complete selection pipeline"""
    # Load real example pool
    pool = load_example_pool("examples/support_pool.json")

    # Select examples
    selected = select_with_negative(
        query="How do I reset my password?",
        pool=pool,
        k=3
    )

    # Verify results
    assert len(selected) == 3
    assert all(ex.embedding is not None for ex in selected)
    assert any("refusal" in ex.tags for ex in selected)

    # Check that top example is actually similar
    assert "password" in selected[0].input.lower() or \
           "account" in selected[0].input.lower()

6.3 Performance Benchmarks

def test_selection_performance():
    """Test that selection is fast enough for production"""
    import time

    pool = load_example_pool("examples/large_pool_1000.json")

    start = time.time()
    selected = select_with_negative("test query", pool, k=3)
    elapsed = time.time() - start

    # Should complete in <100ms
    assert elapsed < 0.1, f"Selection took {elapsed:.3f}s (too slow!)"

7. Common Pitfalls & Debugging

7.1 The Name Bias Problem

Symptom: Model always uses the same name in outputs

# Example pool has 90% "John" and 10% other names
examples = [
    "John wants a refund",
    "John reset his password",
    "John locked his account",
    "Sarah changed email"  # Only 1 example
]

# Model output always uses "John"!

Solution: Lint your example pool for bias

def detect_name_bias(pool: ExamplePool) -> Dict[str, int]:
    """
    Detect overuse of specific names in examples
    """
    import re

    names = {}
    for example in pool.examples:
        # Extract capitalized words (potential names)
        found_names = re.findall(r'\b[A-Z][a-z]+\b', example.input)
        for name in found_names:
            names[name] = names.get(name, 0) + 1

    # Warn if any name appears in >30% of examples
    total = len(pool.examples)
    warnings = []
    for name, count in names.items():
        ratio = count / total
        if ratio > 0.3:
            warnings.append(f"⚠ Name '{name}' appears in {ratio*100:.0f}% of examples")

    return warnings

7.2 The Embedding Cache Miss

Symptom: Selection is slow (>1s per query)

Cause: Regenerating embeddings every time

Solution: Pre-compute and cache

def load_pool_with_cached_embeddings(pool_file: str) -> ExamplePool:
    """
    Load pool with pre-computed embeddings
    """
    with open(pool_file) as f:
        data = json.load(f)

    # Check if embeddings exist
    if data['examples'][0].get('embedding') is None:
        print("No cached embeddings found. Generating...")
        generate_and_cache_embeddings(pool_file)
        # Reload
        with open(pool_file.replace('.json', '_embedded.json')) as f:
            data = json.load(f)

    # Convert embedding lists back to numpy arrays
    for ex in data['examples']:
        ex['embedding'] = np.array(ex['embedding'])

    return ExamplePool(**data)

7.3 The Coverage Gap

Symptom: Certain query types never get good examples

Diagnosis: Check example distribution

def analyze_example_coverage(pool: ExamplePool):
    """
    Report coverage by category
    """
    from collections import Counter

    categories = Counter(ex.category for ex in pool.examples)

    print("Example Distribution:")
    for category, count in categories.most_common():
        percentage = (count / len(pool.examples)) * 100
        print(f"  {category}: {count} ({percentage:.1f}%)")

    # Warn about underrepresented categories
    for category, count in categories.items():
        if count < 3:
            print(f"⚠ Category '{category}' has only {count} examples (need at least 3)")

7.4 The Similarity Collapse

Symptom: All similarity scores are 0.9+ or all are 0.1-

Cause: Using wrong similarity metric or embeddings not normalized

Debug:

def debug_similarity_scores(query: str, pool: ExamplePool):
    """
    Print similarity distribution for debugging
    """
    query_emb = generate_embedding(query)

    scores = []
    for ex in pool.examples:
        score = cosine_similarity(query_emb, np.array(ex.embedding))
        scores.append((score, ex.id, ex.input[:40]))

    scores.sort(reverse=True)

    print(f"Similarity distribution for query: '{query}'")
    print(f"Top 5:")
    for score, id, text in scores[:5]:
        print(f"  {score:.3f} - {id}: {text}...")

    print(f"\nBottom 5:")
    for score, id, text in scores[-5:]:
        print(f"  {score:.3f} - {id}: {text}...")

    # Statistics
    all_scores = [s for s, _, _ in scores]
    print(f"\nStats: min={min(all_scores):.3f}, max={max(all_scores):.3f}, "
          f"mean={np.mean(all_scores):.3f}, std={np.std(all_scores):.3f}")

8. Extensions

8.1 Beginner Extensions

Extension 1: Keyword Fallback

If semantic similarity fails (offline mode), fall back to keyword matching:

def select_with_fallback(query: str, pool: ExamplePool, k: int = 3):
    try:
        return select_diverse_top_k(query, pool, k)
    except Exception as e:
        print(f"Semantic selection failed: {e}. Falling back to keywords.")
        return select_by_keywords(query, pool, k)

Extension 2: Example Previews

Show example previews before sending to model:

$ python curator.py --query "refund" --pool examples.json --preview

Selected examples:
1. [ex_12] "I want a refund for order #123"
   Output: {"action": "process_refund", "policy": "30_day_window"}

2. [ex_45] "My order was damaged, need refund"
   Output: {"action": "process_refund", "reason": "damaged_goods"}

Continue? (y/n):

8.2 Intermediate Extensions

Extension 3: Multi-Query Caching

Cache selected examples for similar queries:

from functools import lru_cache

@lru_cache(maxsize=1000)
def select_examples_cached(query: str, pool_version: str, k: int):
    """
    Cache selections for frequently asked queries

    pool_version ensures cache invalidation when pool changes
    """
    pool = load_example_pool(f"pools/{pool_version}.json")
    return select_with_negative(query, pool, k)

Extension 4: Category Balancing

Ensure examples span multiple categories:

def select_category_balanced(query: str, pool: ExamplePool, k: int = 3):
    """
    Select at least one example from each major category
    """
    from collections import defaultdict

    # Group examples by category
    by_category = defaultdict(list)
    for ex in pool.examples:
        by_category[ex.category].append(ex)

    # Get top candidate from each category
    candidates = []
    for category, examples in by_category.items():
        category_pool = ExamplePool(
            version=pool.version,
            examples=examples,
            embedding_model=pool.embedding_model,
            last_updated=pool.last_updated
        )
        top = select_top_k(query, category_pool, k=1)
        if top:
            candidates.extend(top)

    # Select k most similar from candidates
    return select_top_k_from_list(query, candidates, k)

8.3 Advanced Extensions

Extension 5: Active Learning for Example Curation

Track which examples are most effective and prioritize them:

def track_example_effectiveness(
    selected_examples: List[Example],
    test_result: TestResult
):
    """
    Track which examples lead to successful outcomes
    """
    # Load effectiveness scores
    scores = load_effectiveness_scores()

    # Update scores based on test outcome
    for ex in selected_examples:
        if test_result.passed:
            scores[ex.id] = scores.get(ex.id, 0.5) + 0.1
        else:
            scores[ex.id] = scores.get(ex.id, 0.5) - 0.05

        # Clamp to [0, 1]
        scores[ex.id] = max(0, min(1, scores[ex.id]))

    save_effectiveness_scores(scores)

def select_with_effectiveness_boost(query: str, pool: ExamplePool, k: int):
    """
    Combine similarity with historical effectiveness
    """
    scores = load_effectiveness_scores()
    query_emb = generate_embedding(query)

    ranked = []
    for ex in pool.examples:
        similarity = cosine_similarity(query_emb, np.array(ex.embedding))
        effectiveness = scores.get(ex.id, 0.5)

        # Combine: 70% similarity, 30% effectiveness
        combined_score = 0.7 * similarity + 0.3 * effectiveness
        ranked.append((combined_score, ex))

    ranked.sort(reverse=True, key=lambda x: x[0])
    return [ex for _, ex in ranked[:k]]

Extension 6: Maximal Marginal Relevance (MMR)

Advanced diversity algorithm used in production search systems:

def select_with_mmr(
    query: str,
    pool: ExamplePool,
    k: int = 3,
    lambda_param: float = 0.7
) -> List[Example]:
    """
    Maximal Marginal Relevance selection

    Balances relevance to query vs. diversity from already-selected examples

    lambda_param: 1.0 = only relevance, 0.0 = only diversity
    """
    query_emb = generate_embedding(query)

    # Compute all similarities to query
    similarities = {}
    for ex in pool.examples:
        similarities[ex.id] = cosine_similarity(query_emb, np.array(ex.embedding))

    selected = []
    candidates = list(pool.examples)

    # Pick first example (most similar to query)
    first = max(candidates, key=lambda ex: similarities[ex.id])
    selected.append(first)
    candidates.remove(first)

    # Iteratively select remaining k-1 examples
    for _ in range(k - 1):
        if not candidates:
            break

        mmr_scores = {}
        for candidate in candidates:
            # Relevance to query
            relevance = similarities[candidate.id]

            # Max similarity to any selected example
            max_similarity = max(
                cosine_similarity(
                    np.array(candidate.embedding),
                    np.array(sel.embedding)
                )
                for sel in selected
            )

            # MMR formula
            mmr = lambda_param * relevance - (1 - lambda_param) * max_similarity
            mmr_scores[candidate.id] = mmr

        # Select candidate with highest MMR
        next_ex = max(candidates, key=lambda ex: mmr_scores[ex.id])
        selected.append(next_ex)
        candidates.remove(next_ex)

    return selected

Extension 7: Cross-Encoder Reranking

Use a more powerful model to rerank candidates:

from sentence_transformers import CrossEncoder

def select_with_reranking(query: str, pool: ExamplePool, k: int = 3):
    """
    Two-stage selection:
    1. Use fast bi-encoder to get top 20 candidates
    2. Use slow cross-encoder to rerank to top k
    """
    # Stage 1: Fast retrieval
    candidates = select_top_k(query, pool, k=20)

    # Stage 2: Precise reranking
    reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

    pairs = [(query, ex.input) for ex in candidates]
    scores = reranker.predict(pairs)

    # Sort by reranker scores
    ranked = sorted(zip(scores, candidates), reverse=True, key=lambda x: x[0])

    return [ex for _, ex in ranked[:k]]

9. Real-World Connections

9.1 Industry Applications

1. Customer Support (Intercom, Zendesk)

# Real-world example: Support ticket routing
def route_support_ticket(ticket_text: str):
    # Select examples similar to this ticket
    examples = select_with_negative(ticket_text, support_pool, k=3)

    # Use examples to guide categorization
    prompt = f"""
    Examples of how to categorize tickets:
    {format_examples(examples)}

    Ticket: {ticket_text}
    Category:
    """

    category = llm.complete(prompt)
    return category

2. Code Generation (GitHub Copilot)

GitHub Copilot uses similar code patterns from your codebase as examples:

# Simplified version of Copilot's approach
def generate_code_with_context(current_code: str, cursor_position: int):
    # Find similar code patterns in project
    similar_snippets = find_similar_code(current_code, project_codebase)

    # Use as few-shot examples
    prompt = build_prompt_with_examples(similar_snippets, current_code)

    completion = codegen_model.complete(prompt)
    return completion

3. Content Generation (Jasper, Copy.ai)

Marketing copy generators use brand-specific examples:

def generate_marketing_copy(brief: str, brand: str):
    # Load brand-specific example pool
    brand_pool = load_example_pool(f"brands/{brand}/examples.json")

    # Select examples matching the brief style
    examples = select_with_negative(brief, brand_pool, k=3)

    # Generate with brand voice
    copy = generate_with_examples(brief, examples)
    return copy

9.2 Research Applications

Academic Use Case: Few-Shot Classification

Researchers use dynamic example selection for text classification:

# Research: "SetFit: Efficient Few-Shot Learning Without Prompts"
def few_shot_classify(text: str, labeled_examples: List, categories: List[str]):
    # Select most similar labeled examples
    selected = select_top_k(text, labeled_examples, k=8)

    # Fine-tune small model on selected examples (SetFit approach)
    model = train_on_examples(selected)

    # Classify new text
    category = model.predict(text)
    return category

9.3 Production Patterns

Pattern 1: Hybrid Static + Dynamic

def hybrid_selection(query: str, pool: ExamplePool, k: int = 3):
    """
    Always include 1 canonical example + k-1 dynamic examples
    """
    # Canonical example (hand-picked, always included)
    canonical = pool.get_example_by_id("canonical_1")

    # Dynamic examples
    dynamic = select_diverse_top_k(query, pool, k=k-1)

    return [canonical] + dynamic

Pattern 2: Example Warming

def warm_example_cache(common_queries: List[str], pool: ExamplePool):
    """
    Pre-compute selections for common queries at deployment time
    """
    cache = {}
    for query in common_queries:
        cache[query] = select_with_negative(query, pool, k=3)

    save_to_redis(cache)

10. Resources

10.1 Books

Topic Book Chapter
Few-Shot Learning Theory “AI Engineering” by Chip Huyen Ch. 5 (Prompt Engineering & In-Context Learning)
Data Selection Strategies “Hands-On Machine Learning” by Géron Ch. 2 (End-to-End ML Project, Stratified Sampling)
Semantic Similarity “Introduction to Information Retrieval” by Manning Ch. 6 (Scoring, Term Weighting & Vector Space Model)
Vector Search “Introduction to Information Retrieval” by Manning Ch. 18 (Latent Semantic Indexing)
Embeddings Fundamentals “Speech and Language Processing” by Jurafsky & Martin Ch. 6 (Vector Semantics)
Evaluation Metrics “Hands-On Machine Learning” by Géron Ch. 3 (Classification Metrics)
Sampling Strategies “Designing Data-Intensive Applications” by Kleppmann Ch. 10 (Batch Processing, section on sampling)

10.2 Papers

  1. “Language Models are Few-Shot Learners” (Brown et al., 2020)
    • The GPT-3 paper introducing few-shot prompting
    • https://arxiv.org/abs/2005.14165
  2. “What Makes Good In-Context Examples for GPT-3?” (Liu et al., 2021)
    • Analysis of example selection strategies
    • https://arxiv.org/abs/2101.06804
  3. “Lost in the Middle: How Language Models Use Long Contexts” (Liu et al., 2023)
    • Shows position effects in context usage
    • https://arxiv.org/abs/2307.03172
  4. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” (Reimers & Gurevych, 2019)
    • Foundation for sentence-transformers library
    • https://arxiv.org/abs/1908.10084

10.3 Libraries & Tools

# Essential libraries
pip install sentence-transformers  # Embeddings
pip install numpy                  # Vector operations
pip install faiss-cpu              # Fast similarity search (optional)
pip install chromadb               # Vector database (optional)

# For advanced features
pip install openai                 # OpenAI embeddings API
pip install cohere                 # Cohere embeddings
pip install pinecone-client        # Production vector DB

10.4 Online Resources

  • Prompt Engineering Guide: https://www.promptingguide.ai/
    • Section on Few-Shot Prompting
  • OpenAI Cookbook: https://cookbook.openai.com/
    • Examples of embedding-based search
  • Sentence-Transformers Documentation: https://www.sbert.net/
    • Model selection guide, performance benchmarks

11. Self-Assessment Checklist

Core Understanding

  • I can explain the difference between zero-shot, one-shot, and few-shot prompting
    • Test: Explain to someone in 2 minutes without notes
  • I understand why few-shot examples improve model performance
    • Test: What mechanism allows LLMs to learn from examples at inference time?
  • I can calculate cosine similarity between two vectors by hand
    • Test: Given vectors [1,0] and [0.7, 0.7], compute similarity
  • I understand the primacy and recency effects
    • Test: Where should you place your most important example and why?
  • I can explain the bias trap with concrete examples
    • Test: Give 3 types of bias that can leak through few-shot examples

Implementation Skills

  • I can load and parse an example pool from JSON
    • Evidence: Show your load_example_pool() function
  • I’ve implemented cosine similarity from scratch
    • Evidence: Working implementation without using libraries
  • I’ve generated embeddings using sentence-transformers
    • Evidence: Code that generates and caches embeddings
  • I’ve implemented top-K selection
    • Evidence: Returns K most similar examples
  • I’ve implemented diversity filtering
    • Evidence: Selected examples are not too similar to each other
  • I’ve integrated with Project 1 harness
    • Evidence: Comparison report showing static vs. dynamic performance

Measurement & Analysis

  • I’ve measured the improvement from dynamic selection
    • Evidence: Report showing percentage point improvement
  • I’ve identified which categories benefit most
    • Evidence: Category breakdown showing gains
  • I’ve debugged similarity scoring issues
    • Evidence: Used debug function to analyze score distribution
  • I’ve analyzed my example pool for bias
    • Evidence: Ran linter to detect name/tone bias

Growth

  • I can identify when dynamic examples are worth the complexity
    • Application: Give criteria for when to use static vs. dynamic
  • I’ve documented lessons learned
    • What surprised you during implementation?
    • What would you do differently next time?
  • I can explain this project in a job interview
    • Practice: 2-minute explanation covering problem, solution, results
  • I understand production considerations
    • How would you reduce selection latency?
    • How would you handle example pool updates?
    • What metrics would you track?

12. Submission / Completion Criteria

Minimum Viable Completion

To consider this project “complete” at a basic level:

  • Can load example pool from JSON
    • Parses examples with all required fields
    • Handles missing/malformed data gracefully
  • Can generate embeddings
    • Uses sentence-transformers or equivalent
    • Caches embeddings to avoid recomputation
  • Implements top-K selection
    • Correctly computes cosine similarity
    • Returns K most similar examples
  • Shows improvement over static examples
    • Integrated with Project 1 harness
    • Report shows measurable accuracy gain

Proof of Completion:

  • Screenshot of CLI showing selected examples
  • Comparison report showing improvement
  • Code walkthrough of selection algorithm

Full Completion

All minimum criteria plus:

  • Diversity filtering implemented
    • Selected examples are not too similar to each other
    • Configurable similarity threshold
  • Negative example injection
    • Always includes one refusal pattern
    • Handles cases where no negative examples exist
  • Detailed logging
    • Saves selection decisions with scores
    • Audit trail for debugging
  • CLI with multiple strategies
    • Supports top-k, diverse, negative injection
    • Clear help text and examples
  • Performance testing
    • Selection completes in <100ms for pool of 100 examples
    • Handles pools of 1000+ examples

Proof of Completion:

  • Public GitHub repository
  • README with usage examples
  • Passing unit tests
  • Performance benchmarks

Excellence (Going Above & Beyond)

All full completion criteria plus any 2+ of:

  • MMR or advanced selection algorithm
    • Implements Maximal Marginal Relevance
    • Demonstrates improvement over greedy diversity
  • Active learning / effectiveness tracking
    • Tracks which examples lead to success
    • Adapts selection based on historical performance
  • Cross-encoder reranking
    • Two-stage selection (fast retrieval + precise reranking)
    • Measurably better than single-stage
  • Example pool linting & analysis
    • Detects bias (names, tone, length)
    • Reports coverage gaps
    • Suggests additions
  • Production features
    • Example caching for common queries
    • Support for multiple pools (per-domain)
    • API endpoint for example selection service

Proof of Completion:

  • Blog post with implementation details
  • Video demo
  • Contribution to open-source project
  • Production deployment

Appendix: Sample Files

Example Pool (examples/support_pool.json)

{
  "version": "1.0.0",
  "embedding_model": "all-MiniLM-L6-v2",
  "last_updated": "2024-12-27",
  "examples": [
    {
      "id": "ex_12",
      "input": "I forgot my username and can't log in",
      "output": {
        "category": "account_access",
        "action": "username_recovery",
        "steps": ["verify_email", "send_username"],
        "citation": "help_doc_5"
      },
      "tags": ["account", "simple", "success"],
      "category": "account_access",
      "complexity": "simple",
      "outcome_type": "success",
      "embedding": null,
      "metadata": {
        "created_by": "support_team",
        "created_at": "2024-12-01",
        "usage_count": 45
      }
    },
    {
      "id": "ex_45",
      "input": "My account is locked due to too many password attempts",
      "output": {
        "category": "account_security",
        "action": "unlock_account",
        "steps": ["verify_identity", "send_unlock_link"],
        "citation": "security_policy_3"
      },
      "tags": ["account", "medium", "success"],
      "category": "account_security",
      "complexity": "medium",
      "outcome_type": "success",
      "embedding": null,
      "metadata": {}
    },
    {
      "id": "ex_3",
      "input": "My laptop screen is cracked and I need it fixed",
      "output": {
        "category": "out_of_scope",
        "action": "polite_refusal",
        "reason": "Hardware repairs require physical service. Please contact device manufacturer.",
        "citation": "scope_policy_1"
      },
      "tags": ["hardware", "refusal", "negative"],
      "category": "out_of_scope",
      "complexity": "simple",
      "outcome_type": "refusal",
      "embedding": null,
      "metadata": {}
    },
    {
      "id": "ex_23",
      "input": "How do I reset my security questions?",
      "output": {
        "category": "account_security",
        "action": "security_question_reset",
        "steps": ["verify_email", "navigate_to_settings", "update_questions"],
        "citation": "help_doc_12"
      },
      "tags": ["account", "simple", "success"],
      "category": "account_security",
      "complexity": "simple",
      "outcome_type": "success",
      "embedding": null,
      "metadata": {}
    }
  ]
}

Embedding Generation Script (scripts/generate_embeddings.py)

#!/usr/bin/env python3
"""
Generate embeddings for example pool
Usage: python generate_embeddings.py examples/support_pool.json
"""

import json
import sys
from sentence_transformers import SentenceTransformer
import numpy as np

def generate_embeddings(pool_file: str, model_name: str = "all-MiniLM-L6-v2"):
    """Generate and cache embeddings for all examples"""

    print(f"Loading example pool: {pool_file}")
    with open(pool_file) as f:
        pool = json.load(f)

    print(f"Loading embedding model: {model_name}")
    model = SentenceTransformer(model_name)

    print(f"Generating embeddings for {len(pool['examples'])} examples...")
    for i, example in enumerate(pool['examples'], 1):
        text = example['input']
        embedding = model.encode([text])[0]
        example['embedding'] = embedding.tolist()

        if i % 10 == 0:
            print(f"  Processed {i}/{len(pool['examples'])}")

    # Save with embeddings
    output_file = pool_file.replace('.json', '_embedded.json')
    with open(output_file, 'w') as f:
        json.dump(pool, f, indent=2)

    print(f"✓ Saved embeddings to: {output_file}")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python generate_embeddings.py <pool_file.json>")
        sys.exit(1)

    generate_embeddings(sys.argv[1])

End of Project 5: Few-Shot Example Curator