Project 14: Distilling a Coding Agent

Distill a larger coding agent into a smaller, cheaper model while preserving task success rates.

Quick Reference

Attribute Value
Difficulty Level 5: Expert
Time Estimate 3-4 weeks
Language Python
Prerequisites Distillation, eval harnesses
Key Topics agent distillation, evals, regression testing

1. Learning Objectives

By completing this project, you will:

  1. Capture teacher traces for coding tasks.
  2. Train a student model on agent traces.
  3. Evaluate task success rates.
  4. Measure latency and cost improvements.
  5. Prevent regressions with eval harnesses.

2. Theoretical Foundation

2.1 Distilling Agents

Agent distillation uses traces and decisions to train smaller models that preserve behavior.


3. Project Specification

3.1 What You Will Build

A distillation pipeline that learns from a teacher coding agent and evaluates the student on coding tasks.

3.2 Functional Requirements

  1. Trace collection from teacher agent.
  2. Student training on traces.
  3. Evaluation harness for coding tasks.
  4. Regression metrics vs teacher.
  5. Cost/latency report.

3.3 Non-Functional Requirements

  • Deterministic evaluation on fixed tasks.
  • Clear success criteria for tasks.
  • Trace storage for reproducibility.

4. Solution Architecture

4.1 Components

Component Responsibility
Trace Collector Record teacher behavior
Trainer Distill student model
Evaluator Run coding task suite
Reporter Compare costs and accuracy

5. Implementation Guide

5.1 Project Structure

QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY/P14-distill-agent/
├── src/
│   ├── collect.py
│   ├── train.py
│   ├── eval.py
│   └── report.py

5.2 Implementation Phases

Phase 1: Trace collection (8-12h)

  • Run teacher agent on tasks.
  • Checkpoint: traces stored with outcomes.

Phase 2: Distillation (8-12h)

  • Train student on traces.
  • Checkpoint: student reproduces actions.

Phase 3: Evaluation (8-12h)

  • Compare student vs teacher on task suite.
  • Checkpoint: regression report produced.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit trace format schema validation
Integration eval task success rates
Regression metrics compare vs teacher

6.2 Critical Test Cases

  1. Student achieves target success rate threshold.
  2. Cost per task decreases.
  3. Regression alerts on failed tasks.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Low student accuracy task failures improve trace quality
Overfitting traces poor generalization add task diversity
Missing evals silent regressions maintain task suite

8. Extensions & Challenges

Beginner

  • Add task difficulty tiers.
  • Add trace filtering.

Intermediate

  • Add active learning for new tasks.
  • Add task-specific adapters.

Advanced

  • Add multi-teacher distillation.
  • Add long-horizon agent tasks.

9. Real-World Connections

  • Coding copilots benefit from distilled models for cost.
  • Enterprise agents need regression-safe distillation.

10. Resources

  • Distillation papers
  • Agent evaluation frameworks

11. Self-Assessment Checklist

  • I can collect and store agent traces.
  • I can train a student model from traces.
  • I can evaluate regressions reliably.

12. Submission / Completion Criteria

Minimum Completion:

  • Trace collection + student training

Full Completion:

  • Evaluation suite + reports

Excellence:

  • Multi-teacher distillation
  • Long-horizon tasks

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/QUANTIZATION_DISTILLATION_INFERENCE_OPTIMIZATION_MASTERY.md.