Project 9: Agent Evaluation Harness

Build an evaluation harness that runs agents against test suites, measures success rates, and tracks regressions.


Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 16–24 hours
Language Python or JavaScript
Prerequisites Projects 2–8, testing foundations
Key Topics evals, benchmarks, regression tracking

Learning Objectives

By completing this project, you will:

  1. Define evaluation datasets for agent tasks.
  2. Run deterministic test suites with mock tools.
  3. Compute success metrics and failure categories.
  4. Track regressions across versions.
  5. Generate reports for comparisons.

The Core Question You’re Answering

“How do you know the agent is improving instead of quietly getting worse?”

Without evals, you are flying blind.


Concepts You Must Understand First

Concept Why It Matters Where to Learn
Test harness design Repeatable evaluation QA/testing guides
Mock tools Deterministic runs Unit testing
Regression detection Prevent quality drift CI practices
Metrics Success rate, latency, cost Observability basics

Theoretical Foundation

Agent Evals as Quality Gates

Agent -> Test Suite -> Metrics -> Regression Check

This is how you detect quality regressions after prompt or model changes.


Project Specification

What You’ll Build

A harness that runs the agent on fixed tasks, scores performance, and compares against baselines.

Functional Requirements

  1. Test suite with expected outcomes
  2. Mock tools for deterministic execution
  3. Metrics (success, latency, cost)
  4. Regression detection vs baseline
  5. JSON/HTML reports

Non-Functional Requirements

  • Repeatable results
  • Clear failure categorization
  • Configurable datasets

Real World Outcome

Example regression report:

{
  "success_rate": 0.86,
  "baseline": 0.91,
  "regression": true,
  "failures": ["missing citation", "timeout"]
}

Architecture Overview

┌──────────────┐   tasks   ┌───────────────┐
│ Test Suite   │──────────▶│ Agent Runner  │
└──────┬───────┘           └──────┬────────┘
       │                           ▼
       ▼                    ┌──────────────┐
┌──────────────┐            │ Metrics      │
│ Baseline     │◀──────────▶│ Regression   │
└──────────────┘            └──────────────┘

Implementation Guide

Phase 1: Test Runner (4–6h)

  • Build deterministic mock tools
  • Checkpoint: tasks run repeatably

Phase 2: Metrics + Reporting (5–8h)

  • Compute success/failure metrics
  • Generate report output
  • Checkpoint: baseline comparison works

Phase 3: Regression Alerts (5–8h)

  • Detect regressions vs baseline
  • Checkpoint: alerts fire on drops

Common Pitfalls & Debugging

Pitfall Symptom Fix
Flaky tests inconsistent scores mock tool outputs
No baseline no regression context store baseline reports
Unclear failures debugging slow add failure categories

Interview Questions They’ll Ask

  1. Why is deterministic evaluation critical?
  2. How do you design a good agent test suite?
  3. What metrics indicate regression?

Hints in Layers

  • Hint 1: Start with 5–10 fixed tasks.
  • Hint 2: Mock tools to remove randomness.
  • Hint 3: Compute success rate + latency.
  • Hint 4: Compare against a stored baseline.

Learning Milestones

  1. Repeatable: same scores every run.
  2. Measurable: metrics and reports generated.
  3. Safe: regression alerts catch drops.

Submission / Completion Criteria

Minimum Completion

  • Deterministic test runner

Full Completion

  • Baseline comparison + reports

Excellence

  • CI integration
  • Multi-model comparisons

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/AI_AGENTS_PROJECTS.md.