Project 9: Agent Evaluation Harness
Build an evaluation harness that runs agents against test suites, measures success rates, and tracks regressions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 16–24 hours |
| Language | Python or JavaScript |
| Prerequisites | Projects 2–8, testing foundations |
| Key Topics | evals, benchmarks, regression tracking |
Learning Objectives
By completing this project, you will:
- Define evaluation datasets for agent tasks.
- Run deterministic test suites with mock tools.
- Compute success metrics and failure categories.
- Track regressions across versions.
- Generate reports for comparisons.
The Core Question You’re Answering
“How do you know the agent is improving instead of quietly getting worse?”
Without evals, you are flying blind.
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Test harness design | Repeatable evaluation | QA/testing guides |
| Mock tools | Deterministic runs | Unit testing |
| Regression detection | Prevent quality drift | CI practices |
| Metrics | Success rate, latency, cost | Observability basics |
Theoretical Foundation
Agent Evals as Quality Gates
Agent -> Test Suite -> Metrics -> Regression Check
This is how you detect quality regressions after prompt or model changes.
Project Specification
What You’ll Build
A harness that runs the agent on fixed tasks, scores performance, and compares against baselines.
Functional Requirements
- Test suite with expected outcomes
- Mock tools for deterministic execution
- Metrics (success, latency, cost)
- Regression detection vs baseline
- JSON/HTML reports
Non-Functional Requirements
- Repeatable results
- Clear failure categorization
- Configurable datasets
Real World Outcome
Example regression report:
{
"success_rate": 0.86,
"baseline": 0.91,
"regression": true,
"failures": ["missing citation", "timeout"]
}
Architecture Overview
┌──────────────┐ tasks ┌───────────────┐
│ Test Suite │──────────▶│ Agent Runner │
└──────┬───────┘ └──────┬────────┘
│ ▼
▼ ┌──────────────┐
┌──────────────┐ │ Metrics │
│ Baseline │◀──────────▶│ Regression │
└──────────────┘ └──────────────┘
Implementation Guide
Phase 1: Test Runner (4–6h)
- Build deterministic mock tools
- Checkpoint: tasks run repeatably
Phase 2: Metrics + Reporting (5–8h)
- Compute success/failure metrics
- Generate report output
- Checkpoint: baseline comparison works
Phase 3: Regression Alerts (5–8h)
- Detect regressions vs baseline
- Checkpoint: alerts fire on drops
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Flaky tests | inconsistent scores | mock tool outputs |
| No baseline | no regression context | store baseline reports |
| Unclear failures | debugging slow | add failure categories |
Interview Questions They’ll Ask
- Why is deterministic evaluation critical?
- How do you design a good agent test suite?
- What metrics indicate regression?
Hints in Layers
- Hint 1: Start with 5–10 fixed tasks.
- Hint 2: Mock tools to remove randomness.
- Hint 3: Compute success rate + latency.
- Hint 4: Compare against a stored baseline.
Learning Milestones
- Repeatable: same scores every run.
- Measurable: metrics and reports generated.
- Safe: regression alerts catch drops.
Submission / Completion Criteria
Minimum Completion
- Deterministic test runner
Full Completion
- Baseline comparison + reports
Excellence
- CI integration
- Multi-model comparisons
This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/AI_AGENTS_PROJECTS.md.