Project 9: Evaluation Harness & Red Team
Build an evaluation harness that stress-tests multi-agent workflows with adversarial scenarios.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4 |
| Time Estimate | 20-30 hours |
| Language | Python (Alternatives: TypeScript, Go) |
| Prerequisites | Logging, validation, metrics |
| Key Topics | Testing, adversarial evaluation, metrics |
1. Learning Objectives
By completing this project, you will:
- Design adversarial test scenarios for agent workflows.
- Implement pass/fail criteria and scoring.
- Generate metrics for reliability and cost.
- Produce a report with remediation guidance.
2. Theoretical Foundation
2.1 Core Concepts
- Evaluation Harness: A structured suite of tests for reliability.
- Red Teaming: Adversarial inputs to expose failures.
- Metrics: Quantitative signals of success and cost.
2.2 Why This Matters
Multi-agent systems often fail in subtle ways. A harness makes failure measurable and repeatable.
2.3 Historical Context / Background
Testing and chaos engineering practices inspire red-team evaluation for agentic systems.
2.4 Common Misconceptions
- “If it works once, it’s fine.” Reliability requires repeated testing.
- “Adversarial inputs are rare.” Real users behave adversarially.
3. Project Specification
3.1 What You Will Build
A test harness that runs multi-agent workflows against a suite of scenarios and produces a pass/fail report with metrics.
3.2 Functional Requirements
- Scenario Library: Store test cases and expected outcomes.
- Evaluation Runner: Execute tasks across agents.
- Scoring System: Assign pass/fail and confidence.
- Report Generator: Summarize results and failures.
3.3 Non-Functional Requirements
- Repeatability: Results can be compared across runs.
- Coverage: Scenarios include edge cases.
- Clarity: Reports highlight failures and fixes.
3.4 Example Usage / Output
$ run-harness --suite "adversarial"
[Report] pass_rate=72%, avg_cost=0.45
[Failures] 6 scenarios require remediation
3.5 Real World Outcome
You can hand a report to a stakeholder that lists failure cases, their causes, and suggested fixes.
4. Solution Architecture
4.1 High-Level Design
Scenario Library -> Runner -> Evaluator -> Metrics -> Report
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Scenario Library | Store tests | JSON/YAML schema |
| Runner | Execute tasks | Deterministic settings |
| Evaluator | Score outputs | Criteria definitions |
| Report Generator | Summarize results | Human-readable format |
4.3 Data Structures
Pseudo-structures:
STRUCT Scenario:
id
input
expected_output
failure_mode
STRUCT Result:
scenario_id
status
notes
4.4 Algorithm Overview
Evaluation Pipeline
- Load scenarios.
- Run workflow.
- Score outputs.
- Aggregate metrics.
Complexity Analysis:
- Time: O(S * W) where S = scenarios, W = workflow steps
- Space: O(S) results
5. Implementation Guide
5.1 Development Environment Setup
Use a fixed runtime configuration to keep test results comparable.
5.2 Project Structure
project-root/
├── scenarios/
├── runner/
├── evaluator/
├── metrics/
└── reports/
5.3 The Core Question You’re Answering
“How do I know my multi-agent system is actually reliable?”
5.4 Concepts You Must Understand First
- Evaluation metrics
- How do you define pass/fail?
- Book Reference: “Release It!” - Ch. 4
- Adversarial testing
- How do you design stress scenarios?
- Book Reference: “Clean Architecture” - Ch. 11
5.5 Questions to Guide Your Design
- Scenario coverage
- Which failures are most costly?
- Scoring criteria
- How will you measure correctness and safety?
5.6 Thinking Exercise
Design an adversarial scenario that tests coordination breakdown and describe expected failure signals.
5.7 The Interview Questions They’ll Ask
- “What is an evaluation harness for agents?”
- “How do you design red-team scenarios?”
- “What metrics matter most?”
- “How do you turn failures into fixes?”
- “How do you compare results across runs?”
5.8 Hints in Layers
Hint 1: Start with known failures Use scenarios that already caused issues.
Hint 2: Add conflicting requirements Create tasks with ambiguity.
Hint 3: Define scoring Use a simple pass/fail plus notes.
Hint 4: Summarize results Generate a report for stakeholders.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Reliability | “Release It!” | Ch. 4 |
5.10 Implementation Phases
Phase 1: Foundation (6-8 hours)
Goals:
- Build scenario library
- Implement runner
Tasks:
- Define scenario schema
- Load and run scenarios
Checkpoint: Runner executes test suite.
Phase 2: Core Functionality (6-8 hours)
Goals:
- Add scoring and metrics
- Produce reports
Tasks:
- Implement scoring rules
- Aggregate metrics
Checkpoint: Report shows pass/fail.
Phase 3: Polish & Edge Cases (6-8 hours)
Goals:
- Add adversarial scenarios
- Add remediation notes
Tasks:
- Expand test suite
- Suggest fixes per failure
Checkpoint: Report includes remediation guidance.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Scenario format | Free text vs schema | Schema | Repeatability |
| Scoring | Binary vs weighted | Weighted | More nuance |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Scoring logic | Known outputs pass |
| Integration Tests | Runner works | All scenarios executed |
| Edge Case Tests | Adversarial input | Failure detected |
6.2 Critical Test Cases
- Scenario with conflicting requirements fails.
- Missing evidence triggers failure.
- System logs include trace IDs.
6.3 Test Data
Scenario: "Conflicting instructions"
Expected: failure + escalation
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Weak scenarios | High pass rate | Add adversarial cases |
| Unclear scoring | Inconsistent results | Define strict criteria |
| No remediation | Failures ignored | Add fix suggestions |
7.2 Debugging Strategies
- Compare scenario outputs to expected outputs.
- Review trace logs for failure points.
7.3 Performance Traps
- Large scenario suites can be costly; use sampling for iteration.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add scenario tags.
- Add summary charts.
8.2 Intermediate Extensions
- Add drift detection across runs.
- Add per-agent performance stats.
8.3 Advanced Extensions
- Integrate with CI pipelines.
- Add automated regression alerts.
9. Real-World Connections
9.1 Industry Applications
- Quality assurance for agentic workflows
- Safety validation in regulated industries
9.2 Related Open Source Projects
- OpenAI evals (evaluation patterns)
9.3 Interview Relevance
- Evaluation strategies are critical in AI system design interviews.
10. Resources
10.1 Essential Reading
- “Release It!” - reliability testing
10.2 Tools & Documentation
- OpenAI Evals documentation (evaluation patterns)
10.3 Related Projects in This Series
- Previous Project: Human-in-the-Loop Command Center (P08)
- Next Project: Capstone (P10)
11. Self-Assessment Checklist
11.1 Understanding
- I can design adversarial scenarios
11.2 Implementation
- Reports show failures and remediation
11.3 Growth
- I can compare metrics across runs
12. Submission / Completion Criteria
Minimum Viable Completion:
- Test suite runs and reports pass/fail
Full Completion:
- Metrics and remediation guidance included
Excellence (Going Above & Beyond):
- CI integration and regression alerts added
This guide was generated from LEARN_COMPLEX_MULTI_AGENT_SYSTEMS_DEEP_DIVE.md. For the complete learning path, see the README.