Project 8: Multi-Agent Collaboration (The Teamwork)

Build a small “research team” where specialized agents (Researcher, Writer, Critic) collaborate through an orchestrated loop to produce a better final artifact than any single agent.

Quick Reference

Attribute	Value
Difficulty	Level 5: Master
Time Estimate	35–55 hours
Language	Python
Prerequisites	Strong prompt/tool fundamentals, trace/debug habits, eval mindset
Key Topics	role specialization, inter-agent protocols, shared memory, iterative refinement, conflict resolution

1. Learning Objectives

By completing this project, you will:

Implement multi-agent orchestration with explicit roles and goals.
Design a protocol for agent communication (messages, handoffs, critique format).
Build shared memory so agents can collaborate on the same evidence base.
Add iteration limits and quality thresholds to prevent endless debates.
Evaluate collaboration quality (does the team actually improve output?).

2. Theoretical Foundation

2.1 Core Concepts

Division of labor: Specialization reduces cognitive load: one agent gathers evidence, one synthesizes, one critiques.
Communication protocols: Without structure, agents produce redundant text. You want structured handoffs: evidence lists, outlines, critique checklists.
Shared memory: If the Writer cannot see what the Researcher found, the system fails. Shared state needs provenance and versioning.
Iteration & convergence: Multi-agent loops need stopping criteria: max rounds, minimum score, or diminishing returns.
Failure modes: Groupthink, oscillation, and “critic paralysis” are common; orchestration logic must manage them.

2.2 Why This Matters

Complex assistant tasks (planning a trip, writing a proposal, designing a system) benefit from multiple perspectives and internal checks. Multi-agent systems are a practical way to add “checks and balances.”

2.3 Common Misconceptions

“More agents = better.” More agents increases coordination overhead; keep the team small and roles sharp.
“Critic should be harsh.” The critic should be constructive and grounded in criteria.
“Agents can share context implicitly.” They can’t; you must implement memory sharing.

3. Project Specification

3.1 What You Will Build

A CLI tool that accepts a topic and produces a final report (blog post, memo, plan) by orchestrating:

Researcher: gathers sources and extracts factual bullets.
Writer: produces a draft from research.
Critic: reviews against a rubric and requests revisions.

3.2 Functional Requirements

Roles: at least 3 agents with distinct prompts and responsibilities.
Shared memory: research artifacts stored and referenced by ID.
Orchestrator: runs the loop: Research → Draft → Critique → Revise (bounded).
Rubric: critic outputs structured evaluation (scores + actionable feedback).
Citations: final output includes a sources section if web research is enabled.

3.3 Non-Functional Requirements

Determinism: run at low temperature for critic and rubric scoring.
Observability: store an execution trace of agent messages and decisions.
Cost control: cap tokens per agent turn and cap iterations.
Quality control: enforce minimum evidence count and citation discipline.

3.4 Example Usage / Output

python multi_agent_research.py --topic "Sustainable urban agriculture solutions"

Output artifacts:

report.md (final)
trace.jsonl (all agent messages and tool calls)
evidence.json (normalized sources + snippets)

4. Solution Architecture

4.1 High-Level Design

┌───────────────┐     ┌───────────────────┐
│ CLI/UI         │────▶│ Orchestrator      │
└───────────────┘     │ (rounds + policy) │
                      └───────┬───────────┘
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
      ┌────────────┐  ┌────────────┐  ┌────────────┐
      │ Researcher  │  │ Writer     │  │ Critic     │
      └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
             │ evidence        │ drafts          │ rubric feedback
             ▼                 ▼                 ▼
                   ┌────────────────────────┐
                   │ Shared Memory / Store  │
                   └────────────────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Orchestrator	control order + stopping	max rounds; thresholds; timeouts
Agent prompts	define roles	sharp responsibilities; structured outputs
Shared store	persist evidence and drafts	version by round; provenance
Rubric/evals	measure quality	criteria: accuracy, clarity, completeness

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class SourceItem:
    id: str
    url: str
    title: str
    snippet: str

@dataclass(frozen=True)
class Critique:
    scores: dict[str, int]  # e.g., {"accuracy": 8, "clarity": 7}
    must_fix: list[str]
    nice_to_have: list[str]

4.4 Algorithm Overview

Key Algorithm: bounded collaboration loop

Researcher gathers N sources and extracts M evidence bullets.
Writer drafts output using only evidence store.
Critic scores against rubric and returns a structured critique.
If score < threshold and rounds remain: Writer revises using critique.
Stop when threshold met or max rounds reached; emit final report.

Complexity Analysis:

Time: O(rounds × agent_turns) model/tool calls
Space: O(evidence + traces)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install pydantic rich

5.2 Project Structure

multi-agent-team/
├── src/
│   ├── cli.py
│   ├── orchestrator.py
│   ├── agents/
│   │   ├── researcher.py
│   │   ├── writer.py
│   │   └── critic.py
│   ├── memory.py
│   └── evals.py
└── data/
    └── runs/

5.3 Implementation Phases

Phase 1: Roles + trace logging (8–12h)

Goals:

Run a fixed pipeline with three agents and store traces.

Tasks:

Implement agent wrappers with structured inputs/outputs.
Store every message in trace.jsonl with timestamps.
Produce a report from a fixed evidence set (no web tool yet).

Checkpoint: Given a seed evidence file, output is stable and traceable.

Phase 2: Shared memory + rubric-driven revision (10–15h)

Goals:

Critic drives measurable improvements.

Tasks:

Implement shared store with versions per round.
Define rubric and parse critic output with validation.
Implement revision loop with max rounds and thresholds.

Checkpoint: Round 2 output is demonstrably better on rubric criteria.

Phase 3: Real research tools + citation discipline (12–28h)

Goals:

Use browsing/search tools and maintain provenance.

Tasks:

Integrate a web search/fetch tool (or reuse Project 5 components).
Normalize sources and store evidence bullets with URLs.
Enforce “no evidence → no claim” rule in Writer prompt.

Checkpoint: Final report includes citations tied to evidence.

5.4 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Memory	shared text blob vs structured store	structured store	provenance + constraints
Critique	freeform vs JSON rubric	JSON rubric	stable iteration
Stopping	fixed rounds vs threshold	threshold + max rounds	prevents infinite loops

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	parsing/validation	critique JSON parsing, store versioning
Replay	deterministic behavior	run with cached sources and fixed temps
Quality	eval harness	rubric score monotonicity across rounds

6.2 Critical Test Cases

Convergence: system stops when quality threshold met.
No hallucinated citations: all citations correspond to stored sources.
Critic usefulness: critic output contains actionable, specific fixes.

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Agents repeat themselves	bloated traces	require structured outputs and concise formats
Critic nitpicks	endless revisions	“must fix” vs “nice to have” separation
Missing shared context	writer invents facts	enforce evidence-only writing policy
Runaway cost	too many rounds	hard caps on rounds/tokens

Debugging strategies:

Inspect trace and identify where protocol breaks (e.g., unstructured outputs).
Add small “contract tests” for agent output schemas.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a “Summarizer” agent to compress evidence.
Add a “Fact-checker” agent that verifies claims against sources.

8.2 Intermediate Extensions

Add parallel research: multiple research subagents gather evidence concurrently.
Add disagreement resolution: critic flags contradictions and asks for more research.

8.3 Advanced Extensions

Add task decomposition: orchestrator splits topic into sub-questions automatically.
Add score-based model selection per role (cheap researcher, strong critic).

9. Real-World Connections

9.1 Industry Applications

Content pipelines (research → draft → editorial review).
Multi-agent customer support (triage, resolution, QA).

9.3 Interview Relevance

Multi-agent orchestration, protocols, shared memory, and cost controls.

10. Resources

10.1 Essential Reading

Multi-Agent Systems with AutoGen (Victor Dibia) — roles and orchestration
Building AI Agents (Packt) — agent loops, tool use

10.3 Tools & Documentation

CrewAI / AutoGen docs (agents, tasks, tools)
LangGraph for explicit state machines

Previous: Project 7 (code agent) — traceability and safety rails
Next: Project 12 (self-improving) — recursive capability growth with strict sandboxing

11. Self-Assessment Checklist

I can explain why the team improves output vs a single agent.
I can show a structured communication protocol and trace.
I can cap cost and still converge to acceptable quality.
I can enforce evidence/citation discipline across agents.

12. Submission / Completion Criteria

Minimum Viable Completion:

Three agents with distinct roles
Orchestrated loop with trace logging
Final report output

Full Completion:

Shared memory store with provenance
Rubric-driven revision loop with thresholds
Optional web research with citations

Excellence (Going Above & Beyond):

Parallel research subagents + contradiction resolution + eval harness

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.

Project 8: Multi-Agent Collaboration (The Teamwork)

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Roles + trace logging (8–12h)

Phase 2: Shared memory + rubric-driven revision (10–15h)

Phase 3: Real research tools + citation discipline (12–28h)

5.4 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

12. Submission / Completion Criteria