Project 8: Multi-Agent Collaboration (The Teamwork)

Project 8: Multi-Agent Collaboration (The Teamwork)

Build a small โ€œresearch teamโ€ where specialized agents (Researcher, Writer, Critic) collaborate through an orchestrated loop to produce a better final artifact than any single agent.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 35โ€“55 hours
Language Python
Prerequisites Strong prompt/tool fundamentals, trace/debug habits, eval mindset
Key Topics role specialization, inter-agent protocols, shared memory, iterative refinement, conflict resolution

1. Learning Objectives

By completing this project, you will:

  1. Implement multi-agent orchestration with explicit roles and goals.
  2. Design a protocol for agent communication (messages, handoffs, critique format).
  3. Build shared memory so agents can collaborate on the same evidence base.
  4. Add iteration limits and quality thresholds to prevent endless debates.
  5. Evaluate collaboration quality (does the team actually improve output?).

2. Theoretical Foundation

2.1 Core Concepts

  • Division of labor: Specialization reduces cognitive load: one agent gathers evidence, one synthesizes, one critiques.
  • Communication protocols: Without structure, agents produce redundant text. You want structured handoffs: evidence lists, outlines, critique checklists.
  • Shared memory: If the Writer cannot see what the Researcher found, the system fails. Shared state needs provenance and versioning.
  • Iteration & convergence: Multi-agent loops need stopping criteria: max rounds, minimum score, or diminishing returns.
  • Failure modes: Groupthink, oscillation, and โ€œcritic paralysisโ€ are common; orchestration logic must manage them.

2.2 Why This Matters

Complex assistant tasks (planning a trip, writing a proposal, designing a system) benefit from multiple perspectives and internal checks. Multi-agent systems are a practical way to add โ€œchecks and balances.โ€

2.3 Common Misconceptions

  • โ€œMore agents = better.โ€ More agents increases coordination overhead; keep the team small and roles sharp.
  • โ€œCritic should be harsh.โ€ The critic should be constructive and grounded in criteria.
  • โ€œAgents can share context implicitly.โ€ They canโ€™t; you must implement memory sharing.

3. Project Specification

3.1 What You Will Build

A CLI tool that accepts a topic and produces a final report (blog post, memo, plan) by orchestrating:

  • Researcher: gathers sources and extracts factual bullets.
  • Writer: produces a draft from research.
  • Critic: reviews against a rubric and requests revisions.

3.2 Functional Requirements

  1. Roles: at least 3 agents with distinct prompts and responsibilities.
  2. Shared memory: research artifacts stored and referenced by ID.
  3. Orchestrator: runs the loop: Research โ†’ Draft โ†’ Critique โ†’ Revise (bounded).
  4. Rubric: critic outputs structured evaluation (scores + actionable feedback).
  5. Citations: final output includes a sources section if web research is enabled.

3.3 Non-Functional Requirements

  • Determinism: run at low temperature for critic and rubric scoring.
  • Observability: store an execution trace of agent messages and decisions.
  • Cost control: cap tokens per agent turn and cap iterations.
  • Quality control: enforce minimum evidence count and citation discipline.

3.4 Example Usage / Output

python multi_agent_research.py --topic "Sustainable urban agriculture solutions"

Output artifacts:

  • report.md (final)
  • trace.jsonl (all agent messages and tool calls)
  • evidence.json (normalized sources + snippets)

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CLI/UI         โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚ Orchestrator      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚ (rounds + policy) โ”‚
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                              โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ               โ–ผ               โ–ผ
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚ Researcher  โ”‚  โ”‚ Writer     โ”‚  โ”‚ Critic     โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
             โ”‚ evidence        โ”‚ drafts          โ”‚ rubric feedback
             โ–ผ                 โ–ผ                 โ–ผ
                   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                   โ”‚ Shared Memory / Store  โ”‚
                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Components

Component Responsibility Key Decisions
Orchestrator control order + stopping max rounds; thresholds; timeouts
Agent prompts define roles sharp responsibilities; structured outputs
Shared store persist evidence and drafts version by round; provenance
Rubric/evals measure quality criteria: accuracy, clarity, completeness

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class SourceItem:
    id: str
    url: str
    title: str
    snippet: str

@dataclass(frozen=True)
class Critique:
    scores: dict[str, int]  # e.g., {"accuracy": 8, "clarity": 7}
    must_fix: list[str]
    nice_to_have: list[str]

4.4 Algorithm Overview

Key Algorithm: bounded collaboration loop

  1. Researcher gathers N sources and extracts M evidence bullets.
  2. Writer drafts output using only evidence store.
  3. Critic scores against rubric and returns a structured critique.
  4. If score < threshold and rounds remain: Writer revises using critique.
  5. Stop when threshold met or max rounds reached; emit final report.

Complexity Analysis:

  • Time: O(rounds ร— agent_turns) model/tool calls
  • Space: O(evidence + traces)

5. Implementation Guide

5.1 Development Environment Setup

python -m venv .venv
source .venv/bin/activate
pip install pydantic rich

5.2 Project Structure

multi-agent-team/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ cli.py
โ”‚   โ”œโ”€โ”€ orchestrator.py
โ”‚   โ”œโ”€โ”€ agents/
โ”‚   โ”‚   โ”œโ”€โ”€ researcher.py
โ”‚   โ”‚   โ”œโ”€โ”€ writer.py
โ”‚   โ”‚   โ””โ”€โ”€ critic.py
โ”‚   โ”œโ”€โ”€ memory.py
โ”‚   โ””โ”€โ”€ evals.py
โ””โ”€โ”€ data/
    โ””โ”€โ”€ runs/

5.3 Implementation Phases

Phase 1: Roles + trace logging (8โ€“12h)

Goals:

  • Run a fixed pipeline with three agents and store traces.

Tasks:

  1. Implement agent wrappers with structured inputs/outputs.
  2. Store every message in trace.jsonl with timestamps.
  3. Produce a report from a fixed evidence set (no web tool yet).

Checkpoint: Given a seed evidence file, output is stable and traceable.

Phase 2: Shared memory + rubric-driven revision (10โ€“15h)

Goals:

  • Critic drives measurable improvements.

Tasks:

  1. Implement shared store with versions per round.
  2. Define rubric and parse critic output with validation.
  3. Implement revision loop with max rounds and thresholds.

Checkpoint: Round 2 output is demonstrably better on rubric criteria.

Phase 3: Real research tools + citation discipline (12โ€“28h)

Goals:

  • Use browsing/search tools and maintain provenance.

Tasks:

  1. Integrate a web search/fetch tool (or reuse Project 5 components).
  2. Normalize sources and store evidence bullets with URLs.
  3. Enforce โ€œno evidence โ†’ no claimโ€ rule in Writer prompt.

Checkpoint: Final report includes citations tied to evidence.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Memory shared text blob vs structured store structured store provenance + constraints
Critique freeform vs JSON rubric JSON rubric stable iteration
Stopping fixed rounds vs threshold threshold + max rounds prevents infinite loops

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit parsing/validation critique JSON parsing, store versioning
Replay deterministic behavior run with cached sources and fixed temps
Quality eval harness rubric score monotonicity across rounds

6.2 Critical Test Cases

  1. Convergence: system stops when quality threshold met.
  2. No hallucinated citations: all citations correspond to stored sources.
  3. Critic usefulness: critic output contains actionable, specific fixes.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Agents repeat themselves bloated traces require structured outputs and concise formats
Critic nitpicks endless revisions โ€œmust fixโ€ vs โ€œnice to haveโ€ separation
Missing shared context writer invents facts enforce evidence-only writing policy
Runaway cost too many rounds hard caps on rounds/tokens

Debugging strategies:

  • Inspect trace and identify where protocol breaks (e.g., unstructured outputs).
  • Add small โ€œcontract testsโ€ for agent output schemas.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a โ€œSummarizerโ€ agent to compress evidence.
  • Add a โ€œFact-checkerโ€ agent that verifies claims against sources.

8.2 Intermediate Extensions

  • Add parallel research: multiple research subagents gather evidence concurrently.
  • Add disagreement resolution: critic flags contradictions and asks for more research.

8.3 Advanced Extensions

  • Add task decomposition: orchestrator splits topic into sub-questions automatically.
  • Add score-based model selection per role (cheap researcher, strong critic).

9. Real-World Connections

9.1 Industry Applications

  • Content pipelines (research โ†’ draft โ†’ editorial review).
  • Multi-agent customer support (triage, resolution, QA).

9.3 Interview Relevance

  • Multi-agent orchestration, protocols, shared memory, and cost controls.

10. Resources

10.1 Essential Reading

  • Multi-Agent Systems with AutoGen (Victor Dibia) โ€” roles and orchestration
  • Building AI Agents (Packt) โ€” agent loops, tool use

10.3 Tools & Documentation

  • CrewAI / AutoGen docs (agents, tasks, tools)
  • LangGraph for explicit state machines
  • Previous: Project 7 (code agent) โ€” traceability and safety rails
  • Next: Project 12 (self-improving) โ€” recursive capability growth with strict sandboxing

11. Self-Assessment Checklist

  • I can explain why the team improves output vs a single agent.
  • I can show a structured communication protocol and trace.
  • I can cap cost and still converge to acceptable quality.
  • I can enforce evidence/citation discipline across agents.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Three agents with distinct roles
  • Orchestrated loop with trace logging
  • Final report output

Full Completion:

  • Shared memory store with provenance
  • Rubric-driven revision loop with thresholds
  • Optional web research with citations

Excellence (Going Above & Beyond):

  • Parallel research subagents + contradiction resolution + eval harness

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.