Project 37: Multi-Agent Orchestrator - Parallel Claude Swarm

Build an orchestration system that spawns multiple Claude Code instances in headless mode, assigns them specialized tasks, coordinates their work through shared state, handles failures gracefully, and combines their outputs into coherent results.

Quick Reference

Attribute	Value
Difficulty	Master
Time Estimate	1 month+
Languages	TypeScript (Alternatives: Python, Go)
Prerequisites	All previous projects, understanding of distributed systems concepts
Key Topics	Multi-agent systems, distributed coordination, fault tolerance, consensus
Knowledge Area	Distributed Systems / Agent Orchestration
Software/Tools	Claude Code Headless, Task API
Coolness Level	Level 5: Pure Magic (Super Cool)
Business Potential	4. The “Open Core” Infrastructure

1. Project Overview

This is the pinnacle of Claude Code automation. You will build a system that coordinates multiple Claude Code instances working in parallel - like a swarm of AI workers each with specialized capabilities, all collaborating on complex tasks that no single agent could accomplish efficiently.

What makes this Master level:

Distributed systems thinking is required
Failure handling must be robust
Coordination problems are subtle and complex
Performance optimization across multiple agents
State synchronization across processes

2. Real World Outcome

You will have a multi-agent orchestration system capable of complex, parallelized workflows:

Example: Large-Scale Code Migration

$ claude-swarm run ./migration-plan.yaml

+------------------------------------------------------------------+
|                   CLAUDE SWARM ORCHESTRATOR                       |
+------------------------------------------------------------------+

Loading migration plan...
Target: Migrate JavaScript codebase to TypeScript

Spawning agent swarm:
+-------------------------------------------------------------------+
| Agent ID | Role      | Assigned Work                    | Status |
+----------+-----------+----------------------------------+--------+
| agent-1  | Analyzer  | Scanning codebase for types      | INIT   |
| agent-2  | Analyzer  | Identifying external deps        | INIT   |
| agent-3  | Converter | Converting /src/utils (15 files) | INIT   |
| agent-4  | Converter | Converting /src/components (32)  | INIT   |
| agent-5  | Converter | Converting /src/services (8)     | INIT   |
| agent-6  | Validator | Type-checking converted files    | WAIT   |
| agent-7  | Reviewer  | Reviewing conversion quality     | WAIT   |
+----------+-----------+----------------------------------+--------+

Progress:
[================------------] 60% - 33/55 files converted

Agent Status (real-time):
  agent-1: COMPLETE - Found 127 type patterns
  agent-2: COMPLETE - 15 deps need @types packages
  agent-3: WORKING  - 12/15 files done
  agent-4: WORKING  - 20/32 files done
  agent-5: COMPLETE - All services converted
  agent-6: WAITING  - Need more completed files
  agent-7: WORKING  - 5 files reviewed

WARNING agent-4 error on /src/components/DataGrid.jsx:
"Complex HOC pattern needs manual intervention"
-> Added to manual-review queue
-> Agent-4 continuing with remaining files...

...

MIGRATION COMPLETE!

+------------------------------------------------------------------+
|                        FINAL RESULTS                              |
+------------------------------------------------------------------+
| Files auto-converted:     | 52/55                                |
| Files need manual review: | 3                                    |
| Type errors:              | 0                                    |
| Generated artifacts:      | tsconfig.json, types/*.d.ts         |
| Total time:               | 4m 32s (vs ~2h sequential)           |
| Parallel speedup:         | ~26x                                 |
+------------------------------------------------------------------+

Report: ./migration-report.html

Example: Parallel Codebase Analysis

$ claude-swarm analyze --depth=deep ./large-monorepo

Swarm Configuration:
  - 4 Architecture Analyzers
  - 2 Security Auditors
  - 2 Performance Profilers
  - 1 Documentation Reviewer
  - 1 Result Aggregator

[====================] 100% Complete

Analysis Report Generated:
  - Architecture diagram: ./reports/architecture.svg
  - Security findings: 3 high, 7 medium, 12 low
  - Performance hotspots: 5 identified
  - Documentation coverage: 67%
  - Tech debt estimate: 340 story points

3. The Core Question You Are Answering

“How do you coordinate multiple AI agents to accomplish more than one agent could alone?”

This is not just parallelization for speed. It is about specialization - one agent that excels at analysis, another at conversion, another at review. Together they produce higher quality results than one agent trying to do everything.

Sub-questions to consider:

How do you partition work without losing context?
How do agents share discoveries without overwhelming each other?
How do you handle conflicting recommendations from different agents?
What happens when one agent fails mid-task?

4. Concepts You Must Understand First

Stop and research these before coding:

4.1 Agent Specialization

Questions to answer:

How do you make an agent “expert” in a task?
What is the tradeoff between generalist and specialist agents?
How do you define agent boundaries?

Key insight: Specialization through output styles and focused system prompts. An “Analyzer” agent gets a different persona than a “Converter” agent.

Reference: “Multi-Agent Systems” by Wooldridge - Chapters 4-5

4.2 Coordination Patterns

Questions to answer:

What is the difference between orchestration and choreography?
How do you handle shared state across agents?
What synchronization primitives do you need?

+-----------------------------------------------------------------------+
|                    ORCHESTRATION vs CHOREOGRAPHY                       |
+-----------------------------------------------------------------------+
|                                                                       |
|  ORCHESTRATION (Central Control):                                     |
|                                                                       |
|       +---------------+                                               |
|       | Orchestrator  |  <-- Single point of coordination             |
|       +---------------+                                               |
|        /      |       \                                               |
|       v       v        v                                              |
|   +------+ +------+ +------+                                          |
|   |Agent1| |Agent2| |Agent3|  <-- Agents receive commands             |
|   +------+ +------+ +------+                                          |
|                                                                       |
|  Pros: Simple control flow, easy to debug                             |
|  Cons: Single point of failure, bottleneck potential                  |
|                                                                       |
+-----------------------------------------------------------------------+
|                                                                       |
|  CHOREOGRAPHY (Decentralized):                                        |
|                                                                       |
|   +------+     +------+                                               |
|   |Agent1| <-> |Agent2|  <-- Agents communicate directly              |
|   +------+     +------+                                               |
|      ^           ^                                                    |
|       \         /                                                     |
|        v       v                                                      |
|       +------+                                                        |
|       |Agent3|                                                        |
|       +------+                                                        |
|                                                                       |
|  Pros: No single point of failure, scalable                           |
|  Cons: Complex to debug, emergent behavior                            |
|                                                                       |
+-----------------------------------------------------------------------+

Reference: “Designing Data-Intensive Applications” by Kleppmann - Chapters 8-9

4.3 Failure Modes in Distributed Systems

Questions to answer:

What happens when one agent fails?
How do you implement retries with backoff?
When should the whole swarm fail vs. continue?

The Eight Fallacies of Distributed Computing (Peter Deutsch):

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology does not change
There is one administrator
Transport cost is zero
The network is homogeneous

Reference: “Release It!” by Michael Nygard - Chapters 4-5

4.4 Consensus and Coordination Algorithms

Simplified concepts to understand:

Leader Election: How do agents decide who coordinates?
Distributed Locks: How do you prevent duplicate work?
Work Stealing: How do idle agents get more work?

Reference: “Designing Data-Intensive Applications” Chapter 9 - Consistency and Consensus

5. Questions to Guide Your Design

Before implementing, think through these:

5.1 Work Division

How do you partition work across agents?
- By file? By directory? By task type?
What is the optimal number of agents for a task?
- Too few: Slow
- Too many: Coordination overhead dominates
How do you handle uneven workloads?
- Some files are 10 lines, others are 1000
- Work stealing vs. static assignment

5.2 Communication

How do agents share results?
- Shared filesystem?
- Message queue?
- In-memory (Redis)?
What is the message format?
- Structured JSON?
- Streaming updates?
- Event sourcing?
How do you handle ordering and conflicts?
- Two agents find the same bug
- One agent’s output invalidates another’s

5.3 Progress and Observability

How do you track overall progress?
- Not just “tasks complete” but “meaningful progress”
How do you visualize agent status?
- Real-time dashboard?
- CLI output?
What do you log for debugging?
- Too little: Cannot diagnose failures
- Too much: Information overload

6. Thinking Exercise: Design Agent Topologies

Consider three coordination patterns and when each is appropriate:

Pattern A: Hub and Spoke

                    +--------------+
                    | Orchestrator |
                    +--------------+
                    /      |       \
                   /       |        \
                  v        v         v
            +-------+  +-------+  +-------+
            |Agent 1|  |Agent 2|  |Agent 3|
            +-------+  +-------+  +-------+

Characteristics:

Central control over all agents
Simple to implement and reason about
Orchestrator can become bottleneck
Single point of failure

Use when:

Tasks are independent
Central state coordination is required
Debugging visibility is important

Pattern B: Pipeline

   +-------+     +-------+     +-------+     +--------+
   |Agent 1| --> |Agent 2| --> |Agent 3| --> | Result |
   +-------+     +-------+     +-------+     +--------+
   (Analyze)     (Convert)     (Validate)

Characteristics:

Each agent has a specialized role
Output of one feeds input of next
Natural for staged workflows
Bottleneck at slowest stage

Use when:

Tasks have natural phases
Each phase has different requirements
Order matters

Pattern C: Mesh

   +-------+ <---------> +-------+
   |Agent 1|             |Agent 2|
   +-------+             +-------+
      ^                      ^
      |                      |
      v                      v
   +-------+ <---------> +-------+
   |Agent 3|             |Agent 4|
   +-------+             +-------+

Characteristics:

All agents can communicate
Most flexible but most complex
Emergent coordination
Difficult to debug

Use when:

Tasks require collaboration
No clear hierarchy
Maximum parallelism needed

Design Questions

For each pattern, answer:

How do you implement it with Claude Code headless?
What are the failure modes?
How do you monitor progress?
What coordination primitives are needed?

7. The Interview Questions They Will Ask

Prepare to answer these:

“How would you handle a situation where agents produce conflicting results?”

Think about: Voting, confidence scores, human escalation, domain-specific merge strategies
“What is your strategy for debugging a multi-agent system?”

Think about: Correlation IDs, distributed tracing, replay capabilities, deterministic testing
“How do you prevent agents from duplicating work?”

Think about: Work queues, distributed locks, idempotent operations, task fingerprinting
“What is the tradeoff between agent count and coordination overhead?”

Think about: Amdahl’s Law, communication costs, diminishing returns, optimal parallelism
“How would you implement checkpointing for long-running swarms?”

Think about: State persistence, resume semantics, partial completion, crash recovery

8. Hints in Layers

Only read when stuck:

Hint 1: Start with Two Agents

Build the simplest possible orchestration: one agent produces, one agent reviews. Get that working first before scaling up.

// Simplest possible multi-agent
async function simplestSwarm() {
  const producer = spawnAgent("producer", "Write code for X");
  const result = await waitForResult(producer);

  const reviewer = spawnAgent("reviewer", `Review this code: ${result}`);
  const review = await waitForResult(reviewer);

  return { code: result, review };
}

Hint 2: Use Session IDs

Each headless Claude session has an ID. Use --resume to have agents continue from where they left off.

# First interaction
claude -p "Start analyzing codebase" --output-format json
# Returns session_id in output

# Continue same session
claude --resume <session_id> -p "Now focus on security issues"

Hint 3: File-Based Coordination

The simplest shared state is files. Agents write to specific locations, others read when ready.

/tmp/swarm-work/
  /agent-1/
    output.json       # Agent 1's results
    status.json       # Agent 1's current status
  /agent-2/
    output.json
    status.json
  /shared/
    work-queue.json   # Remaining work
    completed.json    # Finished items

Hint 4: Output Styles for Specialization

Give each agent a different output style that focuses them on their task.

# styles/analyzer.md
You are a code analyzer. Focus ONLY on:
- Identifying patterns
- Cataloging dependencies
- Finding type information
Output structured JSON, no explanations.

# styles/converter.md
You are a code converter. Focus ONLY on:
- Converting JavaScript to TypeScript
- Adding type annotations
- Preserving functionality
Be conservative - flag uncertain conversions.

9. Books That Will Help

Topic	Book	Chapters	Why It Helps
Distributed Systems	“Designing Data-Intensive Applications” by Kleppmann	Ch. 8-9	Understanding coordination, consistency, consensus
Agent Systems	“Multi-Agent Systems” by Wooldridge	Ch. 4-5	Agent communication, coordination protocols
Resilience	“Release It!” by Nygard	Ch. 4-5	Failure patterns, circuit breakers, bulkheads
Concurrency	“Seven Concurrency Models in Seven Weeks” by Butcher	All	Different coordination paradigms
AI Agents	“Artificial Intelligence: A Modern Approach” by Russell & Norvig	Ch. 2	Agent architectures and design

10. Architecture Deep Dive

10.1 System Architecture

+------------------------------------------------------------------------+
|                    MULTI-AGENT ORCHESTRATION SYSTEM                     |
+------------------------------------------------------------------------+
|                                                                        |
|  +------------------------------------------------------------------+  |
|  |                         ORCHESTRATOR                              |  |
|  |  (TypeScript/Node.js)                                            |  |
|  |                                                                   |  |
|  |  +------------+  +-------------+  +------------+  +------------+  |  |
|  |  | Work Queue |  | State Store |  | Agent Pool |  | Result Agg |  |  |
|  |  +------------+  +-------------+  +------------+  +------------+  |  |
|  +------------------------------------------------------------------+  |
|          |                  |                  |                       |
|          | spawn            | read/write       | results               |
|          v                  v                  v                       |
|  +------------------------------------------------------------------+  |
|  |                       AGENT LAYER                                 |  |
|  |                                                                   |  |
|  |  +----------+    +----------+    +----------+    +----------+     |  |
|  |  | Agent 1  |    | Agent 2  |    | Agent 3  |    | Agent N  |     |  |
|  |  | Headless |    | Headless |    | Headless |    | Headless |     |  |
|  |  | Claude   |    | Claude   |    | Claude   |    | Claude   |     |  |
|  |  +----------+    +----------+    +----------+    +----------+     |  |
|  |       |              |              |              |               |  |
|  +------------------------------------------------------------------+  |
|          |              |              |              |                 |
|          v              v              v              v                 |
|  +------------------------------------------------------------------+  |
|  |                    SHARED STATE LAYER                             |  |
|  |                                                                   |  |
|  |  +---------------+  +---------------+  +-------------------+      |  |
|  |  | File System   |  | Redis/SQLite  |  | Message Queue     |      |  |
|  |  | (Simplest)    |  | (Structured)  |  | (Event-Driven)    |      |  |
|  |  +---------------+  +---------------+  +-------------------+      |  |
|  +------------------------------------------------------------------+  |
|                                                                        |
+------------------------------------------------------------------------+

10.2 Agent Spawning Pattern

interface AgentConfig {
  name: string;
  role: 'analyzer' | 'converter' | 'reviewer' | 'validator';
  outputStyle: string;
  workDir: string;
}

interface AgentHandle {
  name: string;
  process: ChildProcess;
  sessionId: string | null;
  status: 'starting' | 'running' | 'complete' | 'failed';
  startTime: number;
}

async function spawnAgent(config: AgentConfig): Promise<AgentHandle> {
  const { name, role, outputStyle, workDir } = config;

  const proc = spawn('claude', [
    '-p', getPromptForRole(role),
    '--output-format', 'stream-json',
    '--output-style', outputStyle,
    '--allowedTools', getToolsForRole(role).join(','),
    '--workdir', workDir
  ], {
    env: { ...process.env, CLAUDE_AGENT_NAME: name }
  });

  const handle: AgentHandle = {
    name,
    process: proc,
    sessionId: null,
    status: 'starting',
    startTime: Date.now()
  };

  // Parse first output to get session ID
  proc.stdout.on('data', (chunk) => {
    const lines = chunk.toString().split('\n');
    for (const line of lines) {
      if (line.trim()) {
        const event = JSON.parse(line);
        if (event.session_id && !handle.sessionId) {
          handle.sessionId = event.session_id;
          handle.status = 'running';
        }
      }
    }
  });

  return handle;
}

10.3 Result Aggregation

interface AgentResult {
  agentName: string;
  role: string;
  output: any;
  duration: number;
  errors: string[];
}

interface MergedResult {
  success: boolean;
  outputs: Map<string, any>;
  conflicts: Conflict[];
  summary: string;
}

async function aggregateResults(
  agents: AgentHandle[]
): Promise<MergedResult> {
  // Wait for all agents to complete
  const results = await Promise.allSettled(
    agents.map(a => waitForCompletion(a))
  );

  const successful: AgentResult[] = [];
  const failed: { agent: string; error: Error }[] = [];

  results.forEach((result, idx) => {
    if (result.status === 'fulfilled') {
      successful.push(result.value);
    } else {
      failed.push({
        agent: agents[idx].name,
        error: result.reason
      });
    }
  });

  // Detect conflicts between agent outputs
  const conflicts = detectConflicts(successful);

  // Merge based on role-specific strategies
  const merged = mergeByRole(successful);

  return {
    success: failed.length === 0,
    outputs: merged,
    conflicts,
    summary: generateSummary(successful, failed, conflicts)
  };
}

function detectConflicts(results: AgentResult[]): Conflict[] {
  const conflicts: Conflict[] = [];

  // Example: Two agents modified the same file
  const fileModifications = new Map<string, AgentResult[]>();

  for (const result of results) {
    const files = result.output.modifiedFiles || [];
    for (const file of files) {
      const existing = fileModifications.get(file) || [];
      existing.push(result);
      fileModifications.set(file, existing);
    }
  }

  for (const [file, agents] of fileModifications) {
    if (agents.length > 1) {
      conflicts.push({
        type: 'file_conflict',
        file,
        agents: agents.map(a => a.agentName),
        resolution: 'manual_review'
      });
    }
  }

  return conflicts;
}

10.4 Failure Handling

interface RetryConfig {
  maxRetries: number;
  backoffMs: number;
  backoffMultiplier: number;
  maxBackoffMs: number;
}

async function withRetry<T>(
  operation: () => Promise<T>,
  config: RetryConfig,
  context: string
): Promise<T> {
  let lastError: Error;
  let delay = config.backoffMs;

  for (let attempt = 0; attempt < config.maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      lastError = error as Error;

      console.error(
        `[${context}] Attempt ${attempt + 1} failed: ${lastError.message}`
      );

      if (attempt < config.maxRetries - 1) {
        console.log(`[${context}] Retrying in ${delay}ms...`);
        await sleep(delay);
        delay = Math.min(delay * config.backoffMultiplier, config.maxBackoffMs);
      }
    }
  }

  throw new Error(
    `[${context}] All ${config.maxRetries} attempts failed. ` +
    `Last error: ${lastError!.message}`
  );
}

// Circuit breaker for agent spawning
class AgentCircuitBreaker {
  private failures = 0;
  private lastFailure = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private threshold: number = 3,
    private resetTimeMs: number = 30000
  ) {}

  async execute<T>(operation: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.resetTimeMs) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open - agent spawning disabled');
      }
    }

    try {
      const result = await operation();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();

    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

11. Implementation Milestones

Milestone 1: Two Agents Coordinate Successfully (Week 1-2)

Goal: Basic orchestration works

Deliverables:

Validation: Run migration on 5-file test project

Milestone 2: Failure is Handled Gracefully (Week 2-3)

Goal: Resilience works

Deliverables:

Agent timeout handling
Retry with exponential backoff
Work redistribution on failure
Partial results on failure
Circuit breaker for spawning

Validation: Kill random agents during run, system recovers

Milestone 3: N Agents Scale Efficiently (Week 3-4)

Goal: Full swarm capability

Deliverables:

Dynamic agent count based on work
Work stealing for load balancing
Progress reporting dashboard
Performance metrics
Conflict detection and resolution

Validation: 50-file project completes 5x faster with 5 agents

12. Testing Strategy

Unit Tests

describe('Agent Spawning', () => {
  it('should spawn agent with correct configuration', async () => {
    const agent = await spawnAgent({
      name: 'test-agent',
      role: 'analyzer',
      outputStyle: 'json',
      workDir: '/tmp/test'
    });

    expect(agent.status).toBe('starting');
    expect(agent.process.pid).toBeDefined();
  });

  it('should parse session ID from output', async () => {
    // Mock Claude output
    const mockOutput = '{"session_id": "abc123", "type": "init"}\n';
    // ...
  });
});

Integration Tests

describe('Multi-Agent Migration', () => {
  it('should complete migration with multiple converters', async () => {
    const result = await runSwarm({
      task: 'migrate',
      target: './test-fixtures/small-project',
      agentCount: 3
    });

    expect(result.success).toBe(true);
    expect(result.filesConverted).toBe(10);
    expect(result.errors).toHaveLength(0);
  });

  it('should handle agent failure gracefully', async () => {
    const result = await runSwarm({
      task: 'migrate',
      target: './test-fixtures/small-project',
      agentCount: 3,
      simulateFailure: { agentIndex: 1, afterMs: 1000 }
    });

    expect(result.success).toBe(true);
    expect(result.recoveries).toBe(1);
  });
});

Chaos Tests

describe('Chaos Engineering', () => {
  it('should survive random agent terminations', async () => {
    const swarm = startSwarm({ agentCount: 5 });

    // Kill random agents every 5 seconds
    const chaos = setInterval(() => {
      const randomAgent = Math.floor(Math.random() * 5);
      swarm.agents[randomAgent]?.process.kill('SIGTERM');
    }, 5000);

    const result = await swarm.complete();
    clearInterval(chaos);

    expect(result.success).toBe(true);
  });
});

13. Configuration Schema

# swarm-config.yaml
swarm:
  name: "code-migration"
  maxAgents: 10
  minAgents: 2

agents:
  analyzer:
    count: 2
    style: "styles/analyzer.md"
    tools: ["Read", "Glob", "Grep"]
    timeout: 300000  # 5 minutes

  converter:
    count: 5
    style: "styles/converter.md"
    tools: ["Read", "Write", "Edit", "Bash"]
    timeout: 600000  # 10 minutes

  validator:
    count: 2
    style: "styles/validator.md"
    tools: ["Bash"]
    timeout: 180000  # 3 minutes

  reviewer:
    count: 1
    style: "styles/reviewer.md"
    tools: ["Read", "Glob"]
    timeout: 300000  # 5 minutes

coordination:
  stateBackend: "sqlite"  # or "redis", "filesystem"
  workDistribution: "pull"  # or "push"
  conflictResolution: "manual"  # or "auto-merge", "last-write-wins"

resilience:
  retries: 3
  backoff:
    initial: 1000
    multiplier: 2
    max: 30000
  circuitBreaker:
    threshold: 5
    resetTime: 60000

observability:
  logLevel: "info"
  metricsPort: 9090
  dashboard: true

14. Example Swarm Definitions

Migration Swarm

# swarms/migration.yaml
name: JavaScript to TypeScript Migration
description: Convert JS files to TS with type inference

phases:
  - name: analysis
    agents:
      - role: dependency-analyzer
        task: "Identify all external dependencies and their @types packages"
      - role: pattern-analyzer
        task: "Find common patterns that indicate types (PropTypes, JSDoc, etc.)"

  - name: conversion
    depends_on: [analysis]
    agents:
      - role: converter
        count: dynamic  # Scale based on file count
        task: "Convert assigned files to TypeScript"

  - name: validation
    depends_on: [conversion]
    agents:
      - role: type-checker
        task: "Run tsc and collect type errors"
      - role: test-runner
        task: "Run existing tests to verify behavior"

  - name: review
    depends_on: [validation]
    agents:
      - role: reviewer
        task: "Review conversions for quality and consistency"

Analysis Swarm

# swarms/codebase-analysis.yaml
name: Deep Codebase Analysis
description: Comprehensive analysis of large codebase

agents:
  - role: architecture-mapper
    count: 2
    task: "Map module dependencies and identify architectural patterns"

  - role: security-auditor
    count: 2
    task: "Find security vulnerabilities and unsafe patterns"

  - role: performance-profiler
    count: 1
    task: "Identify performance bottlenecks and optimization opportunities"

  - role: code-quality-reviewer
    count: 2
    task: "Assess code quality, duplication, and maintainability"

  - role: documentation-checker
    count: 1
    task: "Evaluate documentation coverage and accuracy"

aggregation:
  strategy: merge-by-category
  output: comprehensive-report

15. Observability Dashboard

+------------------------------------------------------------------------+
|                    CLAUDE SWARM DASHBOARD                               |
+------------------------------------------------------------------------+
|                                                                        |
|  TASK: code-migration-2024-01-15                                       |
|  STATUS: RUNNING                                                       |
|  ELAPSED: 00:04:32                                                     |
|                                                                        |
|  +-------------------------------------------------------------------+ |
|  |                     OVERALL PROGRESS                              | |
|  |  [================------------] 60% (33/55 files)                 | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
|  +-------------------------------------------------------------------+ |
|  |                      AGENT STATUS                                 | |
|  +-------------------------------------------------------------------+ |
|  | Agent          | Status    | Progress | CPU  | Mem   | Duration  | |
|  |----------------|-----------|----------|------|-------|-----------|  |
|  | analyzer-1     | COMPLETE  | 100%     | -    | -     | 1:23      | |
|  | analyzer-2     | COMPLETE  | 100%     | -    | -     | 1:45      | |
|  | converter-1    | WORKING   | 80%      | 45%  | 234MB | 3:12      | |
|  | converter-2    | WORKING   | 62%      | 52%  | 198MB | 3:08      | |
|  | converter-3    | COMPLETE  | 100%     | -    | -     | 2:56      | |
|  | validator-1    | WAITING   | 0%       | 2%   | 45MB  | -         | |
|  | reviewer-1     | WORKING   | 15%      | 12%  | 87MB  | 0:45      | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
|  +-------------------------------------------------------------------+ |
|  |                      RECENT EVENTS                                | |
|  +-------------------------------------------------------------------+ |
|  | 04:30:12 | converter-3 | Completed /src/services/auth.ts         | |
|  | 04:30:08 | converter-2 | Converting /src/components/Modal.tsx    | |
|  | 04:29:55 | converter-1 | WARNING: Complex HOC in DataGrid.jsx    | |
|  | 04:29:45 | reviewer-1  | Reviewed /src/utils/helpers.ts - PASS   | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
|  +-------------------------------------------------------------------+ |
|  |                      METRICS                                      | |
|  +-------------------------------------------------------------------+ |
|  | Files/minute:     8.2                                             | |
|  | Errors:           1                                               | |
|  | Warnings:         3                                               | |
|  | Manual review:    1 file                                          | |
|  | Est. remaining:   2:45                                            | |
|  +-------------------------------------------------------------------+ |
|                                                                        |
+------------------------------------------------------------------------+

16. Common Pitfalls and Solutions

Pitfall 1: Context Window Exhaustion

Problem: Agents run out of context when processing large files.

Solution:

// Split large files before assigning to agents
function splitLargeFiles(files: string[], maxLines: number = 500): Task[] {
  const tasks: Task[] = [];

  for (const file of files) {
    const lines = readFileSync(file, 'utf-8').split('\n');

    if (lines.length > maxLines) {
      // Split into logical chunks (by function/class boundaries)
      const chunks = splitByBoundaries(lines, maxLines);
      chunks.forEach((chunk, i) => {
        tasks.push({
          type: 'partial',
          file,
          chunk: i,
          totalChunks: chunks.length,
          content: chunk
        });
      });
    } else {
      tasks.push({ type: 'complete', file });
    }
  }

  return tasks;
}

Pitfall 2: Race Conditions in Shared State

Problem: Multiple agents update the same state simultaneously.

Solution:

// Use atomic operations with file locking
import { lockfile } from 'proper-lockfile';

async function updateSharedState(
  statePath: string,
  updater: (state: any) => any
): Promise<void> {
  const release = await lockfile.lock(statePath);

  try {
    const current = JSON.parse(readFileSync(statePath, 'utf-8'));
    const updated = updater(current);
    writeFileSync(statePath, JSON.stringify(updated, null, 2));
  } finally {
    await release();
  }
}

Pitfall 3: Agent Starvation

Problem: Some agents finish quickly and sit idle while others are overloaded.

Solution:

// Implement work stealing
async function workStealing(idle: AgentHandle, busy: AgentHandle[]): Promise<void> {
  for (const agent of busy) {
    const stolen = await agent.stealWork(1);
    if (stolen.length > 0) {
      await idle.assignWork(stolen);
      return;
    }
  }

  // No work to steal - mark agent as truly idle
  idle.status = 'idle-no-work';
}

17. Extension Ideas

Once you complete the basic swarm:

Adaptive Scaling: Automatically adjust agent count based on work queue depth
Agent Memory: Agents remember context from previous runs
Specialization Learning: Agents develop expertise over time
Cross-Project Swarms: Coordinate across multiple repositories
Human-in-the-Loop: Escalation points for human decisions
Visual Swarm Builder: GUI for designing swarm topologies

18. Success Criteria

You have mastered this project when:

Your swarm can coordinate 5+ agents on a real task
Failures are handled gracefully without data loss
Performance scales sub-linearly with agent count
You can explain every coordination decision
You have implemented at least two topology patterns
You can debug a multi-agent failure from logs alone
Your swarm has completed a real-world migration or analysis

Source

This project is part of the Claude Code Mastery: 40 Projects learning path.