Book a Free Strategy Call
Skip the read — talk to Walid in 30 min.
Free strategy call. We map your AI engineering team, you keep the notes.
Multi-agent systems moved from research demos to production in 2025. By 2026, the question is no longer "can a swarm of LLMs solve this" but "which Claude-native stack do I pick, and how do I keep the bill under control." Claude 4.7 ships with native sub-agent support inside Claude Code, a first-class Agent SDK in TypeScript and Python, and clean interop with LangGraph and the OpenAI Agents SDK. You have options. Picking the wrong one costs weeks.
The hard part is not writing a manager that fans out to two workers. That fits in twenty lines. The hard part is everything around it: communication protocols that do not balloon context, retries that do not loop, eval harnesses that catch silent regressions, and cost guardrails that stop a runaway swarm from burning through your monthly budget at 3am. Most "multi-agent" tutorials skip all of that. This one does not.
This guide shows you how to build a multi-agent system with Claude in 2026 — from topology selection through production deployment. You get the four mainstream architecture options compared, a step-by-step build path with real code, a complete 4-agent code review system as a working example, the failure modes that bite in production, and a hiring shortcut at the end if you want the whole thing built for you.
Architecture options
Four mainstream paths exist in 2026. Each has a sweet spot.
Claude Code sub-agents — the lowest-friction option. Inside Claude Code, you spawn sub-agents via the Task tool. Each sub-agent gets its own context window, its own tool budget, and reports back to the parent. Zero infrastructure. Best for IDE-resident dev workflows, code review swarms, and research agents that fan out across a repo. Ships in every Claude Code install. See our Claude Code agent swarm architecture deep dive for the full pattern catalog.
Claude Agent SDK (TypeScript or Python) — Anthropic's first-party SDK for shipping agents outside Claude Code. You define agents as code, register tools, and Anthropic handles the loop: model call, tool dispatch, context compaction, retries. Best for production APIs, scheduled jobs, and anything that needs to run without a human attached. Released GA in late 2025, now the default for new production builds.
LangGraph + Claude — graph-based orchestration where each node is an agent or a tool. You wire edges, conditional routing, and state reducers. More boilerplate than the Agent SDK, but you get explicit control over the execution graph, durable state, and human-in-the-loop checkpoints. Best for complex workflows with branching logic, approval gates, or long-running state machines.
OpenAI Agents SDK + Claude — Anthropic-compatible since the model-agnostic refactor in early 2026. If your team already runs the OpenAI Agents SDK and wants to A/B Claude against GPT, you can swap the model parameter and keep the same handoff and tracing primitives. Best for teams already invested in that stack who want Claude's reasoning without rebuilding orchestration. See our breakdown of the best multi-agent frameworks for the full comparison matrix.
A quick rule of thumb. Building inside Claude Code for dev tasks: sub-agents. Shipping a production agent to users: Agent SDK. Need explicit graph control or human approvals mid-flow: LangGraph. Already on OpenAI Agents SDK: keep it and route to Claude.
Step 1: Pick your topology
Three topologies cover 90% of real systems. Pick before you write code.
Manager / worker (fan-out). One manager agent decomposes the task and dispatches N parallel workers. Workers do not talk to each other. Manager aggregates results and returns. Use when sub-tasks are independent — research a list of companies, review a list of files, summarize a list of documents. Cheap, parallel, simple to debug.
Pipeline (sequential). Agent A's output is Agent B's input is Agent C's input. Each stage transforms. Use when work has clear sequential stages — extract, then classify, then write, then proofread. Predictable cost, easy to trace, but no parallelism. If one stage is slow, the whole pipeline waits.
Hierarchical (tree). A top manager dispatches to sub-managers, which dispatch to workers. Use when the problem decomposes recursively — auditing a monorepo where each package gets a sub-manager that fans out across files. Most powerful, most expensive, hardest to debug. Reserve for problems that genuinely need it.
Default to manager/worker. Move to pipeline when stages are inherently sequential. Move to hierarchical only when the problem demands it. Most production systems that "need" hierarchical actually need a flatter manager/worker with better prompts.
Step 2: Define agents + tools
Here is a minimal manager + two workers in the Claude Agent SDK (TypeScript). Manager dispatches research tasks, two workers handle web search and code search in parallel.
import { Agent, tool } from "@anthropic-ai/agent-sdk";
import { z } from "zod";
const webSearch = tool({
name: "web_search",
description: "Search the public web and return snippets.",
input: z.object({ query: z.string() }),
run: async ({ query }) => {
// call your search provider here
return await fetchSearchResults(query);
},
});
const codeSearch = tool({
name: "code_search",
description: "Search the internal repo and return file matches.",
input: z.object({ query: z.string(), repo: z.string() }),
run: async ({ query, repo }) => {
return await searchRepo(repo, query);
},
});
const webWorker = new Agent({
model: "claude-opus-4-7",
systemPrompt: "You search the public web. Return concise factual snippets with URLs.",
tools: [webSearch],
});
const codeWorker = new Agent({
model: "claude-opus-4-7",
systemPrompt: "You search internal code. Return file paths, line numbers, and a one-line summary per hit.",
tools: [codeSearch],
});
const dispatch = tool({
name: "dispatch_worker",
description: "Send a task to a named worker and get its result.",
input: z.object({
worker: z.enum(["web", "code"]),
task: z.string(),
}),
run: async ({ worker, task }) => {
const agent = worker === "web" ? webWorker : codeWorker;
const result = await agent.run({ input: task });
return result.finalOutput;
},
});
const manager = new Agent({
model: "claude-opus-4-7",
systemPrompt: `You are a research manager. Decompose the user's question into web and code sub-tasks.
Dispatch each sub-task to the right worker in parallel. Aggregate results into a single answer.`,
tools: [dispatch],
});
// Entry point
const answer = await manager.run({
input: "How does our auth service handle token refresh, and what does Auth0's 2026 doc recommend?",
});
Two things to notice. First, workers are isolated. Each one has its own context window — they do not see each other's history. The manager is the only place the full picture exists. Second, tools are the wire. The manager does not "call" workers directly. It calls a dispatch_worker tool, which is the only seam between agents. That seam is where you log, retry, and rate-limit.
Step 3: Communication protocol
Two patterns, both valid.
Structured messages (recommended default). Every agent-to-agent handoff is a typed object — task, expected output schema, deadline, parent task ID. The manager sends a WorkerTask and gets back a WorkerResult. Both are validated with Zod or Pydantic. Easy to log, easy to replay, easy to test. Cost is predictable because you control the payload size.
type WorkerTask = {
taskId: string;
parentTaskId: string;
worker: "web" | "code";
instructions: string;
outputSchema: "snippets" | "file_matches";
maxTokens: number;
};
type WorkerResult = {
taskId: string;
status: "ok" | "error" | "timeout";
payload: unknown;
tokensUsed: number;
};
Shared scratchpad. Every agent reads and writes to a shared markdown buffer or a vector store. Useful when agents need to see each other's intermediate work — for instance, a debate setup where Critic reads Writer's draft. Dangerous because the scratchpad grows monotonically and context costs explode. Use sparingly. Cap the scratchpad size, summarize aggressively, and never let an agent read the raw history beyond N turns.
For most production systems: structured messages. Reach for a scratchpad only when agents genuinely need a shared mutable surface and you have budget for the context overhead.
Step 4: Coordination + retries
A swarm without retries is a swarm that fails the first time a tool times out. Three rules.
Idempotent tools. Every tool must be safe to retry. If dispatch_worker fails halfway, calling it again with the same task ID should not double-bill or double-execute. Tag every task with a deterministic ID and check a dedup table before doing real work.
Bounded retries with backoff. Three attempts max per worker call, exponential backoff (1s, 4s, 16s). If a worker fails three times, the manager gets a structured error and decides — skip the sub-task, try a different worker, or abort. Do not retry forever. Do not retry without backoff.
Circuit breakers per worker. If a worker fails 5 times in 60 seconds, trip a breaker and stop dispatching to it for 2 minutes. Without breakers, a downstream outage turns into a runaway loop that drains your budget while producing nothing.
async function dispatchWithRetry(task: WorkerTask): Promise<WorkerResult> {
if (breaker.isOpen(task.worker)) {
return { taskId: task.taskId, status: "error", payload: "breaker_open", tokensUsed: 0 };
}
for (let attempt = 1; attempt <= 3; attempt++) {
try {
const result = await runWorker(task);
breaker.recordSuccess(task.worker);
return result;
} catch (err) {
breaker.recordFailure(task.worker);
if (attempt === 3) throw err;
await sleep(Math.pow(4, attempt - 1) * 1000);
}
}
throw new Error("unreachable");
}
Step 5: Logging + eval
You cannot ship what you cannot measure. Two layers.
Trace every call. Every model call, every tool call, every agent handoff goes to a structured log — task ID, parent task ID, agent name, model, input tokens, output tokens, latency, status. Use OpenTelemetry or LangSmith. The Agent SDK emits traces out of the box; you just point it at your collector.
Golden eval set. Maintain 30–100 canonical inputs with known-good outputs. Run them nightly. Track three metrics: pass rate, mean cost per run, mean latency. When pass rate drops or cost spikes, you have a regression before users do. Use Claude itself as the judge for open-ended outputs — give it the input, the gold answer, and the actual answer, and ask it to score on a rubric.
This is the difference between a demo and a production system. Demos work once. Production systems work on the thousandth run, on the input you didn't anticipate, after the model provider silently changes their tokenizer. Logging and eval are how you catch that before it becomes a Slack fire.
Step 6: Cost guardrails
A naive multi-agent system is a money fire. Five guardrails.
Per-run budget. Every top-level invocation gets a max-token budget (input + output across all sub-agents). If a run exceeds the budget, abort and return a partial result. Default to 200k tokens per run for production; tune from data.
Per-worker budget. Each worker dispatch has its own cap. A web search worker should never use more than 20k tokens. If it does, something is wrong — probably runaway tool loops. Cap and surface the violation.
Model tiering. Use Claude Haiku for cheap classification tasks (intent detection, routing, simple extraction). Use Claude Sonnet for general work. Reserve Claude Opus 4.7 for the manager and any worker doing real reasoning. A poorly-tiered swarm costs 3-5x more than a well-tiered one for the same quality.
Caching. Claude's prompt caching cuts repeated-context cost by 90%. Cache your system prompts, your tool definitions, and any large reference docs. The Agent SDK does this automatically when you mark messages as cacheable.
Daily kill switch. Set a hard daily spend limit at the API key level. If the swarm misbehaves at 3am, your kill switch stops the bleeding before you wake up. Anthropic Console supports this natively.
Example: 4-agent code review system
Here is a complete code review swarm you can copy. Four agents: Manager, Security Reviewer, Style Reviewer, Test Coverage Reviewer. Manager fans out, aggregates, returns a single review report.
import { Agent, tool } from "@anthropic-ai/agent-sdk";
import { z } from "zod";
// Shared tools
const readFile = tool({
name: "read_file",
description: "Read a file from the diff being reviewed.",
input: z.object({ path: z.string() }),
run: async ({ path }) => await fs.readFile(path, "utf-8"),
});
const grepRepo = tool({
name: "grep_repo",
description: "Search the repo for a pattern.",
input: z.object({ pattern: z.string(), path: z.string().optional() }),
run: async ({ pattern, path }) => await execGrep(pattern, path),
});
// Reviewer agents — isolated context per file batch
const securityReviewer = new Agent({
model: "claude-opus-4-7",
systemPrompt: `You are a security reviewer. Look for: SQL injection, XSS, secrets in code,
insecure deserialization, missing auth checks, IDOR, SSRF, path traversal.
Return a JSON array of findings: { severity, file, line, issue, fix }. Empty array if clean.`,
tools: [readFile, grepRepo],
});
const styleReviewer = new Agent({
model: "claude-sonnet-4-7",
systemPrompt: `You are a style reviewer. Check naming, function length, file length,
unused imports, dead code, and adherence to the project's existing patterns.
Return JSON findings with severity (low/medium), file, line, issue, fix.`,
tools: [readFile, grepRepo],
});
const testReviewer = new Agent({
model: "claude-opus-4-7",
systemPrompt: `You are a test coverage reviewer. For each changed function,
verify there is a corresponding test. Flag untested critical paths.
Return JSON findings with severity, file, line, missing_test_description.`,
tools: [readFile, grepRepo],
});
// Dispatch tool — the only seam between manager and reviewers
const dispatchReview = tool({
name: "dispatch_review",
description: "Send a file batch to a named reviewer.",
input: z.object({
reviewer: z.enum(["security", "style", "tests"]),
files: z.array(z.string()),
}),
run: async ({ reviewer, files }) => {
const agent =
reviewer === "security" ? securityReviewer :
reviewer === "style" ? styleReviewer : testReviewer;
const result = await agent.run({
input: `Review these files:\n${files.join("\n")}`,
});
return result.finalOutput;
},
});
// Manager
const reviewManager = new Agent({
model: "claude-opus-4-7",
systemPrompt: `You are a code review manager. Given a list of changed files:
1. Dispatch all files to the security reviewer.
2. Dispatch all files to the style reviewer.
3. Dispatch all files to the test reviewer.
Run all three in parallel. Aggregate findings by severity (critical, high, medium, low).
Return a single markdown report.`,
tools: [dispatchReview],
});
// Entry point — called from your CI pipeline
export async function reviewPR(changedFiles: string[]) {
const report = await reviewManager.run({
input: `Review this PR. Changed files:\n${changedFiles.join("\n")}`,
maxTokens: 200_000,
});
return report.finalOutput;
}
What this gives you: parallel review across three independent dimensions, isolated context per reviewer (no cross-contamination), structured JSON findings the manager can aggregate, and a single markdown report at the end. Drop it into your CI as a pre-merge check. Typical run cost on a 10-file PR: $0.40-0.80 with caching enabled.
Production extensions you will want: persist findings to a database so the same issue isn't re-flagged on every push, add a fourth reviewer for performance regressions if you ship hot paths, and wire the breaker pattern from Step 4 so a flaky tool does not stall the whole pipeline. For a deeper walkthrough, our team has shipped versions of this exact system for clients — see AI agent development.
Common failure modes
Six failure modes hit nearly every production multi-agent build. Watch for them.
Context bloat. The manager's context grows with every worker result until you hit the model's limit and the whole thing crashes. Fix: workers return summarized payloads, not raw output. Cap each worker's output at 2-5k tokens.
Infinite tool loops. A worker calls a tool, gets a result it doesn't understand, calls the same tool again with the same args. Fix: bounded tool calls per worker (max 10), and detect repeated identical calls.
Hallucinated handoffs. The manager dispatches to a worker that doesn't exist, or asks for an output schema the worker never returns. Fix: enum validation on dispatch targets, schema validation on worker results, fail loud.
Cost runaway. One bug in a retry loop turns a $0.50 run into a $50 run. Fix: per-run token budgets enforced by the SDK, daily spend cap at the API key level, alerts on anomalous cost.
Silent quality regression. Model provider tweaks the tokenizer or the system fingerprint, and your swarm starts producing subtly worse output. Pass rate looks fine because you have no eval set. Fix: golden evals from day one.
Race conditions on shared state. Two workers write to the same scratchpad and clobber each other. Fix: avoid shared mutable state. Use structured messages with append-only logs.
Build it yourself, or hire a team that has shipped a dozen
Multi-agent systems are 20% architecture and 80% the boring production work — retries, evals, cost caps, observability, deploy pipelines. The architecture you can get from this guide. The boring 80% is where most teams burn three months and ship something that works in dev and falls over in prod.
If you want to skip that, AY Automate has shipped production Claude multi-agent systems for code review, customer support, sales research, and internal ops automation. We work in Claude Code, ship with the Claude Agent SDK, and hand over running infrastructure with evals, dashboards, and runbooks. Talk to us at /consultation or browse the AI agent development service for what we ship.
FAQ
What is a multi-agent system?
A multi-agent system is software where two or more LLM agents cooperate on a task — typically one manager that decomposes work and dispatches sub-tasks to specialized workers. Each agent has its own context, prompt, and tools. The system is "multi-agent" because the agents are isolated and communicate through structured handoffs, not because there are multiple LLM calls.
How is a multi-agent system different from a single agent with many tools?
A single agent with many tools shares one context window across every step. It works until the context bloats or the agent gets confused juggling 30+ tools. A multi-agent system splits the work across isolated contexts — each worker only sees its own slice — which keeps context lean and lets each agent be deeply specialized. Rule of thumb: under 10 tools and simple flows, single agent. Over 10 tools or genuinely parallel work, multi-agent.
Should I use Claude Code sub-agents or the Claude Agent SDK?
Claude Code sub-agents for dev workflows that run inside an IDE session — code review, refactoring, research across a repo. Claude Agent SDK for production systems that run without a human attached — APIs, scheduled jobs, customer-facing automation. The Agent SDK is what you ship to users. Sub-agents are what you use while building.
How much does a production multi-agent system cost to run?
Highly variable. A 4-agent code review system on a typical 10-file PR runs $0.40-0.80 with caching. A customer-support swarm handling 1,000 tickets per day with three agents per ticket typically lands at $40-120 per day depending on ticket complexity and model tier. Cost scales linearly with volume and quadratically with agent count if you are not careful — guardrails matter.
Which framework should I use — Agent SDK, LangGraph, or OpenAI Agents SDK?
Agent SDK if you're starting fresh and want the lowest-friction Claude-native path. LangGraph if you need explicit graph control, durable state, or human-in-the-loop approvals. OpenAI Agents SDK if your team already runs it and you want to A/B Claude versus GPT without rebuilding orchestration. See best multi-agent frameworks for the full comparison.
Do I need a vector database for multi-agent systems?
Only if your agents need long-term memory across runs or need to retrieve from a large knowledge base. Most production multi-agent systems do not — they pass structured task payloads, not embeddings. Add a vector DB when you genuinely need semantic retrieval, not by default.
How do I evaluate a multi-agent system?
Maintain a golden set of 30-100 canonical inputs with known-good outputs. Run nightly. Track pass rate, mean cost per run, and mean latency. For open-ended outputs, use Claude as a judge with a clear rubric. When pass rate drops or cost spikes, investigate immediately. Without evals, you ship regressions blind.
Can a multi-agent system replace a human team?
For well-defined, repetitive cognitive work — yes, in 2026, a properly built swarm replaces most of a junior research team or a tier-1 support team. For ambiguous, judgment-heavy, or relationship-driven work, no. The pattern that works: agents handle the 80% repeatable volume, humans handle the 20% that needs judgment, escalation paths route between them. That's the production reality, not "AI replaces everyone."
Book a Free Strategy Call
Building this in production?
Walid runs a 30-min call to map your AI engineering team. Free, no slides.

Walid founded AY Automate to help businesses ship AI workflows that actually move revenue. He leads strategy and oversees every client engagement end-to-end.
Full Bio →