Parallel Subagent Orchestration: When Fan-Out Helps, and Exactly Where It Breaks
The question that actually matters when you build an agent system is narrower than “single agent or multi-agent.” It’s: which parts of this workload can run in parallel without two agents making contradictory decisions, and is the extra token cost worth it for this task? Most of the public debate skips straight to architecture diagrams and never answers that. This piece tries to answer it precisely, with the mechanism, the cost arithmetic, the strongest counterargument, and a small orchestrator you can copy.
The disagreement that frames everything
Two well-documented positions sit on opposite sides, and reading both is the fastest way to understand the real tradeoff.
In June 2025, Cognition (the team behind Devin) published “Don’t Build Multi-Agents”. The argument is about context, not parallelism per se. Two principles: share as much context as possible across decisions, and avoid splitting decision-making in ways that conflict. When you fan a task out to subagents that each see only a slice of the context, they make implicit decisions — naming, structure, interpretation of ambiguous instructions — that are individually reasonable and collectively incoherent. The post’s running example is two subagents building different parts of one UI in mismatched styles, then a third agent trying to reconcile them. The reconciliation is harder than the original task.
In the same window, Anthropic published “How we built our multi-agent research system”, reporting that an orchestrator-worker setup (Claude Opus 4 as lead, Claude Sonnet 4 as subagents) outperformed a single-agent Opus 4 baseline by 90.2% on their internal research eval. Their workload is open-ended breadth-first research — “find the board members of every IT company in the S&P 500” — where the subtasks genuinely don’t depend on each other.
These look contradictory. They aren’t. They describe different workloads, and the line between them is the read/write distinction.
The read/write split is the load-bearing idea
Cognition’s April 22, 2026 follow-up, “Multi-Agents: What’s Actually Working”, states the synthesis directly. Their words: “most multi-agent setups in the world are limited to ‘readonly’ subagents, like web search subagents and code search subagents,” and the reason write-parallelism fails is that “actions carry implicit decisions … that might conflict with the implicit choices of other parallel agents.” Their conclusion: “multi-agent systems work best today when writes stay single-threaded and the additional agents contribute intelligence rather than actions.”
That is the principle the whole field has converged on, and it generalizes cleanly:
- Reads parallelize. Searching the web, grepping a codebase, reading files, gathering evidence — these don’t mutate shared state. Two subagents reading different things can’t contradict each other; worst case they fetch redundant data. Fan-out here is pure latency win.
- Writes must be serialized. Editing files, committing code, sending messages, making schema decisions — these encode choices. Two agents writing in parallel against shared state produce the incoherence Cognition described. One writer; everyone else advises.
So Anthropic’s research system isn’t a counterexample to Cognition — it’s an instance of the rule. The subagents read (search, browse, gather); the lead agent writes (synthesizes the final report, runs the citation pass). The breadth-first research task happens to be almost entirely reads, which is exactly why it parallelizes so well. A task that’s mostly coordinated writes — a multi-file refactor where every edit has to be consistent with every other — sits at the opposite end and resists fan-out.
This reframes the decision. Don’t ask “should I use multiple agents?” Ask “what fraction of this task is independent reads?” High read fraction → fan out the reads, keep one writer. Low read fraction → a single agent is usually simpler and at least as good.
The strongest objection: you’re paying for compute, not architecture
Here’s where an honest piece has to slow down, because the 90.2% number invites overinterpretation.
Anthropic’s own post is candid about the cost: in their reported setup, agents use roughly 4× the tokens of a chat interaction, and multi-agent systems roughly 15×. They also found that in their browsing eval, token usage alone explained ~80% of the performance variance, with the number of tool calls and the model choice as the other main factors. Read that carefully: their own data says most of the multi-agent win is explained by spending more tokens, not by the architecture being inherently smarter. A multi-agent system that fans out to five subagents is, among other things, a way to spend 5× the inference budget on the problem.
This is not a settled debate, and the disagreement is worth understanding rather than papering over. Tran and Kiela’s April 2026 paper, “Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets”, makes the methodological objection direct: nearly all multi-agent benchmarks compare a single agent against a multi-agent system that quietly uses far more total computation. When they hold the thinking-token budget constant — matching intermediate reasoning tokens, excluding prompts and final answers — across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5 on multi-hop reasoning, single agents match or beat multi-agent systems. In their reported numbers the two are nose-to-nose: single-agent accuracy spans roughly 0.280–0.427 across budgets, comparable multi-agent variants average 0.280–0.420. (Those are the aggregate ranges the paper reports across model families; treat per-benchmark, per-model breakdowns as something to read off their tables, not summarize.)
Their mechanism is an information-theoretic one and it’s the part worth internalizing. When agent A hands a message to agent B, B works from A’s processed summary of the original context, not the context itself. By the data processing inequality, that hand-off can only preserve or lose information — never add it. Every agent boundary is a lossy compression step. The conclusion: “many reported advantages of multi-agent systems are better explained by unaccounted computation and context effects rather than inherent architectural benefits.”
Two caveats keep this from being circular. First, the 80%-of-variance and 4×/15× figures are Anthropic’s measurements on Anthropic’s workloads — a specific research task with a specific orchestration. They are not universal constants, and citing them as “independent confirmation” of the compute thesis would be leaning on the same source twice; treat them as one well-instrumented data point. Second, Tran and Kiela tested multi-hop reasoning, where the subtasks are sequential and dependent — precisely the regime where lossy hand-offs hurt most. Their finding is strong evidence against multi-agent for dependent reasoning. It says much less about breadth-first independent reads, where there’s no reasoning chain to compress and the win is wall-clock latency, not accuracy.
Net: parallel subagents buy you latency on independent reads and a way to deploy more compute. They do not buy you reasoning that a single agent with the same token budget couldn’t match. If you reach for multi-agent expecting the architecture itself to make the system smarter on a dependent reasoning chain, the equal-budget evidence says you’ll be disappointed.
A decision procedure you can actually apply
Putting the mechanism and the cost together:
- Estimate the independent-read fraction. What share of the work is gathering information that doesn’t depend on other in-flight work? High → fan-out candidate.
- Check write coordination. Does completing the task require multiple writes that must be mutually consistent? If yes, those writes go through one agent regardless of how the reads are structured.
- Price it. Multi-agent only makes sense when the value of finishing faster (or more thoroughly) clears the multiplied token bill. Spend a worked example on this rather than hand-waving — see below.
- Check the dependency structure. If the subtasks form a reasoning chain (each step needs the last), the equal-budget evidence favors a single agent. If they’re independent, fan-out’s lossy hand-offs cost you little.
If steps 1–4 don’t clearly favor fan-out, build the single agent. It’s less code, has no synthesis step, and gives you one coherent context to debug.
Worked cost math (fill in current prices yourself)
Token math is the part that silently rots in every blog post, so treat the dollar figures as a template, not a fact. Pull current per-token prices from your provider’s pricing page at the time you’re reading; as of mid-2026 the relevant Anthropic numbers were Opus-tier around $5 / $25 per million input/output tokens, Sonnet-tier around $3 / $15, the Batch API at a 50% discount, and prompt-cache reads at roughly 0.1× input price (with a 1.25× write premium on the 5-minute cache). Verify before you quote.
The structure that doesn’t rot: a fan-out of N read subagents plus one synthesizer costs, very roughly,
total ≈ orchestrator_tokens
+ N × (subagent_prompt + subagent_output)
+ synthesizer_tokens
versus a single agent doing the reads sequentially:
total ≈ agent_tokens (one context, no per-subagent prompt duplication)
The N× multiplier on the subagent prompt is the hidden cost: every subagent re-pays for its system prompt and task framing. Two levers cut it. Prompt caching: if all N subagents share a system-prompt prefix, cache it once and each subagent reads it at ~0.1× instead of full price — this is the single highest-leverage optimization for fan-out, because the shared prefix is exactly the part that’s identical across subagents. Batch API: if the subagents don’t need to be real-time, the 50% batch discount applies to the whole fan-out. Stacking both — cached shared prefix, batched execution — is what makes a wide fan-out economically sane. Run the arithmetic with live prices before committing; the break-even on N is sensitive to how much of the subagent prompt is cacheable.
A minimal orchestrator: parallel reads, single-threaded writes
Here’s the pattern in runnable form — the spawning heuristic and the read/write boundary encoded as code rather than prose. It fans out read-only subagents concurrently, then funnels everything through one writer. Prices and model IDs change; the control structure is the point.
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
# --- the heuristic, as a gate ---------------------------------------
def should_fan_out(subtasks: list[dict]) -> bool:
"""Fan out only when subtasks are independent reads.
A subtask is a read if it mutates no shared state."""
independent_reads = [t for t in subtasks if t["kind"] == "read"]
return len(independent_reads) >= 2 and all(
t["kind"] == "read" for t in subtasks
)
# --- parallel READS (safe to run concurrently) ----------------------
async def read_subagent(task: dict) -> str:
# low effort + cheaper model: subagents gather, they don't decide
resp = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[{ # identical prefix across subagents
"type": "text",
"text": SHARED_READ_PROMPT, # -> cache this, read at ~0.1x
"cache_control": {"type": "ephemeral"},
}],
output_config={"effort": "low"},
messages=[{"role": "user", "content": task["instruction"]}],
)
return next(b.text for b in resp.content if b.type == "text")
# --- single-threaded WRITE (one decision-maker) ---------------------
async def writer(findings: list[str], goal: str) -> str:
resp = await client.messages.create(
model="claude-opus-4-8", # the writer gets the strong model
max_tokens=16000,
output_config={"effort": "high"},
messages=[{
"role": "user",
"content": f"Goal: {goal}\n\nGathered evidence:\n" +
"\n---\n".join(findings),
}],
)
return next(b.text for b in resp.content if b.type == "text")
async def orchestrate(goal: str, subtasks: list[dict]) -> str:
if should_fan_out(subtasks):
# reads run concurrently — no shared-state conflict possible
findings = await asyncio.gather(
*(read_subagent(t) for t in subtasks)
)
else:
# sequential single context; cheaper, coherent, easier to debug
findings = [await read_subagent(t) for t in subtasks]
# exactly one agent writes the result
return await writer(list(findings), goal)
Three things this encodes that matter more than the API specifics. The should_fan_out gate refuses to parallelize anything that isn’t a pure read — that’s the conflict-avoidance rule made mechanical. Subagents run a cheaper model at low effort because their job is gathering, not deciding; the strong model is spent on the single write where coherence is decided. And the shared system prompt carries a cache breakpoint, because the prefix is identical across every subagent — the one place fan-out’s token multiplier is recoverable. (If you’re on a model family without these exact effort/cache parameters, the structure holds; only the call signature changes.)
What actually breaks in production
The failure modes are predictable and worth naming so you can watch for them:
- Silent write conflicts. A subagent you thought was read-only quietly writes — caches a file, updates a shared store, posts a message. Now you have the conflict you designed around, with no error. Audit every subagent’s side effects; a tool that can write is a write subagent no matter what you call it.
- Synthesis is the hard part, and it’s serial. Fan-out makes gathering fast, then dumps everything on one synthesizer whose context now holds N subagents’ worth of output. The synthesizer’s context window, not the subagents, is usually the real bottleneck — and it doesn’t parallelize.
- Debuggability collapses. A single agent has one transcript. A fan-out has N+2 interleaved ones, and reproducing a bad run means reproducing concurrent timing. Budget for substantially more observability than a single-agent system needs.
- Cost surprises from the prompt multiplier. Without prefix caching, every subagent re-pays for its framing. People model the output tokens and forget the N× on input. Measure actual
cache_read_input_tokensin production — if it’s near zero, a silent cache invalidator means you’re paying full price N times.
The honest summary: parallel subagent orchestration is a latency-and-throughput optimization for the read-heavy, independent-subtask corner of the problem space, paid for in tokens and debuggability. It is not a general intelligence upgrade. Keep writes single-threaded, fan out only genuine reads, cache the shared prefix, and price it against the single-agent baseline with live numbers before you build it.