ChanlChanl
Agent Architecture

Multi-Agent Systems Don't Fail at Reasoning. They Fail at Handoff.

Multi-agent systems don't fail at reasoning. They fail at handoff. Command objects, memory transfer, and the 8-10 handoff cliff, plus the telemetry that catches drift.

DGDean GroverCo-founderFollow
April 23, 2026
11 min read read
AI-Generated Illustration for Handoff Is the New Prompt -- Soul (2020) Style, Terra Cotta Palette

When a multi-agent system starts failing in production, the first move is almost always to swap the model. Sonnet to Opus. Try GPT-5.4 Pro. Crank the reasoning budget. That move is almost always wrong.

The MAST study annotated 1,642 execution traces across seven open-source multi-agent frameworks and found that inter-agent misalignment is one of the three largest failure categories, alongside system design issues and task verification gaps. Context loss during handoff is the one failure mode that shows up in every incident report. A separate Google Research evaluation across 180 agent configurations found that independent multi-agent networks amplify errors 17.2x versus single-agent baselines.

The model can solve the task. The handoff is what drops it.

What a Handoff Actually Is

Across every framework shipped in the last year, a handoff is the same primitive wearing different clothes: a tool call that transfers control and some amount of state from one agent to another.

  • OpenAI Agents SDK (the production successor to Swarm): a handoff is a transfer_to_<agent> function. When the active agent calls it, the runner swaps active_agent and keeps the shared conversation history. Handoffs are explicit tool calls, which means they're visible in the trace.
  • LangGraph Command: a node returns Command(goto="agent_b", update={...}, graph=Command.PARENT). The goto routes control, the update mutates state, and the graph scope decides whether the hop stays inside a subgraph or exits to the parent.
  • Anthropic orchestrator-worker: the lead agent dispatches subagents in parallel, each with its own context window. Subagents don't talk to each other. They return condensed summaries (typically 1-2k tokens after 10k+ of exploration) back to the lead.

Different APIs, same three moving parts: a message, a state delta, and a control transfer. Almost every handoff bug lives in one of those three.

What Gets Dropped at the Boundary

The polite version in the docs reads "Agent A summarizes, Agent B picks up." The real version has four failure modes that recur across every framework on the list above.

1. Implicit State That Never Made It Into the Message

Agent A retrieved the customer's order ID from a tool call, reasoned about it internally, and handed off with a summary: "Customer wants to return their recent order." Agent B now has to re-retrieve. Sometimes it asks the customer again. Sometimes it guesses. Either way, the handoff failed because the state lived in Agent A's context window and nowhere else.

LangChain's handoff docs call this out directly: "you must explicitly decide what messages pass between agents. Get this wrong and agents receive malformed conversation history or bloated context."

2. Compressed Summaries That Drop the Actual Ask

Anthropic's multi-agent research system compresses subagent output from tens of thousands of tokens down to 1-2k. That's a 10-20x compression ratio. On research tasks it works, because the subagent did the exploration and the lead only needs the conclusion. On CX tasks it's catastrophic. The customer's actual request is a 40-token string. If the summary drops it, the receiving agent hallucinates intent.

3. Stale Tool Results

Agent A called get_order_status three turns ago. The result was cached in the message history. Agent B picks up, reasons from the stale result, and answers a question about the current state using two-minute-old data. This is a pure observability problem: nothing in the handoff flags the freshness of the underlying tool output.

4. Role-Confused Instructions

OpenAI's Swarm docs note: "If an Agent calls multiple functions to hand-off to an Agent, only the last handoff function will be used." In practice this means a triage agent that tries to hand off to both a billing agent and a returns agent ends up routing to whichever was emitted last. The customer gets the wrong specialist, and the specialist's system prompt conflicts with what the customer just said.

The 8-10 Handoff Cliff

LangChain's benchmarking of multi-agent architectures and independent field studies converge on a consistent number: sequential handoffs degrade past roughly 8-10 hops. After that, accuracy on end-to-end tasks falls off a cliff.

Three things compound at that depth:

Failure modeWhat happens
Information lossEach handoff drops some fraction of state. Compounded over 10 hops, the original intent is ~50% recoverable.
Role driftEach agent subtly re-interprets the task. By hop 8, the system is solving an adjacent problem.
Context bloatConversation history grows linearly while each agent's useful context window shrinks.

Atlan's analysis of multi-agent scaling puts it bluntly: past a threshold, adding agents doesn't add capability, it adds coordination cost. The centralized-coordination-with-shared-context pattern contains error amplification to 4.4x. Still bad, but four times better than independent agents. The fix isn't more agents. It's fewer handoffs.

The Observability Signals That Catch It

You cannot debug a handoff failure by reading a transcript. The failure is in what isn't in the transcript. These are the telemetry signals that have actually caught issues in production CX agents:

  • Pre/post-handoff state delta. Dump the state object immediately before and after each handoff. Diff them. Fields that silently disappear between agents are your candidate dropouts.
  • Stale-state queries. Count how often the receiving agent re-asks the user for information already present in the prior agent's transcript. High rates mean the handoff summary is compressing too aggressively.
  • Intent preservation score. Run the original user utterance and the final agent's interpretation through a cheap LLM-as-judge. Divergence above a threshold is a handoff regression signal.
  • Handoff-to-resolution ratio. How many handoffs did it take to close the conversation? A creeping number is the earliest leading indicator of degradation, and it usually precedes CSAT drops by days.
  • Tool-result freshness at handoff time. Timestamp every tool output. Flag any reasoning step downstream that uses a tool result older than N turns.

In a platform like Chanl Monitor, these are dashboard panels, not grep queries. The point is that every signal here is visible at the instrumentation layer. None of them require model upgrades or better prompts.

Handoff Design Rules That Hold Up in Production

Six rules that survive contact with real customer conversations:

1. Hand off explicit state objects, not compressed prose. Use Command-style update={} payloads with named fields (order_id, customer_intent, resolved_items). Prose summaries are lossy by definition.

2. Summarize then hand off, never the reverse. The outgoing agent should emit a structured summary as its last step, and the handoff tool should read from it. Don't let the receiving agent reconstruct context from raw message history.

3. Cap sequential depth at 5. Past that, restructure as supervisor-coordinates-parallel-workers (Anthropic's orchestrator pattern) instead of swarm-style chains. You pay 15-20% more tokens and buy back accuracy.

4. Supervisor for safety-critical, swarm for latency-critical. Beam's orchestration taxonomy has this right: supervisor adds 20-40% token overhead but gives you a single point of policy enforcement. Swarm is faster and cheaper, until the quality cliff.

5. Freeze the tool output schema at the handoff boundary. If Agent B consumes a field that Agent A emits, that field is now part of the contract. Test it like any other API contract. This is exactly what end-to-end Scenarios are for.

6. Instrument before scaling. The single biggest predictor of whether a multi-agent system survives past the demo is whether the team added handoff telemetry before adding the fourth agent.

The Prompt You Should Be Optimizing

Most of the prompt engineering effort in multi-agent systems goes into the agent-to-user prompt, the system message that defines persona and instructions. That's the visible surface.

The prompts that actually decide whether the system works are the ones agents pass to each other. The one-line summary the triage agent hands to the billing specialist. The state update LangGraph emits with Command. The distilled output the subagent returns to the lead. Those are the prompts your customers experience, even though no customer ever sees them.

Treat them like product. Version them, diff them in code review, write regression tests against them. And when the system starts misbehaving, look there before you reach for a bigger model. The handoff is the prompt now.

DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Frequently Asked Questions