When a multi-agent system starts failing in production, the first move is almost always to swap the model. Sonnet to Opus. Try GPT-5.4 Pro. Crank the reasoning budget. That move is almost always wrong.
The MAST study annotated 1,642 execution traces across seven open-source multi-agent frameworks and found that inter-agent misalignment is one of the three largest failure categories, alongside system design issues and task verification gaps. Context loss during handoff is the one failure mode that shows up in every incident report. A separate Google Research evaluation across 180 agent configurations found that independent multi-agent networks amplify errors 17.2x versus single-agent baselines.
The model can solve the task. The handoff is what drops it.
What a Handoff Actually Is
Across every framework shipped in the last year, a handoff is the same primitive wearing different clothes: a tool call that transfers control and some amount of state from one agent to another.
- OpenAI Agents SDK (the production successor to Swarm): a handoff is a
transfer_to_<agent>function. When the active agent calls it, the runner swapsactive_agentand keeps the shared conversation history. Handoffs are explicit tool calls, which means they're visible in the trace. - LangGraph Command: a node returns
Command(goto="agent_b", update={...}, graph=Command.PARENT). Thegotoroutes control, theupdatemutates state, and thegraphscope decides whether the hop stays inside a subgraph or exits to the parent. - Anthropic orchestrator-worker: the lead agent dispatches subagents in parallel, each with its own context window. Subagents don't talk to each other. They return condensed summaries (typically 1-2k tokens after 10k+ of exploration) back to the lead.
Different APIs, same three moving parts: a message, a state delta, and a control transfer. Almost every handoff bug lives in one of those three.
What Gets Dropped at the Boundary
The polite version in the docs reads "Agent A summarizes, Agent B picks up." The real version has four failure modes that recur across every framework on the list above.
1. Implicit State That Never Made It Into the Message
Agent A retrieved the customer's order ID from a tool call, reasoned about it internally, and handed off with a summary: "Customer wants to return their recent order." Agent B now has to re-retrieve. Sometimes it asks the customer again. Sometimes it guesses. Either way, the handoff failed because the state lived in Agent A's context window and nowhere else.
LangChain's handoff docs call this out directly: "you must explicitly decide what messages pass between agents. Get this wrong and agents receive malformed conversation history or bloated context."
2. Compressed Summaries That Drop the Actual Ask
Anthropic's multi-agent research system compresses subagent output from tens of thousands of tokens down to 1-2k. That's a 10-20x compression ratio. On research tasks it works, because the subagent did the exploration and the lead only needs the conclusion. On CX tasks it's catastrophic. The customer's actual request is a 40-token string. If the summary drops it, the receiving agent hallucinates intent.
3. Stale Tool Results
Agent A called get_order_status three turns ago. The result was cached in the message history. Agent B picks up, reasons from the stale result, and answers a question about the current state using two-minute-old data. This is a pure observability problem: nothing in the handoff flags the freshness of the underlying tool output.
4. Role-Confused Instructions
OpenAI's Swarm docs note: "If an Agent calls multiple functions to hand-off to an Agent, only the last handoff function will be used." In practice this means a triage agent that tries to hand off to both a billing agent and a returns agent ends up routing to whichever was emitted last. The customer gets the wrong specialist, and the specialist's system prompt conflicts with what the customer just said.
The 8-10 Handoff Cliff
LangChain's benchmarking of multi-agent architectures and independent field studies converge on a consistent number: sequential handoffs degrade past roughly 8-10 hops. After that, accuracy on end-to-end tasks falls off a cliff.
Three things compound at that depth:
| Failure mode | What happens |
|---|---|
| Information loss | Each handoff drops some fraction of state. Compounded over 10 hops, the original intent is ~50% recoverable. |
| Role drift | Each agent subtly re-interprets the task. By hop 8, the system is solving an adjacent problem. |
| Context bloat | Conversation history grows linearly while each agent's useful context window shrinks. |
Atlan's analysis of multi-agent scaling puts it bluntly: past a threshold, adding agents doesn't add capability, it adds coordination cost. The centralized-coordination-with-shared-context pattern contains error amplification to 4.4x. Still bad, but four times better than independent agents. The fix isn't more agents. It's fewer handoffs.
The Observability Signals That Catch It
You cannot debug a handoff failure by reading a transcript. The failure is in what isn't in the transcript. These are the telemetry signals that have actually caught issues in production CX agents:
- Pre/post-handoff state delta. Dump the state object immediately before and after each handoff. Diff them. Fields that silently disappear between agents are your candidate dropouts.
- Stale-state queries. Count how often the receiving agent re-asks the user for information already present in the prior agent's transcript. High rates mean the handoff summary is compressing too aggressively.
- Intent preservation score. Run the original user utterance and the final agent's interpretation through a cheap LLM-as-judge. Divergence above a threshold is a handoff regression signal.
- Handoff-to-resolution ratio. How many handoffs did it take to close the conversation? A creeping number is the earliest leading indicator of degradation, and it usually precedes CSAT drops by days.
- Tool-result freshness at handoff time. Timestamp every tool output. Flag any reasoning step downstream that uses a tool result older than N turns.
In a platform like Chanl Monitor, these are dashboard panels, not grep queries. The point is that every signal here is visible at the instrumentation layer. None of them require model upgrades or better prompts.
Handoff Design Rules That Hold Up in Production
Six rules that survive contact with real customer conversations:
1. Hand off explicit state objects, not compressed prose. Use Command-style update={} payloads with named fields (order_id, customer_intent, resolved_items). Prose summaries are lossy by definition.
2. Summarize then hand off, never the reverse. The outgoing agent should emit a structured summary as its last step, and the handoff tool should read from it. Don't let the receiving agent reconstruct context from raw message history.
3. Cap sequential depth at 5. Past that, restructure as supervisor-coordinates-parallel-workers (Anthropic's orchestrator pattern) instead of swarm-style chains. You pay 15-20% more tokens and buy back accuracy.
4. Supervisor for safety-critical, swarm for latency-critical. Beam's orchestration taxonomy has this right: supervisor adds 20-40% token overhead but gives you a single point of policy enforcement. Swarm is faster and cheaper, until the quality cliff.
5. Freeze the tool output schema at the handoff boundary. If Agent B consumes a field that Agent A emits, that field is now part of the contract. Test it like any other API contract. This is exactly what end-to-end Scenarios are for.
6. Instrument before scaling. The single biggest predictor of whether a multi-agent system survives past the demo is whether the team added handoff telemetry before adding the fourth agent.
The Prompt You Should Be Optimizing
Most of the prompt engineering effort in multi-agent systems goes into the agent-to-user prompt, the system message that defines persona and instructions. That's the visible surface.
The prompts that actually decide whether the system works are the ones agents pass to each other. The one-line summary the triage agent hands to the billing specialist. The state update LangGraph emits with Command. The distilled output the subagent returns to the lead. Those are the prompts your customers experience, even though no customer ever sees them.
Treat them like product. Version them, diff them in code review, write regression tests against them. And when the system starts misbehaving, look there before you reach for a bigger model. The handoff is the prompt now.
- MAST: Why Do Multi-Agent LLM Systems Fail? (arXiv 2503.13657)
- Anthropic — How we built our multi-agent research system
- Anthropic — Effective context engineering for AI agents
- OpenAI Cookbook — Orchestrating Agents: Routines and Handoffs
- OpenAI Swarm (GitHub)
- LangChain — Command: A new tool for multi-agent architectures
- LangChain — Handoffs documentation
- LangChain — Benchmarking Multi-Agent Architectures
- Towards Data Science — The Multi-Agent Trap
- Towards Data Science — How Agent Handoffs Work in Multi-Agent Systems
- Galileo — Why Multi-Agent Systems Fail
- Atlan — Multi-Agent Scaling: The Context Gap
- XTrace — AI Agent Handoff: Why Context Breaks
- Beam.ai — 6 Multi-Agent Orchestration Patterns for Production
- Anthropic — When to use multi-agent systems (and when not to)
- LangGraph Swarm (GitHub)
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.



