What Is a Handoff in a Multi-Agent System?

A handoff is a tool call that transfers control and some amount of state from one agent to another. Across frameworks it wears different clothes — OpenAI Agents SDK uses a transfer_to_ function, LangGraph uses a Command with goto and update fields, Anthropic's orchestrator pattern dispatches subagents that return condensed summaries — but the primitive is always the same: a message, a state delta, and a control transfer.

Why Do Multi-Agent Systems Fail More Often Than Single-Agent Ones?

They fail at the seams between agents, not inside any one agent. Google Research found that independent multi-agent networks amplify errors 17.2x over single-agent baselines. The MAST study of 1,642 execution traces ranks inter-agent misalignment among the top three failure categories. The model can solve the task — the handoff is what drops it.

How Many Sequential Handoffs Can a Multi-Agent System Handle Before Quality Collapses?

Benchmarking converges on roughly 8-10 hops before accuracy falls off a cliff. Three things compound at that depth: information loss (each hop drops some state), role drift (each agent re-interprets the task), and context bloat (history grows while useful context shrinks). The practical rule is to cap sequential depth at 5 and restructure anything deeper as a supervisor-coordinates-workers pattern.

What Is the Difference Between a Supervisor and a Swarm Multi-Agent Architecture?

Supervisor architectures route every decision through a central coordinator, which adds 20-40% token overhead but gives you a single point of policy enforcement. Swarm architectures let peer agents hand off directly, which is faster and cheaper until you hit the quality cliff. Use supervisor for safety-critical flows, swarm for latency-critical ones.

How Do You Debug a Multi-Agent Handoff Failure?

Not by reading the transcript — the failure is in what isn't there. Instrument five signals: pre/post-handoff state deltas, stale-state queries (how often the receiving agent re-asks for known info), intent preservation score (LLM-as-judge between original utterance and final interpretation), handoff-to-resolution ratio, and tool-result freshness. Creeping handoff-to-resolution ratio is usually the earliest leading indicator.

What Are Command Objects in LangGraph Handoffs?

A Command is LangGraph's primitive for handoffs. A node returns Command(goto="agent_b", update={...}, graph=Command.PARENT). The goto field routes control, the update field mutates shared state with named fields instead of prose, and the graph scope decides whether the hop stays inside a subgraph or exits to the parent. Command objects are preferable to prose summaries because they're lossy by construction — you have to explicitly name the fields that travel.

Should You Treat Inter-Agent Prompts Differently From Agent-to-User Prompts?

Yes. The system message that defines an agent's persona is visible surface — most prompt engineering effort goes there. But the one-line summaries and state updates agents pass to each other are the prompts that actually decide whether the system works. Version them, diff them in code review, and write regression tests against them. Treat them like product, not like scaffolding.

Multi-Agent Systems Don't Fail at Reasoning. They Fail at Handoff.

When a multi-agent system starts failing in production, the first move is almost always to swap the model. Sonnet to Opus. Try GPT-5.4 Pro. Crank the reasoning budget. That move is almost always wrong.

The MAST study annotated 1,642 execution traces across seven open-source multi-agent frameworks and found that inter-agent misalignment is one of the three largest failure categories, alongside system design issues and task verification gaps. Context loss during handoff is the one failure mode that shows up in every incident report. A separate Google Research evaluation across 180 agent configurations found that independent multi-agent networks amplify errors 17.2x versus single-agent baselines.

The model can solve the task. The handoff is what drops it.

What a Handoff Actually Is

Across every framework shipped in the last year, a handoff is the same primitive wearing different clothes: a tool call that transfers control and some amount of state from one agent to another.

OpenAI Agents SDK (the production successor to Swarm): a handoff is a transfer_to_<agent> function. When the active agent calls it, the runner swaps active_agent and keeps the shared conversation history. Handoffs are explicit tool calls, which means they're visible in the trace.
LangGraph Command: a node returns Command(goto="agent_b", update={...}, graph=Command.PARENT). The goto routes control, the update mutates state, and the graph scope decides whether the hop stays inside a subgraph or exits to the parent.
Anthropic orchestrator-worker: the lead agent dispatches subagents in parallel, each with its own context window. Subagents don't talk to each other. They return condensed summaries (typically 1-2k tokens after 10k+ of exploration) back to the lead.

Different APIs, same three moving parts: a message, a state delta, and a control transfer. Almost every handoff bug lives in one of those three.

What Gets Dropped at the Boundary

The polite version in the docs reads "Agent A summarizes, Agent B picks up." The real version has four failure modes that recur across every framework on the list above.

1. Implicit State That Never Made It Into the Message

Agent A retrieved the customer's order ID from a tool call, reasoned about it internally, and handed off with a summary: "Customer wants to return their recent order." Agent B now has to re-retrieve. Sometimes it asks the customer again. Sometimes it guesses. Either way, the handoff failed because the state lived in Agent A's context window and nowhere else.

LangChain's handoff docs call this out directly: "you must explicitly decide what messages pass between agents. Get this wrong and agents receive malformed conversation history or bloated context."

2. Compressed Summaries That Drop the Actual Ask

Anthropic's multi-agent research system compresses subagent output from tens of thousands of tokens down to 1-2k. That's a 10-20x compression ratio. On research tasks it works, because the subagent did the exploration and the lead only needs the conclusion. On CX tasks it's catastrophic. The customer's actual request is a 40-token string. If the summary drops it, the receiving agent hallucinates intent.

3. Stale Tool Results

Agent A called get_order_status three turns ago. The result was cached in the message history. Agent B picks up, reasons from the stale result, and answers a question about the current state using two-minute-old data. This is a pure observability problem: nothing in the handoff flags the freshness of the underlying tool output.

4. Role-Confused Instructions

OpenAI's Swarm docs note: "If an Agent calls multiple functions to hand-off to an Agent, only the last handoff function will be used." In practice this means a triage agent that tries to hand off to both a billing agent and a returns agent ends up routing to whichever was emitted last. The customer gets the wrong specialist, and the specialist's system prompt conflicts with what the customer just said.

The 8-10 Handoff Cliff

LangChain's benchmarking of multi-agent architectures and independent field studies converge on a consistent number: sequential handoffs degrade past roughly 8-10 hops. After that, accuracy on end-to-end tasks falls off a cliff.

Three things compound at that depth:

Failure mode	What happens
Information loss	Each handoff drops some fraction of state. Compounded over 10 hops, the original intent is ~50% recoverable.
Role drift	Each agent subtly re-interprets the task. By hop 8, the system is solving an adjacent problem.
Context bloat	Conversation history grows linearly while each agent's useful context window shrinks.

Atlan's analysis of multi-agent scaling puts it bluntly: past a threshold, adding agents doesn't add capability, it adds coordination cost. The centralized-coordination-with-shared-context pattern contains error amplification to 4.4x. Still bad, but four times better than independent agents. The fix isn't more agents. It's fewer handoffs.

The Observability Signals That Catch It

You cannot debug a handoff failure by reading a transcript. The failure is in what isn't in the transcript. These are the telemetry signals that have actually caught issues in production CX agents:

Pre/post-handoff state delta. Dump the state object immediately before and after each handoff. Diff them. Fields that silently disappear between agents are your candidate dropouts.
Stale-state queries. Count how often the receiving agent re-asks the user for information already present in the prior agent's transcript. High rates mean the handoff summary is compressing too aggressively.
Intent preservation score. Run the original user utterance and the final agent's interpretation through a cheap LLM-as-judge. Divergence above a threshold is a handoff regression signal.
Handoff-to-resolution ratio. How many handoffs did it take to close the conversation? A creeping number is the earliest leading indicator of degradation, and it usually precedes CSAT drops by days.
Tool-result freshness at handoff time. Timestamp every tool output. Flag any reasoning step downstream that uses a tool result older than N turns.

In a platform like Chanl Monitor, these are dashboard panels, not grep queries. The point is that every signal here is visible at the instrumentation layer. None of them require model upgrades or better prompts.

Handoff Design Rules That Hold Up in Production

Six rules that survive contact with real customer conversations:

1. Hand off explicit state objects, not compressed prose. Use Command-style update={} payloads with named fields (order_id, customer_intent, resolved_items). Prose summaries are lossy by definition.

2. Summarize then hand off, never the reverse. The outgoing agent should emit a structured summary as its last step, and the handoff tool should read from it. Don't let the receiving agent reconstruct context from raw message history.

3. Cap sequential depth at 5. Past that, restructure as supervisor-coordinates-parallel-workers (Anthropic's orchestrator pattern) instead of swarm-style chains. You pay 15-20% more tokens and buy back accuracy.

4. Supervisor for safety-critical, swarm for latency-critical. Beam's orchestration taxonomy has this right: supervisor adds 20-40% token overhead but gives you a single point of policy enforcement. Swarm is faster and cheaper, until the quality cliff.

5. Freeze the tool output schema at the handoff boundary. If Agent B consumes a field that Agent A emits, that field is now part of the contract. Test it like any other API contract. This is exactly what end-to-end Scenarios are for.

6. Instrument before scaling. The single biggest predictor of whether a multi-agent system survives past the demo is whether the team added handoff telemetry before adding the fourth agent.

The Prompt You Should Be Optimizing

Most of the prompt engineering effort in multi-agent systems goes into the agent-to-user prompt, the system message that defines persona and instructions. That's the visible surface.

The prompts that actually decide whether the system works are the ones agents pass to each other. The one-line summary the triage agent hands to the billing specialist. The state update LangGraph emits with Command. The distilled output the subagent returns to the lead. Those are the prompts your customers experience, even though no customer ever sees them.

Treat them like product. Version them, diff them in code review, write regression tests against them. And when the system starts misbehaving, look there before you reach for a bigger model. The handoff is the prompt now.

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

agents multi-agent handoff langgraph openai-agents anthropic observability

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos