What's the Difference Between a Supervisor Agent and a Swarm?

A supervisor agent sits in the middle and routes every hop: the user talks to the supervisor, which delegates to a specialist, which returns to the supervisor, which decides the next hop. A swarm is peer-to-peer: specialists hand off directly to each other using a shared state object, and there is no central router in the critical path. Supervisor trades tokens for debuggability; swarm trades debuggability for latency.

How Much More Does a Supervisor Pattern Cost in Tokens?

Supervisor pays an extra router LLM call on every hop. In production CX traces this lands at roughly 20-40% more tokens per run, depending on how long the supervisor's system prompt is and how often it summarizes intermediate results. The overhead is proportional to hop count, so a 3-hop lookup is barely noticeable and an 8-hop refund-then-reschedule is painful.

What Is the Handoff Cliff in a Swarm?

Past roughly 8-10 sequential peer handoffs, swarm quality degrades sharply. Each handoff re-serializes state into a new agent's context window, and small compaction errors compound. tau-bench pass@8 is under 25% for base models, which is the same cliff measured from a different angle. The cliff exists because peers have no shared canonical view of the task other than the chain of messages they pass.

Which Pattern Should I Start with for Customer Experience?

Start supervisor. CX calls are typically 3-5 hops (triage, one or two lookups, an action, a summary) and at that depth the token tax is small and the debuggability is worth it. You graduate to swarm only when handoff latency is the thing users complain about, which usually means you have a long-tail workflow with six or more sequential tool calls and a real-time UX constraint.

What Signals Tell Me It's Time to Move From Supervisor to Swarm?

Three telemetry signals. Handoff depth p95 trending past 6 on the workflows you care about. Supervisor-to-worker token ratio climbing above 0.4 because the router is doing most of the thinking. User-perceived latency exceeding your SLA on a majority of calls where the agent had the right answer. If any two of those are true on the same workflow, graduate that workflow to a sub-graph.

Do I Need LangGraph Specifically?

No. The pattern is independent of the framework. LangGraph happens to have first-class primitives for both styles (a StateGraph with a named supervisor node, and Command objects for direct peer handoffs) so the code example is short. CrewAI, AutoGen, and OpenAI Agents SDK all implement the same two shapes with different vocabulary. The failure modes are the same.

Where Does Chanl Fit in This?

Chanl's scorecards let you author LLM-judged rubrics that grade whether handoff boundaries carried the right context, and its scenarios let you replay a fixed set of persona variants against a new prompt version so you can see routing drift. The pattern itself you build in LangGraph or whatever framework you like; Chanl is the layer that tells you whether a supervisor-to-swarm graduation actually improved things on the workflows you care about, not just in aggregate.

When to Use a Supervisor, When to Let Agents Swarm

Most multi-agent writeups pick a pattern and defend it. That is not what production teams do. Production teams pick supervisor for the first three workflows because it is debuggable, keep it for as long as the token tax is affordable, and peel individual workflows off to a swarm when latency becomes the thing customers complain about. The question is not which pattern is better in the abstract. It is which pattern is better for this workflow, given how deep the call gets and how expensive a routing error is.

Two numbers decide it. Supervisor costs 20-40% more tokens per run because the router LLM runs on every hop. Swarm hits a quality cliff past 8-10 sequential peer handoffs. Tau-bench's pass@8 result for base models sits under 25% for the same structural reason. Both numbers show up across production CX traces and open benchmarks. Neither moves much with model upgrades, because the overhead is structural, not capability-bound.

What follows is the shape of that trade-off for customer experience specifically. Where each pattern breaks, what to measure, and a concrete LangGraph+MCP reference for the graduation path. If you want the bird's-eye view of all six orchestration patterns and when each one shows up, I wrote that in the multi-agent orchestration patterns piece. This one is narrower: supervisor vs swarm, for CX, with real telemetry.

The Pattern Choice Isn't a Matter of Taste

CX workflows reduce to four canonical tasks: triage the request, look up the state of the world, take an action, summarize the outcome. A typical call runs three to five hops across those four tasks. Sometimes it runs twelve. The distribution is long-tailed and the long tail is where pattern choice starts mattering.

Supervisor pattern: one central agent owns routing. Every hop passes through it. The supervisor reads the full history, decides the next specialist to call, receives the result, decides what to do next. It is a hub-and-spoke topology where every leaf talks only to the center.

Swarm pattern: peers hand off directly. Triage decides the request needs a refund, passes control to the refund agent, which passes control to the notification agent, which passes control to the summary agent. No central router. State travels with the message, usually as a serialized object with the conversation so far plus whatever task-specific fields the next peer needs.

The supervisor pays an extra LLM call per hop. That's the 20-40% number. On a 3-hop call it's a rounding error. On a 10-hop call it adds a full model turn of latency and cost. The swarm saves those calls but pays a different tax: every handoff is a fresh context window, and the peer on the receiving end has to reconstruct task state from whatever the previous peer serialized. Past 8-10 handoffs, that reconstruction stops being reliable.

This is the honest version of the choice. Supervisor is more expensive per hop but cheaper to debug. Swarm is faster per hop but cheaper to scale only up to a depth limit, after which quality collapses fast.

What Supervisor Actually Costs You per Call

Take a representative CX call: the user asks to reschedule a delivery, the agent looks up the order, checks available windows, confirms the change, sends the customer a confirmation. Four tool calls, eight turns if you count the user messages.

In a supervisor topology that looks like:

Supervisor trace for reschedule·text

User        -> Supervisor   (parse intent)
Supervisor  -> OrderLookup  (get order 18833)
OrderLookup -> Supervisor   (returns order state)
Supervisor  -> Availability (get windows for 2026-04-25)
Availability-> Supervisor   (returns 3 windows)
Supervisor  -> User         (present options)
User        -> Supervisor   (picks window 2)
Supervisor  -> Scheduler    (commit change)
Scheduler   -> Supervisor   (success)
Supervisor  -> Notifier     (send SMS)
Notifier    -> Supervisor   (delivered)
Supervisor  -> User         (confirm)

Five supervisor turns, four worker turns. The supervisor's prompt is longer than any worker's because it carries the full conversation plus the specialist registry plus the routing rules. On Sonnet-class pricing that is roughly $0.03-0.04 extra per call compared to a swarm that did the same work with peer handoffs.

For a team doing 50,000 calls a month, that's $1,500-$2,000 of pure routing overhead. For a team doing 5 million calls a month, that's $150,000-$200,000. The cost scales linearly with volume, which is why at some point you stop tolerating it on the workflows where it isn't earning its keep.

What you get in return: every routing decision is logged with the full context that produced it. When a call goes wrong, you replay the supervisor's turns and you can literally see what it considered and what it picked. That's worth real money for the first year of production. It's still worth some money every year after, but the ratio shifts.

Where Swarm Actually Breaks

The 8-10 handoff cliff is not a folk theorem. It shows up on tau-bench and tau2-bench, and teams running swarm at real call volume report the same curve. Here's what happens structurally.

Each peer handoff reserializes the task. Peer A finishes, writes a message like "user wants to reschedule order 18833 to Friday afternoon, I've verified availability, please commit the change and notify them." Peer B receives that message, plus the system prompt that defines its role, plus whatever tools it has. It then tries to commit the change.

The problem is that "verified availability" is doing a lot of work in that handoff. What does verified mean? Did A check the full window object, or just that the slot existed? Does B need to re-check before committing? If B re-checks, it's now paying for work A already did. If B doesn't re-check, and A's check was stale by the time B acted, the whole call fails in a way that's hard to trace because the checking logic is spread across two agents with no shared record.

After 2-3 handoffs this is fine. After 5, you start seeing weird inconsistencies. After 8-10, the quality curve falls off a cliff. The swarm is doing exactly what it was told to do, but "what it was told" has been compressed and re-expanded enough times that the fidelity is gone.

You can push the cliff out by using structured state objects instead of free-text handoff messages. LangGraph's StateGraph model does this. Every peer reads and writes a typed dict, so "verified" becomes availability_checked: true, window_id: "slot_3", checked_at: 1714051200. That buys you maybe 2-4 more hops before the same pattern reemerges, because even structured state accumulates ambiguity when multiple peers can mutate the same fields.

The related failure is what I've called the passing ships problem: two peers working on the same customer can't see each other. In a pure swarm that's architectural. The only fix is a shared canonical state, and once you have shared canonical state you have most of what a supervisor was giving you anyway, minus the routing.

The CX-Specific Failure Modes

The general rules apply. But CX has four specific failure modes where the pattern choice shows up sharper than the average multi-agent writeup captures.

Triage. The first hop routes the call to the right workflow: billing, refunds, scheduling, technical. Routing errors cost far more than token overhead. A triage-swarm peer that misroutes sends the call down the wrong tree and the mistake may not surface until hop four. A triage-supervisor reconsiders routing on every hop, because it's holding the full history. You want supervisor for triage, always, even if you swarm the rest.

Lookup. Parallel sub-agents are a real latency win if you can fan out. Check the order and the customer profile and the recent interaction history in parallel. Supervisor patterns do this naturally by issuing three concurrent tool calls. Swarm patterns do it by spawning three peers, each of which has to carry the full conversation context. You just doubled your context cost to save a routing call. For parallel lookups, supervisor is strictly cheaper.

Action. This is where swarm starts paying off. Refunds and scheduling are multi-step: verify, stage, commit, notify. If those four steps are a stable pipeline with well-defined handoff fields, a swarm runs them 20-40% faster because the supervisor isn't in the loop re-deciding. But the handoff fields have to actually be stable. If the refund flow sometimes needs to call back to triage because the policy is unclear, you've recreated the need for a supervisor and paid the swarm's tax on top.

Summary. Supervisor patterns waste tokens summarizing themselves. The supervisor has the whole history; generating a summary is just asking the LLM to compress what's already in its context. Swarm summary peers are lighter because they only see the final handoff, but they also see less. Neither pattern wins cleanly here. It depends on whether your summaries need to cite earlier turns. If yes, supervisor. If no, either.

The map that falls out: supervisor for triage and parallel lookup, swarm for stable multi-step actions, either for summary. Most CX systems benefit from a supervisor for the first hop and optionally a swarm for the action hop. That's a hybrid, and hybrids are what production looks like.

Monitoring Signals That Tell You to Switch

"Graduate a workflow to swarm when it needs to" is advice that means nothing without measurement. Three signals are enough.

Handoff depth p95. Track the depth of agent handoffs per workflow, not per call. If a workflow's p95 depth is creeping up over 6, the supervisor is doing a lot of routing that could probably be direct. If p95 is 3, don't touch it.

Supervisor-to-worker token ratio. Sum the supervisor's output tokens divided by the workers' output tokens, per workflow. In a healthy supervisor setup this sits around 0.2-0.3: the router is doing routing, not thinking. When it climbs past 0.4, the supervisor is reasoning about task state instead of delegating, which is the thing the workers should be doing. That's a signal either to beef up the workers or to move the workflow to a swarm where there's no central thinker.

Tool-call context loss rate. Run a sample of failed calls through a scorecard that asks: did the final tool call have access to context that existed earlier in the trace? If yes and it was used, pass. If yes and it was ignored, that's a supervisor problem. The router summarized away something that mattered. If no and the context didn't make it through a handoff, that's a swarm problem. The ratio tells you which pattern is leaking.

Two of three signals on the same workflow is enough to move. If only one trips, investigate first. Usually there's a prompt or a tool-definition bug that's cheaper to fix than a topology change.

Signal	Supervisor healthy	Supervisor degrading	Swarm opportunity
Handoff depth p95	3-5	6-8	> 6 with stable flow
Supervisor/worker token ratio	0.2-0.3	0.3-0.4	> 0.4
Tool-call context loss	< 2%	2-5%	Mostly at supervisor summarization
User-perceived p95 latency	Within SLA	10-20% over	> 20% over

The last row is the one that matters to customers. Token cost and telemetry are internal; latency is the thing they actually feel.

A Reference Architecture with LangGraph and MCP

Here's what the graduation path looks like in code. The shape is independent of the framework, but the LangGraph primitives make it concrete.

Start with a supervisor:

supervisor.py·python

import asyncio
from langgraph.graph import StateGraph, START, END
from langgraph.types import Command
from typing import TypedDict, Literal
 
class CXState(TypedDict):
    messages: list
    customer_id: str
    intent: str | None
    resolution: str | None
 
def supervisor(state: CXState) -> Command[Literal["triage", "lookup", "action", "summary", END]]:
    # Route based on current state. The supervisor LLM call happens here.
    next_node = route_with_llm(state)  # returns one of the literal values
    return Command(goto=next_node)
 
def triage(state: CXState) -> Command[Literal["supervisor"]]:
    intent = classify_intent(state["messages"])
    return Command(goto="supervisor", update={"intent": intent})
 
async def lookup(state: CXState) -> Command[Literal["supervisor"]]:
    # Fan out tool calls via MCP. The real MCP Python SDK is `session.call_tool(name, args)`;
    # parallelism comes from asyncio, not a magic `call_parallel` method.
    order, profile = await asyncio.gather(
        mcp_session.call_tool("get_order", {"customer_id": state["customer_id"]}),
        mcp_session.call_tool("get_profile", {"customer_id": state["customer_id"]}),
    )
    return Command(goto="supervisor", update={"messages": state["messages"] + [order, profile]})
 
# ... action, summary similar
 
graph = StateGraph(CXState)
graph.add_node("supervisor", supervisor)
graph.add_node("triage", triage)
graph.add_node("lookup", lookup)
graph.add_node("action", action)
graph.add_node("summary", summary)
graph.add_edge(START, "supervisor")
app = graph.compile()

This is a supervisor in the honest sense: every worker returns to the supervisor, and the supervisor re-decides. Handoff depth here is bounded by how many workers you visit, typically 3-5.

The graduation to swarm replaces the supervisor re-entry with direct peer handoffs, for the subgraph you're graduating:

swarm_subgraph.py·python

# When the action workflow stabilizes into verify -> stage -> commit -> notify,
# the supervisor hands the entire subgraph off once, and the peers handle the chain.
 
def action_verify(state: CXState) -> Command[Literal["action_stage"]]:
    verified = verify_refund(state)
    return Command(goto="action_stage", update={"verified": verified})
 
def action_stage(state: CXState) -> Command[Literal["action_commit"]]:
    staged = stage_refund(state)
    return Command(goto="action_commit", update={"staged": staged})
 
def action_commit(state: CXState) -> Command[Literal["action_notify"]]:
    committed = commit_refund(state)
    return Command(goto="action_notify", update={"committed": committed})
 
def action_notify(state: CXState) -> Command[Literal["supervisor"]]:
    send_notification(state)
    return Command(goto="supervisor", update={"resolution": "refund_complete"})

Two things are worth noticing. First, the swarm subgraph still terminates back at the supervisor. You haven't abandoned the supervisor. You've pruned its involvement in the middle of a stable pipeline. Second, the peers share state via MCP-backed tools, not via free-text messages. verified, staged, committed are typed fields, which is what buys you the extra handoff depth before the cliff.

MCP's role here is understated but important. In a pure swarm, each peer is responsible for its own tool context. If commit_refund needs to know what stage_refund produced, either the state dict carries it (fragile, because every new field means updating every peer) or each peer re-queries (expensive). An MCP server backed by a shared cache gives the peers a single source of truth for tool state, which is the third option: let the tools remember.

The full production shape is a supervisor at the top with swarm subgraphs for workflows that have earned the graduation. Start everything as supervisor. Measure. Peel off the workflows where the telemetry says it's time.

What You Still Need to Build, and Where Chanl Fits

The framework gives you the primitives. What it doesn't give you:

Scorecards for handoff quality. Did each handoff carry enough context for the next peer to succeed? You author this as an LLM-judged prompt rubric with a free-form description and a score or boolean output. The rubric grades the whole transcript, so you phrase it to call out specific handoff boundaries ("when the refund agent took over, did it re-verify availability?"). Scorecards run this over production traces and flag the calls where the rubric failed.
Regression tests for routing. When the supervisor's prompt changes, does it still route the same calls to the same specialists? This is a replay problem. Scenarios let you lock in persona variants and replay them against a new prompt version, so you can eyeball routing drift by comparing the resulting traces. Scoring the drift automatically is something you wire on top with a scorecard; the platform doesn't diff traces for you out of the box.
Trace replay with the pattern diff. You need to compare supervisor-topology traces against swarm-topology traces for the same call, with the same customer state, to know whether the graduation actually paid off. Analytics is the dashboard layer; wiring handoff depth and supervisor-to-worker token ratio in alongside cost and resolution rate takes some custom event plumbing.
MCP-backed tool registry. The peers need a canonical source of tool state between handoffs. Whether you stand up your own MCP server or use an off-the-shelf MCP integration, the pattern has to exist. Free-text handoffs between peers will bleed context no matter how well you prompt them.

The LangGraph + MCP reference above is yours to copy. The eval layer around it is the part that takes a quarter to build from scratch, and it's the part Chanl ships.

Pick Supervisor. Measure. Graduate When the Numbers Say So

The trap in most multi-agent discussions is that they frame the pattern choice as architectural personality. It isn't. It's an operational measurement. You pick supervisor because the 20-40% token tax buys you debuggability that is worth real money for the first year of production. You keep supervisor for as long as your handoff depth stays under 6, your supervisor-to-worker ratio stays under 0.4, and your latency stays within SLA. You graduate specific workflows to swarm when the telemetry says those constraints are breaking, and you graduate only those workflows, not the whole system.

Swarm isn't better. Supervisor isn't better. The workflow decides, and it tells you what it wants if you're measuring the right things.

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

agent-architecture multi-agent orchestration langgraph mcp customer-experience tools testing

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed

When to Use a Supervisor, When to Let Agents Swarm

The Pattern Choice Isn't a Matter of Taste

What Supervisor Actually Costs You per Call

Where Swarm Actually Breaks

The CX-Specific Failure Modes

Monitoring Signals That Tell You to Switch

A Reference Architecture with LangGraph and MCP

What You Still Need to Build, and Where Chanl Fits

Pick Supervisor. Measure. Graduate When the Numbers Say So

Learn Agentic AI

Frequently Asked Questions

Related Articles

50 Tools, Zero Memory. The Biggest Gap in AI Agents Today

Your MCP server is a monolith. Here's how to fix it

The 17x error trap in multi-agent systems