Your prototype works. Three agents in a demo, smoothly handing off tasks, producing polished results. Then you deploy it. Within a week, the system is hallucinating facts that no single agent would have produced alone, burning through token budgets at 3x the projected rate, and occasionally entering infinite loops that only crash when they hit the API rate limit.
You've walked into the 17x error trap.
An analysis published on Towards Data Science found that naive multi-agent architectures don't reduce errors through redundancy. They amplify them. When agents share state and operate without explicit coordination, their individual error rates compound exponentially. Ten agents, each 95% accurate on their own, produce a system that fails 40% of the time.
This article walks through exactly how that happens in the three most popular multi-agent frameworks: CrewAI, LangGraph, and Autogen. Each has distinct failure modes. Each has specific fixes. And for most production use cases, you probably don't need any of them.
Table of contents
- The math behind the 17x trap
- CrewAI: when roles become cages
- LangGraph: graph complexity explosion
- Autogen: the sycophancy problem
- What failures do CrewAI, LangGraph, and Autogen all share?
- What actually works: agent chaining
- How do you test multi-agent coordination before production?
- The decision that saves you six months
The math behind the 17x trap
The "17x" comes from measuring how shared state turns small per-agent errors into system-wide failures. A single agent with a 5% error rate is fine. Ten agents sharing state can push the system failure rate past 50%, roughly a 10x amplification, and retries on failure generate additional corrupted state that pushes it higher. The compounding math is straightforward: the probability that all N agents in a parallel system succeed is 0.95^N. With 10 agents, that's a 60% success rate. Flip it: 40% of requests hit at least one agent-level failure.
But the real damage isn't from independent failures. It's from state contamination. When Agent A hallucinates a customer ID and writes it to shared state, Agent B reads that ID as ground truth. Agent B's output, which references a customer that doesn't exist, feeds into Agent C's summarization step. By the time the response reaches the user, three agents have collaborated to produce a confidently wrong answer that none of them would have generated individually.
// The compounding math: system failure grows fast with agent count
function systemFailureRate(agentErrorRate: number, agentCount: number): number {
// Probability that at least one agent fails
const independentFailure = 1 - Math.pow(1 - agentErrorRate, agentCount);
// With shared state, errors cascade: each failure can corrupt
// downstream agents, amplifying the effective error rate
const cascadeFactor = 1 + (agentCount - 1) * 0.3; // empirical
return Math.min(independentFailure * cascadeFactor, 1.0);
}
// Individual agent: 5% error rate
// 3 agents, independent: 14% system failure
// 3 agents, shared state: 19% system failure
// 10 agents, independent: 40% system failure
// 10 agents, shared state: 52% system failure (the 17x trap)A single agent's 5% error rate becomes a 52% system failure rate with 10 agents sharing state. That's roughly a 10x amplification, and the "17x" label comes from measuring real production systems where agents also retry on failure, generating additional corrupted state with each attempt.
GitHub's engineering team and multiple production analyses converge on the same conclusion: most teams don't need agents that collaborate. They need agents that each do one thing well, connected by deterministic routing.
CrewAI: when roles become cages
CrewAI's mental model is intuitive. You define roles (researcher, writer, manager), assign them tasks, and let a manager agent delegate. It's the fastest path from zero to working prototype.
The failure mode is equally intuitive: what happens when a customer request doesn't map cleanly to predefined roles?
# CrewAI crew that works in demos but fails on ambiguous requests
from crewai import Agent, Task, Crew, Process
manager = Agent(
role="Customer Service Manager",
goal="Route customer requests to the right specialist",
backstory="You manage a team of customer service specialists.",
)
refund_specialist = Agent(
role="Refund Specialist",
goal="Process refunds accurately",
backstory="You handle all refund-related requests.",
)
technical_support = Agent(
role="Technical Support",
goal="Resolve technical issues",
backstory="You troubleshoot product and technical problems.",
)
# The problem: a customer says "I want a refund because the product
# is broken and I can't figure out how to reset it."
#
# Is this a refund? Technical support? Both?
# The manager agent must classify, but the request spans two roles.
# In practice, it picks one. The other half gets dropped.
crew = Crew(
agents=[manager, refund_specialist, technical_support],
tasks=[
Task(
description="Handle this customer request: {request}",
agent=manager,
expected_output="Delegated to the right specialist",
)
],
process=Process.hierarchical,
manager_agent=manager,
)The manager agent sees "refund" and "broken" in the same message. It picks the refund specialist. The technical troubleshooting question, the part where the customer actually needs help, vanishes. The customer gets a refund confirmation and still has a broken product.
CrewAI's second failure mode: infinite delegation. When the researcher agent produces output the writer agent considers insufficient, the writer asks for more research. The researcher asks for clarification on what's needed. Neither has a termination condition.
# Fix: add max_iter and early stopping to prevent infinite loops
crew = Crew(
agents=[manager, refund_specialist, technical_support],
tasks=[classify_task, execute_task],
process=Process.hierarchical,
manager_agent=manager,
max_iter=3, # Hard cap on delegation rounds
verbose=True, # See exactly where it loops
)
# Better fix: decompose the request BEFORE delegation
# Don't ask the manager to both classify AND route a multi-part request
decompose_task = Task(
description="""Break this customer request into individual subtasks.
Output a JSON array where each item has 'task' and 'specialist'.
Request: {request}""",
agent=manager,
expected_output="JSON array of subtasks",
)When CrewAI is the right choice: Tasks that decompose cleanly into roles. Content generation pipelines (research, draft, review). Data processing workflows where each stage is well-defined. If you can draw a clear org chart for your agents, CrewAI maps naturally.
When it's not: Any task where boundaries between roles are fuzzy. Customer service, where requests routinely span multiple domains. Anything where the routing decision itself requires the kind of reasoning you're delegating to specialists.
LangGraph: graph complexity explosion
LangGraph solves CrewAI's rigidity problem by giving you explicit control over execution flow. You define states, edges, and conditional routing as a directed graph. You get checkpointing, replay, and the ability to inspect exactly what happened at every node.
The tradeoff is graph complexity. With N agents, you can have up to N-squared edges. Every conditional edge is a routing decision made by an LLM, which means every edge is non-deterministic.
from langgraph.graph import StateGraph, END
from typing import TypedDict, Literal
class AgentState(TypedDict):
messages: list
current_intent: str
refund_result: dict | None
replacement_result: dict | None
def route_intent(state: AgentState) -> Literal["refund", "replacement", "escalate"]:
"""This routing decision is made by an LLM.
Run it 10 times, you might get 3 different results."""
last_message = state["messages"][-1]
# The LLM classifies intent here. If the message is ambiguous,
# this function returns different values on different runs.
classification = classify_intent(last_message)
return classification
graph = StateGraph(AgentState)
graph.add_node("classifier", classify_node)
graph.add_node("refund", refund_node)
graph.add_node("replacement", replacement_node)
graph.add_node("escalate", escalate_node)
graph.add_conditional_edges(
"classifier",
route_intent,
{
"refund": "refund",
"replacement": "replacement",
"escalate": "escalate",
},
)
# Each node can also route to any other node...
# 4 nodes = up to 16 possible edges
# 8 nodes = up to 64 possible edges
# At some point you're debugging a maze, not a workflowThe non-deterministic routing problem. Run the same customer message through route_intent ten times. You'll get "refund" seven times, "escalate" twice, and "replacement" once. In production, this means identical customer requests get handled differently depending on which classification the LLM happens to produce.
# Fix: use structured output to force deterministic routing
from pydantic import BaseModel
class RoutingDecision(BaseModel):
intent: Literal["refund", "replacement", "escalate"]
confidence: float
reasoning: str
def route_intent_structured(state: AgentState) -> str:
"""Structured output eliminates ambiguous classifications."""
decision = llm.with_structured_output(RoutingDecision).invoke(
f"Classify this customer intent: {state['messages'][-1]}"
)
# Low confidence? Don't guess. Escalate.
if decision.confidence < 0.8:
return "escalate"
return decision.intentWhen LangGraph is the right choice: Complex branching workflows where you need checkpointing and replay for debugging. Workflows with human-in-the-loop approval steps. Systems where you need to inspect and understand every decision the system made (compliance, audit trails).
When it's not: Simple A-then-B-then-C pipelines that don't need graph semantics. Prototyping, where the graph overhead slows iteration. Small teams that don't have time to debug N-squared edge interactions.
Autogen: the sycophancy problem
Autogen takes a different approach entirely. Instead of predefined roles or explicit graphs, agents have conversations. A group chat where multiple agents discuss, debate, and converge on a solution. It's the most natural pattern for research, code review, and any task where iteration improves output.
The production failure mode is subtle and dangerous: agents develop sycophantic agreement patterns.
import autogen
# Three code review agents in a group chat
senior_reviewer = autogen.AssistantAgent(
name="Senior_Reviewer",
system_message="""You are a senior code reviewer.
Identify bugs, security issues, and performance problems.""",
)
security_reviewer = autogen.AssistantAgent(
name="Security_Reviewer",
system_message="""You are a security specialist.
Focus on authentication, injection, and data exposure.""",
)
quality_reviewer = autogen.AssistantAgent(
name="Quality_Reviewer",
system_message="""You are a code quality expert.
Check for maintainability, test coverage, and documentation.""",
)
# The group chat pattern: agents take turns reviewing
group_chat = autogen.GroupChat(
agents=[senior_reviewer, security_reviewer, quality_reviewer],
messages=[],
max_round=10,
)
# The problem: Senior_Reviewer says "LGTM, no major issues."
# Security_Reviewer sees the senior's approval and responds:
# "I agree with Senior_Reviewer, the code looks secure."
# Quality_Reviewer follows: "Agreed, code quality is solid."
#
# All three approved a SQL injection vulnerability.
# None actually caught it because each deferred to the others.This is hallucinated consensus. The agents aren't collaborating. They're performing agreement. Each one reads the prior agents' positive signals and adjusts toward approval. The result feels authoritative, three experts agreeing, but it's collectively wrong.
Autogen's second failure mode: unbounded conversation. Without explicit termination, agents keep talking. Token costs grow linearly with each round of conversation. A 10-round group chat with three agents can easily consume 50,000 tokens per request.
# Fix: add an adversarial reviewer and explicit termination
adversarial_reviewer = autogen.AssistantAgent(
name="Devils_Advocate",
system_message="""You MUST find at least one issue with the code
under review. If other reviewers approved it, examine their
reasoning for gaps. You are not allowed to say 'LGTM'.""",
)
def termination_check(msg):
"""Stop when the adversarial reviewer has been heard
and all agents have responded at least once."""
return msg.get("name") == "Devils_Advocate" and "ISSUE:" in msg["content"]
group_chat = autogen.GroupChat(
agents=[senior_reviewer, security_reviewer,
quality_reviewer, adversarial_reviewer],
messages=[],
max_round=6, # Hard cap on conversation length
speaker_selection_method="round_robin", # Deterministic ordering
)
manager = autogen.GroupChatManager(
groupchat=group_chat,
is_termination_msg=termination_check,
)When Autogen is the right choice: Research workflows where multiple perspectives genuinely improve output. Document review where different reviewers catch different issues. Brainstorming, where the conversational pattern produces ideas that no single agent would generate.
When it's not: Any latency-sensitive application. Production customer service, where you need predictable response times. Cost-sensitive workloads, where unbounded token consumption is unacceptable.
What failures do CrewAI, LangGraph, and Autogen all share?
Three failure patterns appear in every multi-agent production system regardless of which framework you pick. Infinite loops, hallucinated consensus, and coordination overhead each surface differently in CrewAI, LangGraph, and Autogen, but the root causes are identical. Understanding these shared patterns matters more than picking the right framework.
Infinite loops
Agent A asks Agent B for clarification. Agent B asks Agent A for context. Neither has a termination condition. This happens in CrewAI crews, LangGraph cycles, and Autogen group chats. The surface looks different, but the root cause is always the same: two agents with overlapping responsibilities and no exit criteria.
The fix is structural, not configuration. Setting max_iter=10 is a band-aid. The real fix is ensuring no two agents have a mutual dependency where each can request work from the other. If Agent A can delegate to Agent B, then Agent B should not be able to delegate back to A.
Hallucinated consensus
When multiple agents "agree" on a fact that none of them actually verified. This is especially dangerous because the multi-agent stamp of approval makes the output feel more trustworthy than a single agent's guess. Three agents confidently stating that a policy allows a refund is more convincing than one agent saying it, even when all three are wrong.
The fix: ground truth checkpoints. At least one agent in any multi-agent workflow must have access to actual data (a database, an API, a document store) rather than relying on its training data. When agents only talk to each other, they're doing collaborative hallucination.
Coordination overhead
The meta-work of routing, summarizing, and synchronizing between agents exceeds the actual task work. You built a multi-agent system to handle complex requests. But 60% of the tokens are spent on agents describing what they've done to other agents, summarizing conversations for the orchestrator, and negotiating who should handle the next subtask.
The fix: measure coordination cost explicitly. Track the ratio of "coordination tokens" (inter-agent communication) to "work tokens" (actual task execution). If coordination exceeds 40% of total token usage, your system would be cheaper and faster as a single agent with better tooling.
What actually works: agent chaining
For most production systems, the answer isn't a framework at all. It's sequential agent chaining: each agent does one thing, its output gets validated, and only then does the next agent run. No shared state, no negotiation, no LLM-based routing decisions. This pattern eliminates all three shared failure modes by design.
// Agent chaining: each agent does one thing, output is verified
// before passing to the next. No shared state, no negotiation.
interface ChainStep {
agent: string;
input: (context: Record<string, unknown>) => unknown;
validate: (output: unknown) => boolean;
}
async function executeChain(
steps: ChainStep[],
initialContext: Record<string, unknown>
) {
const context = { ...initialContext };
for (const step of steps) {
const input = step.input(context);
const output = await agents[step.agent].execute(input);
// Verify output before it becomes the next agent's input
if (!step.validate(output)) {
// Don't pass bad data downstream. Stop here.
return {
success: false,
failedAt: step.agent,
output
};
}
context[step.agent] = output;
}
return { success: true, context };
}
// Usage: refund -> replacement -> scheduling
// Each step's output is validated before the next step runs.
// No agent can corrupt another agent's input.
const chain: ChainStep[] = [
{
agent: "refund",
input: (ctx) => ctx.customerRequest,
validate: (out: any) => out.refundId && out.amount > 0,
},
{
agent: "replacement",
input: (ctx) => ({
request: ctx.customerRequest,
refund: ctx.refund,
}),
validate: (out: any) => out.trackingNumber !== undefined,
},
{
agent: "scheduling",
input: (ctx) => ({
request: ctx.customerRequest,
refund: ctx.refund,
replacement: ctx.replacement,
}),
validate: (out: any) => out.callbackTime !== undefined,
},
];This pattern eliminates all three shared failure modes. No infinite loops because the chain is strictly sequential. No hallucinated consensus because each agent operates independently. Minimal coordination overhead because the "routing" is just a function call, not an LLM classification.
The insight from GitHub's multi-agent engineering work and production analyses: you don't need agents that collaborate. You need agents that each do one thing well, connected by deterministic code with human-readable intermediate state.
How do you test multi-agent coordination before production?
Unit testing individual agents is straightforward, but testing how they interact is where teams struggle. The hardest coordination bugs only appear when agents share context, hand off mid-conversation, or encounter requests that span multiple agent domains. You can't unit-test your way to confidence here.
The approach that works: build adversarial test personas that specifically target the boundaries between agents.
// Create personas that stress-test agent handoff points
const adversarialPersonas = [
{
name: "Topic Switcher",
description: "Starts asking about refunds, suddenly pivots to a " +
"technical question, then back to refunds. Tests whether " +
"context survives agent transitions.",
},
{
name: "Contradiction Artist",
description: "Says 'I want a refund' then two turns later says " +
"'Actually, keep the product but send a replacement.' Tests " +
"whether agents handle intent reversal or lock into the " +
"first classification.",
},
{
name: "Multi-Domain Questioner",
description: "Asks a single question that spans billing, " +
"technical support, and account management. 'Why was I " +
"charged twice for a product that does not work and can " +
"you reset my password while we are at it?' Tests whether " +
"the system can decompose compound requests.",
},
];Running these personas as automated scenarios surfaces coordination failures before customers do. The Topic Switcher will expose systems that lose context during agent transitions. The Contradiction Artist will find systems that can't update their understanding mid-conversation. The Multi-Domain Questioner will break any routing system that assumes one intent per message.
After running scenarios, grade the results with coordination-specific scorecards:
// Score multi-agent conversations on coordination quality
const coordinationCriteria = [
{
name: "Goal Completion",
description: "Did the system accomplish everything the " +
"customer asked for? Partial credit if some but not all " +
"subtasks were completed.",
weight: 0.4,
},
{
name: "No Redundant Work",
description: "Did multiple agents perform the same task? " +
"Did the customer have to repeat information that should " +
"have been passed between agents?",
weight: 0.2,
},
{
name: "No Contradictions",
description: "Did any agent's response contradict another " +
"agent's response within the same conversation?",
weight: 0.2,
},
{
name: "Transition Smoothness",
description: "When work moved from one agent to another, " +
"was the customer aware of the handoff? Did context carry " +
"over or was information lost?",
weight: 0.2,
},
];The combination of adversarial personas and coordination scorecards catches the three shared failure modes before production. Infinite loops show up as conversations that hit your max turn limit. Hallucinated consensus shows up as high confidence scores on incorrect answers. Coordination overhead shows up as low goal completion despite high individual agent accuracy.
You can also use tool execution logs to trace exactly which tools each agent called and in what order, making it possible to reconstruct the full decision chain when something goes wrong. And production monitoring catches the failures that slip through testing: conversations where coordination breaks down on edge cases your personas didn't cover.
The decision that saves you six months
Here's the decision tree that would have saved us, and most teams we talk to, about six months of debugging production multi-agent failures.
Most production systems land on "agent chaining with validation." It's boring. It's predictable. It doesn't have a conference talk about it. But it handles 80% of multi-agent use cases without any of the three shared failure modes.
When you do need a framework, pick based on your failure tolerance:
| Factor | CrewAI | LangGraph | Autogen |
|---|---|---|---|
| Learning curve | Low | High | Medium |
| Debugging | Limited | Excellent (replay) | Moderate |
| Latency control | Low | High | Low |
| Token cost control | Medium | High | Low (unbounded risk) |
| Best for | Role-based tasks | Complex branching | Review/debate |
| Worst for | Ambiguous routing | Simple pipelines | Latency-sensitive |
Start with the simplest architecture that could work. A single agent with good tools handles more than you think. Agent chaining handles most of what's left. Reach for a framework only when you have a concrete problem that chaining can't solve.
The 17x error trap catches teams who start with the framework and work backward to the problem. Start with the problem. Most of the time, the answer is simpler than you expect.
Test your multi-agent system before production does
Run adversarial scenarios against agent handoff points, score coordination quality with custom scorecards, and catch the failures that demos never reveal.
Start testing freeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



