A team I spoke with recently shipped three new MCP servers in a single afternoon. Their internal CRM, a knowledge base, and a billing system. Total tools available to the agent went from 47 to 105. The eval suite that had been holding at 92 percent passing the night before crashed to 61 percent the next morning. Nothing in the agent's prompt or model had changed.
This is the cliff everyone in the agent space talks about and almost nobody publishes the curve for. "More tools means worse routing" is the folk wisdom. The number we all want is: at what tool count does GPT-5 or Claude Opus 4.1 start mis-routing, and how steep is the fall?
Below is what the published benchmarks actually tell us, a measurement harness you can run on your own agent in an afternoon, and the scoping pattern that pulled the team above back to 89 percent without touching their agent prompt. Code in TypeScript and Python, real SDK calls, no pseudocode.
What the leaderboards already tell us
The Berkeley Function-Calling Leaderboard (BFCL) is the canonical public eval for tool calling. Version 4 evaluates models across single, parallel, multiple, and multi-turn function calls, with curated distractor functions designed to look plausible. As of early 2026, the top of the leaderboard reads as follows.
| Model | BFCL v4 overall accuracy | Rank |
|---|---|---|
| Claude Opus 4.1 | 70.36% | 2 |
| Claude Sonnet 4 | 70.29% | 3 |
| GPT-5 | 59.22% | 7 |
These numbers are aggregate. They do not stratify by tool count. ToolBench, the OpenBMB benchmark used at ICLR, leans the other way: it pulls 16,464 real-world APIs across 49 categories from RapidAPI and tests retrieval and selection at scale. Recent ToolComp results show average multi-step tool accuracy below 50 percent across leading models. Both benchmarks tell us the same thing in different framings: tool routing is not a solved problem, and the gap between models is a smaller signal than the gap between toolset sizes.
What no public benchmark publishes cleanly is the size-decay axis: hold the model fixed, vary tool count, plot accuracy. That's the experiment teams actually need.
A reproducible measurement harness
The methodology is straightforward. Start with a labeled eval set, where every prompt has a known-correct tool. For each prompt, synthesize distractor tools with similar names, overlapping descriptions, and slightly different parameter shapes. Then run the same prompts at toolset sizes of 1, 5, 10, 25, 50, 75, 100, and 105. Record top-1 routing accuracy.
The hardest part is the distractors. Random tools are too easy to ignore. The model needs distractors that share vocabulary with the correct tool, otherwise you are measuring vocabulary recognition, not routing.
import OpenAI from "openai";
const openai = new OpenAI();
// Given the canonical tool the prompt should route to, generate N
// near-miss distractors. Same domain, similar verbs, overlapping params.
export async function generateDistractors(
correctTool: { name: string; description: string; parameters: object },
n: number,
): Promise<Array<typeof correctTool>> {
const res = await openai.chat.completions.create({
model: "gpt-5",
messages: [
{
role: "system",
content:
"Generate plausible-but-wrong tool definitions that share vocabulary with the canonical tool. Each must be syntactically valid and semantically distinct. Return JSON of the form {\"distractors\": [<tool>, ...]}.",
},
{
role: "user",
content: `Canonical: ${JSON.stringify(correctTool)}\nReturn ${n} distractors.`,
},
],
response_format: { type: "json_object" },
});
const parsed = JSON.parse(res.choices[0].message.content ?? "{}");
return parsed.distractors;
}That gets you a distractor pool. You will want to hand-review a sample of these. Generated distractors that drift outside the domain inflate accuracy artificially, and ones that are accidentally correct deflate it.
The eval loop, in Python
With the distractor pool in hand, the loop is mechanical. For each (model, toolset_size) pair, sample distractors to bring the toolset up to size, send the prompt to the model, and check whether the tool name in the response matches the canonical one.
import json
from typing import Any
from anthropic import Anthropic
from openai import OpenAI
# Two model clients. Same eval, two families. Add Gemini via google-genai if you want a third.
clients = {
"gpt-5": OpenAI(),
"claude-opus-4-1": Anthropic(),
}
TOOLSET_SIZES = [1, 5, 10, 25, 50, 75, 100, 105]
def run_one(
model_id: str,
prompt: str,
correct_tool: dict[str, Any],
distractors: list[dict[str, Any]],
n_tools: int,
) -> bool:
"""Returns True if the model called the correct tool top-1."""
# Always include the correct tool, fill the rest from distractors.
pool = [correct_tool] + distractors[: n_tools - 1]
if model_id == "gpt-5":
res = clients["gpt-5"].chat.completions.create(
model=model_id,
messages=[{"role": "user", "content": prompt}],
tools=[{"type": "function", "function": t} for t in pool],
tool_choice="auto",
)
calls = res.choices[0].message.tool_calls or []
return bool(calls) and calls[0].function.name == correct_tool["name"]
if model_id == "claude-opus-4-1":
res = clients["claude-opus-4-1"].messages.create(
model=model_id,
max_tokens=1024,
tools=pool,
messages=[{"role": "user", "content": prompt}],
)
# Claude returns content blocks; first tool_use block is the call.
for block in res.content:
if block.type == "tool_use":
return block.name == correct_tool["name"]
return False
raise ValueError(model_id)
def evaluate(eval_set: list[dict[str, Any]], distractor_pool: list[dict[str, Any]]):
results: dict[tuple[str, int], list[bool]] = {}
for model_id in clients:
for n in TOOLSET_SIZES:
outcomes = [
run_one(model_id, ex["prompt"], ex["correct_tool"], distractor_pool, n)
for ex in eval_set
]
results[(model_id, n)] = outcomes
return {k: sum(v) / len(v) for k, v in results.items()}Run this with 200 prompts and the loop produces 16 accuracy numbers (2 models, 8 sizes). That's the input to the curve. A full pass at this size hits 3,200 calls; expect cost in the low hundreds of dollars depending on model mix, average prompt length, and how aggressively you cache tool definitions across calls. Run a 20-prompt smoke pass first and extrapolate before committing to a full 200-prompt run.
What the curve actually looks like
Across BFCL stratifications, ToolBench retrieval-then-selection results, and what teams have shared informally, the shape comes out consistent across model families even when the absolute numbers differ.
| Toolset size | Routing accuracy (representative) |
|---|---|
| 1 to 10 tools | 95% to 98% |
| 25 tools | 88% to 93% |
| 50 tools | 78% to 85% |
| 75 tools | 65% to 75% |
| 100+ tools | 55% to 68% |
These ranges are illustrative composites of public BFCL and ToolBench results plus practitioner reports, not a single cited measurement. The shape is what matters: shallow decay through 25, moderate through 50, then a cliff. The cliff position varies by model and by how distractor-heavy the eval is, but the cliff itself is a real and reproducible artifact.
Microsoft Research named this failure mode "tool-space interference" in 2025 and drew a useful distinction: it is not the model being worse at hard tasks. A 50-tool agent can fail prompts that a 10-tool version of the same model handles flawlessly. The capability is there. The routing is what breaks.
Why the cliff happens
Three mechanisms compound. Once you can name them you can target the right fix.
First, name and description overlap. Two tools called get_user_profile and fetch_user_details will compete for the same prompts. The model resolves the ambiguity probabilistically, and as the toolset grows, the probability of any individual tool being chosen correctly drops mechanically. This is the dominant driver below 50 tools.
Second, context pressure. EclipseSource and others have flagged that an MCP tool definition runs roughly 250 to 1,400 tokens depending on schema complexity, enums, and field descriptions; the GitHub MCP server alone has been reported at around 55,000 tokens across its 93 tools before any prompt is sent. Connect a few tool-heavy MCP servers and the schema overhead can claim a large fraction of the context window before the conversation starts. When tool definitions eat that much context, every other signal gets diluted. This is the dominant driver past 50 tools.
Third, self-conditioning. If the model called a tool earlier in the conversation, it is more likely to keep calling tools. If it called tools in parallel, it is biased toward parallel calls again. As toolset size grows, the prior gets noisier, and self-conditioning amplifies whichever direction noise pushed the first call.
OpenAI's API enforces a hard cap of 128 functions per request, but treating the cap as a target is a mistake. The performance cliff sits well below it.
Recovery: scope the toolset to the turn
The right intervention is rarely "use a smarter model." It is "show the model fewer tools." The trick is doing this without losing capability, which means scoping toolsets by intent rather than removing them globally. A 105-tool agent split into four scoped toolsets of roughly 25 tools each, with the right one selected per turn, typically recovers routing accuracy to within 2 to 3 points of the small-toolset baseline. The agent loses nothing. The model just stops seeing irrelevant options.
This is where agent platforms earn their keep. We use Chanl for our own agents, and the pattern below is roughly what it looks like in code: define scoped toolsets up front, point the agent at one toolset per intent, replay the same eval, and grade routing as a structured rubric.
import { ChanlSDK } from "@chanl/sdk";
const sdk = new ChanlSDK({ baseUrl: "https://api.chanl.ai", apiKey: process.env.CHANL_API_KEY! });
// 1. Carve the 105-tool flat agent into four intent-scoped toolsets.
const billing = await sdk.toolsets.create({ name: "billing", tools: BILLING_TOOL_IDS });
const account = await sdk.toolsets.create({ name: "account", tools: ACCOUNT_TOOL_IDS });
const support = await sdk.toolsets.create({ name: "support", tools: SUPPORT_TOOL_IDS });
const orders = await sdk.toolsets.create({ name: "orders", tools: ORDER_TOOL_IDS });
// 2. Replay the same eval against the agent for each scoped toolset, one
// at a time. Per-turn toolset switching is platform behavior; the SDK
// primitive is the toolset itself.
const runs = await sdk.scenarios.runAll({ agentId: AGENT_ID, minScore: 80 });
// 3. Grade routing as a rubric, not a single boolean. evaluate() takes a
// callId from the run and a scorecardId.
const grades = await Promise.all(
runs.results.map((r) =>
sdk.scorecard.evaluate(r.executionId, { scorecardId: ROUTING_SCORECARD }),
),
);The same pattern works without a platform, of course. You can hand-roll the scoping with a small intent classifier and two function-calling endpoints. The reason teams pay for the platform is that the scoping logic and the eval replay live in the same place as the agent, so when you change a tool description, the eval re-runs, and the routing scorecard tells you whether you regressed before the change ships. See scenarios for how the replay piece works, and scorecards for the rubric grading.
A practical tool budget
A few rules of thumb worth posting in your team's docs.
Per turn, not per agent. Routing accuracy is a function of how many tools the model sees in the request, not how many your agent could theoretically call across all sessions. An agent with 200 tools and a 20-tool active scope will outperform an agent with 60 tools all flat.
Under 25 tools per turn for >90 percent routing. Past 25 you will need to invest in scoping or retrieval. Past 50 the cliff shows up. Past 75 you should assume close-to-random behavior on adversarial prompts and design accordingly.
Test the curve, don't trust the curve. The decay is shape-stable across models but the inflection point is workload-specific. The harness in this article is small enough to run weekly, and it should sit in CI alongside your unit tests. See tools for how scoped toolsets fit into the broader MCP story, and Tool explosion: managing 50 agent tools at scale for the qualitative companion to this post.
There are still missing pieces. Automatic intent-based scoping (sdk.toolsets.scope({ intent, history })) is something we want to ship and don't have yet, so today the routing classifier is yours to write. A built-in routing rubric for sdk.scorecard.evaluate would also save teams from designing one. Both are good product fits and both are coming.
The cliff is real, the fix is per-turn scoping, and the only proof that matters is the curve you measure on your own agent against your own distractor pool.
Test your tool routing before your customers do
Build, monitor, and stress-test agents with scoped toolsets, scenario replays, and routing scorecards in a single workspace.
Try Chanl- Berkeley Function Calling Leaderboard V4 (Gorilla, UC Berkeley, 2025-2026)
- BFCL: From Tool Use to Agentic Evaluation (ICML 2025)
- Tool-Space Interference: An emerging problem for LLM agents (Microsoft Research, 2025)
- MCP and Context Overload: Why More Tools Make Your AI Agent Worse (EclipseSource, 2026)
- Function Calling and Agentic AI in 2025: Latest Benchmarks (Klavis, 2025)
- ToolBench / StableToolBench (OpenBMB, ICLR 2024)
- How many tools/functions can an AI Agent have? (Allen Chan, 2025)
- Why LLM agents break when you give them tools (terzioglub, DEV.to, 2025)
- Code Mode: give agents an entire API in 1,000 tokens (Cloudflare, 2026)
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



