How many tools is too many for an LLM agent?

There is no hard cap below the API limit, but published benchmarks and practitioner reports converge on the same shape. Routing accuracy stays high through roughly 25 tools, declines moderately to 50, and falls sharply past 50. By 100 tools, even frontier models are below 70 percent top-1 accuracy on tasks they handle perfectly with 10 tools. OpenAI's hard limit is 128 functions, but you should never approach it.

Does the Berkeley Function-Calling Leaderboard test tool count?

BFCL v4 reports aggregate accuracy across categories, not stratified by tool count, but its categories include single, parallel, multiple, and multi-turn calls with curated distractors. The leaderboard makes clear that distractor quality and quantity drive much of the variance between models. Claude Opus 4.1 leads at 70.36 percent overall, GPT-5 sits at 59.22 percent.

Why does adding more tools degrade routing accuracy?

Three mechanisms compound. Name and description overlap creates ambiguity the model resolves probabilistically. Tool definitions consume roughly 250 to 1,400 tokens each, so 50 tools can burn 12,000 to 70,000 tokens of context before the conversation starts. And models exhibit self-conditioning bias, where past tool calls bias subsequent ones, amplifying mistakes.

How do you fix tool routing without removing tools?

Scope the toolset per turn instead of per agent. Group tools by intent, channel, or customer segment, and present only the relevant scope on each turn. Teams that move from a 105-tool flat agent to four 25-tool scoped toolsets typically recover routing accuracy to within 2 to 3 points of the small-toolset baseline.

What is tool-space interference?

A failure mode named in 2025 Microsoft Research where horizontal extensibility of MCP servers actively reduces agent performance. Common causes include tool name collisions, overlapping descriptions, and parallel call confusion. It is distinct from generic LLM capability limits.

How do you measure tool routing accuracy on your own agent?

Build a labeled eval set of prompts, each tagged with the correct tool to call. Synthesize plausible distractor tools with similar names and descriptions. Run the eval at increasing toolset sizes and record top-1 routing accuracy. The Berkeley Function-Calling Leaderboard methodology is the right reference, and your eval suite should sit in CI so regressions are caught before deploy.

Should I use semantic search to pre-filter tools before the LLM sees them?

Yes, when the toolset crosses 25 tools per turn. The pattern is a small embedding-based retriever that picks the top 5 to 10 candidate tools per user message, then those are passed to the model. This works well for routing-heavy agents but adds a hop, so latency-sensitive use cases sometimes prefer static intent-based scoping.

Past 50 tools, function-calling accuracy falls off a cliff

A team I spoke with recently shipped three new MCP servers in a single afternoon. Their internal CRM, a knowledge base, and a billing system. Total tools available to the agent went from 47 to 105. The eval suite that had been holding at 92 percent passing the night before crashed to 61 percent the next morning. Nothing in the agent's prompt or model had changed.

This is the cliff everyone in the agent space talks about and almost nobody publishes the curve for. "More tools means worse routing" is the folk wisdom. The number we all want is: at what tool count does GPT-5 or Claude Opus 4.1 start mis-routing, and how steep is the fall?

Below is what the published benchmarks actually tell us, a measurement harness you can run on your own agent in an afternoon, and the scoping pattern that pulled the team above back to 89 percent without touching their agent prompt. Code in TypeScript and Python, real SDK calls, no pseudocode.

What the leaderboards already tell us

The Berkeley Function-Calling Leaderboard (BFCL) is the canonical public eval for tool calling. Version 4 evaluates models across single, parallel, multiple, and multi-turn function calls, with curated distractor functions designed to look plausible. As of early 2026, the top of the leaderboard reads as follows.

Model	BFCL v4 overall accuracy	Rank
Claude Opus 4.1	70.36%	2
Claude Sonnet 4	70.29%	3
GPT-5	59.22%	7

These numbers are aggregate. They do not stratify by tool count. ToolBench, the OpenBMB benchmark used at ICLR, leans the other way: it pulls 16,464 real-world APIs across 49 categories from RapidAPI and tests retrieval and selection at scale. Recent ToolComp results show average multi-step tool accuracy below 50 percent across leading models. Both benchmarks tell us the same thing in different framings: tool routing is not a solved problem, and the gap between models is a smaller signal than the gap between toolset sizes.

What no public benchmark publishes cleanly is the size-decay axis: hold the model fixed, vary tool count, plot accuracy. That's the experiment teams actually need.

A reproducible measurement harness

The methodology is straightforward. Start with a labeled eval set, where every prompt has a known-correct tool. For each prompt, synthesize distractor tools with similar names, overlapping descriptions, and slightly different parameter shapes. Then run the same prompts at toolset sizes of 1, 5, 10, 25, 50, 75, 100, and 105. Record top-1 routing accuracy.

The hardest part is the distractors. Random tools are too easy to ignore. The model needs distractors that share vocabulary with the correct tool, otherwise you are measuring vocabulary recognition, not routing.

distractors.ts·typescript

import OpenAI from "openai";
 
const openai = new OpenAI();
 
// Given the canonical tool the prompt should route to, generate N
// near-miss distractors. Same domain, similar verbs, overlapping params.
export async function generateDistractors(
  correctTool: { name: string; description: string; parameters: object },
  n: number,
): Promise<Array<typeof correctTool>> {
  const res = await openai.chat.completions.create({
    model: "gpt-5",
    messages: [
      {
        role: "system",
        content:
          "Generate plausible-but-wrong tool definitions that share vocabulary with the canonical tool. Each must be syntactically valid and semantically distinct. Return JSON of the form {\"distractors\": [<tool>, ...]}.",
      },
      {
        role: "user",
        content: `Canonical: ${JSON.stringify(correctTool)}\nReturn ${n} distractors.`,
      },
    ],
    response_format: { type: "json_object" },
  });
  const parsed = JSON.parse(res.choices[0].message.content ?? "{}");
  return parsed.distractors;
}

That gets you a distractor pool. You will want to hand-review a sample of these. Generated distractors that drift outside the domain inflate accuracy artificially, and ones that are accidentally correct deflate it.

The eval loop, in Python

With the distractor pool in hand, the loop is mechanical. For each (model, toolset_size) pair, sample distractors to bring the toolset up to size, send the prompt to the model, and check whether the tool name in the response matches the canonical one.

eval_loop.py·python

import json
from typing import Any
from anthropic import Anthropic
from openai import OpenAI
 
# Two model clients. Same eval, two families. Add Gemini via google-genai if you want a third.
clients = {
    "gpt-5": OpenAI(),
    "claude-opus-4-1": Anthropic(),
}
 
TOOLSET_SIZES = [1, 5, 10, 25, 50, 75, 100, 105]
 
def run_one(
    model_id: str,
    prompt: str,
    correct_tool: dict[str, Any],
    distractors: list[dict[str, Any]],
    n_tools: int,
) -> bool:
    """Returns True if the model called the correct tool top-1."""
    # Always include the correct tool, fill the rest from distractors.
    pool = [correct_tool] + distractors[: n_tools - 1]
 
    if model_id == "gpt-5":
        res = clients["gpt-5"].chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            tools=[{"type": "function", "function": t} for t in pool],
            tool_choice="auto",
        )
        calls = res.choices[0].message.tool_calls or []
        return bool(calls) and calls[0].function.name == correct_tool["name"]
 
    if model_id == "claude-opus-4-1":
        res = clients["claude-opus-4-1"].messages.create(
            model=model_id,
            max_tokens=1024,
            tools=pool,
            messages=[{"role": "user", "content": prompt}],
        )
        # Claude returns content blocks; first tool_use block is the call.
        for block in res.content:
            if block.type == "tool_use":
                return block.name == correct_tool["name"]
        return False
    raise ValueError(model_id)
 
def evaluate(eval_set: list[dict[str, Any]], distractor_pool: list[dict[str, Any]]):
    results: dict[tuple[str, int], list[bool]] = {}
    for model_id in clients:
        for n in TOOLSET_SIZES:
            outcomes = [
                run_one(model_id, ex["prompt"], ex["correct_tool"], distractor_pool, n)
                for ex in eval_set
            ]
            results[(model_id, n)] = outcomes
    return {k: sum(v) / len(v) for k, v in results.items()}

Run this with 200 prompts and the loop produces 16 accuracy numbers (2 models, 8 sizes). That's the input to the curve. A full pass at this size hits 3,200 calls; expect cost in the low hundreds of dollars depending on model mix, average prompt length, and how aggressively you cache tool definitions across calls. Run a 20-prompt smoke pass first and extrapolate before committing to a full 200-prompt run.

What the curve actually looks like

Across BFCL stratifications, ToolBench retrieval-then-selection results, and what teams have shared informally, the shape comes out consistent across model families even when the absolute numbers differ.

Toolset size	Routing accuracy (representative)
1 to 10 tools	95% to 98%
25 tools	88% to 93%
50 tools	78% to 85%
75 tools	65% to 75%
100+ tools	55% to 68%

These ranges are illustrative composites of public BFCL and ToolBench results plus practitioner reports, not a single cited measurement. The shape is what matters: shallow decay through 25, moderate through 50, then a cliff. The cliff position varies by model and by how distractor-heavy the eval is, but the cliff itself is a real and reproducible artifact.

Connected Integrations12 active

Salesforce

Slack

Google

Stripe

HubSpot

Intercom

Zapier

Shopify

GitHub

Jira

Gmail

PostgreSQL

Microsoft Research named this failure mode "tool-space interference" in 2025 and drew a useful distinction: it is not the model being worse at hard tasks. A 50-tool agent can fail prompts that a 10-tool version of the same model handles flawlessly. The capability is there. The routing is what breaks.

Why the cliff happens

Three mechanisms compound. Once you can name them you can target the right fix.

First, name and description overlap. Two tools called get_user_profile and fetch_user_details will compete for the same prompts. The model resolves the ambiguity probabilistically, and as the toolset grows, the probability of any individual tool being chosen correctly drops mechanically. This is the dominant driver below 50 tools.

Second, context pressure. EclipseSource and others have flagged that an MCP tool definition runs roughly 250 to 1,400 tokens depending on schema complexity, enums, and field descriptions; the GitHub MCP server alone has been reported at around 55,000 tokens across its 93 tools before any prompt is sent. Connect a few tool-heavy MCP servers and the schema overhead can claim a large fraction of the context window before the conversation starts. When tool definitions eat that much context, every other signal gets diluted. This is the dominant driver past 50 tools.

Third, self-conditioning. If the model called a tool earlier in the conversation, it is more likely to keep calling tools. If it called tools in parallel, it is biased toward parallel calls again. As toolset size grows, the prior gets noisier, and self-conditioning amplifies whichever direction noise pushed the first call.

OpenAI's API enforces a hard cap of 128 functions per request, but treating the cap as a target is a mistake. The performance cliff sits well below it.

Recovery: scope the toolset to the turn

The right intervention is rarely "use a smarter model." It is "show the model fewer tools." The trick is doing this without losing capability, which means scoping toolsets by intent rather than removing them globally. A 105-tool agent split into four scoped toolsets of roughly 25 tools each, with the right one selected per turn, typically recovers routing accuracy to within 2 to 3 points of the small-toolset baseline. The agent loses nothing. The model just stops seeing irrelevant options.

This is where agent platforms earn their keep. We use Chanl for our own agents, and the pattern below is roughly what it looks like in code: define scoped toolsets up front, point the agent at one toolset per intent, replay the same eval, and grade routing as a structured rubric.

recover.ts·typescript

import { ChanlSDK } from "@chanl/sdk";
 
const sdk = new ChanlSDK({ baseUrl: "https://api.chanl.ai", apiKey: process.env.CHANL_API_KEY! });
 
// 1. Carve the 105-tool flat agent into four intent-scoped toolsets.
const billing = await sdk.toolsets.create({ name: "billing", tools: BILLING_TOOL_IDS });
const account = await sdk.toolsets.create({ name: "account", tools: ACCOUNT_TOOL_IDS });
const support = await sdk.toolsets.create({ name: "support", tools: SUPPORT_TOOL_IDS });
const orders  = await sdk.toolsets.create({ name: "orders",  tools: ORDER_TOOL_IDS });
 
// 2. Replay the same eval against the agent for each scoped toolset, one
//    at a time. Per-turn toolset switching is platform behavior; the SDK
//    primitive is the toolset itself.
const runs = await sdk.scenarios.runAll({ agentId: AGENT_ID, minScore: 80 });
 
// 3. Grade routing as a rubric, not a single boolean. evaluate() takes a
//    callId from the run and a scorecardId.
const grades = await Promise.all(
  runs.results.map((r) =>
    sdk.scorecard.evaluate(r.executionId, { scorecardId: ROUTING_SCORECARD }),
  ),
);

The same pattern works without a platform, of course. You can hand-roll the scoping with a small intent classifier and two function-calling endpoints. The reason teams pay for the platform is that the scoping logic and the eval replay live in the same place as the agent, so when you change a tool description, the eval re-runs, and the routing scorecard tells you whether you regressed before the change ships. See scenarios for how the replay piece works, and scorecards for the rubric grading.

A practical tool budget

A few rules of thumb worth posting in your team's docs.

Per turn, not per agent. Routing accuracy is a function of how many tools the model sees in the request, not how many your agent could theoretically call across all sessions. An agent with 200 tools and a 20-tool active scope will outperform an agent with 60 tools all flat.

Under 25 tools per turn for >90 percent routing. Past 25 you will need to invest in scoping or retrieval. Past 50 the cliff shows up. Past 75 you should assume close-to-random behavior on adversarial prompts and design accordingly.

Test the curve, don't trust the curve. The decay is shape-stable across models but the inflection point is workload-specific. The harness in this article is small enough to run weekly, and it should sit in CI alongside your unit tests. See tools for how scoped toolsets fit into the broader MCP story, and Tool explosion: managing 50 agent tools at scale for the qualitative companion to this post.

There are still missing pieces. Automatic intent-based scoping (sdk.toolsets.scope({ intent, history })) is something we want to ship and don't have yet, so today the routing classifier is yours to write. A built-in routing rubric for sdk.scorecard.evaluate would also save teams from designing one. Both are good product fits and both are coming.

The cliff is real, the fix is per-turn scoping, and the only proof that matters is the curve you measure on your own agent against your own distractor pool.

Test your tool routing before your customers do

Build, monitor, and stress-test agents with scoped toolsets, scenario replays, and routing scorecards in a single workspace.

Try Chanl

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

tools mcp function-calling testing scorecards agent-architecture

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.