ChanlChanl
Tools & MCP

How MCP Tool Descriptions Break Your Agent

New research shows 97% of MCP tool descriptions have quality issues that hurt agent accuracy. Here's what the smells look like, why they matter, and how to fix them.

DGDean GroverCo-founderFollow
May 13, 2026
13 min read
AI-generated illustration for mcp tool descriptions agent accuracy -- Her (2013) style, Terra Cotta palette

Your MCP server is deployed. Your agent connects. Tool calls start firing, most of them correct, some of them baffling. The agent calls getCustomerProfile when it should have called getAccountHistory. It passes an account ID where an email address was expected. It invokes a knowledge base lookup even though the answer was already in context.

You check the server logs. No errors. You check the model output. The reasoning looks plausible. Then you read your tool description and see it: a single sentence written for a developer reading a README, not for a model deciding which of twelve tools to call in 200 milliseconds.

A February 2026 study examined 856 tools across 103 MCP servers and found that 97.1% contained at least one description quality issue. The most common problem wasn't broken schemas or missing fields. It was descriptions that technically exist but give an AI agent the wrong signals at selection time.

Here's what those problems look like, why they matter for accuracy, and how to fix them without adding noise.

Why Tool Descriptions Function as an Agent's Routing Table

An agent with access to tools operates like a packet router. When a user sends a request, the model selects which tool handles it, extracts the right parameters, and decides whether to call at all. The only information it has for making that decision isn't your code. It's your text.

Tool descriptions are the mechanism agents use to understand tool semantics at runtime. Unlike developers who can hover over an IDE tooltip or read a GitHub README, agents have one shot: the name, description, and parameter schema you registered at configuration time. Get the routing table right and the agent routes correctly. Get it wrong and you get the equivalent of dropped packets: wrong tool calls, missing parameters, or a tool invoked in a context it wasn't designed for.

The MCP specification defines the structure of a tool definition (name, description, inputSchema). It doesn't define what makes a good description. That gap is where most agent failures hide.

The problem compounds with scale. Two tools and a vague description causes occasional confusion. Twenty tools and it causes systematic misrouting. The agents that fall apart at scale aren't failing because the model is bad. They're failing because the routing table is underspecified.

Yes No Yes No User Request Agent Reads All Descriptions Which tool matches? Triggering conditions match? Check parameter schema Skip tool Parameters clear? Call tool with arguments Guess or hallucinate values Try next tool or respond without tools
How agents use tool descriptions to make selection decisions

The Three Smells That Break Agent Selection

The three most common description problems are Unstated Limitations (missing boundary conditions), Missing Usage Guidelines (no explicit trigger criteria), and Opaque Parameters (unclear input field intent). Each one degrades agent selection in a different way. The February 2026 study borrowed the term "smells" from software code quality literature: patterns that aren't bugs themselves but reliably signal a deeper problem. Three dominated the dataset.

Unstated Limitations (89.8% of tools)

This is the most widespread issue. A description explains what the tool does but says nothing about when it shouldn't be used. The agent has no signal for when to route elsewhere.

Bad: No boundary conditions·json
{
  "name": "getOrderHistory",
  "description": "Returns order history for a customer."
}

From the agent's perspective, this description triggers for any order-related query. There's nothing ruling out pending orders, wholesale accounts, or unauthenticated sessions. When multiple tools partially match a query, the agent has no basis for disambiguation.

Better: Explicit limitations·json
{
  "name": "getOrderHistory",
  "description": "Returns completed and shipped orders for a verified retail customer account. Use only after customer identity is confirmed in this session. Does not include pending, cancelled, or wholesale orders. For wholesale accounts, use getWholesaleOrderHistory instead."
}

The Does not include and For X accounts, use Y instead phrases do the disambiguation work. Agents are surprisingly responsive to these constraints. They parse them as routing logic, not as human-readable caveats, but as decision rules.

Two patterns that work well for unstated limitations:

  • Scope boundaries: "Only for orders placed in the last 90 days."
  • Mutual exclusion pointers: "For subscription products, use getSubscriptionHistory instead."

Missing Usage Guidelines (89.3% of tools)

The second smell. The description explains the output but not the when. The agent understands what the tool produces but has no trigger conditions for calling it.

Bad: Output-focused, no trigger·json
{
  "name": "escalateToHuman",
  "description": "Creates an escalation ticket and notifies the on-call support agent."
}

A frustrated customer is on the line. Does the agent escalate? The description doesn't say when. An agent following implicit heuristics might escalate too aggressively (any negative sentiment), not enough (only on explicit request), or inconsistently (depends on conversation context that happened to be in the window).

Better: Explicit triggering conditions·json
{
  "name": "escalateToHuman",
  "description": "Escalates the conversation to a live support agent. Call when: (1) the customer explicitly requests a human agent, (2) the conversation has had 3 or more unsuccessful resolution attempts, or (3) the issue involves account security, billing disputes over $200, or compliance topics. Do not call for routine queries that other tools can resolve."
}

This description converts a vague capability into a decision rule. Numbered trigger conditions work well because agents parse them as conditionals. They check each one at selection time rather than using fuzzy semantic matching.

The Do not call for... line is equally important. Without it, the agent might use escalation as a catch-all for anything difficult, even when a better tool exists.

Opaque Parameters (84.3% of tools)

The third smell. Parameter names and types that don't explain intent. The agent infers what to pass from the field name alone, which leads to hallucinated or mismatched values.

Bad: Opaque parameters·json
{
  "name": "sendFollowUp",
  "inputSchema": {
    "type": "object",
    "properties": {
      "id": { "type": "string" },
      "mode": { "type": "string" },
      "delay": { "type": "integer" }
    }
  }
}

What's id? A customer ID? A conversation ID? A ticket ID? What modes are valid for mode? Is delay in seconds or milliseconds?

Better: Self-documenting parameters·json
{
  "name": "sendFollowUp",
  "inputSchema": {
    "type": "object",
    "properties": {
      "conversationId": {
        "type": "string",
        "description": "The unique conversation identifier from the current session (e.g., 'conv_abc123'). Use the session conversation ID, not the customer ID."
      },
      "channel": {
        "type": "string",
        "enum": ["email", "sms", "push"],
        "description": "Delivery channel for the follow-up. Use 'email' for non-urgent follow-ups, 'sms' for time-sensitive messages like appointment reminders, and 'push' only if the customer has the mobile app installed."
      },
      "delayMinutes": {
        "type": "integer",
        "description": "Minutes to wait before sending. Typical values: 0 (immediate), 30, 60, 1440 (24 hours). Maximum is 10080 (7 days)."
      }
    }
  }
}

Two improvements that carry the most weight: rename ambiguous fields to be self-explanatory (id to conversationId), and add per-parameter description fields with valid examples. The examples in the description are especially useful. Agents treat them as anchors when multiple valid values exist.

mcp-config.json
Live
{
"mcpServers":
{
"chanl":
{
"url": "https://acme.chanl.dev/mcp",
"transport": "sse",
"apiKey": "sk-chanl-...a4f2"
}
}
}
Tools
12 connected
Memory
Active
Knowledge
3 sources

What's the Accuracy-Latency Trade-Off With Richer Descriptions?

Augmented tool descriptions improve task success rates by 5.85 percentage points and per-step execution quality by 15.12%. Both are meaningful gains. The trade-off: average execution steps increase by 67.46%. Understanding that cost matters for how you calibrate descriptions in latency-sensitive deployments.

What this means in practice: when you give an agent richer tool descriptions, it reasons more carefully before calling. It reads the trigger conditions. It validates parameters against the description. It considers whether another tool is a better fit. All of that thinking is good. It also adds latency, sometimes significantly.

For a CX agent handling live customer interactions, this matters. A tool call that takes 3 seconds instead of 2 is noticeable in voice conversations. A response that requires 5 reasoning steps instead of 3 adds up fast across a conversation.

The calibration to aim for: write descriptions at the resolution an agent needs to make binary selection decisions, not at the resolution of a full developer reference doc. Cover what the tool does, when to use it, when not to, and what parameters mean, then stop. Information that doesn't affect the selection decision adds step cost without improving accuracy.

A practical heuristic: if a line in your description wouldn't change which tool an agent picks or how it fills parameters, cut it.

How Should You Structure an Agent-Readable Tool Description?

Here's the four-part structure I've converged on for MCP tools used by CX agents.

Part 1: Purpose sentence. What the tool does, in one line. Output-focused, not behavior-focused. "Returns completed orders for a verified customer" beats "This tool accesses our order management system to retrieve historical transaction data."

Part 2: Triggering conditions. When to call it, with 2-4 explicit conditions. Use numbered lists when conditions are discrete. Favor specific signals over vague ones ("after identity is verified" beats "when you need customer data").

Part 3: Exclusions. When NOT to call it, with 1-3 anti-patterns. Point to the correct tool when possible. This is the most skipped part and the most valuable for disambiguation.

Part 4: Parameter intent. For each parameter, what it means and examples of valid values. Rename ambiguous field names. Use enum for fields with a fixed set of valid values.

Here's a complete example for a customer lookup tool:

Four-part description: lookupCustomerContext·json
{
  "name": "lookupCustomerContext",
  "description": "Retrieves a customer's profile, recent interaction history, and current account status. Call when: customer identity has been confirmed in this session AND you need context to personalize a response or make an account decision. Do not call before identity verification; use verifyIdentity first. Do not call more than once per conversation; results are cached for the session duration.",
  "inputSchema": {
    "type": "object",
    "required": ["customerId"],
    "properties": {
      "customerId": {
        "type": "string",
        "description": "The verified customer identifier returned from verifyIdentity. Format: 'cust_[alphanumeric]'. Do not use the caller's phone number here."
      },
      "includeHistory": {
        "type": "boolean",
        "description": "Set true to include the last 10 interactions. Default false. Set true when the customer references a previous interaction and you need that context."
      }
    }
  }
}

Notice the explicit reference to verifyIdentity in the description. That's the cross-referencing pattern, and it's worth calling out separately.

Agents often fail not because they pick the wrong tool but because they pick tools in the wrong order. Descriptions that reference related tools help the agent reason about dependencies and sequencing, a pattern that significantly reduces out-of-order tool calls.

text
"Always call verifyIdentity before calling this tool."
"Use searchKnowledgeBase first; only call escalateToHuman if no relevant articles are found."
"Replaces lookupOrderStatus for orders placed after 2025-01-01. Use lookupOrderStatus for older orders."

These lines encode the tool graph that lives in your head as a developer. Every MCP server maintainer knows this graph implicitly: the order tools need to be called, the ones that replace others, the ones that depend on earlier outputs. Making it explicit in descriptions turns implicit developer knowledge into agent-accessible routing logic.

Testing Tool Descriptions Before Shipping

Better descriptions are only useful if you can verify they work. The goal is measurable improvement in tool selection accuracy, and that means building a test matrix before you deploy.

For each tool, write:

  • 10-15 natural-language inputs that should trigger it
  • 5-10 inputs that should not trigger it (including inputs that should trigger a similar tool instead)

Then track three metrics:

  • Selection precision: For inputs that should trigger tool X, what percentage do?
  • Selection recall: For inputs that shouldn't trigger tool X, what percentage correctly don't?
  • Parameter fill accuracy: For correct tool calls, what percentage fill required parameters correctly?

A tool with under 80% precision is a tool with a description problem. The test matrix makes that visible before it surfaces as a production failure.

Here's a lightweight test harness pattern. Define one scenario per intent and run them through chanl.scenarios.run, then assert the tool the agent actually picked matched what you expected.

tool-description-test.ts·typescript
import Chanl from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY! })
 
// Each scenario in your library represents one intent the agent should handle.
// The scenario's expected tool is recorded in its scorecard rubric.
const toolSelectionScenarios = [
  { scenarioId: 'sc_lookup_context_profile', expectedTool: 'lookupCustomerContext' },
  { scenarioId: 'sc_lookup_context_history', expectedTool: 'lookupCustomerContext' },
  { scenarioId: 'sc_escalate_explicit_request', expectedTool: 'escalateToHuman' },
  { scenarioId: 'sc_escalate_repeat_caller', expectedTool: 'escalateToHuman' },
  { scenarioId: 'sc_should_not_escalate_password_reset', expectedTool: null },
  { scenarioId: 'sc_should_not_escalate_hours_query', expectedTool: null },
]
 
async function runToolDescriptionTests() {
  let passed = 0
  let failed = 0
 
  for (const { scenarioId, expectedTool } of toolSelectionScenarios) {
    const run = await chanl.scenarios.run(scenarioId)
    const calledTools = run.toolCalls?.map(t => t.name) ?? []
    const ok = expectedTool === null
      ? !calledTools.includes('escalateToHuman')
      : calledTools.includes(expectedTool)
 
    if (ok) passed++
    else failed++
 
    console.log(`${ok ? 'pass' : 'fail'} ${scenarioId} expected=${expectedTool ?? 'none'} got=[${calledTools.join(', ')}]`)
  }
 
  console.log(`\n${passed}/${passed + failed} tool selection tests passed`)
}
 
runToolDescriptionTests()

Running this before every deploy gives you a regression suite for description quality. It's the equivalent of a unit test for your routing table. From there, Chanl's monitoring features let you track tool selection accuracy in production, so when a description starts causing misroutes after a server update, you catch it on the metrics dashboard before users do.

This is closely related to the tool description drift problem, where descriptions that were once accurate stop matching updated tool behavior. Pre-deploy tests catch description quality issues; production monitoring catches drift over time.

Patterns That Work, and Anti-Patterns That Don't

A few patterns from teams that have done this well:

Versioned descriptions alongside versioned tools. When a tool's behavior changes, the description has to change too. Treat description updates as a first-class part of your changelog, not an afterthought. If you're bumping a tool version because behavior changed, the description probably needs updating.

Aggregate descriptions for tool families. If you have a cluster of related tools (lookupOrder, lookupOrderItem, lookupOrderShipment), a brief summary at the top of each ("Part of the order lookup family. Use lookupOrder for order-level data, lookupOrderItem for line items, lookupOrderShipment for tracking.") helps agents reason about the whole family rather than each tool in isolation.

Don't use system prompts as a substitute. It's tempting to handle tool selection logic in the system prompt: "When the customer asks about orders, use getOrderHistory." This works for simple cases but doesn't scale. With 20 tools, your system prompt becomes a routing guide that's hard to maintain and doesn't stay in sync with tool changes. Per-tool descriptions scale better.

Don't pad descriptions with motivation. Developers sometimes write descriptions that explain why a tool exists rather than when to call it. "This tool was built to support our new account management flow" is not useful to an agent. Neither is "Use this to improve customer satisfaction." Stick to operational specifics.

For teams managing tools at scale, the tool explosion problem is real, and description quality degrades predictably as tool counts grow. You add tools fast and document descriptions slowly. Building a description review step into your tool registration process is the most effective countermeasure.

What This Means If You Publish an MCP Server

If you publish a public MCP server, whether open-source or as part of a product, description quality is now a visible differentiator. Agents that succeed with your tools consistently get integrated and kept. Agents that fail mysteriously get removed or replaced.

The three things to do this week:

1. Audit existing descriptions for the three smells. For each tool: does it state limitations? does it specify triggers? do parameters explain intent? You don't need perfect descriptions everywhere. Start with the tools that get called most frequently.

2. Apply the four-part structure to your top tools. High-call-volume tools have the highest ROI for description improvements. Write a purpose sentence, triggering conditions, exclusions, and parameter intent for the top 20% of your tool surface area.

3. Add a description review step to your tool changelog. Every tool change that affects behavior needs a corresponding description update. This is the discipline that prevents description smells from accumulating and the drift that causes silent failure in production.

The research makes the case clearly: 97.1% of MCP tool descriptions have quality issues, and fixing them produces measurable accuracy gains. Your tools are already built. Your descriptions are probably the last mile standing between them and reliable agent behavior.

Treating tool descriptions as first-class artifacts, part of the Build phase of any agent deployment, is the shift that separates teams who debug production failures from teams who prevent them. The same agent that misrouted calls in your opening week can become a reliably accurate one with nothing more than better text.

Test your MCP tool selection accuracy before shipping

Run scenario-based tests against your agent's tool descriptions. Catch selection errors in staging, not in production conversations.

Start Testing
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions