What is MCP tool description drift?

Drift is when an MCP server's tool description changes between versions while the schema, name, and inputs stay identical. Because the contract looks unchanged, schema-based evals still pass. The agent's routing prior shifts because the description is part of the prompt, and tool selection accuracy can move several percentage points without any alarm firing.

Why do tool description edits hurt agent performance?

Tool descriptions are prompts. The model uses the description text to choose between tools when names alone are ambiguous. Published research on prompt sensitivity, including Sclar 2024 at ICLR, shows that semantically equivalent rewordings can move task accuracy by tens of points in the worst cases. A description edit that reads cleaner to a human can quietly become the wrong description for the model.

How do you detect tool description drift?

Run two checks at every tool update. First, embedding similarity between old and new description (cosine distance against text-embedding-3-small or similar). Second, a behavioral replay: run a fixed prompt set through the agent before and after the change and compare top-1 routing accuracy. Flag any tool where similarity drops below a threshold or routing delta exceeds a couple of points.

Should you canary deploy MCP tool descriptions?

Yes. Treat each description revision like a feature flag. Promote through 1 percent, 10 percent, then 100 percent of traffic with the new description. Watch routing accuracy and tool error rate at each step. Auto-rollback if either metric regresses past your tolerance, the same pattern you would use for a model change or a prompt rewrite.

How do you version MCP tool definitions when the spec has no version field?

The MCP spec does not version individual tools, so versioning lives at your registry layer. Hash the full tool definition (name, description, schema) into a version id, store every prior version, and expose a diff between versions. When agents connect, the registry can route them to a specific version while you canary the next one.

What should the MCP description regression threshold be?

There is no universal number, but routing accuracy deltas above 2-3 points on a representative prompt set typically indicate real drift rather than noise. Embedding cosine similarity below roughly 0.92 is also worth a look. Both thresholds should be calibrated against your own historical variance, not adopted blind.

Can a description change invalidate prompt cache?

Yes. Anthropic prompt cache invalidates on any byte change in cached content, including tool definitions. A description edit forces a full reprocess of the cached prefix on every following request until the cache repopulates. This is a hidden cost on top of the routing accuracy risk.

MCP tool description drift: the silent failure nobody alerts on

The change went in on a Monday afternoon. A single tweak to one tool description on the company's MCP server: the word "schedule" became "book" because someone in support said customers used "book" more often in tickets. The eval suite ran green. The PR shipped.

Two weeks later, an analyst noticed appointment requests were routing to the right tool about 8 percent less often. No errors. No exceptions. The agent kept choosing tools and the tools kept executing. They were just the wrong tools. The eval suite was written against the old description, so the regression was invisible to the only system that was supposed to catch it.

This is description drift, and the MCP spec has no convention for catching it. The rest of this article is the path from there to a working gate: why descriptions behave like prompts, a perturbation generator that surfaces the failure, a drift detector, and a canary rollout you can wrap around any tool registry.

Tool descriptions are prompts. They behave like prompts

When an MCP server returns a tool definition, the description string travels straight into the model's context as part of the tool list. The schema, the name, and the description together form what the model sees when it has to pick which tool to call. Names disambiguate when they are obviously different. Descriptions disambiguate when names are not, which is most of the time in any non-trivial registry.

Prompt sensitivity is well documented. Sclar and colleagues showed at ICLR 2024 that semantically equivalent prompt formatting variations produced accuracy swings of up to 76 percentage points across model and task pairs. Errica and colleagues introduced sensitivity and consistency metrics specifically because rephrased-but-equivalent prompts move classifier predictions in ways that standard accuracy hides. Apple's GSM-Symbolic study found similar fragility on reasoning tasks when surface details changed. None of these papers studied MCP specifically, but the lesson generalizes: any text that participates in the model's decision is sensitive to surface form, and a tool description is exactly that kind of text.

The silent part of the failure comes from where the contract sits. Schema-based eval suites check that the tool exists, that it accepts the right parameters, and that it returns the right shape. They do not check whether the model still picks it for the same prompts. Routing accuracy is a property of the description and the surrounding tool list, not of the schema. Edit one, leave the other unchanged, and your tests are looking at the wrong surface.

mcp-config.json

Live

{

"mcpServers":

{

"chanl":

{

"url": "https://acme.chanl.dev/mcp",

"transport": "sse",

"apiKey": "sk-chanl-...a4f2"

}

Tools

12 connected

Memory

Active

Knowledge

3 sources

So the problem is not "we shipped a bad description." It is "we shipped a description and had no way to know what changed." Close that gap and the rest of this article is mechanical: a perturbation generator to surface the failure mode, a detector to flag it, a canary to bound the blast radius.

Five perturbations that look harmless and aren't

Before you can detect drift you need a way to generate it deliberately. Five perturbation strategies cover most edits a well-meaning engineer or product writer will make to a tool description. They look like cleanups. They behave like prompt rewrites.

perturbations.ts·typescript

type Tool = { name: string; description: string };
type Perturbation = (t: Tool) => Tool;
 
const terser: Perturbation = (t) => ({
  ...t,
  description: t.description
    .replace(/\b(in order to|so that you can|please)\b/gi, "")
    .replace(/\s+/g, " ")
    .trim(),
});
 
const jargonier: Perturbation = (t) => ({
  ...t,
  description: t.description
    .replace(/\bset up an appointment\b/gi, "create a CRM activity")
    .replace(/\bcustomer\b/gi, "account contact"),
});
 
const reordered: Perturbation = (t) => ({
  ...t,
  description: t.description.replace(
    /^(\w+)(.*)$/,
    (_, verb, rest) => `For matters relating to${rest}, ${verb.toLowerCase()}.`,
  ),
});
 
const invertedPolarity: Perturbation = (t) => ({
  ...t,
  description: t.description.replace(/\bschedule\b/gi, "reschedule"),
  name: t.name.replace(/^schedule_/, "reschedule_"),
});
 
const addedOptionalParam = (t: Tool): Tool => ({
  ...t,
  description: `${t.description} Accepts an optional 'priority' field.`,
});
 
export const perturbations = {
  terser,
  jargonier,
  reordered,
  invertedPolarity,
  addedOptionalParam,
};

Each strategy is defensible in isolation. Terser reads cleaner. Jargonier matches the internal product vocabulary. Reordered front-loads context that other tools also need. Inverted polarity matches a renamed feature. Added optional param adds capability without changing existing callers. The schema survives all five. The routing prior does not.

A worked example makes the cost concrete. Imagine a baseline tool: name: "schedule_appointment", description: "Schedule a new appointment for a customer at a specific date and time. Use when the user wants to set up a meeting.". Run a fixed set of 100 routing prompts through an agent that has this tool plus three lookalikes (reschedule_appointment, cancel_appointment, find_available_slots). Then re-run the same 100 prompts after each perturbation.

Perturbation	Plausible top-1 routing delta	Why it shifts
Terser	-1 to -3 points	Cue words like "set up" disappear; closely related tools start tying
Jargonier	-3 to -8 points	"Account contact" no longer matches "customer" in user prompts
Reordered	-2 to -5 points	Verb is no longer first; lookalike tools with leading verbs gain priority
Inverted polarity	-8 to -15 points	"reschedule" now competes with the genuine reschedule tool
Added optional param	-1 to +2 points	Mostly inert; occasionally pulls borderline prompts toward this tool

These ranges are illustrative, not measured on a specific server. They sit well below the 76-point ceiling Sclar 2024 reported for prompt-format perturbations and line up with the qualitative behavior of the Berkeley Function-Calling Leaderboard when descriptions of similar tools are made more or less distinctive. The point is the shape of the table, not the exact numbers. Treat your own measurements as authoritative.

Building a drift detector

A drift detector needs to answer two questions every time a tool definition changes: did the meaning move, and did the agent's behavior move. Embedding similarity gives you the first cheaply. A behavioral replay gives you the second more expensively but more honestly.

drift_detector.py·python

from dataclasses import dataclass
from openai import OpenAI
 
client = OpenAI()
 
@dataclass
class DriftReport:
    tool_id: str
    cosine: float
    routing_delta: float
    flagged: bool
 
def embed(text: str) -> list[float]:
    return client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    ).data[0].embedding
 
def cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = sum(x * x for x in a) ** 0.5
    nb = sum(x * x for x in b) ** 0.5
    return dot / (na * nb)
 
def replay_routing_accuracy(tool_def: dict, prompts: list[dict]) -> float:
    # `run_agent` is your own router: takes a user message and a tool list,
    # returns the name of the tool the model picked. Wrap whatever client
    # you already use (Anthropic, OpenAI, your in-house gateway).
    correct = 0
    for p in prompts:
        chosen = run_agent(p["user"], tools=[tool_def, *p["distractors"]])
        if chosen == p["expected_tool"]:
            correct += 1
    return correct / len(prompts)
 
def detect_drift(
    old: dict,
    new: dict,
    eval_prompts: list[dict],
    cosine_floor: float = 0.92,
    routing_tolerance: float = 0.02,
) -> DriftReport:
    cos = cosine(embed(old["description"]), embed(new["description"]))
    base = replay_routing_accuracy(old, eval_prompts)
    after = replay_routing_accuracy(new, eval_prompts)
    delta = after - base
    flagged = cos < cosine_floor or delta < -routing_tolerance
    return DriftReport(
        tool_id=new["name"],
        cosine=cos,
        routing_delta=delta,
        flagged=flagged,
    )

The two thresholds need calibration on your own data. A cosine floor of 0.92 catches obvious paraphrases but lets idiomatic edits through. A routing tolerance of two percentage points catches drift larger than the natural sampling noise on a 100-prompt set, and tightens automatically as your eval set grows. If your historical variance is wider than two points across reruns, raise the tolerance until it reflects the floor of your own measurement noise.

Two design notes about this detector. The behavioral replay needs distractor tools in the same prompt, otherwise you are measuring whether the agent finds the only tool in the room and the answer is always yes. And the eval set has to be frozen. Update the prompts and you have changed the measuring instrument at the same time as the thing being measured.

Canary rollout for tool descriptions

Detection without a rollback story is just a pager. The same canary pattern teams use for feature flags works for tool descriptions if the registry can route different traffic shares to different versions. Promote through 1 percent, 10 percent, and 100 percent, watching routing accuracy at each step.

Canary rollout for an MCP tool description revision

The registry layer that backs this rollout is short. Each tool has a list of versions, each version has a traffic percentage, and a request-time picker hashes the agent or session id to a stable bucket so the same caller sees the same version for the duration of a window.

canary.ts·typescript

type ToolVersion = { id: string; description: string; trafficPct: number };
type Registry = Map<string, ToolVersion[]>;
 
function pickVersion(versions: ToolVersion[], sessionId: string): ToolVersion {
  const bucket = hashToUnitInterval(sessionId);
  let cum = 0;
  for (const v of versions) {
    cum += v.trafficPct / 100;
    if (bucket <= cum) return v;
  }
  return versions[versions.length - 1];
}
 
async function promote(registry: Registry, toolName: string, nextPct: number) {
  const versions = registry.get(toolName)!;
  const baseline = await measureRouting(versions[0]);
  const candidate = versions[1];
  candidate.trafficPct = nextPct;
  versions[0].trafficPct = 100 - nextPct;
  const observed = await measureRouting(candidate);
  if (observed - baseline < -0.02) {
    candidate.trafficPct = 0;
    versions[0].trafficPct = 100;
    throw new Error(`auto-rollback: routing dropped ${(baseline - observed).toFixed(3)}`);
  }
}

The retire-after-seven-days step in the diagram is not cosmetic. Anthropic prompt caching invalidates whenever any byte of the cached content changes, so every description edit forces the cache to repopulate on the next request. Holding the old version in the registry for a week lets you correlate routing changes with cache warmup effects rather than blaming a metric move on the description that triggered it.

Gating description changes with Chanl scenarios

Up to here, every piece is buildable on raw MCP plus an embedding model plus a small registry. The thing that is genuinely tedious to build yourself is the regression gate that runs against representative customer prompts on every PR. That is what Chanl scenarios does, and most of the existing surface is the right shape for description drift even though the feature was originally built for prompt and agent regression testing.

The pattern is straightforward. List the current MCP tools, check what would change in the proposed update, run a fixed scenario set against the new description, and only let the change merge if the pass rate stays inside tolerance.

gate-description-change.ts·typescript

import { Chanl } from "@chanl/sdk";
 
const sdk = new Chanl({ apiKey: process.env.CHANL_API_KEY! });
 
// `pollExecution` is a small helper around sdk.scenarios.getExecution(id)
// that loops until status is 'completed' or 'failed'. Most teams already
// have one of these in their CI utilities.
async function gateDescriptionChange(toolId: string, nextDescription: string) {
  const before = await sdk.tools.get(toolId);
  await sdk.tools.update(toolId, { description: nextDescription });
 
  // Smoke test: execute the updated tool with a representative argument set
  const probe = await sdk.tools.test(toolId, { date: "2026-05-10", customer: "demo" });
 
  // Run the regression scenario; poll the execution and read its score
  const queued = await sdk.scenarios.run("routing-baseline-v3", { agentId: "agent_router" });
  const execution = await pollExecution(queued.data!.executionId);
 
  const passRate = (execution.overallScore ?? 0) / 100;
  if (passRate < 0.95) {
    await sdk.tools.update(toolId, { description: before.data!.description });
    throw new Error(`rollback: pass rate ${passRate.toFixed(3)} below 0.95`);
  }
  return { passRate, probe };
}

The methods being called here are all live in @chanl/sdk today: sdk.tools.get, sdk.tools.update, sdk.tools.test, sdk.scenarios.run, and sdk.scenarios.getExecution (the pollExecution helper wraps it). What is intentionally missing, and what description drift exposes as a product gap, is sdk.tools.createVersion, sdk.tools.deploy({ percentage }), and sdk.tools.diff({ from, to }). Those mirror the prompt versioning surface and are on the roadmap for exactly this use case.

A GitHub Actions snippet wires the gate to the merge button, with no special infrastructure beyond the Chanl API key.

.github/workflows/mcp-description-gate.yml·yaml

name: MCP description gate
on:
  pull_request:
    paths: ["mcp/tools/**"]
jobs:
  gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npx tsx scripts/gate-description-change.ts
        env:
          CHANL_API_KEY: ${{ secrets.CHANL_API_KEY }}

This gate is doing exactly what the eval suite was supposed to do at the start of the article. The difference is that scenarios test routing behavior on representative customer prompts, not schema shape. Pair it with the drift detector running on the same PR and you have both signals: the description's meaning moved, and the agent's behavior moved with it.

What to do Monday

Three concrete things take a few hours each. First, make tool descriptions versioned at your registry layer, even if MCP itself does not, by hashing the full tool definition into an id and storing every prior version. Second, run the drift detector above on every tool update and block merges where cosine drops below 0.92 or routing delta exceeds 2 points on a frozen prompt set. Third, wrap the canary rollout around any registry that already supports multiple versions per tool, and tie auto-rollback to the same routing-accuracy metric your gate uses.

Description drift will keep happening. Your stack just needs to know when it does, before customers do.

Stop shipping tool description changes blind

Chanl scenarios gate every prompt and tool change against representative customer flows. Wire it into your MCP repo and catch routing drift before it merges.

See how scenarios gate changes

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

mcp tools function-calling prompt-engineering versioning canary-deploy testing

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

MCP tool description drift: the silent failure nobody alerts on

Tool descriptions are prompts. They behave like prompts

Five perturbations that look harmless and aren't

Building a drift detector

Canary rollout for tool descriptions

Gating description changes with Chanl scenarios

What to do Monday

Stop shipping tool description changes blind

The Signal Briefing

Frequently Asked Questions

Related Articles

Past 50 tools, function-calling accuracy falls off a cliff

MCP es ahora el estandar de la industria para integraciones de agentes de IA. Esto es lo que significa

Herramientas para Agentes de IA: MCP, OpenAPI y Gestión de Herramientas que Realmente Escala