The change went in on a Monday afternoon. A single tweak to one tool description on the company's MCP server: the word "schedule" became "book" because someone in support said customers used "book" more often in tickets. The eval suite ran green. The PR shipped.
Two weeks later, an analyst noticed appointment requests were routing to the right tool about 8 percent less often. No errors. No exceptions. The agent kept choosing tools and the tools kept executing. They were just the wrong tools. The eval suite was written against the old description, so the regression was invisible to the only system that was supposed to catch it.
This is description drift, and the MCP spec has no convention for catching it. The rest of this article is the path from there to a working gate: why descriptions behave like prompts, a perturbation generator that surfaces the failure, a drift detector, and a canary rollout you can wrap around any tool registry.
Tool descriptions are prompts. They behave like prompts
When an MCP server returns a tool definition, the description string travels straight into the model's context as part of the tool list. The schema, the name, and the description together form what the model sees when it has to pick which tool to call. Names disambiguate when they are obviously different. Descriptions disambiguate when names are not, which is most of the time in any non-trivial registry.
Prompt sensitivity is well documented. Sclar and colleagues showed at ICLR 2024 that semantically equivalent prompt formatting variations produced accuracy swings of up to 76 percentage points across model and task pairs. Errica and colleagues introduced sensitivity and consistency metrics specifically because rephrased-but-equivalent prompts move classifier predictions in ways that standard accuracy hides. Apple's GSM-Symbolic study found similar fragility on reasoning tasks when surface details changed. None of these papers studied MCP specifically, but the lesson generalizes: any text that participates in the model's decision is sensitive to surface form, and a tool description is exactly that kind of text.
The silent part of the failure comes from where the contract sits. Schema-based eval suites check that the tool exists, that it accepts the right parameters, and that it returns the right shape. They do not check whether the model still picks it for the same prompts. Routing accuracy is a property of the description and the surrounding tool list, not of the schema. Edit one, leave the other unchanged, and your tests are looking at the wrong surface.
So the problem is not "we shipped a bad description." It is "we shipped a description and had no way to know what changed." Close that gap and the rest of this article is mechanical: a perturbation generator to surface the failure mode, a detector to flag it, a canary to bound the blast radius.
Five perturbations that look harmless and aren't
Before you can detect drift you need a way to generate it deliberately. Five perturbation strategies cover most edits a well-meaning engineer or product writer will make to a tool description. They look like cleanups. They behave like prompt rewrites.
type Tool = { name: string; description: string };
type Perturbation = (t: Tool) => Tool;
const terser: Perturbation = (t) => ({
...t,
description: t.description
.replace(/\b(in order to|so that you can|please)\b/gi, "")
.replace(/\s+/g, " ")
.trim(),
});
const jargonier: Perturbation = (t) => ({
...t,
description: t.description
.replace(/\bset up an appointment\b/gi, "create a CRM activity")
.replace(/\bcustomer\b/gi, "account contact"),
});
const reordered: Perturbation = (t) => ({
...t,
description: t.description.replace(
/^(\w+)(.*)$/,
(_, verb, rest) => `For matters relating to${rest}, ${verb.toLowerCase()}.`,
),
});
const invertedPolarity: Perturbation = (t) => ({
...t,
description: t.description.replace(/\bschedule\b/gi, "reschedule"),
name: t.name.replace(/^schedule_/, "reschedule_"),
});
const addedOptionalParam = (t: Tool): Tool => ({
...t,
description: `${t.description} Accepts an optional 'priority' field.`,
});
export const perturbations = {
terser,
jargonier,
reordered,
invertedPolarity,
addedOptionalParam,
};Each strategy is defensible in isolation. Terser reads cleaner. Jargonier matches the internal product vocabulary. Reordered front-loads context that other tools also need. Inverted polarity matches a renamed feature. Added optional param adds capability without changing existing callers. The schema survives all five. The routing prior does not.
A worked example makes the cost concrete. Imagine a baseline tool: name: "schedule_appointment", description: "Schedule a new appointment for a customer at a specific date and time. Use when the user wants to set up a meeting.". Run a fixed set of 100 routing prompts through an agent that has this tool plus three lookalikes (reschedule_appointment, cancel_appointment, find_available_slots). Then re-run the same 100 prompts after each perturbation.
| Perturbation | Plausible top-1 routing delta | Why it shifts |
|---|---|---|
| Terser | -1 to -3 points | Cue words like "set up" disappear; closely related tools start tying |
| Jargonier | -3 to -8 points | "Account contact" no longer matches "customer" in user prompts |
| Reordered | -2 to -5 points | Verb is no longer first; lookalike tools with leading verbs gain priority |
| Inverted polarity | -8 to -15 points | "reschedule" now competes with the genuine reschedule tool |
| Added optional param | -1 to +2 points | Mostly inert; occasionally pulls borderline prompts toward this tool |
These ranges are illustrative, not measured on a specific server. They sit well below the 76-point ceiling Sclar 2024 reported for prompt-format perturbations and line up with the qualitative behavior of the Berkeley Function-Calling Leaderboard when descriptions of similar tools are made more or less distinctive. The point is the shape of the table, not the exact numbers. Treat your own measurements as authoritative.
Building a drift detector
A drift detector needs to answer two questions every time a tool definition changes: did the meaning move, and did the agent's behavior move. Embedding similarity gives you the first cheaply. A behavioral replay gives you the second more expensively but more honestly.
from dataclasses import dataclass
from openai import OpenAI
client = OpenAI()
@dataclass
class DriftReport:
tool_id: str
cosine: float
routing_delta: float
flagged: bool
def embed(text: str) -> list[float]:
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
def cosine(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = sum(x * x for x in a) ** 0.5
nb = sum(x * x for x in b) ** 0.5
return dot / (na * nb)
def replay_routing_accuracy(tool_def: dict, prompts: list[dict]) -> float:
# `run_agent` is your own router: takes a user message and a tool list,
# returns the name of the tool the model picked. Wrap whatever client
# you already use (Anthropic, OpenAI, your in-house gateway).
correct = 0
for p in prompts:
chosen = run_agent(p["user"], tools=[tool_def, *p["distractors"]])
if chosen == p["expected_tool"]:
correct += 1
return correct / len(prompts)
def detect_drift(
old: dict,
new: dict,
eval_prompts: list[dict],
cosine_floor: float = 0.92,
routing_tolerance: float = 0.02,
) -> DriftReport:
cos = cosine(embed(old["description"]), embed(new["description"]))
base = replay_routing_accuracy(old, eval_prompts)
after = replay_routing_accuracy(new, eval_prompts)
delta = after - base
flagged = cos < cosine_floor or delta < -routing_tolerance
return DriftReport(
tool_id=new["name"],
cosine=cos,
routing_delta=delta,
flagged=flagged,
)The two thresholds need calibration on your own data. A cosine floor of 0.92 catches obvious paraphrases but lets idiomatic edits through. A routing tolerance of two percentage points catches drift larger than the natural sampling noise on a 100-prompt set, and tightens automatically as your eval set grows. If your historical variance is wider than two points across reruns, raise the tolerance until it reflects the floor of your own measurement noise.
Two design notes about this detector. The behavioral replay needs distractor tools in the same prompt, otherwise you are measuring whether the agent finds the only tool in the room and the answer is always yes. And the eval set has to be frozen. Update the prompts and you have changed the measuring instrument at the same time as the thing being measured.
Canary rollout for tool descriptions
Detection without a rollback story is just a pager. The same canary pattern teams use for feature flags works for tool descriptions if the registry can route different traffic shares to different versions. Promote through 1 percent, 10 percent, and 100 percent, watching routing accuracy at each step.
The registry layer that backs this rollout is short. Each tool has a list of versions, each version has a traffic percentage, and a request-time picker hashes the agent or session id to a stable bucket so the same caller sees the same version for the duration of a window.
type ToolVersion = { id: string; description: string; trafficPct: number };
type Registry = Map<string, ToolVersion[]>;
function pickVersion(versions: ToolVersion[], sessionId: string): ToolVersion {
const bucket = hashToUnitInterval(sessionId);
let cum = 0;
for (const v of versions) {
cum += v.trafficPct / 100;
if (bucket <= cum) return v;
}
return versions[versions.length - 1];
}
async function promote(registry: Registry, toolName: string, nextPct: number) {
const versions = registry.get(toolName)!;
const baseline = await measureRouting(versions[0]);
const candidate = versions[1];
candidate.trafficPct = nextPct;
versions[0].trafficPct = 100 - nextPct;
const observed = await measureRouting(candidate);
if (observed - baseline < -0.02) {
candidate.trafficPct = 0;
versions[0].trafficPct = 100;
throw new Error(`auto-rollback: routing dropped ${(baseline - observed).toFixed(3)}`);
}
}The retire-after-seven-days step in the diagram is not cosmetic. Anthropic prompt caching invalidates whenever any byte of the cached content changes, so every description edit forces the cache to repopulate on the next request. Holding the old version in the registry for a week lets you correlate routing changes with cache warmup effects rather than blaming a metric move on the description that triggered it.
Gating description changes with Chanl scenarios
Up to here, every piece is buildable on raw MCP plus an embedding model plus a small registry. The thing that is genuinely tedious to build yourself is the regression gate that runs against representative customer prompts on every PR. That is what Chanl scenarios does, and most of the existing surface is the right shape for description drift even though the feature was originally built for prompt and agent regression testing.
The pattern is straightforward. List the current MCP tools, check what would change in the proposed update, run a fixed scenario set against the new description, and only let the change merge if the pass rate stays inside tolerance.
import { Chanl } from "@chanl/sdk";
const sdk = new Chanl({ apiKey: process.env.CHANL_API_KEY! });
// `pollExecution` is a small helper around sdk.scenarios.getExecution(id)
// that loops until status is 'completed' or 'failed'. Most teams already
// have one of these in their CI utilities.
async function gateDescriptionChange(toolId: string, nextDescription: string) {
const before = await sdk.tools.get(toolId);
await sdk.tools.update(toolId, { description: nextDescription });
// Smoke test: execute the updated tool with a representative argument set
const probe = await sdk.tools.test(toolId, { date: "2026-05-10", customer: "demo" });
// Run the regression scenario; poll the execution and read its score
const queued = await sdk.scenarios.run("routing-baseline-v3", { agentId: "agent_router" });
const execution = await pollExecution(queued.data!.executionId);
const passRate = (execution.overallScore ?? 0) / 100;
if (passRate < 0.95) {
await sdk.tools.update(toolId, { description: before.data!.description });
throw new Error(`rollback: pass rate ${passRate.toFixed(3)} below 0.95`);
}
return { passRate, probe };
}The methods being called here are all live in @chanl/sdk today: sdk.tools.get, sdk.tools.update, sdk.tools.test, sdk.scenarios.run, and sdk.scenarios.getExecution (the pollExecution helper wraps it). What is intentionally missing, and what description drift exposes as a product gap, is sdk.tools.createVersion, sdk.tools.deploy({ percentage }), and sdk.tools.diff({ from, to }). Those mirror the prompt versioning surface and are on the roadmap for exactly this use case.
A GitHub Actions snippet wires the gate to the merge button, with no special infrastructure beyond the Chanl API key.
name: MCP description gate
on:
pull_request:
paths: ["mcp/tools/**"]
jobs:
gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: 20 }
- run: npm ci
- run: npx tsx scripts/gate-description-change.ts
env:
CHANL_API_KEY: ${{ secrets.CHANL_API_KEY }}This gate is doing exactly what the eval suite was supposed to do at the start of the article. The difference is that scenarios test routing behavior on representative customer prompts, not schema shape. Pair it with the drift detector running on the same PR and you have both signals: the description's meaning moved, and the agent's behavior moved with it.
What to do Monday
Three concrete things take a few hours each. First, make tool descriptions versioned at your registry layer, even if MCP itself does not, by hashing the full tool definition into an id and storing every prior version. Second, run the drift detector above on every tool update and block merges where cosine drops below 0.92 or routing delta exceeds 2 points on a frozen prompt set. Third, wrap the canary rollout around any registry that already supports multiple versions per tool, and tie auto-rollback to the same routing-accuracy metric your gate uses.
Description drift will keep happening. Your stack just needs to know when it does, before customers do.
Stop shipping tool description changes blind
Chanl scenarios gate every prompt and tool change against representative customer flows. Wire it into your MCP repo and catch routing drift before it merges.
See how scenarios gate changes- Model Context Protocol specification: tools/list and tool definition shape
- Anthropic tool use documentation: descriptions are the primary signal for tool selection
- OpenAI function calling guide: clear descriptions drive routing accuracy
- Berkeley Function-Calling Leaderboard (BFCL): empirical accuracy across tool sets
- Sclar et al., Quantifying Language Models Sensitivity to Spurious Features in Prompt Design (ICLR 2024)
- Errica et al., What did I do wrong? Quantifying LLMs Sensitivity to Prompt Variations (2024)
- Mirzadeh et al., GSM-Symbolic: prompt sensitivity in reasoning tasks
- OpenAI text-embedding-3-small reference
- LaunchDarkly canary deployment guide with auto-rollback
- Statsig auto-rollback for feature flag experiments
- Anthropic prompt caching: cache invalidation on tool definition changes
- MCP TypeScript SDK on GitHub
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



