Your FastMCP server works perfectly on localhost. Tools respond, the MCP Inspector shows green checkmarks, Claude Desktop calls your functions and returns clean results. You deploy it. Within a day, your agent is burning through tokens on tool definitions it doesn't need, auto-approving destructive operations without confirmation, and returning error messages so vague the LLM spins in retry loops.
This isn't hypothetical. These are the seven mistakes that show up repeatedly when FastMCP servers move from development to production. Each one works fine locally because local testing doesn't expose the failure mode. Production does.
We'll walk through each mistake with the broken version, explain why it breaks, and show the fix. If you haven't built an MCP server yet, our MCP tutorial covers the fundamentals.
1. Missing tool annotations
Tool annotations tell MCP clients what a tool does before calling it. Without them, clients have no way to distinguish a read-only lookup from a database wipe. The result: either every call requires manual confirmation (killing automation) or every call gets auto-approved (killing your data).
MCP 2025-03-26 introduced annotations as structured metadata on tool definitions. Five hints describe a tool's behavior:
| Annotation | Type | What It Tells the Client |
|---|---|---|
readOnlyHint | boolean | This tool only reads data, never modifies anything |
destructiveHint | boolean | This tool deletes or irreversibly changes state |
idempotentHint | boolean | Calling this tool twice with the same input produces the same result |
openWorldHint | boolean | This tool interacts with external systems beyond the server |
title | string | Human-readable name for display in client UIs |
Here's a FastMCP server without annotations:
from fastmcp import FastMCP
mcp = FastMCP("deployment-tools")
@mcp.tool()
def get_deploy_status(deploy_id: str) -> str:
"""Check deployment status."""
return fetch_status(deploy_id)
@mcp.tool()
def rollback_production(deploy_id: str) -> str:
"""Rollback production to a previous deployment."""
return execute_rollback(deploy_id)Both tools look identical to the client. get_deploy_status is a harmless read. rollback_production takes down your live service. The client treats them the same.
The fix takes one additional parameter per tool:
from fastmcp import FastMCP
mcp = FastMCP("deployment-tools")
@mcp.tool(
annotations={
"title": "Check Deploy Status",
"readOnlyHint": True,
"destructiveHint": False,
"openWorldHint": True,
}
)
def get_deploy_status(deploy_id: str) -> str:
"""Check the current status of a deployment by ID."""
return fetch_status(deploy_id)
@mcp.tool(
annotations={
"title": "Rollback Production",
"readOnlyHint": False,
"destructiveHint": True,
"idempotentHint": False,
"openWorldHint": True,
}
)
def rollback_production(deploy_id: str) -> str:
"""Rollback production to a previous deployment. This is destructive and cannot be undone."""
return execute_rollback(deploy_id)Now the client knows get_deploy_status is safe to auto-approve and rollback_production needs human confirmation. Claude Code uses these annotations to decide its approval policy. Without them, it defaults to asking for every call, which defeats the point of agentic tool use.
2. Why shouldn't MCP tools mirror your API?
MCP tools should represent outcomes, not API operations. When you mirror your REST API as MCP tools, you force the LLM to learn your API design, manage state between calls, and chain operations correctly. It fails at all three.
Consider a deployment service. The REST API has endpoints for listing deployments, creating a deployment, monitoring build progress, and promoting to production. A naive MCP server exposes all four:
@mcp.tool()
def list_deployments(app_id: str) -> str:
"""List all deployments for an app."""
...
@mcp.tool()
def create_deployment(app_id: str, branch: str) -> str:
"""Create a new deployment from a branch."""
...
@mcp.tool()
def get_build_status(deploy_id: str) -> str:
"""Check if a deployment build has completed."""
...
@mcp.tool()
def promote_deployment(deploy_id: str) -> str:
"""Promote a deployment to production."""
...To deploy to staging, the agent needs to call create_deployment, poll get_build_status in a loop until it succeeds, then call promote_deployment with the right ID. That's three tool calls minimum, with polling logic the LLM must figure out on its own. It usually gets the polling wrong.
The outcome-oriented version wraps the workflow:
@mcp.tool(
annotations={
"title": "Deploy to Staging",
"readOnlyHint": False,
"destructiveHint": False,
"openWorldHint": True,
}
)
def deploy_to_staging(app_id: str, branch: str = "main") -> str:
"""Deploy a branch to staging. Creates the deployment,
waits for the build, and promotes automatically.
Returns the deployment URL when complete."""
deploy = create_deployment_internal(app_id, branch)
wait_for_build(deploy.id, timeout=300)
promote(deploy.id, environment="staging")
return f"Deployed {branch} to staging: {deploy.url}"One tool call. No LLM polling logic. No chaining mistakes. The server handles the workflow complexity, and the agent handles the intent.
This doesn't mean you should never expose granular tools. get_deploy_status is still useful as a standalone read-only tool. The principle is: if a user would describe the action as a single sentence ("deploy this branch to staging"), it should be a single tool.
3. Token-wasteful responses
Every token in a tool response eats into the context window. JSON is the default format for most developers, but it's spectacularly wasteful for tabular data. Each row repeats every field name, and the structural characters (braces, brackets, quotes, colons) add up fast.
Here's a typical response from a tool that lists recent deployments:
[
{"id": "dep_001", "branch": "main", "status": "active", "created": "2026-04-01T10:00:00Z", "author": "alice"},
{"id": "dep_002", "branch": "fix/auth", "status": "building", "created": "2026-04-01T09:30:00Z", "author": "bob"},
{"id": "dep_003", "branch": "feat/search", "status": "failed", "created": "2026-04-01T08:15:00Z", "author": "carol"},
{"id": "dep_004", "branch": "main", "status": "active", "created": "2026-03-31T16:00:00Z", "author": "alice"},
{"id": "dep_005", "branch": "fix/perf", "status": "rolled_back", "created": "2026-03-31T14:20:00Z", "author": "dave"}
]That's 5 rows with 5 fields. The field names id, branch, status, created, and author appear 5 times each. For 20 rows, they'd appear 20 times. The JSON version of this data uses roughly 480 tokens. The same data as a markdown table:
| id | branch | status | created | author |
|---|---|---|---|---|
| dep_001 | main | active | 2026-04-01T10:00:00Z | alice |
| dep_002 | fix/auth | building | 2026-04-01T09:30:00Z | bob |
| dep_003 | feat/search | failed | 2026-04-01T08:15:00Z | carol |
| dep_004 | main | active | 2026-03-31T16:00:00Z | alice |
| dep_005 | fix/perf | rolled_back | 2026-03-31T14:20:00Z | dave |Same information, roughly 240 tokens. That's a 50% reduction. At 20 rows, the savings compound because the header row is fixed overhead while JSON repeats keys linearly.
The fix is straightforward. Detect whether the response is tabular and format accordingly:
def format_response(data: list[dict], format: str = "table") -> str:
if not data:
return "No results found."
if format == "json":
return json.dumps(data, indent=2)
# Markdown table format
headers = list(data[0].keys())
lines = [" | ".join(headers), " | ".join(["---"] * len(headers))]
for row in data:
lines.append(" | ".join(str(row.get(h, "")) for h in headers))
return "\n".join(lines)Use JSON when the data is nested or irregular. Use markdown tables when it's flat and tabular. The LLM reads both equally well, but tables cost half the tokens.
4. Why do generic errors cause retry loops?
Generic error messages cause retry loops. When a tool returns "Bad Request" or "Internal Server Error," the LLM has no information about what went wrong. It tries the same call again, maybe with slightly different parameters, gets the same error, and burns through your API budget while accomplishing nothing.
Here's the pattern that causes it:
@mcp.tool()
def create_api_key(name: str, scopes: list[str]) -> str:
"""Create a new API key with specified scopes."""
try:
return api.create_key(name, scopes)
except Exception as e:
return f"Error: {str(e)}"If the exception message is "400 Bad Request", that's what the LLM sees. It doesn't know if the name was invalid, the scopes were wrong, or the user hit a rate limit. So it guesses. Usually wrong.
Structured errors with actionable context give the LLM what it needs to self-correct:
@mcp.tool()
def create_api_key(name: str, scopes: list[str]) -> str:
"""Create a new API key with specified scopes.
Valid scopes: read, write, admin, deploy."""
valid_scopes = {"read", "write", "admin", "deploy"}
invalid = set(scopes) - valid_scopes
if invalid:
return (
f"Invalid scopes: {', '.join(invalid)}. "
f"Valid options: {', '.join(sorted(valid_scopes))}. "
f"Retry with corrected scopes."
)
if len(name) < 3 or len(name) > 64:
return "Key name must be 3-64 characters. Retry with a valid name."
try:
key = api.create_key(name, scopes)
return f"Created API key '{name}' with scopes: {', '.join(scopes)}. Key ID: {key.id}"
except RateLimitError:
return "Rate limit exceeded. Wait 60 seconds before retrying."
except DuplicateKeyError:
return f"A key named '{name}' already exists. Use a different name."
except Exception as e:
return f"Unexpected error creating key: {type(e).__name__}. Contact support if this persists."Three things make this work. First, validation happens before the API call, so common mistakes never hit the network. Second, each error message includes the specific problem and the specific fix. Third, the tool description lists valid values, so the LLM can often get it right on the first call.
The difference in practice: generic errors cause 3 to 5 retry cycles averaging 2,000 tokens each. Structured errors cause 0 to 1 retries. Over hundreds of agent interactions per day, that's the difference between a manageable API bill and a surprising one.
5. Is FastMCP secure for remote deployment?
FastMCP ships with HTTP transport that has no authentication. That's fine for local development where only you can reach the server. It's a critical vulnerability the moment you deploy it anywhere a network request can reach it.
An MCP server with tools that query databases, call APIs, or execute code is, by definition, a remote code execution surface. Without auth, anyone who discovers the URL can call any tool.
The default HTTP setup looks like this:
from fastmcp import FastMCP
mcp = FastMCP("my-tools")
# ... register tools ...
if __name__ == "__main__":
mcp.run(transport="http", host="0.0.0.0", port=8000)That's listening on all interfaces with zero access control. Here's how to add bearer token authentication using FastMCP's auth hooks:
import os
from fastmcp import FastMCP
from fastmcp.server.auth import BearerAuthProvider
class TokenAuth(BearerAuthProvider):
async def validate_token(self, token: str) -> dict:
expected = os.environ.get("MCP_AUTH_TOKEN")
if not expected:
raise ValueError("MCP_AUTH_TOKEN not configured")
if token != expected:
raise ValueError("Invalid token")
return {"authenticated": True}
mcp = FastMCP("my-tools", auth=TokenAuth())For production deployments behind a reverse proxy, you can also validate tokens at the proxy layer (nginx, Cloudflare, or your cloud provider's API gateway) and let the MCP server trust the proxy. That's the pattern Chanl's MCP server uses: gateway token validation at the edge, with workspace-scoped tool resolution behind it.
The key point: never expose an MCP server to a network without authentication. If it's reachable, it's callable. If it's callable without auth, it's a vulnerability.
6. No production deployment story
FastMCP is designed for local development. It starts with mcp.run(), runs as a single process, and shuts down when your terminal closes. That's perfect for prototyping. It's not a deployment strategy.
Production MCP servers need health checks, graceful shutdown, structured logging, and a process manager. The gap between "it runs on my machine" and "it runs in production" is where most FastMCP projects stall.
Here's a minimal production wrapper that addresses the basics:
import logging
import signal
import sys
from fastmcp import FastMCP
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
logger = logging.getLogger("mcp-server")
mcp = FastMCP("production-tools")
@mcp.tool(annotations={"readOnlyHint": True})
def health_check() -> str:
"""Server health check. Returns ok if the server is running."""
return "ok"
# ... register your tools ...
def handle_shutdown(signum, frame):
logger.info("Shutdown signal received, cleaning up...")
# Close database connections, flush buffers, etc.
sys.exit(0)
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)
if __name__ == "__main__":
logger.info("Starting MCP server on port 8000")
mcp.run(transport="streamable-http", host="0.0.0.0", port=8000)For containerized deployments, use Streamable HTTP transport (not stdio, which requires the client to spawn the server as a child process). Add a health_check tool or a dedicated HTTP health endpoint that your orchestrator can poll. Handle SIGTERM for graceful shutdown in Kubernetes or Docker.
If you're deploying to a serverless platform like Vercel, the pattern changes entirely. You export a request handler instead of running a long-lived process. The transport layer handles connection persistence across invocations.
The point isn't to prescribe a specific deployment architecture. It's that you need one, and FastMCP's default mcp.run() isn't it.
7. Monolithic tool exposure
This is the most expensive mistake by far, and the one most developers don't catch until their token bills arrive. Every tool definition in an MCP server gets serialized into the LLM's context window. Each definition costs 550 to 1,400 tokens depending on schema complexity. A server with 40 tools burns 22,000 to 55,000 tokens on definitions alone, before the agent reads a single message or generates a single response.
Here's the math for a real example. Say you've built a full-featured MCP server for your platform:
| Tool Count | Avg Tokens per Tool | Context Window Cost |
|---|---|---|
| 5 tools | 800 | 4,000 tokens |
| 15 tools | 800 | 12,000 tokens |
| 40 tools | 800 | 32,000 tokens |
| 100 tools | 800 | 80,000 tokens |
With a 128K context window, 40 tools consume 25% of your available context before anything else happens. The agent has less room for conversation history, less room for tool responses, and less room for reasoning. Quality degrades.
But the real problem isn't just token cost. It's decision quality. When an LLM sees 40 tool definitions, it's choosing from 40 options on every turn. The more options, the more likely it picks the wrong tool or hallucinates a tool that doesn't exist. Five focused tools produce better tool selection accuracy than forty broad ones.
The fix is to scope tools to the task. Instead of one server with everything, create purpose-built groups:
# Instead of one monolithic server with 40 tools:
mcp = FastMCP("everything-server")
# Create focused servers or use toolset-scoped resolution:
# Deployment tools (5 tools)
deploy_mcp = FastMCP("deploy-tools")
@deploy_mcp.tool(annotations={"title": "Deploy to Staging", "destructiveHint": False})
def deploy_to_staging(app_id: str, branch: str = "main") -> str: ...
@deploy_mcp.tool(annotations={"title": "Deploy to Production", "destructiveHint": True})
def deploy_to_production(app_id: str, deploy_id: str) -> str: ...
# Monitoring tools (4 tools)
monitor_mcp = FastMCP("monitor-tools")
@monitor_mcp.tool(annotations={"title": "Get Error Rate", "readOnlyHint": True})
def get_error_rate(app_id: str, window: str = "1h") -> str: ...
@monitor_mcp.tool(annotations={"title": "Get Latency Stats", "readOnlyHint": True})
def get_latency_stats(app_id: str, percentile: int = 95) -> str: ...Each agent connects only to the toolset it needs. A deployment agent sees 5 tools. A monitoring agent sees 4 tools. Neither wastes context on definitions irrelevant to its job.
This is the architecture behind Chanl's tool management system. Tools are registered individually, then grouped into toolsets. When an agent connects, the MCP server resolves only the tools in its assigned toolset. An agent with 3 tools in its toolset never sees the other 97 tools on the platform.
Putting it all together
These seven mistakes share a common thread: they're invisible in local development. Annotations don't matter when you're the only user approving calls. Token waste doesn't matter when you're testing with 3 rows of data. Auth doesn't matter on localhost. Monolithic tool sets don't matter when you're testing one tool at a time.
Production exposes all of them simultaneously. Here's a checklist for auditing your FastMCP server before deploying:
- Every tool has annotations (readOnlyHint, destructiveHint, openWorldHint at minimum)
- Tools represent outcomes, not API primitives (one action = one tool)
- Tabular responses use markdown tables, not JSON
- Error messages include the specific problem and the specific fix
- HTTP transport has authentication middleware
- Server has health checks, structured logging, and graceful shutdown
- Agent connects to a scoped toolset (under 10 tools), not a monolithic server
The gap between a working MCP server and a production-ready one isn't about the protocol. The protocol is solid. It's about the patterns around it: how you describe tools, how you structure responses, how you handle failure, and how you control access. Get those right and your agents stop burning tokens on problems they shouldn't have.
If you're building tools for AI agents and want to test them with realistic scenarios before they hit production, Chanl handles tool registration, toolset management, and MCP hosting so you can focus on the tools themselves.
Ship MCP Tools Without the Production Gotchas
Register tools once, group them into scoped toolsets, and test with AI-powered scenarios before deploying.
Start BuildingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



