Why do MCP servers need different observability than regular APIs?

MCP servers are called by AI agents at runtime, which means traffic patterns are unpredictable and failures are hard to reproduce. An agent might call the same tool 200 times in a loop if its reasoning gets stuck, or call a tool with subtly wrong arguments that pass validation but return misleading data. Traditional API monitoring catches latency and errors. It misses AI-specific failure modes like reasoning loops, hallucinated arguments, and cascading tool failures.

What OpenTelemetry semantic conventions apply to MCP servers?

The OpenTelemetry GenAI semantic conventions (stabilized in early 2026) define standard attributes for AI workloads: gen_ai.system (the LLM provider), gen_ai.tool.name, gen_ai.tool.call.id, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens. For MCP specifically, you should also track the MCP session ID, transport type (SSE vs stdio), and whether the call was initiated by an agent or a human. These attributes let you correlate tool calls back to the originating conversation.

How do I detect an agent tool loop in production before it runs up my API bill?

Track tool call rate per session with a sliding window counter. If a single session calls the same tool more than N times in 5 minutes (start with N=10 and tune it), fire an alert and optionally circuit-break the session. Most legitimate workflows call a given tool 1-3 times per conversation. A spike to 50+ calls almost always means the agent is looping on a failed state.

Should I trace every MCP tool call or sample them?

Trace every tool call in development and staging. In production, use head-based sampling at 20-50% for normal traffic, but keep 100% coverage on error spans and slow calls (above your P95 baseline). Never sample tool failures -- the pattern of which tools fail, in what sequence, is exactly the signal you need to debug agent behavior. Storage costs are low compared to the debugging time saved.

How do I attribute MCP tool costs to specific agent features or conversations?

Add a cost attribute to every tool call span: the estimated token or API cost. Propagate your conversation ID, user ID, and feature flag context through the OpenTelemetry baggage API so every downstream span inherits them automatically. Then you can query your telemetry backend for total tool cost by conversation, by feature, or by agent version -- which tells you which agent behaviors are expensive to run.

What is the most common MCP observability mistake?

Treating MCP servers like REST APIs and only tracking HTTP status codes and latency. An MCP tool can return HTTP 200 with a response that causes the agent to hallucinate its next step. You need to track what the tool returned, not just whether it returned successfully. Log the tool output (with PII scrubbing) at debug level and sample it at info level so you can reconstruct what the agent saw during a debugging session.

How do I handle observability for MCP servers behind a gateway or proxy?

Use OpenTelemetry's W3C trace context propagation. Your MCP gateway should extract the traceparent header from incoming requests and inject it into the headers it sends to upstream tool servers. Each MCP server creates child spans under the same trace ID. This gives you a single trace that spans your agent, the gateway, and every tool server -- which is essential for debugging multi-tool workflows.

What metrics should I set up alerts for on an MCP server?

Alert on four things: error rate above 2% for any individual tool, P99 latency above 3 seconds (tune per tool), call volume spikes above 5x the hourly baseline (loop detection), and authentication failures above 1% (potential security issue). These four alerts cover the majority of production incidents without creating alert fatigue from too many noisy signals.

MCP Servers in Production: Observability from Day One

You shipped your MCP server. Agents are calling it. Everything looks fine.

Then at 2 AM, a support ticket lands: a customer's refund lookup returned the wrong account balance. You pull the logs. The MCP server returned HTTP 200. The agent's conversation shows it used the result confidently. But somewhere in the chain between the agent's request and the tool's response, something went wrong -- and you have no way to see what.

This is the observability gap that bites teams after their first MCP deployment. The server is running. The API is responding. But you're flying blind on what your agents are actually getting and doing with it.

This guide walks through instrumenting an MCP server for production from scratch: tracing tool calls end to end, detecting loops before they run up your bill, and building the dashboards that actually tell you what's happening.

Why MCP Needs Different Monitoring

Traditional API monitoring answers: did the request succeed, how fast was it, what was the HTTP status? That's enough for a REST API serving human users.

MCP servers don't serve humans. They serve AI agents at runtime -- and agents fail differently. An agent can call the same tool in a loop because its reasoning got stuck. It can receive a perfectly valid HTTP 200 with data that causes it to produce a wrong answer in the next step. It can make subtly malformed tool arguments that pass schema validation but return misleading results.

None of these show up in a standard APM dashboard. You need to see inside the tool calls.

The good news: OpenTelemetry's GenAI semantic conventions (stabilized in early 2026) give you the vocabulary to describe exactly what's happening in an MCP tool call. Once you're emitting the right spans and attributes, any standard observability backend -- Grafana, Datadog, New Relic, Honeycomb -- can query them.

Here's what matters at a glance:

Signal	Standard API	MCP Server
Latency	P50, P99 per endpoint	P50, P99 per tool + per agent
Errors	HTTP 4xx/5xx	HTTP errors + semantic tool failures
Volume	Requests/sec	Calls/session (loop detection)
Content	Optional	Critical (what did the agent get?)
Cost	N/A	Tokens + downstream API costs

Step 1: Set Up the OpenTelemetry SDK

Start with a minimal instrumentation setup. You'll add to it, but get the plumbing right first.

typescript

// mcp-server/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';
 
const resource = new Resource({
  [SEMRESATTRS_SERVICE_NAME]: 'mcp-tools-server',
  [SEMRESATTRS_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
  'deployment.environment': process.env.NODE_ENV ?? 'development',
  'mcp.transport': process.env.MCP_TRANSPORT ?? 'sse',
});
 
const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
});
 
const metricExporter = new OTLPMetricExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/metrics',
});
 
export const sdk = new NodeSDK({
  resource,
  traceExporter,
  metricReader: new PeriodicExportingMetricReader({
    exporter: metricExporter,
    exportIntervalMillis: 30_000,
  }),
});
 
// Call sdk.start() before importing anything else
sdk.start();

Initialize this before your MCP server starts:

typescript

// mcp-server/index.ts  (top of file, before other imports)
import './instrumentation';
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
// ... rest of your server

Step 2: Instrument Tool Calls

The core of MCP observability is tracing individual tool calls. Each call should produce a span with the tool name, input arguments, output, duration, and any error.

typescript

// mcp-server/tracing.ts
import { trace, SpanStatusCode, context, propagation } from '@opentelemetry/api';
import { scrubPII } from './scrubbing';
 
const tracer = trace.getTracer('mcp-tools', '1.0.0');
 
interface ToolCallContext {
  toolName: string;
  sessionId: string;
  conversationId: string;
  callId: string;
  agentVersion?: string;
}
 
export async function traceToolCall<T>(
  ctx: ToolCallContext,
  args: Record<string, unknown>,
  handler: () => Promise<T>
): Promise<T> {
  return tracer.startActiveSpan(
    `mcp.tool.${ctx.toolName}`,
    {
      attributes: {
        // GenAI semantic conventions
        'gen_ai.tool.name': ctx.toolName,
        'gen_ai.tool.call.id': ctx.callId,
        'gen_ai.system': 'mcp',
 
        // MCP-specific
        'mcp.session.id': ctx.sessionId,
        'mcp.conversation.id': ctx.conversationId,
        'mcp.agent.version': ctx.agentVersion ?? 'unknown',
 
        // Sanitized input for debugging
        'mcp.tool.input': JSON.stringify(scrubPII(args)),
      }
    },
    async (span) => {
      const startTime = Date.now();
      try {
        const result = await handler();
        const duration = Date.now() - startTime;
 
        span.setAttributes({
          'mcp.tool.duration_ms': duration,
          'mcp.tool.success': true,
          // Log output at debug level only — too verbose for info
          'mcp.tool.output_preview': JSON.stringify(scrubPII(result as Record<string, unknown>)).slice(0, 200),
        });
 
        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (error) {
        const duration = Date.now() - startTime;
        const err = error as Error;
 
        span.setAttributes({
          'mcp.tool.duration_ms': duration,
          'mcp.tool.success': false,
          'mcp.tool.error_type': err.constructor.name,
          'mcp.tool.error_message': err.message,
        });
 
        span.setStatus({
          code: SpanStatusCode.ERROR,
          message: err.message,
        });
        span.recordException(err);
 
        throw error;
      } finally {
        span.end();
      }
    }
  );
}

Use this wrapper around every tool in your MCP server:

typescript

// Example: account lookup tool
server.tool('get_account_balance', AccountBalanceSchema, async (args, callInfo) => {
  return traceToolCall(
    {
      toolName: 'get_account_balance',
      sessionId: callInfo.sessionId ?? 'unknown',
      conversationId: args._conversationId ?? 'unknown',
      callId: callInfo.callId ?? crypto.randomUUID(),
    },
    args,
    () => accountService.getBalance(args.accountId)
  );
});

Step 3: Track Metrics for Alerting

Spans give you detailed debugging data. Metrics give you the aggregates you need for alerting and dashboards.

typescript

// mcp-server/metrics.ts
import { metrics } from '@opentelemetry/api';
 
const meter = metrics.getMeter('mcp-tools', '1.0.0');
 
// Counter: total calls per tool
const toolCallCounter = meter.createCounter('mcp.tool.calls', {
  description: 'Total MCP tool calls',
  unit: '1',
});
 
// Histogram: call duration distribution
const toolDurationHistogram = meter.createHistogram('mcp.tool.duration', {
  description: 'MCP tool call duration in milliseconds',
  unit: 'ms',
  advice: { explicitBucketBoundaries: [50, 100, 250, 500, 1000, 2500, 5000] },
});
 
// Gauge: calls per session in sliding window (for loop detection)
const sessionCallGauge = meter.createObservableGauge('mcp.session.call_rate', {
  description: 'Tool calls per session in last 5 minutes',
  unit: '1',
});
 
// Counter: errors per tool
const toolErrorCounter = meter.createCounter('mcp.tool.errors', {
  description: 'Total MCP tool call errors',
  unit: '1',
});
 
interface ToolMetricLabels {
  tool: string;
  session: string;
  error_type?: string;
}
 
export function recordToolCall(
  labels: ToolMetricLabels,
  durationMs: number,
  success: boolean
): void {
  const dimensions = {
    'tool.name': labels.tool,
    'mcp.session.id': labels.session,
  };
 
  toolCallCounter.add(1, dimensions);
  toolDurationHistogram.record(durationMs, dimensions);
 
  if (!success) {
    toolErrorCounter.add(1, {
      ...dimensions,
      'error.type': labels.error_type ?? 'unknown',
    });
  }
}

Step 4: Loop Detection

This is the alerting feature you'll wish you had from day one. Agents can get stuck calling the same tool repeatedly -- usually because a tool returned unexpected data and the agent retried instead of escalating.

typescript

// mcp-server/loop-detection.ts
interface SessionWindow {
  calls: Map<string, number[]>; // toolName -> timestamps
  lastAlert: number;
}
 
const sessionWindows = new Map<string, SessionWindow>();
const WINDOW_MS = 5 * 60 * 1000; // 5 minutes
const LOOP_THRESHOLD = 15;        // calls to same tool in window
 
export function trackCallForLoops(
  sessionId: string,
  toolName: string,
  onLoopDetected: (sessionId: string, toolName: string, callCount: number) => void
): void {
  const now = Date.now();
 
  if (!sessionWindows.has(sessionId)) {
    sessionWindows.set(sessionId, { calls: new Map(), lastAlert: 0 });
  }
 
  const window = sessionWindows.get(sessionId)!;
 
  if (!window.calls.has(toolName)) {
    window.calls.set(toolName, []);
  }
 
  const timestamps = window.calls.get(toolName)!;
 
  // Add current call and prune old timestamps
  timestamps.push(now);
  const cutoff = now - WINDOW_MS;
  const recentCalls = timestamps.filter(t => t > cutoff);
  window.calls.set(toolName, recentCalls);
 
  // Alert if threshold exceeded (throttle to once per 5 minutes per session)
  if (recentCalls.length >= LOOP_THRESHOLD && now - window.lastAlert > WINDOW_MS) {
    window.lastAlert = now;
    onLoopDetected(sessionId, toolName, recentCalls.length);
  }
}
 
// Cleanup stale sessions (run on a timer)
export function cleanupStaleSessions(maxAgeMs = 30 * 60 * 1000): void {
  const cutoff = Date.now() - maxAgeMs;
  for (const [sessionId, window] of sessionWindows.entries()) {
    const allTimestamps = Array.from(window.calls.values()).flat();
    const mostRecent = Math.max(...allTimestamps, 0);
    if (mostRecent < cutoff) {
      sessionWindows.delete(sessionId);
    }
  }
}

Wire this into your tool call handler:

typescript

server.tool('get_account_balance', AccountBalanceSchema, async (args, callInfo) => {
  const sessionId = callInfo.sessionId ?? 'unknown';
 
  trackCallForLoops(sessionId, 'get_account_balance', async (sid, tool, count) => {
    // Send to your alerting system
    await alert.send({
      severity: 'warning',
      message: `Loop detected: session ${sid} called ${tool} ${count}x in 5 minutes`,
      metadata: { sessionId: sid, toolName: tool, callCount: count },
    });
  });
 
  return traceToolCall(/* ... */);
});

Step 5: PII Scrubbing

You're logging tool inputs and outputs. Those almost certainly contain PII. Do the scrubbing before spans are exported.

typescript

// mcp-server/scrubbing.ts
const PII_PATTERNS: Array<[RegExp, string]> = [
  [/\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, '[CARD_NUMBER]'],
  [/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi, '[EMAIL]'],
  [/\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, '[PHONE]'],
  [/\b\d{3}-\d{2}-\d{4}\b/g, '[SSN]'],
];
 
export function scrubPII(data: unknown): unknown {
  if (typeof data === 'string') {
    return PII_PATTERNS.reduce((str, [pattern, replacement]) =>
      str.replace(pattern, replacement), data
    );
  }
 
  if (Array.isArray(data)) {
    return data.map(scrubPII);
  }
 
  if (data !== null && typeof data === 'object') {
    return Object.fromEntries(
      Object.entries(data as Record<string, unknown>).map(([k, v]) => [k, scrubPII(v)])
    );
  }
 
  return data;
}

This is minimal but covers the most common PII types. Extend with patterns for your specific domain.

How the Traces Connect

Here's how a single agent turn flows through the trace system, from the agent's reasoning step down to your MCP tool and back:

Distributed trace spanning agent reasoning, MCP gateway, and tool server for a single conversation turn

Each box is a span. They all share the same trace ID. When something goes wrong -- say the DB query returns stale data -- you can pull the full trace and see exactly what the agent received, what it inferred, and what it said.

This is what OpenTelemetry tracing for AI agents looks like from the MCP server's perspective. The linked article covers the agent side of the same trace.

Step 6: Connect to Chanl Monitoring

If you're running your MCP server through Chanl's MCP runtime, the telemetry above feeds directly into the monitoring dashboard. You get per-tool latency trends, call volume charts, and loop detection alerts out of the box.

For the cases where you're running your own MCP infrastructure, here's how to export traces to Chanl alongside your primary observability backend:

typescript

// instrumentation.ts — multi-backend export using SimpleSpanProcessor
import { NodeSDK } from '@opentelemetry/sdk-node';
import { SimpleSpanProcessor } from '@opentelemetry/sdk-trace-base';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
 
const primaryExporter = new OTLPTraceExporter({
  url: process.env.PRIMARY_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
});
 
// Additional export to Chanl for conversation-quality correlation
const chanlExporter = new OTLPTraceExporter({
  url: 'https://telemetry.chanl.ai/v1/traces',
  headers: { 'x-chanl-api-key': process.env.CHANL_API_KEY ?? '' },
});
 
export const sdk = new NodeSDK({
  resource: new Resource({ 'service.name': 'mcp-tools-server' }),
  spanProcessors: [
    new SimpleSpanProcessor(primaryExporter),
    new SimpleSpanProcessor(chanlExporter),
  ],
});
 
sdk.start();

This pattern -- using multiple SpanProcessor instances -- lets you keep your existing Grafana or Datadog setup while also feeding Chanl's analytics layer, which correlates tool performance against conversation quality scores.

What Your First Dashboard Should Show

When your traces start flowing, build this dashboard first. It answers the questions you'll actually ask during an incident.

Health overview (top row):

Tool call success rate by tool name (last 24h)
P99 latency by tool (last 1h)
Active sessions with loop alerts

Volume signals (middle row):

Calls per tool per hour (trend line)
Top 10 sessions by call volume (detect outliers)
Error rate by error type

Debugging helpers (bottom row):

Recent failed traces (link to trace ID)
Slowest 10 tool calls (last 1h)
Authentication failures by client

Most of this is one or two queries in Grafana or Honeycomb once your spans have the right attributes. The hard part is deciding what to look at -- the dashboard above covers 80% of what matters.

Three Things That Will Bite You in the First Week

After shipping MCP observability across a few teams, these are the gotchas that consistently show up in the first week of production.

Baggage propagation breaks at async boundaries. OpenTelemetry baggage (where you store conversation ID, user ID, feature flags) relies on Node.js AsyncLocalStorage. If you're using a job queue, a worker pool, or any pattern that passes work across async contexts without explicit propagation, your downstream spans will lose the parent context. The fix: propagate trace context explicitly when crossing async boundaries.

typescript

import { context, propagation } from '@opentelemetry/api';
 
// When pushing to a queue or starting a worker
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
 
// Store carrier in the job payload
await queue.push({ ...jobData, _otelContext: carrier });
 
// When the worker picks up the job:
const parentContext = propagation.extract(context.active(), job._otelContext);
await context.with(parentContext, async () => {
  // All spans created here inherit the parent trace
  await processJob(job);
});

SSE transport drops spans on reconnect. If you're using Server-Sent Events as your MCP transport, the agent reconnects if the connection drops. Each reconnect creates a new session ID. Without handling this, you'll see fragmented traces that look like separate conversations when they're actually one session. Track reconnect events explicitly and store the original session ID in a persistent store so you can stitch them together.

Sampling drops your most important spans. Head-based sampling (deciding at the start of a trace whether to record it) is simple but wrong for AI agents. A conversation that starts normally can become your most important trace if the agent makes a bad tool call in turn 7. Use tail-based sampling: buffer all spans for a trace and make the sampling decision at the end, keeping any trace that contains errors, high latency, or flagged quality scores. Grafana Tempo and Honeycomb both support tail-based sampling. Datadog's adaptive sampling approximates it.

The tools management guide covers how to use Chanl's tool registry to avoid the SSE reconnect problem entirely when you're using Chanl's MCP runtime.

The Difference Between Running and Watching

An MCP server that's running isn't the same as one you can see. When something goes wrong at 2 AM, the difference between "running" and "watching" is how quickly you find the answer.

The setup in this guide takes about a day to instrument and another day to tune thresholds and build your initial dashboard. After that, you have a production system you can actually debug -- which is a much better place to be than parsing raw logs and guessing what your agent saw.

The MCP Explained guide covers building your first MCP server if you're still getting the basics in place. Once you have a server that works, come back here and make it observable.

Built-in observability for your MCP tools

Chanl's MCP runtime instruments your tool calls automatically -- traces, loop detection, and cost attribution with no extra setup. Connect your existing tools and start monitoring in minutes.

Try Chanl Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

mcp opentelemetry observability typescript monitoring production agent-infrastructure

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.