What is a circuit breaker in an AI agent system?

A circuit breaker is a reliability pattern borrowed from electrical engineering. It wraps calls to external systems (tools, APIs, LLM endpoints) and monitors failure rates. When failures exceed a threshold, the circuit trips and subsequent calls fail immediately rather than waiting to time out or retry. After a recovery window, it allows a test request through to check whether the downstream system has recovered.

Why do AI agents enter infinite retry loops?

Agents enter retry loops when a tool call fails and the agent's reasoning leads it to retry rather than escalate or bail out. Without an iteration limit or budget ceiling in the agent runtime, the agent can legitimately decide to retry a failing API hundreds of times overnight. Most agent frameworks don't enforce hard limits by default, so you have to wire them in explicitly.

What should a circuit breaker protect in an AI agent?

You need circuit breakers at three layers: tool calls (any function the agent invokes), LLM API calls (to prevent runaway costs if your orchestration retries on model errors), and external API calls made by your tools. Each layer has different failure signatures and thresholds. Budget-based breakers are a fourth type: they trip when total token spend for a session exceeds a ceiling you set.

How do I set the right thresholds for an agent circuit breaker?

Start conservative: trip after 5 failures in a 60-second window, wait 30 seconds before testing recovery. Monitor false positive rates. If circuits trip during normal traffic spikes, widen the window. Tool calls that hit external APIs warrant tighter thresholds (3 failures) than internal data lookups (10 failures). Budget-based breakers should trip at 2-3x your expected per-session cost.

What is the difference between a circuit breaker and a retry policy?

Retry policies control how many times you attempt a failing call. Circuit breakers control whether you attempt it at all. A retry policy of 3 attempts with exponential backoff is fine for transient failures. A circuit breaker is what stops that retry policy from running 10,000 times overnight when your downstream API is down for hours. You need both: retries for transient blips, circuit breakers for sustained outages.

Can I test circuit breakers before they trip in production?

Yes. Shadow deployment is the standard approach. You run your agent against a subset of real traffic with circuit breakers in open state (failing fast) and measure how the agent handles the fast-fails. This catches the cases where your agent doesn't gracefully handle a tripped circuit. Shadow deployment reduces production incidents by about 40% for teams that use it.

How does schema validation prevent tool call failures?

Schema validation catches malformed tool calls before they hit external APIs. If your agent generates a tool call with a missing required field or an out-of-range value, a validation layer rejects it immediately rather than sending it to your CRM or payment processor. Schema drift (when your tool's expected inputs change without the agent being updated) is the top cause of broken automations. Versioned schema enforcement catches this at the boundary.

What observability do I need for circuit breakers to be useful?

At minimum: alert when a circuit trips, log every open-state rejection with the reason, and dashboard current circuit state across all your agent's tools. Without this, a tripped circuit is invisible from the outside and your agent silently degrades. You want to correlate circuit state with conversation quality metrics so you can answer 'why did agent quality drop between 11 PM and 7 AM?' from a single dashboard.

Circuit Breakers for AI Agents: Stop the 3 AM Meltdown

At 11 PM, a developer's support agent started hitting a payment API that had gone into maintenance mode. The API returned 503. The agent retried. 503 again. The agent retried again. By 7 AM it had made thousands of identical calls and the developer woke up to a $437 bill and zero successful transactions.

Nobody designed this failure. The agent was doing exactly what it was built to do: persist until the task completes. The missing ingredient wasn't smarter reasoning. It was a circuit breaker.

What Circuit Breakers Are and Why Agents Need Them

Circuit breakers protect agents from doing too much damage when an external system fails. The idea comes from distributed systems: wrap calls to unreliable downstream services in a layer that tracks failure rates. When failures exceed a threshold, the circuit trips. Subsequent calls fail immediately rather than waiting to time out or retry. After a recovery window, the circuit lets a test request through to check if the system is back.

The pattern has been standard in microservices for a decade. Most AI agent frameworks don't include it by default.

Agents are especially vulnerable without circuit breakers because they reason about failures and decide to retry. A well-instructed agent will try to complete its task. In the face of a persistent failure, that means it retries, re-evaluates, and retries again. The model isn't broken. The behavior is exactly what you asked for. But without a hard ceiling on iterations and a mechanism that stops calls to a broken downstream service, that good behavior becomes a runaway process by morning.

Datadog's 2026 State of AI Engineering found that 5% of all LLM call spans reported errors, and 60% of those errors came from rate limits. Rate limit errors are exactly the kind of transient, recoverable failure that causes agents to enter retry spirals. The agent sees "rate limited," interprets it as a temporary condition, and keeps trying.

The Three States

A circuit breaker moves through three states. Understanding them helps you tune the thresholds correctly.

Circuit breaker state machine for agent tool calls

Closed is normal operation. Every call goes through. The breaker tracks failures in a rolling window. This is the default state.

Open means the circuit has tripped. Every call fails immediately with a structured error, not a timeout. No waiting, no external requests. The agent gets a fast failure it can handle explicitly.

Half-open is the recovery probe. After a timeout, one request goes through. If it succeeds, the circuit closes. If it fails, it reopens for another recovery period.

The key insight for agent systems: open state should produce a structured error that your agent handles explicitly. If your agent's system prompt doesn't define what to do when a tool returns CIRCUIT_OPEN, the agent will try to work around it in unpredictable ways, often by retrying through a different code path that bypasses your circuit entirely.

Implementing a Circuit Breaker in TypeScript

Here's a straightforward implementation that covers the core states without heavy dependencies:

circuit-breaker.ts·typescript

type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
 
interface CircuitBreakerConfig {
  failureThreshold: number;  // trips after N failures
  successThreshold: number;  // closes after N successes in half-open
  timeout: number;           // ms to stay open before testing
  windowMs: number;          // rolling window for failure counting
}
 
class CircuitBreaker {
  private state: CircuitState = 'CLOSED';
  private failures: number[] = [];  // timestamps
  private successes = 0;
  private nextAttempt = 0;
 
  constructor(
    private name: string,
    private config: CircuitBreakerConfig
  ) {}
 
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error(`CIRCUIT_OPEN:${this.name}`);
      }
      this.state = 'HALF_OPEN';
    }
 
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }
 
  private onSuccess() {
    if (this.state === 'HALF_OPEN') {
      this.successes++;
      if (this.successes >= this.config.successThreshold) {
        this.reset();
      }
    } else {
      this.failures = [];
    }
  }
 
  private onFailure() {
    const now = Date.now();
    this.failures = this.failures.filter(
      (t) => now - t < this.config.windowMs
    );
    this.failures.push(now);
 
    if (this.failures.length >= this.config.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = now + this.config.timeout;
      this.successes = 0;
    }
  }
 
  private reset() {
    this.state = 'CLOSED';
    this.failures = [];
    this.successes = 0;
  }
 
  getState() {
    return this.state;
  }
}

You'd wrap each tool call:

agent-tools.ts·typescript

const paymentApiBreaker = new CircuitBreaker('payment-api', {
  failureThreshold: 3,
  successThreshold: 2,
  timeout: 30_000,    // 30 seconds open
  windowMs: 60_000,   // 1-minute rolling window
});
 
async function processRefund(orderId: string, amount: number) {
  return paymentApiBreaker.call(async () => {
    const response = await paymentApi.refund({ orderId, amount });
    return response;
  });
}

When the payment API goes down, processRefund throws CIRCUIT_OPEN:payment-api immediately. No waiting, no retrying, no wasted tokens on the LLM reasoning about a broken downstream system.

What to Protect at Each Layer

Not everything needs the same configuration. Think in three layers.

Tool calls to external APIs (CRMs, payment processors, booking systems) deserve tight thresholds: 3 failures in 60 seconds, 30-second recovery timeout. These services are most likely to have extended outages, and they're usually the most expensive to keep retrying.

LLM API calls need a different approach. LLM failures are mostly rate limits and transient errors, so a slightly wider window makes sense: 5 failures in 120 seconds, 60-second recovery. You also want a budget-based breaker here, one that trips if total token spend for this session exceeds 3x your expected cost.

budget-breaker.ts·typescript

class BudgetCircuitBreaker {
  private spent = 0;
 
  constructor(private maxTokens: number) {}
 
  check(estimatedTokens: number) {
    if (this.spent + estimatedTokens > this.maxTokens) {
      throw new Error(
        `BUDGET_EXCEEDED: spent=${this.spent}, limit=${this.maxTokens}`
      );
    }
  }
 
  record(tokensUsed: number) {
    this.spent += tokensUsed;
  }
}
 
// Per-session, initialized when the conversation starts
const budget = new BudgetCircuitBreaker(
  EXPECTED_TOKENS_PER_SESSION * 3  // 3x ceiling
);

Iteration limits are a third type. Not failure-rate-based, but execution-depth-based. Every agent loop needs a hard ceiling on how many steps it can take.

agent-runner.ts·typescript

const MAX_ITERATIONS = 15;
let iterations = 0;
 
while (!done) {
  if (++iterations > MAX_ITERATIONS) {
    throw new Error('ITERATION_LIMIT_EXCEEDED');
  }
  // agent step
}

Most agent frameworks have this as a setting. The default in many is either very high or disabled. Set it explicitly, and handle ITERATION_LIMIT_EXCEEDED in your error path so the agent can tell the customer it needs to transfer rather than silently hanging.

Schema Validation: Catch Bad Tool Calls Before They Hit the Wire

Schema drift is the other major class of silent agent failures. Your tool's expected inputs change, the agent keeps generating calls in the old format, and you get errors that look like API failures but are actually bad inputs from your own agent.

Schema drift is the leading cause of broken CX agent automations. The fix is a validation layer before the call reaches your external API.

validated-tool.ts·typescript

import { z } from 'zod';
 
const RefundSchema = z.object({
  orderId: z.string().regex(/^ORD-[0-9]+$/),
  amount: z.number().positive().max(10_000),
  reason: z.enum(['damaged', 'not-received', 'changed-mind', 'other']),
});
 
async function processRefund(rawInput: unknown) {
  // Validate before touching the external API
  const input = RefundSchema.parse(rawInput);
  return paymentApiBreaker.call(() => paymentApi.refund(input));
}

When your agent generates a malformed tool call (missing a field, wrong data type, invalid enum), the schema validator throws a specific, handleable error before any external call is made. You can feed the validation error back to the agent for a single correction attempt rather than letting the malformed call propagate.

Versioning your schemas is the other half of this. When your tool's inputs change, publish a new version and notify the agent via its system prompt. The previous schema stays available for a deprecation window so you can migrate gradually rather than breaking everything at once.

For the full picture of how tool management fits into a production agent stack, the Chanl tools features cover versioned tool registration and schema enforcement. The production guardrails guide covers the broader defense-in-depth picture that circuit breakers slot into.

Making Circuit Breakers Observable

A circuit breaker you can't see is almost worse than none at all. When a circuit trips and you don't know it, your agent silently degrades. Customers get "I'm unable to process that right now" while you're looking at green dashboards.

Instrument every state transition:

observable-breaker.ts·typescript

class ObservableCircuitBreaker extends CircuitBreaker {
  constructor(
    name: string,
    config: CircuitBreakerConfig,
    private handlers: {
      onTrip?: (name: string, failureCount: number) => void;
      onRecover?: (name: string) => void;
      onRejection?: (name: string) => void;
    }
  ) {
    super(name, config);
  }
}
 
const breaker = new ObservableCircuitBreaker(
  'payment-api',
  { failureThreshold: 3, successThreshold: 2, timeout: 30_000, windowMs: 60_000 },
  {
    onTrip: (name, count) =>
      alerts.critical(`Circuit tripped: ${name} after ${count} failures`),
    onRecover: (name) =>
      alerts.info(`Circuit recovered: ${name}`),
    onRejection: (name) =>
      metrics.increment('circuit_rejected', { circuit: name }),
  }
);

The minimum viable observability setup:

Alert (PagerDuty, Slack, whatever) when a circuit trips
Log every open-state rejection with the circuit name and timestamp
Dashboard showing current state for each circuit across all your agent's tools
Historical trip frequency and recovery time so you can spot chronic failures

If you're using Chanl's monitoring features, circuit state and trip events feed into the same dashboard as your conversation quality scores. When you're investigating why agent quality dropped overnight, you don't need to cross-reference two systems.

Rate Limit Handling as a Special Case

Rate limits deserve their own treatment because they're the most common failure mode and they have a specific shape: the API is healthy, it just wants you to slow down.

The right pattern for rate limits isn't a standard circuit breaker. It's a rate-aware retry with backoff that respects the Retry-After header when present.

rate-limit-handler.ts·typescript

async function callWithRateLimit<T>(
  fn: () => Promise<T>,
  maxRetries = 3
): Promise<T> {
  let lastError: Error;
 
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (!isRateLimitError(err)) throw err;
 
      const retryAfter = extractRetryAfter(err) ?? Math.pow(2, attempt) * 1000;
      if (attempt === maxRetries) throw err;
 
      await sleep(retryAfter);
      lastError = err as Error;
    }
  }
 
  throw lastError!;
}
 
function isRateLimitError(err: unknown): boolean {
  return (err as any)?.status === 429;
}
 
function extractRetryAfter(err: unknown): number | null {
  const retryAfterHeader = (err as any)?.headers?.['retry-after'];
  return retryAfterHeader ? parseInt(retryAfterHeader) * 1000 : null;
}

The circuit breaker and rate-limit handler work in tandem: rate-limit retries handle the transient "slow down" signal, while the circuit breaker handles the sustained "this API is down" case. If you're getting rate-limited continuously, the retry handler will eventually exhaust its attempts and throw, which the circuit breaker then counts as a failure.

Testing Your Circuit Breakers Before 3 AM

Circuit breakers need testing before the incident, not during it. The pattern that works is chaos injection in staging: deliberately fail individual tools and verify that the agent handles the fast-fail gracefully without spiraling.

chaos-injection.ts·typescript

// Wrap tool calls with chaos in staging
function withChaos<T>(fn: () => Promise<T>, failRate: number): Promise<T> {
  if (process.env.NODE_ENV !== 'production' && Math.random() < failRate) {
    return Promise.reject(new Error('CHAOS_INJECTED'));
  }
  return fn();
}
 
// 20% failure rate on payment API in staging
const stagingPaymentResult = withChaos(
  () => paymentApi.refund({ orderId, amount }),
  0.2
);

What to verify with chaos testing:

The agent produces a graceful message ("I'm unable to process refunds right now, let me connect you with a specialist") rather than looping or hanging
The circuit trips at the threshold you configured, not before or long after
The circuit recovers when chaos is disabled without manual intervention
Budget and iteration limits engage when they should

The scenarios feature in Chanl lets you run simulated customer conversations against your agent including tool failure scenarios. You can build a scenario where the payment API always returns 503 and verify the agent's response quality, not just that it doesn't crash.

Shadow deployment is the more comprehensive approach: route a small percentage of real traffic to a version of your agent with circuit breakers in open state and measure the fallback behavior with real inputs. This catches edge cases that staged chaos testing misses because real customer queries are messier than synthetic ones.

Default Thresholds to Start With

If you're starting from scratch, these thresholds are conservative and work for most CX agents:

Circuit Type	Failure Threshold	Window	Recovery Timeout
External API tools (payments, CRM)	3 failures	60s	30s
LLM API calls	5 failures	120s	60s
Internal data lookups	10 failures	60s	15s
Budget per session	3x expected cost	N/A	Hard stop
Iteration limit	15 steps	N/A	Hard stop

Tighten thresholds for tools that touch money or customer data. Loosen them for read-only lookups where false positives (circuit trips during a traffic spike) are more costly than a few extra failures.

For the observability layer that makes these thresholds useful in practice, see what to monitor for AI agents and is monitoring your agent actually enough for where reactive monitoring alone falls short.

The Night Your Agent Breaks

Every production agent hits an outage eventually. A downstream API goes into maintenance. A schema changes without warning. A rate limit gets hit during a traffic spike. The question isn't whether your agent will face these conditions. It's whether you've given it the tools to fail gracefully when it does.

Circuit breakers don't make agents smarter. They make failures explicit, bounded, and recoverable. The agent that woke up to a $437 bill didn't need better reasoning. It needed a ceiling on what "keep trying" could cost.

Add the circuit breakers. Set the budget limits. Sleep soundly.

Monitor circuit state alongside conversation quality

Chanl surfaces circuit breaker trips, tool failure rates, and agent iteration counts in the same dashboard as your conversation scorecards, so reliability and quality problems surface together.

Try Chanl Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

reliability circuit-breakers tool-calling production error-handling monitoring

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.