At 11 PM, a developer's support agent started hitting a payment API that had gone into maintenance mode. The API returned 503. The agent retried. 503 again. The agent retried again. By 7 AM it had made thousands of identical calls and the developer woke up to a $437 bill and zero successful transactions.
Nobody designed this failure. The agent was doing exactly what it was built to do: persist until the task completes. The missing ingredient wasn't smarter reasoning. It was a circuit breaker.
What Circuit Breakers Are and Why Agents Need Them
Circuit breakers protect agents from doing too much damage when an external system fails. The idea comes from distributed systems: wrap calls to unreliable downstream services in a layer that tracks failure rates. When failures exceed a threshold, the circuit trips. Subsequent calls fail immediately rather than waiting to time out or retry. After a recovery window, the circuit lets a test request through to check if the system is back.
The pattern has been standard in microservices for a decade. Most AI agent frameworks don't include it by default.
Agents are especially vulnerable without circuit breakers because they reason about failures and decide to retry. A well-instructed agent will try to complete its task. In the face of a persistent failure, that means it retries, re-evaluates, and retries again. The model isn't broken. The behavior is exactly what you asked for. But without a hard ceiling on iterations and a mechanism that stops calls to a broken downstream service, that good behavior becomes a runaway process by morning.
Datadog's 2026 State of AI Engineering found that 5% of all LLM call spans reported errors, and 60% of those errors came from rate limits. Rate limit errors are exactly the kind of transient, recoverable failure that causes agents to enter retry spirals. The agent sees "rate limited," interprets it as a temporary condition, and keeps trying.
The Three States
A circuit breaker moves through three states. Understanding them helps you tune the thresholds correctly.
Closed is normal operation. Every call goes through. The breaker tracks failures in a rolling window. This is the default state.
Open means the circuit has tripped. Every call fails immediately with a structured error, not a timeout. No waiting, no external requests. The agent gets a fast failure it can handle explicitly.
Half-open is the recovery probe. After a timeout, one request goes through. If it succeeds, the circuit closes. If it fails, it reopens for another recovery period.
The key insight for agent systems: open state should produce a structured error that your agent handles explicitly. If your agent's system prompt doesn't define what to do when a tool returns CIRCUIT_OPEN, the agent will try to work around it in unpredictable ways, often by retrying through a different code path that bypasses your circuit entirely.
Implementing a Circuit Breaker in TypeScript
Here's a straightforward implementation that covers the core states without heavy dependencies:
type CircuitState = 'CLOSED' | 'OPEN' | 'HALF_OPEN';
interface CircuitBreakerConfig {
failureThreshold: number; // trips after N failures
successThreshold: number; // closes after N successes in half-open
timeout: number; // ms to stay open before testing
windowMs: number; // rolling window for failure counting
}
class CircuitBreaker {
private state: CircuitState = 'CLOSED';
private failures: number[] = []; // timestamps
private successes = 0;
private nextAttempt = 0;
constructor(
private name: string,
private config: CircuitBreakerConfig
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error(`CIRCUIT_OPEN:${this.name}`);
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
private onSuccess() {
if (this.state === 'HALF_OPEN') {
this.successes++;
if (this.successes >= this.config.successThreshold) {
this.reset();
}
} else {
this.failures = [];
}
}
private onFailure() {
const now = Date.now();
this.failures = this.failures.filter(
(t) => now - t < this.config.windowMs
);
this.failures.push(now);
if (this.failures.length >= this.config.failureThreshold) {
this.state = 'OPEN';
this.nextAttempt = now + this.config.timeout;
this.successes = 0;
}
}
private reset() {
this.state = 'CLOSED';
this.failures = [];
this.successes = 0;
}
getState() {
return this.state;
}
}You'd wrap each tool call:
const paymentApiBreaker = new CircuitBreaker('payment-api', {
failureThreshold: 3,
successThreshold: 2,
timeout: 30_000, // 30 seconds open
windowMs: 60_000, // 1-minute rolling window
});
async function processRefund(orderId: string, amount: number) {
return paymentApiBreaker.call(async () => {
const response = await paymentApi.refund({ orderId, amount });
return response;
});
}When the payment API goes down, processRefund throws CIRCUIT_OPEN:payment-api immediately. No waiting, no retrying, no wasted tokens on the LLM reasoning about a broken downstream system.
What to Protect at Each Layer
Not everything needs the same configuration. Think in three layers.
Tool calls to external APIs (CRMs, payment processors, booking systems) deserve tight thresholds: 3 failures in 60 seconds, 30-second recovery timeout. These services are most likely to have extended outages, and they're usually the most expensive to keep retrying.
LLM API calls need a different approach. LLM failures are mostly rate limits and transient errors, so a slightly wider window makes sense: 5 failures in 120 seconds, 60-second recovery. You also want a budget-based breaker here, one that trips if total token spend for this session exceeds 3x your expected cost.
class BudgetCircuitBreaker {
private spent = 0;
constructor(private maxTokens: number) {}
check(estimatedTokens: number) {
if (this.spent + estimatedTokens > this.maxTokens) {
throw new Error(
`BUDGET_EXCEEDED: spent=${this.spent}, limit=${this.maxTokens}`
);
}
}
record(tokensUsed: number) {
this.spent += tokensUsed;
}
}
// Per-session, initialized when the conversation starts
const budget = new BudgetCircuitBreaker(
EXPECTED_TOKENS_PER_SESSION * 3 // 3x ceiling
);Iteration limits are a third type. Not failure-rate-based, but execution-depth-based. Every agent loop needs a hard ceiling on how many steps it can take.
const MAX_ITERATIONS = 15;
let iterations = 0;
while (!done) {
if (++iterations > MAX_ITERATIONS) {
throw new Error('ITERATION_LIMIT_EXCEEDED');
}
// agent step
}Most agent frameworks have this as a setting. The default in many is either very high or disabled. Set it explicitly, and handle ITERATION_LIMIT_EXCEEDED in your error path so the agent can tell the customer it needs to transfer rather than silently hanging.
Schema Validation: Catch Bad Tool Calls Before They Hit the Wire
Schema drift is the other major class of silent agent failures. Your tool's expected inputs change, the agent keeps generating calls in the old format, and you get errors that look like API failures but are actually bad inputs from your own agent.
Schema drift is the leading cause of broken CX agent automations. The fix is a validation layer before the call reaches your external API.
import { z } from 'zod';
const RefundSchema = z.object({
orderId: z.string().regex(/^ORD-[0-9]+$/),
amount: z.number().positive().max(10_000),
reason: z.enum(['damaged', 'not-received', 'changed-mind', 'other']),
});
async function processRefund(rawInput: unknown) {
// Validate before touching the external API
const input = RefundSchema.parse(rawInput);
return paymentApiBreaker.call(() => paymentApi.refund(input));
}When your agent generates a malformed tool call (missing a field, wrong data type, invalid enum), the schema validator throws a specific, handleable error before any external call is made. You can feed the validation error back to the agent for a single correction attempt rather than letting the malformed call propagate.
Versioning your schemas is the other half of this. When your tool's inputs change, publish a new version and notify the agent via its system prompt. The previous schema stays available for a deprecation window so you can migrate gradually rather than breaking everything at once.
For the full picture of how tool management fits into a production agent stack, the Chanl tools features cover versioned tool registration and schema enforcement. The production guardrails guide covers the broader defense-in-depth picture that circuit breakers slot into.
Making Circuit Breakers Observable
A circuit breaker you can't see is almost worse than none at all. When a circuit trips and you don't know it, your agent silently degrades. Customers get "I'm unable to process that right now" while you're looking at green dashboards.
Instrument every state transition:
class ObservableCircuitBreaker extends CircuitBreaker {
constructor(
name: string,
config: CircuitBreakerConfig,
private handlers: {
onTrip?: (name: string, failureCount: number) => void;
onRecover?: (name: string) => void;
onRejection?: (name: string) => void;
}
) {
super(name, config);
}
}
const breaker = new ObservableCircuitBreaker(
'payment-api',
{ failureThreshold: 3, successThreshold: 2, timeout: 30_000, windowMs: 60_000 },
{
onTrip: (name, count) =>
alerts.critical(`Circuit tripped: ${name} after ${count} failures`),
onRecover: (name) =>
alerts.info(`Circuit recovered: ${name}`),
onRejection: (name) =>
metrics.increment('circuit_rejected', { circuit: name }),
}
);The minimum viable observability setup:
- Alert (PagerDuty, Slack, whatever) when a circuit trips
- Log every open-state rejection with the circuit name and timestamp
- Dashboard showing current state for each circuit across all your agent's tools
- Historical trip frequency and recovery time so you can spot chronic failures
If you're using Chanl's monitoring features, circuit state and trip events feed into the same dashboard as your conversation quality scores. When you're investigating why agent quality dropped overnight, you don't need to cross-reference two systems.
Rate Limit Handling as a Special Case
Rate limits deserve their own treatment because they're the most common failure mode and they have a specific shape: the API is healthy, it just wants you to slow down.
The right pattern for rate limits isn't a standard circuit breaker. It's a rate-aware retry with backoff that respects the Retry-After header when present.
async function callWithRateLimit<T>(
fn: () => Promise<T>,
maxRetries = 3
): Promise<T> {
let lastError: Error;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (!isRateLimitError(err)) throw err;
const retryAfter = extractRetryAfter(err) ?? Math.pow(2, attempt) * 1000;
if (attempt === maxRetries) throw err;
await sleep(retryAfter);
lastError = err as Error;
}
}
throw lastError!;
}
function isRateLimitError(err: unknown): boolean {
return (err as any)?.status === 429;
}
function extractRetryAfter(err: unknown): number | null {
const retryAfterHeader = (err as any)?.headers?.['retry-after'];
return retryAfterHeader ? parseInt(retryAfterHeader) * 1000 : null;
}The circuit breaker and rate-limit handler work in tandem: rate-limit retries handle the transient "slow down" signal, while the circuit breaker handles the sustained "this API is down" case. If you're getting rate-limited continuously, the retry handler will eventually exhaust its attempts and throw, which the circuit breaker then counts as a failure.
Testing Your Circuit Breakers Before 3 AM
Circuit breakers need testing before the incident, not during it. The pattern that works is chaos injection in staging: deliberately fail individual tools and verify that the agent handles the fast-fail gracefully without spiraling.
// Wrap tool calls with chaos in staging
function withChaos<T>(fn: () => Promise<T>, failRate: number): Promise<T> {
if (process.env.NODE_ENV !== 'production' && Math.random() < failRate) {
return Promise.reject(new Error('CHAOS_INJECTED'));
}
return fn();
}
// 20% failure rate on payment API in staging
const stagingPaymentResult = withChaos(
() => paymentApi.refund({ orderId, amount }),
0.2
);What to verify with chaos testing:
- The agent produces a graceful message ("I'm unable to process refunds right now, let me connect you with a specialist") rather than looping or hanging
- The circuit trips at the threshold you configured, not before or long after
- The circuit recovers when chaos is disabled without manual intervention
- Budget and iteration limits engage when they should
The scenarios feature in Chanl lets you run simulated customer conversations against your agent including tool failure scenarios. You can build a scenario where the payment API always returns 503 and verify the agent's response quality, not just that it doesn't crash.
Shadow deployment is the more comprehensive approach: route a small percentage of real traffic to a version of your agent with circuit breakers in open state and measure the fallback behavior with real inputs. This catches edge cases that staged chaos testing misses because real customer queries are messier than synthetic ones.
Default Thresholds to Start With
If you're starting from scratch, these thresholds are conservative and work for most CX agents:
| Circuit Type | Failure Threshold | Window | Recovery Timeout |
|---|---|---|---|
| External API tools (payments, CRM) | 3 failures | 60s | 30s |
| LLM API calls | 5 failures | 120s | 60s |
| Internal data lookups | 10 failures | 60s | 15s |
| Budget per session | 3x expected cost | N/A | Hard stop |
| Iteration limit | 15 steps | N/A | Hard stop |
Tighten thresholds for tools that touch money or customer data. Loosen them for read-only lookups where false positives (circuit trips during a traffic spike) are more costly than a few extra failures.
For the observability layer that makes these thresholds useful in practice, see what to monitor for AI agents and is monitoring your agent actually enough for where reactive monitoring alone falls short.
The Night Your Agent Breaks
Every production agent hits an outage eventually. A downstream API goes into maintenance. A schema changes without warning. A rate limit gets hit during a traffic spike. The question isn't whether your agent will face these conditions. It's whether you've given it the tools to fail gracefully when it does.
Circuit breakers don't make agents smarter. They make failures explicit, bounded, and recoverable. The agent that woke up to a $437 bill didn't need better reasoning. It needed a ceiling on what "keep trying" could cost.
Add the circuit breakers. Set the budget limits. Sleep soundly.
Monitor circuit state alongside conversation quality
Chanl surfaces circuit breaker trips, tool failure rates, and agent iteration counts in the same dashboard as your conversation scorecards, so reliability and quality problems surface together.
Try Chanl FreeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



