Your AI agent looks fine from the outside. Latency is normal. API error rate is zero. Uptime is 100%. Meanwhile, it's confidently telling customers their refund arrives in 3-5 business days when your policy is 7-10, promising callbacks that never get scheduled, and calling the wrong tool in 8% of interactions.
Traditional SRE doesn't catch this. It never had to. For a stateless API returning data from a database, availability and latency tell you almost everything you need to know. For an AI agent, they tell you almost nothing about the failures that matter.
This article is a practical SRE playbook for AI agents. The new SLIs you need, how to set SLOs that are actually useful, and how to use error budgets to control agent autonomy before problems reach customers.
Why Traditional SRE Metrics Miss AI Failures
Classic SRE is built around three signals: availability (is the service responding?), latency (how fast?), and error rate (are requests succeeding?). These are necessary. They're not sufficient for systems that can be wrong while appearing healthy.
Consider the failure modes that don't show up in traditional monitoring:
- The agent returns a 200 with a hallucinated policy that no human agreed to
- The agent calls the correct tool but with a malformed parameter that silently returns empty data
- The agent resolves Tier 1 issues correctly but misroutes 40% of Tier 2 issues to the wrong team
- The agent's task success rate drifts from 94% to 81% over three weeks. Too gradual to trigger any alert
None of these register as errors in APM. They register as customer complaints, churn, and a quiet degradation in metrics that matter.
Datadog's 2026 State of AI Engineering found that most production incidents from AI agents are discovered by users, not by the team that built the agent. That's not a model failure. It's an SRE failure. The absence of the right measurements.
The Five SLIs That Actually Matter
A Service Level Indicator is a measurable quantity that represents the quality of your service from the user's perspective. For AI agents in CX, you need five categories, not the traditional two.
TaskSuccessRate is the most important. Of all tasks your agent attempted, what fraction did it complete correctly? For a customer service agent, "correct" means the customer's issue was resolved without requiring human escalation. This is expensive to measure in real time, which is why most teams skip it. Don't skip it. Measure it with an LLM judge on a representative sample if you can't cover everything.
ToolCallAccuracy measures whether the agent called the right tools with valid parameters. An agent that calls get_account_balance when it should have called get_credit_limit is wrong even if both calls succeed. Measure this per tool, because accuracy varies by tool complexity. See ai-agent-tools-mcp-openapi-tool-management-guide for how tool description quality directly affects this metric. A poorly written tool description degrades call accuracy before you ever look at a model.
HallucinationRate is the percentage of responses containing claims not grounded in the agent's retrieved context or memory. Measure this with an evaluator that checks each factual claim against the source documents. Track it separately for factual claims (policy details, account data) and procedural claims (steps to resolve an issue). They have different downstream costs and different root causes.
CostPerTask measures your agent's efficiency. Token cost and API spend per successful task completion. An agent hitting 94% task success while burning $0.40 per task when you budgeted $0.15 is operationally broken, even if it looks reliable by quality metrics.
P99 End-to-End Latency is the traditional metric with a CX-specific definition: measure wall-clock time from user input to complete response, including every tool call. Your LLM's TTFT is not the customer's experience. An agent making four tool calls before responding might have excellent TTFT and terrible perceived responsiveness. See voice-ai-pipeline-stt-tts-latency-budget for how this stacks in voice pipelines specifically.
Here's how you'd define these as a scorecard the agent gets evaluated against on every run:
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
const scorecard = await chanl.scorecards.create({
name: 'CX Agent SLIs',
description: 'Five SLIs for cx-support-v2: success, tool accuracy, hallucination, cost, latency.',
criteria: [
{
name: 'TaskSuccessRate',
rubric: 'Did the agent fully resolve the customer issue without human escalation?',
weight: 0.35,
targetScore: 94,
},
{
name: 'ToolCallAccuracy',
rubric: 'Were the right tools called with valid parameters that returned non-empty results?',
weight: 0.2,
targetScore: 92,
},
{
name: 'HallucinationRate',
rubric: 'Flag responses with claims not grounded in memory, retrieved context, or tool outputs.',
weight: 0.25,
targetScore: 96,
},
{
name: 'CostPerTask',
rubric: 'Sum tokens and external API spend per resolved task. Flag runs above 0.18 USD.',
weight: 0.1,
targetScore: 90,
},
{
name: 'P99Latency',
rubric: 'Wall-clock from user input to complete response, including tool calls. Flag above 4500ms.',
weight: 0.1,
targetScore: 90,
},
],
});Setting SLOs: Start From Business Impact
An SLO is a commitment: this agent will maintain X quality level Y% of the time over window Z. Setting them wrong in either direction is costly. Too loose, your SLO catches nothing useful. Too tight, you're constantly in breach and alert fatigue kills your team's ability to respond.
Two principles that work in practice:
Set SLOs from business impact, not technical intuition. A 4% HallucinationRate might sound acceptable until you calculate that at 10,000 conversations per day, that's 400 customers per day receiving incorrect information. Calculate the downstream consequence first (wrong refund window, wrong cancellation terms, wrong escalation routing) and then set the threshold based on how many of those you can afford.
Tighten SLOs as you learn. Your first SLO for a new agent should sit 10-15% above your observed baseline. You don't know what "good" looks like yet. Run for 30 days, observe the natural distribution, and tighten. Review every 90 days: are your current SLOs catching real problems or just generating noise?
Different agent types need different targets. A billing agent and a FAQ agent don't have the same risk profile:
| Agent Type | TaskSuccessRate SLO | HallucinationRate SLO | Notes |
|---|---|---|---|
| FAQ / General Support | 90% | 6% | High volume, lower stakes |
| Account Management | 94% | 3% | Financial implications per error |
| Billing / Payments | 97% | 1% | Every error has a direct cost |
| Medical / Compliance | 98% | 0.5% | Regulatory risk, conservative targets |
Chanl's Scorecards feature lets you define custom evaluation rubrics per agent and track SLI trends over time. The rubric is what makes "task success" meaningful for your specific agent. Not a generic accuracy metric, but the actual criteria that reflect your business definition of a good outcome.
Error Budgets as Autonomy Currency
Error budgets for AI agents serve two purposes at once: they tell you when you're failing, and they control what your agent is allowed to do. Most implementations only use the first. The second is where the real protection lives.
Traditional SRE use of error budgets is simple: if your SLO is 95% and you're at 97%, you have budget to spend on risky deployments and experiments. When the budget runs low, you slow down changes.
For AI agents, extend this to autonomy. An agent with a healthy error budget can take autonomous actions. Send that email, make that callback, update that CRM field. An agent burning through its budget faster than sustainable needs human oversight before taking actions above a certain risk threshold.
This is how you avoid the compound failure: a degrading agent taking autonomous actions at scale, causing downstream damage that's harder to reverse than the original regression.
The autonomy policy itself is a piece of config your runtime owns. Here's the shape worth standardizing on:
type AutonomyState = 'healthy' | 'warning' | 'critical' | 'exhausted';
const policy: Record<AutonomyState, {
trigger: string;
allowedActions: Array<'autonomous' | 'draft' | 'escalate'>;
alerts: string[];
freezeDeployments?: boolean;
circuitBreak?: boolean;
}> = {
healthy: {
trigger: 'burnRate < 1.0',
allowedActions: ['autonomous', 'draft', 'escalate'],
alerts: [],
},
warning: {
trigger: 'burnRate >= 1.0',
allowedActions: ['draft', 'escalate'],
alerts: ['#agents-oncall'],
},
critical: {
trigger: 'burnRate >= 5.0',
allowedActions: ['escalate'],
alerts: ['#agents-oncall', '#incidents'],
freezeDeployments: true,
},
exhausted: {
trigger: 'budgetRemaining <= 0',
allowedActions: [],
alerts: ['#agents-oncall', '#incidents', 'pagerduty'],
circuitBreak: true,
},
};Drive the transitions from your SLI scorecard results. Pull per-session results via scorecards.listResults({ scorecardId }), aggregate, compute burn rate, transition the agent between states.
A burn rate of 1.0 means you're consuming your error budget exactly as fast as it replenishes. A burn rate of 5.0 means you'll exhaust your 30-day budget in 6 days. When you see 5x burn, you have 6 days to fix something before your agent should stop making autonomous decisions. That's the operational signal that traditional APM will never surface.
The autonomy policy is the key addition. Instead of just alerting when quality drops, you constrain what the agent is allowed to do in response to quality drops. The degrading agent stops taking autonomous actions. It starts asking for human approval. This prevents one bad model version from silently causing weeks of customer harm.
Building Your Agent SRE Dashboard
Your SRE dashboard needs to surface six things at a glance:
- Current SLI values vs. SLO targets, the red/green health view
- Error budget remaining per SLI, expressed as percentage and days-remaining
- Burn rate trend over the last 24 hours. Is it accelerating?
- Current autonomy mode: autonomous, draft, escalate, or circuit-broken
- Recent alert history. When did the last burn-rate alert fire?
- Top contributing sessions. Which specific conversations are burning the most budget?
Number 6 is the most actionable and the least common. When HallucinationRate is burning fast, you don't want to know it's burning. You want to know which sessions are driving it and what they share. A cluster of hallucinations on refund policy questions tells you something specific to fix. "HallucinationRate is at 6%" without the session breakdown just tells you something is wrong, not what.
Chanl's Monitoring dashboard surfaces per-session contributions to SLI metrics so you can drill from "HallucinationRate trending up" to "12 sessions this morning, all asking about promotional pricing, are the source." That drill-path is how SRE becomes useful rather than decorative.

Deploy Gate
Pre-deploy quality checks
Incident Response for AI Agents
Traditional incident response assumes binary: the service is up or down. AI agent incidents are messier. Your agent is up, fast, returning 200s. Just wrong in a particular way, at a particular rate, for a particular input class.
Your runbook needs three failure scenarios:
Gradual degradation. SLI trending down over days, burn rate above 1.0 but not alarming. Response: pull the contributing sessions for the last 48 hours, identify the common thread. Is it a new class of questions the agent hasn't seen? A tool whose response format changed? A prompt that's drifting because of a deployment three days ago? Gradual degradation is the incident type you'd never catch without SLOs. It's the slow leak.
Sudden regression. SLI drops sharply after a deployment, burn rate spikes above 3.0. Response: check the deployment log. If the regression started within 30 minutes of a deployment, roll back. If not, check whether a third-party dependency changed (tool API format update, CRM schema change, model provider version bump). Sudden regressions look like model failures but are often integration failures in disguise.
Systematic failure mode. The agent is correct on average but wrong for a specific input cluster. Your burn rate might look okay. Failure rate is 15% on a scenario that represents 10% of volume. But 100% of that scenario is failing. Response: add a targeted override or circuit breaker for the identified scenario, deploy the fix, and then add the failure case to your Scenarios test suite so it's permanently covered. This is the incident type that keeps compounding if you don't also add the test.
For all three, the first step is identical: pull the contributing sessions, read the transcripts, and understand the pattern. Agent SRE is investigative work, not just metric watching. That's what makes it different from traditional SRE, and what makes it more directly connected to customer outcomes.
The Continuous Reliability Improvement Loop
SRE for AI agents isn't a state you reach. It's a loop you run:
Each cycle should make your SLO tighter. After fixing the failure pattern, your measured baseline improves. After the baseline improves, you tighten the SLO. After tightening, your error budget becomes more sensitive. It catches smaller regressions that would have slipped through before. The loop compounds.
The teams running this consistently are the ones whose agents get measurably better every quarter. The teams skipping it are the ones discovering regressions from customer complaints six weeks after the model update that caused them.
Starting Today: A 30-Day Path
If you have zero agent SRE infrastructure right now, here's the concrete sequence.
Days 1-7: Measure without targets. Define your five SLIs. Don't set SLO targets yet. Just start logging and measure your current baseline. Understand what "normal" looks like for your specific agent before you commit to any number.
Days 8-14: Set initial SLOs. Put targets at your observed baseline minus a small margin, so you're within SLO at current performance. Configure error budget tracking. Get burn-rate alerts working for anything above 2.0x.
Days 15-21: Add the autonomy policy. Introduce the three-state policy: healthy budget gets autonomous mode, warning burn rate gets draft mode, exhausted budget gets circuit break. This single change will catch the next regression before it reaches customers.
Days 22-30: Close the loop. Review which sessions contributed most to budget burn during the month. Fix the top two failure patterns. Add them to your scenario test suite. Watch your SLIs improve.
One month from now, your agent will be measurably more reliable, and you'll have numbers proving it. The agent that told a customer "3-5 business days" when the policy was 7-10, the one that was invisible to your APM, is now the kind of error your SLIs catch in the same shift it happens.
SLOs, scorecards, and error budgets for your AI agents
Chanl gives you the evaluation infrastructure to measure what APM misses: task success, tool accuracy, and hallucination rate, with autonomy policies that protect customers when quality drops.
Explore ScorecardsCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



