A team shipped their first AI support agent and called it a success three weeks later. Their evidence: the agent handled 60% of incoming tickets without a human. Six weeks after that, CSAT scores fell 8 points and their head of support asked why customers seemed more frustrated since the agent launched.
Nobody had defined what "success" meant before the agent went live. Deflection rate is a measure of load reduction, not customer outcome. The agent was deflecting tickets. It wasn't resolving them.
A recent survey found that only 31% of organizations have a measurement framework for their AI agents. The other 69% pick up metrics after the plane is airborne and wonder why the instruments don't make sense. Here's how to build the framework before you ship.
Why Technical Metrics Aren't Enough
Technical metrics tell you the agent is healthy. Business KPIs tell you whether it's actually doing its job.
Latency, error rate, token cost, tool call success rate: these matter as operational health signals. But none of them tells you whether the customer got what they needed. An agent can have sub-300ms response times and a 0.1% error rate while failing to resolve 70% of its interactions. The monitoring dashboard looks clean. The customer experience is quietly deteriorating.
The confusion comes from LLM evaluation culture, which trained teams to think about evaluation as a technical discipline: benchmarks, accuracy scores, BLEU scores. That frame made sense for foundation model research. It doesn't make sense for an agent that's supposed to handle customer returns and needs to be measured on whether customers successfully returned things.
Business KPIs for production agents are the same metrics your head of support or head of success would care about for any support operation. Task completion. Escalation rate. Time to resolution. Cost per successful outcome. CSAT delta. Your AI agent should be accountable to the same numbers as the humans it works alongside.
The Five Core KPIs
Task completion rate, escalation rate, time to resolution, cost per successful outcome, and CSAT delta form the foundation. Every other metric is context for why these five are what they are.
Task completion rate is the percentage of interactions where the agent fully resolved the customer's request without human intervention. This is your primary KPI. Everything else explains why this number is what it is.
Be precise about what counts as completion before you launch. For a support agent, completion might mean "ticket marked resolved and not reopened within 24 hours." For a booking agent, it means "booking confirmed and payment processed." For a Q&A agent, it might mean "user did not ask the same question again within the session." Vague definitions produce inflated numbers that hide real failure.
Escalation rate is the percentage of interactions where the agent hands off to a human. You want this low, but not too low. A rate of near zero suggests the agent is taking on cases it shouldn't. A high rate suggests scope is set too narrow or the agent is hitting cases it wasn't prepared for.
Track escalation reasons, not just the rate. "Agent unable to answer" and "customer requested human" are both escalations but mean very different things. One is a capability gap; the other is a preference that may not indicate a problem at all.
Time to resolution is how long it takes from the start of an interaction to a confirmed resolution. For async channels like email or messaging, this includes wait time between messages. For voice or live chat, it's handle time plus any post-interaction processing. Compare AI agent resolution time to human agent resolution time on equivalent task types. The delta is a clear signal of whether the agent is adding or subtracting value from the customer's perspective.
Cost per successful outcome is total operational cost divided by the number of successful outcomes. This metric ties spend to value, which is the only spending metric that matters.
Here's a concrete comparison:
| Agent | Cost/Call | Completion Rate | Cost/Successful Outcome |
|---|---|---|---|
| Agent A | $0.50 | 40% | $1.25 |
| Agent B | $0.80 | 90% | $0.89 |
Agent B costs 60% more per call. It costs 29% less per successful outcome. If you're optimizing cost per call, you'll choose the wrong agent. Cost per successful outcome is the metric that reflects the actual economics.
CSAT delta is the change in customer satisfaction for customers served by the AI agent compared to a baseline. Without a delta, a CSAT score is just a number. With a delta, it tells you whether the agent is making the experience better or worse than the alternative.

Setting Your Baseline Before Launch
You can't measure change without a starting point, and the measurement gap starts here.
Before your agent handles its first real interaction, document your current state on each KPI for the same task types the agent will cover. If the agent is handling tier-1 support, measure your human agents on tier-1 support: resolution rate, average handle time, CSAT, cost per resolution.
A holdout group makes this much cleaner. Route 10-20% of equivalent interactions to human agents even after the AI agent launches. This gives you a live comparison rather than a historical one, and it controls for time-based confounders: seasonal patterns in request type, product changes that affect what customers ask about, or changes in customer behavior. If you can't run a holdout group, document pre-launch metrics carefully and be explicit that any post-launch comparisons are time-confounded.
The production eval gap survey found that teams who don't establish baselines before launch are significantly more likely to end up without a working measurement framework six months later. Without a baseline, your KPI data is a number with no reference point. It can't tell you whether you're winning.
Leading vs. Lagging Indicators
Every business KPI has technical precursors that predict it, usually by days or weeks. Building dashboards that show both layers makes root cause analysis much faster.
Task completion rate is predicted by tool call accuracy, context recall in long conversations, and scope coverage. If your agent starts calling tools incorrectly or losing context after several turns, task completion will drop. You'll see it in the technical metrics before it shows up as a meaningful CSAT change.
Escalation rate is predicted by confidence scoring and the quality of your fallback logic. Agents without a mechanism for "I don't know" will hallucinate (bad for CSAT) or fail silently (bad for task completion). Agents that escalate well need both a calibrated confidence signal and a clear escalation path that doesn't leave the customer confused about what's happening.
CSAT delta is predicted by response quality, tone consistency, and resolution accuracy. An LLM-as-a-judge pipeline is commonly used to score response quality at scale, but it needs to be calibrated to your actual CSAT patterns to be predictive. A judge scoring "helpfulness" on a rubric you designed may not track with how your customers experience helpfulness.
When a business KPI moves unexpectedly, your first question should be "which leading indicator moved first?" If task completion drops, check tool call accuracy before you check model output quality. If CSAT delta worsens, check response quality scores before you review escalation patterns. The leading indicator points you to the right layer of the system to investigate.
Chanl's analytics surfaces both layers in a single view: tool call accuracy, context recall, and response quality scores alongside task completion and escalation patterns. The goal is to connect the technical signal to the business outcome without switching between systems.
Alerting: Define Thresholds Before You Need Them
Metrics without alerts are decoration. You need to know immediately when something breaks, not at the next weekly review.
Set alert thresholds before launch, not in response to an incident. For each KPI, define three levels: warning (investigate), critical (act now), and catastrophic (consider taking the agent offline).
For task completion rate, a reasonable starting framework is: warn at minus 5 percentage points from baseline, critical at minus 10, catastrophic at minus 20. Adjust based on how consequential failures are in your domain. An agent handling healthcare appointment scheduling needs tighter thresholds than one handling FAQ responses.
For escalation rate, alert in both directions. A spike upward means the agent is struggling. A sustained drop below your expected floor means the agent might be handling cases it shouldn't, which is a different kind of risk.
CSAT data is typically lagged by days (customers fill in surveys after interactions), so daily or weekly thresholds make more sense than real-time alerts. If you're collecting in-conversation satisfaction signals, you can set tighter thresholds.
Chanl's monitoring supports alerts on any tracked metric with configurable thresholds and routing to Slack, PagerDuty, or email. Configure these before launch day. The worst time to set up alerting is during an incident.
Reviewing KPIs: The Closed Loop That Changes Behavior
A weekly review that doesn't result in an agent change is a report, not a feedback loop.
Structure your weekly review around three questions: Are all KPIs within their thresholds? Is there a trend that hasn't crossed a threshold yet but is moving in the wrong direction? What's the next change to make, and how will we know if it worked?
The third question is where most teams stall. They generate KPI reports but don't close the loop to specific agent improvements. If task completion rate is down, you need to know which task types are failing most often, what the agents are doing differently on those tasks, and what change to the prompt, tools, or memory configuration might fix it. That chain of diagnosis has to be part of the review process, not a separate exercise that only happens when there's a crisis.
Connect your KPI reviews to your agent's version history. When you change a prompt, add a tool, or update the knowledge base, you should be able to see whether the relevant KPIs moved in the next review. If they didn't move, either the change wasn't the fix or your KPIs aren't sensitive enough to detect it. Both are useful to know.
If you're building this measurement practice on an agent that's already in production, scorecards vs. vibes covers how to establish measurement rigor retroactively. Score drift and shipping with confidence is a good companion for teams that already have evals but aren't sure how to use them to make shipping decisions.
What Happens When You Don't Define This First
The 8-point CSAT drop from the opening scenario wasn't random. It was predictable. The team was measuring deflection (a proxy metric) instead of resolution (the actual goal). When the agent deflected tickets without resolving them, the customers just came back angrier, reopened the ticket, or left. The deflection metric looked fine the entire time.
This is the cost of not defining success before launch. You end up retroactively arguing about what the metrics mean rather than using them to make decisions. "The deflection rate is high" and "the CSAT dropped" are both true, and they seem contradictory unless you had a clear definition of success that connected them.
Five KPIs. A baseline. Leading indicators with alerts. A weekly review with a closed loop to agent changes. That's the entire framework. It's not complex to describe. The hard part is doing it before the first production interaction, when you're still tempted to treat measurement as something you'll add later once you've proven the concept.
Define success now. Your agent has something to be accountable to, and you have something to optimize toward.
Track the KPIs that reflect what your agent actually does
Chanl connects task completion rate, escalation rate, and CSAT delta to the technical signals that drive them. Set thresholds before launch and get alerts when something changes.
Try Chanl FreeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



