What KPIs should I track for an AI agent in production?

The five core KPIs are: task completion rate (did the agent fully resolve the request?), escalation rate (how often does it hand off to a human?), time to resolution, cost per successful outcome, and CSAT delta compared to a pre-launch baseline. Technical metrics like latency and error rate matter as leading indicators, but the business KPIs are what you're ultimately accountable to.

What is task completion rate for an AI agent?

Task completion rate is the percentage of interactions where the agent fully resolved the customer's request without human intervention. Define 'fully resolved' precisely for your use case before launch: for a support agent it might mean the ticket was resolved and not reopened within 24 hours; for a booking agent it means the booking was confirmed and payment processed. Partial completions count as incomplete.

Why is escalation rate an important AI agent metric?

Escalation rate tells you two things simultaneously: how capable the agent is, and whether your scope boundaries are correctly set. A very low escalation rate might mean the agent is handling cases it shouldn't. A high rate might mean the scope is set too narrow. The right escalation rate is use-case dependent, but tracking it over time surfaces drift in both directions before it affects customers.

What is cost per successful outcome for an AI agent?

Cost per successful outcome is the total operational cost of running the agent divided by the number of interactions where it fully resolved the customer's request. It's more meaningful than cost per call because it accounts for quality. An agent that costs $0.50 per call but resolves only 40% of issues has a cost per successful outcome of $1.25. One that costs $0.80 but resolves 90% costs $0.89 per outcome.

How do I establish a baseline before deploying an AI agent?

Measure your current human agent performance on the same task types before launch: resolution rate, average handle time, cost per resolution, and CSAT score. These become your baseline. Once the AI agent is live, compare its metrics to the baseline on equivalent task types. A holdout group, where 10-20% of interactions still route to humans, gives you a live comparison rather than a historical one.

What is the difference between leading and lagging indicators for agents?

Lagging indicators are your business KPIs: task completion rate, CSAT delta, cost per outcome. They tell you what happened. Leading indicators are technical metrics that predict what the lagging indicators will show: tool call accuracy, context recall rate, model latency. When a business KPI moves, the leading indicator that moved first points to the root cause.

How often should I review AI agent KPIs?

Weekly reviews are the minimum for an active production agent. Set automated alerts for threshold breaches so you're not waiting for a weekly review to see a sudden drop. For agents in sensitive domains like healthcare, finance, or legal, daily reviews are more appropriate in the first 90 days. The cadence matters less than the closed loop: review, diagnose, change something, measure the change.

What is score drift for an AI agent?

Score drift is when an agent's quality metrics change over time even though the code hasn't changed. It happens because the world changes: new products the agent doesn't know about, seasonal shifts in request patterns, or slow changes in the underlying model's behavior from fine-tuning updates. Regular KPI reviews with alerts on trend changes catch drift before customers start noticing.

AI Agent KPIs: What to Measure Before You Ship

A team shipped their first AI support agent and called it a success three weeks later. Their evidence: the agent handled 60% of incoming tickets without a human. Six weeks after that, CSAT scores fell 8 points and their head of support asked why customers seemed more frustrated since the agent launched.

Nobody had defined what "success" meant before the agent went live. Deflection rate is a measure of load reduction, not customer outcome. The agent was deflecting tickets. It wasn't resolving them.

A recent survey found that only 31% of organizations have a measurement framework for their AI agents. The other 69% pick up metrics after the plane is airborne and wonder why the instruments don't make sense. Here's how to build the framework before you ship.

Why Technical Metrics Aren't Enough

Technical metrics tell you the agent is healthy. Business KPIs tell you whether it's actually doing its job.

Latency, error rate, token cost, tool call success rate: these matter as operational health signals. But none of them tells you whether the customer got what they needed. An agent can have sub-300ms response times and a 0.1% error rate while failing to resolve 70% of its interactions. The monitoring dashboard looks clean. The customer experience is quietly deteriorating.

The confusion comes from LLM evaluation culture, which trained teams to think about evaluation as a technical discipline: benchmarks, accuracy scores, BLEU scores. That frame made sense for foundation model research. It doesn't make sense for an agent that's supposed to handle customer returns and needs to be measured on whether customers successfully returned things.

Business KPIs for production agents are the same metrics your head of support or head of success would care about for any support operation. Task completion. Escalation rate. Time to resolution. Cost per successful outcome. CSAT delta. Your AI agent should be accountable to the same numbers as the humans it works alongside.

The Five Core KPIs

Task completion rate, escalation rate, time to resolution, cost per successful outcome, and CSAT delta form the foundation. Every other metric is context for why these five are what they are.

Task completion rate is the percentage of interactions where the agent fully resolved the customer's request without human intervention. This is your primary KPI. Everything else explains why this number is what it is.

Be precise about what counts as completion before you launch. For a support agent, completion might mean "ticket marked resolved and not reopened within 24 hours." For a booking agent, it means "booking confirmed and payment processed." For a Q&A agent, it might mean "user did not ask the same question again within the session." Vague definitions produce inflated numbers that hide real failure.

Escalation rate is the percentage of interactions where the agent hands off to a human. You want this low, but not too low. A rate of near zero suggests the agent is taking on cases it shouldn't. A high rate suggests scope is set too narrow or the agent is hitting cases it wasn't prepared for.

Track escalation reasons, not just the rate. "Agent unable to answer" and "customer requested human" are both escalations but mean very different things. One is a capability gap; the other is a preference that may not indicate a problem at all.

Time to resolution is how long it takes from the start of an interaction to a confirmed resolution. For async channels like email or messaging, this includes wait time between messages. For voice or live chat, it's handle time plus any post-interaction processing. Compare AI agent resolution time to human agent resolution time on equivalent task types. The delta is a clear signal of whether the agent is adding or subtracting value from the customer's perspective.

Cost per successful outcome is total operational cost divided by the number of successful outcomes. This metric ties spend to value, which is the only spending metric that matters.

Here's a concrete comparison:

Agent	Cost/Call	Completion Rate	Cost/Successful Outcome
Agent A	$0.50	40%	$1.25
Agent B	$0.80	90%	$0.89

Agent B costs 60% more per call. It costs 29% less per successful outcome. If you're optimizing cost per call, you'll choose the wrong agent. Cost per successful outcome is the metric that reflects the actual economics.

CSAT delta is the change in customer satisfaction for customers served by the AI agent compared to a baseline. Without a delta, a CSAT score is just a number. With a delta, it tells you whether the agent is making the experience better or worse than the alternative.

Score

Good

0/100

Tone & Empathy

94%

Resolution

88%

Response Time

72%

Compliance

85%

Setting Your Baseline Before Launch

You can't measure change without a starting point, and the measurement gap starts here.

Before your agent handles its first real interaction, document your current state on each KPI for the same task types the agent will cover. If the agent is handling tier-1 support, measure your human agents on tier-1 support: resolution rate, average handle time, CSAT, cost per resolution.

A holdout group makes this much cleaner. Route 10-20% of equivalent interactions to human agents even after the AI agent launches. This gives you a live comparison rather than a historical one, and it controls for time-based confounders: seasonal patterns in request type, product changes that affect what customers ask about, or changes in customer behavior. If you can't run a holdout group, document pre-launch metrics carefully and be explicit that any post-launch comparisons are time-confounded.

The production eval gap survey found that teams who don't establish baselines before launch are significantly more likely to end up without a working measurement framework six months later. Without a baseline, your KPI data is a number with no reference point. It can't tell you whether you're winning.

Leading vs. Lagging Indicators

Every business KPI has technical precursors that predict it, usually by days or weeks. Building dashboards that show both layers makes root cause analysis much faster.

Task completion rate is predicted by tool call accuracy, context recall in long conversations, and scope coverage. If your agent starts calling tools incorrectly or losing context after several turns, task completion will drop. You'll see it in the technical metrics before it shows up as a meaningful CSAT change.

Escalation rate is predicted by confidence scoring and the quality of your fallback logic. Agents without a mechanism for "I don't know" will hallucinate (bad for CSAT) or fail silently (bad for task completion). Agents that escalate well need both a calibrated confidence signal and a clear escalation path that doesn't leave the customer confused about what's happening.

CSAT delta is predicted by response quality, tone consistency, and resolution accuracy. An LLM-as-a-judge pipeline is commonly used to score response quality at scale, but it needs to be calibrated to your actual CSAT patterns to be predictive. A judge scoring "helpfulness" on a rubric you designed may not track with how your customers experience helpfulness.

How technical leading indicators connect to business KPIs for production agents

When a business KPI moves unexpectedly, your first question should be "which leading indicator moved first?" If task completion drops, check tool call accuracy before you check model output quality. If CSAT delta worsens, check response quality scores before you review escalation patterns. The leading indicator points you to the right layer of the system to investigate.

Chanl's analytics surfaces both layers in a single view: tool call accuracy, context recall, and response quality scores alongside task completion and escalation patterns. The goal is to connect the technical signal to the business outcome without switching between systems.

Alerting: Define Thresholds Before You Need Them

Metrics without alerts are decoration. You need to know immediately when something breaks, not at the next weekly review.

Set alert thresholds before launch, not in response to an incident. For each KPI, define three levels: warning (investigate), critical (act now), and catastrophic (consider taking the agent offline).

For task completion rate, a reasonable starting framework is: warn at minus 5 percentage points from baseline, critical at minus 10, catastrophic at minus 20. Adjust based on how consequential failures are in your domain. An agent handling healthcare appointment scheduling needs tighter thresholds than one handling FAQ responses.

For escalation rate, alert in both directions. A spike upward means the agent is struggling. A sustained drop below your expected floor means the agent might be handling cases it shouldn't, which is a different kind of risk.

CSAT data is typically lagged by days (customers fill in surveys after interactions), so daily or weekly thresholds make more sense than real-time alerts. If you're collecting in-conversation satisfaction signals, you can set tighter thresholds.

Chanl's monitoring supports alerts on any tracked metric with configurable thresholds and routing to Slack, PagerDuty, or email. Configure these before launch day. The worst time to set up alerting is during an incident.

Reviewing KPIs: The Closed Loop That Changes Behavior

A weekly review that doesn't result in an agent change is a report, not a feedback loop.

Structure your weekly review around three questions: Are all KPIs within their thresholds? Is there a trend that hasn't crossed a threshold yet but is moving in the wrong direction? What's the next change to make, and how will we know if it worked?

The third question is where most teams stall. They generate KPI reports but don't close the loop to specific agent improvements. If task completion rate is down, you need to know which task types are failing most often, what the agents are doing differently on those tasks, and what change to the prompt, tools, or memory configuration might fix it. That chain of diagnosis has to be part of the review process, not a separate exercise that only happens when there's a crisis.

Connect your KPI reviews to your agent's version history. When you change a prompt, add a tool, or update the knowledge base, you should be able to see whether the relevant KPIs moved in the next review. If they didn't move, either the change wasn't the fix or your KPIs aren't sensitive enough to detect it. Both are useful to know.

If you're building this measurement practice on an agent that's already in production, scorecards vs. vibes covers how to establish measurement rigor retroactively. Score drift and shipping with confidence is a good companion for teams that already have evals but aren't sure how to use them to make shipping decisions.

What Happens When You Don't Define This First

The 8-point CSAT drop from the opening scenario wasn't random. It was predictable. The team was measuring deflection (a proxy metric) instead of resolution (the actual goal). When the agent deflected tickets without resolving them, the customers just came back angrier, reopened the ticket, or left. The deflection metric looked fine the entire time.

This is the cost of not defining success before launch. You end up retroactively arguing about what the metrics mean rather than using them to make decisions. "The deflection rate is high" and "the CSAT dropped" are both true, and they seem contradictory unless you had a clear definition of success that connected them.

Five KPIs. A baseline. Leading indicators with alerts. A weekly review with a closed loop to agent changes. That's the entire framework. It's not complex to describe. The hard part is doing it before the first production interaction, when you're still tempted to treat measurement as something you'll add later once you've proven the concept.

Define success now. Your agent has something to be accountable to, and you have something to optimize toward.

Track the KPIs that reflect what your agent actually does

Chanl connects task completion rate, escalation rate, and CSAT delta to the technical signals that drive them. Set thresholds before launch and get alerts when something changes.

Try Chanl Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluation kpis metrics production scorecards analytics testing

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

AI Agent KPIs: What to Measure Before You Ship

Why Technical Metrics Aren't Enough

The Five Core KPIs

Setting Your Baseline Before Launch

Leading vs. Lagging Indicators

Alerting: Define Thresholds Before You Need Them

Reviewing KPIs: The Closed Loop That Changes Behavior

What Happens When You Don't Define This First

Track the KPIs that reflect what your agent actually does

The Signal Briefing

Frequently Asked Questions

Related Articles

How Much Testing Is Enough for Your AI Agent?

Your Agent Aced the Benchmark. Production Disagreed.

How to Build a Regression Test Suite for AI Agents