Your agent handled 847 tickets last week. Pass rate: 91%. Average score: 4.1 out of 5.0. The team is happy. You're shipping.
Then your CFO asks one question: "What does each resolution actually cost us?"
You open the spreadsheet. Model costs, tool call fees, retry overhead, the occasional human escalation. You add it up. Your agent costs $2.80 per conversation. Your human agents cost $7.00. On paper, you're winning. But then you look closer: your agent's task resolution rate is 68%. When you divide actual cost by actual successful resolutions, you're at $4.12 per successful outcome. Your human agents resolve 94% of tickets. Their cost per successful outcome is $7.45.
You're not far ahead anymore. And you had no idea.
This is the metric gap that almost every team hits in their second month of production. Pass rate and quality scores tell you whether your agent produces good-looking output. They don't tell you whether that output accomplishes anything, or what it costs when it doesn't.
The metric that answers both questions is cost per successful outcome. Here's how to build it.
| What teams measure | What actually predicts ROI |
|---|---|
| Pass rate | Task success rate |
| Average quality score | Quality score paired with cost |
| Total model spend | Cost per resolved conversation |
| Escalation count | Escalation rate plus cost per escalation |
Why Pass Rate Alone Fails You
Pass rate measures whether your agent produced output that clears a quality threshold. An output that fills in all required fields, stays on topic, and avoids hallucinating scores a pass. But a pass doesn't mean the ticket closed. It doesn't mean the customer got what they needed. It doesn't mean your tool calls were efficient.
A pattern shows up over and over: teams deploy an agent with 92% pass rate and call it production-ready. Three weeks later, they check their re-contact rate (the percentage of customers who contacted support again within 48 hours). It's 31%. The agent answers questions coherently but doesn't actually solve problems. Pass rate saw nothing. Cost per successful outcome would have caught it immediately.
The other failure pass rate misses is cost efficiency. An agent that resolves tickets by making six tool calls instead of two looks identical on a quality scorecard. Both produce accurate, helpful responses. But the six-call agent costs 3x as much per resolution. At scale, that difference compounds into a business problem that no quality score surfaces.
You need both dimensions: did it work, and what did it cost when it did?
According to a 2026 survey, only 31% of organizations have implemented a measurement framework for agentic AI, while 47% either have no framework or are unsure whether one exists. Most of the teams with frameworks are measuring quality scores. Almost none are measuring cost per successful outcome. That's the gap.
The Two Numbers Every Agent Needs
The simplest version of this framework is a single ratio: total spend divided by resolved conversations. But to act on it, you need to understand both inputs separately.
Cost per conversation is the total marginal cost to run one conversation: model tokens, tool call fees, retry overhead, and any human escalation costs. This tells you your efficiency ceiling. If each conversation costs $2.00 regardless of outcome, and you're targeting $1.00 per resolved ticket, your task success rate needs to be at least 50% before you're in range.
Task success rate is the percentage of conversations that actually resolve the customer's issue. This is harder to measure than cost because ground truth is rarely available in real time. But it's the multiplier that turns cost-per-conversation into cost-per-successful-outcome.
// unit-economics.ts
interface ConversationMetrics {
callId: string;
modelCost: number; // prompt + completion tokens * price per token
toolCost: number; // API calls, database queries, third-party lookups
retryCost: number; // failed tool calls that triggered retries
escalationCost: number; // human review cost, if escalated
totalCost: number;
qualityScore: number; // 0-5 from scorecard
resolved: boolean; // did the customer's issue actually close?
}
function computeUnitEconomics(conversations: ConversationMetrics[]) {
const totalSpend = conversations.reduce((sum, c) => sum + c.totalCost, 0);
const resolved = conversations.filter((c) => c.resolved);
const taskSuccessRate = resolved.length / conversations.length;
const costPerConversation = totalSpend / conversations.length;
const costPerSuccessfulOutcome =
resolved.length > 0 ? totalSpend / resolved.length : Infinity;
return {
totalConversations: conversations.length,
totalSpend,
taskSuccessRate,
costPerConversation,
costPerSuccessfulOutcome,
averageQualityScore: mean(conversations.map((c) => c.qualityScore)),
};
}The costPerSuccessfulOutcome number is the one that connects infrastructure spend to business value. It's also the one that prevents the optimization trap: you can't improve it by cutting costs if doing so reduces resolution rates, and you can't improve it by improving quality if the quality improvements are too expensive.
Breaking Down What Drives Your Cost
Before you can optimize cost per outcome, you need to know which cost component is driving it. Most teams are surprised by the breakdown when they first measure it.
Here's a typical distribution for a customer support agent:
| Cost Component | Typical Share | Primary Driver |
|---|---|---|
| Prompt inference | 40-50% | System prompt size, conversation history |
| Completion inference | 15-20% | Response verbosity, over-explanation |
| Tool calls | 10-20% | Number of tools invoked per conversation |
| Retry overhead | 5-15% | Tool correctness rate |
| Human escalation | 5-20% | Agent's escalation threshold, complex ticket rate |
The prompt cost dominates. System prompts with embedded policy documents, full product catalogs, and detailed instructions can run 3,000-8,000 tokens per conversation before a single customer message arrives. That's a $0.30-$0.80 fixed overhead paid on every conversation, including the simple "what are your hours?" ones.
The retry column is the surprising one. At first glance, 5-15% seems small. But retry overhead cascades: a tool call that fails and retries doesn't just double in cost. It delays the response, which increases the chance the customer sends another message while waiting, which extends the conversation, which adds more completion tokens. A 10% retry rate can inflate your total cost by 25% through these second-order effects.
// cost-breakdown.ts
async function analyzeCostBreakdown(agentId: string, days = 7) {
const metrics = await chanl.calls.getMetrics({
agentId,
dateRange: {
start: daysAgo(days),
end: now(),
},
});
const total = metrics.costBreakdown.total;
return {
promptShare: (metrics.costBreakdown.promptTokens / total * 100).toFixed(1) + '%',
completionShare: (metrics.costBreakdown.completionTokens / total * 100).toFixed(1) + '%',
toolShare: (metrics.costBreakdown.toolCalls / total * 100).toFixed(1) + '%',
retryShare: (metrics.costBreakdown.retries / total * 100).toFixed(1) + '%',
escalationShare: (metrics.costBreakdown.humanReview / total * 100).toFixed(1) + '%',
avgPromptTokens: metrics.promptTokens.mean,
avgToolCallsPerConversation: metrics.toolCalls.mean,
toolCorrectnessRate: metrics.toolCorrectness,
retryRate: metrics.retryRate,
};
}You want these as shares of total cost, not raw numbers, so you can identify the biggest lever. If prompt cost is 55% of spend and you have a 6,000-token system prompt, that's where to start. If retry overhead is 18%, tool descriptions are the problem.
Tool Correctness Is the Hidden Cost Drain
Tool correctness is the rate at which your agent selects the right tool with the right parameters on the first try. It's the single biggest lever between an optimized agent and a wasteful one, and the one most teams don't think to measure.
An agent with 95% tool correctness makes roughly 1.05 tool calls per needed operation. An agent with 80% correctness makes about 1.25. That 19% increase in tool calls cascades through the entire cost structure: more tokens to process the error response, more tokens to recover and retry, sometimes an additional conversation turn when the customer notices the delay.
The root cause of poor tool correctness is almost always the tool description. Agents decide which tool to call based on the descriptions you provide. When three tools have similar-sounding descriptions, the agent picks wrong 20-30% of the time. "search_knowledge_base" could mean search product docs, search FAQ, or search policy library. The agent has to guess.
// tool-audit.ts
async function auditToolCorrectness(agentId: string) {
const calls = await chanl.calls.list({
agentId,
startDate: sevenDaysAgo(),
endDate: now(),
limit: 500,
});
// Group retries by tool name to find the worst offenders
const retryByTool = new Map<string, { retries: number; total: number }>();
for (const call of calls.items) {
for (const toolCall of call.toolCalls ?? []) {
const entry = retryByTool.get(toolCall.toolName) ?? { retries: 0, total: 0 };
entry.total += 1;
if (toolCall.wasRetry) entry.retries += 1;
retryByTool.set(toolCall.toolName, entry);
}
}
// Sort by retry rate descending
const sorted = [...retryByTool.entries()]
.map(([tool, stats]) => ({
tool,
retryRate: (stats.retries / stats.total * 100).toFixed(1) + '%',
retries: stats.retries,
total: stats.total,
}))
.sort((a, b) => b.retries / b.total - a.retries / a.total);
return sorted;
}When you find tools with high retry rates, the fix is usually a description rewrite. Be specific about what the tool does and when to use it. Here's the before and after:
Before: search_knowledge_base: searches the knowledge base
After: search_policy_docs: Search the company returns and refunds policy document. Use when the customer asks about refund eligibility, return windows, or policy exceptions. Do NOT use for product availability, order status, or pricing questions.
That specificity collapses retry rates from 20% to 3-5% on the tool in question. It's 10 minutes of work per tool. For a typical agent with 8-12 tools, an afternoon of description rewrites can reduce your total token spend by 15-25%.
For a deeper look at how tool descriptions affect routing decisions, see MCP Tool Descriptions and Agent Accuracy. A good agent tools dashboard surfaces per-tool call volumes, error rates, and retry counts in the same view so you can spot the worst offenders without writing a script.
Wiring In Task Success Signals
The hardest part of this framework is measuring whether the conversation actually resolved the customer's issue. In most CX contexts, ground truth isn't available in real time. You have to triangulate from multiple signals.
Explicit resolution signals: The customer clicks "resolved," closes the ticket, or gives a positive CSAT score. High precision, low coverage: most conversations don't generate explicit signals.
LLM-as-judge resolution scoring: A separate evaluator reads the conversation and scores whether the issue was resolved. Medium precision, full coverage, adds evaluation cost of $0.02-$0.05 per conversation.
Re-contact detection: A customer who contacts again within 24 hours about the same issue didn't get resolved. High precision for catching failures, but only surfaces failures that generate follow-up contacts.
Combined scoring: Use the LLM judge as the primary signal, override with explicit resolution where available, and flag re-contacts as unresolved regardless of judge score.
// resolution-scoring.ts
async function scoreResolution(callId: string): Promise<boolean> {
const call = await chanl.calls.get({ callId });
// Explicit signal takes priority
if (call.customerRating !== null) return call.customerRating >= 4;
if (call.ticketStatus === 'resolved') return true;
// Re-contact signal: this customer was not resolved, override any judge score
const recontact = await chanl.calls.findRecontact({
customerId: call.customerId,
afterCallId: callId,
withinHours: 24,
});
if (recontact !== null) return false;
// Fall back to LLM-as-judge on the resolution criterion
const result = await chanl.scorecards.evaluate({
callId,
scorecardId: 'resolution-check',
});
const resolutionCriterion = result.criteria.find((c) => c.name === 'Task Resolution');
return (resolutionCriterion?.score ?? 0) >= 4;
}This combination gives you a task success signal for every conversation. It won't be perfectly accurate. But a 5% error rate on your resolution signal adds only a small amount of noise to your cost-per-outcome number, and that noise is far less dangerous than operating without the signal at all.
For more on pairing offline and production evaluation signals, see Online vs. Offline Evals: What to Measure Where. A production monitoring dashboard should track re-contact rate alongside quality scores, so you can see the gap between "looking good" and "working well."
Building the Unit Economics Dashboard
With cost measurement and resolution scoring in place, the dashboard is straightforward. You want four numbers that update weekly:
// weekly-economics.ts
async function getWeeklyUnitEconomics(agentId: string) {
const [thisWeek, lastWeek] = await Promise.all([
chanl.calls.getMetrics({
agentId,
dateRange: { start: sevenDaysAgo(), end: now() },
}),
chanl.calls.getMetrics({
agentId,
dateRange: { start: fourteenDaysAgo(), end: sevenDaysAgo() },
}),
]);
const costPerOutcomeNow = thisWeek.totalCost / thisWeek.resolvedCount;
const costPerOutcomePrev = lastWeek.totalCost / lastWeek.resolvedCount;
return {
// Core metrics
costPerConversation: thisWeek.totalCost / thisWeek.conversationCount,
taskSuccessRate: thisWeek.resolvedCount / thisWeek.conversationCount,
costPerSuccessfulOutcome: costPerOutcomeNow,
toolCorrectness: thisWeek.toolCorrectness,
// Week-over-week change (positive = getting worse)
delta: {
costPerOutcome: costPerOutcomeNow - costPerOutcomePrev,
taskSuccessRate:
thisWeek.resolvedCount / thisWeek.conversationCount -
lastWeek.resolvedCount / lastWeek.conversationCount,
},
};
}The delta fields are the early warning system. Rising costPerOutcome with flat taskSuccessRate means you're paying more without getting more value. Rising taskSuccessRate alongside rising costPerOutcome might mean your success improvements are coming from expensive tool calls you could prune. Both improving means your optimizations are working.
What Good Looks Like
Concrete benchmarks give you targets. Based on published data and production deployments:
| Metric | Early Stage | Optimized | Best-in-Class |
|---|---|---|---|
| Cost per conversation | $1.50-$3.50 | $0.80-$1.50 | $0.40-$0.80 |
| Task success rate | 55-70% | 75-85% | 88-95% |
| Cost per successful outcome | $2.50-$5.50 | $1.00-$2.00 | $0.50-$1.00 |
| Tool correctness | 75-85% | 90-95% | 97-99% |
| Escalation rate | 25-40% | 10-20% | 5-12% |
Early-stage agents land at $3-5 per resolution because tool descriptions are rough, system prompts carry redundant context, and retry logic hasn't been tuned. Optimized agents reach $1-2 through iterative audits. Best-in-class numbers come from narrow-scope agents with precise tool descriptions and lean system prompts.
These benchmarks are for routine customer support workloads. Complex workflows like multi-step order modifications or billing disputes run higher on all dimensions. The right comparison is always your human baseline, not an abstract number from a blog post.
The human agent comparison: Human support costs $6-$8 per contact for routine inquiries and $15-$30+ for complex technical support. At $1.50 per successful AI outcome, you're delivering a 4-5x cost advantage on routine work. At $4.00 per outcome, the advantage shrinks to 1.5-2x. At $7.00+, which happens when success rates are low, you're paying more than human handling and getting less reliability.
The business case for AI agents depends on keeping cost per successful outcome well below your human baseline. That means measuring it, not just measuring quality scores and hoping the economics work out.
The Improvement Sequence
Once your dashboard is running, the improvement cycle is clear: measure, find the biggest lever, fix it, re-measure.
Roughly in order of impact:
Tool descriptions first. Improving vague descriptions to specific ones with explicit "do not use when" guidance. Highest ROI. Takes 2-4 hours per agent, reduces retry cost 20-40%.
System prompt audit second. Find content in your system prompt that belongs in a knowledge base. Policy documents, product catalogs, edge-case instructions that cover 0.1% of tickets. Move them to retrieval. Cut 500-2,000 tokens per conversation immediately.
Early resolution detection third. Add a check after each tool call result: has the issue been resolved? Stop elaborating. Stop offering additional help that wasn't requested. End the conversation. Reduces completion tokens 15-25%.
Escalation threshold tuning fourth. If your escalation rate is above 20%, your agent is routing to humans tickets it could resolve. If it's below 5%, it might be holding onto cases that genuinely need a human. Tune the threshold explicitly, with a scorecarded sample of both escalated and non-escalated conversations to validate.
Retry logic last. Add error classification: fail fast on unrecoverable errors (malformed parameters, 400s), retry with backoff on recoverable ones (rate limits, transient 503s). Prevents retry cascades while keeping the agent resilient.
The goal isn't to minimize cost in isolation. It's to minimize cost per successful outcome, which means you can't cut corners on resolution quality. An agent that resolves 95% of tickets at $0.70 per conversation is better than one that resolves 60% at $0.50. The second looks cheaper on the invoice and costs more per unit of value delivered.
When your CFO asks "what does each resolution actually cost us?" next time, you'll have the answer. Not "our pass rate is 91%" but "we're at $1.24 per successful outcome, down from $2.80 six weeks ago, and we're tracking toward $0.90 by Q3." That's a conversation worth having.
Track cost per outcome alongside quality scores
Chanl pairs your scorecard quality data with token costs, tool call rates, and escalation signals to give you a live cost-per-successful-outcome metric for every agent you run.
See how it worksCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



