What is cost per successful outcome for AI agents?

Cost per successful outcome measures the total marginal cost of running an AI agent (model tokens, tool calls, retries, and any human escalation) divided by the number of conversations that actually resolved the customer's issue. It pairs the efficiency metric with the quality metric so you can't optimize one at the expense of the other.

Why is pass rate not enough to measure AI agent performance?

Pass rate measures whether your agent produced a valid-looking output. It doesn't measure whether the customer's issue was resolved, whether the resolution cost $0.20 or $2.00, or whether the agent made three unnecessary tool calls along the way. An agent can have a 90% pass rate and still be losing money on every ticket it handles.

What should I include in the cost of an agent interaction?

Include model inference costs (prompt tokens plus completion tokens), tool call costs (API calls, database queries, third-party lookups), retry costs from failed tool calls, and human review costs for escalated tickets. Exclude infrastructure fixed costs; you want the marginal cost per conversation so you can compare across agents and over time.

How do I measure task success in production without ground truth?

Use a combination of signals: LLM-as-judge scorecards for outcome quality, explicit resolution signals when available (customer confirms, closes ticket, gives positive CSAT), and re-contact detection (a customer who contacts again within 24 hours about the same issue did not get resolved). The combination is more reliable than any single signal and gives you coverage across all conversation types.

What's a good benchmark for cost per AI agent resolution?

Well-optimized agents average $0.50 to $1.84 per resolved contact compared to $6 to $8 for human-handled contacts. Early-stage agents often run $3 to $5 per resolution due to inefficient tool use and retry overhead. If your cost per successful outcome exceeds $3 for routine inquiries, tool correctness and system prompt bloat are usually the first places to audit.

What is tool correctness and why does it matter for agent costs?

Tool correctness is the rate at which your agent calls the right tool with the right parameters on the first attempt. Failed tool calls trigger retries, which multiply token costs and add latency. An agent with 80% tool correctness makes roughly 1.25 calls per needed operation; an agent with 97% correctness makes about 1.03. At scale, that difference directly inflates your cost per outcome.

How do I reduce agent token costs without hurting quality?

Start with tool descriptions: better descriptions reduce incorrect tool selections and cut retries. Then audit your system prompt for content that belongs in a knowledge base instead (policy documents, product catalogs). Finally, add early resolution detection so the agent stops after resolving the issue rather than continuing to elaborate. These three changes typically reduce token costs 20 to 40 percent without touching quality scores.

How does cost per successful outcome differ from cost per conversation?

Cost per conversation includes all conversations regardless of whether the issue resolved. Cost per successful outcome only counts conversations where the customer's issue was actually addressed. If your agent resolves 70% of tickets and costs $1.00 per conversation, your cost per successful outcome is $1.43. This distinction matters because optimizing for cost per conversation can push you toward cheaper resolutions that don't actually work.

Cost Per Successful Outcome: The AI Agent Metric Teams Miss

Your agent handled 847 tickets last week. Pass rate: 91%. Average score: 4.1 out of 5.0. The team is happy. You're shipping.

Then your CFO asks one question: "What does each resolution actually cost us?"

You open the spreadsheet. Model costs, tool call fees, retry overhead, the occasional human escalation. You add it up. Your agent costs $2.80 per conversation. Your human agents cost $7.00. On paper, you're winning. But then you look closer: your agent's task resolution rate is 68%. When you divide actual cost by actual successful resolutions, you're at $4.12 per successful outcome. Your human agents resolve 94% of tickets. Their cost per successful outcome is $7.45.

You're not far ahead anymore. And you had no idea.

This is the metric gap that almost every team hits in their second month of production. Pass rate and quality scores tell you whether your agent produces good-looking output. They don't tell you whether that output accomplishes anything, or what it costs when it doesn't.

The metric that answers both questions is cost per successful outcome. Here's how to build it.

What teams measure	What actually predicts ROI
Pass rate	Task success rate
Average quality score	Quality score paired with cost
Total model spend	Cost per resolved conversation
Escalation count	Escalation rate plus cost per escalation

Why pass rate alone fails you

Pass rate measures whether your agent produced output that clears a quality threshold. An output that fills in all required fields, stays on topic, and avoids hallucinating scores a pass. But a pass doesn't mean the ticket closed. It doesn't mean the customer got what they needed. It doesn't mean your tool calls were efficient.

Here's a failure mode I've seen repeatedly: teams deploy an agent with 92% pass rate and call it production-ready. Three weeks later, they check their re-contact rate (the percentage of customers who contacted support again within 48 hours). It's 31%. The agent answers questions coherently but doesn't actually solve problems. Pass rate saw nothing. Cost per successful outcome would have caught it immediately.

The other failure pass rate misses is cost efficiency. An agent that resolves tickets by making six tool calls instead of two looks identical on a quality scorecard. Both produce accurate, helpful responses. But the six-call agent costs 3x as much per resolution. At scale, that difference compounds into a business problem that no quality score surfaces.

You need both dimensions: did it work, and what did it cost when it did?

According to a 2026 survey, only 31% of organizations have implemented a measurement framework for agentic AI, while 47% either have no framework or are unsure whether one exists. Most of the teams with frameworks are measuring quality scores. Almost none are measuring cost per successful outcome. That's the gap.

How cost and outcome signals combine into a single unit economics metric

The two numbers every agent needs

The simplest version of this framework is a single ratio: total spend divided by resolved conversations. But to act on it, you need to understand both inputs separately.

Cost per conversation is the total marginal cost to run one conversation: model tokens, tool call fees, retry overhead, and any human escalation costs. This tells you your efficiency ceiling. If each conversation costs $2.00 regardless of outcome, and you're targeting $1.00 per resolved ticket, your task success rate needs to be at least 50% before you're in range.

Task success rate is the percentage of conversations that actually resolve the customer's issue. This is harder to measure than cost because ground truth is rarely available in real time. But it's the multiplier that turns cost-per-conversation into cost-per-successful-outcome.

typescript

// unit-economics.ts
interface ConversationMetrics {
  callId: string;
  modelCost: number;        // prompt + completion tokens * price per token
  toolCost: number;         // API calls, database queries, third-party lookups
  retryCost: number;        // failed tool calls that triggered retries
  escalationCost: number;   // human review cost, if escalated
  totalCost: number;
  qualityScore: number;     // 0-5 from scorecard
  resolved: boolean;        // did the customer's issue actually close?
}
 
function computeUnitEconomics(conversations: ConversationMetrics[]) {
  const totalSpend = conversations.reduce((sum, c) => sum + c.totalCost, 0);
  const resolved = conversations.filter((c) => c.resolved);
  const taskSuccessRate = resolved.length / conversations.length;
  const costPerConversation = totalSpend / conversations.length;
  const costPerSuccessfulOutcome =
    resolved.length > 0 ? totalSpend / resolved.length : Infinity;
 
  return {
    totalConversations: conversations.length,
    totalSpend,
    taskSuccessRate,
    costPerConversation,
    costPerSuccessfulOutcome,
    averageQualityScore: mean(conversations.map((c) => c.qualityScore)),
  };
}

The costPerSuccessfulOutcome number is the one that connects infrastructure spend to business value. It's also the one that prevents the optimization trap: you can't improve it by cutting costs if doing so reduces resolution rates, and you can't improve it by improving quality if the quality improvements are too expensive.

Breaking down what drives your cost

Before you can optimize cost per outcome, you need to know which cost component is driving it. Most teams are surprised by the breakdown when they first measure it.

Here's a typical distribution for a customer support agent:

Cost Component	Typical Share	Primary Driver
Prompt inference	40-50%	System prompt size, conversation history
Completion inference	15-20%	Response verbosity, over-explanation
Tool calls	10-20%	Number of tools invoked per conversation
Retry overhead	5-15%	Tool correctness rate
Human escalation	5-20%	Agent's escalation threshold, complex ticket rate

The prompt cost dominates. System prompts with embedded policy documents, full product catalogs, and detailed instructions can run 3,000-8,000 tokens per conversation before a single customer message arrives. That's a $0.30-$0.80 fixed overhead paid on every conversation, including the simple "what are your hours?" ones.

The retry column is the surprising one. At first glance, 5-15% seems small. But retry overhead cascades: a tool call that fails and retries doesn't just double in cost. It delays the response, which increases the chance the customer sends another message while waiting, which extends the conversation, which adds more completion tokens. A 10% retry rate can inflate your total cost by 25% through these second-order effects.

typescript

// cost-breakdown.ts
async function analyzeCostBreakdown(agentId: string, days = 7) {
  const metrics = await chanl.calls.getMetrics({
    agentId,
    dateRange: {
      start: daysAgo(days),
      end: now(),
    },
  });
 
  const total = metrics.costBreakdown.total;
 
  return {
    promptShare: (metrics.costBreakdown.promptTokens / total * 100).toFixed(1) + '%',
    completionShare: (metrics.costBreakdown.completionTokens / total * 100).toFixed(1) + '%',
    toolShare: (metrics.costBreakdown.toolCalls / total * 100).toFixed(1) + '%',
    retryShare: (metrics.costBreakdown.retries / total * 100).toFixed(1) + '%',
    escalationShare: (metrics.costBreakdown.humanReview / total * 100).toFixed(1) + '%',
    avgPromptTokens: metrics.promptTokens.mean,
    avgToolCallsPerConversation: metrics.toolCalls.mean,
    toolCorrectnessRate: metrics.toolCorrectness,
    retryRate: metrics.retryRate,
  };
}

You want these as shares of total cost, not raw numbers, so you can identify the biggest lever. If prompt cost is 55% of spend and you have a 6,000-token system prompt, that's where to start. If retry overhead is 18%, tool descriptions are the problem.

Tool correctness: the hidden efficiency drain

Tool correctness is the rate at which your agent selects the right tool with the right parameters on the first try. It's the single biggest lever between an optimized agent and a wasteful one, and the one most teams don't think to measure.

An agent with 95% tool correctness makes roughly 1.05 tool calls per needed operation. An agent with 80% correctness makes about 1.25. That 19% increase in tool calls cascades through the entire cost structure: more tokens to process the error response, more tokens to recover and retry, sometimes an additional conversation turn when the customer notices the delay.

The root cause of poor tool correctness is almost always the tool description. Agents decide which tool to call based on the descriptions you provide. When three tools have similar-sounding descriptions, the agent picks wrong 20-30% of the time. "search_knowledge_base" could mean search product docs, search FAQ, or search policy library. The agent has to guess.

typescript

// tool-audit.ts
async function auditToolCorrectness(agentId: string) {
  const calls = await chanl.calls.list({
    agentId,
    startDate: sevenDaysAgo(),
    endDate: now(),
    limit: 500,
  });
 
  // Group retries by tool name to find the worst offenders
  const retryByTool = new Map<string, { retries: number; total: number }>();
 
  for (const call of calls.items) {
    for (const toolCall of call.toolCalls ?? []) {
      const entry = retryByTool.get(toolCall.toolName) ?? { retries: 0, total: 0 };
      entry.total += 1;
      if (toolCall.wasRetry) entry.retries += 1;
      retryByTool.set(toolCall.toolName, entry);
    }
  }
 
  // Sort by retry rate descending
  const sorted = [...retryByTool.entries()]
    .map(([tool, stats]) => ({
      tool,
      retryRate: (stats.retries / stats.total * 100).toFixed(1) + '%',
      retries: stats.retries,
      total: stats.total,
    }))
    .sort((a, b) => b.retries / b.total - a.retries / a.total);
 
  return sorted;
}

When you find tools with high retry rates, the fix is usually a description rewrite. Be specific about what the tool does and when to use it. Here's the before and after:

Before: search_knowledge_base: searches the knowledge base

After: search_policy_docs: Search the company returns and refunds policy document. Use when the customer asks about refund eligibility, return windows, or policy exceptions. Do NOT use for product availability, order status, or pricing questions.

That specificity collapses retry rates from 20% to 3-5% on the tool in question. It's 10 minutes of work per tool. For a typical agent with 8-12 tools, an afternoon of description rewrites can reduce your total token spend by 15-25%.

For a deeper look at how tool descriptions affect routing decisions, see MCP Tool Descriptions and Agent Accuracy. The Chanl Tools feature shows per-tool call volumes, error rates, and retry counts in the same view.

Connected Integrations12 active

Salesforce

Slack

Google

Stripe

HubSpot

Intercom

Zapier

Shopify

GitHub

Jira

Gmail

PostgreSQL

Wiring in task success signals

The hardest part of this framework is measuring whether the conversation actually resolved the customer's issue. In most CX contexts, ground truth isn't available in real time. You have to triangulate from multiple signals.

Explicit resolution signals: The customer clicks "resolved," closes the ticket, or gives a positive CSAT score. High precision, low coverage: most conversations don't generate explicit signals.

LLM-as-judge resolution scoring: A separate evaluator reads the conversation and scores whether the issue was resolved. Medium precision, full coverage, adds evaluation cost of $0.02-$0.05 per conversation.

Re-contact detection: A customer who contacts again within 24 hours about the same issue didn't get resolved. High precision for catching failures, but only surfaces failures that generate follow-up contacts.

Combined scoring: Use the LLM judge as the primary signal, override with explicit resolution where available, and flag re-contacts as unresolved regardless of judge score.

typescript

// resolution-scoring.ts
async function scoreResolution(callId: string): Promise<boolean> {
  const call = await chanl.calls.get({ callId });
 
  // Explicit signal takes priority
  if (call.customerRating !== null) return call.customerRating >= 4;
  if (call.ticketStatus === 'resolved') return true;
 
  // Re-contact signal: this customer was not resolved, override any judge score
  const recontact = await chanl.calls.findRecontact({
    customerId: call.customerId,
    afterCallId: callId,
    withinHours: 24,
  });
  if (recontact !== null) return false;
 
  // Fall back to LLM-as-judge on the resolution criterion
  const result = await chanl.scorecards.evaluate({
    callId,
    scorecardId: 'resolution-check',
  });
 
  const resolutionCriterion = result.criteria.find((c) => c.name === 'Task Resolution');
  return (resolutionCriterion?.score ?? 0) >= 4;
}

This combination gives you a task success signal for every conversation. It won't be perfectly accurate. But a 5% error rate on your resolution signal adds only a small amount of noise to your cost-per-outcome number, and that noise is far less dangerous than operating without the signal at all.

For more on pairing offline and production evaluation signals, see Online vs. Offline Evals: What to Measure Where. The Chanl Monitoring feature tracks re-contact rate alongside quality scores in the same dashboard, so you can see the gap between "looking good" and "working well."

Building the unit economics dashboard

With cost measurement and resolution scoring in place, the dashboard is straightforward. You want four numbers that update weekly:

typescript

// weekly-economics.ts
async function getWeeklyUnitEconomics(agentId: string) {
  const [thisWeek, lastWeek] = await Promise.all([
    chanl.calls.getMetrics({
      agentId,
      dateRange: { start: sevenDaysAgo(), end: now() },
    }),
    chanl.calls.getMetrics({
      agentId,
      dateRange: { start: fourteenDaysAgo(), end: sevenDaysAgo() },
    }),
  ]);
 
  const costPerOutcomeNow = thisWeek.totalCost / thisWeek.resolvedCount;
  const costPerOutcomePrev = lastWeek.totalCost / lastWeek.resolvedCount;
 
  return {
    // Core metrics
    costPerConversation: thisWeek.totalCost / thisWeek.conversationCount,
    taskSuccessRate: thisWeek.resolvedCount / thisWeek.conversationCount,
    costPerSuccessfulOutcome: costPerOutcomeNow,
    toolCorrectness: thisWeek.toolCorrectness,
 
    // Week-over-week change (positive = getting worse)
    delta: {
      costPerOutcome: costPerOutcomeNow - costPerOutcomePrev,
      taskSuccessRate:
        thisWeek.resolvedCount / thisWeek.conversationCount -
        lastWeek.resolvedCount / lastWeek.conversationCount,
    },
  };
}

The delta fields are the early warning system. Rising costPerOutcome with flat taskSuccessRate means you're paying more without getting more value. Rising taskSuccessRate alongside rising costPerOutcome might mean your success improvements are coming from expensive tool calls you could prune. Both improving means your optimizations are working.

What good looks like

Concrete benchmarks give you targets. Based on published data and production deployments:

Metric	Early Stage	Optimized	Best-in-Class
Cost per conversation	$1.50-$3.50	$0.80-$1.50	$0.40-$0.80
Task success rate	55-70%	75-85%	88-95%
Cost per successful outcome	$2.50-$5.50	$1.00-$2.00	$0.50-$1.00
Tool correctness	75-85%	90-95%	97-99%
Escalation rate	25-40%	10-20%	5-12%

Early-stage agents land at $3-5 per resolution because tool descriptions are rough, system prompts carry redundant context, and retry logic hasn't been tuned. Optimized agents reach $1-2 through iterative audits. Best-in-class numbers come from narrow-scope agents with precise tool descriptions and lean system prompts.

These benchmarks are for routine customer support workloads. Complex workflows like multi-step order modifications or billing disputes run higher on all dimensions. The right comparison is always your human baseline, not an abstract number from a blog post.

The human agent comparison: Human support costs $6-$8 per contact for routine inquiries and $15-$30+ for complex technical support. At $1.50 per successful AI outcome, you're delivering a 4-5x cost advantage on routine work. At $4.00 per outcome, the advantage shrinks to 1.5-2x. At $7.00+, which happens when success rates are low, you're paying more than human handling and getting less reliability.

The business case for AI agents depends on keeping cost per successful outcome well below your human baseline. That means measuring it, not just measuring quality scores and hoping the economics work out.

The improvement sequence

Once your dashboard is running, the improvement cycle is clear: measure, find the biggest lever, fix it, re-measure.

Roughly in order of impact:

Tool descriptions first. Improving vague descriptions to specific ones with explicit "do not use when" guidance. Highest ROI. Takes 2-4 hours per agent, reduces retry cost 20-40%.

System prompt audit second. Find content in your system prompt that belongs in a knowledge base. Policy documents, product catalogs, edge-case instructions that cover 0.1% of tickets. Move them to retrieval. Cut 500-2,000 tokens per conversation immediately.

Early resolution detection third. Add a check after each tool call result: has the issue been resolved? Stop elaborating. Stop offering additional help that wasn't requested. End the conversation. Reduces completion tokens 15-25%.

Escalation threshold tuning fourth. If your escalation rate is above 20%, your agent is routing to humans tickets it could resolve. If it's below 5%, it might be holding onto cases that genuinely need a human. Tune the threshold explicitly, with a scorecarded sample of both escalated and non-escalated conversations to validate.

Retry logic last. Add error classification: fail fast on unrecoverable errors (malformed parameters, 400s), retry with backoff on recoverable ones (rate limits, transient 503s). Prevents retry cascades while keeping the agent resilient.

The goal isn't to minimize cost in isolation. It's to minimize cost per successful outcome, which means you can't cut corners on resolution quality. An agent that resolves 95% of tickets at $0.70 per conversation is better than one that resolves 60% at $0.50. The second looks cheaper on the invoice and costs more per unit of value delivered.

When your CFO asks "what does each resolution actually cost us?" next time, you'll have the answer. Not "our pass rate is 91%" but "we're at $1.24 per successful outcome, down from $2.80 six weeks ago, and we're tracking toward $0.90 by Q3." That's a conversation worth having.

Track cost per outcome alongside quality scores

Chanl pairs your scorecard quality data with token costs, tool call rates, and escalation signals to give you a live cost-per-successful-outcome metric for every agent you run.

See how it works

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluations agent-metrics unit-economics cost-per-outcome observability typescript

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.