Everyone posts Opus 4.7 vs GPT-5 Pro benchmarks. It's fun to read. It's also mostly irrelevant.
The vast majority of customer-experience traffic, the chat turns, the intent classifications, the refund confirmations, the FAQ deflections, runs on cheap models. Haiku 4.5. GPT-5 Mini. Gemini 2.5 Flash. These are the $1-per-million-token models doing the actual work while the leaderboards fight over which frontier model codes better.
If you ship an AI agent for customer experience, this is your tier. And it's the tier nobody benchmarks rigorously for CX workloads.
In this article:
- The pricing picture at the $1 tier
- Workload 1: Tool-calling accuracy
- Workload 2: First-token latency
- Workload 3: Structured-output reliability
- The blended economics at 100K conversations per month
- Where each model actually wins
- Routing and observability in production
The Pricing Picture at the $1 Tier
Cheap-tier pricing is compressed enough that output tokens, not input, dominate spend. Haiku 4.5 is $1 input and $5 output per million tokens. GPT-5 Mini sits around $0.25 input and $2 output. Gemini 2.5 Flash is roughly $0.30 input and $2.50 output. The "cheapest" winner flips depending on your input-to-output ratio.
Here is the picture, including the flagship prices for contrast. Provider pricing moves; confirm against the official rate cards before budgeting.
| Model | Input ($/MTok) | Output ($/MTok) | Cached input | Tier |
|---|---|---|---|---|
| Claude Haiku 4.5 | 1.00 | 5.00 | 0.10 | Low |
| GPT-5 Mini | 0.25 | 2.00 | ~0.025 | Low |
| Gemini 2.5 Flash | 0.30 | 2.50 | ~0.03 | Low |
| Claude Sonnet 4.6 | 3.00 | 15.00 | 0.30 | Mid |
| GPT-5.4 (standard) | 2.50 | 15.00 | ~0.25 | Mid |
| Claude Opus 4.7 | 5.00 | 25.00 | 0.50 | Flagship |
| GPT-5.4 Pro | 30.00 | 180.00 | n/a | Flagship |
Two observations matter. First, GPT-5 Mini is roughly 4x cheaper than Haiku 4.5 on raw sticker price. Second, all three cheap-tier models offer prompt caching at around 10 percent of input price, which swamps the sticker difference once your system prompt is cached. If you're not caching, you're leaving 80 to 90 percent of savings on the table no matter which model you pick.
That's the baseline. The interesting question is what those pennies actually buy you at a real CX workload.
Workload 1: Tool-Calling Accuracy
For CX agents, tool-calling accuracy is the load-bearing metric. Everything else is decoration. If your agent can't reliably pick the right tool and fill the right arguments, it can't check an order, issue a refund, or look up a customer record.
The CX-relevant benchmark here is tau-bench, which Sierra open-sourced and which simulates real customer-service conversations with tool use and policy following. It's a better stand-in for production CX than SWE-bench or MMLU. Even frontier models score under 50 percent on a single run of tau2-bench. Multi-run pass rates at pass@8 drop below 25 percent on retail for GPT-4o-class models.
Published leaderboards don't break out the cheap tier cleanly, and the public numbers you'll find are almost all frontier-model runs. What teams report from their own tau2-bench replications on telecom and retail subsets tends to rank the cheap tier in this directional order, with the spread between them tighter than the spread between any of them and the Sonnet/GPT-5 flagship class:
| Model | Tool-call accuracy (CX task) | Multi-step retention | Policy adherence |
|---|---|---|---|
| Claude Haiku 4.5 | Best of the three | Strong | Strong |
| GPT-5 Mini | Close second | Moderate | Strong |
| Gemini 2.5 Flash | Third | Moderate | Moderate |
Treat that ranking as a starting hypothesis, not a result. The only numbers that matter are the ones you get replaying your own traffic.
Haiku 4.5 is the consistent leader here. Anthropic has spent a release cycle hammering on the tool-use path, and it shows. Haiku rarely hallucinates argument names, rarely invents tools that don't exist in the schema, and handles the "wait, the user changed their mind about the shipping address" reversal cleanly.
GPT-5 Mini is close on simple tool calls and falls off faster on multi-step reversal. It's also the most opinionated about refusing to call a tool when the system prompt is ambiguous, which can be a feature or a bug depending on your risk posture.
Gemini 2.5 Flash is the cheapest for a reason. Policy adherence is its weakest axis. A common failure: it will call the right tool with the right arguments, but then write a response that contradicts the tool result. You can usually fix this with a stricter system prompt, but you have to know to do it.
One lesson that keeps repeating: tool-calling accuracy is not a model capability, it's a (model, schema, prompt) interaction. Test it on your schema. A model that handles OpenAI's function-calling format with 8 tools cleanly can fall apart on yours with 23 tools and four enum fields that look alike.
// Replay 50 real conversations, grade each tool call, score per model.
// This is the only benchmark that actually predicts your production behavior.
const cases = await loadProductionReplay("2026-04-15", { sample: 50 });
for (const model of ["claude-haiku-4-5", "gpt-5-mini", "gemini-2.5-flash"]) {
let correct = 0;
for (const c of cases) {
const result = await runAgentTurn({
model,
systemPrompt: PROD_SYSTEM_PROMPT,
tools: PROD_TOOLS,
messages: c.messages
});
const expected = c.groundTruth.toolCall;
if (matchesCall(result.toolCall, expected)) correct++;
}
console.log(`${model}: ${((correct / cases.length) * 100).toFixed(1)}% tool-accuracy`);
}Run this on your own traffic before you trust any leaderboard. If the gap between your numbers and the published numbers is large, trust your numbers.
Workload 2: First-Token Latency
Time to first token (TTFT) is the number your users feel. A chatbot that streams the first word in 250ms feels instant. One that takes 1.2 seconds feels broken, even if the total response is the same length.
Cheap-tier TTFT is not uniform. Regional infrastructure, current load, and whether the model is actually cheap or is a distilled version of something larger all matter. Typical p50 numbers for a ~2,500-token input on a warm cache, US region, streaming on:
| Model | TTFT p50 | TTFT p95 | Output tokens/sec |
|---|---|---|---|
| Gemini 2.5 Flash | 180-250ms | 450ms | 120-180 |
| GPT-5 Mini | 220-320ms | 600ms | 80-140 |
| Claude Haiku 4.5 | 300-450ms | 700ms | 70-120 |
Flash is the latency champion. It has been since the 1.5 generation. If your UX lives or dies on snappy first-token, Gemini Flash is hard to beat.
GPT-5 Mini is the middle. It is noticeably faster on TTFT than Haiku in most regions and comparable on output throughput for chat-length generations.
Haiku 4.5 has the highest TTFT of the three, which catches people off guard because Haiku has historically been the "fast Claude." The 4.5 generation added significant tool-use improvements but also added tool-use preflight, which adds measurable first-token latency on any turn that touches a tool. For pure chat turns (no tools) the gap closes. For tool-heavy turns it widens.
None of this matters if you don't measure it in your own stack. A badly configured proxy can add 400ms of TTFT on top of any model. Per-model TTFT histograms tracked alongside tool-call success are the combination that actually predicts user-perceived quality. Real-time analytics becomes the feedback loop: swap models, watch the TTFT curve shift, roll back if it regresses.
Workload 3: Structured-Output Reliability
If your agent returns JSON to a downstream system, structured-output reliability is the silent killer. A 3 percent failure rate means 3,000 broken tickets a month at 100K conversations, and your on-call engineer finds out when the retry queue backs up.
All three models support structured outputs. Not all three treat it with the same discipline.
Haiku 4.5 honors response_format strictly. Missing fields, wrong types, and invalid enums are rare on a well-described schema, even under adversarial prompts. Refusals are clean: it either returns a valid payload or a structured refusal, not malformed JSON.
GPT-5 Mini with the strict JSON schema mode is equally reliable in the happy path. Where it gets interesting is edge cases: unusual Unicode in strings, deeply nested optional fields, or schemas with 20+ fields. It is more likely than Haiku to silently drop an optional field rather than return it as null.
Gemini 2.5 Flash is the loosest of the three. With responseSchema configured it mostly honors types, but enum violations and occasional wrapped responses (the payload nested under an extra key) show up at a measurable rate. Under load it tends to regress to looser output first.
A rough shape of what you'd see on a 10K-sample test with a medium-complexity schema (8 fields, 2 enums, 1 nested object):
| Model | Valid JSON % | Schema conformance % | Failure mode |
|---|---|---|---|
| Claude Haiku 4.5 | 99.7 | 99.1 | Occasional refusal on adversarial input |
| GPT-5 Mini | 99.5 | 98.4 | Dropped optional fields |
| Gemini 2.5 Flash | 98.9 | 96.8 | Enum violations, wrapped payloads |
The gap between 99.1 and 96.8 looks small on a slide. At 100K conversations/month it's 2,300 extra parse failures, which is an incident waiting to happen unless you have retry logic and observability in place.
Treat this as a contract-testing problem, not a prompt-engineering problem. Enforce your schema with Zod or Pydantic at the application boundary. Log every rejection with the raw payload. Feed the rejections back into your regression set. Any model swap is gated on this metric staying flat or improving.
The Blended Economics at 100K Conversations per Month
This is where the sticker-price analysis breaks. Real cost is driven by conversation shape, caching, and the output-to-input ratio. Consider a reference CX workload:
- 100,000 conversations per month
- 6 turns per conversation on average
- 3,000-token system prompt (cached)
- 800 tokens of rolling context per turn
- 400 tokens of output per turn
That is 600K turns, 1.8B cached input tokens, 480M uncached input tokens, and 240M output tokens per month.
At sticker prices (no caching):
| Model | Input cost | Output cost | Monthly total |
|---|---|---|---|
| Gemini 2.5 Flash | $684 | $600 | ~$1,284 |
| GPT-5 Mini | $570 | $480 | ~$1,050 |
| Claude Haiku 4.5 | $2,280 | $1,200 | ~$3,480 |
With caching on the system prompt (90 percent discount on cached reads):
| Model | Input cost | Output cost | Monthly total |
|---|---|---|---|
| GPT-5 Mini | $165 | $480 | ~$645 |
| Gemini 2.5 Flash | $202 | $600 | ~$802 |
| Claude Haiku 4.5 | $612 | $1,200 | ~$1,812 |
With caching on, Haiku 4.5 is roughly 2.3x the cost of GPT-5 Mini for the same workload. That is real money. The question is whether the tool-call accuracy, structured-output reliability, and policy adherence gap is worth it. In most production CX stacks the answer is some variation of "yes, but only for the turns that need it, and we route the rest to something cheaper."
A common routing pattern in production looks like this:
// Route based on what the turn actually needs.
// 70-85% of CX traffic is pattern 1 or 2.
async function pickModel(turn: AgentTurn): Promise<string> {
// 1. Classify intent first with the cheapest fast model.
const intent = await classify(turn, { model: "gemini-2.5-flash" });
// 2. Simple FAQ deflection? Keep it on Flash.
if (intent.category === "faq" && intent.confidence > 0.85) {
return "gemini-2.5-flash";
}
// 3. Tool calls? Haiku has the best tool-use accuracy.
if (intent.requiresTool) {
return "claude-haiku-4-5";
}
// 4. Anything unknown or high-stakes? Escalate to Sonnet.
if (intent.confidence < 0.6 || intent.riskLevel === "high") {
return "claude-sonnet-4-6";
}
// 5. Default: Mini for balanced quality + cost.
return "gpt-5-mini";
}This kind of routing typically cuts blended cost by 40 to 60 percent versus running everything on a single mid-tier model, without losing the quality you need on the turns that matter. The catch: you need observability to know whether each route is working. Otherwise you're flying blind and the "savings" are just silent quality regressions.
Where Each Model Actually Wins
The short version, if you need to pick today:
- Pick Claude Haiku 4.5 if tool-calling accuracy and policy adherence are your top priorities, you're on Anthropic anyway, and you're willing to pay the tier premium. This is the safest default for CX.
- Pick GPT-5 Mini if you're already on the OpenAI stack, you need the cheapest $/conversation, and your workload is chat-heavy with simple tool use.
- Pick Gemini 2.5 Flash if latency matters more than anything else, your workload is multimodal (screenshots, images), or you're running at a scale where the price difference starts to dominate infrastructure.
This is also the decomposition for a routing strategy: classify with Flash, call tools with Haiku, handle the long tail with Mini, escalate to Sonnet or GPT-5 for anything that looks ambiguous or risky. Model distillation for edge prices covers the next step beyond this: training a small model on your own traffic so the cheap tier gets even cheaper without losing quality. When 3B beats 70B pushes the same argument further.
One thing to avoid: picking a cheap model off a public leaderboard and shipping it. Public benchmarks are dominated by coding, math, and general reasoning. CX traffic is weighted toward short, tool-heavy, policy-bound turns with tight schema requirements. The ranking on your workload is not the ranking on the leaderboard.
Routing and Observability in Production
General-purpose tooling gets you to the model choice. What it leaves unsolved is the production half: per-model analytics, regression gates on model swaps, and a way to catch silent quality regressions when a provider ships a surprise model update.
The shape we use with the Chanl SDK: the agent is configured once, your own router function picks the model per turn, and every interaction is tagged with the model that handled it. Prompt management and agent tools configuration live on the agent itself, so swapping the underlying model on a route is a config change, not a code deploy.
import { ChanlClient } from "@chanl/sdk";
const client = new ChanlClient({ apiKey: process.env.CHANL_API_KEY });
// Create the agent once. Your router picks the model per turn at runtime.
const agent = await client.agents.create({
name: "support-agent",
prompt: "You are a customer-support agent for Acme...",
tools: ["lookup_order", "issue_refund", "update_address"]
});
// The router lives in your app code and returns a model ID per turn.
// Pass the chosen model into chat/voice calls as the model override.
async function handleTurn(turn: AgentTurn) {
const model = await pickModel(turn); // from the routing example above
return client.chat.send({ agentId: agent.id, model, messages: turn.messages });
}The observability side is where it pays off. Because every interaction is tagged with the model that handled it, you can slice tool-call accuracy, TTFT, and structured-output pass rate by model. Pull call metrics with client.calls.getMetrics({ agentId, dateFrom, dateTo }) and group the results by the model field in your own dashboard, or drive the grouping with a scorecard that runs on the captured set.
Before you promote a cheap-tier model to handle more traffic, you run a scorecard regression against your captured production set. This is the gate that catches the silent quality drop that public leaderboards can't show you. Monitoring closes the loop: if a model's tool-call rate dips below threshold for 15 minutes, you get paged before customers do.
The underlying thesis is simple. Picking a cheap-tier model is a one-time decision. Keeping it working in production is a forever decision. You want the boring observability infrastructure in place before you ship the cost-saving model swap, not after.
The Takeaway
Opus and GPT-5 Pro get the headlines. Haiku 4.5, GPT-5 Mini, and Gemini 2.5 Flash handle the load. At the $1 tier, the right model depends on your workload shape: tool-heavy goes Haiku, latency-critical goes Flash, cheap-balanced goes Mini. Caching is not optional at production volume. Routing is how you get flagship quality on the turns that need it while the 70 to 85 percent that don't pay cheap-tier prices.
Test on your own traffic. Ship with observability. Ignore the leaderboards that don't test your workload. Token cost optimization in production goes deeper on the caching and routing side if you want the next lever after model choice.
Catch model-swap regressions before customers do
Chanl gives every AI agent per-model observability, tool-call accuracy scoring, and scorecard regression gates on model changes. Build, connect, and monitor on any model tier.
See how it works- LLM Council — AI Model Benchmarks April 2026
- EdenAI — Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Benchmarks
- BenchLM — LLM Agent & Tool-Use Benchmarks
- Vellum — LLM Leaderboard 2026
- tau-bench (Sierra) GitHub
- Artificial Analysis — tau2-Bench Telecom Leaderboard
- Anthropic 2026 Agentic Coding Trends Report
- Google Cloud — AI Agent Trends 2026
- NxCode — GPT-5.4 Complete Guide
- Anthropic API Pricing 2026
- Finout — Claude Opus 4.7 Pricing Analysis
- Anthropic Prompt Caching Documentation
- OpenAI Structured Outputs Documentation
- Google Gemini Context Caching
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



