Which Is the Best Cheap Model for a Customer-Service Chatbot in 2026?

There is no single winner. Haiku 4.5 wins tool-calling accuracy and structured-output discipline. GPT-5 Mini wins first-token latency and is easiest to migrate to from GPT-5 flagship. Gemini Flash wins price and multimodal. For most CX workloads, Haiku 4.5 is the safest default, with Gemini Flash reserved for image-heavy or ultra-price-sensitive flows.

How Much Does Haiku 4.5 Cost Compared to GPT-5 Mini and Gemini Flash?

Claude Haiku 4.5 is $1 per million input tokens and $5 per million output. GPT-5 Mini is roughly $0.25 input and $2 output. Gemini 2.5 Flash is roughly $0.30 input and $2.50 output. At 100K conversations per month, token-only spend ranges from about $250 to $1,500 depending on the model and whether prompt caching is turned on.

What Is Tau-Bench and Why Does It Matter for CX?

tau-bench is a benchmark from Sierra that simulates realistic customer-service conversations with tool use, policy following, and multi-turn dialogue. It is far more representative of CX workloads than SWE-bench or MMLU. Even top models score below 50 percent on a single run, which tells you something important about real-world agent reliability.

How Do I Test Which Cheap Model Is Best for My Agent?

Export 50 to 100 real conversations from production, convert them into a regression set, and replay them against each candidate model. Score three dimensions: tool-call accuracy, structured-output conformance, and time to first token. Use a fixed customer persona so outputs are comparable. Never pick a model from a public leaderboard without running your own traffic through it.

Should I Route Easy Turns to a Cheap Model and Hard Turns to a Flagship?

Yes, for most production agents. A common pattern is to classify intent with the cheap model, then escalate to Sonnet or GPT-5 when the classifier returns low confidence or the user asks for something outside the known flows. In practice, 70 to 85 percent of CX traffic never needs the flagship, which is where the cost savings come from.

Does Prompt Caching Work on Cheap-Tier Models?

Yes. Anthropic charges cache reads at 10 percent of input price across the entire Claude family, including Haiku 4.5. OpenAI offers cached input discounts on GPT-5 Mini. Gemini has context caching as well. For an agent with a 3,000-token system prompt hit 100K times per month, caching saves roughly 90 percent of input spend.

What Is the Biggest Mistake Teams Make Picking a Low-Tier Model?

Picking based on a public leaderboard rather than replaying their own traffic. Public benchmarks heavily weight coding and reasoning. CX traffic is weighted toward short, tool-heavy, policy-bound turns. The model that wins a Reddit thread on SWE-bench can still be the wrong choice for a refund flow.

Everyone Benchmarks Opus. Your Chatbot Runs on Haiku.

Everyone posts Opus 4.7 vs GPT-5 Pro benchmarks. It's fun to read. It's also mostly irrelevant.

The vast majority of customer-experience traffic, the chat turns, the intent classifications, the refund confirmations, the FAQ deflections, runs on cheap models. Haiku 4.5. GPT-5 Mini. Gemini 2.5 Flash. These are the $1-per-million-token models doing the actual work while the leaderboards fight over which frontier model codes better.

If you ship an AI agent for customer experience, this is your tier. And it's the tier nobody benchmarks rigorously for CX workloads.

In this article:

The pricing picture at the $1 tier
Workload 1: Tool-calling accuracy
Workload 2: First-token latency
Workload 3: Structured-output reliability
The blended economics at 100K conversations per month
Where each model actually wins
Routing and observability in production

The Pricing Picture at the $1 Tier

Cheap-tier pricing is compressed enough that output tokens, not input, dominate spend. Haiku 4.5 is $1 input and $5 output per million tokens. GPT-5 Mini sits around $0.25 input and $2 output. Gemini 2.5 Flash is roughly $0.30 input and $2.50 output. The "cheapest" winner flips depending on your input-to-output ratio.

Here is the picture, including the flagship prices for contrast. Provider pricing moves; confirm against the official rate cards before budgeting.

Model	Input ($/MTok)	Output ($/MTok)	Cached input	Tier
Claude Haiku 4.5	1.00	5.00	0.10	Low
GPT-5 Mini	0.25	2.00	~0.025	Low
Gemini 2.5 Flash	0.30	2.50	~0.03	Low
Claude Sonnet 4.6	3.00	15.00	0.30	Mid
GPT-5.4 (standard)	2.50	15.00	~0.25	Mid
Claude Opus 4.7	5.00	25.00	0.50	Flagship
GPT-5.4 Pro	30.00	180.00	n/a	Flagship

Two observations matter. First, GPT-5 Mini is roughly 4x cheaper than Haiku 4.5 on raw sticker price. Second, all three cheap-tier models offer prompt caching at around 10 percent of input price, which swamps the sticker difference once your system prompt is cached. If you're not caching, you're leaving 80 to 90 percent of savings on the table no matter which model you pick.

That's the baseline. The interesting question is what those pennies actually buy you at a real CX workload.

Workload 1: Tool-Calling Accuracy

For CX agents, tool-calling accuracy is the load-bearing metric. Everything else is decoration. If your agent can't reliably pick the right tool and fill the right arguments, it can't check an order, issue a refund, or look up a customer record.

The CX-relevant benchmark here is tau-bench, which Sierra open-sourced and which simulates real customer-service conversations with tool use and policy following. It's a better stand-in for production CX than SWE-bench or MMLU. Even frontier models score under 50 percent on a single run of tau2-bench. Multi-run pass rates at pass@8 drop below 25 percent on retail for GPT-4o-class models.

Published leaderboards don't break out the cheap tier cleanly, and the public numbers you'll find are almost all frontier-model runs. What teams report from their own tau2-bench replications on telecom and retail subsets tends to rank the cheap tier in this directional order, with the spread between them tighter than the spread between any of them and the Sonnet/GPT-5 flagship class:

Model	Tool-call accuracy (CX task)	Multi-step retention	Policy adherence
Claude Haiku 4.5	Best of the three	Strong	Strong
GPT-5 Mini	Close second	Moderate	Strong
Gemini 2.5 Flash	Third	Moderate	Moderate

Treat that ranking as a starting hypothesis, not a result. The only numbers that matter are the ones you get replaying your own traffic.

Haiku 4.5 is the consistent leader here. Anthropic has spent a release cycle hammering on the tool-use path, and it shows. Haiku rarely hallucinates argument names, rarely invents tools that don't exist in the schema, and handles the "wait, the user changed their mind about the shipping address" reversal cleanly.

GPT-5 Mini is close on simple tool calls and falls off faster on multi-step reversal. It's also the most opinionated about refusing to call a tool when the system prompt is ambiguous, which can be a feature or a bug depending on your risk posture.

Gemini 2.5 Flash is the cheapest for a reason. Policy adherence is its weakest axis. A common failure: it will call the right tool with the right arguments, but then write a response that contradicts the tool result. You can usually fix this with a stricter system prompt, but you have to know to do it.

One lesson that keeps repeating: tool-calling accuracy is not a model capability, it's a (model, schema, prompt) interaction. Test it on your schema. A model that handles OpenAI's function-calling format with 8 tools cleanly can fall apart on yours with 23 tools and four enum fields that look alike.

tool-call-regression.ts·typescript

// Replay 50 real conversations, grade each tool call, score per model.
// This is the only benchmark that actually predicts your production behavior.
 
const cases = await loadProductionReplay("2026-04-15", { sample: 50 });
 
for (const model of ["claude-haiku-4-5", "gpt-5-mini", "gemini-2.5-flash"]) {
  let correct = 0;
 
  for (const c of cases) {
    const result = await runAgentTurn({
      model,
      systemPrompt: PROD_SYSTEM_PROMPT,
      tools: PROD_TOOLS,
      messages: c.messages
    });
 
    const expected = c.groundTruth.toolCall;
    if (matchesCall(result.toolCall, expected)) correct++;
  }
 
  console.log(`${model}: ${((correct / cases.length) * 100).toFixed(1)}% tool-accuracy`);
}

Run this on your own traffic before you trust any leaderboard. If the gap between your numbers and the published numbers is large, trust your numbers.

Workload 2: First-Token Latency

Time to first token (TTFT) is the number your users feel. A chatbot that streams the first word in 250ms feels instant. One that takes 1.2 seconds feels broken, even if the total response is the same length.

Cheap-tier TTFT is not uniform. Regional infrastructure, current load, and whether the model is actually cheap or is a distilled version of something larger all matter. Typical p50 numbers for a ~2,500-token input on a warm cache, US region, streaming on:

Model	TTFT p50	TTFT p95	Output tokens/sec
Gemini 2.5 Flash	180-250ms	450ms	120-180
GPT-5 Mini	220-320ms	600ms	80-140
Claude Haiku 4.5	300-450ms	700ms	70-120

Flash is the latency champion. It has been since the 1.5 generation. If your UX lives or dies on snappy first-token, Gemini Flash is hard to beat.

GPT-5 Mini is the middle. It is noticeably faster on TTFT than Haiku in most regions and comparable on output throughput for chat-length generations.

Haiku 4.5 has the highest TTFT of the three, which catches people off guard because Haiku has historically been the "fast Claude." The 4.5 generation added significant tool-use improvements but also added tool-use preflight, which adds measurable first-token latency on any turn that touches a tool. For pure chat turns (no tools) the gap closes. For tool-heavy turns it widens.

None of this matters if you don't measure it in your own stack. A badly configured proxy can add 400ms of TTFT on top of any model. Per-model TTFT histograms tracked alongside tool-call success are the combination that actually predicts user-perceived quality. Real-time analytics becomes the feedback loop: swap models, watch the TTFT curve shift, roll back if it regresses.

Workload 3: Structured-Output Reliability

If your agent returns JSON to a downstream system, structured-output reliability is the silent killer. A 3 percent failure rate means 3,000 broken tickets a month at 100K conversations, and your on-call engineer finds out when the retry queue backs up.

All three models support structured outputs. Not all three treat it with the same discipline.

Haiku 4.5 honors response_format strictly. Missing fields, wrong types, and invalid enums are rare on a well-described schema, even under adversarial prompts. Refusals are clean: it either returns a valid payload or a structured refusal, not malformed JSON.

GPT-5 Mini with the strict JSON schema mode is equally reliable in the happy path. Where it gets interesting is edge cases: unusual Unicode in strings, deeply nested optional fields, or schemas with 20+ fields. It is more likely than Haiku to silently drop an optional field rather than return it as null.

Gemini 2.5 Flash is the loosest of the three. With responseSchema configured it mostly honors types, but enum violations and occasional wrapped responses (the payload nested under an extra key) show up at a measurable rate. Under load it tends to regress to looser output first.

A rough shape of what you'd see on a 10K-sample test with a medium-complexity schema (8 fields, 2 enums, 1 nested object):

Model	Valid JSON %	Schema conformance %	Failure mode
Claude Haiku 4.5	99.7	99.1	Occasional refusal on adversarial input
GPT-5 Mini	99.5	98.4	Dropped optional fields
Gemini 2.5 Flash	98.9	96.8	Enum violations, wrapped payloads

The gap between 99.1 and 96.8 looks small on a slide. At 100K conversations/month it's 2,300 extra parse failures, which is an incident waiting to happen unless you have retry logic and observability in place.

Treat this as a contract-testing problem, not a prompt-engineering problem. Enforce your schema with Zod or Pydantic at the application boundary. Log every rejection with the raw payload. Feed the rejections back into your regression set. Any model swap is gated on this metric staying flat or improving.

The Blended Economics at 100K Conversations per Month

This is where the sticker-price analysis breaks. Real cost is driven by conversation shape, caching, and the output-to-input ratio. Consider a reference CX workload:

100,000 conversations per month
6 turns per conversation on average
3,000-token system prompt (cached)
800 tokens of rolling context per turn
400 tokens of output per turn

That is 600K turns, 1.8B cached input tokens, 480M uncached input tokens, and 240M output tokens per month.

At sticker prices (no caching):

Model	Input cost	Output cost	Monthly total
Gemini 2.5 Flash	$684	$600	~$1,284
GPT-5 Mini	$570	$480	~$1,050
Claude Haiku 4.5	$2,280	$1,200	~$3,480

With caching on the system prompt (90 percent discount on cached reads):

Model	Input cost	Output cost	Monthly total
GPT-5 Mini	$165	$480	~$645
Gemini 2.5 Flash	$202	$600	~$802
Claude Haiku 4.5	$612	$1,200	~$1,812

With caching on, Haiku 4.5 is roughly 2.3x the cost of GPT-5 Mini for the same workload. That is real money. The question is whether the tool-call accuracy, structured-output reliability, and policy adherence gap is worth it. In most production CX stacks the answer is some variation of "yes, but only for the turns that need it, and we route the rest to something cheaper."

A common routing pattern in production looks like this:

model-routing.ts·typescript

// Route based on what the turn actually needs.
// 70-85% of CX traffic is pattern 1 or 2.
 
async function pickModel(turn: AgentTurn): Promise<string> {
  // 1. Classify intent first with the cheapest fast model.
  const intent = await classify(turn, { model: "gemini-2.5-flash" });
 
  // 2. Simple FAQ deflection? Keep it on Flash.
  if (intent.category === "faq" && intent.confidence > 0.85) {
    return "gemini-2.5-flash";
  }
 
  // 3. Tool calls? Haiku has the best tool-use accuracy.
  if (intent.requiresTool) {
    return "claude-haiku-4-5";
  }
 
  // 4. Anything unknown or high-stakes? Escalate to Sonnet.
  if (intent.confidence < 0.6 || intent.riskLevel === "high") {
    return "claude-sonnet-4-6";
  }
 
  // 5. Default: Mini for balanced quality + cost.
  return "gpt-5-mini";
}

This kind of routing typically cuts blended cost by 40 to 60 percent versus running everything on a single mid-tier model, without losing the quality you need on the turns that matter. The catch: you need observability to know whether each route is working. Otherwise you're flying blind and the "savings" are just silent quality regressions.

Where Each Model Actually Wins

The short version, if you need to pick today:

Pick Claude Haiku 4.5 if tool-calling accuracy and policy adherence are your top priorities, you're on Anthropic anyway, and you're willing to pay the tier premium. This is the safest default for CX.
Pick GPT-5 Mini if you're already on the OpenAI stack, you need the cheapest $/conversation, and your workload is chat-heavy with simple tool use.
Pick Gemini 2.5 Flash if latency matters more than anything else, your workload is multimodal (screenshots, images), or you're running at a scale where the price difference starts to dominate infrastructure.

This is also the decomposition for a routing strategy: classify with Flash, call tools with Haiku, handle the long tail with Mini, escalate to Sonnet or GPT-5 for anything that looks ambiguous or risky. Model distillation for edge prices covers the next step beyond this: training a small model on your own traffic so the cheap tier gets even cheaper without losing quality. When 3B beats 70B pushes the same argument further.

One thing to avoid: picking a cheap model off a public leaderboard and shipping it. Public benchmarks are dominated by coding, math, and general reasoning. CX traffic is weighted toward short, tool-heavy, policy-bound turns with tight schema requirements. The ranking on your workload is not the ranking on the leaderboard.

Routing and Observability in Production

General-purpose tooling gets you to the model choice. What it leaves unsolved is the production half: per-model analytics, regression gates on model swaps, and a way to catch silent quality regressions when a provider ships a surprise model update.

The shape we use with the Chanl SDK: the agent is configured once, your own router function picks the model per turn, and every interaction is tagged with the model that handled it. Prompt management and agent tools configuration live on the agent itself, so swapping the underlying model on a route is a config change, not a code deploy.

agent-with-routing.ts·typescript

import { ChanlClient } from "@chanl/sdk";
 
const client = new ChanlClient({ apiKey: process.env.CHANL_API_KEY });
 
// Create the agent once. Your router picks the model per turn at runtime.
const agent = await client.agents.create({
  name: "support-agent",
  prompt: "You are a customer-support agent for Acme...",
  tools: ["lookup_order", "issue_refund", "update_address"]
});
 
// The router lives in your app code and returns a model ID per turn.
// Pass the chosen model into chat/voice calls as the model override.
async function handleTurn(turn: AgentTurn) {
  const model = await pickModel(turn); // from the routing example above
  return client.chat.send({ agentId: agent.id, model, messages: turn.messages });
}

The observability side is where it pays off. Because every interaction is tagged with the model that handled it, you can slice tool-call accuracy, TTFT, and structured-output pass rate by model. Pull call metrics with client.calls.getMetrics({ agentId, dateFrom, dateTo }) and group the results by the model field in your own dashboard, or drive the grouping with a scorecard that runs on the captured set.

Before you promote a cheap-tier model to handle more traffic, you run a scorecard regression against your captured production set. This is the gate that catches the silent quality drop that public leaderboards can't show you. Monitoring closes the loop: if a model's tool-call rate dips below threshold for 15 minutes, you get paged before customers do.

The underlying thesis is simple. Picking a cheap-tier model is a one-time decision. Keeping it working in production is a forever decision. You want the boring observability infrastructure in place before you ship the cost-saving model swap, not after.

The Takeaway

Opus and GPT-5 Pro get the headlines. Haiku 4.5, GPT-5 Mini, and Gemini 2.5 Flash handle the load. At the $1 tier, the right model depends on your workload shape: tool-heavy goes Haiku, latency-critical goes Flash, cheap-balanced goes Mini. Caching is not optional at production volume. Routing is how you get flagship quality on the turns that need it while the 70 to 85 percent that don't pay cheap-tier prices.

Test on your own traffic. Ship with observability. Ignore the leaderboards that don't test your workload. Token cost optimization in production goes deeper on the caching and routing side if you want the next lever after model choice.

Catch model-swap regressions before customers do

Chanl gives every AI agent per-model observability, tool-call accuracy scoring, and scorecard regression gates on model changes. Build, connect, and monitor on any model tier.

See how it works

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

model-routing cost-optimization latency tool-calling customer-experience benchmarks llm operations

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.