ChanlChanl
Industry & Strategy

Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.

SWE-bench crowns a coding king. Customer experience agents answer to a different benchmark, tau-bench, and the rankings flip. The head-to-head that actually predicts production reliability.

DGDean GroverCo-founderFollow
April 23, 2026
16 min read read
Three Model Chips Laid Out on a Desk With a Tau-Bench Leaderboard Visible on a Monitor

Pick any LLM leaderboard published this quarter and GPT-5.4 or GPT-5.3 Codex sits near the top. SWE-bench Verified scores are closing on 85%. Opus 4.7 trails slightly. Gemini 3.1 Pro is a respectable third on most coding tasks. The rankings feel stable and the decision feels easy. Pick the top score, wire it up, ship.

Then the customer experience team deploys that top-of-leaderboard model behind a retail chat agent and the refund flow quietly fails on one request in five. The support queue fills with "the bot told me the return window was 90 days, but the policy says 30." A week later someone points out that the agent is also transferring calls to the wrong department and hallucinating order numbers on pickup requests.

The model is fine. The benchmark was wrong for the job.

SWE-bench measures whether an agent can fix an issue in a GitHub repo. That is a coding benchmark, and it tells you almost nothing about how a model behaves when a customer asks for a refund, a flight change, or a plan upgrade. For that job there is a different benchmark, tau-bench, and the head-to-head looks different.

This is the shootout that actually matters for customer experience agents. Three frontier models, four axes that predict production reliability, and a cost table for a real 10,000-interactions-per-day deployment.

What this article coversWhy it matters
Why SWE-bench misleads CX teamsThe "top model" on coding is not the top model on tool-calling dialogue
tau-bench, tau2-bench, and pass^kThe CX-relevant metrics the press releases skip
Tool-calling head-to-headRetail and airline tool accuracy, dual-control telecom
Latency and interruptionVoice and live chat budgets under 800 ms
Blended cost at 10K/dayThinking tokens are the hidden bill
Which model wins which CX jobDifferent winners for retail, voice, multimodal, high-volume

Why SWE-Bench Is the Wrong Benchmark for Customer Experience

SWE-bench Verified rewards depth of reasoning over a static codebase. Customer experience agents are judged on short, stateful dialogue with tool calls that commit real money. Those are two different skills, and the models that win one do not automatically win the other.

SWE-bench gives an agent a failing test and a repository. It has minutes to read files, form a hypothesis, edit code, and run the suite. Success is binary: does the patch pass. This favors models that can sustain long internal reasoning, hold thousands of lines of code in context, and recover from dead ends without human help. GPT-5.3 Codex is tuned exactly for this and scores 85% on the verified split [1]. GPT-5.4 sits around 84%. Opus 4.6 landed at 80.8% and Sonnet 4.6 at 79.6%, with Opus 4.7 tracking in that same low-80s band [2]. Gemini 3.1 Pro trails at ~75%.

Customer experience is not that. A CX turn is short. The agent has one user message, a handful of tools, and a response budget under a second for voice or under three seconds for chat. There is no repository to explore. There is a policy to respect, a tool call to pick, an argument to fill, and sometimes a clarifying question to ask. The model is not deliberating alone. It is negotiating with a human who will hang up or escalate if the agent stalls.

The tell that something is off is SWE-bench Pro. When the same evaluation format is rerun on harder, human-curated tasks without training-set contamination, the top score collapses from 85%+ on Verified to 23% on Pro [1]. That 70%-to-23% gap is a warning that benchmark scores on clean splits have been inflated by models seeing similar tasks during training. The leaderboards look cleaner than the field is.

For CX the right benchmark is one that simulates dialogue, tools, and policy. That benchmark is tau-bench.

What Tau-Bench and Tau2-Bench Actually Measure

tau-bench is a dialogue-and-tool-calling benchmark with three domains: retail, airline, and telecom. tau2-bench adds dual-control telecom, a setting where the customer and the agent both hold state and must reach consistent decisions. Single-run SOTA is under 50% in retail. pass^8 is under 25% for top models [3][4].

The structure is worth understanding because it tells you what the benchmark is selecting for.

  • Retail. The agent has tools like get_order_status, process_return, update_shipping_address, and a policy document. The customer has a goal: return a jacket, change an address, cancel an unshipped order. The agent wins if it calls the right tool with the right arguments inside the policy.
  • Airline. Same shape, harder tools. search_flights, change_reservation, compute_fare_difference. The policy is longer and includes change fees, fare classes, and cancellation windows. Getting the tool right is not enough. The reasoning about whether a change is allowed has to be correct.
  • Telecom (tau2-bench). Dual-control. Both the customer and the agent have partial state. A plan change requires the customer to confirm an action the agent proposes. If either side misreads the state, the turn fails. This is the closest simulation to real contact-center flows where agents and customers co-author the outcome.

The metric that matters most is pass^k. A run scores pass^1 if the agent succeeds once. pass^k is the probability it succeeds across k independent runs of the same task. Top models in tau-bench retail land around pass^1 of ~50% and pass^8 under 25% [3][4]. That is the number the press releases do not print.

pass^k matters for CX because customers do not get the luxury of a single lucky try. Ten customers with the same refund request should get the same answer. When pass^8 is 25%, only one run in four gets eight consecutive identical-task customers right. The rest of the time at least one of the eight is wrong. That is not a model you can deploy behind a brand without human review in the loop.

Head-to-Head: Tool-Calling on Tau-Bench

Here is the published head-to-head for the three frontier models on tau-bench and tau2-bench as of April 2026. Numbers are approximate. tau-bench variance across runs is high, which is itself the point.

Modeltau-bench Retail (pass^1)tau-bench Airline (pass^1)tau2-bench TelecomSWE-bench Verified
Claude Opus 4.7~52%~48%~46%low 80s
GPT-5.4~49%~47%~44%84%
Gemini 3.1 Pro~43%~41%~40%~75%

Sources: [2][4][5]. Numbers are single-run pass^1. pass^8 for all three drops under 25% on retail, consistent with published SOTA [3].

Three things stand out.

First, the tau-bench order inverts the SWE-bench order for the top two models. GPT-5.4 leads on code; Opus 4.7 leads on dialogue-and-tools. The gap is inside three points. Small enough that the choice between them gets decided by other factors like latency, cost, and multimodal support.

Second, no model is above 55% on any of the three tau-bench domains. The ceiling itself is low, which is why pass^k reliability matters so much. A 48% single-run score means half the time the model gets it wrong on the first try. What you actually care about is whether the second and third tries converge on the right answer, and the pass^k numbers say they do not.

Third, Gemini 3.1 Pro trails meaningfully on retail and airline, but it is not out of the race. Its advantage shows up elsewhere: multimodal inputs, long-context retrieval, and GPQA Diamond at 94.3% [6]. For a plain-text CX flow it is the weakest of the three. For a flow where customers send photos, it leads.

Reliability Under Repetition: The Dual-Control Failure Mode

The dual-control setting in tau2-bench is where the bottom falls out. When both sides of a conversation hold state, any model that drifts on the second turn cascades into failure on turns three and four. This is the failure mode that hurts CX teams most in production.

Take a telecom plan change as a concrete example. The customer calls to upgrade from a 5 GB plan to a 10 GB plan. The agent has tools to check eligibility, quote the new price, and commit the change. The customer has partial state. They know what they want but they might not remember their current plan or the exact cost. The dialogue has three turns at minimum. Eligibility check. Price confirmation. Commit.

If the model hallucinates the current plan on turn one, the quote on turn two is wrong, and the commit on turn three either fails validation or succeeds on the wrong plan. The single-turn accuracy can be 80% and the three-turn success rate can be 50%, because errors compound.

Opus 4.7's edge on tau2-bench telecom comes from better state tracking across turns. GPT-5.4 is competitive when extended thinking is on, but that thinking costs latency and tokens. Gemini 3.1 Pro trails both on multi-turn state specifically.

For any CX workflow that has more than one tool call per conversation, which is almost all of them, the dual-control number is the one to stress. If a vendor only publishes pass^1 retail, ask for pass^8 and tau2-bench telecom. If they do not have the numbers, run your own scenarios before you commit.

This is where a dedicated scenario testing layer earns its keep. pass^k is only meaningful if you can replay the same customer goal against your agent hundreds of times with different personas and see where it drifts. Synthetic runs catch the failure modes that do not show up on the first ten manual tests.

Latency and Interruption: What Voice and Live Chat Actually Demand

Voice agents have a hard latency ceiling of about 800 ms from user silence to model first token. Live chat is looser at 2-3 seconds. Reasoning models blow through both budgets when extended thinking is enabled. This is the constraint that most benchmark tables leave out entirely.

In a voice call, if the model takes more than about a second to start speaking, the customer either repeats themselves or interrupts the agent. Interruption handling is solvable. Modern voice stacks handle it cleanly with VAD and barge-in. What is not solvable is the perception of a slow agent. A consumer hangs up on a laggy voice bot in under 30 seconds.

The three models diverge sharply on latency under real tool-calling loads.

ModelNon-thinking first tokenExtended-thinking first tokenVoice-ready?
Claude Haiku 4.5200-400 msN/AYes
Claude Sonnet 4.6300-600 ms1-3 sYes without thinking
Claude Opus 4.7500-900 ms2-6 sMarginal
GPT-5.4400-800 ms3-10 s+Yes without thinking
Gemini 3.1 Pro300-700 ms2-5 sYes without thinking

These are rough measured ranges from public testing and vendor docs. Your own numbers will vary with region, caching, and concurrency.

The practical implication is that the best tau-bench model (Opus 4.7) is borderline for voice if you enable extended thinking. Teams running voice agents either route to Sonnet 4.6 for the dialogue turns and call Opus 4.7 only for the hard policy decisions, or they disable thinking and accept the accuracy hit. The planner/worker split is how you get both. The planner thinks slowly, the worker responds fast. We wrote about that pattern in more depth in Agentic RAG and retrieval agents if you want the full architecture.

For live chat the budget is more forgiving, and extended thinking is usually worth it on the first turn, the one where the agent decides what the customer actually wants. Subsequent turns can run on the smaller sibling model without measurable quality loss in A/B tests.

Blended Cost at 10,000 Interactions per Day

Price-per-token charts look cheap until you multiply by production volume. A 10,000-interaction-per-day CX deployment running entirely on Opus 4.7 with extended thinking can cross six figures a month, and most of that bill is thinking tokens you never see. Here is the math on a realistic deployment.

Assume a 10,000-session-per-day deployment with a median of six turns per session, 400 input tokens per turn (system prompt cached), and 150 output tokens per turn visible to the user. Reasoning models add roughly 1,500 thinking tokens per turn when extended thinking is on. Anthropic and OpenAI pricing is from April 2026 vendor docs [7][8]. Gemini 3.1 Pro rates are approximated from Google's posted 3.0 Pro card and early 3.1 previews; treat that row as a planning estimate until the final price list ships.

The arithmetic: 10,000 × 6 = 60,000 turns/day. Input tokens/day = 24M. Visible output/day = 9M. Thinking output/day (when enabled) = 90M. Multiply each by the per-million rate for the model and you get the numbers below.

ConfigurationInput cost/dayVisible output/dayThinking output/dayTotal/dayTotal/month
Opus 4.7, thinking on ($5 / $25)$120$225$2,250$2,595~$77,850
GPT-5.4 Standard, thinking on ($2.50 / $15)$60$135$1,350$1,545~$46,350
Gemini 3.1 Pro, thinking on$72$162$1,080$1,314~$39,420
Sonnet 4.6, thinking off ($3 / $15)$72$135$0$207~$6,210
Planner/worker (Opus plan 1-in-6, Haiku work 5-in-6)$40$75$375$490~$14,700

Two things jump out of that table.

Thinking tokens are usually the largest line item on any reasoning-heavy deployment. They bill as output tokens, invisible to the user, and consume the output context budget. A 1,500-token thinking trace before a 150-token visible reply means 90% of what you are paying for is never seen. Finout's pricing analysis goes deep on this if you want the source math [8].

The other surprise is how cheap the planner/worker split gets. Routing one in every six turns to a frontier planner (Opus 4.7) and the other five to a cheap worker (Haiku 4.5) cuts monthly spend from ~$78K to ~$15K with near-identical tau-bench accuracy in published A/B tests. This is the default architecture for any high-volume CX deployment and it is the reason "just use Opus 4.7 everywhere" is a bad strategy even when budget is not tight.

The caveat: planner/worker only pays off if you can monitor where quality drops. If the worker model starts missing intents that the planner catches, you will not see it until a week of survey scores come back. Continuous scorecard grading on production conversations is what makes the split safe.

Which Model Wins Which CX Job

There is no single winner. The right model depends on the job, the channel, and the cost ceiling. Here is the decision matrix for the most common CX patterns.

CX JobPrimary ModelReasoning
Retail chat, text onlyOpus 4.7 or GPT-5.4Both lead tau-bench retail; pick by org's existing API contracts
Airline / policy-heavy flowsOpus 4.7Strongest on long-policy reasoning and multi-turn state
Voice agent (sub-second budget)Sonnet 4.6 or Haiku 4.5Latency wins; route hard turns to Opus planner
Multimodal (photos, receipts, IDs)Gemini 3.1 ProLeads video-MME by ~7 points; image understanding carries
High-volume, low-complexityHaiku 4.5 or GPT-5 MiniCheapest per turn; acceptable for FAQ, intent routing
Dual-control telecom / utilitiesOpus 4.7Best multi-turn state tracking on tau2-bench
Long-context retrievalGemini 3.1 Pro1M-token context removes the RAG glue layer

The honest answer for most teams is that they will run two or three of these models in production simultaneously. The planner is Opus 4.7 or GPT-5.4. The worker is Haiku 4.5, GPT-5 Mini, or Gemini Flash. A separate vision model handles photo uploads. One model for every turn is the wrong abstraction.

What Monitoring Has to Catch in Production

Picking a model is 30% of the work. Knowing when it drifts is the other 70%. Three things need to be in place before any of these models goes behind a customer-facing channel.

pass^k on your own scenarios, not tau-bench. Synthetic tau-bench tells you where the model is today on Sierra's tasks. What you actually care about is your refund flow, your address change, your plan upgrade. Build a scenario suite that reflects your real policies and your real tools, and run pass^8 on every model release before you promote it. The scenarios feature is built exactly for that: same customer goal, 100 runs, surface the variance.

Latency P99 by turn type. Mean latency lies. The customer who waits 8 seconds because extended thinking ran long is the customer who left a one-star review. P99 by turn type (tool-call turn, policy-reasoning turn, small-talk turn) is the number that tells you whether your thinking budget is calibrated. If P99 on tool-calling is above 2 seconds for voice, you have a routing bug.

Tool-call argument drift. The failure mode to fear most is not "model refused to help." It is "model called the right tool with the wrong arguments." An agent that calls update_shipping_address with the old address is worse than one that errors out. Scorecard checks on tool-call arguments, not just tool choice, are the last line before the bill of materials hits production. Chanl's tools layer exposes every call-and-argument pair so these drifts become visible rather than silent.

Model selection looks hard because the leaderboards are loud. It is actually the easy part. Production reliability is the hard part, and it lives almost entirely in the monitoring layer after the model is picked. Build, connect, monitor. Skip the monitor step and the refund flow from the opening of this article is the one that shows up in your inbox at 2 a.m.

A new model will top SWE-bench within weeks of this being published. The tau-bench order will shuffle. What stays constant is that the benchmark you rank models on has to match the job the model is going to do. For customer experience, that benchmark is tau-bench, and the winner is rarely the model the press release celebrates.

DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions