Why Is SWE-Bench Not the Right Benchmark for Customer Experience Agents?

SWE-bench measures an agent's ability to fix Python issues in open-source repositories. It rewards long, deliberative reasoning over a static codebase. Customer experience agents do something very different. They call tools in a short conversation, track dual-control state in telecom accounts, and stay inside policy for refunds or transfers. tau-bench and tau2-bench were built specifically for that job, which is why the rank order of models changes when you switch benchmarks.

What Does Pass^k Measure and Why Does It Matter for CX?

pass^k is the probability an agent solves the same task correctly across all k independent runs. Top models score around pass^1 ~50% on tau-bench retail and drop under 25% at pass^8. That means eight customers with the same issue often get inconsistent answers. Customer experience does not tolerate that kind of variance. pass^k is the reliability number to ask about, not the headline accuracy.

Which Model Wins Tau-Bench Retail in 2026?

Claude Opus 4.7 currently leads tau-bench retail for tool-calling accuracy, but the gap to GPT-5.4 is narrow, inside 3 points in most published runs. Gemini 3.1 Pro trails both on retail but leads on multimodal tasks that include receipts, screenshots, or ID photos. The winner depends on whether your customer inputs are text-only or mixed media.

Are Thinking Tokens a Hidden Cost in Agent Budgets?

Yes. Reasoning models bill internal thinking as output tokens, and you never see those tokens in the response. A single Opus 4.7 turn with extended thinking can emit 2,000 invisible tokens before the visible 200-token reply. For a high-volume CX deployment, thinking can be the biggest line item on the bill, and it does not scale linearly with conversation length.

Should I Use the Top Frontier Model for Every CX Turn?

No. The cheapest reliable production pattern is a planner/worker split. A frontier model chooses the next tool and drafts the policy-sensitive reply, a cheaper model handles routine turns like intent classification, entity extraction, and small-talk. Most turns in a CX session are routine. Routing the routine ones to Haiku 4.5, GPT-5 Mini, or Gemini Flash typically cuts blended cost 70% with no measurable quality loss.

How Do Latency and Interruption Handling Differ Across These Models?

Claude Sonnet 4.6 and Haiku 4.5 have the lowest first-token latency for tool-calling turns in the 200-600 ms range typical of voice agents. GPT-5.4 is competitive on non-thinking turns but jumps to multi-second latency when extended thinking is enabled. Gemini 3.1 Pro is fastest on multimodal prefill but slower on plain text. For voice, latency determines whether a customer interrupts the model mid-sentence.

Where Does Gemini 3.1 Pro Win for Customer Experience?

Two places. First, any flow where the customer sends an image: receipts, screenshots, damaged-product photos, IDs, barcodes. Gemini's multimodal lead in video-MME and image understanding carries into shorter customer-service interactions. Second, bulk retrieval over long knowledge bases where its context window handles 1M tokens natively without retrieval-augmented generation glue. For pure tool-calling dialogue, Opus 4.7 and GPT-5.4 still lead.

Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.

Pick any LLM leaderboard published this quarter and GPT-5.4 or GPT-5.3 Codex sits near the top. SWE-bench Verified scores are closing on 85%. Opus 4.7 trails slightly. Gemini 3.1 Pro is a respectable third on most coding tasks. The rankings feel stable and the decision feels easy. Pick the top score, wire it up, ship.

Then the customer experience team deploys that top-of-leaderboard model behind a retail chat agent and the refund flow quietly fails on one request in five. The support queue fills with "the bot told me the return window was 90 days, but the policy says 30." A week later someone points out that the agent is also transferring calls to the wrong department and hallucinating order numbers on pickup requests.

The model is fine. The benchmark was wrong for the job.

SWE-bench measures whether an agent can fix an issue in a GitHub repo. That is a coding benchmark, and it tells you almost nothing about how a model behaves when a customer asks for a refund, a flight change, or a plan upgrade. For that job there is a different benchmark, tau-bench, and the head-to-head looks different.

This is the shootout that actually matters for customer experience agents. Three frontier models, four axes that predict production reliability, and a cost table for a real 10,000-interactions-per-day deployment.

What this article covers	Why it matters
Why SWE-bench misleads CX teams	The "top model" on coding is not the top model on tool-calling dialogue
tau-bench, tau2-bench, and pass^k	The CX-relevant metrics the press releases skip
Tool-calling head-to-head	Retail and airline tool accuracy, dual-control telecom
Latency and interruption	Voice and live chat budgets under 800 ms
Blended cost at 10K/day	Thinking tokens are the hidden bill
Which model wins which CX job	Different winners for retail, voice, multimodal, high-volume

Why SWE-Bench Is the Wrong Benchmark for Customer Experience

SWE-bench Verified rewards depth of reasoning over a static codebase. Customer experience agents are judged on short, stateful dialogue with tool calls that commit real money. Those are two different skills, and the models that win one do not automatically win the other.

SWE-bench gives an agent a failing test and a repository. It has minutes to read files, form a hypothesis, edit code, and run the suite. Success is binary: does the patch pass. This favors models that can sustain long internal reasoning, hold thousands of lines of code in context, and recover from dead ends without human help. GPT-5.3 Codex is tuned exactly for this and scores 85% on the verified split [1]. GPT-5.4 sits around 84%. Opus 4.6 landed at 80.8% and Sonnet 4.6 at 79.6%, with Opus 4.7 tracking in that same low-80s band [2]. Gemini 3.1 Pro trails at ~75%.

Customer experience is not that. A CX turn is short. The agent has one user message, a handful of tools, and a response budget under a second for voice or under three seconds for chat. There is no repository to explore. There is a policy to respect, a tool call to pick, an argument to fill, and sometimes a clarifying question to ask. The model is not deliberating alone. It is negotiating with a human who will hang up or escalate if the agent stalls.

The tell that something is off is SWE-bench Pro. When the same evaluation format is rerun on harder, human-curated tasks without training-set contamination, the top score collapses from 85%+ on Verified to 23% on Pro [1]. That 70%-to-23% gap is a warning that benchmark scores on clean splits have been inflated by models seeing similar tasks during training. The leaderboards look cleaner than the field is.

For CX the right benchmark is one that simulates dialogue, tools, and policy. That benchmark is tau-bench.

What Tau-Bench and Tau2-Bench Actually Measure

tau-bench is a dialogue-and-tool-calling benchmark with three domains: retail, airline, and telecom. tau2-bench adds dual-control telecom, a setting where the customer and the agent both hold state and must reach consistent decisions. Single-run SOTA is under 50% in retail. pass^8 is under 25% for top models [3][4].

The structure is worth understanding because it tells you what the benchmark is selecting for.

Retail. The agent has tools like get_order_status, process_return, update_shipping_address, and a policy document. The customer has a goal: return a jacket, change an address, cancel an unshipped order. The agent wins if it calls the right tool with the right arguments inside the policy.
Airline. Same shape, harder tools. search_flights, change_reservation, compute_fare_difference. The policy is longer and includes change fees, fare classes, and cancellation windows. Getting the tool right is not enough. The reasoning about whether a change is allowed has to be correct.
Telecom (tau2-bench). Dual-control. Both the customer and the agent have partial state. A plan change requires the customer to confirm an action the agent proposes. If either side misreads the state, the turn fails. This is the closest simulation to real contact-center flows where agents and customers co-author the outcome.

The metric that matters most is pass^k. A run scores pass^1 if the agent succeeds once. pass^k is the probability it succeeds across k independent runs of the same task. Top models in tau-bench retail land around pass^1 of ~50% and pass^8 under 25% [3][4]. That is the number the press releases do not print.

pass^k matters for CX because customers do not get the luxury of a single lucky try. Ten customers with the same refund request should get the same answer. When pass^8 is 25%, only one run in four gets eight consecutive identical-task customers right. The rest of the time at least one of the eight is wrong. That is not a model you can deploy behind a brand without human review in the loop.

Head-to-Head: Tool-Calling on Tau-Bench

Here is the published head-to-head for the three frontier models on tau-bench and tau2-bench as of April 2026. Numbers are approximate. tau-bench variance across runs is high, which is itself the point.

Model	tau-bench Retail (pass^1)	tau-bench Airline (pass^1)	tau2-bench Telecom	SWE-bench Verified
Claude Opus 4.7	~52%	~48%	~46%	low 80s
GPT-5.4	~49%	~47%	~44%	84%
Gemini 3.1 Pro	~43%	~41%	~40%	~75%

Sources: [2][4][5]. Numbers are single-run pass^1. pass^8 for all three drops under 25% on retail, consistent with published SOTA [3].

Three things stand out.

First, the tau-bench order inverts the SWE-bench order for the top two models. GPT-5.4 leads on code; Opus 4.7 leads on dialogue-and-tools. The gap is inside three points. Small enough that the choice between them gets decided by other factors like latency, cost, and multimodal support.

Second, no model is above 55% on any of the three tau-bench domains. The ceiling itself is low, which is why pass^k reliability matters so much. A 48% single-run score means half the time the model gets it wrong on the first try. What you actually care about is whether the second and third tries converge on the right answer, and the pass^k numbers say they do not.

Third, Gemini 3.1 Pro trails meaningfully on retail and airline, but it is not out of the race. Its advantage shows up elsewhere: multimodal inputs, long-context retrieval, and GPQA Diamond at 94.3% [6]. For a plain-text CX flow it is the weakest of the three. For a flow where customers send photos, it leads.

Reliability Under Repetition: The Dual-Control Failure Mode

The dual-control setting in tau2-bench is where the bottom falls out. When both sides of a conversation hold state, any model that drifts on the second turn cascades into failure on turns three and four. This is the failure mode that hurts CX teams most in production.

Take a telecom plan change as a concrete example. The customer calls to upgrade from a 5 GB plan to a 10 GB plan. The agent has tools to check eligibility, quote the new price, and commit the change. The customer has partial state. They know what they want but they might not remember their current plan or the exact cost. The dialogue has three turns at minimum. Eligibility check. Price confirmation. Commit.

If the model hallucinates the current plan on turn one, the quote on turn two is wrong, and the commit on turn three either fails validation or succeeds on the wrong plan. The single-turn accuracy can be 80% and the three-turn success rate can be 50%, because errors compound.

Opus 4.7's edge on tau2-bench telecom comes from better state tracking across turns. GPT-5.4 is competitive when extended thinking is on, but that thinking costs latency and tokens. Gemini 3.1 Pro trails both on multi-turn state specifically.

For any CX workflow that has more than one tool call per conversation, which is almost all of them, the dual-control number is the one to stress. If a vendor only publishes pass^1 retail, ask for pass^8 and tau2-bench telecom. If they do not have the numbers, run your own scenarios before you commit.

This is where a dedicated scenario testing layer earns its keep. pass^k is only meaningful if you can replay the same customer goal against your agent hundreds of times with different personas and see where it drifts. Synthetic runs catch the failure modes that do not show up on the first ten manual tests.

Latency and Interruption: What Voice and Live Chat Actually Demand

Voice agents have a hard latency ceiling of about 800 ms from user silence to model first token. Live chat is looser at 2-3 seconds. Reasoning models blow through both budgets when extended thinking is enabled. This is the constraint that most benchmark tables leave out entirely.

In a voice call, if the model takes more than about a second to start speaking, the customer either repeats themselves or interrupts the agent. Interruption handling is solvable. Modern voice stacks handle it cleanly with VAD and barge-in. What is not solvable is the perception of a slow agent. A consumer hangs up on a laggy voice bot in under 30 seconds.

The three models diverge sharply on latency under real tool-calling loads.

Model	Non-thinking first token	Extended-thinking first token	Voice-ready?
Claude Haiku 4.5	200-400 ms	N/A	Yes
Claude Sonnet 4.6	300-600 ms	1-3 s	Yes without thinking
Claude Opus 4.7	500-900 ms	2-6 s	Marginal
GPT-5.4	400-800 ms	3-10 s+	Yes without thinking
Gemini 3.1 Pro	300-700 ms	2-5 s	Yes without thinking

These are rough measured ranges from public testing and vendor docs. Your own numbers will vary with region, caching, and concurrency.

The practical implication is that the best tau-bench model (Opus 4.7) is borderline for voice if you enable extended thinking. Teams running voice agents either route to Sonnet 4.6 for the dialogue turns and call Opus 4.7 only for the hard policy decisions, or they disable thinking and accept the accuracy hit. The planner/worker split is how you get both. The planner thinks slowly, the worker responds fast. We wrote about that pattern in more depth in Agentic RAG and retrieval agents if you want the full architecture.

For live chat the budget is more forgiving, and extended thinking is usually worth it on the first turn, the one where the agent decides what the customer actually wants. Subsequent turns can run on the smaller sibling model without measurable quality loss in A/B tests.

Blended Cost at 10,000 Interactions per Day

Price-per-token charts look cheap until you multiply by production volume. A 10,000-interaction-per-day CX deployment running entirely on Opus 4.7 with extended thinking can cross six figures a month, and most of that bill is thinking tokens you never see. Here is the math on a realistic deployment.

Assume a 10,000-session-per-day deployment with a median of six turns per session, 400 input tokens per turn (system prompt cached), and 150 output tokens per turn visible to the user. Reasoning models add roughly 1,500 thinking tokens per turn when extended thinking is on. Anthropic and OpenAI pricing is from April 2026 vendor docs [7][8]. Gemini 3.1 Pro rates are approximated from Google's posted 3.0 Pro card and early 3.1 previews; treat that row as a planning estimate until the final price list ships.

The arithmetic: 10,000 × 6 = 60,000 turns/day. Input tokens/day = 24M. Visible output/day = 9M. Thinking output/day (when enabled) = 90M. Multiply each by the per-million rate for the model and you get the numbers below.

Configuration	Input cost/day	Visible output/day	Thinking output/day	Total/day	Total/month
Opus 4.7, thinking on ($5 / $25)	$120	$225	$2,250	$2,595	~$77,850
GPT-5.4 Standard, thinking on ($2.50 / $15)	$60	$135	$1,350	$1,545	~$46,350
Gemini 3.1 Pro, thinking on	$72	$162	$1,080	$1,314	~$39,420
Sonnet 4.6, thinking off ($3 / $15)	$72	$135	$0	$207	~$6,210
Planner/worker (Opus plan 1-in-6, Haiku work 5-in-6)	$40	$75	$375	$490	~$14,700

Two things jump out of that table.

Thinking tokens are usually the largest line item on any reasoning-heavy deployment. They bill as output tokens, invisible to the user, and consume the output context budget. A 1,500-token thinking trace before a 150-token visible reply means 90% of what you are paying for is never seen. Finout's pricing analysis goes deep on this if you want the source math [8].

The other surprise is how cheap the planner/worker split gets. Routing one in every six turns to a frontier planner (Opus 4.7) and the other five to a cheap worker (Haiku 4.5) cuts monthly spend from ~$78K to ~$15K with near-identical tau-bench accuracy in published A/B tests. This is the default architecture for any high-volume CX deployment and it is the reason "just use Opus 4.7 everywhere" is a bad strategy even when budget is not tight.

The caveat: planner/worker only pays off if you can monitor where quality drops. If the worker model starts missing intents that the planner catches, you will not see it until a week of survey scores come back. Continuous scorecard grading on production conversations is what makes the split safe.

Which Model Wins Which CX Job

There is no single winner. The right model depends on the job, the channel, and the cost ceiling. Here is the decision matrix for the most common CX patterns.

CX Job	Primary Model	Reasoning
Retail chat, text only	Opus 4.7 or GPT-5.4	Both lead tau-bench retail; pick by org's existing API contracts
Airline / policy-heavy flows	Opus 4.7	Strongest on long-policy reasoning and multi-turn state
Voice agent (sub-second budget)	Sonnet 4.6 or Haiku 4.5	Latency wins; route hard turns to Opus planner
Multimodal (photos, receipts, IDs)	Gemini 3.1 Pro	Leads video-MME by ~7 points; image understanding carries
High-volume, low-complexity	Haiku 4.5 or GPT-5 Mini	Cheapest per turn; acceptable for FAQ, intent routing
Dual-control telecom / utilities	Opus 4.7	Best multi-turn state tracking on tau2-bench
Long-context retrieval	Gemini 3.1 Pro	1M-token context removes the RAG glue layer

The honest answer for most teams is that they will run two or three of these models in production simultaneously. The planner is Opus 4.7 or GPT-5.4. The worker is Haiku 4.5, GPT-5 Mini, or Gemini Flash. A separate vision model handles photo uploads. One model for every turn is the wrong abstraction.

What Monitoring Has to Catch in Production

Picking a model is 30% of the work. Knowing when it drifts is the other 70%. Three things need to be in place before any of these models goes behind a customer-facing channel.

pass^k on your own scenarios, not tau-bench. Synthetic tau-bench tells you where the model is today on Sierra's tasks. What you actually care about is your refund flow, your address change, your plan upgrade. Build a scenario suite that reflects your real policies and your real tools, and run pass^8 on every model release before you promote it. The scenarios feature is built exactly for that: same customer goal, 100 runs, surface the variance.

Latency P99 by turn type. Mean latency lies. The customer who waits 8 seconds because extended thinking ran long is the customer who left a one-star review. P99 by turn type (tool-call turn, policy-reasoning turn, small-talk turn) is the number that tells you whether your thinking budget is calibrated. If P99 on tool-calling is above 2 seconds for voice, you have a routing bug.

Tool-call argument drift. The failure mode to fear most is not "model refused to help." It is "model called the right tool with the wrong arguments." An agent that calls update_shipping_address with the old address is worse than one that errors out. Scorecard checks on tool-call arguments, not just tool choice, are the last line before the bill of materials hits production. Chanl's tools layer exposes every call-and-argument pair so these drifts become visible rather than silent.

Model selection looks hard because the leaderboards are loud. It is actually the easy part. Production reliability is the hard part, and it lives almost entirely in the monitoring layer after the model is picked. Build, connect, monitor. Skip the monitor step and the refund flow from the opening of this article is the one that shows up in your inbox at 2 a.m.

A new model will top SWE-bench within weeks of this being published. The tau-bench order will shuffle. What stays constant is that the benchmark you rank models on has to match the job the model is going to do. For customer experience, that benchmark is tau-bench, and the winner is rarely the model the press release celebrates.

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

ai-agents customer-experience gpt-5 claude-opus-4-7 gemini-3-1 tau-bench benchmarks tool-calling model-selection

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed

Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.

Why SWE-Bench Is the Wrong Benchmark for Customer Experience

What Tau-Bench and Tau2-Bench Actually Measure

Head-to-Head: Tool-Calling on Tau-Bench

Reliability Under Repetition: The Dual-Control Failure Mode

Latency and Interruption: What Voice and Live Chat Actually Demand

Blended Cost at 10,000 Interactions per Day

Which Model Wins Which CX Job

What Monitoring Has to Catch in Production

Learn Agentic AI

Frequently Asked Questions

Related Articles

Stop Using SWE-Bench to Pick Your CX Model

Your Conversations Are Already CRM Data. Here's How to Use Them.

Every Contact Center Job Is Changing. Here's What That Actually Looks Like