SWE-Bench Verified says GPT-5.4 hits 84% and Claude Opus 4.6 hits 80.8%. SWE-Bench Pro, the harder variant that came out specifically to address how inflated Verified had gotten, says the top score was around 23% when it first dropped. Both of those numbers are about coding agents writing patches for GitHub repos.
You were going to pick your customer support model based on one of them.
This happens constantly. A team building a refund bot or an airline change-fee agent opens a leaderboard, sees "GPT-5.4: 84% agentic coding," and treats that as a signal about customer experience. It is not a signal. It is roughly like picking a short-order cook based on their marathon time. The skills overlap at the level of "uses hands," and that is it.
The benchmarks that actually measure what CX agents do exist. Sierra Research shipped the first one, tau-bench, in mid-2024, and extended it with tau2-bench in 2025. They are both open source. Almost nobody I talk to has run them. This is the piece.
What SWE-Bench Actually Measures
SWE-Bench gives an agent a real GitHub issue, a snapshot of the repo at the time the issue was filed, and a test suite. The agent wins if it produces a patch that makes the tests pass. SWE-Bench Verified is a 500-issue subset that humans hand-checked for solvability and clean test coverage.
The headline scores climbed fast. gpt-5.3-codex: 85. GPT-5.4: 84. Claude Opus 4.6: 80.8. Claude Sonnet 4.6: 79.6. Gemini 3.1 Pro: 75.
Then Scale AI released SWE-Bench Pro. Same task format, issue in and patch out, but with three deliberate changes:
- GPL-licensed and proprietary repos to reduce the chance models had memorized the fixes.
- Underspecified-but-solvable issues kept in, instead of filtered out.
- Docker-based reproducible test harnesses per task.
Top models dropped from 70%+ on Verified to around 23% on Pro at launch. That gap is the story. It does not prove Verified is "wrong." It proves that a 30-point swing from one respected coding benchmark to another means we were not measuring a stable capability. Pro scores have since climbed into the 50s for the strongest configurations, but the lesson stands: headline numbers on a coding benchmark can move wildly when you change the harness, not the task.
And none of this has anything to do with customer experience.
What a CX Benchmark Has to Test
Imagine a refund agent talking to a confused customer. The agent has to:
- Read a policy document. This item is eligible for refund within 30 days, this one is final sale.
- Verify identity before doing anything sensitive.
- Call
lookup_order(order_id)with an id the customer may have mistyped. - Call
process_refund(order_id, amount, reason)with the correct amount, not the order total. - Refuse gracefully when the customer demands something the policy does not allow.
- Recover when the customer says "wait, actually, I meant the shoes, not the jacket" four turns in.
- Not hallucinate a policy clause that would make the company eat a loss.
None of those are on SWE-Bench. SWE-Bench does not have a user. It has an issue description and a test runner. The agent never gets interrupted, never has to enforce a policy, never has to decline a request, never has to handle a misspelled order id.
A CX benchmark has to test all of those things, and it has to test them in multi-turn conversation with a user who behaves like a user.
Tau-Bench: Retail and Airline
Sierra Research built tau-bench around a simple premise. You have a database. The task has a desired final state of that database. The agent talks to a simulated user, calls tools, and wins if the database ends up in the right place.
Two domains in the original release:
- tau-retail. An e-commerce backend. Orders, items, exchanges, refunds. The agent has tools like
get_order,return_items,modify_pending_order_to_items,exchange_items. A policy document lives in the system prompt. - tau-airline. Reservations, cancellations, rebookings. Richer eligibility rules: silver members can cancel basic economy within 24 hours for free, other tiers cannot.
The user simulator is the interesting part. It is an LLM with a goal ("I want a refund for the hiking boots I returned last week"), a personality, and a rule: do not volunteer information the agent did not ask for. If the agent needs an order number, the agent has to ask. If the agent asks vaguely, the user answers vaguely. The simulator can also quit if the agent is going in circles.
Ground truth is a hard state check, not an LLM judge. After the conversation, the benchmark diffs the database against the expected state. Did the right refund fire, for the right amount, with the right reason code? Did no side-effect touch the wrong order? Some tasks also require that specific tool calls fire with specific arguments. That's the "action check."
Original pass@1 numbers from the Sierra paper are humbling. GPT-4o with plain function calling hit around 61% on retail and around 35% on airline. These are not "agent vs agent" scores with elaborate scaffolding. They are the base behavior of a frontier model doing what most production CX agents actually do.
Tau2-Bench and Dual-Control
The retail agent is a text interface to a database. The telecom agent is a text interface to a human holding a phone.
That distinction is what tau2-bench telecom captures with "dual-control." In the telecom domain, both the agent and the user simulator have tools. The user can toggle settings on their device, grant permissions, restart their modem, read error messages back. The agent cannot do any of those things directly. The agent has to guide the user through them.
A task might be: the user's home internet is down. The agent has tools to check line status and to send a reset command. The user has tools to reboot the modem, check the power cable, and report back what lights are on. Solving the ticket requires both sides to coordinate, and the scoring captures both task success and whether the agent communicated clearly enough for the user simulator to perform the required steps.
This is the shape of most real phone and chat support. The human is the final actuator. If the agent says "just go into settings and toggle the network thing," the user will not do the right action. If the agent says "open Settings, tap General, tap Network, then toggle Airplane Mode on, wait five seconds, toggle it off," the user will. The benchmark actually measures that.
Nothing on SWE-Bench resembles this.
Pass^k vs Pass@1: The Reliability Gap
Here is the most useful thing to take from the tau-bench paper, independent of the benchmark itself.
The standard metric on most agent leaderboards is pass@1. Run the task once, did the agent succeed. Or pass@k: at least one of k runs succeeded. Both are optimistic. They tell you about best case.
Customer support is not graded on best case. If a refund agent fires the wrong tool on one in every five conversations, that is not a B grade. That is a compliance incident.
Sierra introduced pass^k (pass-hat-k) for this. It is the probability that all k independent runs of the same task succeed, averaged across tasks. It decays geometrically as p^k.
That decay is brutal. A model that looks like a solid 90% on pass@1 reality-checks to this:
| pass@1 | pass^2 | pass^4 | pass^8 |
|---|---|---|---|
| 90% | 81% | 66% | 43% |
| 80% | 64% | 41% | 17% |
| 70% | 49% | 24% | 6% |
| 60% | 36% | 13% | 2% |
A 70% pass@1 agent only gets eight identical conversations right end-to-end about 6% of the time. That is not pedantic. That is the reality of running the same workflow a million times a month.
Which means when a vendor tells you their agent gets 85% on tau-bench, the first question is always: pass@1 or pass^k, and at what k. The second question is: how many trials per task. The third is whether they are reporting avg@k, the simple average across k runs, which hides variance and looks much higher than pass^k.
The current tau2-telecom snapshot on public leaderboards shows top models in the high 0.9s. That is almost certainly avg@k, not pass^k. Both numbers are legitimate, but they tell very different stories. You want the pessimistic one for production planning.
Running Your Own Slice
The benchmarks are open source. Here is the shortest path to running one against a model you are thinking about shipping.
# Clone the repo
git clone https://github.com/sierra-research/tau-bench
cd tau-bench
pip install -e .
# Run retail with your model, 4 trials per task for pass^4
python run.py \
--agent-strategy tool-calling \
--env retail \
--model gpt-5.4 \
--user-model gpt-5-mini \
--user-strategy llm \
--num-trials 4 \
--max-concurrency 10Start with retail. It is the most general domain, the cheapest to run, and it exercises the same skills your CX agent needs: policy reasoning, tool arg correctness, graceful refusal. Airline and telecom exercise stricter eligibility rules and dual-control respectively. Get retail stable first.
Four rules for getting signal out of this:
- Report pass^k, not pass@1. k=4 is a reasonable floor. k=8 if your workload is high volume.
- Use a cheap model as the user simulator. The paper uses GPT-4-class for the agent and a smaller model for the user. You are not benchmarking the user simulator; you are benchmarking the agent.
- Add your own tasks. The included tasks are general e-commerce. Copy the format, write 10 to 20 tasks that mirror your real policy edge cases: the ones that cause your current agent to escalate.
- Register your real tools. If your production agent uses a custom
check_subscription_statustool, add it to the domain's tools module and write tasks that require it. Otherwise you are benchmarking a generic agent against generic tools, which is interesting but not the question you are trying to answer.
Budget: a full retail run on GPT-5.4 at k=4 is roughly a few hundred dollars at current API pricing. Telecom is more. Pick the domain that matches your product and pay for signal on that one.
What Tau-Bench Will Not Tell You
This is the honest part. tau-bench measures the model's ability to follow a simulated policy doc and call simulated tools against a simulated database. It does not measure:
- Your system prompt. The benchmark ships its own.
- Your actual tools. You have to register them.
- Your real data. Your customers are weirder than any user simulator.
- Your failure modes under production load, context length, interruptions, or platform quirks (audio cutting out mid-sentence, messaging delivery delays, etc.).
A model that scores well on tau-retail will not automatically be a good production refund agent. The benchmark tells you the ceiling. Your deployment tells you the floor. That gap is where the work lives, and it is the reason deployment-level simulation matters on top of model-level benchmarking. We have written more about that distinction in how to evaluate AI agents and online vs offline evals. At Chanl that gap is what Scenarios covers: persona-driven simulations against your actual system prompt and tools. But the point is generic. Model benchmarks and deployment tests are different jobs. Do both.
What to Ask Vendors
When someone pitches you a CX model or agent platform with a SWE-Bench number, three questions cut through the noise.
Question one: what did it score on tau-bench retail, and what k did you use? If they do not know what tau-bench is, they have not benchmarked for CX. If they know but refuse to give you k, they are reporting pass@1. If they report pass^4 or pass^8 honestly, you are talking to someone serious.
Question two: what does the user simulator look like in your tests? A benchmark where the user always provides the right information on the first turn is not a CX benchmark. A real answer involves LLM-driven users that get confused, change their minds, or quit.
Question three: what happens when the policy forbids the user's request? Most CX failures are not "did the agent complete the happy path." They are "did the agent refuse gracefully when it had to." If the vendor's eval does not include refusal cases, their numbers are best-case theater.
The models themselves are actually pretty good at customer support. The problem is the industry keeps measuring them on coding tasks and pretending that generalizes. It does not. pass^4 on tau-retail with your own tools registered will tell you more in an afternoon than any leaderboard has told you in a year.
Benchmark the model, then simulate your deployment.
tau-bench tells you the ceiling. Chanl Scenarios tells you what happens on your system prompt, your tools, and the edge cases your support team already knows about.
See Scenarios- τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
- τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
- sierra-research/tau-bench on GitHub
- sierra-research/tau2-bench on GitHub
- Sierra: Benchmarking AI agents for the real world
- Artificial Analysis: τ²-Bench Telecom leaderboard
- llm-stats: tau2-telecom model scores
- SWE-bench leaderboards
- SWE-Bench Pro public leaderboard (Scale AI)
- Alan Product & Tech: Stop Trusting Headline Scores, Start Measuring Trade-offs
- EdenAI: Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Benchmarks
- BenchLM: LLM Agent & Tool-Use Benchmarks
- LM Council: AI Model Benchmarks April 2026
- Vellum: LLM Leaderboard 2026
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



