Articles tagged “tau-bench”
2 articles

Industry & Strategy·16 min read read
Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.
SWE-bench crowns a coding king. Customer experience agents answer to a different benchmark, tau-bench, and the rankings flip. The head-to-head that actually predicts production reliability.
Read More

Testing & Evaluation·11 min read read
Stop Using SWE-Bench to Pick Your CX Model
SWE-Bench scores 85% or 23% depending on the harness, and neither measures customer experience. Why tau-bench, tau2-bench, and pass^k matter for CX agents.
Read More
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.