Articles tagged “tau-bench”
2 articles

Industry & Strategy·16 min read read
Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.
SWE-bench crowns a coding king. Customer experience agents answer to a different benchmark, tau-bench, and the rankings flip. The head-to-head that actually predicts production reliability.
Read More

Testing & Evaluation·11 min read read
Stop Using SWE-Bench to Pick Your CX Model
SWE-Bench scores 85% or 23% depending on the harness, and neither measures customer experience. Why tau-bench, tau2-bench, and pass^k matter for CX agents.
Read More
Learn Agentic AI
Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.