Articles tagged “benchmarks”
5 articles

Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.
SWE-bench crowns a coding king. Customer experience agents answer to a different benchmark, tau-bench, and the rankings flip. The head-to-head that actually predicts production reliability.

Everyone Benchmarks Opus. Your Chatbot Runs on Haiku.
Haiku 4.5, GPT-5 Mini, Gemini Flash at the $1/MTok tier that powers CX. Tool-call accuracy, first-token latency, structured-output reliability, blended cost math.

Stop Using SWE-Bench to Pick Your CX Model
SWE-Bench scores 85% or 23% depending on the harness, and neither measures customer experience. Why tau-bench, tau2-bench, and pass^k matter for CX agents.

Your Agent Aced the Benchmark. Production Disagreed.
We scored 92% on GAIA. Production CSAT: 64%. Here's which AI agent benchmarks actually predict deployed performance, why most don't, and what to measure instead.

Performance Benchmarks for AI Agents: What Actually Matters Beyond Word Error Rate
Most enterprises obsess over Word Error Rate while missing the metrics that actually predict success. Here's what to measure instead.
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.