Blog/Tags/benchmarks

benchmarks

Browse 5 articles tagged with “benchmarks”.

Articles tagged “benchmarks”

5 articles

Three Model Chips Laid Out on a Desk With a Tau-Bench Leaderboard Visible on a Monitor

Industry & Strategy·16 min read read

Your CX Agent Doesn't Care Who Won SWE-Bench. Here's Who Actually Wins.

SWE-bench crowns a coding king. Customer experience agents answer to a different benchmark, tau-bench, and the rankings flip. The head-to-head that actually predicts production reliability.

Watercolor Still-Life of a Steel Coin, Silver Disc, and Gold Token Spilling From a Velvet Pouch Onto Dark Wood — Three Cheap-Tier Models on the Table

Agent Architecture·14 min read read

Everyone Benchmarks Opus. Your Chatbot Runs on Haiku.

Haiku 4.5, GPT-5 Mini, Gemini Flash at the $1/MTok tier that powers CX. Tool-call accuracy, first-token latency, structured-output reliability, blended cost math.

Watercolor Illustration of Two Scoreboards Side by Side, One for Coding Tasks, One for Customer Conversations, With the Customer Scoreboard Showing Much Lower Numbers

Testing & Evaluation·11 min read read

Stop Using SWE-Bench to Pick Your CX Model

SWE-Bench scores 85% or 23% depending on the harness, and neither measures customer experience. Why tau-bench, tau2-bench, and pass^k matter for CX agents.

Data visualization showing the gap between AI agent benchmark scores and production performance metrics

Testing & Evaluation·13 min read

Your Agent Aced the Benchmark. Production Disagreed.

We scored 92% on GAIA. Production CSAT: 64%. Here's which AI agent benchmarks actually predict deployed performance, why most don't, and what to measure instead.

A blurry image of a green and white background - Photo by Logan Voss on Unsplash

Testing & Evaluation·15 min read

Performance Benchmarks for AI Agents: What Actually Matters Beyond Word Error Rate

Most enterprises obsess over Word Error Rate while missing the metrics that actually predict success. Here's what to measure instead.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed