ChanlChanl
Blog/Testing & Evaluation

Testing & Evaluation

Browse 35 articles in testing & evaluation.

Testing & Evaluation Articles

35 articles · Page 1 of 3

Watercolor Illustration of a CI Pipeline With a Behavioral Testing Gate Between Staging and Production
Testing & Evaluation·15 min read

How to Build a Regression Test Suite for AI Agents

Your CI/CD pipeline catches code regressions. But who catches it when a prompt change breaks your agent's compliance behavior? Here's how to build behavioral regression testing for non-deterministic AI agents.

Read More
Engineer Reviewing AI Persona Conversation Transcripts on a Laptop
Testing & Evaluation·16 min read

Synthetic Users: Test Your Agent Against AI Personas

Scripted tests catch only the failures you anticipated. Build AI-powered synthetic users that simulate real customers and break your agent before it ships.

Read More
Branching Network Showing the Tool-Call Path an AI Agent Takes Across a Conversation
Testing & Evaluation·12 min read

How to Build a Trajectory Eval for Your AI Agent

Outcome evals check the final answer. Trajectory evals check the path: tools called, data touched, steps taken. Here's how to build one for a CX agent.

Read More
Dashboard showing agent resolution costs alongside quality scores and task success rates
Testing & Evaluation·18 min read

Cost Per Successful Outcome: The AI Agent Metric Teams Miss

Most teams measure AI agent quality by pass rate. The metric that actually predicts ROI is cost per successful outcome: what each resolution costs paired against whether it actually resolved. Here's how to build it.

Read More
A flowchart showing an agent's step-by-step decision path with one step flagged as diverging from the expected trajectory
Testing & Evaluation·13 min read

Trajectory Eval: Catch Agent Bugs Output Scoring Misses

Final-output scoring misses 20-40% of agent regressions. Trajectory evaluation scores every step an agent takes -- tool calls, reasoning decisions, order of operations -- and catches the bugs that output-only evals can't see.

Read More
A dashboard showing rich telemetry data on one side and a blank trend chart on the other, representing observability without measurement
Testing & Evaluation·11 min read

Your Agent Has Observability. It Doesn't Have Measurement.

89% of AI teams added observability. 52% added evals. But only 31% can say whether their agent is getting better or worse. Here's the difference between watching your agent and actually measuring it.

Read More
Dashboard showing AI agent KPI tiles for task completion rate, escalation rate, cost per successful outcome, and CSAT delta
Testing & Evaluation·13 min read

AI Agent KPIs: What to Measure Before You Ship

Only 31% of teams have a measurement framework for their AI agents. Here's how to define task completion rate, escalation rate, cost per outcome, and CSAT delta before your first production interaction.

Read More
Three glowing rubric cards floating in misted air, each marking the same transcript with subtly different ink colors, with a faint kappa heatmap projected on the wall behind them
Testing & Evaluation·11 min read

GPT-5, Claude 4.5, Gemini Score the Same Calls. Their Kappa Is 0.52

Run the same calls through GPT-5, Claude 4.5, and Gemini and Cohen's kappa lands at 0.52. Here is how to measure judge agreement on your own corpus.

Read More
AI-generated illustration for agent eval no ground truth -- Soul (2020) style, Terra Cotta palette
Testing & Evaluation·14 min read

How to Eval Agents When There's No Right Answer

Most eval methods assume you know the correct response. CX agents rarely have one. Here's how to score agent quality with criteria-based rubrics and LLM-as-judge, no labeled ground truth required.

Read More
Watercolor Illustration of Two Scoreboards Side by Side, One for Coding Tasks, One for Customer Conversations, With the Customer Scoreboard Showing Much Lower Numbers
Testing & Evaluation·11 min read read

Stop Using SWE-Bench to Pick Your CX Model

SWE-Bench scores 85% or 23% depending on the harness, and neither measures customer experience. Why tau-bench, tau2-bench, and pass^k matter for CX agents.

Read More
Warm watercolor illustration of an engineer reviewing A/B test scorecards and conversation analytics at a rooftop workspace during golden hour
Testing & Evaluation·12 min read

Every Conversation Is an Experiment You Didn't Run

Your agent already ran the A/B test you're scoping. Here's how to read the results in your logs with propensity matching, synthetic control, and diff-in-diff.

Read More
Watercolor illustration of an observation tower overlooking two parallel worlds, Blade Runner 2049 style in sage and olive tones
Testing & Evaluation·8 min read

Is AI Better Than Your Humans? Score Both on One Rubric

Most teams can't say whether AI beats humans because they score them differently. One rubric, run on both, sliced by segment, gives you an honest answer.

Read More

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos