Testing & Evaluation Articles
25 articles · Page 1 of 3

Every Conversation Is an Experiment You Didn't Run
Your agent already ran the A/B test you're scoping. Here's how to read the results in your logs with propensity matching, synthetic control, and diff-in-diff.

Is AI Better Than Your Humans? Score Both on One Rubric
Most teams can't say whether AI beats humans because they score them differently. One rubric, run on both, sliced by segment, gives you an honest answer.

Every Failed Call Is a Test Case You Haven't Written Yet
The gap between staging and production for AI agents is measured in surprise. Here's how to close the loop from live failure to regression gate.

How Much Testing Is Enough for Your AI Agent?
Code coverage doesn't apply to AI agents. Here's a framework for thinking about evaluation coverage: how many scenarios you need, what distribution to target, and how to know when you've tested enough.

Your LLM-as-judge may be highly biased
LLM-as-Judge has 12 documented biases. Here are 6 evaluation methods production teams actually use instead, with code examples and patterns.

Is monitoring your AI agent actually enough?
Research shows 83% of agent teams track capability metrics but only 30% evaluate real outcomes. Here's how to close the gap with multi-turn scenario testing.

Memory bugs don't crash. They just give wrong answers.
Memory bugs don't crash your agent. They just give subtly wrong answers using stale context. Here are 5 test patterns to catch them before customers do.

Online vs. Offline Evals: Close the Production Gap
89% of teams have observability but only 37% run online evals. Here's why that gap is where production failures hide, and how to close it with a practical online eval pipeline.

We open-sourced our AI agent testing engine
chanl-eval is an open-source engine for stress-testing AI agents with simulated conversations, adaptive personas, and per-criteria scorecards. MIT licensed.

Agent Drift: Why Your AI Gets Worse the Longer It Runs
AI agents silently degrade over long conversations. Research quantifies three types of drift and shows why point-in-time evals miss them entirely.

NIST Red-Teamed 13 Frontier Models. All of Them Failed.
NIST ran 250K+ attacks against every frontier model. None survived. Here's what the results mean for teams shipping AI agents to production today.

Your Agent Aced the Benchmark. Production Disagreed.
We scored 92% on GAIA. Production CSAT: 64%. Here's which AI agent benchmarks actually predict deployed performance, why most don't, and what to measure instead.
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.