Blog/Tags/evaluation

evaluation

Browse 13 articles tagged with “evaluation”.

Articles tagged “evaluation”

13 articles

AI-generated illustration for agent development lifecycle adlc

Operations·14 min read

How to Run the Agent Development Lifecycle (ADLC) in Production

Shipping an AI agent is easy. Keeping it reliable after launch is hard. The ADLC walks you through Intent, Build, Evaluate, Deploy, Observe, then back around.

A flowchart showing an agent's step-by-step decision path with one step flagged as diverging from the expected trajectory

Testing & Evaluation·13 min read

Trajectory Eval: Catch Agent Bugs Output Scoring Misses

Final-output scoring misses 20-40% of agent regressions. Trajectory evaluation scores every step an agent takes -- tool calls, reasoning decisions, order of operations -- and catches the bugs that output-only evals can't see.

A dashboard showing rich telemetry data on one side and a blank trend chart on the other, representing observability without measurement

Testing & Evaluation·11 min read

Your Agent Has Observability. It Doesn't Have Measurement.

89% of AI teams added observability. 52% added evals. But only 31% can say whether their agent is getting better or worse. Here's the difference between watching your agent and actually measuring it.

Dashboard showing AI agent KPI tiles for task completion rate, escalation rate, cost per successful outcome, and CSAT delta

Testing & Evaluation·13 min read

AI Agent KPIs: What to Measure Before You Ship

Only 31% of teams have a measurement framework for their AI agents. Here's how to define task completion rate, escalation rate, cost per outcome, and CSAT delta before your first production interaction.

AI-generated illustration for agent eval no ground truth -- Soul (2020) style, Terra Cotta palette

Testing & Evaluation·14 min read

How to Eval Agents When There's No Right Answer

Most eval methods assume you know the correct response. CX agents rarely have one. Here's how to score agent quality with criteria-based rubrics and LLM-as-judge, no labeled ground truth required.

Watercolor Illustration of Two Scoreboards Side by Side, One for Coding Tasks, One for Customer Conversations, With the Customer Scoreboard Showing Much Lower Numbers

Testing & Evaluation·11 min read read

Stop Using SWE-Bench to Pick Your CX Model

SWE-Bench scores 85% or 23% depending on the harness, and neither measures customer experience. Why tau-bench, tau2-bench, and pass^k matter for CX agents.

Warm watercolor illustration of an engineer reviewing A/B test scorecards and conversation analytics at a rooftop workspace during golden hour

Testing & Evaluation·12 min read

Every Conversation Is an Experiment You Didn't Run

Your agent already ran the A/B test you're scoping. Here's how to read the results in your logs with propensity matching, synthetic control, and diff-in-diff.

Watercolor illustration of an observation tower overlooking two parallel worlds, Blade Runner 2049 style in sage and olive tones

Testing & Evaluation·8 min read

Is AI Better Than Your Humans? Score Both on One Rubric

Most teams can't say whether AI beats humans because they score them differently. One rubric, run on both, sliced by segment, gives you an honest answer.

Grid of test scenario cards with pass and fail indicators showing evaluation coverage distribution

Testing & Evaluation·13 min read

How Much Testing Is Enough for Your AI Agent?

Code coverage doesn't apply to AI agents. Here's a framework for thinking about evaluation coverage: how many scenarios you need, what distribution to target, and how to know when you've tested enough.

Control room with green monitoring screens, one cracked display unnoticed in the center, Minority Report style

Testing & Evaluation·14 min read read

Is monitoring your AI agent actually enough?

Research shows 83% of agent teams track capability metrics but only 30% evaluate real outcomes. Here's how to close the gap with multi-turn scenario testing.

Open-source AI agent testing engine with conversation simulation and scorecard evaluation

Testing & Evaluation·14 min read

We open-sourced our AI agent testing engine

chanl-eval is an open-source engine for stress-testing AI agents with simulated conversations, adaptive personas, and per-criteria scorecards. MIT licensed.

A filing cabinet with most drawers empty and papers scattered on the floor, watercolor illustration in muted blue tones

Knowledge & Memory·12 min read read

Your Agent Completed the Task. It Also Forgot 87% of What It Knew.

Task completion hides a silent failure: agents forget 87% of stored knowledge under complexity. New research reveals why standard evals miss this entirely.

Watercolor illustration of a split dashboard showing human reviewers on one side and automated scoring metrics on the other

Operations·15 min read read

74% of Production Agents Still Rely on Human Evaluation

A survey of 306 practitioners reveals most production agents are far simpler than expected. The eval gap isn't a tooling problem. It's a trust problem.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed