Here are three numbers from Datadog's 2026 State of AI Engineering report, published this spring:
- 89% of teams have implemented observability for their AI agents
- 52% have adopted some form of evaluation
- 31% have a measurement framework that tells them whether their agent is improving or degrading
Those three numbers describe the same problem in different ways: you can watch your agent closely and still have no idea what it's doing to your business.
Teams that have observability can answer: "What happened during that conversation?" They can pull a trace, see which tools fired, check latency, count tokens. That's genuinely useful when something breaks.
Teams that have measurement can answer: "Is my agent better today than it was six weeks ago?" They can quantify whether the prompt change they shipped last Tuesday actually helped. They can catch regressions before users do. They can tell the product team with confidence whether the new knowledge base chunk improved recall or hurt it.
The 89% built dashboards. The 31% built something to reason from.
Here's how to get from one to the other.
Observability and Measurement Are Not the Same Thing
Observability tells you what happened. Measurement tells you whether that's good.
Think about what a typical observability setup gives you: distributed traces across LLM calls, tool executions, and retrieval steps. Token counts per turn. Latency histograms. Error rates by tool. Session replay for specific conversations. That's all genuinely valuable -- when something goes wrong, you can find it quickly.
But none of it tells you whether your agent is getting better or worse over time. A trace shows you that Tool A was called with argument X and returned result Y. It doesn't tell you whether that was the right tool to call, whether the argument was correct, or whether the customer's underlying problem got solved.
Measurement fills that gap. It requires:
- A definition of success that's specific to your agent's job
- A set of reference scenarios you can run repeatedly
- A scoring system that grades performance across multiple dimensions
- A cadence that produces trend data you can reason from
You can have all of items 1-4 without a single distributed trace. You can have a perfect tracing setup without any of items 1-4. Most teams have the second and none of the first.
Why 69% of Teams Are Flying Blind
Three patterns explain why most teams stop at observability and never build measurement. They're not about capability -- they're about how the work gets structured.
The 31% measurement framework number surprised researchers, because the teams in that 69% aren't careless. They've invested in observability. They're running evals in some form. But the three patterns below are why "some evals" doesn't become "a measurement framework":
Evals as one-off checks, not trend infrastructure. A team writes 30 test conversations before shipping a feature, runs the agent against them, fixes the failures, ships. That's useful! But if they don't run those same 30 conversations again next month with the same scoring, they have a snapshot, not a trend. One data point isn't measurement.
Metrics that don't reflect customer outcomes. Teams commonly track average response latency, token cost, and tool call success rate. These are real metrics. But a 200ms response that confidently gives the wrong answer is worse than a 400ms response that admits uncertainty and escalates. Speed and cost metrics measure the how, not the whether.
No agreed definition of "good." The hardest part of measurement isn't technical -- it's deciding what success means for your specific agent. For a booking agent, is success "appointment confirmed"? "Customer didn't call back"? "NPS > 4"? Without a written definition, every eval produces a different answer depending on who runs it.
The good news is that starting a measurement framework doesn't require rebuilding your observability stack. You can layer measurement on top of what you already have in an afternoon of work.
The Four Dimensions That Actually Matter
A useful measurement framework for a CX agent tracks four things:
Task accuracy. Did the agent do what the user asked? For a booking agent, did it actually confirm a valid appointment? For a returns agent, did it initiate the right return process? This is output correctness at the business level, not the text level.
Tool correctness. Did the agent call the right tools, in the right order, with the right arguments? An agent that books an appointment but calls the wrong calendar API for the user's timezone is wrong even if the conversation sounds right. Observability gives you traces; measurement tells you whether those traces look correct.
Policy compliance. Did the agent stay within allowed behaviors? Never quote prices it isn't authorized to quote. Always escalate unverifiable medical questions. Never collect payment details on an unencrypted channel. Policy compliance is binary and doesn't show up in output quality scores unless you specifically look for it.
Cost per successful outcome. What did each confirmed booking actually cost in tokens, tool calls, and API fees? This isn't about being cheap -- it's about knowing when a prompt change that looks neutral on quality metrics is silently tripling your per-interaction cost.
Most teams track one or two of these. Tracking all four together is what makes the measurement framework useful: you might ship a prompt change that improves task accuracy, hurts policy compliance slightly, and doubles cost. Without all four metrics in the same view, you'd only see the accuracy win.
Building Your First Measurement Loop
You don't need a sophisticated eval framework to start. Here's a minimal loop that produces real trend data:
interface ConversationScore {
conversationId: string;
runDate: Date;
taskAccuracy: number; // 0-1
toolCorrectness: number; // 0-1
policyCompliance: boolean; // Pass/fail
costCents: number;
notes?: string;
}
// Reference set: 20-30 conversations with known-good outcomes
// Run this weekly (or on every significant change)
async function runWeeklyMeasurement(referenceConversations: ReferenceConversation[]) {
const scores: ConversationScore[] = [];
for (const ref of referenceConversations) {
const session = await runAgentOnConversation(ref.input);
const score = await scoreSession(session, ref.expectedOutcome);
scores.push(score);
}
return {
taskAccuracy: average(scores.map(s => s.taskAccuracy)),
toolCorrectness: average(scores.map(s => s.toolCorrectness)),
policyPassRate: scores.filter(s => s.policyCompliance).length / scores.length,
avgCostCents: average(scores.map(s => s.costCents)),
sampleSize: scores.length,
runDate: new Date(),
};
}The scoring function is where most of the work lives. For task accuracy, you need ground truth -- a known-good outcome for each reference conversation. For tool correctness, you check whether the sequence of tool calls matches a reference sequence (or use LLM-as-judge for flexible matching). For policy compliance, you run the transcript through a policy checker.
async function scoreSession(
session: AgentSession,
expected: ExpectedOutcome,
): Promise<ConversationScore> {
// Task accuracy: did the right thing happen?
const taskAccuracy = await checkOutcome(session.result, expected.outcome);
// Tool correctness: did the right tools fire in the right order?
const toolCorrectness = scoreToolSequence(
session.toolCalls,
expected.toolSequence,
);
// Policy compliance: did any policy violations occur?
const policyCompliance = await checkPolicyViolations(session.transcript);
// Cost: sum all token and tool API costs
const costCents = calculateSessionCost(session.usage);
return {
conversationId: session.id,
runDate: new Date(),
taskAccuracy,
toolCorrectness,
policyCompliance,
costCents,
};
}Store results in a simple table and chart the weekly trend. Even a spreadsheet works for the first few months. The point is to have a number you can compare against next week's number.
Scorecards as the Bridge Between Observability and Measurement
The pattern that makes measurement sustainable at scale is the scorecard: a structured evaluation that grades multiple dimensions per conversation rather than producing a single aggregate score.
A scorecard for a booking agent might look like this:
| Dimension | Weight | Score | Notes |
|---|---|---|---|
| Confirmed valid appointment | 30% | 1.0 | Correct date, time, service |
| Used correct calendar tool | 20% | 0.8 | Called right API, wrong timezone |
| Stayed within pricing policy | 25% | 1.0 | No unauthorized discounts |
| Appropriate escalation behavior | 15% | 1.0 | Correctly escalated complex request |
| Cost within budget | 10% | 0.9 | Slightly over token budget |
| Weighted total | 100% | 0.91 |
Scorecards make regression detection specific. If your overall score drops from 0.91 to 0.84 after a prompt change, a single number doesn't tell you where to look. A scorecard tells you that tool correctness dropped from 0.8 to 0.5 -- probably because the prompt change affected how the agent selects between calendar APIs.

Chanl's scorecard feature runs these evaluations automatically against your live conversations and reference sets, so you don't have to schedule the weekly script manually. The trend view shows week-over-week movement across all dimensions, and alerts when any dimension drops more than a defined threshold. This connects the measurement framework to your analytics view, so you can see whether a scorecard regression correlates with a change in customer satisfaction scores or callback rate. And when a specific dimension regresses, monitoring gives you the conversation-level detail to find which calls drove the drop.
For teams already running LLM-as-judge evals, it's worth reading LLM as a Judge: Building a Production Eval Pipeline for the calibration work that makes judge scores reliable. The short version: LLM judges drift toward positivity bias and verbosity preference -- calibrate them against human scores on a reference set before trusting them at scale. We also covered specific judge failure modes in 12 Biases That Break Your LLM Judge.
The Weekly Review Ritual
Measurement only helps if you actually look at it. The teams in that 31% share a common practice: a lightweight weekly ritual that takes 20 minutes and produces real decisions.
The format is simple:
- Pull this week's scorecard summary vs. last week's
- Flag any dimension that moved more than 5 points
- For each flagged dimension, look at the specific conversations that drove the movement
- Ship a fix or document the regression as acceptable + reason why
That's it. No weekly review meeting. No 40-slide deck. Just: what changed, where, why, and what we're doing about it.
The teams that skip this ritual are the ones who ship a prompt update on Tuesday, notice something feels slightly off by Friday, and can't isolate what changed because they have no baseline to compare against. Good luck finding the issue in your distributed traces.
If you want a benchmark for what a mature measurement practice looks like, Scorecards vs. Vibes: How to Actually Measure AI Agent Quality covers the full spectrum from informal vibe-checking to structured eval programs. And Production Agent Evals: Catch Score Drift, Ship Confidently goes deep on the drift detection that turns weekly scores into actionable alerts.
From Measurement to Improvement
Here's the payoff that makes the 20-minute weekly ritual worth it: once you have a measurement baseline, every change becomes a real experiment.
Before a prompt update: run your reference set, record the scores. After the update: run the same set, compare. A prompt that improves task accuracy from 0.82 to 0.89 while holding policy compliance at 1.0 is a clear win. A prompt that improves accuracy to 0.91 but drops policy compliance to 0.85 is a decision, not a win -- and now you can make it intentionally.
This is what "Build, Connect, Monitor" means at the measurement level. Build a change. Connect it to a reference set that can evaluate it. Monitor the trend line to know whether it worked.
The 31% who have this infrastructure aren't smarter than the 69% who don't. They just spent one afternoon writing 25 reference conversations and building a simple weekly scoring script. The investment pays back on the first prompt change that would have shipped undetected and broken something.
Your agent has observability. Give it measurement, and you'll know whether it's getting better.
Run scorecards on every conversation, not just samples
Chanl's scorecard feature grades your agent across multiple dimensions automatically -- so you see where quality is improving and where it's slipping before customers do.
See Scorecards in ActionCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



