ChanlChanl
Testing & Evaluation

Your Agent Has Observability. It Doesn't Have Measurement.

89% of AI teams added observability. 52% added evals. But only 31% can say whether their agent is getting better or worse. Here's the difference between watching your agent and actually measuring it.

DGDean GroverCo-founderFollow
May 5, 2026
11 min read
A dashboard showing rich telemetry data on one side and a blank trend chart on the other, representing observability without measurement

Here are three numbers from Datadog's 2026 State of AI Engineering report, published this spring:

  • 89% of teams have implemented observability for their AI agents
  • 52% have adopted some form of evaluation
  • 31% have a measurement framework that tells them whether their agent is improving or degrading

Those three numbers describe the same problem in different ways: you can watch your agent closely and still have no idea what it's doing to your business.

Teams that have observability can answer: "What happened during that conversation?" They can pull a trace, see which tools fired, check latency, count tokens. That's genuinely useful when something breaks.

Teams that have measurement can answer: "Is my agent better today than it was six weeks ago?" They can quantify whether the prompt change they shipped last Tuesday actually helped. They can catch regressions before users do. They can tell the product team with confidence whether the new knowledge base chunk improved recall or hurt it.

The 89% built dashboards. The 31% built something to reason from.

Here's how to get from one to the other.

Observability and Measurement Are Not the Same Thing

Observability tells you what happened. Measurement tells you whether that's good.

Think about what a typical observability setup gives you: distributed traces across LLM calls, tool executions, and retrieval steps. Token counts per turn. Latency histograms. Error rates by tool. Session replay for specific conversations. That's all genuinely valuable -- when something goes wrong, you can find it quickly.

But none of it tells you whether your agent is getting better or worse over time. A trace shows you that Tool A was called with argument X and returned result Y. It doesn't tell you whether that was the right tool to call, whether the argument was correct, or whether the customer's underlying problem got solved.

Measurement fills that gap. It requires:

  1. A definition of success that's specific to your agent's job
  2. A set of reference scenarios you can run repeatedly
  3. A scoring system that grades performance across multiple dimensions
  4. A cadence that produces trend data you can reason from

You can have all of items 1-4 without a single distributed trace. You can have a perfect tracing setup without any of items 1-4. Most teams have the second and none of the first.

Why 69% of Teams Are Flying Blind

Three patterns explain why most teams stop at observability and never build measurement. They're not about capability -- they're about how the work gets structured.

The 31% measurement framework number surprised researchers, because the teams in that 69% aren't careless. They've invested in observability. They're running evals in some form. But the three patterns below are why "some evals" doesn't become "a measurement framework":

Evals as one-off checks, not trend infrastructure. A team writes 30 test conversations before shipping a feature, runs the agent against them, fixes the failures, ships. That's useful! But if they don't run those same 30 conversations again next month with the same scoring, they have a snapshot, not a trend. One data point isn't measurement.

Metrics that don't reflect customer outcomes. Teams commonly track average response latency, token cost, and tool call success rate. These are real metrics. But a 200ms response that confidently gives the wrong answer is worse than a 400ms response that admits uncertainty and escalates. Speed and cost metrics measure the how, not the whether.

No agreed definition of "good." The hardest part of measurement isn't technical -- it's deciding what success means for your specific agent. For a booking agent, is success "appointment confirmed"? "Customer didn't call back"? "NPS > 4"? Without a written definition, every eval produces a different answer depending on who runs it.

The good news is that starting a measurement framework doesn't require rebuilding your observability stack. You can layer measurement on top of what you already have in an afternoon of work.

The Four Dimensions That Actually Matter

A useful measurement framework for a CX agent tracks four things:

Task accuracy. Did the agent do what the user asked? For a booking agent, did it actually confirm a valid appointment? For a returns agent, did it initiate the right return process? This is output correctness at the business level, not the text level.

Tool correctness. Did the agent call the right tools, in the right order, with the right arguments? An agent that books an appointment but calls the wrong calendar API for the user's timezone is wrong even if the conversation sounds right. Observability gives you traces; measurement tells you whether those traces look correct.

Policy compliance. Did the agent stay within allowed behaviors? Never quote prices it isn't authorized to quote. Always escalate unverifiable medical questions. Never collect payment details on an unencrypted channel. Policy compliance is binary and doesn't show up in output quality scores unless you specifically look for it.

Cost per successful outcome. What did each confirmed booking actually cost in tokens, tool calls, and API fees? This isn't about being cheap -- it's about knowing when a prompt change that looks neutral on quality metrics is silently tripling your per-interaction cost.

Live Conversations Scorecard Eval Task Accuracy Tool Correctness Policy Compliance Cost per Outcome Weekly Trend Report Regression Detected? I J
Measurement framework: four dimensions, two feedback loops

Most teams track one or two of these. Tracking all four together is what makes the measurement framework useful: you might ship a prompt change that improves task accuracy, hurts policy compliance slightly, and doubles cost. Without all four metrics in the same view, you'd only see the accuracy win.

Building Your First Measurement Loop

You don't need a sophisticated eval framework to start. Here's a minimal loop that produces real trend data:

measurement-loop.ts·typescript
interface ConversationScore {
  conversationId: string;
  runDate: Date;
  taskAccuracy: number;       // 0-1
  toolCorrectness: number;    // 0-1
  policyCompliance: boolean;  // Pass/fail
  costCents: number;
  notes?: string;
}
 
// Reference set: 20-30 conversations with known-good outcomes
// Run this weekly (or on every significant change)
async function runWeeklyMeasurement(referenceConversations: ReferenceConversation[]) {
  const scores: ConversationScore[] = [];
 
  for (const ref of referenceConversations) {
    const session = await runAgentOnConversation(ref.input);
    const score = await scoreSession(session, ref.expectedOutcome);
    scores.push(score);
  }
 
  return {
    taskAccuracy: average(scores.map(s => s.taskAccuracy)),
    toolCorrectness: average(scores.map(s => s.toolCorrectness)),
    policyPassRate: scores.filter(s => s.policyCompliance).length / scores.length,
    avgCostCents: average(scores.map(s => s.costCents)),
    sampleSize: scores.length,
    runDate: new Date(),
  };
}

The scoring function is where most of the work lives. For task accuracy, you need ground truth -- a known-good outcome for each reference conversation. For tool correctness, you check whether the sequence of tool calls matches a reference sequence (or use LLM-as-judge for flexible matching). For policy compliance, you run the transcript through a policy checker.

scoring.ts·typescript
async function scoreSession(
  session: AgentSession,
  expected: ExpectedOutcome,
): Promise<ConversationScore> {
  // Task accuracy: did the right thing happen?
  const taskAccuracy = await checkOutcome(session.result, expected.outcome);
 
  // Tool correctness: did the right tools fire in the right order?
  const toolCorrectness = scoreToolSequence(
    session.toolCalls,
    expected.toolSequence,
  );
 
  // Policy compliance: did any policy violations occur?
  const policyCompliance = await checkPolicyViolations(session.transcript);
 
  // Cost: sum all token and tool API costs
  const costCents = calculateSessionCost(session.usage);
 
  return {
    conversationId: session.id,
    runDate: new Date(),
    taskAccuracy,
    toolCorrectness,
    policyCompliance,
    costCents,
  };
}

Store results in a simple table and chart the weekly trend. Even a spreadsheet works for the first few months. The point is to have a number you can compare against next week's number.

Scorecards as the Bridge Between Observability and Measurement

The pattern that makes measurement sustainable at scale is the scorecard: a structured evaluation that grades multiple dimensions per conversation rather than producing a single aggregate score.

A scorecard for a booking agent might look like this:

DimensionWeightScoreNotes
Confirmed valid appointment30%1.0Correct date, time, service
Used correct calendar tool20%0.8Called right API, wrong timezone
Stayed within pricing policy25%1.0No unauthorized discounts
Appropriate escalation behavior15%1.0Correctly escalated complex request
Cost within budget10%0.9Slightly over token budget
Weighted total100%0.91

Scorecards make regression detection specific. If your overall score drops from 0.91 to 0.84 after a prompt change, a single number doesn't tell you where to look. A scorecard tells you that tool correctness dropped from 0.8 to 0.5 -- probably because the prompt change affected how the agent selects between calendar APIs.

Quality analyst reviewing scores
Score
Good
0/100
Tone & Empathy
94%
Resolution
88%
Response Time
72%
Compliance
85%

Chanl's scorecard feature runs these evaluations automatically against your live conversations and reference sets, so you don't have to schedule the weekly script manually. The trend view shows week-over-week movement across all dimensions, and alerts when any dimension drops more than a defined threshold. This connects the measurement framework to your analytics view, so you can see whether a scorecard regression correlates with a change in customer satisfaction scores or callback rate. And when a specific dimension regresses, monitoring gives you the conversation-level detail to find which calls drove the drop.

For teams already running LLM-as-judge evals, it's worth reading LLM as a Judge: Building a Production Eval Pipeline for the calibration work that makes judge scores reliable. The short version: LLM judges drift toward positivity bias and verbosity preference -- calibrate them against human scores on a reference set before trusting them at scale. We also covered specific judge failure modes in 12 Biases That Break Your LLM Judge.

The Weekly Review Ritual

Measurement only helps if you actually look at it. The teams in that 31% share a common practice: a lightweight weekly ritual that takes 20 minutes and produces real decisions.

The format is simple:

  1. Pull this week's scorecard summary vs. last week's
  2. Flag any dimension that moved more than 5 points
  3. For each flagged dimension, look at the specific conversations that drove the movement
  4. Ship a fix or document the regression as acceptable + reason why

That's it. No weekly review meeting. No 40-slide deck. Just: what changed, where, why, and what we're doing about it.

The teams that skip this ritual are the ones who ship a prompt update on Tuesday, notice something feels slightly off by Friday, and can't isolate what changed because they have no baseline to compare against. Good luck finding the issue in your distributed traces.

If you want a benchmark for what a mature measurement practice looks like, Scorecards vs. Vibes: How to Actually Measure AI Agent Quality covers the full spectrum from informal vibe-checking to structured eval programs. And Production Agent Evals: Catch Score Drift, Ship Confidently goes deep on the drift detection that turns weekly scores into actionable alerts.

From Measurement to Improvement

Here's the payoff that makes the 20-minute weekly ritual worth it: once you have a measurement baseline, every change becomes a real experiment.

Before a prompt update: run your reference set, record the scores. After the update: run the same set, compare. A prompt that improves task accuracy from 0.82 to 0.89 while holding policy compliance at 1.0 is a clear win. A prompt that improves accuracy to 0.91 but drops policy compliance to 0.85 is a decision, not a win -- and now you can make it intentionally.

This is what "Build, Connect, Monitor" means at the measurement level. Build a change. Connect it to a reference set that can evaluate it. Monitor the trend line to know whether it worked.

The 31% who have this infrastructure aren't smarter than the 69% who don't. They just spent one afternoon writing 25 reference conversations and building a simple weekly scoring script. The investment pays back on the first prompt change that would have shipped undetected and broken something.

Your agent has observability. Give it measurement, and you'll know whether it's getting better.

Run scorecards on every conversation, not just samples

Chanl's scorecard feature grades your agent across multiple dimensions automatically -- so you see where quality is improving and where it's slipping before customers do.

See Scorecards in Action
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Frequently Asked Questions