What's the difference between observability and measurement for AI agents?

Observability tells you what your agent did: traces, logs, token counts, latency. Measurement tells you whether it's getting better or worse over time: trend lines, regression detection, before-and-after comparisons after a prompt change. You need both, but most teams stop at observability and miss the measurement layer entirely.

Why do so many teams have observability but not measurement?

Observability tooling is easy to install -- most frameworks emit traces out of the box. Measurement requires defining what success looks like for your specific agent, writing eval harnesses to test it repeatedly, and building trend infrastructure to track it over time. That's harder work with less immediate payoff, so teams defer it until they're already on fire.

What should a measurement framework for AI agents track?

Four dimensions matter: task accuracy (did the agent accomplish what the user asked?), tool correctness (did it call the right tools with the right arguments?), policy compliance (did it stay within allowed behaviors?), and cost efficiency (what did each successful outcome cost?). Final output quality alone misses the other three entirely.

How do I start building an agent measurement framework?

Start with a single scorecard for your highest-volume conversation type. Define 5-8 dimensions that matter for that interaction, write 20-30 reference conversations with known-good outcomes, and run your agent against them weekly. A simple pass/fail rate trend across those conversations tells you more than a year of traces.

What's a scorecard for AI agents?

A scorecard is a structured evaluation that grades an agent's performance across multiple dimensions -- accuracy, tone, tool usage, escalation behavior -- rather than producing a single aggregate score. Scorecards make it possible to see that your agent got worse at policy compliance after a prompt change even though overall accuracy held steady.

How often should I run agent evaluations?

Weekly at minimum for stable agents; after every significant change (prompt update, tool addition, model upgrade) for active development. The goal is to build a trend you can reason from, not to catch individual failures. A single eval run tells you a snapshot; weekly runs tell you direction.

Can I use an LLM to grade my agent's outputs automatically?

Yes, and it works well at scale, but LLM judges introduce their own biases -- positivity bias, verbosity preference, sensitivity to framing. Use an LLM judge for coverage, but calibrate it against human scores on a reference set. The combination gives you scale with reliability.

Your Agent Has Observability. It Doesn't Have Measurement.

Here are three numbers from Datadog's 2026 State of AI Engineering report, published this spring:

89% of teams have implemented observability for their AI agents
52% have adopted some form of evaluation
31% have a measurement framework that tells them whether their agent is improving or degrading

Those three numbers describe the same problem in different ways: you can watch your agent closely and still have no idea what it's doing to your business.

Teams that have observability can answer: "What happened during that conversation?" They can pull a trace, see which tools fired, check latency, count tokens. That's genuinely useful when something breaks.

Teams that have measurement can answer: "Is my agent better today than it was six weeks ago?" They can quantify whether the prompt change they shipped last Tuesday actually helped. They can catch regressions before users do. They can tell the product team with confidence whether the new knowledge base chunk improved recall or hurt it.

The 89% built dashboards. The 31% built something to reason from.

Here's how to get from one to the other.

Observability and Measurement Are Not the Same Thing

Observability tells you what happened. Measurement tells you whether that's good.

Think about what a typical observability setup gives you: distributed traces across LLM calls, tool executions, and retrieval steps. Token counts per turn. Latency histograms. Error rates by tool. Session replay for specific conversations. That's all genuinely valuable -- when something goes wrong, you can find it quickly.

But none of it tells you whether your agent is getting better or worse over time. A trace shows you that Tool A was called with argument X and returned result Y. It doesn't tell you whether that was the right tool to call, whether the argument was correct, or whether the customer's underlying problem got solved.

Measurement fills that gap. It requires:

A definition of success that's specific to your agent's job
A set of reference scenarios you can run repeatedly
A scoring system that grades performance across multiple dimensions
A cadence that produces trend data you can reason from

You can have all of items 1-4 without a single distributed trace. You can have a perfect tracing setup without any of items 1-4. Most teams have the second and none of the first.

Why 69% of Teams Are Flying Blind

Three patterns explain why most teams stop at observability and never build measurement. They're not about capability -- they're about how the work gets structured.

The 31% measurement framework number surprised researchers, because the teams in that 69% aren't careless. They've invested in observability. They're running evals in some form. But the three patterns below are why "some evals" doesn't become "a measurement framework":

Evals as one-off checks, not trend infrastructure. A team writes 30 test conversations before shipping a feature, runs the agent against them, fixes the failures, ships. That's useful! But if they don't run those same 30 conversations again next month with the same scoring, they have a snapshot, not a trend. One data point isn't measurement.

Metrics that don't reflect customer outcomes. Teams commonly track average response latency, token cost, and tool call success rate. These are real metrics. But a 200ms response that confidently gives the wrong answer is worse than a 400ms response that admits uncertainty and escalates. Speed and cost metrics measure the how, not the whether.

No agreed definition of "good." The hardest part of measurement isn't technical -- it's deciding what success means for your specific agent. For a booking agent, is success "appointment confirmed"? "Customer didn't call back"? "NPS > 4"? Without a written definition, every eval produces a different answer depending on who runs it.

The good news is that starting a measurement framework doesn't require rebuilding your observability stack. You can layer measurement on top of what you already have in an afternoon of work.

The Four Dimensions That Actually Matter

A useful measurement framework for a CX agent tracks four things:

Task accuracy. Did the agent do what the user asked? For a booking agent, did it actually confirm a valid appointment? For a returns agent, did it initiate the right return process? This is output correctness at the business level, not the text level.

Tool correctness. Did the agent call the right tools, in the right order, with the right arguments? An agent that books an appointment but calls the wrong calendar API for the user's timezone is wrong even if the conversation sounds right. Observability gives you traces; measurement tells you whether those traces look correct.

Policy compliance. Did the agent stay within allowed behaviors? Never quote prices it isn't authorized to quote. Always escalate unverifiable medical questions. Never collect payment details on an unencrypted channel. Policy compliance is binary and doesn't show up in output quality scores unless you specifically look for it.

Cost per successful outcome. What did each confirmed booking actually cost in tokens, tool calls, and API fees? This isn't about being cheap -- it's about knowing when a prompt change that looks neutral on quality metrics is silently tripling your per-interaction cost.

Measurement framework: four dimensions, two feedback loops

Most teams track one or two of these. Tracking all four together is what makes the measurement framework useful: you might ship a prompt change that improves task accuracy, hurts policy compliance slightly, and doubles cost. Without all four metrics in the same view, you'd only see the accuracy win.

Building Your First Measurement Loop

You don't need a sophisticated eval framework to start. Here's a minimal loop that produces real trend data:

measurement-loop.ts·typescript

interface ConversationScore {
  conversationId: string;
  runDate: Date;
  taskAccuracy: number;       // 0-1
  toolCorrectness: number;    // 0-1
  policyCompliance: boolean;  // Pass/fail
  costCents: number;
  notes?: string;
}
 
// Reference set: 20-30 conversations with known-good outcomes
// Run this weekly (or on every significant change)
async function runWeeklyMeasurement(referenceConversations: ReferenceConversation[]) {
  const scores: ConversationScore[] = [];
 
  for (const ref of referenceConversations) {
    const session = await runAgentOnConversation(ref.input);
    const score = await scoreSession(session, ref.expectedOutcome);
    scores.push(score);
  }
 
  return {
    taskAccuracy: average(scores.map(s => s.taskAccuracy)),
    toolCorrectness: average(scores.map(s => s.toolCorrectness)),
    policyPassRate: scores.filter(s => s.policyCompliance).length / scores.length,
    avgCostCents: average(scores.map(s => s.costCents)),
    sampleSize: scores.length,
    runDate: new Date(),
  };
}

The scoring function is where most of the work lives. For task accuracy, you need ground truth -- a known-good outcome for each reference conversation. For tool correctness, you check whether the sequence of tool calls matches a reference sequence (or use LLM-as-judge for flexible matching). For policy compliance, you run the transcript through a policy checker.

scoring.ts·typescript

async function scoreSession(
  session: AgentSession,
  expected: ExpectedOutcome,
): Promise<ConversationScore> {
  // Task accuracy: did the right thing happen?
  const taskAccuracy = await checkOutcome(session.result, expected.outcome);
 
  // Tool correctness: did the right tools fire in the right order?
  const toolCorrectness = scoreToolSequence(
    session.toolCalls,
    expected.toolSequence,
  );
 
  // Policy compliance: did any policy violations occur?
  const policyCompliance = await checkPolicyViolations(session.transcript);
 
  // Cost: sum all token and tool API costs
  const costCents = calculateSessionCost(session.usage);
 
  return {
    conversationId: session.id,
    runDate: new Date(),
    taskAccuracy,
    toolCorrectness,
    policyCompliance,
    costCents,
  };
}

Store results in a simple table and chart the weekly trend. Even a spreadsheet works for the first few months. The point is to have a number you can compare against next week's number.

Scorecards as the Bridge Between Observability and Measurement

The pattern that makes measurement sustainable at scale is the scorecard: a structured evaluation that grades multiple dimensions per conversation rather than producing a single aggregate score.

A scorecard for a booking agent might look like this:

Dimension	Weight	Score	Notes
Confirmed valid appointment	30%	1.0	Correct date, time, service
Used correct calendar tool	20%	0.8	Called right API, wrong timezone
Stayed within pricing policy	25%	1.0	No unauthorized discounts
Appropriate escalation behavior	15%	1.0	Correctly escalated complex request
Cost within budget	10%	0.9	Slightly over token budget
Weighted total	100%	0.91

Scorecards make regression detection specific. If your overall score drops from 0.91 to 0.84 after a prompt change, a single number doesn't tell you where to look. A scorecard tells you that tool correctness dropped from 0.8 to 0.5 -- probably because the prompt change affected how the agent selects between calendar APIs.

Score

Good

0/100

Tone & Empathy

94%

Resolution

88%

Response Time

72%

Compliance

85%

Chanl's scorecard feature runs these evaluations automatically against your live conversations and reference sets, so you don't have to schedule the weekly script manually. The trend view shows week-over-week movement across all dimensions, and alerts when any dimension drops more than a defined threshold. This connects the measurement framework to your analytics view, so you can see whether a scorecard regression correlates with a change in customer satisfaction scores or callback rate. And when a specific dimension regresses, monitoring gives you the conversation-level detail to find which calls drove the drop.

For teams already running LLM-as-judge evals, it's worth reading LLM as a Judge: Building a Production Eval Pipeline for the calibration work that makes judge scores reliable. The short version: LLM judges drift toward positivity bias and verbosity preference -- calibrate them against human scores on a reference set before trusting them at scale. We also covered specific judge failure modes in 12 Biases That Break Your LLM Judge.

The Weekly Review Ritual

Measurement only helps if you actually look at it. The teams in that 31% share a common practice: a lightweight weekly ritual that takes 20 minutes and produces real decisions.

The format is simple:

Pull this week's scorecard summary vs. last week's
Flag any dimension that moved more than 5 points
For each flagged dimension, look at the specific conversations that drove the movement
Ship a fix or document the regression as acceptable + reason why

That's it. No weekly review meeting. No 40-slide deck. Just: what changed, where, why, and what we're doing about it.

The teams that skip this ritual are the ones who ship a prompt update on Tuesday, notice something feels slightly off by Friday, and can't isolate what changed because they have no baseline to compare against. Good luck finding the issue in your distributed traces.

If you want a benchmark for what a mature measurement practice looks like, Scorecards vs. Vibes: How to Actually Measure AI Agent Quality covers the full spectrum from informal vibe-checking to structured eval programs. And Production Agent Evals: Catch Score Drift, Ship Confidently goes deep on the drift detection that turns weekly scores into actionable alerts.

From Measurement to Improvement

Here's the payoff that makes the 20-minute weekly ritual worth it: once you have a measurement baseline, every change becomes a real experiment.

Before a prompt update: run your reference set, record the scores. After the update: run the same set, compare. A prompt that improves task accuracy from 0.82 to 0.89 while holding policy compliance at 1.0 is a clear win. A prompt that improves accuracy to 0.91 but drops policy compliance to 0.85 is a decision, not a win -- and now you can make it intentionally.

This is what "Build, Connect, Monitor" means at the measurement level. Build a change. Connect it to a reference set that can evaluate it. Monitor the trend line to know whether it worked.

The 31% who have this infrastructure aren't smarter than the 69% who don't. They just spent one afternoon writing 25 reference conversations and building a simple weekly scoring script. The investment pays back on the first prompt change that would have shipped undetected and broken something.

Your agent has observability. Give it measurement, and you'll know whether it's getting better.

Run scorecards on every conversation, not just samples

Chanl's scorecard feature grades your agent across multiple dimensions automatically -- so you see where quality is improving and where it's slipping before customers do.

See Scorecards in Action

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluation observability measurement production-ai testing-strategy ai-agents

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Your Agent Has Observability. It Doesn't Have Measurement.

Observability and Measurement Are Not the Same Thing

Why 69% of Teams Are Flying Blind

The Four Dimensions That Actually Matter

Building Your First Measurement Loop

Scorecards as the Bridge Between Observability and Measurement

The Weekly Review Ritual

From Measurement to Improvement

Run scorecards on every conversation, not just samples

The Signal Briefing

Frequently Asked Questions

Related Articles

Stop Using SWE-Bench to Pick Your CX Model

Every Failed Call Is a Test Case You Haven't Written Yet

How Much Testing Is Enough for Your AI Agent?