Why is monitoring alone not enough for AI agents?

Monitoring tells you what happened after the fact. It catches failures reactively through dashboards, alerts, and traces. But it can't prevent bad interactions from reaching customers. Evaluation catches quality issues before deployment by testing agents against realistic scenarios and scoring their outcomes.

How do multi-turn conversations expose agent failures that single-turn tests miss?

Single-turn tests check one prompt and one response. Multi-turn tests simulate real conversations where context builds across exchanges. An agent might answer each question correctly in isolation but forget the customer's email after they just provided it, or contradict earlier statements. These coherence failures only surface in multi-turn evaluation.

What is a component cascade failure in AI agents?

A component cascade failure happens when an intermediate step produces wrong output that downstream steps process correctly. For example, an agent retrieves the wrong document, then summarizes it perfectly. The summary scores well in isolation, but the customer gets wrong information. Testing only the final output misses the root cause.

How can teams start evaluating AI agents before production?

Start by defining what a successful conversation looks like in your domain. Create test scenarios that simulate real user journeys across multiple turns. Build scorecards with outcome-level criteria like goal completion, context retention, and appropriate escalation. Run these evaluations in CI before every deployment.

What is adversarial testing for AI agents?

Adversarial testing uses personas that deliberately probe edge cases: impatient users who interrupt, users who change their mind mid-conversation, users who provide contradictory information, or users who try to manipulate the agent. Agents that perform well in benign conditions often fail when facing these realistic but challenging interactions.

How does outcome scoring differ from response quality scoring?

Response quality scoring checks whether individual agent responses are well-formed, relevant, and accurate. Outcome scoring checks whether the entire conversation accomplished the goal. A support agent might give five well-formed responses that are all individually correct, but if it never actually resolves the customer's issue, outcome scoring catches that failure.

Is monitoring your AI agent actually enough?

There's a pattern that keeps showing up in research about production AI agents. Not the kind of finding teams brag about. The kind that explains why quality issues still blindside engineering teams who thought they had monitoring covered.

A systematic review of 60+ agent evaluation studies found that 83% of teams report capability metrics, but only 30% evaluate human-centred or outcome-level dimensions. Separately, a survey of 306 practitioners found that 74% still rely on human evaluation as their primary assessment method.

Those numbers come from two independent research efforts published in 2025, and they capture something important about where the industry is right now. Observability adoption far outpaces outcome evaluation. Teams have invested heavily in knowing when things go wrong. They haven't invested nearly as much in preventing things from going wrong in the first place.

This isn't a tooling problem. It's a category confusion problem. Monitoring and evaluation look similar from the outside. Both produce dashboards. Both generate scores. Both claim to measure quality. But they answer fundamentally different questions, and conflating them creates a dangerous blind spot.

Why does monitoring create a false sense of security?

Green dashboards don't mean good outcomes. Monitoring measures whether the system ran without errors, not whether the agent actually helped the customer. When every trace completes and every metric stays flat, teams naturally assume the agent is performing well. But "completed without errors" and "accomplished the goal" are entirely different claims.

Consider what a typical observability stack actually measures: latency per request, token usage, error rates, trace completeness, tool call success/failure. These are infrastructure metrics. They tell you whether the system is running. They don't tell you whether the system is working.

A support agent that confidently gives wrong answers completes every trace with a 200 status code. A booking agent that books the wrong date finishes every tool call successfully. An onboarding agent that confuses a new user with jargon produces perfect latency numbers. None of these failures show up in monitoring dashboards until a customer complains.

Here's what that looks like in practice. Most teams have something close to this setup:

typescript

// Typical observability setup: traces everything, evaluates nothing
const trace = tracer.startSpan('agent.conversation');
 
// ✅ Captures latency
trace.setAttribute('duration_ms', responseTime);
 
// ✅ Captures token usage
trace.setAttribute('tokens.input', inputTokens);
trace.setAttribute('tokens.output', outputTokens);
 
// ✅ Captures tool call results
trace.setAttribute('tool.name', 'lookup_order');
trace.setAttribute('tool.status', 'success');
 
// ❌ Never asks: did the agent solve the problem?
// ❌ Never asks: did the customer get the right information?
// ❌ Never asks: would a human reviewer approve this interaction?
 
trace.end();

Everything in this setup answers "what happened?" Nothing answers "was the outcome good?"

That gap is where quality regressions live. You ship a prompt update on Tuesday. Traces look clean. Latency is flat. Error rate holds steady. On Thursday, a customer escalation reveals that the new prompt causes the agent to skip the verification step in refund conversations. You've been shipping bad interactions for 48 hours. Your monitoring didn't catch it because the failure isn't a system error. It's a behavior change.

What's the difference between step-level and outcome-level evaluation?

Step-level evaluation scores each individual action or response in isolation. Outcome-level evaluation scores whether the agent accomplished the overall goal across a full conversation. The industry has mature tooling for the first type but fragmented, inconsistent approaches to the second. Anthropic's "Demystifying Evals" framework draws this distinction clearly.

Dimension	Step-level evaluation	Outcome-level evaluation
What it scores	Each individual action or response	Whether the full conversation accomplished the goal
Example question	"Did the agent use the right tool?"	"Did the customer's problem get resolved?"
Tooling maturity	Well understood (LangSmith, Braintrust, Arize)	Fragmented, inconsistent benchmarks
Adoption	Widely adopted (part of the 83% tracking capability metrics)	Rare (part of the 30% evaluating outcomes)
Failure detection	Catches broken tool calls, bad formatting	Catches context loss, goal failure, coherence drift
Blind spot	Can score every step as passing while the outcome fails	Requires full conversation context to evaluate

Most teams stop at step-level evaluation because it's easier, it's available, and it produces reassuring numbers. But step-level scores can mask outcome-level failures in ways that are subtle and hard to catch without dedicated testing.

When every step passes but the outcome fails

Multi-turn conversations are where the gap between step-level and outcome-level evaluation becomes most visible. An agent can score well on every individual response and still fail the customer.

Here's a concrete example. A support agent handles a billing inquiry across four turns:

text

Turn 1:
  Customer: "I was charged twice for my March subscription."
  Agent: "I'd be happy to look into that for you. Can you
          confirm the email on your account?"
  Per-turn score: ✅ Appropriate response, asks for verification
 
Turn 2:
  Customer: "Sure, it's maria@example.com"
  Agent: "Thanks! I can see your account. It looks like there
          were two charges on March 3rd. Let me investigate."
  Per-turn score: ✅ Acknowledges info, confirms the issue
 
Turn 3:
  Customer: "Yes, exactly. Can you refund the duplicate?"
  Agent: "Of course. To process the refund, can you confirm
          the email address on your account?"
  Per-turn score: ✅ Polite, follows refund procedure
 
Turn 4:
  Customer: "I just gave you my email..."
  Agent: "I apologize for the inconvenience. Could you please
          provide your email so I can locate your account?"
  Per-turn score: ✅ Apologetic, attempts to help

Every individual response is polite, relevant, and follows the script. A per-turn evaluator scores each one as passing. But the agent forgot the customer's email after they provided it. Twice. The customer satisfaction score for this interaction would be abysmal.

This is the multi-turn failure blindspot. Single-turn metrics can't detect conversational coherence failures because they evaluate each exchange in isolation, without access to the full conversation history. The failure only becomes visible when you score the outcome: did the agent resolve the billing issue without making the customer repeat themselves?

You can't catch this with monitoring. You catch it with scenario testing that simulates realistic multi-turn conversations and scores them against outcome criteria.

How do cascade failures hide behind green traces?

Cascade failures occur when an intermediate step produces wrong output that downstream steps process correctly, making every individual trace span appear healthy while the customer receives incorrect information. They're the most dangerous category of failure that step-level evaluation misses, because the system looks like it's working perfectly at every point in the pipeline.

Picture an agent that answers product questions using a knowledge base. The retrieval step pulls the wrong document. The summarization step produces a clear, well-written summary of that wrong document. The response is fluent, confident, and completely incorrect.

A component cascade failure: each step scores well individually, but the outcome is wrong

If you only evaluate the final response for quality, it scores well. The language is clear. The answer is structured. It addresses the question directly. But the customer gets the wrong return policy and tries to return a laptop under clothing terms.

The only way to catch this is to evaluate intermediate outputs, not just the final response. Tool output validation, retrieval accuracy checks, and end-to-end outcome scoring that compares the agent's answer against the expected correct answer.

This is where the monitoring-only approach breaks down most dangerously. The trace shows a successful retrieval, a successful summarization, and a successful response. Every span is green. The customer got wrong information. Your quality scorecards need to include criteria for factual accuracy, not just response quality.

Happy path testing isn't testing

The third blind spot is adversarial robustness. Most teams that do test their agents test only the happy path: cooperative users, clear questions, standard workflows. Production users are none of these things.

Real users interrupt mid-sentence. They change their mind after the agent has already started processing. They provide contradictory information. They ask questions the agent wasn't designed for. They try to get the agent to do things it shouldn't. Not always maliciously. Often just because they're confused, frustrated, or multitasking.

An agent that handles cooperative users flawlessly can fall apart when facing:

The impatient user who skips ahead in the flow and demands a resolution before the agent has gathered enough information
The mind-changer who says "actually, cancel that" after the agent has already submitted the request
The contradictory user who provides a phone number, then gives a different phone number when asked to confirm, then insists the first one was correct
The off-script user who asks about something completely unrelated in the middle of a support flow

These aren't edge cases. They're Tuesday. And testing only happy paths leaves your agent exposed to the interactions that generate the most customer complaints.

Building test personas that simulate these behaviors is how you close the adversarial gap. Not one generic "difficult customer" persona, but specific behavioral profiles that probe specific failure modes your agent needs to handle.

How do you close the gap before production?

You don't need to replace your observability stack. You need to add a layer on top of it that asks different questions. Instead of "what happened?" you add "should that have happened?" This means multi-turn scenario tests, outcome-level scorecards, and adversarial persona testing, all running before deployment.

That means three things in practice:

1. Multi-turn scenario tests

Single-turn prompt/response tests are a start, but they miss everything we've discussed: context loss, coherence failures, conversational drift. You need tests that simulate full user journeys.

A good scenario test defines a persona (who is the user, what do they want, how do they behave), a conversation flow (how many turns, what topics), and success criteria (what does a good outcome look like). Then it runs the conversation and scores the result.

Here's what a multi-turn evaluation looks like when you test for the billing support failure from earlier:

typescript

import { Chanl } from '@chanl/sdk'
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY })
 
// Create a persona that tests context retention
const { data: persona } = await chanl.personas.create({
  name: 'Frustrated Repeat Customer',
  description: 'Provides information once and expects the agent to remember it',
  traits: ['impatient', 'direct', 'detail-oriented'],
  background: 'Has been a customer for 3 years. Was double-charged and wants a quick resolution.',
})
 
// Create a scenario that simulates the full billing conversation
const { data: scenario } = await chanl.scenarios.create({
  name: 'Double Charge Resolution',
  agentId: supportAgentId,
  personaId: persona.id,
  type: 'inbound',
  maxTurns: 6,
  variables: {
    customerEmail: 'maria@example.com',
    issueType: 'duplicate-charge',
    chargeDate: '2026-03-03',
  },
})

This scenario doesn't just test whether the agent responds. It tests whether the agent can hold context across turns, handle a frustrated user, and actually resolve the problem. The persona's traits drive realistic behavior. The variables provide ground truth for scoring.

2. Outcome-level scorecards

Step-level metrics tell you whether the agent's responses were well-formed. Outcome scorecards tell you whether the conversation accomplished its goal. They score the full interaction, not individual messages.

Building on the billing scenario, you'd want a scorecard that checks for exactly the failures that monitoring misses:

typescript

// Create a scorecard with outcome-level criteria
const { data: scorecard } = await chanl.scorecards.create({
  name: 'Support Resolution Quality',
  scoringAlgorithm: 'weighted_average',
  passingThreshold: 75,
})
 
// Did the agent solve the actual problem?
await chanl.scorecards.createCriterion(scorecard.id, {
  name: 'Issue Resolution',
  key: 'resolution',
  description: 'The duplicate charge was acknowledged and a refund was initiated without requiring the customer to repeat information',
  weight: 35,
  type: 'prompt',
})
 
// Did the agent retain context across turns?
await chanl.scorecards.createCriterion(scorecard.id, {
  name: 'Context Retention',
  key: 'context-retention',
  description: 'The agent remembered previously provided information like email address and charge details without asking again',
  weight: 30,
  type: 'prompt',
})
 
// Did the agent handle the customer's frustration?
await chanl.scorecards.createCriterion(scorecard.id, {
  name: 'Emotional Awareness',
  key: 'emotional-awareness',
  description: 'The agent recognized the customer frustration about being double-charged and responded with appropriate empathy',
  weight: 20,
  type: 'prompt',
})
 
// Was escalation handled correctly?
await chanl.scorecards.createCriterion(scorecard.id, {
  name: 'Escalation Judgment',
  key: 'escalation',
  description: 'The agent escalated to a human if it could not resolve the refund, rather than looping or guessing',
  weight: 15,
  type: 'prompt',
})

Each criterion maps to a specific failure mode. Context retention catches the "forgot the email" problem. Emotional awareness catches the tone misalignment. Escalation judgment catches the agent that loops forever instead of handing off. These are the things your human reviewers check for. Now they're automated and consistent.

3. Adversarial persona testing

Happy-path tests tell you whether the agent works when everything goes right. Adversarial tests tell you whether the agent degrades gracefully when things get messy.

typescript

// Create adversarial personas that probe specific weaknesses
const { data: mindChanger } = await chanl.personas.create({
  name: 'The Mind-Changer',
  description: 'Changes their request mid-conversation after the agent has already started processing',
  traits: ['indecisive', 'polite', 'easily confused'],
  background: 'Initially wants to upgrade their plan, then decides to cancel instead, then asks about a completely different product.',
})
 
const { data: rusher } = await chanl.personas.create({
  name: 'The Rusher',
  description: 'Skips verification steps and demands immediate resolution',
  traits: ['impatient', 'assertive', 'time-pressured'],
  background: 'Between meetings, needs the issue resolved in under 2 minutes, will not tolerate delays or unnecessary questions.',
})
 
// Run scenarios with each adversarial persona
const results = await chanl.scenarios.runAll({
  scenarioIds: [billingScenario.id, upgradeScenario.id],
  personaIds: [mindChanger.id, rusher.id],
  scorecardId: scorecard.id,
})
 
// Check results across all combinations
for (const result of results.data) {
  console.log(`${result.scenarioName} x ${result.personaName}: ${result.score}/100`)
  if (result.score < scorecard.passingThreshold) {
    console.log(`  Failed criteria: ${result.failedCriteria.map(c => c.name).join(', ')}`)
  }
}

Running your scenarios against multiple personas creates a matrix of test conditions. You find out not just whether the agent handles billing inquiries, but whether it handles billing inquiries from impatient users, confused users, and users who change their mind. The failures that emerge from adversarial testing are almost always different from the ones you'd find with cooperative test personas.

From reactive to preventive

The difference between monitoring-only and monitoring-plus-evaluation isn't just about catching more bugs. It's about when you catch them.

Monitoring is reactive. It catches failures after they've reached customers. You find out about the context retention bug on Thursday when the escalation queue spikes. You find out about the cascade failure when a customer tweets the wrong return policy your agent gave them. You find out about the adversarial weakness when someone discovers they can get your agent to skip verification steps.

Evaluation is preventive. It catches failures before deployment. You run your scenario suite in CI. The context retention test fails. You fix the prompt. You run again. It passes. You ship with confidence that the specific failure mode is covered.

The gap between monitoring adoption and outcome evaluation represents teams flying with instruments but no pre-flight checklist. The instruments are valuable. They're just not sufficient.

Closing the gap means treating evaluation as a first-class engineering practice, not an afterthought. It means running scenarios before every deployment, scoring outcomes across full conversations, and building adversarial test suites that probe the failures your monitoring can't see.

The tools exist. The patterns are documented. The gap is a choice. And for teams that monitor without testing, every deployment is a bet that nothing has regressed since the last time someone manually checked.

That's not a bet most teams should be making.

How do you track quality over time?

One more piece closes the loop between evaluation and monitoring. Running scenarios once before a deployment catches regressions at that moment. Tracking scores over time catches slower trends: gradual quality drift, seasonal patterns, and the subtle degradation that happens when upstream APIs change or knowledge bases go stale.

typescript

// Pull score trends to catch gradual regression
const results = await chanl.scorecards.listResults({
  scorecardId: scorecard.id,
  from: '2026-03-01',
  to: '2026-04-01',
})
 
// Flag any criteria trending downward
for (const criterion of results.data.criteria) {
  const trend = criterion.weekOverWeek
  if (trend < -5) {
    console.warn(
      `${criterion.name} dropped ${Math.abs(trend)}% this week. ` +
      `Current: ${criterion.currentScore}. Investigate.`
    )
  }
}

This turns evaluation from a gate (pass/fail at deploy time) into a signal (continuous quality measurement). Monitoring dashboards show you system health. Score trends show you quality health. Together, they cover both sides of the question: is it running, and is it working?

Teams that track capability metrics and teams that evaluate outcomes aren't opposing camps. They're two halves of a complete quality practice. The goal isn't to choose one over the other. It's to close the gap between them.

Stop monitoring without testing

Run multi-turn scenarios, score outcomes, and catch regressions before your customers do.

Start testing your agents

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluation testing monitoring scorecards scenarios agent-quality observability

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed

Is monitoring your AI agent actually enough?

Why does monitoring create a false sense of security?

What's the difference between step-level and outcome-level evaluation?

When every step passes but the outcome fails

How do cascade failures hide behind green traces?

Happy path testing isn't testing

How do you close the gap before production?

1. Multi-turn scenario tests

2. Outcome-level scorecards

3. Adversarial persona testing

From reactive to preventive

How do you track quality over time?

Stop monitoring without testing

Learn Agentic AI

Frequently Asked Questions

Related Articles

How Much Testing Is Enough for Your AI Agent?

Online vs. Offline Evals: Close the Production Gap

We open-sourced our AI agent testing engine

Is monitoring your AI agent actually enough?

Why does monitoring create a false sense of security?

What's the difference between step-level and outcome-level evaluation?

When every step passes but the outcome fails

How do cascade failures hide behind green traces?

Happy path testing isn't testing

How do you close the gap before production?

1. Multi-turn scenario tests

2. Outcome-level scorecards

3. Adversarial persona testing

From reactive to preventive

How do you track quality over time?

Stop monitoring without testing

Learn Agentic AI

Frequently Asked Questions

What is the eval gap in AI agent development?

Why is monitoring alone not enough for AI agents?

What is the difference between step-level and outcome-level evaluation?

How do multi-turn conversations expose agent failures that single-turn tests miss?

What is a component cascade failure in AI agents?

How can teams start evaluating AI agents before production?

What is adversarial testing for AI agents?

How does outcome scoring differ from response quality scoring?

Related Articles

How Much Testing Is Enough for Your AI Agent?

Online vs. Offline Evals: Close the Production Gap

We open-sourced our AI agent testing engine