There's a pattern that keeps showing up in research about production AI agents. Not the kind of finding teams brag about. The kind that explains why quality issues still blindside engineering teams who thought they had monitoring covered.
A systematic review of 60+ agent evaluation studies found that 83% of teams report capability metrics, but only 30% evaluate human-centred or outcome-level dimensions. Separately, a survey of 306 practitioners found that 74% still rely on human evaluation as their primary assessment method.
Those numbers come from two independent research efforts published in 2025, and they capture something important about where the industry is right now. Observability adoption far outpaces outcome evaluation. Teams have invested heavily in knowing when things go wrong. They haven't invested nearly as much in preventing things from going wrong in the first place.
This isn't a tooling problem. It's a category confusion problem. Monitoring and evaluation look similar from the outside. Both produce dashboards. Both generate scores. Both claim to measure quality. But they answer fundamentally different questions, and conflating them creates a dangerous blind spot.
Why does monitoring create a false sense of security?
Green dashboards don't mean good outcomes. Monitoring measures whether the system ran without errors, not whether the agent actually helped the customer. When every trace completes and every metric stays flat, teams naturally assume the agent is performing well. But "completed without errors" and "accomplished the goal" are entirely different claims.
Consider what a typical observability stack actually measures: latency per request, token usage, error rates, trace completeness, tool call success/failure. These are infrastructure metrics. They tell you whether the system is running. They don't tell you whether the system is working.
A support agent that confidently gives wrong answers completes every trace with a 200 status code. A booking agent that books the wrong date finishes every tool call successfully. An onboarding agent that confuses a new user with jargon produces perfect latency numbers. None of these failures show up in monitoring dashboards until a customer complains.
Here's what that looks like in practice. Most teams have something close to this setup:
// Typical observability setup: traces everything, evaluates nothing
const trace = tracer.startSpan('agent.conversation');
// ✅ Captures latency
trace.setAttribute('duration_ms', responseTime);
// ✅ Captures token usage
trace.setAttribute('tokens.input', inputTokens);
trace.setAttribute('tokens.output', outputTokens);
// ✅ Captures tool call results
trace.setAttribute('tool.name', 'lookup_order');
trace.setAttribute('tool.status', 'success');
// ❌ Never asks: did the agent solve the problem?
// ❌ Never asks: did the customer get the right information?
// ❌ Never asks: would a human reviewer approve this interaction?
trace.end();Everything in this setup answers "what happened?" Nothing answers "was the outcome good?"
That gap is where quality regressions live. You ship a prompt update on Tuesday. Traces look clean. Latency is flat. Error rate holds steady. On Thursday, a customer escalation reveals that the new prompt causes the agent to skip the verification step in refund conversations. You've been shipping bad interactions for 48 hours. Your monitoring didn't catch it because the failure isn't a system error. It's a behavior change.
What's the difference between step-level and outcome-level evaluation?
Step-level evaluation scores each individual action or response in isolation. Outcome-level evaluation scores whether the agent accomplished the overall goal across a full conversation. The industry has mature tooling for the first type but fragmented, inconsistent approaches to the second. Anthropic's "Demystifying Evals" framework draws this distinction clearly.
| Dimension | Step-level evaluation | Outcome-level evaluation |
|---|---|---|
| What it scores | Each individual action or response | Whether the full conversation accomplished the goal |
| Example question | "Did the agent use the right tool?" | "Did the customer's problem get resolved?" |
| Tooling maturity | Well understood (LangSmith, Braintrust, Arize) | Fragmented, inconsistent benchmarks |
| Adoption | Widely adopted (part of the 83% tracking capability metrics) | Rare (part of the 30% evaluating outcomes) |
| Failure detection | Catches broken tool calls, bad formatting | Catches context loss, goal failure, coherence drift |
| Blind spot | Can score every step as passing while the outcome fails | Requires full conversation context to evaluate |
Most teams stop at step-level evaluation because it's easier, it's available, and it produces reassuring numbers. But step-level scores can mask outcome-level failures in ways that are subtle and hard to catch without dedicated testing.
When every step passes but the outcome fails
Multi-turn conversations are where the gap between step-level and outcome-level evaluation becomes most visible. An agent can score well on every individual response and still fail the customer.
Here's a concrete example. A support agent handles a billing inquiry across four turns:
Turn 1:
Customer: "I was charged twice for my March subscription."
Agent: "I'd be happy to look into that for you. Can you
confirm the email on your account?"
Per-turn score: ✅ Appropriate response, asks for verification
Turn 2:
Customer: "Sure, it's maria@example.com"
Agent: "Thanks! I can see your account. It looks like there
were two charges on March 3rd. Let me investigate."
Per-turn score: ✅ Acknowledges info, confirms the issue
Turn 3:
Customer: "Yes, exactly. Can you refund the duplicate?"
Agent: "Of course. To process the refund, can you confirm
the email address on your account?"
Per-turn score: ✅ Polite, follows refund procedure
Turn 4:
Customer: "I just gave you my email..."
Agent: "I apologize for the inconvenience. Could you please
provide your email so I can locate your account?"
Per-turn score: ✅ Apologetic, attempts to helpEvery individual response is polite, relevant, and follows the script. A per-turn evaluator scores each one as passing. But the agent forgot the customer's email after they provided it. Twice. The customer satisfaction score for this interaction would be abysmal.
This is the multi-turn failure blindspot. Single-turn metrics can't detect conversational coherence failures because they evaluate each exchange in isolation, without access to the full conversation history. The failure only becomes visible when you score the outcome: did the agent resolve the billing issue without making the customer repeat themselves?
You can't catch this with monitoring. You catch it with scenario testing that simulates realistic multi-turn conversations and scores them against outcome criteria.
How do cascade failures hide behind green traces?
Cascade failures occur when an intermediate step produces wrong output that downstream steps process correctly, making every individual trace span appear healthy while the customer receives incorrect information. They're the most dangerous category of failure that step-level evaluation misses, because the system looks like it's working perfectly at every point in the pipeline.
Picture an agent that answers product questions using a knowledge base. The retrieval step pulls the wrong document. The summarization step produces a clear, well-written summary of that wrong document. The response is fluent, confident, and completely incorrect.
If you only evaluate the final response for quality, it scores well. The language is clear. The answer is structured. It addresses the question directly. But the customer gets the wrong return policy and tries to return a laptop under clothing terms.
The only way to catch this is to evaluate intermediate outputs, not just the final response. Tool output validation, retrieval accuracy checks, and end-to-end outcome scoring that compares the agent's answer against the expected correct answer.
This is where the monitoring-only approach breaks down most dangerously. The trace shows a successful retrieval, a successful summarization, and a successful response. Every span is green. The customer got wrong information. Your quality scorecards need to include criteria for factual accuracy, not just response quality.
Happy path testing isn't testing
The third blind spot is adversarial robustness. Most teams that do test their agents test only the happy path: cooperative users, clear questions, standard workflows. Production users are none of these things.
Real users interrupt mid-sentence. They change their mind after the agent has already started processing. They provide contradictory information. They ask questions the agent wasn't designed for. They try to get the agent to do things it shouldn't. Not always maliciously. Often just because they're confused, frustrated, or multitasking.
An agent that handles cooperative users flawlessly can fall apart when facing:
- The impatient user who skips ahead in the flow and demands a resolution before the agent has gathered enough information
- The mind-changer who says "actually, cancel that" after the agent has already submitted the request
- The contradictory user who provides a phone number, then gives a different phone number when asked to confirm, then insists the first one was correct
- The off-script user who asks about something completely unrelated in the middle of a support flow
These aren't edge cases. They're Tuesday. And testing only happy paths leaves your agent exposed to the interactions that generate the most customer complaints.
Building test personas that simulate these behaviors is how you close the adversarial gap. Not one generic "difficult customer" persona, but specific behavioral profiles that probe specific failure modes your agent needs to handle.
How do you close the gap before production?
You don't need to replace your observability stack. You need to add a layer on top of it that asks different questions. Instead of "what happened?" you add "should that have happened?" This means multi-turn scenario tests, outcome-level scorecards, and adversarial persona testing, all running before deployment.
That means three things in practice:
1. Multi-turn scenario tests
Single-turn prompt/response tests are a start, but they miss everything we've discussed: context loss, coherence failures, conversational drift. You need tests that simulate full user journeys.
A good scenario test defines a persona (who is the user, what do they want, how do they behave), a conversation flow (how many turns, what topics), and success criteria (what does a good outcome look like). Then it runs the conversation and scores the result.
Here's what a multi-turn evaluation looks like when you test for the billing support failure from earlier:
import { Chanl } from '@chanl/sdk'
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY })
// Create a persona that tests context retention
const { data: persona } = await chanl.personas.create({
name: 'Frustrated Repeat Customer',
description: 'Provides information once and expects the agent to remember it',
traits: ['impatient', 'direct', 'detail-oriented'],
background: 'Has been a customer for 3 years. Was double-charged and wants a quick resolution.',
})
// Create a scenario that simulates the full billing conversation
const { data: scenario } = await chanl.scenarios.create({
name: 'Double Charge Resolution',
agentId: supportAgentId,
personaId: persona.id,
type: 'inbound',
maxTurns: 6,
variables: {
customerEmail: 'maria@example.com',
issueType: 'duplicate-charge',
chargeDate: '2026-03-03',
},
})This scenario doesn't just test whether the agent responds. It tests whether the agent can hold context across turns, handle a frustrated user, and actually resolve the problem. The persona's traits drive realistic behavior. The variables provide ground truth for scoring.
2. Outcome-level scorecards
Step-level metrics tell you whether the agent's responses were well-formed. Outcome scorecards tell you whether the conversation accomplished its goal. They score the full interaction, not individual messages.
Building on the billing scenario, you'd want a scorecard that checks for exactly the failures that monitoring misses:
// Create a scorecard with outcome-level criteria
const { data: scorecard } = await chanl.scorecards.create({
name: 'Support Resolution Quality',
scoringAlgorithm: 'weighted_average',
passingThreshold: 75,
})
// Did the agent solve the actual problem?
await chanl.scorecards.createCriterion(scorecard.id, {
name: 'Issue Resolution',
key: 'resolution',
description: 'The duplicate charge was acknowledged and a refund was initiated without requiring the customer to repeat information',
weight: 35,
type: 'prompt',
})
// Did the agent retain context across turns?
await chanl.scorecards.createCriterion(scorecard.id, {
name: 'Context Retention',
key: 'context-retention',
description: 'The agent remembered previously provided information like email address and charge details without asking again',
weight: 30,
type: 'prompt',
})
// Did the agent handle the customer's frustration?
await chanl.scorecards.createCriterion(scorecard.id, {
name: 'Emotional Awareness',
key: 'emotional-awareness',
description: 'The agent recognized the customer frustration about being double-charged and responded with appropriate empathy',
weight: 20,
type: 'prompt',
})
// Was escalation handled correctly?
await chanl.scorecards.createCriterion(scorecard.id, {
name: 'Escalation Judgment',
key: 'escalation',
description: 'The agent escalated to a human if it could not resolve the refund, rather than looping or guessing',
weight: 15,
type: 'prompt',
})Each criterion maps to a specific failure mode. Context retention catches the "forgot the email" problem. Emotional awareness catches the tone misalignment. Escalation judgment catches the agent that loops forever instead of handing off. These are the things your human reviewers check for. Now they're automated and consistent.
3. Adversarial persona testing
Happy-path tests tell you whether the agent works when everything goes right. Adversarial tests tell you whether the agent degrades gracefully when things get messy.
// Create adversarial personas that probe specific weaknesses
const { data: mindChanger } = await chanl.personas.create({
name: 'The Mind-Changer',
description: 'Changes their request mid-conversation after the agent has already started processing',
traits: ['indecisive', 'polite', 'easily confused'],
background: 'Initially wants to upgrade their plan, then decides to cancel instead, then asks about a completely different product.',
})
const { data: rusher } = await chanl.personas.create({
name: 'The Rusher',
description: 'Skips verification steps and demands immediate resolution',
traits: ['impatient', 'assertive', 'time-pressured'],
background: 'Between meetings, needs the issue resolved in under 2 minutes, will not tolerate delays or unnecessary questions.',
})
// Run scenarios with each adversarial persona
const results = await chanl.scenarios.runAll({
scenarioIds: [billingScenario.id, upgradeScenario.id],
personaIds: [mindChanger.id, rusher.id],
scorecardId: scorecard.id,
})
// Check results across all combinations
for (const result of results.data) {
console.log(`${result.scenarioName} x ${result.personaName}: ${result.score}/100`)
if (result.score < scorecard.passingThreshold) {
console.log(` Failed criteria: ${result.failedCriteria.map(c => c.name).join(', ')}`)
}
}Running your scenarios against multiple personas creates a matrix of test conditions. You find out not just whether the agent handles billing inquiries, but whether it handles billing inquiries from impatient users, confused users, and users who change their mind. The failures that emerge from adversarial testing are almost always different from the ones you'd find with cooperative test personas.
From reactive to preventive
The difference between monitoring-only and monitoring-plus-evaluation isn't just about catching more bugs. It's about when you catch them.
Monitoring is reactive. It catches failures after they've reached customers. You find out about the context retention bug on Thursday when the escalation queue spikes. You find out about the cascade failure when a customer tweets the wrong return policy your agent gave them. You find out about the adversarial weakness when someone discovers they can get your agent to skip verification steps.
Evaluation is preventive. It catches failures before deployment. You run your scenario suite in CI. The context retention test fails. You fix the prompt. You run again. It passes. You ship with confidence that the specific failure mode is covered.
The gap between monitoring adoption and outcome evaluation represents teams flying with instruments but no pre-flight checklist. The instruments are valuable. They're just not sufficient.
Closing the gap means treating evaluation as a first-class engineering practice, not an afterthought. It means running scenarios before every deployment, scoring outcomes across full conversations, and building adversarial test suites that probe the failures your monitoring can't see.
The tools exist. The patterns are documented. The gap is a choice. And for teams that monitor without testing, every deployment is a bet that nothing has regressed since the last time someone manually checked.
That's not a bet most teams should be making.
How do you track quality over time?
One more piece closes the loop between evaluation and monitoring. Running scenarios once before a deployment catches regressions at that moment. Tracking scores over time catches slower trends: gradual quality drift, seasonal patterns, and the subtle degradation that happens when upstream APIs change or knowledge bases go stale.
// Pull score trends to catch gradual regression
const results = await chanl.scorecards.listResults({
scorecardId: scorecard.id,
from: '2026-03-01',
to: '2026-04-01',
})
// Flag any criteria trending downward
for (const criterion of results.data.criteria) {
const trend = criterion.weekOverWeek
if (trend < -5) {
console.warn(
`${criterion.name} dropped ${Math.abs(trend)}% this week. ` +
`Current: ${criterion.currentScore}. Investigate.`
)
}
}This turns evaluation from a gate (pass/fail at deploy time) into a signal (continuous quality measurement). Monitoring dashboards show you system health. Score trends show you quality health. Together, they cover both sides of the question: is it running, and is it working?
Teams that track capability metrics and teams that evaluate outcomes aren't opposing camps. They're two halves of a complete quality practice. The goal isn't to choose one over the other. It's to close the gap between them.
Stop monitoring without testing
Run multi-turn scenarios, score outcomes, and catch regressions before your customers do.
Start testing your agentsCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.



