Your agent shipped clean last Thursday. Scenarios passed, latency looked fine, the dashboard showed nothing unusual. Then on Tuesday, a CX lead forwarded you a ticket thread with a subject line: "Agent said our return window is 90 days?"
Your return window is 30 days. Changed three weeks ago. Your staging evals don't know that, because your staging evals run on the same test cases you wrote before the policy changed. They passed every day since the change. Meanwhile, your agent kept quoting the old policy to real customers.
This is what the LangChain 2026 State of AI Agents report quantifies: 89% of teams have observability implemented, but only 37% run online evaluations. That 52-point gap isn't a dashboard problem or a logging problem. It's a fundamentally different kind of evaluation that most teams haven't built yet.
| Dimension | Offline Evals | Online Evals |
|---|---|---|
| When it runs | Pre-deploy, in CI/CD | Continuously, on live traffic |
| What it tests | Your test cases | Real customer conversations |
| Input distribution | Curated and anticipated | Messy and surprising |
| Failure detection | Regressions against known scenarios | Drift, policy gaps, novel failures |
| Cost model | Fixed per deploy | Scales with traffic (sampled) |
| Feedback latency | Immediate (blocks deploy) | Hours to days (surfaces patterns) |
| Primary question | "Did this change break something?" | "Is production quality holding?" |
You need both. They answer different questions, and neither replaces the other. This guide walks through why that's true, then builds the online eval pipeline most teams are missing.
Why offline evals can't catch what production reveals
Offline evals are powerful for regression testing. You write a scenario, define expected behavior, and your CI/CD blocks any change that breaks it. That's the right tool for "did this prompt edit cause the agent to stop escalating billing disputes correctly?"
But offline evals have a structural limit: they can only test against scenarios that already exist in your test set. Production traffic is a different distribution. Real customers:
- Ask questions that span multiple topics in a single message
- Reference context from a previous call they made two weeks ago
- Use language patterns your test persona writer never thought of
- Hit your agent at 2am when your knowledge base update from earlier that day hasn't fully propagated
There's also the drift problem. Agent quality can degrade for reasons that have nothing to do with your code. Your LLM provider silently updates a model checkpoint. A third-party data source your agent uses starts returning stale results. Your knowledge base gets a new document that contradicts an older one. None of these trigger a CI/CD failure. Your offline evals keep passing. Production quietly gets worse.
Research from InsightFinder found 91% of ML systems experience performance degradation without proactive monitoring. For AI agents, that degradation is often invisible until a customer notices it.
The loop only closes when you have both layers. Offline evals catch what you anticipated. Online evals catch what you didn't.
What an online eval pipeline actually looks like
An online eval pipeline has three moving parts: a sampler that picks conversations from production traffic, a judge that scores them, and a storage layer that makes the scores queryable for trend analysis.
Here's the minimal version in TypeScript using the Chanl SDK:
import Chanl from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Define your scorecard criteria once
const SCORECARD_ID = 'prod-cx-scorecard-v2';
// Run the sampler — pull recent conversations, pick 5% for evaluation
async function sampleAndEval() {
const recent = await chanl.calls.list({
startTime: new Date(Date.now() - 3600_000), // last hour
limit: 500,
});
const sampleSize = Math.ceil(recent.calls.length * 0.05);
const sample = recent.calls
.sort(() => Math.random() - 0.5)
.slice(0, sampleSize);
const results = await Promise.allSettled(
sample.map(call =>
chanl.scorecards.evaluate({
callId: call.id,
scorecardId: SCORECARD_ID,
judgeModel: 'claude-haiku-4-5',
temperature: 0,
})
)
);
return results
.filter(r => r.status === 'fulfilled')
.map(r => (r as PromiseFulfilledResult<any>).value);
}That's the skeleton. What matters is what goes inside SCORECARD_ID: the rubric definition that tells the judge what to look for.
Building rubrics that travel across unknown conversations
Offline eval rubrics can be tightly coupled to your test scenarios. "Given this exact booking request, the agent should confirm the date, mention the cancellation policy, and ask for the customer's email." Specific anchors against a known input.
Online eval rubrics need to generalize. You don't know the input. Your rubric has to work for a billing dispute, an account recovery request, a shipping status check, and a complaint about a broken product in the same pass.
The practical structure is a two-layer rubric: a shared criterion definition, then generalized anchors that don't assume specific conversation content.
Here's an example of a criterion that works well for production online evals:
const scorecardDefinition = {
name: 'Production CX Scorecard v2',
description: 'Evaluates live agent conversations for quality across four production dimensions',
criteria: [
{
name: 'factual_accuracy',
description: 'The agent\'s statements are factually correct given available information',
weight: 0.35,
anchors: {
5: 'All factual claims are accurate and verifiable. No incorrect information given.',
4: 'Claims are accurate with minor imprecision that doesn\'t mislead the customer.',
3: 'Mostly accurate but contains one statement that could mislead without being outright wrong.',
2: 'Contains at least one factually incorrect claim that could affect the customer\'s decision.',
1: 'Multiple incorrect claims, or one high-impact incorrect claim (pricing, policy, availability).',
},
},
{
name: 'policy_adherence',
description: 'The agent follows defined guardrails and business rules',
weight: 0.30,
anchors: {
5: 'No policy violations. Guardrails applied correctly where relevant.',
4: 'No violations, minor ambiguity in a policy edge case handled reasonably.',
3: 'One policy guideline not applied, but outcome is still acceptable.',
2: 'Clear policy guideline not applied, leading to a suboptimal but recoverable outcome.',
1: 'Policy violated in a way that creates risk, gives wrong information, or requires human correction.',
},
},
{
name: 'task_resolution',
description: 'The conversation reached an appropriate outcome for the customer\'s need',
weight: 0.20,
anchors: {
5: 'Customer\'s need fully addressed or appropriately escalated to a human.',
4: 'Need substantially addressed with minor gaps or unnecessary friction.',
3: 'Partial resolution — customer need partially met but some aspect left hanging.',
2: 'Conversation ended without resolution, though the agent didn\'t clearly fail.',
1: 'Conversation ended with wrong outcome, unresolved issue, or customer demonstrably frustrated.',
},
},
{
name: 'tone_appropriateness',
description: 'Response tone matches the context and customer state',
weight: 0.15,
anchors: {
5: 'Tone perfectly calibrated — empathetic when customer is frustrated, efficient when customer wants speed.',
4: 'Tone appropriate with minor calibration gaps (slightly too formal, slightly too casual).',
3: 'Tone mismatch that doesn\'t damage the interaction but feels off.',
2: 'Tone clearly inappropriate for the situation — cold during complaint, overly familiar during sensitive topic.',
1: 'Tone actively damages the interaction or makes a difficult situation worse.',
},
},
],
};The weights matter. Production failures cluster around factual_accuracy and policy_adherence. Giving them 65% of the composite score means your overall quality metric actually reflects risk, not just vibe.

Detecting drift before customers notice it
Individual conversation scores are useful. Trends are more useful. The failure mode you're building against isn't "one bad conversation." It's quality degraded by 0.4 points over two weeks while nobody noticed until the escalation rate doubled.
Drift detection compares your current rolling score against a baseline:
async function detectDrift() {
const metrics = await chanl.calls.getMetrics({
scorecardId: SCORECARD_ID,
window: '7d',
breakdown: 'daily',
});
const baselineAvg = metrics.baseline.composite; // established in first 14 days
const currentAvg = metrics.current.composite;
const drift = baselineAvg - currentAvg;
if (drift >= 0.5) {
await triggerAlert({
severity: 'critical',
message: `Quality score dropped ${drift.toFixed(2)} points vs. baseline`,
data: metrics,
});
} else if (drift >= 0.3) {
await triggerAlert({
severity: 'warning',
message: `Quality trending down: ${drift.toFixed(2)} points below baseline`,
data: metrics,
});
}
// Also check per-criterion drift — overall score can mask a criteria-specific failure
for (const [criterion, score] of Object.entries(metrics.current.criteria)) {
const baselineCriterion = metrics.baseline.criteria[criterion];
const criterionDrift = baselineCriterion - (score as number);
if (criterionDrift >= 0.4) {
await triggerAlert({
severity: 'warning',
message: `${criterion} specifically degraded by ${criterionDrift.toFixed(2)} points`,
data: { criterion, baseline: baselineCriterion, current: score },
});
}
}
return { drift, metrics };
}Per-criterion drift detection is the part most teams skip. Overall composite score can look stable while a specific dimension quietly fails. If your policy_adherence score drops from 4.2 to 3.4 while tone and task_resolution hold steady, the composite barely moves. But policy_adherence dropping 0.8 points is a fire drill.
Turning online scores into offline test cases
Converting low-scoring production conversations into new offline test cases closes the eval loop permanently. Each failure you capture in production is a scenario your test suite didn't anticipate. Add it, and you've guaranteed that specific failure can never sneak past CI again.
Most online eval implementations stall before this step. Teams build the pipeline, start collecting scores, build a dashboard, and then... watch the dashboard. The scores don't automatically make the agent better.
This sounds tedious but it's the highest-leverage thing you can do with your online eval data. Each low-scoring production conversation is a real failure mode your test suite didn't anticipate. Adding it closes that gap permanently.
async function surfaceFailuresForReview() {
const lowScoring = await chanl.calls.list({
scorecardId: SCORECARD_ID,
scoreBelow: 3.0,
window: '24h',
limit: 20,
orderBy: 'score_asc',
});
for (const call of lowScoring.calls) {
const detail = await chanl.calls.get(call.id, { includeTranscript: true });
// Tag for human review queue
await chanl.calls.tag(call.id, {
tags: ['needs-review', 'online-eval-flag'],
priority: detail.score < 2.0 ? 'high' : 'normal',
});
}
return lowScoring.calls.length;
}The human review step isn't overhead. It's calibration. Reviewers confirm whether the low score reflects a genuine agent failure, a rubric gap, or a true edge case that should become a new test scenario. Each confirmed failure feeds back into your offline test suite via your testing scenarios.
The combined eval strategy
Offline and online evals aren't in competition. They're scheduled for different moments and answer different questions:
Before deploy: Run your full offline scenario suite in CI/CD. Block merges that regress performance on any critical path. This is your regression gate. It should catch 90%+ of issues introduced by code or prompt changes.
Continuously in production: Sample 5-10% of live conversations for online eval. Run drift detection daily. Alert on threshold violations. Surface low-scorers for review weekly.
Weekly: Pull the 10 lowest-scoring conversations from the past seven days. Review them with a human. Pattern-match: are they clustered around a specific scenario type, tool, or topic? Convert confirmed failures to new test cases.
Before next deploy: Your new test cases from last week's review are now part of the offline suite. The loop closes.
This cadence is what production-ready agent eval actually means. Not "we have a test suite." Not "we have a dashboard." The closed loop where production failures become tomorrow's tests.
You can read more about building the offline side of this loop in Production Agent Evals: Catch Score Drift, Ship Confidently. That guide covers the scenario regression and drift detection components in depth and pairs well with this one.
What the 52-point gap actually costs
The 89%/37% stat from LangChain's report is striking, but the cost behind it is more concrete. For a team running a customer support agent at 10,000 conversations per day:
| Eval coverage | What you catch | What you miss |
|---|---|---|
| Offline only | Code regressions, prompt regressions | Policy drift, novel failures, model updates, tool errors |
| Observability only | Latency spikes, error rates, volume anomalies | Quality degradation, semantic failures, subtle policy violations |
| Online evals (5% sample) | Quality trends, policy gaps, systematic failures | One-off failures in the unsampled 95% |
| Combined | Regressions + ongoing quality + policy drift | Low-probability edge cases (manage with higher sampling rate) |
The gap between "observability only" and "online evals" isn't about dashboards. It's about whether your monitoring layer understands the quality of what the agent said, not just the mechanics of how it said it. Latency and error rates measure infrastructure. Scorecards measure customer experience.
If you're in the 89% with observability and haven't built online evals yet, start small. Pick your most production-critical agent. Set up 5% sampling. Write a four-criterion rubric with specific anchors. Run it for two weeks and look at the trend. You'll find something worth fixing. When you do, you'll understand exactly why 89% observability coverage still leaves a 52-point gap.
The agents that perform best in production aren't the ones that passed the most CI tests. They're the ones with teams that kept evaluating them after they shipped.
Build the closed eval loop for your agents
Chanl's scorecard system handles online evaluation, drift detection, and failure surfacing so your team can close the gap between staging and production.
Start FreeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



