What's the difference between online and offline evals for AI agents?

Offline evals run on fixed test sets before deployment. They catch regressions against scenarios you already wrote. Online evals score live production conversations as they happen, catching failures that your test set never anticipated: policy changes, new customer language patterns, tool call failures under real load, and gradual quality drift over days or weeks.

Why do offline evals pass but production quality still drops?

Offline evals test what you imagined when you wrote the test. Production traffic includes edge cases you never anticipated: customers who code-switch mid-sentence, requests that combine topics in unexpected ways, and context that changes over time (updated policies, new products, seasonal promotions). A test suite that never changes will keep passing while the real world moves on.

How much production traffic should I sample for online evals?

Start with 5-10% of conversations. That's enough to catch systematic failures and drift patterns without running up significant LLM evaluation costs. For high-stakes agent types (billing, refunds, medical triage), raise the sampling rate to 20-30%. For lower-stakes interactions (FAQ, status checks), 3-5% is sufficient. Track your sampling rate and adjust based on the failure rate you find.

Which judge model should I use for online evals?

Use a model one tier below your primary agent model. If your agent runs on Claude Opus or GPT-4o, judge with GPT-4o-mini or Claude Haiku at temperature 0. Run a calibration pass: sample 50 conversations, score them with the candidate judge and with human review, and check correlation. If Pearson correlation stays above 0.85, the cheaper judge is reliable enough. Most teams find correlation holds for well-specified rubrics.

What's the minimum scorecard rubric for a production online eval?

Four criteria cover most production failures: factual accuracy (is the information the agent gave correct?), policy adherence (did it follow your defined guardrails and rules?), task resolution (did the conversation reach the right outcome?), and tone appropriateness (was the response suitable for the context and customer state?). Start here before adding nuanced criteria like empathy or upsell effectiveness.

How do I connect online eval scores back to agent improvement?

Online scores are inputs, not outputs. The feedback loop has four steps: (1) score live conversations continuously, (2) surface low-scoring conversations for human review, (3) extract patterns from the failures (which scenarios, which tools, which response types break most often?), (4) convert those patterns into new offline test cases and prompt updates. Without step 4, online evals are a dashboard that nobody acts on.

Can I use online evals without increasing LLM costs significantly?

Yes. The math is straightforward: at 5% sampling and a judge model that costs 10x less than your agent model, online evals add roughly 0.5% to total LLM spend. Batch the evaluation calls during off-peak windows if latency budget allows. For high-volume deployments, caching identical or near-identical conversation segments can reduce redundant evaluation calls by 20-40%.

Should online and offline evals use the same rubric criteria?

They should share the same core criteria but not necessarily the same scoring anchors. Offline evals benefit from very specific anchors tied to your test scenarios (you know exactly what the right answer is). Online evals need anchors that generalize across unknown production conversations. Write rubrics with both layers: a shared criterion definition, then scenario-specific anchors for offline and generalized anchors for online.

Online vs. Offline Evals: Close the Production Gap

Your agent shipped clean last Thursday. Scenarios passed, latency looked fine, the dashboard showed nothing unusual. Then on Tuesday, a CX lead forwarded you a ticket thread with a subject line: "Agent said our return window is 90 days?"

Your return window is 30 days. Changed three weeks ago. Your staging evals don't know that, because your staging evals run on the same test cases you wrote before the policy changed. They passed every day since the change. Meanwhile, your agent kept quoting the old policy to real customers.

This is what the LangChain 2026 State of AI Agents report quantifies: 89% of teams have observability implemented, but only 37% run online evaluations. That 52-point gap isn't a dashboard problem or a logging problem. It's a fundamentally different kind of evaluation that most teams haven't built yet.

Dimension	Offline Evals	Online Evals
When it runs	Pre-deploy, in CI/CD	Continuously, on live traffic
What it tests	Your test cases	Real customer conversations
Input distribution	Curated and anticipated	Messy and surprising
Failure detection	Regressions against known scenarios	Drift, policy gaps, novel failures
Cost model	Fixed per deploy	Scales with traffic (sampled)
Feedback latency	Immediate (blocks deploy)	Hours to days (surfaces patterns)
Primary question	"Did this change break something?"	"Is production quality holding?"

You need both. They answer different questions, and neither replaces the other. This guide walks through why that's true, then builds the online eval pipeline most teams are missing.

Why offline evals can't catch what production reveals

Offline evals are powerful for regression testing. You write a scenario, define expected behavior, and your CI/CD blocks any change that breaks it. That's the right tool for "did this prompt edit cause the agent to stop escalating billing disputes correctly?"

But offline evals have a structural limit: they can only test against scenarios that already exist in your test set. Production traffic is a different distribution. Real customers:

Ask questions that span multiple topics in a single message
Reference context from a previous call they made two weeks ago
Use language patterns your test persona writer never thought of
Hit your agent at 2am when your knowledge base update from earlier that day hasn't fully propagated

There's also the drift problem. Agent quality can degrade for reasons that have nothing to do with your code. Your LLM provider silently updates a model checkpoint. A third-party data source your agent uses starts returning stale results. Your knowledge base gets a new document that contradicts an older one. None of these trigger a CI/CD failure. Your offline evals keep passing. Production quietly gets worse.

Research from InsightFinder found 91% of ML systems experience performance degradation without proactive monitoring. For AI agents, that degradation is often invisible until a customer notices it.

How offline and online evals cover different failure modes

The loop only closes when you have both layers. Offline evals catch what you anticipated. Online evals catch what you didn't.

What an online eval pipeline actually looks like

An online eval pipeline has three moving parts: a sampler that picks conversations from production traffic, a judge that scores them, and a storage layer that makes the scores queryable for trend analysis.

Here's the minimal version in TypeScript using the Chanl SDK:

typescript

import Chanl from '@chanl/sdk';
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Define your scorecard criteria once
const SCORECARD_ID = 'prod-cx-scorecard-v2';
 
// Run the sampler — pull recent conversations, pick 5% for evaluation
async function sampleAndEval() {
  const recent = await chanl.calls.list({
    startTime: new Date(Date.now() - 3600_000), // last hour
    limit: 500,
  });
 
  const sampleSize = Math.ceil(recent.calls.length * 0.05);
  const sample = recent.calls
    .sort(() => Math.random() - 0.5)
    .slice(0, sampleSize);
 
  const results = await Promise.allSettled(
    sample.map(call =>
      chanl.scorecards.evaluate({
        callId: call.id,
        scorecardId: SCORECARD_ID,
        judgeModel: 'claude-haiku-4-5',
        temperature: 0,
      })
    )
  );
 
  return results
    .filter(r => r.status === 'fulfilled')
    .map(r => (r as PromiseFulfilledResult<any>).value);
}

That's the skeleton. What matters is what goes inside SCORECARD_ID: the rubric definition that tells the judge what to look for.

Building rubrics that travel across unknown conversations

Offline eval rubrics can be tightly coupled to your test scenarios. "Given this exact booking request, the agent should confirm the date, mention the cancellation policy, and ask for the customer's email." Specific anchors against a known input.

Online eval rubrics need to generalize. You don't know the input. Your rubric has to work for a billing dispute, an account recovery request, a shipping status check, and a complaint about a broken product in the same pass.

The practical structure is a two-layer rubric: a shared criterion definition, then generalized anchors that don't assume specific conversation content.

Here's an example of a criterion that works well for production online evals:

typescript

const scorecardDefinition = {
  name: 'Production CX Scorecard v2',
  description: 'Evaluates live agent conversations for quality across four production dimensions',
  criteria: [
    {
      name: 'factual_accuracy',
      description: 'The agent\'s statements are factually correct given available information',
      weight: 0.35,
      anchors: {
        5: 'All factual claims are accurate and verifiable. No incorrect information given.',
        4: 'Claims are accurate with minor imprecision that doesn\'t mislead the customer.',
        3: 'Mostly accurate but contains one statement that could mislead without being outright wrong.',
        2: 'Contains at least one factually incorrect claim that could affect the customer\'s decision.',
        1: 'Multiple incorrect claims, or one high-impact incorrect claim (pricing, policy, availability).',
      },
    },
    {
      name: 'policy_adherence',
      description: 'The agent follows defined guardrails and business rules',
      weight: 0.30,
      anchors: {
        5: 'No policy violations. Guardrails applied correctly where relevant.',
        4: 'No violations, minor ambiguity in a policy edge case handled reasonably.',
        3: 'One policy guideline not applied, but outcome is still acceptable.',
        2: 'Clear policy guideline not applied, leading to a suboptimal but recoverable outcome.',
        1: 'Policy violated in a way that creates risk, gives wrong information, or requires human correction.',
      },
    },
    {
      name: 'task_resolution',
      description: 'The conversation reached an appropriate outcome for the customer\'s need',
      weight: 0.20,
      anchors: {
        5: 'Customer\'s need fully addressed or appropriately escalated to a human.',
        4: 'Need substantially addressed with minor gaps or unnecessary friction.',
        3: 'Partial resolution — customer need partially met but some aspect left hanging.',
        2: 'Conversation ended without resolution, though the agent didn\'t clearly fail.',
        1: 'Conversation ended with wrong outcome, unresolved issue, or customer demonstrably frustrated.',
      },
    },
    {
      name: 'tone_appropriateness',
      description: 'Response tone matches the context and customer state',
      weight: 0.15,
      anchors: {
        5: 'Tone perfectly calibrated — empathetic when customer is frustrated, efficient when customer wants speed.',
        4: 'Tone appropriate with minor calibration gaps (slightly too formal, slightly too casual).',
        3: 'Tone mismatch that doesn\'t damage the interaction but feels off.',
        2: 'Tone clearly inappropriate for the situation — cold during complaint, overly familiar during sensitive topic.',
        1: 'Tone actively damages the interaction or makes a difficult situation worse.',
      },
    },
  ],
};

The weights matter. Production failures cluster around factual_accuracy and policy_adherence. Giving them 65% of the composite score means your overall quality metric actually reflects risk, not just vibe.

Score

Good

0/100

Tone & Empathy

94%

Resolution

88%

Response Time

72%

Compliance

85%

Detecting drift before customers notice it

Individual conversation scores are useful. Trends are more useful. The failure mode you're building against isn't "one bad conversation." It's quality degraded by 0.4 points over two weeks while nobody noticed until the escalation rate doubled.

Drift detection compares your current rolling score against a baseline:

typescript

async function detectDrift() {
  const metrics = await chanl.calls.getMetrics({
    scorecardId: SCORECARD_ID,
    window: '7d',
    breakdown: 'daily',
  });
 
  const baselineAvg = metrics.baseline.composite; // established in first 14 days
  const currentAvg = metrics.current.composite;
  const drift = baselineAvg - currentAvg;
 
  if (drift >= 0.5) {
    await triggerAlert({
      severity: 'critical',
      message: `Quality score dropped ${drift.toFixed(2)} points vs. baseline`,
      data: metrics,
    });
  } else if (drift >= 0.3) {
    await triggerAlert({
      severity: 'warning',
      message: `Quality trending down: ${drift.toFixed(2)} points below baseline`,
      data: metrics,
    });
  }
 
  // Also check per-criterion drift — overall score can mask a criteria-specific failure
  for (const [criterion, score] of Object.entries(metrics.current.criteria)) {
    const baselineCriterion = metrics.baseline.criteria[criterion];
    const criterionDrift = baselineCriterion - (score as number);
    if (criterionDrift >= 0.4) {
      await triggerAlert({
        severity: 'warning',
        message: `${criterion} specifically degraded by ${criterionDrift.toFixed(2)} points`,
        data: { criterion, baseline: baselineCriterion, current: score },
      });
    }
  }
 
  return { drift, metrics };
}

Per-criterion drift detection is the part most teams skip. Overall composite score can look stable while a specific dimension quietly fails. If your policy_adherence score drops from 4.2 to 3.4 while tone and task_resolution hold steady, the composite barely moves. But policy_adherence dropping 0.8 points is a fire drill.

Turning online scores into offline test cases

Converting low-scoring production conversations into new offline test cases closes the eval loop permanently. Each failure you capture in production is a scenario your test suite didn't anticipate. Add it, and you've guaranteed that specific failure can never sneak past CI again.

Most online eval implementations stall before this step. Teams build the pipeline, start collecting scores, build a dashboard, and then... watch the dashboard. The scores don't automatically make the agent better.

This sounds tedious but it's the highest-leverage thing you can do with your online eval data. Each low-scoring production conversation is a real failure mode your test suite didn't anticipate. Adding it closes that gap permanently.

typescript

async function surfaceFailuresForReview() {
  const lowScoring = await chanl.calls.list({
    scorecardId: SCORECARD_ID,
    scoreBelow: 3.0,
    window: '24h',
    limit: 20,
    orderBy: 'score_asc',
  });
 
  for (const call of lowScoring.calls) {
    const detail = await chanl.calls.get(call.id, { includeTranscript: true });
    
    // Tag for human review queue
    await chanl.calls.tag(call.id, {
      tags: ['needs-review', 'online-eval-flag'],
      priority: detail.score < 2.0 ? 'high' : 'normal',
    });
  }
 
  return lowScoring.calls.length;
}

The human review step isn't overhead. It's calibration. Reviewers confirm whether the low score reflects a genuine agent failure, a rubric gap, or a true edge case that should become a new test scenario. Each confirmed failure feeds back into your offline test suite via your testing scenarios.

The combined eval strategy

Offline and online evals aren't in competition. They're scheduled for different moments and answer different questions:

Before deploy: Run your full offline scenario suite in CI/CD. Block merges that regress performance on any critical path. This is your regression gate. It should catch 90%+ of issues introduced by code or prompt changes.

Continuously in production: Sample 5-10% of live conversations for online eval. Run drift detection daily. Alert on threshold violations. Surface low-scorers for review weekly.

Weekly: Pull the 10 lowest-scoring conversations from the past seven days. Review them with a human. Pattern-match: are they clustered around a specific scenario type, tool, or topic? Convert confirmed failures to new test cases.

Before next deploy: Your new test cases from last week's review are now part of the offline suite. The loop closes.

This cadence is what production-ready agent eval actually means. Not "we have a test suite." Not "we have a dashboard." The closed loop where production failures become tomorrow's tests.

You can read more about building the offline side of this loop in Production Agent Evals: Catch Score Drift, Ship Confidently. That guide covers the scenario regression and drift detection components in depth and pairs well with this one.

What the 52-point gap actually costs

The 89%/37% stat from LangChain's report is striking, but the cost behind it is more concrete. For a team running a customer support agent at 10,000 conversations per day:

Eval coverage	What you catch	What you miss
Offline only	Code regressions, prompt regressions	Policy drift, novel failures, model updates, tool errors
Observability only	Latency spikes, error rates, volume anomalies	Quality degradation, semantic failures, subtle policy violations
Online evals (5% sample)	Quality trends, policy gaps, systematic failures	One-off failures in the unsampled 95%
Combined	Regressions + ongoing quality + policy drift	Low-probability edge cases (manage with higher sampling rate)

The gap between "observability only" and "online evals" isn't about dashboards. It's about whether your monitoring layer understands the quality of what the agent said, not just the mechanics of how it said it. Latency and error rates measure infrastructure. Scorecards measure customer experience.

If you're in the 89% with observability and haven't built online evals yet, start small. Pick your most production-critical agent. Set up 5% sampling. Write a four-criterion rubric with specific anchors. Run it for two weeks and look at the trend. You'll find something worth fixing. When you do, you'll understand exactly why 89% observability coverage still leaves a 52-point gap.

The agents that perform best in production aren't the ones that passed the most CI tests. They're the ones with teams that kept evaluating them after they shipped.

Build the closed eval loop for your agents

Chanl's scorecard system handles online evaluation, drift detection, and failure surfacing so your team can close the gap between staging and production.

Start Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluations testing scorecards production monitoring llm-judge typescript observability

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Online vs. Offline Evals: Close the Production Gap

Why offline evals can't catch what production reveals

What an online eval pipeline actually looks like

Building rubrics that travel across unknown conversations

Detecting drift before customers notice it

Turning online scores into offline test cases

The combined eval strategy

What the 52-point gap actually costs

Build the closed eval loop for your agents

The Signal Briefing

Frequently Asked Questions

Related Articles

Production Agent Evals: Catch Score Drift, Ship Confidently

Your Agent Aced the Benchmark. Production Disagreed.

Is monitoring your AI agent actually enough?