What is criteria-based evaluation for AI agents?

Criteria-based evaluation scores agent responses against a set of defined quality dimensions (accuracy, tone, policy compliance) rather than comparing them to a reference answer. Instead of asking 'does this match the expected output?', you ask 'does this meet the standards we've defined for a good response?' This works for open-ended CX conversations where no single correct answer exists.

How does LLM-as-judge work for evaluating agent responses?

LLM-as-judge uses a separate language model to score your agent's responses against a rubric. You give the judge the conversation context, the agent's response, and a structured scoring criteria. The judge returns scores and reasoning for each criterion. This scales automated evaluation to thousands of conversations that would take human reviewers days to process.

Why doesn't traditional evaluation work for CX agents?

Traditional eval compares outputs to reference answers. But for a customer asking 'I'm upset about a charge on my account,' dozens of different responses could all be excellent, just phrased differently. Traditional eval would penalize any response that differs from the reference even if it's genuinely better. CX conversations have no canonical correct response, only criteria for what a good one looks like.

How many judges should I use for LLM-as-judge evaluation?

Most teams find that three judges strike the right balance between reliability and cost. One judge is fast but noisy on borderline cases. Three judges with majority voting catches outliers. Beyond five judges, the marginal reduction in variance rarely justifies the added cost. For high-stakes evaluation like safety or compliance scoring, use five judges and require strong agreement.

What criteria should a CX agent eval rubric include?

A solid CX agent rubric covers five areas: task completion (did the agent resolve or correctly escalate the issue?), factual accuracy (were claims about policies, prices, and features correct?), tone and empathy (was the response appropriate to the customer's emotional state?), policy compliance (did the agent follow required scripts or legal guardrails?), and efficiency (was the response focused or full of unnecessary content?).

What is the difference between offline and online evaluation?

Offline evaluation runs your agent against test sets before deployment, catching regressions before they reach customers. Online evaluation scores live production conversations in near-real time. Both are necessary: offline eval gives you fast feedback during development, online eval catches drift from real-world distribution shifts that your test set didn't anticipate. According to recent industry data, only 52% of teams run offline evals regularly.

How do I catch score drift without ground truth?

Track your eval scores over time rather than against a fixed target. When your average 'task completion' score drops from 87% to 79% over two weeks, that's a signal to investigate even if you don't know what the 'correct' score should be. Chanl's scorecard monitoring alerts you to relative drops in any criterion, letting you catch drift from prompt changes, model updates, or distribution shifts in real customer queries.

Can I use open-source models as judges instead of frontier models?

Yes, but with tradeoffs. Smaller judge models are cheaper and faster, but their scoring is less reliable on nuanced criteria like empathy or contextual appropriateness. A good approach is to use a frontier model (Claude, GPT-4, Gemini) as the primary judge for complex criteria and a smaller model for simple binary checks like 'did the agent provide the account number?'. Always validate your judge's agreement with human reviewers before relying on it in production.

How to Eval Agents When There's No Right Answer

A customer writes: "I can't believe I was charged twice. This is completely unacceptable and I've been a customer for years."

What's the correct response?

There isn't one. Your agent could apologize and immediately process a refund. It could apologize, verify the charge, then process the refund. It could express empathy first, pull up the account, confirm the double charge, then explain the refund timeline. All of these could be excellent responses. None of them is "the right answer."

This is the ground truth problem. And it's why most evaluation methods break down for CX agents.

Why Example-Based Eval Fails for CX

Traditional evaluation works by comparing your model's output to a reference answer. If the output matches (or is similar enough), it passes. If it doesn't, it fails.

This works fine for tasks with unambiguous correct answers: math problems, code that either runs or doesn't, factual questions with a single right response. But CX conversations aren't any of those things.

When you try to build a reference-based eval set for a customer service agent, you hit the same wall every time. You write 100 example conversations and "correct" responses. Then your agent produces responses that are clearly good but phrase things differently than your examples, so the eval scores them low. You write more examples to cover the variations. The example set grows to 300, then 500. You're spending more time maintaining examples than improving the agent.

And even then, the eval is wrong. It penalizes your agent for using a different -- but equally valid -- sentence structure. It rewards responses that match your examples even when the examples are mediocre. It's measuring adherence to a template, not quality.

According to recent industry data, 52.4% of agent teams run offline evals regularly. The ones that find eval genuinely useful have almost all moved away from reference matching toward criteria-based scoring. This isn't a new idea, but it's surprisingly underimplemented in CX contexts specifically.

What Criteria-Based Evaluation Actually Is

Criteria-based evaluation scores responses against a set of defined quality dimensions rather than against a reference answer. Instead of asking "does this match what I expected?", you ask "does this meet the standards I've defined for a good response?"

The shift sounds subtle. The practical difference is enormous.

With criteria-based eval, you define things like:

Task completion: Did the agent resolve the issue, or correctly escalate it when it couldn't?
Factual accuracy: Were all claims about policies, prices, and features correct?
Tone: Was the response appropriate to the customer's emotional state?
Policy compliance: Did the agent follow required scripts, avoid prohibited topics, and stay within legal guardrails?
Efficiency: Was the response focused, or padded with unnecessary content?

Now your agent can respond in 10 different ways and all 10 score well, as long as they all complete the task, get the facts right, hit the right tone, comply with policy, and stay focused. You're measuring what makes a response good, not whether it matches a template.

Criteria-based vs reference-based evaluation flow

Building a Rubric That Actually Measures Quality

Your rubric is the core of criteria-based eval, and most teams get it wrong the first time. Two common failure modes: criteria that are too vague to score consistently, and criteria that are too many to score reliably.

Vague criteria: "Was the response helpful?" is not a scorable criterion. "Helpful" means different things to different judges, and inter-rater agreement will be near random. You need criteria that a judge can evaluate from the conversation alone without needing to infer intent.

Too many criteria: A 20-dimension rubric creates scoring fatigue, inconsistent coverage across conversations, and noisy aggregated scores. The precision you gain from granularity is offset by the noise from trying to score too many things at once.

The sweet spot is 4 to 6 criteria, each defined precisely enough that two different judges applying the same rubric reach the same score on at least 80% of conversations. Here's what works for most CX agents:

rubric.ts·typescript

export const CX_AGENT_RUBRIC = {
  task_completion: {
    description:
      "Did the agent fully address the customer's primary request? " +
      "Score 1 if fully resolved or correctly escalated with context. " +
      "Score 0.5 if partially resolved or escalated without explanation. " +
      "Score 0 if the request was ignored, misunderstood, or closed without resolution.",
    weight: 0.35
  },
  factual_accuracy: {
    description:
      "Were all specific claims about prices, policies, timelines, and features correct? " +
      "Score 1 if all verifiable claims are accurate. " +
      "Score 0.5 if minor inaccuracies that didn't affect the resolution. " +
      "Score 0 if a factual error caused a wrong outcome or customer confusion.",
    weight: 0.25
  },
  tone_and_empathy: {
    description:
      "Was the agent's tone appropriate to the customer's emotional state? " +
      "Score 1 if the response acknowledged frustration, matched urgency, and stayed professional. " +
      "Score 0.5 if tone was neutral when empathy was clearly needed. " +
      "Score 0 if the agent was dismissive, defensive, or escalated the customer's frustration.",
    weight: 0.20
  },
  policy_compliance: {
    description:
      "Did the agent follow all required guardrails? " +
      "Score 1 if no violations of required scripts, prohibited topics, or legal constraints. " +
      "Score 0 if any violation of a mandatory policy occurred. " +
      "This criterion fails independently: a policy violation cannot be offset by other high scores.",
    weight: 0.20
  }
};

Notice the explicit scoring anchors (1, 0.5, 0) with described behaviors for each level. Anchored rubrics are the single biggest lever for improving judge consistency. Without them, one judge's "0.7" is another judge's "0.4".

Also notice the weight distribution. Task completion and accuracy together account for 60% of the score. Tone matters, but a response that's warm and inaccurate is worse than one that's curt and correct.

Implementing LLM-as-Judge

LLM-as-judge uses a separate language model to apply your rubric at scale, scoring thousands of conversations that human reviewers couldn't reach. You give the judge the conversation, your scoring criteria, and anchors for each score level. The judge returns structured scores and reasoning for every criterion -- fast enough to run on your full production sample, reliable enough to catch meaningful regressions.

According to industry data, 53.3% of teams use LLM-as-judge approaches to scale quality assessment, while 59.8% still use human review for nuanced or high-stakes situations. The pattern that works is using LLM-as-judge for breadth and human review for depth.

Here's a basic LLM-as-judge implementation:

judge.ts·typescript

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
interface JudgeResult {
  criterion: string;
  score: number;
  reasoning: string;
}
 
async function scoreConversation(
  conversation: ConversationTurn[],
  rubric: typeof CX_AGENT_RUBRIC
): Promise<JudgeResult[]> {
  const conversationText = conversation
    .map((t) => `${t.role.toUpperCase()}: ${t.content}`)
    .join("\n");
 
  const criteriaText = Object.entries(rubric)
    .map(
      ([key, val]) =>
        `CRITERION: ${key}\n` +
        `SCORING GUIDE: ${val.description}\n` +
        `WEIGHT: ${val.weight}`
    )
    .join("\n\n");
 
  const response = await client.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 2048,
    system:
      "You are an expert evaluator for customer service AI agents. " +
      "Score responses strictly based on the provided criteria. " +
      "Return a JSON array of results with criterion, score (0 to 1), and reasoning.",
    messages: [
      {
        role: "user",
        content:
          `CONVERSATION:\n${conversationText}\n\n` +
          `RUBRIC:\n${criteriaText}\n\n` +
          "Score this conversation on each criterion. " +
          "Return a JSON array with {criterion, score, reasoning} for each."
      }
    ]
  });
 
  const text =
    response.content[0].type === "text" ? response.content[0].text : "";
  return JSON.parse(text);
}

A few things that matter when you implement this:

Give the judge the full context. The judge needs to see the customer's original message, any prior conversation history, and the agent's response. Scoring a response without knowing what the customer said makes tone_and_empathy impossible to score accurately.

Separate criteria into separate calls for high-stakes evaluations. Asking one judge to score five criteria in a single call can cause criterion bleed -- the judge's score on tone influences its score on task_completion even though they're independent. For high-stakes evaluation (safety, compliance), one criterion per call and aggregate afterward.

Always ask for reasoning. The reasoning field is the audit trail. When a score looks wrong or a score distribution shifts unexpectedly, reasoning lets you diagnose whether the judge misread the rubric or whether the agent actually regressed.

Score

Good

0/100

Tone & Empathy

94%

Resolution

88%

Response Time

72%

Compliance

85%

Aggregating Multiple Judges for Reliability

A single judge call is fast but noisy. On borderline cases -- responses that are genuinely ambiguous -- different judge calls return different scores. This variance compounds when you're tracking trends over time.

The solution is multi-judge aggregation. Run three independent judge calls per conversation and use majority voting (for discrete scores) or averaging (for continuous scores):

multi-judge.ts·typescript

async function scoreWithConsensus(
  conversation: ConversationTurn[],
  rubric: typeof CX_AGENT_RUBRIC,
  numJudges: number = 3
): Promise<Record<string, number>> {
  // Run all judge calls concurrently
  const judgeResults = await Promise.all(
    Array.from({ length: numJudges }, () =>
      scoreConversation(conversation, rubric)
    )
  );
 
  // Aggregate scores per criterion
  const aggregated: Record<string, number> = {};
 
  for (const [criterion] of Object.entries(rubric)) {
    const scores = judgeResults.map((results) => {
      const match = results.find((r) => r.criterion === criterion);
      return match?.score ?? 0;
    });
 
    // Average the scores across judges
    aggregated[criterion] =
      scores.reduce((sum, s) => sum + s, 0) / scores.length;
  }
 
  return aggregated;
}

Three judges strike the right balance for most CX evals. One judge is fast but catches fewer edge cases. Five judges reduce variance further but rarely justify the added cost for routine evaluation. For compliance or safety scoring where false negatives are costly, use five judges and require agreement from at least four before marking something as passing.

Using the Chanl SDK for Scorecards at Scale

Chanl's Scorecards feature handles the judge infrastructure and aggregation, so you define your rubric once and run it against both pre-deployment test sets and live production traffic -- same criteria, both environments, no separate infrastructure to maintain. Here's what that looks like in practice:

chanl-scorecards.ts·typescript

import { Chanl } from "@chanl/sdk";
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Define your rubric once
const scorecard = await chanl.scorecards.create({
  name: "CX Agent Quality v2",
  criteria: [
    {
      name: "task_completion",
      description: "Did the agent fully address the customer's request?",
      scale: { min: 0, max: 1, anchors: { 0: "ignored", 0.5: "partial", 1: "resolved" } },
      weight: 0.35
    },
    {
      name: "factual_accuracy",
      description: "Were all specific claims about policies and features correct?",
      scale: { min: 0, max: 1, anchors: { 0: "incorrect", 0.5: "minor error", 1: "accurate" } },
      weight: 0.25
    },
    {
      name: "tone_and_empathy",
      description: "Was tone appropriate to the customer's emotional state?",
      scale: { min: 0, max: 1, anchors: { 0: "dismissive", 0.5: "neutral", 1: "empathetic" } },
      weight: 0.20
    },
    {
      name: "policy_compliance",
      description: "Did the agent follow all required guardrails?",
      scale: { min: 0, max: 1, anchors: { 0: "violated", 1: "compliant" } },
      weight: 0.20
    }
  ],
  judgeModel: "claude-opus-4-7",
  numJudges: 3
});
 
// Score a conversation
const result = await chanl.scorecards.evaluate({
  scorecardId: scorecard.id,
  conversationId: "conv_abc123",
  agentId: "cx-agent-v2"
});
 
console.log(result.overallScore); // 0.87
console.log(result.breakdown);    // scores per criterion with reasoning
console.log(result.flags);        // any criteria that failed independently

The flags field is particularly useful for policy compliance: any criterion that fails independently gets flagged regardless of the overall score. An agent that scores 0.92 overall but violated a compliance guardrail should not be treated the same as one that scored 0.92 cleanly.

You can also run scorecards in batch against your test scenarios before deployment, catching regressions before they reach real customers:

chanl-batch-eval.ts·typescript

// Run your scorecard against a scenario set
const batchResult = await chanl.scenarios.run({
  agentId: "cx-agent-v2",
  scenarioSetId: "billing-dispute-scenarios-v3",
  scorecard: scorecard.id,
  compareBaseline: "cx-agent-v1"
});
 
// See exactly where the new version regressed
console.log(batchResult.regressions);
// [{ scenario: "double-charge-frustrated-customer", criterion: "tone_and_empathy", v1: 0.82, v2: 0.61 }]

The post on production eval gaps goes deeper on integrating this into your CI pipeline so regressions block deploys automatically.

Closing the Loop: Offline and Online Evals Together

Criteria-based eval is most powerful when you run it in two places: against test sets before deployment, and against live production traffic continuously.

Offline evals (pre-deployment) catch regressions before they reach customers. Build a test set of 100 to 200 representative conversations from your scenario library, tagged by customer intent and emotional state. Run your scorecard against this set whenever you change the agent's prompt, switch models, or update your tool integrations. If any criterion drops by more than 5 percentage points from baseline, block the deploy.

Online evals (production monitoring) catch what your test set doesn't. Real customer traffic has distribution shifts that synthetic or historical test sets can't anticipate. When a new product launches, when there's a PR crisis, when a new customer segment starts using your service -- all of these shift the distribution of conversations your agent sees. Scoring a sample of live traffic (even 5 to 10%) surfaces these shifts before they compound.

Chanl's Monitoring feature runs your scorecard against a configurable sample of production traffic and alerts you when any criterion drops more than a defined threshold from its rolling average. You don't need to define what the "correct" score is. You track relative change, and the alerts fire on meaningful movement.

This is the connection between testing and observability that most teams miss. You build the rubric once, run it pre-deployment for confidence, run it in production for awareness, and both signal into the same Analytics dashboard so you can see trends across both.

What Doesn't Need LLM-as-Judge

Not everything should go through an LLM judge. Some things are cheaper and more reliable to check programmatically:

Structured output compliance: Did the agent return a JSON object with the required fields? Just parse it.

Tool call correctness: Did the agent call process_refund when the customer requested one? Check the tool call log, not the response text. The post on building an eval framework covers tool call testing in depth.

Response length bounds: Is the response between 50 and 400 words? String length check, not an LLM call.

Forbidden content: Did the agent mention a competitor by name, quote a price it's not allowed to quote, or use prohibited language? Regex and keyword matching is faster, cheaper, and more reliable than asking a judge.

Use LLM-as-judge for the genuinely fuzzy things: tone, empathy, contextual appropriateness, and nuanced reasoning quality. Use deterministic checks for everything that can be checked deterministically. The combination gives you broad coverage without the cost and latency of routing everything through a judge.

The Compounding Value of Consistent Eval

Criteria-based evaluation becomes significantly more valuable after the first 30 to 90 days, once you have a score baseline across criteria, agents, and conversation types. The scores themselves matter less than the trends -- and you can only see trends with consistency.

After a month of scoring production traffic, you know your baseline. After three months, you can see that tone scores drop on Monday mornings (higher volume, more frustrated customers), that task completion scores dropped 8% after last month's prompt update, and that policy compliance scores are higher on billing conversations than on product questions.

These patterns are invisible without consistent eval. And they're the inputs that let you actually improve your agent in a directed way, rather than making prompt changes and hoping things get better.

That's what monitoring your agent actually means. Not watching for errors. Watching for the gradual drift that turns a good agent into a mediocre one before anyone notices.

Your agent doesn't need a correct answer to be evaluated well. It needs clear criteria, consistent judges, and a feedback loop that runs on every conversation.

Add Scorecards to Your Agent in Minutes

Define your quality rubric once. Chanl runs it against your test scenarios pre-deploy and against live production traffic continuously: same criteria, both environments, no extra infrastructure.

Set Up Your First Scorecard

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluation llm-as-judge scorecards criteria-based-eval agent-quality

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed