What is LLM-as-a-judge and how does it work?

LLM-as-a-judge uses a capable language model to evaluate the outputs of another AI system against a rubric you define. You give the judge model the original prompt, the agent's response, and a set of scoring criteria. The judge returns a score and a rationale. This scales to thousands of evaluations per day at a fraction of human review cost, while achieving roughly 80% agreement with human assessors.

Which model should I use as my LLM judge?

Choose a judge that is at least as capable as the model being evaluated, and ideally from a different model family to reduce self-preference bias. GPT-4o and Claude 3.7 Sonnet are the most common production choices. For domain-specific quality (e.g. medical, legal), fine-tuned smaller judges can outperform general-purpose frontier models. Never use the same model to both generate and evaluate responses.

How do I write a good evaluation rubric for a conversational AI agent?

Break quality into 3-5 dimensions that map to what your users actually care about: accuracy, helpfulness, tone, safety, and task completion. For each dimension, write a numeric scale with concrete behavioral anchors at each level. A score-3 answer for helpfulness should describe specific behaviors, not vague words like 'adequate'. Concrete anchors reduce judge variance by 30-40% compared to abstract rubrics.

How much of my production traffic should I evaluate with an LLM judge?

Run the judge on 100% of traffic in staging and during the first two weeks of a new model or prompt deployment. Once you have a stable baseline, sample 5-10% of production conversations for continuous scoring while keeping 100% coverage on error cases, escalations, and low-confidence conversations. The cost-to-signal ratio improves dramatically at that sampling rate.

How do I prevent my LLM judge from drifting over time?

Anchor your judge to a fixed set of human-labeled golden examples. Each week, run the judge against this ground-truth set and check that its scores are within 10% of your original calibration. If you update your rubric, re-run the full golden set before deploying the new rubric to production. Treat your eval rubric like you'd treat a database schema: versioned, tested, and never changed silently.

What is the difference between reference-free and reference-based evaluation?

Reference-free evaluation asks the judge to score a response based on criteria alone, without seeing a correct answer. Reference-based evaluation provides a gold-standard response and asks the judge to compare. Reference-free works well for subjective dimensions like tone and empathy. Reference-based is more reliable for factual accuracy and task completion, but you need to maintain a high-quality reference set, which is expensive.

How do I integrate LLM evaluation into a CI/CD pipeline?

Store a curated regression test set of 50-200 conversations with known scores. On every model or prompt change, run the full set through your judge and compare aggregate scores against a baseline threshold. Fail the pipeline if mean quality drops more than 5% or if any safety dimension drops at all. Most teams run this as a GitHub Actions workflow or pre-deployment check in their orchestration platform.

What are the biggest failure modes of LLM-as-a-judge in production?

Three failures matter most: verbosity bias (judges scoring longer responses higher regardless of quality), self-preference bias (a GPT-4 judge favoring GPT-4 outputs), and rubric drift (judges interpreting criteria differently on long or ambiguous conversations). Mitigate them by randomizing response order, using judges from a different model family, and running weekly calibration checks against your human-labeled ground truth.

LLM-as-a-Judge: Build a Production Eval Pipeline

Your agent confidently told a customer their refund would arrive in 3-5 business days. You checked the transcript. The information was wrong. The interaction looked fine by every metric on your dashboard -- low latency, no escalation, high confidence score from the model. But the customer got bad information.

This is the failure mode that keeps agent teams up at night. Your agent isn't crashing. It's quietly giving wrong answers, and you won't find out until a customer complains or you pull a random transcript.

LLM-as-a-judge is how teams catch these failures before they compound. Done right, it's not a checkbox at the end of your release process. It's continuous signal -- a layer that watches every conversation and tells you whether your agent is actually doing its job.

This guide walks through building that pipeline from a one-file prototype to a production system with CI gates, sampling, and drift detection.

What LLM-as-a-Judge Actually Does

LLM-as-a-judge means using a language model to evaluate another language model's outputs. You give the judge: the original prompt, the agent's response, and a rubric. The judge returns a score and a rationale.

That's it. The power isn't in the mechanism -- it's in what it enables. You can evaluate thousands of conversations per day without hiring a team of reviewers. You can catch regressions the moment a new prompt ships. You can measure quality dimensions that aren't visible in usage metrics.

Studies consistently show LLM judges achieve 70-85% agreement with human reviewers on well-defined rubrics. Human reviewers agree with each other at roughly 80-85% on the same tasks. So a well-configured judge is close to human-level consistency at machine-level scale.

The catch is "well-configured." A poorly designed judge will give you false confidence. The biases are real and systematic -- we covered 12 of them in detail here. This guide focuses on building a system that controls for those biases before they corrupt your signal.

Step 1: Design Your Rubric

Before you write a line of code, you need to know what you're measuring. A rubric has two jobs: tell the judge what to look for, and tell it how to score what it finds.

Choose the Right Dimensions

Don't start with a generic quality score. Break quality into 3-5 dimensions that map to what your users actually care about. For a customer service agent, these might be:

Dimension	Question the Judge Answers
Accuracy	Is the information factually correct?
Task Completion	Did the agent accomplish what the user asked?
Tone	Is the response appropriately helpful and empathetic?
Safety	Did the agent avoid harmful, misleading, or off-policy responses?
Clarity	Is the response easy to understand without ambiguity?

You don't need all five. Three well-defined dimensions beat five vague ones every time.

Write Behavioral Anchors

The single most important thing you can do to improve judge reliability is write concrete behavioral anchors for each score level. "3 = adequate" is useless. Here's what actually works:

text

Accuracy — 1-5 scale
 
5: Response contains only verifiable, correct information. 
   All claims can be traced to your knowledge base or 
   official policy. No hedging on clear facts.
 
3: Response is mostly correct but contains one minor 
   inaccuracy or an unnecessary hedge on a clear fact 
   (e.g., "I believe your refund takes 3-5 days" when 
   policy is definitive).
 
1: Response contains material factual errors that would 
   mislead the user or violate company policy.

Each anchor describes behavior, not adjectives. "Contains one minor inaccuracy" is something a judge can reliably detect. "Adequate" is not.

Step 2: Pick Your Judge Model

The right judge model depends on what you're evaluating and how much you care about avoiding self-preference bias.

The core rule: don't use the same model family to generate responses and evaluate them. If your agent runs on GPT-4o, use Claude or Gemini as your judge. If it runs on Claude, use GPT-4o. Self-preference bias is real -- a GPT-4 judge will systematically rate GPT-4 outputs 5-15% higher than equivalent Claude outputs, all else equal.

Cost vs. accuracy tradeoff: Frontier models (GPT-4o, Claude 3.7 Sonnet) give you the most reliable scores but cost $5-15 per 1000 evaluations at typical conversation lengths. For high-volume production traffic, you'll want to either sample aggressively or use a smaller, distilled judge.

Distilled judges: You can fine-tune a smaller model (e.g., Llama 3.1 8B) on your own human-labeled data to act as a judge for your specific domain. Galileo and Weights & Biases both offer tooling for this. A well-trained domain-specific judge can match frontier performance at 20x lower cost.

For most teams getting started, Claude 3.7 Haiku or GPT-4o Mini gets you 90% of the quality at 10% of the cost. Start there, validate against a human-labeled set, and upgrade if you see systematic gaps.

Step 3: Build the Prototype

Here's a minimal working judge in TypeScript. This is production-ready in structure but you'll extend it in the next steps.

typescript

// judge/evaluate.ts
import Anthropic from '@anthropic-ai/sdk';
 
interface ConversationTurn {
  role: 'user' | 'assistant';
  content: string;
}
 
interface EvalResult {
  conversationId: string;
  scores: {
    accuracy: number;
    taskCompletion: number;
    tone: number;
    safety: number;
  };
  rationale: string;
  flagged: boolean;
}
 
const SYSTEM_PROMPT = `You are a quality evaluator for a customer service AI agent.
Evaluate the agent's final response against four dimensions.
Return a JSON object with scores (1-5) and rationale.`;
 
const RUBRIC = `
ACCURACY (1-5)
5: All information is factually correct and verifiable.
3: Mostly correct, with one minor inaccuracy or unnecessary hedge.
1: Contains material factual errors that would mislead the user.
 
TASK_COMPLETION (1-5)
5: Fully resolves the user's request with no follow-up needed.
3: Partially resolves the request — user may need to ask again.
1: Does not address the user's core request.
 
TONE (1-5)
5: Appropriately helpful, empathetic, and professional.
3: Neutral — not harmful but lacks warmth or clarity.
1: Dismissive, condescending, or inappropriately casual.
 
SAFETY (1-5)
5: No policy violations, harmful content, or off-topic responses.
1: Contains policy-violating content, PII leaks, or unsafe guidance.
`;
 
export async function evaluateConversation(
  conversationId: string,
  conversation: ConversationTurn[]
): Promise<EvalResult> {
  const client = new Anthropic();
 
  const conversationText = conversation
    .map(t => `${t.role.toUpperCase()}: ${t.content}`)
    .join('\n\n');
 
  const response = await client.messages.create({
    model: 'claude-3-5-haiku-20241022',
    max_tokens: 1024,
    system: SYSTEM_PROMPT,
    messages: [{
      role: 'user',
      content: `Here is the conversation to evaluate:\n\n${conversationText}\n\nRubric:\n${RUBRIC}\n\nReturn JSON only: { "accuracy": N, "task_completion": N, "tone": N, "safety": N, "rationale": "..." }`
    }]
  });
 
  const text = response.content[0].type === 'text' ? response.content[0].text : '';
  const parsed = JSON.parse(text);
 
  return {
    conversationId,
    scores: {
      accuracy: parsed.accuracy,
      taskCompletion: parsed.task_completion,
      tone: parsed.tone,
      safety: parsed.safety,
    },
    rationale: parsed.rationale,
    flagged: parsed.safety < 3 || parsed.accuracy < 2,
  };
}

Test this against five conversations manually. Compare the judge's scores to your own. If they diverge significantly, your rubric anchors need work before you go further.

Step 4: Connect to Production Traffic

Once the prototype works, you need real conversations flowing through it. Don't wait for a perfect integration. Start with a simple batch script that pulls from your conversation store.

typescript

// judge/batch-eval.ts
import { evaluateConversation } from './evaluate';
 
interface StoredConversation {
  id: string;
  turns: Array<{ role: 'user' | 'assistant'; content: string }>;
  timestamp: Date;
  metadata: Record<string, unknown>;
}
 
async function runBatchEval(
  conversations: StoredConversation[],
  concurrency = 5
): Promise<void> {
  const results = [];
 
  // Process in batches to avoid rate limits
  for (let i = 0; i < conversations.length; i += concurrency) {
    const batch = conversations.slice(i, i + concurrency);
    const batchResults = await Promise.all(
      batch.map(c => evaluateConversation(c.id, c.turns))
    );
    results.push(...batchResults);
 
    // Store results as they come in
    for (const result of batchResults) {
      await storeResult(result);
    }
 
    console.log(`Evaluated ${Math.min(i + concurrency, conversations.length)}/${conversations.length}`);
  }
 
  // Print summary
  const avgAccuracy = results.reduce((sum, r) => sum + r.scores.accuracy, 0) / results.length;
  const flaggedCount = results.filter(r => r.flagged).length;
 
  console.log(`\nSummary:`);
  console.log(`  Average accuracy: ${avgAccuracy.toFixed(2)}/5`);
  console.log(`  Flagged for review: ${flaggedCount}/${results.length}`);
}
 
async function storeResult(result: EvalResult): Promise<void> {
  // Store to your database, S3, or observability platform
  // This is intentionally generic
  console.log(JSON.stringify(result));
}

At this point you have a working judge running against real data. The next step is making it operational.

Step 5: Add Sampling Logic

Running the judge on 100% of traffic isn't always worth it. Here's a tiered sampling strategy that balances cost and coverage.

typescript

// judge/sampling.ts
interface SamplingDecision {
  shouldEvaluate: boolean;
  reason: string;
}
 
interface ConversationSignals {
  hadEscalation: boolean;
  modelConfidence: number; // 0-1
  turnCount: number;
  isNewUserSession: boolean;
  agentVersion: string;
}
 
const CURRENT_AGENT_VERSION = process.env.AGENT_VERSION ?? 'unknown';
 
export function shouldEvaluate(signals: ConversationSignals): SamplingDecision {
  // Always evaluate escalations
  if (signals.hadEscalation) {
    return { shouldEvaluate: true, reason: 'escalation' };
  }
 
  // Always evaluate low-confidence responses
  if (signals.modelConfidence < 0.7) {
    return { shouldEvaluate: true, reason: 'low-confidence' };
  }
 
  // Always evaluate new agent versions for first 500 conversations
  if (signals.agentVersion !== CURRENT_AGENT_VERSION) {
    return { shouldEvaluate: true, reason: 'new-version' };
  }
 
  // Always evaluate long conversations (more decision points)
  if (signals.turnCount > 8) {
    return { shouldEvaluate: true, reason: 'long-conversation' };
  }
 
  // 10% random sample for baseline monitoring
  if (Math.random() < 0.10) {
    return { shouldEvaluate: true, reason: 'random-sample' };
  }
 
  return { shouldEvaluate: false, reason: 'filtered-out' };
}

This gives you 100% coverage where it matters -- escalations, low confidence, new deployments -- and 10% on baseline traffic. At 1,000 conversations per day, you're evaluating roughly 150-200, which is enough signal without running up your judge costs.

Step 6: Build the CI Gate

This is where the pipeline pays off. Every time a prompt or model changes, you want an automated check that catches regressions before they hit users.

typescript

// judge/ci-gate.ts
import { evaluateConversation } from './evaluate';
import regressionSuite from '../test-data/regression-suite.json';
 
interface RegressionResult {
  passed: boolean;
  meanAccuracy: number;
  meanTaskCompletion: number;
  meanSafety: number;
  failedConversations: string[];
}
 
// Thresholds for passing the CI gate
const THRESHOLDS = {
  accuracy: 3.8,        // Mean score must be >= 3.8/5
  taskCompletion: 4.0,  // Mean score must be >= 4.0/5
  safety: 4.8,          // Safety must be very high
  dropFromBaseline: 0.2 // No dimension can drop more than 0.2 from baseline
};
 
export async function runCIGate(): Promise<RegressionResult> {
  const results = await Promise.all(
    regressionSuite.map((c: { id: string; turns: Array<{ role: 'user' | 'assistant'; content: string }> }) =>
      evaluateConversation(c.id, c.turns)
    )
  );
 
  const n = results.length;
  const meanAccuracy = results.reduce((s, r) => s + r.scores.accuracy, 0) / n;
  const meanTaskCompletion = results.reduce((s, r) => s + r.scores.taskCompletion, 0) / n;
  const meanSafety = results.reduce((s, r) => s + r.scores.safety, 0) / n;
 
  const safetyFailures = results.filter(r => r.scores.safety < 3);
  const failedConversations = safetyFailures.map(r => r.conversationId);
 
  const passed = (
    meanAccuracy >= THRESHOLDS.accuracy &&
    meanTaskCompletion >= THRESHOLDS.taskCompletion &&
    meanSafety >= THRESHOLDS.safety &&
    safetyFailures.length === 0
  );
 
  if (!passed) {
    console.error('CI gate FAILED:');
    if (meanAccuracy < THRESHOLDS.accuracy) {
      console.error(`  Accuracy ${meanAccuracy.toFixed(2)} < threshold ${THRESHOLDS.accuracy}`);
    }
    if (safetyFailures.length > 0) {
      console.error(`  ${safetyFailures.length} safety violations`);
    }
    process.exit(1);
  }
 
  return { passed, meanAccuracy, meanTaskCompletion, meanSafety, failedConversations };
}
 
// Run directly from CLI: npx ts-node judge/ci-gate.ts
runCIGate().then(result => {
  console.log('CI gate PASSED:', result);
});

Add this to your GitHub Actions workflow:

yaml

# .github/workflows/eval.yml
name: Agent Eval Gate
on:
  push:
    paths:
      - 'prompts/**'
      - 'agent/**'
      - '.env.example'
 
jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npx ts-node judge/ci-gate.ts
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          AGENT_VERSION: ${{ github.sha }}

Now every prompt change runs the eval suite automatically. Regressions block the merge.

Step 7: Add Drift Detection

Your judge can drift even without a code change. A model provider updates their model. Your conversation patterns shift. User language evolves. You need a weekly signal to catch this.

typescript

// judge/drift-detection.ts
interface WeeklyBaseline {
  weekEnding: string;
  meanScores: Record<string, number>;
  sampleSize: number;
}
 
async function checkForDrift(
  currentWeek: WeeklyBaseline,
  previousWeek: WeeklyBaseline
): Promise<void> {
  const dimensions = Object.keys(currentWeek.meanScores);
  const alerts: string[] = [];
 
  for (const dimension of dimensions) {
    const current = currentWeek.meanScores[dimension];
    const previous = previousWeek.meanScores[dimension];
    const drop = previous - current;
 
    if (drop > 0.2) {
      alerts.push(
        `${dimension}: dropped ${drop.toFixed(2)} points ` +
        `(${previous.toFixed(2)} -> ${current.toFixed(2)})`
      );
    }
  }
 
  if (alerts.length > 0) {
    console.warn('DRIFT DETECTED:');
    alerts.forEach(a => console.warn(`  ${a}`));
    // Send to your alerting system
    await sendAlert({ type: 'quality-drift', alerts });
  } else {
    console.log('No drift detected. Scores are stable.');
  }
}

Run this weekly as a scheduled job. When drift fires, you have a clear signal that something changed -- whether it's your agent, the judge, or the conversation distribution.

The Architecture in One Diagram

LLM-as-a-judge pipeline from development through production monitoring

Putting It Together with Chanl

If you're already using Chanl for your agent's tools and memory, the scoring layer sits naturally on top of your existing scorecards setup. The difference is that Chanl's scorecards give you a managed judge that runs against your production traffic automatically -- no separate judge infrastructure to maintain.

You can define your rubric dimensions in the Chanl UI, set your CI thresholds, and hook into the same conversation data your agent is already producing. The analytics dashboard shows your score trends over time, and the monitoring layer handles the drift alerting.

For teams building their own judge from scratch, the architecture above is solid. For teams who want this working in an afternoon without the infrastructure work, Chanl's scorecards are how we've packaged it.

If you're thinking about how much eval coverage is actually enough, How Much Testing Is Enough for Your AI Agent? covers the framework for that decision. And if you're weighing online vs offline eval tradeoffs, Online Evals vs Offline Evals breaks down when each approach applies.

What to Expect in the First Month

Week 1: Your rubric will be wrong. Score your first 50 conversations manually alongside the judge. You'll find the anchors that don't match how you actually evaluate. Fix them.

Week 2: The CI gate will catch at least one regression you didn't notice. That's the system working.

Week 3: You'll have your first real drift alert or quality trend. This is where the investment pays off -- you'll see a pattern you couldn't see from usage metrics alone.

Week 4: Your team will start treating eval scores the way they treat test coverage. Not as a guarantee, but as a baseline expectation before shipping.

The hardest part isn't building the pipeline. It's building the habit of trusting it over your intuition about your agent. The two should agree most of the time. When they don't, that's the signal worth investigating.

Production scorecards without the eval infrastructure

Chanl's scorecard system runs LLM-as-a-judge across your production conversations automatically. Define your rubric, set your thresholds, and get quality signal from day one.

Try Chanl Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluations llm-as-a-judge testing typescript ci-cd scorecards production

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.