What is inter-rater reliability for LLM judges?

Inter-rater reliability measures how often two or more LLM judges assign the same score to the same item. Cohen's kappa for two judges and Krippendorff's alpha for three or more raters are the standard metrics. Values above 0.80 are considered substantial agreement. Most LLM judge pairs on subjective rubrics land between 0.40 and 0.70.

Why doesn't a single LLM judge score tell the whole story?

A single judge's score has no error bar. Run the same calls through a second judge from a different model family and you typically see Cohen's kappa in the 0.4 to 0.7 range, well below the 0.8 threshold considered substantial agreement. The score on the dashboard hides the disagreement, which makes it look more reliable than it is.

How do you compute Cohen's kappa for LLM judge outputs?

Use sklearn's cohen_kappa_score(judge_a_labels, judge_b_labels). Pass discrete labels per call from each judge. For ordinal scores like 1 to 5, use weighted kappa with weights='linear' or 'quadratic' so adjacent disagreements count less than far-apart ones.

Why use Krippendorff's alpha instead of Cohen's kappa?

Cohen's kappa is defined for two raters. With three or more LLM judges, pairwise kappas average out the structure. Krippendorff's alpha handles any number of raters, missing values, and ordinal data in one number. It's the right metric when you're testing a panel of three frontier models on the same call corpus.

What kappa value is good enough to trust an LLM judge?

Landis and Koch propose: below 0.20 slight, 0.21 to 0.40 fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, 0.81 to 1.00 almost perfect. For subjective customer-experience rubrics, even human raters typically land at 0.55 to 0.70, so chasing 0.95 with LLM judges is unrealistic. Treat 0.60 as a deployment gate and decompose by rubric dimension.

Should LLM judges always agree on intent classification?

On objective dimensions like intent classification or did-the-agent-say-X, kappa between frontier judges typically lands at 0.70 to 0.85. On subjective dimensions like empathy or tone, kappa drops to 0.30 to 0.50. The aggregate kappa hides this, which is why you decompose by rubric dimension instead of reporting a single number.

How big a sample do I need to estimate kappa reliably?

100 calls is the practical floor for stable per-dimension kappa estimates. Bootstrap with replacement to get a 95% confidence interval. With 100 calls and three judges on a 5-dimension rubric, you have 1500 judgments per judge, which is enough to detect differences of 0.10 or more in kappa with reasonable power.

GPT-5, Claude 4.5, Gemini Score the Same Calls. Their Kappa Is 0.52

A QA lead I'll call Maya was finishing a quarterly review when her CTO asked a simple question. The team's LLM judge had been scoring calls for six months. The dashboard showed an average of 4.1 out of 5 across the agent fleet. "What if Claude scored these instead of GPT?" the CTO asked. Three days later Maya had her answer. The Cohen's kappa between the two judges was 0.52. On the empathy dimension specifically, it was 0.38.

The dashboard number had not changed. The trust in the dashboard number had.

Most eval pipelines pick one LLM judge, write a scoring prompt, and ship the average. The number gets put on slides. Trends go up when prompts improve, down when something breaks. Nobody publishes the disagreement matrix because nobody runs a second judge. The score on your dashboard is a lottery ticket dressed as a metric. The way you make it a real metric is to measure inter-rater reliability across multiple judges. This article walks through how.

When one judge looks like ground truth

The reason teams default to a single judge is that the field gave them permission to. The MT-Bench paper from Zheng and colleagues at LMSYS reported that GPT-4 reaches about 80% pairwise agreement with human aggregate judgments, comparable to inter-human agreement of around 81%. The G-Eval paper hit Spearman correlation of 0.514 with humans on summarization, beating prior automatic metrics. Both findings were genuine breakthroughs. Both are also routinely overgeneralized.

Pairwise agreement on a benchmark is not the same as scoring a five-dimension enterprise rubric over hours of customer calls. Eighty percent agreement is also not 0.80 kappa. Percent agreement ignores chance, and chance agreement on imbalanced classes can hit 60% before you start. The literature has known this since Cohen wrote the original kappa paper in 1960. The LLM-as-judge literature is now catching up: a 2024 survey by Gu and colleagues reports inter-judge Cohen's kappa values typically landing between 0.40 and 0.70 on subjective rubrics, with 0.80+ rare outside narrow objective tasks.

So your judge's 4.1 average might be reasonable. But the second judge might give 3.6 on the same calls. And the third might split the difference. Until you run that experiment, you do not know.

What inter-rater reliability actually measures

Cohen's kappa measures how much two raters agree beyond what you would expect by chance. Take the observed agreement, subtract the agreement expected from random labeling given the class distributions, divide by one minus that expected agreement. The result tops out at 1.0 (perfect agreement) and can go below zero if raters disagree more than chance.

Landis and Koch in 1977 proposed thresholds that the field still uses: below 0.20 slight, 0.21 to 0.40 fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 1.00 almost perfect. Substantial is the bar most reliability researchers want to clear before treating a measurement as a trustworthy signal.

Three caveats worth knowing before you compute anything. Weighted kappa exists for ordinal scales: if your judges score 1 to 5 and they disagree by one point, that is not the same as disagreeing by four points, and linear or quadratic weights penalize the four-point gap more. Plain Cohen's kappa is also undefined for more than two raters, which is why Krippendorff's alpha exists. And even human inter-rater kappa on subjective tasks like empathy or sentiment lands at 0.55 to 0.70. You are not chasing 0.95. You are chasing "are these judges measuring the same thing."

Build a three-judge harness

The first thing you need is a way to run the same rubric through three different judges and get back parseable, comparable scores. The contract is that each judge takes a transcript and a rubric and returns a per-dimension score. We use OpenAI, Anthropic, and Google AI directly because we want to bypass any wrapper and see raw judge behavior.

judge_harness.py·python

import json
from openai import OpenAI
from anthropic import Anthropic
from google import genai
 
RUBRIC = {
    "empathy": "Did the agent acknowledge the customer's feelings?",
    "compliance": "Did the agent follow required disclosure language?",
    "intent": "Did the agent correctly identify the customer's primary goal?",
    "resolution": "Did the agent resolve the customer's issue or set a clear next step?",
    "tone": "Did the agent maintain a professional, calm tone throughout?",
}
 
PROMPT = """You are scoring a customer service call transcript on five dimensions.
For each dimension, return an integer from 1 (poor) to 5 (excellent).
Return ONLY a JSON object with the dimension names as keys.
 
Rubric:
{rubric}
 
Transcript:
{transcript}
"""
 
def score_with_openai(transcript: str) -> dict:
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": PROMPT.format(rubric=RUBRIC, transcript=transcript)}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)
 
def score_with_anthropic(transcript: str) -> dict:
    client = Anthropic()
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": PROMPT.format(rubric=RUBRIC, transcript=transcript)}],
    )
    return json.loads(resp.content[0].text)
 
def score_with_gemini(transcript: str) -> dict:
    client = genai.Client()
    resp = client.models.generate_content(
        model="gemini-2.5-pro",
        contents=PROMPT.format(rubric=RUBRIC, transcript=transcript),
        config={"response_mime_type": "application/json"},
    )
    return json.loads(resp.text)
 
JUDGES = {"gpt-5": score_with_openai, "claude-4.5": score_with_anthropic, "gemini-2.5": score_with_gemini}

Three things to notice. Temperature is zero where the API allows it, because we want to remove run-to-run randomness from the same judge before we measure across judges. Output is constrained to JSON so the parser does not become a confounder. Each judge sees the exact same prompt, the exact same rubric, and the exact same transcript. We change one variable: the model.

Score 500 calls with all three judges

500 calls is the practical sweet spot: large enough that per-dimension kappa stabilizes, small enough that the API bill stays under a thousand dollars across three frontier providers. The output is a 500 by 3 by 5 tensor: 500 calls, 3 judges, 5 dimensions per call.

run_panel.py·python

import pandas as pd
from judge_harness import JUDGES
 
def score_corpus(transcripts: list[dict]) -> pd.DataFrame:
    rows = []
    for t in transcripts:
        for judge_name, judge_fn in JUDGES.items():
            scores = judge_fn(t["transcript"])
            for dim, score in scores.items():
                rows.append({
                    "call_id": t["id"],
                    "judge": judge_name,
                    "dimension": dim,
                    "score": int(score),
                })
    return pd.DataFrame(rows)
 
# panel = score_corpus(load_transcripts(sample_size=500))
# panel.to_parquet("judge_panel.parquet")

Cache the output. You do not want to re-run a thousand-dollar batch because someone closed a notebook. Now you have a long-format DataFrame ready for agreement analysis.

Compute pairwise kappa with sklearn

Sklearn ships Cohen's kappa out of the box. We pivot the DataFrame so each row is a call and each column is a judge, then compute pairwise kappa per dimension.

pairwise_kappa.py·python

from itertools import combinations
from sklearn.metrics import cohen_kappa_score
import pandas as pd
 
def pairwise_kappa(panel: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for dim, sub in panel.groupby("dimension"):
        wide = sub.pivot(index="call_id", columns="judge", values="score").dropna()
        for a, b in combinations(wide.columns, 2):
            k = cohen_kappa_score(wide[a], wide[b], weights="linear")
            rows.append({"dimension": dim, "pair": f"{a} vs {b}", "kappa": round(k, 3)})
    return pd.DataFrame(rows).pivot(index="dimension", columns="pair", values="kappa")

What you typically see, anchored to published ranges in the Gu 2024 survey and Ye 2025 CALM study, is something like this. Numbers below are illustrative and consistent with that literature, not measured in this article.

Dimension	gpt-5 vs claude-4.5	gpt-5 vs gemini-2.5	claude-4.5 vs gemini-2.5
intent	0.78	0.72	0.75
compliance	0.71	0.68	0.70
resolution	0.62	0.58	0.60
tone	0.51	0.47	0.49
empathy	0.42	0.38	0.40

Two patterns. Aggregate kappa across the panel sits somewhere around 0.55, in the moderate band. And the variance across dimensions is enormous: judges agree on intent classification, they barely agree on empathy. Reporting a single aggregate kappa would tell you the panel is "moderate" and hide everything that matters.

Why three judges need Krippendorff's alpha

Pairwise Cohen's kappa across three judges loses information. You end up averaging three numbers that themselves have biases. Krippendorff's alpha gives you one statistic that handles three or more raters, ordinal data, and missing values.

multi_rater_alpha.py·python

import krippendorff
import numpy as np
import pandas as pd
 
def krippendorff_alpha_per_dim(panel: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for dim, sub in panel.groupby("dimension"):
        wide = sub.pivot(index="call_id", columns="judge", values="score")
        # krippendorff expects shape: (n_raters, n_units); NaN slots are tolerated
        # by the library (treated as missing), but if you cannot accept gaps drop
        # them with wide.dropna() before .T.to_numpy().
        data = wide.T.to_numpy()
        alpha = krippendorff.alpha(reliability_data=data, level_of_measurement="ordinal")
        rows.append({"dimension": dim, "krippendorff_alpha": round(alpha, 3)})
    return pd.DataFrame(rows).sort_values("krippendorff_alpha")

Krippendorff suggests alpha at or above 0.80 is acceptable for content analysis, with 0.667 the lowest acceptable for tentative conclusions. On a customer-experience rubric across three frontier judges, alpha tends to sit between 0.40 and 0.70, and the lowest dimensions are usually empathy and tone. Treat that as the empirical reality, not a failure of the eval.

Decompose, do not aggregate

The single biggest takeaway from running this analysis is that one number is the wrong unit of reporting. A scorecard with five dimensions has five separate inter-rater stories. Intent classification might be substantial. Empathy might be fair at best. The aggregate looks moderate, which mixes the trustworthy signal with the unreliable one.

The fix is to publish per-dimension kappa or alpha alongside the score. If your dashboard says "agent fleet scored 4.1 on empathy this week," your dashboard should also say "judge agreement on empathy: alpha 0.42." The reader then knows that the 4.1 has wide error bars and a small change in prompt could swing it.

Bootstrap confidence intervals around kappa

Kappa is a point estimate. Like any point estimate, it has uncertainty. Two studies with the same true kappa can return 0.55 and 0.61 by sampling alone. The right move is to bootstrap.

bootstrap_kappa.py·python

import numpy as np
from sklearn.metrics import cohen_kappa_score
 
def bootstrap_kappa(judge_a: np.ndarray, judge_b: np.ndarray, n_boot: int = 1000) -> dict:
    rng = np.random.default_rng(seed=42)
    estimates = []
    n = len(judge_a)
    for _ in range(n_boot):
        idx = rng.integers(0, n, n)
        estimates.append(cohen_kappa_score(judge_a[idx], judge_b[idx], weights="linear"))
    estimates = np.array(estimates)
    return {
        "kappa": round(float(estimates.mean()), 3),
        "ci_low": round(float(np.percentile(estimates, 2.5)), 3),
        "ci_high": round(float(np.percentile(estimates, 97.5)), 3),
    }

A kappa of 0.55 with a 95% CI of [0.42, 0.68] is a different conclusion than 0.55 with a CI of [0.53, 0.57]. The first means you cannot reliably distinguish moderate from substantial agreement. The second means you can. With 100 calls per dimension you typically get CIs about 0.10 wide. With 500 calls they tighten to about 0.04. The bootstrap forces you to be honest about how much you know.

The disagreement queue is a gold mine, not a failure

Once you have multi-judge scores, the calls where judges disagree most are not failures of the eval. They are the calls humans should look at. You can score per-call disagreement as the variance of judge scores across the panel, sort descending, take the bottom decile of agreement, and route those to human review.

disagreement_queue.py·python

import pandas as pd
 
def disagreement_queue(panel: pd.DataFrame, bottom_decile: float = 0.10) -> pd.DataFrame:
    by_call = panel.pivot_table(
        index="call_id", columns=["judge", "dimension"], values="score"
    )
    # disagreement score: mean within-call variance across dimensions
    variance_by_call = by_call.groupby(level="dimension", axis=1).var().mean(axis=1)
    threshold = variance_by_call.quantile(1 - bottom_decile)
    return variance_by_call[variance_by_call >= threshold].sort_values(ascending=False)

Out of 500 calls, you flag the 50 with the highest judge disagreement. That is your weekly human-review batch. It is also a dataset for prompt iteration. Every call where three frontier judges look at the same transcript and reach different conclusions is a call where the rubric, the agent, or the transcript itself is ambiguous. Fix the ambiguity and your kappa goes up.

Where Chanl fits

If you build the harness above, you have to maintain the cache, the parquet files, the cron job that runs the panel weekly, and the dashboard wiring. Or you can use a scorecards system that already persists per-judge results so the agreement analysis is one query away.

The Chanl SDK exposes scorecards with multi-criteria rubrics and per-criterion results. You can configure separate scorecard runs against different judge models, then pull the per-call results and compute kappa or alpha across them.

multi_judge_chanl.ts·typescript

import { ChanlSDK } from '@chanl/sdk';
 
const sdk = new ChanlSDK({ apiKey: process.env.CHANL_API_KEY! });
 
const scorecard = await sdk.scorecard.create({
  name: 'CX Quality 5D',
  description: 'Empathy, compliance, intent, resolution, tone',
  status: 'active',
  passingThreshold: 70,
  scoringAlgorithm: 'weighted_average',
});
 
// run the same scorecard with three different judge configs (one per workspace
// agent or eval profile), persist all per-criterion results
for (const callId of last500CallIds) {
  await sdk.scorecard.evaluate(callId, { scorecardId: scorecard.data!.id, force: true });
}
 
// pull per-call, per-criterion results to compute kappa per dimension
for (const callId of last500CallIds) {
  const { data } = await sdk.scorecard.getResultsByCall(callId);
  // each result.criteriaResults entry is one judge's score on one dimension
  // export to your kappa script
}

Score

Good

0/100

Tone & Empathy

94%

Resolution

88%

Response Time

72%

Compliance

85%

A few things in this pattern do not yet have first-class API support. Formal ensemble scoring (judges: [...] with explicit aggregation) and an auto-generated disagreement report are not shipped. We treat them as product gaps to fill, not features to pretend exist. For now, you run the panel as separate evaluations and aggregate in your own code. See Scorecards and the existing 12 Ways Your LLM Judge Is Lying to You for the related single-judge bias story, and Beyond LLM Judge for what to do when the agreement number is too low to fix.

What to do Monday

Pick three judges from different model families. Use temperature zero and JSON-constrained output. Score at least 100 calls, ideally 500. Compute pairwise Cohen's kappa with sklearn and Krippendorff's alpha across all three. Decompose by rubric dimension and publish each one separately, never the aggregate alone. Bootstrap a 95% confidence interval so the kappa itself has error bars. Build the disagreement queue and route the bottom decile to human review. Make the per-dimension agreement number a deployment gate next to your accuracy number.

The single judge score on your dashboard is not wrong. It is just incomplete. Add the second judge and the third, measure how often they agree, and the score becomes a metric instead of a guess.

Score every call with multiple judges, measure agreement

Chanl scorecards persist per-criterion results from any judge configuration, so inter-rater agreement is one query away.

Try Scorecards

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

llm-as-judge evaluations scorecards inter-rater-reliability cohen-kappa krippendorff testing learning-ai

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.