ChanlChanl
Testing & Evaluation

GPT-5, Claude 4.5, Gemini Score the Same Calls. Their Kappa Is 0.52

Run the same calls through GPT-5, Claude 4.5, and Gemini and Cohen's kappa lands at 0.52. Here is how to measure judge agreement on your own corpus.

DGDean GroverCo-founderFollow
May 3, 2026
11 min read
Three glowing rubric cards floating in misted air, each marking the same transcript with subtly different ink colors, with a faint kappa heatmap projected on the wall behind them

A QA lead I'll call Maya was finishing a quarterly review when her CTO asked a simple question. The team's LLM judge had been scoring calls for six months. The dashboard showed an average of 4.1 out of 5 across the agent fleet. "What if Claude scored these instead of GPT?" the CTO asked. Three days later Maya had her answer. The Cohen's kappa between the two judges was 0.52. On the empathy dimension specifically, it was 0.38.

The dashboard number had not changed. The trust in the dashboard number had.

Most eval pipelines pick one LLM judge, write a scoring prompt, and ship the average. The number gets put on slides. Trends go up when prompts improve, down when something breaks. Nobody publishes the disagreement matrix because nobody runs a second judge. The score on your dashboard is a lottery ticket dressed as a metric. The way you make it a real metric is to measure inter-rater reliability across multiple judges. This article walks through how.

When one judge looks like ground truth

The reason teams default to a single judge is that the field gave them permission to. The MT-Bench paper from Zheng and colleagues at LMSYS reported that GPT-4 reaches about 80% pairwise agreement with human aggregate judgments, comparable to inter-human agreement of around 81%. The G-Eval paper hit Spearman correlation of 0.514 with humans on summarization, beating prior automatic metrics. Both findings were genuine breakthroughs. Both are also routinely overgeneralized.

Pairwise agreement on a benchmark is not the same as scoring a five-dimension enterprise rubric over hours of customer calls. Eighty percent agreement is also not 0.80 kappa. Percent agreement ignores chance, and chance agreement on imbalanced classes can hit 60% before you start. The literature has known this since Cohen wrote the original kappa paper in 1960. The LLM-as-judge literature is now catching up: a 2024 survey by Gu and colleagues reports inter-judge Cohen's kappa values typically landing between 0.40 and 0.70 on subjective rubrics, with 0.80+ rare outside narrow objective tasks.

So your judge's 4.1 average might be reasonable. But the second judge might give 3.6 on the same calls. And the third might split the difference. Until you run that experiment, you do not know.

What inter-rater reliability actually measures

Cohen's kappa measures how much two raters agree beyond what you would expect by chance. Take the observed agreement, subtract the agreement expected from random labeling given the class distributions, divide by one minus that expected agreement. The result tops out at 1.0 (perfect agreement) and can go below zero if raters disagree more than chance.

Landis and Koch in 1977 proposed thresholds that the field still uses: below 0.20 slight, 0.21 to 0.40 fair, 0.41 to 0.60 moderate, 0.61 to 0.80 substantial, and 0.81 to 1.00 almost perfect. Substantial is the bar most reliability researchers want to clear before treating a measurement as a trustworthy signal.

Three caveats worth knowing before you compute anything. Weighted kappa exists for ordinal scales: if your judges score 1 to 5 and they disagree by one point, that is not the same as disagreeing by four points, and linear or quadratic weights penalize the four-point gap more. Plain Cohen's kappa is also undefined for more than two raters, which is why Krippendorff's alpha exists. And even human inter-rater kappa on subjective tasks like empathy or sentiment lands at 0.55 to 0.70. You are not chasing 0.95. You are chasing "are these judges measuring the same thing."

Build a three-judge harness

The first thing you need is a way to run the same rubric through three different judges and get back parseable, comparable scores. The contract is that each judge takes a transcript and a rubric and returns a per-dimension score. We use OpenAI, Anthropic, and Google AI directly because we want to bypass any wrapper and see raw judge behavior.

judge_harness.py·python
import json
from openai import OpenAI
from anthropic import Anthropic
from google import genai
 
RUBRIC = {
    "empathy": "Did the agent acknowledge the customer's feelings?",
    "compliance": "Did the agent follow required disclosure language?",
    "intent": "Did the agent correctly identify the customer's primary goal?",
    "resolution": "Did the agent resolve the customer's issue or set a clear next step?",
    "tone": "Did the agent maintain a professional, calm tone throughout?",
}
 
PROMPT = """You are scoring a customer service call transcript on five dimensions.
For each dimension, return an integer from 1 (poor) to 5 (excellent).
Return ONLY a JSON object with the dimension names as keys.
 
Rubric:
{rubric}
 
Transcript:
{transcript}
"""
 
def score_with_openai(transcript: str) -> dict:
    client = OpenAI()
    resp = client.chat.completions.create(
        model="gpt-5",
        messages=[{"role": "user", "content": PROMPT.format(rubric=RUBRIC, transcript=transcript)}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)
 
def score_with_anthropic(transcript: str) -> dict:
    client = Anthropic()
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": PROMPT.format(rubric=RUBRIC, transcript=transcript)}],
    )
    return json.loads(resp.content[0].text)
 
def score_with_gemini(transcript: str) -> dict:
    client = genai.Client()
    resp = client.models.generate_content(
        model="gemini-2.5-pro",
        contents=PROMPT.format(rubric=RUBRIC, transcript=transcript),
        config={"response_mime_type": "application/json"},
    )
    return json.loads(resp.text)
 
JUDGES = {"gpt-5": score_with_openai, "claude-4.5": score_with_anthropic, "gemini-2.5": score_with_gemini}

Three things to notice. Temperature is zero where the API allows it, because we want to remove run-to-run randomness from the same judge before we measure across judges. Output is constrained to JSON so the parser does not become a confounder. Each judge sees the exact same prompt, the exact same rubric, and the exact same transcript. We change one variable: the model.

Score 500 calls with all three judges

500 calls is the practical sweet spot: large enough that per-dimension kappa stabilizes, small enough that the API bill stays under a thousand dollars across three frontier providers. The output is a 500 by 3 by 5 tensor: 500 calls, 3 judges, 5 dimensions per call.

run_panel.py·python
import pandas as pd
from judge_harness import JUDGES
 
def score_corpus(transcripts: list[dict]) -> pd.DataFrame:
    rows = []
    for t in transcripts:
        for judge_name, judge_fn in JUDGES.items():
            scores = judge_fn(t["transcript"])
            for dim, score in scores.items():
                rows.append({
                    "call_id": t["id"],
                    "judge": judge_name,
                    "dimension": dim,
                    "score": int(score),
                })
    return pd.DataFrame(rows)
 
# panel = score_corpus(load_transcripts(sample_size=500))
# panel.to_parquet("judge_panel.parquet")

Cache the output. You do not want to re-run a thousand-dollar batch because someone closed a notebook. Now you have a long-format DataFrame ready for agreement analysis.

Compute pairwise kappa with sklearn

Sklearn ships Cohen's kappa out of the box. We pivot the DataFrame so each row is a call and each column is a judge, then compute pairwise kappa per dimension.

pairwise_kappa.py·python
from itertools import combinations
from sklearn.metrics import cohen_kappa_score
import pandas as pd
 
def pairwise_kappa(panel: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for dim, sub in panel.groupby("dimension"):
        wide = sub.pivot(index="call_id", columns="judge", values="score").dropna()
        for a, b in combinations(wide.columns, 2):
            k = cohen_kappa_score(wide[a], wide[b], weights="linear")
            rows.append({"dimension": dim, "pair": f"{a} vs {b}", "kappa": round(k, 3)})
    return pd.DataFrame(rows).pivot(index="dimension", columns="pair", values="kappa")

What you typically see, anchored to published ranges in the Gu 2024 survey and Ye 2025 CALM study, is something like this. Numbers below are illustrative and consistent with that literature, not measured in this article.

Dimensiongpt-5 vs claude-4.5gpt-5 vs gemini-2.5claude-4.5 vs gemini-2.5
intent0.780.720.75
compliance0.710.680.70
resolution0.620.580.60
tone0.510.470.49
empathy0.420.380.40

Two patterns. Aggregate kappa across the panel sits somewhere around 0.55, in the moderate band. And the variance across dimensions is enormous: judges agree on intent classification, they barely agree on empathy. Reporting a single aggregate kappa would tell you the panel is "moderate" and hide everything that matters.

Why three judges need Krippendorff's alpha

Pairwise Cohen's kappa across three judges loses information. You end up averaging three numbers that themselves have biases. Krippendorff's alpha gives you one statistic that handles three or more raters, ordinal data, and missing values.

multi_rater_alpha.py·python
import krippendorff
import numpy as np
import pandas as pd
 
def krippendorff_alpha_per_dim(panel: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for dim, sub in panel.groupby("dimension"):
        wide = sub.pivot(index="call_id", columns="judge", values="score")
        # krippendorff expects shape: (n_raters, n_units); NaN slots are tolerated
        # by the library (treated as missing), but if you cannot accept gaps drop
        # them with wide.dropna() before .T.to_numpy().
        data = wide.T.to_numpy()
        alpha = krippendorff.alpha(reliability_data=data, level_of_measurement="ordinal")
        rows.append({"dimension": dim, "krippendorff_alpha": round(alpha, 3)})
    return pd.DataFrame(rows).sort_values("krippendorff_alpha")

Krippendorff suggests alpha at or above 0.80 is acceptable for content analysis, with 0.667 the lowest acceptable for tentative conclusions. On a customer-experience rubric across three frontier judges, alpha tends to sit between 0.40 and 0.70, and the lowest dimensions are usually empathy and tone. Treat that as the empirical reality, not a failure of the eval.

Decompose, do not aggregate

The single biggest takeaway from running this analysis is that one number is the wrong unit of reporting. A scorecard with five dimensions has five separate inter-rater stories. Intent classification might be substantial. Empathy might be fair at best. The aggregate looks moderate, which mixes the trustworthy signal with the unreliable one.

The fix is to publish per-dimension kappa or alpha alongside the score. If your dashboard says "agent fleet scored 4.1 on empathy this week," your dashboard should also say "judge agreement on empathy: alpha 0.42." The reader then knows that the 4.1 has wide error bars and a small change in prompt could swing it.

Bootstrap confidence intervals around kappa

Kappa is a point estimate. Like any point estimate, it has uncertainty. Two studies with the same true kappa can return 0.55 and 0.61 by sampling alone. The right move is to bootstrap.

bootstrap_kappa.py·python
import numpy as np
from sklearn.metrics import cohen_kappa_score
 
def bootstrap_kappa(judge_a: np.ndarray, judge_b: np.ndarray, n_boot: int = 1000) -> dict:
    rng = np.random.default_rng(seed=42)
    estimates = []
    n = len(judge_a)
    for _ in range(n_boot):
        idx = rng.integers(0, n, n)
        estimates.append(cohen_kappa_score(judge_a[idx], judge_b[idx], weights="linear"))
    estimates = np.array(estimates)
    return {
        "kappa": round(float(estimates.mean()), 3),
        "ci_low": round(float(np.percentile(estimates, 2.5)), 3),
        "ci_high": round(float(np.percentile(estimates, 97.5)), 3),
    }

A kappa of 0.55 with a 95% CI of [0.42, 0.68] is a different conclusion than 0.55 with a CI of [0.53, 0.57]. The first means you cannot reliably distinguish moderate from substantial agreement. The second means you can. With 100 calls per dimension you typically get CIs about 0.10 wide. With 500 calls they tighten to about 0.04. The bootstrap forces you to be honest about how much you know.

The disagreement queue is a gold mine, not a failure

Once you have multi-judge scores, the calls where judges disagree most are not failures of the eval. They are the calls humans should look at. You can score per-call disagreement as the variance of judge scores across the panel, sort descending, take the bottom decile of agreement, and route those to human review.

disagreement_queue.py·python
import pandas as pd
 
def disagreement_queue(panel: pd.DataFrame, bottom_decile: float = 0.10) -> pd.DataFrame:
    by_call = panel.pivot_table(
        index="call_id", columns=["judge", "dimension"], values="score"
    )
    # disagreement score: mean within-call variance across dimensions
    variance_by_call = by_call.groupby(level="dimension", axis=1).var().mean(axis=1)
    threshold = variance_by_call.quantile(1 - bottom_decile)
    return variance_by_call[variance_by_call >= threshold].sort_values(ascending=False)

Out of 500 calls, you flag the 50 with the highest judge disagreement. That is your weekly human-review batch. It is also a dataset for prompt iteration. Every call where three frontier judges look at the same transcript and reach different conclusions is a call where the rubric, the agent, or the transcript itself is ambiguous. Fix the ambiguity and your kappa goes up.

Where Chanl fits

If you build the harness above, you have to maintain the cache, the parquet files, the cron job that runs the panel weekly, and the dashboard wiring. Or you can use a scorecards system that already persists per-judge results so the agreement analysis is one query away.

The Chanl SDK exposes scorecards with multi-criteria rubrics and per-criterion results. You can configure separate scorecard runs against different judge models, then pull the per-call results and compute kappa or alpha across them.

multi_judge_chanl.ts·typescript
import { ChanlSDK } from '@chanl/sdk';
 
const sdk = new ChanlSDK({ apiKey: process.env.CHANL_API_KEY! });
 
const scorecard = await sdk.scorecard.create({
  name: 'CX Quality 5D',
  description: 'Empathy, compliance, intent, resolution, tone',
  status: 'active',
  passingThreshold: 70,
  scoringAlgorithm: 'weighted_average',
});
 
// run the same scorecard with three different judge configs (one per workspace
// agent or eval profile), persist all per-criterion results
for (const callId of last500CallIds) {
  await sdk.scorecard.evaluate(callId, { scorecardId: scorecard.data!.id, force: true });
}
 
// pull per-call, per-criterion results to compute kappa per dimension
for (const callId of last500CallIds) {
  const { data } = await sdk.scorecard.getResultsByCall(callId);
  // each result.criteriaResults entry is one judge's score on one dimension
  // export to your kappa script
}
Quality analyst reviewing scores
Score
Good
0/100
Tone & Empathy
94%
Resolution
88%
Response Time
72%
Compliance
85%

A few things in this pattern do not yet have first-class API support. Formal ensemble scoring (judges: [...] with explicit aggregation) and an auto-generated disagreement report are not shipped. We treat them as product gaps to fill, not features to pretend exist. For now, you run the panel as separate evaluations and aggregate in your own code. See Scorecards and the existing 12 Ways Your LLM Judge Is Lying to You for the related single-judge bias story, and Beyond LLM Judge for what to do when the agreement number is too low to fix.

What to do Monday

Pick three judges from different model families. Use temperature zero and JSON-constrained output. Score at least 100 calls, ideally 500. Compute pairwise Cohen's kappa with sklearn and Krippendorff's alpha across all three. Decompose by rubric dimension and publish each one separately, never the aggregate alone. Bootstrap a 95% confidence interval so the kappa itself has error bars. Build the disagreement queue and route the bottom decile to human review. Make the per-dimension agreement number a deployment gate next to your accuracy number.

The single judge score on your dashboard is not wrong. It is just incomplete. Add the second judge and the third, measure how often they agree, and the score becomes a metric instead of a guess.

Score every call with multiple judges, measure agreement

Chanl scorecards persist per-criterion results from any judge configuration, so inter-rater agreement is one query away.

Try Scorecards
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions