ChanlChanl
Testing & Evaluation

Is AI Better Than Your Humans? Score Both on One Rubric

Most teams can't say whether AI beats humans because they score them differently. One rubric, run on both, sliced by segment, gives you an honest answer.

DGDean GroverCo-founderFollow
April 15, 2026
8 min read
Watercolor illustration of an observation tower overlooking two parallel worlds, Blade Runner 2049 style in sage and olive tones

A VP of CX walks into the quarterly review with two dashboards. One shows the bot: latency down, deflection up, cost per resolution falling. The other shows her human team: CSAT holding, AHT improving, QA samples green. The board asks the obvious question: so is the AI actually better than our people?

She doesn't know. Not really. Both numbers go up and to the right. The dashboards don't talk to each other. She squints and decides by gut whether to expand the rollout.

That's not a data problem. It's a parity problem. You can't compare two things by measuring them on different rulers.

The fix is blunt: one scorecard. Run it against every conversation, AI or human. Segment the results. Report the truth.

What does "one scorecard" actually contain?

A working shared scorecard uses four categories that apply to both AI and human conversations, with scenario-specific weights inside. The categories stay constant. The weights shift by call type. Below is the baseline rubric.

CategoryWhat it measuresExample question
AccuracyDid the agent resolve the issue correctly?Right policy, right price, right return window cited?
EmpathyDid the agent acknowledge emotion and match tone?Did it de-escalate when the customer got heated?
ResolutionDid the customer leave with what they needed?No dead-end transfers, no hangs, no repeat calls?
ComplianceDid the agent follow scripts, disclaimers, policies?Data handling, regulated claims, PII protection?

A renewal call weights accuracy and compliance heavily, empathy less. A cancellation call inverts that: empathy and resolution carry more. A sales discovery call weights empathy and opportunity identification (which lives inside the resolution category for this scenario) most.

Weights-by-scenario is where shared scorecards usually break. Teams argue about the global weighting and never ship. The answer: don't globally weight anything. Score every category, then slice results by scenario. The scenario decides what matters.

Quality analyst reviewing scores
Score
Good
0/100
Tone & Empathy
94%
Resolution
88%
Response Time
72%
Compliance
85%

How do you trust an LLM judge at scale?

"Score every conversation" sounds fine until you multiply it out. At 50,000 conversations a month, no team of human QA reviewers can keep up. You need an LLM judge. LLM judges are fine. They're also famously unreliable if you don't calibrate them. Three rules keep them honest:

  1. Versioned rubric. The exact prompt, exact categories, exact rating scale gets version-locked. When you change it, you re-score history. Don't mix v1 and v2 scores in the same report.
  2. Human-labeled anchor set. Keep a gold-standard set of 200 to 500 labeled conversations spanning every scenario. Every rubric change gets tested against it before rollout. Track Cohen's kappa between judge and human on each category. Landis & Koch call 0.61 to 0.80 "substantial" and above 0.80 "almost perfect". Published LLM-judge benchmarks like MT-Bench land in roughly the same 0.7 to 0.8 range against humans. Treat 0.75 as the floor. Anything below that on a category means the rubric, not the judge, needs work.
  3. Quarterly recalibration. Drift is real. Re-sample, re-label, re-test. Treat this like a security patch cadence, not a nice-to-have.

Get this right and the judge starts producing scores people actually trust. Skip it and you get scores nobody believes, which means the parity question quietly goes back to being a gut call.

What do parity reports actually show?

When you run one scorecard against both AI and human conversations and slice by scenario, the results are rarely uniform. A typical enterprise we've seen produces something like this, once a quarter of data has rolled through:

  • AI wins on volume-heavy, low-emotion scenarios. Password resets, order status, basic FAQ. AI scores higher on accuracy and resolution because it's consistent, never distracted, and has full memory of the knowledge base.
  • Humans win on delicate retention, complex billing disputes, and accounts with history. AI often flubs empathy in scenarios that need long-memory cues and genuine flexibility.
  • AI is flatly worse on compliance edges. Not because AI can't follow rules. Because the rules get updated and the AI's prompt doesn't. Humans read the Monday memo. Your bot didn't.

Every one of those findings is actionable. "AI should take renewal questions; humans should handle retention calls; compliance gap equals update the prompt library." That's the real output of a shared scorecard: a segmented routing and improvement plan, not a global winner.

Routing based on parity, not policy

Once you have scorecard results by scenario, the routing question stops being a battle over control and becomes a math problem. Route each scenario to whichever type of agent scores higher, adjusted for cost and escalation rate. Where humans score better, they get the work. Where AI scores better, and the resolution-rate gap is small, the AI handles it. Where scores are close, split traffic and measure, and size the split honestly. A couple hundred calls per arm per scenario is usually enough to detect a meaningful gap on a 1 to 5 rubric. A handful of calls is noise. If a scenario is low-volume, either wait until you have the data or run a longer hold-out. Don't flip routing on a week of 20 calls.

This also changes how the AI gets improved. Instead of asking "how do we make the AI better in general?" you ask "for scenario X where we lose to humans by Y points on empathy, what's the specific fix?" Usually it's a prompt gap, a missing tool, or a memory blind spot. All of which are fixable, one at a time, because you now know exactly which scenario to fix.

"Is AI actually better" is a quarterly question

Parity reports aren't a one-time exercise. They're a quarterly cadence. Models change, prompts change, knowledge changes, human teams change. The scorecard is the constant. Run it, slice the results, publish the findings, update the routing.

The teams that do this stop having the "is AI better" debate. It gets answered, by segment, by scenario, by scorecard, and the conversation moves to what do we do about where it's worse. Which is the only version of the question that's actually useful.

So when the board asks our VP next quarter, is AI better than our people?, she doesn't squint. She opens one dashboard. AI wins renewals. Humans win retention. Compliance needs a prompt update by Friday. That's the answer. One scorecard, both teams, segmented, quarterly. Everything else is a guess wearing a number.

Sources & References

Related reading:

DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions