Your team read about the 12 biases hiding in every LLM judge. The verbosity inflation, the position effects, the self-preference loops. You know the scores aren't trustworthy. But you still have agents in production that need evaluation, and "stop using LLM-as-Judge" isn't actionable advice unless you know what replaces it.
That's the gap this article fills. Not "here's what's wrong with LLM judges" (we catalogued all 12 biases in our previous deep dive), but "here are the six evaluation methods that production teams are actually adopting, with code you can run this week."
The shift happening across the industry isn't from one tool to another. It's from a single scoring model to a layered evaluation strategy where different methods handle different dimensions of quality. Here's what that looks like in practice.
Why isn't a better judge model the answer?
The core issue with LLM-as-Judge isn't that LLMs are bad evaluators. It's that teams use a single model to score everything with a single prompt, then treat the output as ground truth. Anthropic's engineering team explored this challenge, noting that evaluating both the transcript (including tool calls and intermediate results) and the final outcome matters for understanding agent failures.
An agent that picks the wrong tool, retrieves the wrong document, then generates a confident-sounding response will score well on a generic "rate this response" rubric. The answer reads fluently. The tone is professional. The format is clean. Every superficial signal says "good response," but the underlying retrieval was wrong and the customer got bad information.
The fix isn't a better judge model. It's matching the right evaluation method to the right dimension of quality. Here's the map:
| Dimension | Wrong approach | Right approach |
|---|---|---|
| Factual accuracy | Generic "rate quality 1-5" | Domain-specific criteria with verifiable anchors |
| Conversation flow | Score each turn independently | Trajectory evaluation across the full arc |
| Edge case handling | Hope your test set covers it | Adversarial simulation with realistic personas |
| Subjective quality | LLM scores everything | Human review on the hard 20%, LLM pre-screens the rest |
| Quality over time | Spot-check when something feels off | Automated regression baselines with alerts |
| Pipeline correctness | End-to-end scoring only | Component-level eval on tool selection, retrieval, generation |
Each of these six methods addresses a specific gap that monolithic LLM judging can't fill. Let's walk through them.
1. Domain-specific scorecards
Domain-specific scorecards replace open-ended "rate quality" prompts with precise, use-case-specific criteria that constrain the evaluator's interpretation and directly reduce scoring bias. Instead of a single question, you score independent dimensions like issue identification, policy adherence, and resolution efficiency, each with concrete anchors at every level.
"Rate the quality of this response on a scale of 1 to 5" leaves the judge free to interpret "quality" however its training data suggests. That interpretation is where all 12 biases enter. The fix: instead of "quality," score "billing issue identification," "refund policy adherence," and "tone calibration for frustrated customers" as independent dimensions.
Here's what a generic eval looks like versus a domain-specific one:
# Generic eval (what most teams start with)
criteria:
- name: "Overall Quality"
prompt: "Rate the overall quality of this response from 1 to 5."
# Domain-specific eval (what actually works)
criteria:
- name: "Issue Identification"
prompt: >
Did the agent correctly identify the customer's core issue
within the first two exchanges?
1: Misidentified the issue or never identified it.
3: Identified the general category but missed specifics.
5: Correctly identified the specific issue and confirmed
understanding with the customer.
- name: "Policy Adherence"
prompt: >
Did the agent follow the company's refund and escalation
policies?
1: Violated a stated policy or made an unauthorized commitment.
3: Followed policies but missed an applicable edge case.
5: Correctly applied all relevant policies including exceptions.
- name: "Resolution Path"
prompt: >
Did the agent move the conversation toward resolution
efficiently?
1: Went in circles, repeated questions, or dead-ended.
3: Reached resolution but took unnecessary steps.
5: Took the most direct path to resolution given
the constraints.Notice what changed. Each criterion has concrete anchors at every score level. The evaluator doesn't decide what "good" means. The rubric defines it. This constrains interpretation and directly reduces verbosity bias (a long, rambling response that never identifies the issue still scores 1 on "Issue Identification") and leniency bias (the anchors define what a 3 looks like, so different evaluator models converge on similar scores).
Building these scorecards requires domain expertise, but that's the point. Your eval criteria should reflect what your business actually cares about, not what a language model thinks "quality" means.
The weight distribution matters. "Policy Adherence" at 35% reflects a real business priority: a wrong refund commitment costs money. "Tone Calibration" at 15% matters but won't tank the overall score on its own. These weights encode your team's values into the eval, which is exactly what a generic "rate quality" prompt fails to do.
Teams using structured scorecards report that the biggest win isn't more accurate scores. It's knowing exactly which dimension degraded when quality drops. A 4.2 dropping to 3.8 tells you nothing. "Policy Adherence dropped from 4.5 to 3.1 after the Tuesday prompt change" tells you exactly what broke and where to look.
2. Multi-turn trajectory evaluation
Single-turn evaluation treats each response as an independent event. That works for chatbots that answer one question and move on. It doesn't work for agents that handle multi-step workflows: verifying identity, looking up an account, diagnosing a problem, applying a solution, and confirming resolution.
Trajectory evaluation scores the full conversation arc. A response that looks perfect in isolation might have been produced only because the agent failed to ask a clarifying question three turns earlier and is now guessing. Per-turn scoring would rate the guess highly (it's fluent, confident, well-structured), but trajectory scoring would catch that the agent skipped verification and acted on an assumption.
The key metrics for trajectory evaluation are different from per-turn metrics:
| Trajectory metric | What it catches |
|---|---|
| Goal completion | Did the agent reach the customer's stated objective? |
| Context maintenance | Did the agent remember information from earlier turns? |
| Recovery quality | When the agent misunderstood something, how well did it recover? |
| Path efficiency | How many turns did it take versus the minimum needed? |
| Escalation timing | If the agent couldn't resolve the issue, did it escalate at the right moment? |
Here's a simplified version of how trajectory scoring works:
def evaluate_trajectory(conversation: list[dict]) -> dict:
scores = {}
# Goal completion: did the last turn resolve the issue?
final_turn = conversation[-1]
customer_goal = extract_goal(conversation[0])
scores["goal_completion"] = check_goal_met(
customer_goal, final_turn
)
# Context maintenance: did the agent reference
# earlier information correctly?
context_refs = 0
context_errors = 0
for i, turn in enumerate(conversation[1:], 1):
if turn["role"] == "assistant":
refs = find_references_to_earlier_turns(
turn, conversation[:i]
)
for ref in refs:
context_refs += 1
if not ref["accurate"]:
context_errors += 1
scores["context_accuracy"] = (
1 - (context_errors / max(context_refs, 1))
)
# Path efficiency: actual turns vs. minimum needed
agent_turns = len(
[t for t in conversation if t["role"] == "assistant"]
)
min_turns = estimate_minimum_turns(customer_goal)
scores["path_efficiency"] = min(
min_turns / max(agent_turns, 1), 1.0
)
return scoresThis is simplified, but it illustrates the structural difference. You're not asking "was this turn good?" but "did this sequence of turns accomplish the goal?" Those are fundamentally different questions, and they catch fundamentally different failure modes.
Current tooling for trajectory evaluation is still maturing. Most teams implement it as a post-processing step: score individual turns with their standard scorecard, then run a separate trajectory analysis on the full conversation. The trajectory layer catches failures that per-turn scoring misses entirely.
3. Scenario-based simulation
Unit tests check individual inputs and outputs. Simulation testing checks whether the agent handles realistic situations end to end. The difference matters because real customer conversations are messy, multi-step, and full of implicit context that unit tests don't capture.
Simulation testing works by creating personas with specific attributes (personality, problem, communication style) and running full conversations against your agent. The persona acts like a real customer would: providing partial information, asking follow-up questions, expressing frustration, and sometimes trying to derail the conversation entirely.
The adversarial testing angle is where simulations earn their keep. You don't just test "customer with a billing question." You test "customer who provides wrong account information to see if the agent catches it," "customer who keeps changing the subject," and "customer who asks the agent to do something it shouldn't."
The "Social Engineer" persona is particularly valuable. It will try to get the agent to skip identity verification, share account details without authentication, or approve actions beyond its authorization level. These are exactly the failure modes that scripted tests miss because they require adversarial creativity.
Teams building scenario-based test suites typically start with 20-30 scenarios covering their core use cases, then expand based on patterns they see in production failures. The scenarios become a living regression suite: when a production conversation goes wrong, you create a persona that reproduces the failure pattern and add it to the suite.
4. Human-AI hybrid evaluation
The goal isn't to eliminate LLM judges. It's to use them correctly. LLM judges handle straightforward evaluations well: when the response is clearly good or clearly bad, the judge and human reviewers tend to agree. The problem is the ambiguous cases, the edge-case-heavy conversations, and the situations requiring real-world judgment the LLM doesn't have.
The hybrid approach splits the workload. The LLM evaluates everything and flags cases where its confidence is low or where scores fall into ambiguous ranges. Human reviewers focus exclusively on those flagged cases. This dramatically cuts human review volume while concentrating human attention on the cases that matter most.
The calibration feedback loop is critical. When human reviewers disagree with the LLM's score (even on the flagged cases), that disagreement feeds back into the rubric. If humans consistently score "tone appropriateness" differently than the LLM on frustrated-customer conversations, that's a signal to tighten the rubric anchors for that specific criterion.
Practical implementation of the hybrid model:
- Score all conversations with your automated scorecard.
- Flag any conversation where the overall score falls between 2.5 and 3.5 (the ambiguous middle range).
- Flag any conversation where individual criteria disagree by more than 2 points (e.g., accuracy scored 5 but resolution scored 1).
- Route flagged conversations to human reviewers.
- Track human vs. LLM agreement rate per criterion, per month.
- Tighten rubric anchors for any criterion where agreement drops below 80%.
This isn't the cheapest evaluation method, but it's the most accurate for high-stakes use cases. If your agent handles billing disputes, medical information, or financial advice, the 20% of cases that fall in the ambiguous zone are exactly the ones where mistakes are most expensive.
5. Automated regression testing
Most teams think of evaluation as "is this response good?" Regression testing asks a different question: "is this response worse than it was last week?" This reframing matters because it catches drift that absolute scoring misses.
Consider a scenario where your agent's accuracy score has been 4.3 for months. A new model version bumps it to 4.4, but resolution efficiency quietly drops from 4.1 to 3.6. The overall average barely moves. If you're watching a single composite score, you don't notice. If you're tracking each dimension independently with regression alerts, the 0.5-point drop on resolution efficiency triggers a review before the degradation reaches customers.
The minimal regression testing setup needs three things: a baseline conversation set, weekly scoring, and per-dimension alerting.
The 10% threshold is a starting point. Some dimensions are more sensitive than others. "Policy Adherence" dropping 10% could mean your agent is making unauthorized commitments, which is an urgent problem. "Tone Calibration" dropping 10% might mean the prompt got slightly more formal, which is worth investigating but not blocking a deploy.
Set per-dimension alert thresholds based on business impact:
| Dimension | Threshold | Why |
|---|---|---|
| Policy Adherence | 5% drop | Violations have direct financial/legal risk |
| Issue Identification | 10% drop | Customers get frustrated but aren't harmed |
| Tone Calibration | 15% drop | Subjective, normal variance is higher |
| Resolution Efficiency | 10% drop | Affects handle time and customer satisfaction |
The power of regression testing is that it works regardless of whether your absolute scores are well-calibrated. Even if your LLM judge has a 0.3-point verbosity inflation, that inflation is consistent. So a relative drop in score still means something changed. You're measuring change, not absolute quality, which sidesteps most of the calibration problems that plague absolute scoring.
Teams running production agents use monitoring dashboards to track these dimensions in real time. The regression test suite runs on a schedule, but the dashboard catches anomalies between test runs.
6. Component-level evaluation
End-to-end scoring tells you the agent gave a wrong answer. It doesn't tell you why. Was the right tool selected? Was the retrieval query well-formed? Did the retrieved documents contain the right information? Was the response actually grounded in those documents, or did the model hallucinate?
Component-level evaluation scores each stage of the agent pipeline independently. This aligns with Anthropic's recommendation to evaluate the full transcript, including tool calls and intermediate results, not just the final output. When intermediate outputs go unexamined, component-level failures cascade into end-to-end failures that are impossible to diagnose from the final response alone.
Here's what component-level eval looks like for a customer support agent with access to knowledge base search and account lookup tools:
# Component-level evaluation criteria
components:
tool_selection:
criteria: "Did the agent call the right tool for the query?"
anchors:
1: "Called an irrelevant tool or no tool when one was needed"
3: "Called a related tool but not the optimal one"
5: "Called the exact right tool with correct parameters"
retrieval_quality:
criteria: "Did the tool return information relevant to the query?"
anchors:
1: "Retrieved documents are unrelated to the question"
3: "Retrieved documents are topically related but don't contain the answer"
5: "Retrieved documents directly answer the question"
response_grounding:
criteria: "Is the agent's response supported by the retrieved information?"
anchors:
1: "Response contradicts or ignores retrieved information"
3: "Response uses retrieved information but adds unsupported claims"
5: "Response is fully grounded in retrieved information with no hallucination"When an agent gives a wrong answer, component-level eval produces a diagnosis, not just a score. "Tool selection: 5, Retrieval: 5, Grounding: 2" means the agent found the right information but hallucinated in the response. "Tool selection: 2, Retrieval: N/A, Grounding: N/A" means the agent never even looked for the answer. These are completely different problems with completely different fixes, but end-to-end scoring would give them the same low score.
This granularity becomes especially important when you're iterating on agent tools. If you add a new tool and retrieval scores drop, the tool itself might return good results but be poorly described, causing the agent to call it when it shouldn't. Component-level scoring isolates that failure to the tool selection layer, where you can fix the tool description without touching the rest of the pipeline.
How do these six methods layer together?
These six methods don't compete with each other. They layer into a two-stage stack: pre-deployment testing (simulations, component eval, regression baselines) catches problems before users see them, while production monitoring (scorecards, trajectory analysis, human review) catches problems in live conversations. A production evaluation stack uses all of them at different stages:
Pre-deployment (catching problems before they reach users):
- Scenario simulation with adversarial personas tests robustness
- Component-level eval validates pipeline integrity
- Regression testing compares against your baseline
Production monitoring (catching problems in real conversations):
- Domain-specific scorecards score every conversation automatically
- Trajectory evaluation runs on a sample of multi-turn conversations
- Human-AI hybrid review handles the flagged edge cases
Here's the full loop in code:
The loop connects all six methods: domain-specific scorecard (step 1), scenario simulation with adversarial persona (steps 2-3), automated regression comparison (step 5), and the scorecard results feed into your analytics dashboard where trajectory analysis and human review happen on the production data.
What gaps remain in agent evaluation tooling?
Three areas still require custom implementation or manual work: trajectory-level scoring across multi-turn conversations, confidence-based routing to human reviewers, and first-class baseline comparison with statistical significance. These gaps are closing, but they shape what you'll build yourself versus what you'll get from existing tools.
Trajectory-level scoring is still manual. Current scorecard tools evaluate individual interactions. Scoring across turns (did the agent maintain context from turn 1 to turn 7?) requires custom implementation. Industry tooling for automated trajectory evaluation is maturing but not standardized yet.
Confidence thresholds for automated human routing don't exist yet in most platforms. The hybrid model described above requires you to build the confidence-check logic yourself. Automatic flagging of "needs human review" conversations based on score distributions would eliminate that custom work.
Baseline comparison requires manual calculation. Pulling two sets of scorecard results and computing deltas per dimension works, but a first-class compareBaseline() method that returns per-dimension deltas and handles statistical significance would make regression testing accessible to teams without data engineering resources.
These are active areas of development. The methods described here work today, even with these gaps. You'll write a bit of glue code for trajectory analysis and baseline comparison, but the underlying evaluation patterns are sound.
Start here
If you're migrating from a monolithic LLM judge to a layered evaluation strategy, here's the order that delivers the most value fastest:
-
Replace generic criteria with domain-specific scorecards. This is the single most impactful change. It takes an afternoon and immediately surfaces dimension-level insights you're currently blind to.
-
Add 5 adversarial personas to your test suite. A confused caller, a social engineer, a topic switcher, an angry escalator, and a domain expert who knows more than your agent. Run them weekly.
-
Set up regression baselines. Score 20 representative conversations today. Score the same set next week. Set alerts on per-dimension deltas. You now have drift detection.
-
Implement human review for the ambiguous middle. Route conversations with scores between 2.5 and 3.5 to a human reviewer. Track agreement rates. Tighten rubrics where humans and LLMs disagree.
-
Add component-level eval when you're debugging tool selection failures. This is the most work to set up but pays off quickly for agents with 5+ tools, where "wrong answer" has multiple possible root causes.
-
Layer in trajectory evaluation as your test scenarios get more complex. Once your scenarios involve multi-turn workflows, per-turn scoring will start missing failures that trajectory scoring catches.
You don't need all six methods on day one. Start with domain-specific scorecards and adversarial simulation. Add regression baselines within the first month. Layer in the rest as your agent's complexity grows and your team's evaluation maturity increases.
The LLM judge isn't dead. It's just no longer the whole answer. Constrain it with concrete criteria, challenge it with adversarial simulations, verify it with human calibration, and monitor it with regression baselines. That's the evaluation stack that holds up in production.
Build your evaluation stack today
Domain-specific scorecards, adversarial personas, and regression baselines. Set up all three in one session.
Start EvaluatingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



