ChanlChanl
Testing & Evaluation

Trajectory Eval: Catch Agent Bugs Output Scoring Misses

Final-output scoring misses 20-40% of agent regressions. Trajectory evaluation scores every step an agent takes -- tool calls, reasoning decisions, order of operations -- and catches the bugs that output-only evals can't see.

DGDean GroverCo-founderFollow
May 8, 2026
13 min read
A flowchart showing an agent's step-by-step decision path with one step flagged as diverging from the expected trajectory

Your team shipped a prompt revision last Tuesday. Output quality in offline evals went up -- 91% pass rate versus 87% before. It looked like a clean win. You rolled it out Wednesday morning.

By Friday, support tickets were spiking. Not because the agent's answers were wrong -- they were actually clearer. Because the agent was now skipping the account verification step on a specific class of queries, looking up the wrong customer record, and processing changes on the wrong account. The final responses were coherent and confident. They were just acting on the wrong data.

You had evaluated the outputs. You hadn't evaluated the path.

This is the trajectory problem. In any multi-step agent, the quality of the path matters as much as the quality of the destination. And in CX agents where actions have real-world consequences -- refunds processed, policies applied, accounts modified -- a wrong path that produces a right-sounding output is more dangerous than a wrong path that fails visibly.

What trajectory evaluation actually measures

Trajectory evaluation scores the complete sequence of steps an agent takes to complete a task, not just its final response. Every tool call, every reasoning decision, every intermediate action is part of the trajectory. The evaluation asks: did the agent do the right things, in the right order, with the right arguments?

This is distinct from output evaluation in a specific way. Output evaluation answers "did the agent say the right thing?" Trajectory evaluation answers "did the agent do the right thing to get there?" In a single-turn LLM call, those are almost the same question. In a multi-step agent with tool use, external lookups, and stateful decisions, they diverge significantly.

Consider a CX agent handling a refund request. The correct trajectory is roughly:

  1. Verify customer identity
  2. Look up the specific order
  3. Check refund eligibility against policy
  4. Apply the refund if eligible, or explain the reason if not
  5. Confirm the action to the customer

An agent can produce a confident, well-phrased refund confirmation after steps 1, 3, 5 -- skipping the order lookup and policy check. The output sounds right. The underlying action was wrong because it acted without verifying the specific order.

Output evaluation gives this a passing score. Trajectory evaluation flags it.

The 20-40% regression miss rate

Research from 2026 agent evaluation work quantifies the gap precisely. Agents evaluated only on final-output quality pass 20-40% more test cases than full trajectory evaluation reveals. That's not a small rounding error -- it's a systematic blind spot.

The mechanism is what makes this hard to catch with other approaches. When a model update or prompt change causes the agent to use a different path, the new path often produces correct results on the test cases you've written. Your test suite was designed around the expected outputs, not the expected paths. The agent finds an alternative route to the same destination, your tests pass, and you ship.

The failures show up in production on the cases your test suite didn't cover -- the edge cases, the unusual sequences of user inputs, the error recovery paths. This is where a wrong path matters: the alternative route your agent discovered might work on the 80% of inputs in your test set and fail on the 20% of inputs you didn't test.

Trajectory evaluation changes this. Instead of asking "did the final output match?" you ask "did the agent take the expected steps?" A prompt change that shifts the agent's tool usage pattern from [verify, lookup, check, apply] to [verify, check, apply] fails immediately in trajectory evaluation -- even if the outputs on your test cases look fine.

Why CX agents are especially exposed

CX agents are uniquely vulnerable to trajectory errors because they run 8-12 dependent steps per interaction -- identity verification, account lookup, policy check, action application, confirmation -- where an error in step 2 corrupts everything that follows. Unlike single-turn LLM calls where trajectory evaluation is almost redundant, multi-step CX agents have enough intermediate decisions that wrong paths regularly produce plausible final outputs.

Standard single-turn LLM applications have short trajectories. The agent gets a prompt, calls one or two tools, and returns a response. Even if trajectory evaluation would find issues, the blast radius of a wrong step is limited -- there's only one step.

CX agents are different. A customer service agent handling a billing dispute typically runs 8-12 steps: identity verification, account lookup, transaction history pull, dispute policy check, escalation eligibility check, credit application, confirmation, and follow-up scheduling. Each step depends on the output of the previous one.

Step 1: Verify Identity Step 2: Look Up Account Step 3: Fetch Transaction History Step 4: Check Dispute Policy Step 5: Assess Eligibility Step 6: Apply Credit or Explain Denial Step 7: Confirm and Summarize
How an error in step 2 corrupts downstream steps in a CX agent trajectory

When step 2 looks up the wrong account -- or skips a disambiguation check when multiple accounts match -- steps 3 through 7 all operate on wrong data. The agent's reasoning in each subsequent step is internally consistent. The dispute policy check is correct given the transaction history it retrieved. The eligibility assessment is correct given the policy. The credit application is correct given the eligibility. All of it is coherent. All of it is wrong.

Output evaluation on step 7 sees a confident, well-reasoned response and scores it highly. Trajectory evaluation catches the wrong account lookup in step 2 and flags the entire run.

The three trajectory metrics

The standard trajectory metrics are trajectory_exact_match, trajectory_precision, and trajectory_recall. Use precision and recall together in production -- exact_match is too brittle for real-world agents that legitimately vary their paths. Precision tells you if the agent is doing unexpected things. Recall tells you if the agent is skipping expected things. Both matter, for different reasons.

Expected Trajectory Comparison Actual Trajectory trajectory_exact_match trajectory_precision trajectory_recall Did agent take identical steps? Of steps taken, what fraction were expected? Of expected steps, what fraction were taken?
Trajectory evaluation metrics: exact match, precision, and recall applied to agent steps

trajectory_exact_match asks whether the agent's trajectory is identical to the expected trajectory -- same steps, same order, same arguments. This is the strictest measure and too brittle for most production use. Agents legitimately find equivalent paths. A threshold of 1.0 exact match will generate false positives constantly.

trajectory_precision measures, of the steps the agent actually took, what fraction were in the expected step set. High precision means the agent didn't take unexpected steps. Low precision means the agent is doing things you didn't expect -- calling extra tools, making extra queries, taking detours.

trajectory_recall measures, of the steps you expected the agent to take, what fraction the agent actually took. High recall means the agent didn't skip expected steps. Low recall means the agent is skipping steps -- the verification it's supposed to do, the policy check it's supposed to run, the confirmation it's supposed to give.

In production, you want both. An agent with high precision but low recall is skipping critical steps while staying on-path for the steps it does take. An agent with high recall but low precision is doing everything you expected, plus extra things you didn't -- which might be harmless or might be a sign of a prompt drift that adds unnecessary tool calls.

Set thresholds based on criticality. For a payment processing flow, you might require 0.95 recall on the verification and authorization steps, while allowing 0.80 precision on supporting lookups. For a general support flow, 0.85 recall and 0.85 precision might be appropriate.

Building a trajectory test suite

Here's a concrete implementation. The pattern has three parts: recording expected trajectories from your best-performing runs, replaying test scenarios and capturing actual trajectories, and computing precision and recall against the expected set.

eval/trajectory-recorder.ts·typescript
import { Chanl } from "@chanl/sdk";
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
interface TrajectoryStep {
  toolName: string;
  arguments: Record<string, unknown>;
  order: number;
}
 
interface RecordedTrajectory {
  scenarioId: string;
  steps: TrajectoryStep[];
  finalOutput: string;
  recordedAt: string;
}
 
async function recordExpectedTrajectory(
  scenarioId: string,
  runId: string
): Promise<RecordedTrajectory> {
  const run = await chanl.calls.get(runId);
  
  const steps: TrajectoryStep[] = run.toolCalls.map((call, i) => ({
    toolName: call.name,
    arguments: call.arguments,
    order: i,
  }));
 
  const trajectory: RecordedTrajectory = {
    scenarioId,
    steps,
    finalOutput: run.finalMessage,
    recordedAt: new Date().toISOString(),
  };
 
  await fs.writeFile(
    `./trajectories/expected/${scenarioId}.json`,
    JSON.stringify(trajectory, null, 2)
  );
 
  return trajectory;
}

Once you have expected trajectories stored, your evaluation runs compare actual against expected:

eval/trajectory-evaluator.ts·typescript
interface TrajectoryEvalResult {
  scenarioId: string;
  exactMatch: boolean;
  precision: number;
  recall: number;
  unexpectedSteps: TrajectoryStep[];
  missingSteps: TrajectoryStep[];
  passed: boolean;
}
 
function evaluateTrajectory(
  expected: RecordedTrajectory,
  actual: TrajectoryStep[],
  thresholds = { precision: 0.85, recall: 0.9 }
): TrajectoryEvalResult {
  const expectedSet = new Set(expected.steps.map((s) => s.toolName));
  const actualSet = new Set(actual.map((s) => s.toolName));
 
  // Precision: what fraction of actual steps were expected
  const correctSteps = actual.filter((s) => expectedSet.has(s.toolName));
  const precision = actual.length > 0 ? correctSteps.length / actual.length : 0;
 
  // Recall: what fraction of expected steps were taken
  const coveredSteps = expected.steps.filter((s) => actualSet.has(s.toolName));
  const recall =
    expected.steps.length > 0
      ? coveredSteps.length / expected.steps.length
      : 0;
 
  const unexpectedSteps = actual.filter((s) => !expectedSet.has(s.toolName));
  const missingSteps = expected.steps.filter((s) => !actualSet.has(s.toolName));
 
  return {
    scenarioId: expected.scenarioId,
    exactMatch:
      precision === 1.0 &&
      recall === 1.0 &&
      actual.length === expected.steps.length,
    precision,
    recall,
    unexpectedSteps,
    missingSteps,
    passed:
      precision >= thresholds.precision && recall >= thresholds.recall,
  };
}

And a runner that exercises your scenarios and reports trajectory results alongside output quality:

eval/run-trajectory-evals.ts·typescript
async function runTrajectoryEvals(scenarioIds: string[]) {
  const results: TrajectoryEvalResult[] = [];
 
  for (const scenarioId of scenarioIds) {
    const expected = JSON.parse(
      await fs.readFile(`./trajectories/expected/${scenarioId}.json`, "utf8")
    );
 
    // Run the agent against this scenario
    const run = await chanl.scenarios.run(scenarioId);
 
    const actualSteps: TrajectoryStep[] = run.toolCalls.map((call, i) => ({
      toolName: call.name,
      arguments: call.arguments,
      order: i,
    }));
 
    const result = evaluateTrajectory(expected, actualSteps);
    results.push(result);
 
    if (!result.passed) {
      console.error(`TRAJECTORY FAIL: ${scenarioId}`);
      console.error(
        `  Precision: ${result.precision.toFixed(2)} (${result.unexpectedSteps.map((s) => s.toolName).join(", ") || "none"} unexpected)`
      );
      console.error(
        `  Recall: ${result.recall.toFixed(2)} (${result.missingSteps.map((s) => s.toolName).join(", ") || "none"} missing)`
      );
    }
  }
 
  const passRate = results.filter((r) => r.passed).length / results.length;
  console.log(
    `Trajectory eval complete: ${(passRate * 100).toFixed(1)}% pass rate`
  );
  return results;
}
Operations engineer monitoring deploys

Deploy Gate

Pre-deploy quality checks

Score > 80%
92%
Latency < 500ms
234ms
Error Rate < 2%
3.1%
Deploy Blocked

How strict should expected trajectories be?

This is the calibration question that every team wrestles with. Too strict and you get constant false positives from legitimate path variations. Too loose and you miss the regressions that matter.

A few heuristics that work in practice.

Anchor on semantics, not argument values. Your trajectory assertions should check that the agent called verify_identity before lookup_account, not that it called verify_identity with exactly { method: "phone", digits: 4 }. Argument-level assertions break whenever argument schemas change and don't catch the class of errors you care about most.

Separate critical from supporting steps. Not every step in your trajectory matters equally. verify_identity in a payment flow is a critical step -- missing it is always a regression. fetch_account_preferences might be a nice-to-have that the agent can skip based on conversation context. Set separate recall thresholds per step class.

Record from diverse runs, not just happy-path runs. If your expected trajectories come only from clean, simple test cases, you'll miss legitimate trajectory variation the agent uses for edge cases. Record from a diverse sample of real production runs and use the most common pattern as your baseline.

Update trajectories intentionally, not automatically. When your agent legitimately improves its path -- finding a more efficient tool sequence, skipping an unnecessary verification step that you've removed from policy -- update the expected trajectory with a deliberate review. Don't auto-update trajectories on every passing eval run; that defeats the purpose.

What to do when trajectories diverge

When a trajectory eval fails, you have three options: it's a regression you need to fix, it's a legitimate improvement you should bless, or it's within acceptable tolerance you should widen. Knowing which one requires looking at the specifics.

Missing critical steps (low recall on verification, policy check, confirmation) are almost always regressions. The agent skipping verification is dangerous regardless of whether the final output looks correct. Investigate the prompt change or model update that caused the skip. Don't widen thresholds to absorb it.

Unexpected extra steps (low precision) are more ambiguous. The agent calling an extra tool might be a sign of a confused agent, or it might be the agent being more thorough than your baseline expected. Check whether the extra tool call is harmful (slowing down the interaction, calling unnecessary external APIs) or neutral. If neutral, update your expected trajectory to include it.

Order changes in non-critical sequences are often safe to accept. If the agent verifies identity first and then looks up the account in one run, versus looking up the account and then verifying in another, the order might not matter semantically. Your trajectory assertions can be flexible about ordering for steps that don't have explicit dependencies.

The framing that helps: trajectory divergences are a prompt for investigation, not an automatic verdict. They surface changes in agent behavior that would otherwise be invisible. What you do with that information depends on whether the change is harmful.

Connecting trajectory failures to production signals

The real power of trajectory evaluation shows up when you connect it to your production monitoring pipeline. Offline trajectory evals catch regressions before deployment. Production trajectory monitoring catches drift after deployment.

Production trajectory monitoring works by sampling live interactions and comparing them against baseline trajectories. You don't evaluate every call -- that's too expensive. Instead, you sample 5-10% and run trajectory comparisons asynchronously. When the average precision or recall on your sampled runs drops below your threshold, you get an alert.

This gives you something that pure output quality monitoring doesn't: an early warning signal. Trajectory drift typically shows up 24-48 hours before it translates into detectable quality degradation in production scorecards. By the time your quality scores drop, you're already seeing user complaints. By the time you notice trajectory drift, you still have time to investigate and potentially roll back.

For teams using Chanl's monitoring dashboards, trajectory metrics sit alongside conversation quality scores as a leading indicator. A spike in unexpected tool calls or a drop in recall on identity verification steps is the signal to look at before your CSAT scores move.

The feedback loop works like this: production trajectory anomalies flow into annotation queues for expert review, validated failures become regression tests in your scenario suite via Chanl's scenario testing, and those scenarios run automatically on every model update or prompt change. Failures in trajectory evals gate deployment. Bugs stop reaching users because they became test cases the first time they appeared.

Trajectory evaluation and model updates

Model updates are the highest-risk moment for trajectory regressions. A new model version can improve final-output quality while silently changing how it uses tools -- different sequencing, skipped intermediate steps, new tool combinations. Output evals miss this entirely. Trajectory evaluation catches it immediately, because the model's new path diverges from your recorded baseline even if the outputs pass.

When you swap from one model version to another, or update your fine-tuned weights, output quality on your existing eval set might improve. But the model's tool-use behavior -- which tools it chooses, how it sequences them, what arguments it passes -- can change significantly even when outputs look good.

The classic failure mode: a new model version discovers that it can skip an intermediate tool call by reasoning from context rather than making the API call. On your test set, this works. In production, the reasoning-from-context approach breaks on cases where the agent doesn't have the right context, and the agent produces a confident answer from stale or wrong information.

Trajectory evaluation catches this because the expected trajectory includes the intermediate tool call. The new model's trajectory will show low recall (it skipped the expected step), which flags the behavior for review before deployment.

LLM-as-a-judge pipelines are excellent for scoring output quality, and tracking score drift over time catches slow degradation. Trajectory evaluation is the complementary layer that catches structural changes in how the agent operates -- changes that output scoring won't see until they've already caused production failures.

Putting trajectory evaluation into your release process

Add trajectory eval as a CI gate alongside output quality evals, with separate recall thresholds per interaction type: high (0.95+) for critical flows like payment and identity verification, moderate (0.85) for general support. This is the build step that catches path regressions before deployment -- the connect step runs it in production sampling -- and the monitor step alerts when live trajectories drift from baseline.

The practical integration point is your CI pipeline. Add trajectory eval as a required gate alongside your output quality evals. For critical interaction types (payment processing, identity verification, escalation decisions), set high recall thresholds. For general support flows, set more forgiving thresholds that still catch major structural regressions.

Your scenario library is the foundation. If you're using Chanl's scenario testing, your scenarios already define expected interaction patterns that you can extract into trajectory expectations. Start with the scenarios you've marked as critical, record their expected trajectories from a known-good agent version, and add trajectory assertions to your eval runs.

The setup cost is front-loaded: recording baseline trajectories takes time, and calibrating thresholds takes a few iterations. Once it's running, trajectory evaluation adds minimal overhead to your existing eval pipeline and catches a class of bugs that was previously invisible until production.

Your agent gets smarter over time. Make sure its path to the right answer is getting better too, not just the answer itself.

Test your agent's full trajectory before you ship

Chanl's scenario testing lets you define expected interaction patterns and catch trajectory regressions before they reach production -- for voice, chat, and messaging agents.

Start Testing Free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions