What Is a Trajectory Eval for AI Agents?

A trajectory eval records every step an agent takes to complete a task, each tool call, its arguments, and the result returned, then scores that sequence against an expected pattern. Unlike outcome evals, which only check whether the final answer was correct, trajectory evals check whether the agent used the right tools in the right order, avoided unnecessary operations, and stayed within its authorized scope. A trajectory eval can catch a billing agent that fetched another customer's records even when it returned the right balance.

Why Do Outcome Evals Miss Agent Failures?

Outcome evals only check the final state: did the agent answer correctly? They can't see whether the agent made seven redundant tool calls to get there, accessed data outside its authorization scope, fetched a record belonging to the wrong customer, or took a twelve-step path through a problem that should have taken three. All of these failures are invisible to outcome evals but fully visible in the trajectory. For production CX agents, these hidden failures translate directly into cost, latency, privacy risk, and regulatory exposure.

How Do I Define the Expected Tool Call Sequence for a Scenario?

Start by watching your agent complete the task successfully several times. Note which tools it called and in what order. Define the core sequence as required steps and the rest as optional. You don't need exact argument matching for most tools. Checking that the right tools were called in roughly the right order catches most trajectory failures. For high-risk operations like deletions, writes, and billing, require exact argument validation in addition to sequence scoring.

What Percentage of Agent Failures Are Only Visible in the Trajectory?

LangChain's 2026 State of AI Agents survey found that 89% of teams have observability for their agents, but only 52% have evaluation practices. Most teams who do eval are checking outputs, not trajectories. The failures that only show up in the path, like redundant calls, scope violations, and efficiency regressions, tend to be invisible until they cause a cost spike, a privacy incident, or a latency complaint. There's no published breakdown of trajectory-only versus outcome-only failures, but any agent calling multiple tools on production data is accumulating these hidden failure modes.

How Do Trajectory Evals Relate to Agent Cost and Latency?

Directly. Every redundant tool call is a wasted API request and a latency hit. An agent that calls your CRM three times in a single interaction where once was enough is costing you three times the expected API budget and adding hundreds of milliseconds to response time. Trajectory evals catch this because they measure whether the agent completed the task with the expected number of steps, not just whether it completed it at all. At scale, a 30% reduction in unnecessary tool calls can meaningfully change your cost-per-conversation.

Can I Run Trajectory Evals on a Live Production Agent?

Yes, and you should. Online trajectory evals work by capturing the full step trace from each real interaction and comparing it to your baseline trajectory patterns. Anomalies like unusual tool sequences, unexpected spikes in tool call count, or operations outside the agent's normal scope trigger alerts. You don't need labeled ground truth for online evals; you're looking for statistical deviation from the agent's own baseline behavior, which you can establish from your first week of production traffic.

What Tools Can I Use to Run Trajectory Evals?

LangChain's LangSmith traces agent trajectories natively if you're already in that stack. Arize Phoenix, Braintrust, and Latitude all support trajectory capture and scoring. If you're building CX agents specifically, voice, chat, or messaging, Chanl's scenario testing captures full step traces and lets you score expected tool sequences as part of your test cases. The key feature to look for is step-level scoring, not just conversation-level scoring.

How to Build a Trajectory Eval for Your AI Agent

A customer calls about a billing dispute. Your agent resolves it correctly: the customer hangs up satisfied, the issue is closed, and your outcome eval marks it a pass.

But if you could watch the full trajectory, here's what you'd see. The agent made seven tool calls to complete a task that needed two. It queried your billing API three times with the same customer ID, getting back identical results each time. Halfway through the conversation it fetched a transaction record for a completely unrelated customer. The record wasn't used, but it was accessed. The actual resolution only happened in step six because the agent spent the first five steps checking data it didn't need.

Outcome: correct. Trajectory: a mess.

Most eval systems score the first thing and never see the second. That gap is where production agent problems live, and a trajectory eval is what closes it. This piece walks through what a trajectory eval is, the three signals that matter for CX agents, and a minimal build you can ship in an afternoon.

Why Outcome Evals Miss So Much

Outcome evals only check the final state. Did the agent return the right answer? They can't see the path: redundant tool calls, unauthorized data access, unnecessary records fetched, or a twelve-step route through a problem that needed three. The final answer can be correct while the journey was expensive, unsafe, or wrong.

They're the default because they're easy to define. Did the agent return the right answer, resolve the ticket, book the appointment? You need a ground truth label and a correctness check, and you're measuring something real.

What you're not measuring is everything that happened on the way to that answer.

LangChain's 2026 State of AI Agents survey put a number on this: 89% of teams with agents have observability implemented, but only 52% have any eval practice. Even within that 52%, most teams are evaluating outputs -- the final state -- not trajectories. Observability tells you the agent is running. Outcome evals tell you the agent is answering. Neither one tells you whether the path was clean.

Four categories of production failures only show up in the trajectory:

Redundant tool calls. Your agent calls the same API endpoint three times in one interaction and gets the same data back each time. The final answer is correct, but you've paid three times the API cost and added hundreds of milliseconds to the conversation. At scale, across thousands of daily interactions, this compounds into a real budget problem.

Scope violations. Your agent accesses data it wasn't supposed to touch -- a different customer's record, a billing table it only needed to read from, a calendar entry outside the current user's account. The final answer was right, but the agent touched data it shouldn't have. This is a privacy risk and potentially a regulatory issue depending on your industry.

Unnecessary information access. Similar to scope violations but subtler: the agent accessed data within its scope that wasn't needed for the task. If your billing agent is fetching full conversation history to resolve a simple refund, it's pulling data that isn't needed and creating unnecessary exposure.

Path efficiency regressions. Your agent used to complete this type of task in three steps. Now it's taking twelve. The outcome is still correct, but something changed in the agent's behavior -- a prompt update, a new tool added to the set, a model version change -- that made it less direct. Without trajectory tracking, you won't notice until the latency complaints start.

What a Trajectory Eval Actually Is

A trajectory eval is a record and a rubric.

The record is the full step trace from a single agent interaction: every tool call, its inputs, and what was returned. Every reasoning step the agent took before deciding to call a tool. Every response it generated at intermediate steps.

trajectory-shape.ts·typescript

interface AgentTrajectory {
  taskId: string;
  startedAt: string;
  completedAt: string;
  steps: Array<{
    stepIndex: number;
    type: 'reasoning' | 'tool_call' | 'response';
    tool?: string;
    args?: Record<string, unknown>;
    result?: unknown;
    durationMs: number;
    tokensUsed?: number;
  }>;
  outcome: string;
  totalTokens: number;
  totalDurationMs: number;
}

The rubric is what you expect: which tools should be called, in what order, with what constraints. You define the rubric for your known scenario types, then score actual trajectories against it.

Trajectory eval: comparing actual agent steps against expected sequence

The trajectory score isn't just pass/fail. It tells you exactly which steps went wrong, making it actionable in a way that outcome evals aren't.

The Three Trajectory Signals That Matter for CX Agents

For customer experience agents -- voice, chat, messaging -- three signals in the trajectory predict real quality better than any single outcome metric.

Tool call accuracy is whether the agent called the right tools in the right order. For a billing dispute resolution, the expected sequence might be: get_customer_account then get_billing_history then create_adjustment (if needed) then send_confirmation. An agent that runs all four in the right order scores high. An agent that skips get_billing_history and jumps to create_adjustment is improvising in a way that should concern you.

Redundancy rate is the percentage of tool calls in a trajectory that were duplicates or unnecessary. A redundancy rate above 10% is a sign something is wrong -- either the agent's reasoning loop has a bug, the tools are returning unexpected results that confuse the agent, or a recent change to the agent's prompt is causing it to re-verify decisions it should be making once.

Scope adherence tracks whether every tool call in the trajectory was authorized given the current task and customer context. This is the privacy-and-security signal. If your customer service agent is fetching account records for customers other than the one it's currently serving, that's a scope violation regardless of whether the final outcome was correct.

These three signals together give you a trajectory health score that tells you far more about your agent's behavior than any outcome metric alone.

How to Build Your First Trajectory Eval

You don't need a full observability platform to start. Here's a minimal approach that works with most agent setups and takes an afternoon to implement.

Step 1: Pick five golden paths. Choose the five most common successful interactions in your agent's current production traffic. These become your baseline trajectories.

Step 2: Record the actual traces. If your agent orchestration layer supports tracing (LangChain, VAPI, Retell, Pipecat all have varying levels of trace capture), turn it on and collect traces for 50-100 real completions of each scenario type.

Step 3: Define your expected sequences. For each scenario, identify which tools are required (must be called), which are optional (may be called depending on the situation), and which are forbidden (should never be called in this context).

scenario-trajectory-definition.ts·typescript

const billingDisputeScenario = {
  name: 'billing-dispute-resolution',
  expectedTrajectory: {
    required: [
      { tool: 'get_customer_account', maxCalls: 1 },
      { tool: 'get_billing_history', maxCalls: 1 },
      { tool: 'send_confirmation', maxCalls: 1 },
    ],
    optional: [
      { tool: 'create_adjustment', maxCalls: 1 },
      { tool: 'escalate_to_human', maxCalls: 1 },
    ],
    forbidden: [
      { tool: 'delete_customer', reason: 'never authorized in service context' },
      { tool: 'get_other_customer_account', reason: 'scope violation' },
    ],
    maxTotalCalls: 6,
  },
};

Step 4: Score your traces. Run your collected traces against these definitions. Any trace that exceeds maxCalls on a required tool, calls a forbidden tool, or exceeds maxTotalCalls should be flagged for review.

Step 5: Automate the check. Chanl's scenario testing captures the full step trace when it runs a scenario, including every tool call with its arguments. You define the scenario once, then score its trajectory in CI before every deployment:

trajectory-test.ts·typescript

import Chanl from '@chanl/sdk';
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// scenarioId points at a scenario you've already authored
// in the Chanl dashboard (or via chanl.scenarios.create)
const { data } = await chanl.scenarios.run('scn_billing_dispute', {
  agentId: process.env.AGENT_ID,
});
 
const steps = data.execution.stepResults ?? [];
const toolCalls = steps.flatMap((s) => s.toolCalls ?? []);
const toolNames = toolCalls.map((c) => c.name);
 
// Required: each tool was called at least once
const required = ['get_customer_account', 'get_billing_history', 'send_confirmation'];
const missing = required.filter((t) => !toolNames.includes(t));
 
// Forbidden: never called
const forbidden = ['delete_customer'];
const violations = toolNames.filter((t) => forbidden.includes(t));
 
// Cap on total calls catches redundancy regressions
const overBudget = toolCalls.length > 6;
 
const passed = missing.length === 0 && violations.length === 0 && !overBudget;

If passed is false, the deployment fails. You catch the redundant-call regression before it hits production. The stepResults shape gives you everything else you might want to score on later: argument shape, step duration, intermediate responses.

Offline and Online Trajectory Evals

Offline and online trajectory evals solve different problems. You need both.

Offline trajectory evals run before deployment on a fixed test set. You define expected trajectories for your known scenario types, run the agent against them, and verify the scores are above threshold. Offline evals catch regressions -- when a prompt change or tool update changes the agent's behavior on cases you've already tested.

52% of teams in the LangChain survey run offline evals. That's the right first step, but it's not sufficient. Your test set can't cover every variation of real user input, and it can't anticipate edge cases that emerge only in production.

Online trajectory evals monitor real production traffic. Instead of comparing against a fixed expected sequence, you compare each trajectory against your agent's own statistical baseline: how many tool calls does this agent usually make for this type of task? Which tools does it usually call? What's the typical duration?

Online evals catch things offline evals can't:

Novel user inputs that produce trajectories your test set never anticipated
Gradual drift -- the agent slowly accumulating extra tool calls over weeks as prompts shift
Environmental failures -- a tool returning unexpected results that causes the agent to loop

Roughly a third of teams run online evals, climbing closer to half once they have agents in production. Once you see what production traffic actually looks like, the gaps in offline-only evaluation become obvious fast.

If you're setting up your first eval system from scratch, offline vs online evals for production agents covers the practical setup decisions in more depth.

Chanl's agent monitoring captures full trajectory traces from production interactions and surfaces anomalies automatically -- unexpected tool sequences, call count spikes, scope violations -- without requiring you to pre-define expected trajectories for every possible input type.

What Trajectory Evals Catch That Outcome Evals Miss

Trajectory evals catch the production failures that outcome evals will never see: privacy violations, cost regressions, and destructive intermediary steps that leave no trace in the final output.

The clearest way to see this is through real failure patterns:

A billing agent that starts fetching adjacent customer records might be doing so because a recent tool update changed how the customer lookup returns related accounts. Outcome eval: still correct (agent is resolving the right customer's dispute). Trajectory eval: flagged immediately (unauthorized data access on every interaction).

A scheduling agent that suddenly takes twelve steps instead of three after a model version update might be producing correct calendars, but the latency has doubled. Outcome eval: pass. Trajectory eval: efficiency regression flagged by the spike in total call count.

An order management agent that calls a deletion endpoint as an intermediate step (then immediately re-creates the record to apply a correction) produces the right final state, but left a gap in the audit trail. Outcome eval: correct. Trajectory eval: flagged for including a destructive operation that should have been a patch.

These aren't hypothetical. They're the real-world failure modes that emerge when agents start handling volume at scale. The AI agent observability gap is real -- most teams can tell you their agent is running, but not what it's doing on the way to an answer.

Starting Small, Building Up

You don't need to instrument every possible trajectory before you ship. Start with your three highest-volume scenario types and your three highest-risk tool calls, meaning any tool that writes, deletes, or touches sensitive data.

Define expected sequences for those six combinations. Score every production trajectory against them. Fix the first issues you find.

From there, expand coverage as you learn what your agent's actual behavior looks like at scale. The first week of trajectory data will show you failure patterns you wouldn't have predicted from your test set.

The 89%/52% gap from the LangChain survey, more teams observe than eval, closes when teams realize how much their observation data was already telling them. You probably already have the traces. Trajectory evals are what you do with them.

Go back to that billing-dispute call from the opening. Seven steps, one fetched record belonging to the wrong customer, an outcome eval that said pass. With a trajectory check in place, that interaction never closes clean. It gets flagged, the redundant calls get traced back to a prompt change, and the scope violation gets caught before it's the fiftieth one this week. Your agent can pass every outcome eval you write while quietly building up a pattern of data access you'd be uncomfortable explaining to a customer. The path matters. Measure it.

Score your agent's trajectory, not just its answers

Chanl scenario testing captures full step traces and scores them against expected tool sequences. Catch redundant calls, scope violations, and efficiency regressions before they compound in production.

See Scenario Testing

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

evaluation trajectory-evals agent-testing tool-calls agent-quality

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

How to Build a Trajectory Eval for Your AI Agent

Why Outcome Evals Miss So Much

What a Trajectory Eval Actually Is

The Three Trajectory Signals That Matter for CX Agents

How to Build Your First Trajectory Eval

Offline and Online Trajectory Evals

What Trajectory Evals Catch That Outcome Evals Miss

Starting Small, Building Up

Score your agent's trajectory, not just its answers

The Signal Briefing

Frequently Asked Questions

Related Articles

Trajectory Eval: Catch Agent Bugs Output Scoring Misses

How to Eval Agents When There's No Right Answer

Is monitoring your AI agent actually enough?

How to Build a Trajectory Eval for Your AI Agent

Why Outcome Evals Miss So Much

What a Trajectory Eval Actually Is

The Three Trajectory Signals That Matter for CX Agents

How to Build Your First Trajectory Eval

Offline and Online Trajectory Evals

What Trajectory Evals Catch That Outcome Evals Miss

Starting Small, Building Up

Score your agent's trajectory, not just its answers

The Signal Briefing

Frequently Asked Questions

What Is a Trajectory Eval for AI Agents?

Why Do Outcome Evals Miss Agent Failures?

What's the Difference Between Offline and Online Trajectory Evals?

How Do I Define the Expected Tool Call Sequence for a Scenario?

What Percentage of Agent Failures Are Only Visible in the Trajectory?

How Do Trajectory Evals Relate to Agent Cost and Latency?

Can I Run Trajectory Evals on a Live Production Agent?

What Tools Can I Use to Run Trajectory Evals?

Related Articles

Trajectory Eval: Catch Agent Bugs Output Scoring Misses

How to Eval Agents When There's No Right Answer

Is monitoring your AI agent actually enough?