A customer calls about a billing dispute. Your agent resolves it correctly: the customer hangs up satisfied, the issue is closed, and your outcome eval marks it a pass.
But if you could watch the full trajectory, here's what you'd see. The agent made seven tool calls to complete a task that needed two. It queried your billing API three times with the same customer ID, getting back identical results each time. Halfway through the conversation it fetched a transaction record for a completely unrelated customer. The record wasn't used, but it was accessed. The actual resolution only happened in step six because the agent spent the first five steps checking data it didn't need.
Outcome: correct. Trajectory: a mess.
Most eval systems score the first thing and never see the second. That gap is where production agent problems live, and a trajectory eval is what closes it. This piece walks through what a trajectory eval is, the three signals that matter for CX agents, and a minimal build you can ship in an afternoon.
Why Outcome Evals Miss So Much
Outcome evals only check the final state. Did the agent return the right answer? They can't see the path: redundant tool calls, unauthorized data access, unnecessary records fetched, or a twelve-step route through a problem that needed three. The final answer can be correct while the journey was expensive, unsafe, or wrong.
They're the default because they're easy to define. Did the agent return the right answer, resolve the ticket, book the appointment? You need a ground truth label and a correctness check, and you're measuring something real.
What you're not measuring is everything that happened on the way to that answer.
LangChain's 2026 State of AI Agents survey put a number on this: 89% of teams with agents have observability implemented, but only 52% have any eval practice. Even within that 52%, most teams are evaluating outputs -- the final state -- not trajectories. Observability tells you the agent is running. Outcome evals tell you the agent is answering. Neither one tells you whether the path was clean.
Four categories of production failures only show up in the trajectory:
Redundant tool calls. Your agent calls the same API endpoint three times in one interaction and gets the same data back each time. The final answer is correct, but you've paid three times the API cost and added hundreds of milliseconds to the conversation. At scale, across thousands of daily interactions, this compounds into a real budget problem.
Scope violations. Your agent accesses data it wasn't supposed to touch -- a different customer's record, a billing table it only needed to read from, a calendar entry outside the current user's account. The final answer was right, but the agent touched data it shouldn't have. This is a privacy risk and potentially a regulatory issue depending on your industry.
Unnecessary information access. Similar to scope violations but subtler: the agent accessed data within its scope that wasn't needed for the task. If your billing agent is fetching full conversation history to resolve a simple refund, it's pulling data that isn't needed and creating unnecessary exposure.
Path efficiency regressions. Your agent used to complete this type of task in three steps. Now it's taking twelve. The outcome is still correct, but something changed in the agent's behavior -- a prompt update, a new tool added to the set, a model version change -- that made it less direct. Without trajectory tracking, you won't notice until the latency complaints start.
What a Trajectory Eval Actually Is
A trajectory eval is a record and a rubric.
The record is the full step trace from a single agent interaction: every tool call, its inputs, and what was returned. Every reasoning step the agent took before deciding to call a tool. Every response it generated at intermediate steps.
interface AgentTrajectory {
taskId: string;
startedAt: string;
completedAt: string;
steps: Array<{
stepIndex: number;
type: 'reasoning' | 'tool_call' | 'response';
tool?: string;
args?: Record<string, unknown>;
result?: unknown;
durationMs: number;
tokensUsed?: number;
}>;
outcome: string;
totalTokens: number;
totalDurationMs: number;
}The rubric is what you expect: which tools should be called, in what order, with what constraints. You define the rubric for your known scenario types, then score actual trajectories against it.
The trajectory score isn't just pass/fail. It tells you exactly which steps went wrong, making it actionable in a way that outcome evals aren't.
The Three Trajectory Signals That Matter for CX Agents
For customer experience agents -- voice, chat, messaging -- three signals in the trajectory predict real quality better than any single outcome metric.
Tool call accuracy is whether the agent called the right tools in the right order. For a billing dispute resolution, the expected sequence might be: get_customer_account then get_billing_history then create_adjustment (if needed) then send_confirmation. An agent that runs all four in the right order scores high. An agent that skips get_billing_history and jumps to create_adjustment is improvising in a way that should concern you.
Redundancy rate is the percentage of tool calls in a trajectory that were duplicates or unnecessary. A redundancy rate above 10% is a sign something is wrong -- either the agent's reasoning loop has a bug, the tools are returning unexpected results that confuse the agent, or a recent change to the agent's prompt is causing it to re-verify decisions it should be making once.
Scope adherence tracks whether every tool call in the trajectory was authorized given the current task and customer context. This is the privacy-and-security signal. If your customer service agent is fetching account records for customers other than the one it's currently serving, that's a scope violation regardless of whether the final outcome was correct.
These three signals together give you a trajectory health score that tells you far more about your agent's behavior than any outcome metric alone.
How to Build Your First Trajectory Eval
You don't need a full observability platform to start. Here's a minimal approach that works with most agent setups and takes an afternoon to implement.
Step 1: Pick five golden paths. Choose the five most common successful interactions in your agent's current production traffic. These become your baseline trajectories.
Step 2: Record the actual traces. If your agent orchestration layer supports tracing (LangChain, VAPI, Retell, Pipecat all have varying levels of trace capture), turn it on and collect traces for 50-100 real completions of each scenario type.
Step 3: Define your expected sequences. For each scenario, identify which tools are required (must be called), which are optional (may be called depending on the situation), and which are forbidden (should never be called in this context).
const billingDisputeScenario = {
name: 'billing-dispute-resolution',
expectedTrajectory: {
required: [
{ tool: 'get_customer_account', maxCalls: 1 },
{ tool: 'get_billing_history', maxCalls: 1 },
{ tool: 'send_confirmation', maxCalls: 1 },
],
optional: [
{ tool: 'create_adjustment', maxCalls: 1 },
{ tool: 'escalate_to_human', maxCalls: 1 },
],
forbidden: [
{ tool: 'delete_customer', reason: 'never authorized in service context' },
{ tool: 'get_other_customer_account', reason: 'scope violation' },
],
maxTotalCalls: 6,
},
};Step 4: Score your traces. Run your collected traces against these definitions. Any trace that exceeds maxCalls on a required tool, calls a forbidden tool, or exceeds maxTotalCalls should be flagged for review.
Step 5: Automate the check. Chanl's scenario testing captures the full step trace when it runs a scenario, including every tool call with its arguments. You define the scenario once, then score its trajectory in CI before every deployment:
import Chanl from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// scenarioId points at a scenario you've already authored
// in the Chanl dashboard (or via chanl.scenarios.create)
const { data } = await chanl.scenarios.run('scn_billing_dispute', {
agentId: process.env.AGENT_ID,
});
const steps = data.execution.stepResults ?? [];
const toolCalls = steps.flatMap((s) => s.toolCalls ?? []);
const toolNames = toolCalls.map((c) => c.name);
// Required: each tool was called at least once
const required = ['get_customer_account', 'get_billing_history', 'send_confirmation'];
const missing = required.filter((t) => !toolNames.includes(t));
// Forbidden: never called
const forbidden = ['delete_customer'];
const violations = toolNames.filter((t) => forbidden.includes(t));
// Cap on total calls catches redundancy regressions
const overBudget = toolCalls.length > 6;
const passed = missing.length === 0 && violations.length === 0 && !overBudget;If passed is false, the deployment fails. You catch the redundant-call regression before it hits production. The stepResults shape gives you everything else you might want to score on later: argument shape, step duration, intermediate responses.
Offline and Online Trajectory Evals
Offline and online trajectory evals solve different problems. You need both.
Offline trajectory evals run before deployment on a fixed test set. You define expected trajectories for your known scenario types, run the agent against them, and verify the scores are above threshold. Offline evals catch regressions -- when a prompt change or tool update changes the agent's behavior on cases you've already tested.
52% of teams in the LangChain survey run offline evals. That's the right first step, but it's not sufficient. Your test set can't cover every variation of real user input, and it can't anticipate edge cases that emerge only in production.
Online trajectory evals monitor real production traffic. Instead of comparing against a fixed expected sequence, you compare each trajectory against your agent's own statistical baseline: how many tool calls does this agent usually make for this type of task? Which tools does it usually call? What's the typical duration?
Online evals catch things offline evals can't:
- Novel user inputs that produce trajectories your test set never anticipated
- Gradual drift -- the agent slowly accumulating extra tool calls over weeks as prompts shift
- Environmental failures -- a tool returning unexpected results that causes the agent to loop
Roughly a third of teams run online evals, climbing closer to half once they have agents in production. Once you see what production traffic actually looks like, the gaps in offline-only evaluation become obvious fast.
If you're setting up your first eval system from scratch, offline vs online evals for production agents covers the practical setup decisions in more depth.
Chanl's agent monitoring captures full trajectory traces from production interactions and surfaces anomalies automatically -- unexpected tool sequences, call count spikes, scope violations -- without requiring you to pre-define expected trajectories for every possible input type.
What Trajectory Evals Catch That Outcome Evals Miss
Trajectory evals catch the production failures that outcome evals will never see: privacy violations, cost regressions, and destructive intermediary steps that leave no trace in the final output.
The clearest way to see this is through real failure patterns:
A billing agent that starts fetching adjacent customer records might be doing so because a recent tool update changed how the customer lookup returns related accounts. Outcome eval: still correct (agent is resolving the right customer's dispute). Trajectory eval: flagged immediately (unauthorized data access on every interaction).
A scheduling agent that suddenly takes twelve steps instead of three after a model version update might be producing correct calendars, but the latency has doubled. Outcome eval: pass. Trajectory eval: efficiency regression flagged by the spike in total call count.
An order management agent that calls a deletion endpoint as an intermediate step (then immediately re-creates the record to apply a correction) produces the right final state, but left a gap in the audit trail. Outcome eval: correct. Trajectory eval: flagged for including a destructive operation that should have been a patch.
These aren't hypothetical. They're the real-world failure modes that emerge when agents start handling volume at scale. The AI agent observability gap is real -- most teams can tell you their agent is running, but not what it's doing on the way to an answer.
Starting Small, Building Up
You don't need to instrument every possible trajectory before you ship. Start with your three highest-volume scenario types and your three highest-risk tool calls, meaning any tool that writes, deletes, or touches sensitive data.
Define expected sequences for those six combinations. Score every production trajectory against them. Fix the first issues you find.
From there, expand coverage as you learn what your agent's actual behavior looks like at scale. The first week of trajectory data will show you failure patterns you wouldn't have predicted from your test set.
The 89%/52% gap from the LangChain survey, more teams observe than eval, closes when teams realize how much their observation data was already telling them. You probably already have the traces. Trajectory evals are what you do with them.
Go back to that billing-dispute call from the opening. Seven steps, one fetched record belonging to the wrong customer, an outcome eval that said pass. With a trajectory check in place, that interaction never closes clean. It gets flagged, the redundant calls get traced back to a prompt change, and the scope violation gets caught before it's the fiftieth one this week. Your agent can pass every outcome eval you write while quietly building up a pattern of data access you'd be uncomfortable explaining to a customer. The path matters. Measure it.
Score your agent's trajectory, not just its answers
Chanl scenario testing captures full step traces and scores them against expected tool sequences. Catch redundant calls, scope violations, and efficiency regressions before they compound in production.
See Scenario TestingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



