A customer service team I talked to last month had one line in their agent config that quietly cost them six figures a year. Every inbound ticket, every clarifying question, every "thanks, that worked" reply went through the same Opus-class model. The founder had picked Anthropic at the start of the project because the SDK felt clean. A year later, the bill was a rounding error away from the salary of the support lead.
The cost wasn't Anthropic's fault. The architecture was. They were paying for planning-grade reasoning to classify whether a message contained a refund request.
Brand loyalty is a procurement decision. Model choice, per turn, is an architecture decision. A well-built customer-experience agent uses at least three models: one to plan, one to route, one to summarize. Each is priced and tuned for a different job, and the supervisor that picks between them can run on the cheapest of the three.
This is the pattern. Here's how it works, the rubric for applying it, the cost math on a real workload, and the two failure modes that will bite you if you skip the boring parts.
What Each Model Is Actually Good At
The planner reasons through hard tickets. The router classifies incoming turns and dispatches. The summarizer condenses long inputs cheaply. For a CX workload in April 2026, the default stack is Opus 4.7 for planning, Haiku 4.5 for routing, and GPT-5.4 Standard for summarization.
| Model | Input / Output ($/Mtok) | Strength | Used For |
|---|---|---|---|
| Claude Opus 4.7 | $5 / $25 | Multi-step reasoning, tool planning | Complex tickets, escalation analysis |
| Claude Haiku 4.5 | $1 / $5 | Sub-200ms classification, short tool calls | Routing, intent detection, simple replies |
| GPT-5.4 Standard | $2.50 / $15 | Long-context summarization, structured output | Transcript summarization, ticket intake |
Opus 4.7 isn't the cheapest model per token, but it's the cheapest per correct plan. Haiku 4.5 is five times cheaper on output and fast enough that you can afford to call it twice. GPT-5.4 Standard sits between them on price and handles long inputs well, which matters for CX because real tickets arrive with attached transcripts and CRM notes.
The standard benchmark comparisons (SWE-bench Verified, MMLU) are misleading here because they're coding-biased. The benchmark that matters for this architecture is tau2-bench, which simulates tool-rich multi-turn CX workflows. Single-run pass rates on tau2-bench telecom still sit below 50% for every frontier model as of April 2026, which is the strongest argument for routing: you can't cover the long tail with one model, so you may as well pay less on the short one.
A Routing Rubric You Can Actually Apply
Route by signal, not by guess. Before the message reaches any reasoning model, a small router inspects five signals and picks the tier. The router itself runs on Haiku, because the classification job is small, and its own latency is part of every conversation.
| Signal | Router action |
|---|---|
| Input ≤ 100 tokens, no tool call needed | Haiku direct reply |
| Input 100-2k tokens, 1 tool call, single intent | Haiku with tools |
| Input 100-2k tokens, ambiguous intent or multi-step | Opus planner |
| Input ≥ 2k tokens (long transcript or doc) | GPT-5.4 Standard summarize, then route the summary |
| Prior turn escalated or customer sentiment negative | Opus planner, always |
This fits in a single function. Here's the core of it. The router returns a tier label; a dispatcher calls the matching SDK.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
type Tier = "haiku-direct" | "haiku-tools" | "opus-plan" | "gpt-summarize";
export async function classify(turn: {
text: string;
priorSentiment?: "positive" | "neutral" | "negative";
priorEscalated?: boolean;
}): Promise<{ tier: Tier; confidence: number }> {
const tokens = Math.ceil(turn.text.length / 4);
if (turn.priorEscalated || turn.priorSentiment === "negative") {
return { tier: "opus-plan", confidence: 0.99 };
}
if (tokens >= 2000) {
return { tier: "gpt-summarize", confidence: 0.95 };
}
const msg = await anthropic.messages.create({
model: "claude-haiku-4-5-20251001",
max_tokens: 64,
system: "Classify the user turn. Reply JSON: {tier:'direct'|'tools'|'plan', confidence:0..1}. 'plan' means multi-step reasoning needed.",
messages: [{ role: "user", content: turn.text }],
});
const parsed = JSON.parse((msg.content[0] as { text: string }).text);
const tier: Tier = parsed.tier === "plan"
? "opus-plan"
: parsed.tier === "tools"
? "haiku-tools"
: "haiku-direct";
return { tier, confidence: parsed.confidence };
}Two things worth noticing. The deterministic rules run first so you never pay the router on obvious cases. And the router emits a confidence score. That score is the hinge for the first failure mode we'll cover below.
The Cost Math on a Real Deflection Workload
On 100k tickets/month with 2.5 LLM calls per ticket, routing cuts the bill from $8,750 to roughly $1,500 including prompt caching. Here's the shape of it.
Assume the common CX distribution: 60% of tickets are routine (classification + short reply), 25% need one tool call and a short answer, 10% need multi-step planning, 5% arrive with a long transcript that needs summarization first. Real CX calls carry system prompts, tool schemas, and recent history, so the average call looks closer to 5k input and 400 output tokens. Aggregate volume: 1.25B input tokens and 100M output tokens per month. Router adds one extra 300-token Haiku call per turn.
| Setup | Input tokens / mo | Output tokens / mo | Monthly bill |
|---|---|---|---|
| Opus 4.7 for everything | 1.25B | 100M | $6,250 input + $2,500 output = $8,750 |
| Tiered routing (no caching) | 1.25B | 100M | Haiku-direct 60%: $413. Haiku-tools 25%: $250. Opus planner 10%: $875. GPT summarizer 5%: $344. Router overhead: $113. ~$2,000 |
| Tiered routing + prompt caching on planner | 1.25B | 100M | Planner input drops from $625 to |
The delta is about $7,250/mo, or $87k/yr, on a workload that isn't large by SaaS standards. The savings come from two places: the 60% of routine turns no longer pay Opus rates, and the 10% of planner calls pay the cached-input rate because the system prompt and recent history are static across turns. Anthropic's prompt caching bills cache reads at 10% of normal input, which is a bigger lever than most teams use.
Batch processing through the Message Batches API is another 50% off for anything non-latency-sensitive, like post-call summarization or overnight quality grading. For a deflection workload these are real line items.
One caveat. The router itself adds latency. Running Haiku as a pre-classifier adds roughly 150-250ms to every turn. For text chat that's invisible. For voice, it's close to the budget ceiling, and you should fold the classifier into the first-pass model instead of adding a hop.
Failure Mode 1: Misrouting Is the Silent Killer
A misrouted ticket produces a confidently wrong answer at router speed, and the customer never tells you. The router sends a hard question to Haiku because the question is short, Haiku fabricates a plausible policy, the customer accepts the reply, and nothing alerts. The ticket closes. The scorecard doesn't fail because there's no scorecard on routine turns.
You catch this with two signals. First, log every router decision with its confidence. Second, run a quality scorecard on a sampled slice of Haiku-direct replies, and shadow-route the bottom 5% by confidence to Opus for comparison. When the shadow response materially disagrees with the shipped reply, that's a regression signal. Over a month you get a P&L on your router itself.
import { classify } from "./router.ts";
export async function handle(turn: Turn) {
const { tier, confidence } = await classify(turn);
const reply = await dispatch(tier, turn);
// Shadow-route the bottom 5% of Haiku decisions to Opus asynchronously.
if (tier.startsWith("haiku") && confidence < 0.75) {
queueShadowCheck({ turnId: turn.id, shippedReply: reply });
}
return reply;
}This pattern is what makes routing safe to run. Without it, you're flying blind on the 60% of your traffic that never triggers a reasoning model, and misroute errors compound because the router's own training signal (your scorecard) only fires when a reply is bad enough that a customer complains.
Teams usually discover they need this the second time they see a bad transcript that routed to Haiku. For anything user-visible, shadow checks belong in the pipeline from day one. The Scorecards layer is where this naturally lives.
Failure Mode 2: Context Loss on Handoff
When one model plans and another executes, the planner's reasoning tokens never reach the worker, and the worker makes tone-deaf decisions. Planner models emit a lot of internal reasoning that gets billed as output but isn't shown to subsequent calls unless you explicitly pass it. If Opus thinks "this is the customer's third escalation, open with an apology," and Haiku executes without that context, Haiku opens with "Happy to help, what's your order number?"
The fix is a structured plan artifact. The planner writes its decisions, not its reasoning, to a small JSON object. The worker reads the artifact along with the raw turn. Nothing is implicit.
import { z } from "zod";
export const Plan = z.object({
intent: z.enum(["refund", "status", "escalation", "info", "other"]),
toneGuidance: z.string().max(120),
mustMention: z.array(z.string()).max(3),
forbidden: z.array(z.string()).max(3),
nextAction: z.enum(["reply", "call_tool", "escalate", "ask_clarifying"]),
});
export type Plan = z.infer<typeof Plan>;
// Opus produces this; Haiku consumes it in the same conversation.The discipline is mechanical. Anything the planner noticed that changes the worker's behavior goes in the artifact. Anything that's just internal reasoning stays in the planner's output and gets discarded. This is also what makes the architecture observable. You can grade the plan separately from the reply, and when a reply goes wrong you can point at the plan.
Production swarm data shows conversation quality degrades past 8-10 sequential handoffs between models. That's the upper bound. For CX you rarely need more than two handoffs per turn (plan, execute), so you're well inside the safe zone, but only if each handoff carries an explicit artifact. Implicit handoffs are where the degradation lives.
Memory handles the cross-turn version of this. The plan artifact is the intra-turn piece. Both matter.
What This Looks Like in Code
The full loop fits in about 60 lines. A classifier picks a tier, a dispatcher calls the right SDK with the right prompts, a plan artifact bridges any handoff, and every decision is logged for the scorecard. This is the skeleton. You'll add your own tools, your own prompts, and your own policy guardrails.
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { classify } from "./router.ts";
import { Plan } from "./plan-artifact.ts";
const anthropic = new Anthropic();
const openai = new OpenAI();
export async function handleTurn(turn: Turn): Promise<string> {
const { tier } = await classify(turn);
if (tier === "gpt-summarize") {
const summary = await openai.chat.completions.create({
model: "gpt-5.4",
messages: [
{ role: "system", content: "Summarize for a CX agent. 150 tokens max." },
{ role: "user", content: turn.text },
],
max_tokens: 200,
});
return handleTurn({ ...turn, text: summary.choices[0].message.content ?? "" });
}
if (tier === "opus-plan") {
const plan = await anthropic.messages.create({
model: "claude-opus-4-7",
max_tokens: 400,
system: "Produce a Plan JSON. No prose.",
messages: [{ role: "user", content: turn.text }],
});
const parsed = Plan.parse(JSON.parse((plan.content[0] as { text: string }).text));
return dispatchToWorker(turn, parsed);
}
// haiku-direct or haiku-tools
return dispatchToHaiku(turn);
}Two SDKs, one router, one artifact schema. The complexity isn't in the dispatch. It's in the observability around it: router confidence, plan grade, shipped-reply scorecard, per-tier cost per resolution. The code is the boring part. The measurement is the work.
If you want the deeper picture on why multiple small models now beat one large one on many agent workloads, the companion piece on small language models and when 3B beats 70B is the best follow-up. The distillation article covers the adjacent case: training your own Haiku-class model for routing and execution once you've got production traffic.
Building, Connecting, and Monitoring a Routed Agent
Routing gives you 75-90% cost savings. It also gives you four new things to build: the plan artifact store, the router observability, the shadow-check scorecard pipeline, and the per-tier cost dashboard. If you're shipping this yourself, that's a week or two of work before the pattern actually pays.
Chanl's Build/Connect/Monitor layers slot in around the routing code you just saw.
Scorecards runs on both shipped replies and shadow-check replies. When a Haiku-direct reply and its Opus shadow materially disagree, the scorecard diff flags the turn for review. That's your router regression signal, automated.
Memory persists the plan artifact across turns and across handoffs. When the next customer message arrives, the worker model sees the prior plan without you threading it through your own state.
Analytics breaks down cost-per-resolution by tier, so you can see when the router is drifting (more turns escalating to Opus than a month ago is a prompt-regression signal, not just a cost signal).
Prompts versions the planner prompt separately from the router prompt. When you tune the planner, you don't want Haiku classification behavior to shift underneath you.
The point isn't that you can't build these yourself. It's that the routing pattern only holds its cost advantage when the observability stack around it is solid. That's where most of the engineering budget you saved on Opus tokens has to go. Picking the right model per turn is the Build pillar in action.
The team I opened with didn't pick a bad model. They picked one model for every job. Swap that single line in their config for a router and a plan artifact and the $105k line item turns into closer to $18k, with better latency on the easy 60% of tickets and better answers on the hard 10%. The pattern isn't clever. It's just what you do once the monthly bill gets your attention.
- Anthropic API Pricing 2026
- Claude Opus 4.7 Pricing Analysis — Finout
- GPT-5.4 Complete Guide — NxCode
- tau2-Bench Telecom Leaderboard — Artificial Analysis
- tau-bench (Sierra) GitHub
- LLM Leaderboard 2026 — Vellum
- Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 — EdenAI
- BenchLM — LLM Agent & Tool-Use Benchmarks
- LLM Council — AI Model Benchmarks April 2026
- Anthropic 2026 Agentic Coding Trends Report
- Google Cloud — AI Agent Trends 2026
- MIT Tech Review — Agent Orchestration
- Beam.ai — Multi-Agent Orchestration Patterns for Production
- Anthropic Prompt Caching Documentation
- Anthropic Message Batches API
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.



