Why Use Three Models in a Customer-Experience Agent Instead of One?

Each tier of model is priced and tuned for a different job. Opus-class models reason well but cost 5x more per output token than Haiku. Haiku classifies and routes in under 200ms but struggles with multi-step planning. GPT-5.4 Standard handles long summarization at a fair $15 per million output tokens. Using one model for every task means you overpay for the 80% of turns that are routine and underpower the 5% that actually need reasoning.

What Is the Planner, Router, Summarizer Pattern?

Planner, router, summarizer is a routing architecture where you assign each model tier to the task it's best at. The planner (Opus 4.7) decomposes complex tickets into steps. The router (Haiku 4.5) classifies incoming turns and picks the next action. The summarizer (GPT-5.4 Standard) condenses long transcripts and documents. A small supervisor function inspects each turn and picks which tier runs.

How Much Does Multi-Model Routing Save on a Support Deflection Workload?

On 100k tickets per month with an average of 2.5 LLM calls per ticket (5k input / 400 output tokens each), running everything on Opus 4.7 costs about $8,750 a month. Routing Planner/Router/Summarizer across Opus 4.7, Haiku 4.5, and GPT-5.4 Standard drops that to roughly $2,000. Prompt caching on the planner brings it to around $1,500. The savings compound as volume grows.

What's the Biggest Failure Mode of Model Routing?

Misrouting. A hard ticket that looks short gets sent to the small router model, and you get a confidently wrong answer at 120ms instead of a correct one at 4s. The fix is logging router confidence alongside a downstream quality scorecard, then shadow-routing the lowest-confidence 5% to Opus as a regression signal.

Do Planner and Worker Models Share Context on Handoff?

No, and that's the second common failure mode. Reasoning tokens are invisible to downstream models and billed as output. If your planner thinks 'the customer is angry because this is the third escalation' and your worker model never sees that, the worker produces a tone-deaf reply. The fix is writing a structured plan artifact the worker reads explicitly.

Does This Mean I Have to Run Multiple SDKs in Production?

You'll run at least two: Anthropic for Claude models and OpenAI for GPT models. That's fine. The routing layer is a small supervisor function that picks which SDK to call per turn. The complexity lives in the plan artifact, the router classifier, and the observability, not in juggling SDKs.

When Is One Model Still the Right Choice?

When volume is below about 10k conversations per month, routing overhead (engineering, observability, two provider contracts) isn't worth the savings. Start with one mid-tier model, add routing when the monthly bill crosses $2k or p95 latency becomes a user-facing problem.

How Does Tau-Bench Relate to CX Model Selection?

tau-bench and tau2-bench simulate realistic customer-experience workflows (retail, airline, telecom) with tools and multi-turn state. Current single-run pass rates sit below 50% on most models, which tells you model choice matters more for CX than for coding. SWE-bench is coding-biased. For routing decisions on a CX agent, tau2-bench telecom is the closer benchmark.

Your Agent Should Use Three Models, Not One

A customer service team I talked to last month had one line in their agent config that quietly cost them six figures a year. Every inbound ticket, every clarifying question, every "thanks, that worked" reply went through the same Opus-class model. The founder had picked Anthropic at the start of the project because the SDK felt clean. A year later, the bill was a rounding error away from the salary of the support lead.

The cost wasn't Anthropic's fault. The architecture was. They were paying for planning-grade reasoning to classify whether a message contained a refund request.

Brand loyalty is a procurement decision. Model choice, per turn, is an architecture decision. A well-built customer-experience agent uses at least three models: one to plan, one to route, one to summarize. Each is priced and tuned for a different job, and the supervisor that picks between them can run on the cheapest of the three.

This is the pattern. Here's how it works, the rubric for applying it, the cost math on a real workload, and the two failure modes that will bite you if you skip the boring parts.

What Each Model Is Actually Good At

The planner reasons through hard tickets. The router classifies incoming turns and dispatches. The summarizer condenses long inputs cheaply. For a CX workload in April 2026, the default stack is Opus 4.7 for planning, Haiku 4.5 for routing, and GPT-5.4 Standard for summarization.

Model	Input / Output ($/Mtok)	Strength	Used For
Claude Opus 4.7	$5 / $25	Multi-step reasoning, tool planning	Complex tickets, escalation analysis
Claude Haiku 4.5	$1 / $5	Sub-200ms classification, short tool calls	Routing, intent detection, simple replies
GPT-5.4 Standard	$2.50 / $15	Long-context summarization, structured output	Transcript summarization, ticket intake

Opus 4.7 isn't the cheapest model per token, but it's the cheapest per correct plan. Haiku 4.5 is five times cheaper on output and fast enough that you can afford to call it twice. GPT-5.4 Standard sits between them on price and handles long inputs well, which matters for CX because real tickets arrive with attached transcripts and CRM notes.

The standard benchmark comparisons (SWE-bench Verified, MMLU) are misleading here because they're coding-biased. The benchmark that matters for this architecture is tau2-bench, which simulates tool-rich multi-turn CX workflows. Single-run pass rates on tau2-bench telecom still sit below 50% for every frontier model as of April 2026, which is the strongest argument for routing: you can't cover the long tail with one model, so you may as well pay less on the short one.

A Routing Rubric You Can Actually Apply

Route by signal, not by guess. Before the message reaches any reasoning model, a small router inspects five signals and picks the tier. The router itself runs on Haiku, because the classification job is small, and its own latency is part of every conversation.

Signal	Router action
Input ≤ 100 tokens, no tool call needed	Haiku direct reply
Input 100-2k tokens, 1 tool call, single intent	Haiku with tools
Input 100-2k tokens, ambiguous intent or multi-step	Opus planner
Input ≥ 2k tokens (long transcript or doc)	GPT-5.4 Standard summarize, then route the summary
Prior turn escalated or customer sentiment negative	Opus planner, always

This fits in a single function. Here's the core of it. The router returns a tier label; a dispatcher calls the matching SDK.

router.ts·typescript

import Anthropic from "@anthropic-ai/sdk";
 
const anthropic = new Anthropic();
 
type Tier = "haiku-direct" | "haiku-tools" | "opus-plan" | "gpt-summarize";
 
export async function classify(turn: {
  text: string;
  priorSentiment?: "positive" | "neutral" | "negative";
  priorEscalated?: boolean;
}): Promise<{ tier: Tier; confidence: number }> {
  const tokens = Math.ceil(turn.text.length / 4);
 
  if (turn.priorEscalated || turn.priorSentiment === "negative") {
    return { tier: "opus-plan", confidence: 0.99 };
  }
  if (tokens >= 2000) {
    return { tier: "gpt-summarize", confidence: 0.95 };
  }
 
  const msg = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001",
    max_tokens: 64,
    system: "Classify the user turn. Reply JSON: {tier:'direct'|'tools'|'plan', confidence:0..1}. 'plan' means multi-step reasoning needed.",
    messages: [{ role: "user", content: turn.text }],
  });
 
  const parsed = JSON.parse((msg.content[0] as { text: string }).text);
  const tier: Tier = parsed.tier === "plan"
    ? "opus-plan"
    : parsed.tier === "tools"
      ? "haiku-tools"
      : "haiku-direct";
  return { tier, confidence: parsed.confidence };
}

Two things worth noticing. The deterministic rules run first so you never pay the router on obvious cases. And the router emits a confidence score. That score is the hinge for the first failure mode we'll cover below.

The Cost Math on a Real Deflection Workload

On 100k tickets/month with 2.5 LLM calls per ticket, routing cuts the bill from $8,750 to roughly $1,500 including prompt caching. Here's the shape of it.

Assume the common CX distribution: 60% of tickets are routine (classification + short reply), 25% need one tool call and a short answer, 10% need multi-step planning, 5% arrive with a long transcript that needs summarization first. Real CX calls carry system prompts, tool schemas, and recent history, so the average call looks closer to 5k input and 400 output tokens. Aggregate volume: 1.25B input tokens and 100M output tokens per month. Router adds one extra 300-token Haiku call per turn.

Setup	Input tokens / mo	Output tokens / mo	Monthly bill
Opus 4.7 for everything	1.25B	100M	$6,250 input + $2,500 output = $8,750
Tiered routing (no caching)	1.25B	100M	Haiku-direct 60%: $413. Haiku-tools 25%: $250. Opus planner 10%: $875. GPT summarizer 5%: $344. Router overhead: $113. ~$2,000
Tiered routing + prompt caching on planner	1.25B	100M	Planner input drops from $625 to $120 with 90% cache hit rate. $1,500

The delta is about $7,250/mo, or $87k/yr, on a workload that isn't large by SaaS standards. The savings come from two places: the 60% of routine turns no longer pay Opus rates, and the 10% of planner calls pay the cached-input rate because the system prompt and recent history are static across turns. Anthropic's prompt caching bills cache reads at 10% of normal input, which is a bigger lever than most teams use.

Batch processing through the Message Batches API is another 50% off for anything non-latency-sensitive, like post-call summarization or overnight quality grading. For a deflection workload these are real line items.

One caveat. The router itself adds latency. Running Haiku as a pre-classifier adds roughly 150-250ms to every turn. For text chat that's invisible. For voice, it's close to the budget ceiling, and you should fold the classifier into the first-pass model instead of adding a hop.

Failure Mode 1: Misrouting Is the Silent Killer

A misrouted ticket produces a confidently wrong answer at router speed, and the customer never tells you. The router sends a hard question to Haiku because the question is short, Haiku fabricates a plausible policy, the customer accepts the reply, and nothing alerts. The ticket closes. The scorecard doesn't fail because there's no scorecard on routine turns.

You catch this with two signals. First, log every router decision with its confidence. Second, run a quality scorecard on a sampled slice of Haiku-direct replies, and shadow-route the bottom 5% by confidence to Opus for comparison. When the shadow response materially disagrees with the shipped reply, that's a regression signal. Over a month you get a P&L on your router itself.

shadow-router.ts·typescript

import { classify } from "./router.ts";
 
export async function handle(turn: Turn) {
  const { tier, confidence } = await classify(turn);
  const reply = await dispatch(tier, turn);
 
  // Shadow-route the bottom 5% of Haiku decisions to Opus asynchronously.
  if (tier.startsWith("haiku") && confidence < 0.75) {
    queueShadowCheck({ turnId: turn.id, shippedReply: reply });
  }
  return reply;
}

This pattern is what makes routing safe to run. Without it, you're flying blind on the 60% of your traffic that never triggers a reasoning model, and misroute errors compound because the router's own training signal (your scorecard) only fires when a reply is bad enough that a customer complains.

Teams usually discover they need this the second time they see a bad transcript that routed to Haiku. For anything user-visible, shadow checks belong in the pipeline from day one. The Scorecards layer is where this naturally lives.

Failure Mode 2: Context Loss on Handoff

When one model plans and another executes, the planner's reasoning tokens never reach the worker, and the worker makes tone-deaf decisions. Planner models emit a lot of internal reasoning that gets billed as output but isn't shown to subsequent calls unless you explicitly pass it. If Opus thinks "this is the customer's third escalation, open with an apology," and Haiku executes without that context, Haiku opens with "Happy to help, what's your order number?"

The fix is a structured plan artifact. The planner writes its decisions, not its reasoning, to a small JSON object. The worker reads the artifact along with the raw turn. Nothing is implicit.

plan-artifact.ts·typescript

import { z } from "zod";
 
export const Plan = z.object({
  intent: z.enum(["refund", "status", "escalation", "info", "other"]),
  toneGuidance: z.string().max(120),
  mustMention: z.array(z.string()).max(3),
  forbidden: z.array(z.string()).max(3),
  nextAction: z.enum(["reply", "call_tool", "escalate", "ask_clarifying"]),
});
export type Plan = z.infer<typeof Plan>;
 
// Opus produces this; Haiku consumes it in the same conversation.

The discipline is mechanical. Anything the planner noticed that changes the worker's behavior goes in the artifact. Anything that's just internal reasoning stays in the planner's output and gets discarded. This is also what makes the architecture observable. You can grade the plan separately from the reply, and when a reply goes wrong you can point at the plan.

Production swarm data shows conversation quality degrades past 8-10 sequential handoffs between models. That's the upper bound. For CX you rarely need more than two handoffs per turn (plan, execute), so you're well inside the safe zone, but only if each handoff carries an explicit artifact. Implicit handoffs are where the degradation lives.

Memory handles the cross-turn version of this. The plan artifact is the intra-turn piece. Both matter.

What This Looks Like in Code

The full loop fits in about 60 lines. A classifier picks a tier, a dispatcher calls the right SDK with the right prompts, a plan artifact bridges any handoff, and every decision is logged for the scorecard. This is the skeleton. You'll add your own tools, your own prompts, and your own policy guardrails.

handle-turn.ts·typescript

import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
import { classify } from "./router.ts";
import { Plan } from "./plan-artifact.ts";
 
const anthropic = new Anthropic();
const openai = new OpenAI();
 
export async function handleTurn(turn: Turn): Promise<string> {
  const { tier } = await classify(turn);
 
  if (tier === "gpt-summarize") {
    const summary = await openai.chat.completions.create({
      model: "gpt-5.4",
      messages: [
        { role: "system", content: "Summarize for a CX agent. 150 tokens max." },
        { role: "user", content: turn.text },
      ],
      max_tokens: 200,
    });
    return handleTurn({ ...turn, text: summary.choices[0].message.content ?? "" });
  }
 
  if (tier === "opus-plan") {
    const plan = await anthropic.messages.create({
      model: "claude-opus-4-7",
      max_tokens: 400,
      system: "Produce a Plan JSON. No prose.",
      messages: [{ role: "user", content: turn.text }],
    });
    const parsed = Plan.parse(JSON.parse((plan.content[0] as { text: string }).text));
    return dispatchToWorker(turn, parsed);
  }
 
  // haiku-direct or haiku-tools
  return dispatchToHaiku(turn);
}

Two SDKs, one router, one artifact schema. The complexity isn't in the dispatch. It's in the observability around it: router confidence, plan grade, shipped-reply scorecard, per-tier cost per resolution. The code is the boring part. The measurement is the work.

If you want the deeper picture on why multiple small models now beat one large one on many agent workloads, the companion piece on small language models and when 3B beats 70B is the best follow-up. The distillation article covers the adjacent case: training your own Haiku-class model for routing and execution once you've got production traffic.

Building, Connecting, and Monitoring a Routed Agent

Routing gives you 75-90% cost savings. It also gives you four new things to build: the plan artifact store, the router observability, the shadow-check scorecard pipeline, and the per-tier cost dashboard. If you're shipping this yourself, that's a week or two of work before the pattern actually pays.

Chanl's Build/Connect/Monitor layers slot in around the routing code you just saw.

Scorecards runs on both shipped replies and shadow-check replies. When a Haiku-direct reply and its Opus shadow materially disagree, the scorecard diff flags the turn for review. That's your router regression signal, automated.

Memory persists the plan artifact across turns and across handoffs. When the next customer message arrives, the worker model sees the prior plan without you threading it through your own state.

Analytics breaks down cost-per-resolution by tier, so you can see when the router is drifting (more turns escalating to Opus than a month ago is a prompt-regression signal, not just a cost signal).

Prompts versions the planner prompt separately from the router prompt. When you tune the planner, you don't want Haiku classification behavior to shift underneath you.

The point isn't that you can't build these yourself. It's that the routing pattern only holds its cost advantage when the observability stack around it is solid. That's where most of the engineering budget you saved on Opus tokens has to go. Picking the right model per turn is the Build pillar in action.

The team I opened with didn't pick a bad model. They picked one model for every job. Swap that single line in their config for a router and a plan artifact and the $105k line item turns into closer to $18k, with better latency on the easy 60% of tickets and better answers on the hard 10%. The pattern isn't clever. It's just what you do once the monthly bill gets your attention.

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

agent-architecture model-routing cost-optimization claude gpt-5 haiku opus customer-experience

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed

Your Agent Should Use Three Models, Not One

What Each Model Is Actually Good At

A Routing Rubric You Can Actually Apply

The Cost Math on a Real Deflection Workload

Failure Mode 1: Misrouting Is the Silent Killer

Failure Mode 2: Context Loss on Handoff

What This Looks Like in Code

Building, Connecting, and Monitoring a Routed Agent

Learn Agentic AI

Frequently Asked Questions

Related Articles

Everyone Benchmarks Opus. Your Chatbot Runs on Haiku.

When to Use a Supervisor, When to Let Agents Swarm

The Modern Data Stack Wasn't Built for Agents