We got our first real production bill: $13,247. For one agent.
A customer-service agent on Claude Sonnet. Nothing exotic: 4,000-token system prompt, 8 tools for order lookups and CRM updates, 500 conversations a day. Each conversation averaged 6 turns with full history replay. The math was brutal once we actually did it.
Over three months, we cut that agent to $1,100/month. Same quality. Same tools. Same conversation depth. This is the playbook.
In this article:
- The $13K Bill, Decomposed: where the money actually goes
- Prompt Caching: 90% Input Savings: the single biggest lever
- Model Routing: Right Model, Right Task: 30-50% off the remaining cost
- Batch Processing: Half-Price Background Work: 50% off non-real-time tasks
- Plan-and-Execute: Stop Re-Planning: caching decisions, not just prompts
- Context Management: Smaller Windows: pay for what matters
- The Full Stack: $13K to $1,100: all techniques combined
- Implementation Checklist: week-by-week rollout
The $13K Bill, Decomposed
Most of a production agent's cost hides in content resent on every turn: system prompt, tool schemas, growing conversation history. Here is what our 500-conversation/day agent consumed:
| Component | Tokens per Conversation | Monthly Volume (15K convos) | Notes |
|---|---|---|---|
| System prompt | 4,000 input | 60M input | Resent every turn |
| Conversation history | 3,000 input (avg) | 45M input | Grows each turn |
| Tool schemas (8 tools) | 2,400 input | 36M input | Resent every turn |
| Tool call results | 1,200 input | 18M input | CRM/order data |
| Agent responses | 1,500 output | 22.5M output | The actual replies |
| Total | ~12,100 | 159M input + 22.5M output |
On Claude Sonnet at $3/MTok input and $15/MTok output:
Input: 159M tokens x $3.00/MTok = $477
Output: 22.5M tokens x $15.00/MTok = $337.50Wait. That is only $815. Where is the $13K?
Here is the part nobody warns you about: each conversation has multiple turns. Our 6-turn average means the system prompt, tool schemas, and growing history are resent on every single turn. The real math:
// Each turn resends: system prompt + tools + full history
// Turn 1: 4,000 + 2,400 + 0 history = 6,400 input
// Turn 2: 4,000 + 2,400 + 2,500 history = 8,900 input
// Turn 3: 4,000 + 2,400 + 5,000 history = 11,400 input
// Turn 4: 4,000 + 2,400 + 7,500 history = 13,900 input
// Turn 5: 4,000 + 2,400 + 10,000 history = 16,400 input
// Turn 6: 4,000 + 2,400 + 12,500 history = 18,900 input
// Total per conversation: ~75,900 input tokens + ~9,000 outputNow the real bill:
Monthly input: 75,900 x 15,000 convos = 1,138.5M tokens
Monthly output: 9,000 x 15,000 convos = 135M tokens
Input cost: 1,138.5M x $3.00/MTok = $3,415.50
Output cost: 135M x $15.00/MTok = $2,025.00
Total: $5,440.50/monthAdd multi-step tool calls (ours averaged 2.4 tool-use rounds per conversation) and another $4,000-$7,800 in LLM calls for tool reasoning. Actual bill: $13,247.
The system prompt and tool schemas alone, content that never changes between conversations, accounted for over 40% of input tokens. That is where we started cutting.
Prompt Caching: 90% Input Savings
Prompt caching eliminates re-processing of static content you resend every request: system prompts, tool schemas, few-shot examples. Subsequent calls read from cache at a fraction of the input price. For our agent, this single lever cut $1,224/month.
| Provider | Cache Write Cost | Cache Read Cost | Savings on Reads | Cache Duration |
|---|---|---|---|---|
| Anthropic | 1.25x input price | 0.1x input price | 90% | 5 minutes |
| Anthropic (extended) | 2x input price | 0.1x input price | 90% | 1 hour |
| OpenAI (GPT-4.1) | 1x (automatic) | 0.25x input price | 75% | Automatic |
| OpenAI (GPT-5) | 1x (automatic) | 0.1x input price | 90% | Automatic |
| Google (Gemini) | 1x (automatic) | 0.25x input price | 75% | Automatic |
For our agent, the static content per turn was 6,400 tokens (4,000 system prompt + 2,400 tool schemas). At 6 turns per conversation, that is 38,400 static tokens per conversation resent and reprocessed.
The math
Before caching (Claude Sonnet):
Static tokens: 38,400/convo x 15,000 convos = 576M tokens/month
Cost: 576M x $3.00/MTok = $1,728/month (just for static content)After caching (Anthropic 5-minute cache):
// First request per conversation: cache write (1.25x)
Write cost: 6,400 tokens x 15,000 x $3.75/MTok = $360/month
// Remaining 5 turns per conversation: cache read (0.1x)
Read cost: 6,400 x 5 turns x 15,000 x $0.30/MTok = $144/month
Total: $504/month (vs $1,728, saving $1,224/month)That is a 71% reduction on static content. With Anthropic's 1-hour cache and enough volume to keep the cache warm, reads drop to $0.30/MTok across nearly all requests.
Implementation
With Anthropic, add a single cache_control field:
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt, // 4,000 tokens of instructions
// Cache this block: reads are 90% cheaper than re-processing
cache_control: { type: "ephemeral" }
}
],
tools: tools, // 2,400 tokens of schemas (auto-cached with system)
messages: conversationHistory
});With OpenAI, caching is automatic for prompts over 1,024 tokens. No code changes needed. The discount appears on your bill.
Running savings: $13,247 - $1,224 = $12,023 remaining.
Model Routing: Right Model, Right Task
Sending every request to your flagship model wastes 60-65% of budget on tasks a 30x cheaper model handles equally well. "What are your hours?" and "I need to dispute a charge" are different tasks, but an unrouted agent treats them identically. A classifier ($4.50/month) sorts each request to the cheapest capable model.
The pricing gap
| Model | Input ($/MTok) | Output ($/MTok) | Good For |
|---|---|---|---|
| GPT-4.1 Nano | $0.10 | $0.40 | Classification, FAQ, routing |
| GPT-4o-mini | $0.15 | $0.60 | Simple Q&A, extraction |
| Claude Haiku 3.5 | $0.80 | $4.00 | Standard support, summaries |
| Gemini 2.5 Flash | $0.30 | $2.50 | Mid-complexity reasoning |
| GPT-4.1 | $2.00 | $8.00 | Complex multi-step tasks |
| Claude Sonnet 4 | $3.00 | $15.00 | Nuanced reasoning, writing |
| GPT-4o | $2.50 | $10.00 | General flagship tasks |
GPT-4.1 Nano to Claude Sonnet is a 30x gap on input. Even routing 50% of requests to a cheaper model creates massive savings.
Three-tier routing
// Tier 1: Simple (60-65% of traffic)
// Greetings, FAQs, order status, store hours
// Route to GPT-4.1 Nano ($0.10/$0.40) because these need no reasoning
const SIMPLE_INTENTS = [
"greeting", "hours", "order_status",
"return_policy", "faq"
];
// Tier 2: Standard (25-30% of traffic)
// Account changes, standard complaints, multi-step lookups
// Route to Claude Haiku 3.5 ($0.80/$4.00) for moderate reasoning
const STANDARD_INTENTS = [
"account_update", "complaint", "billing_question",
"product_comparison"
];
// Tier 3: Complex (10-15% of traffic)
// Disputes, escalations, multi-tool reasoning chains
// Route to Claude Sonnet ($3.00/$15.00) for nuanced judgment
const COMPLEX_INTENTS = [
"dispute", "escalation", "multi_issue",
"policy_exception"
];The classifier runs on GPT-4.1 Nano ($4.50/month for all traffic). The first message gets classified, then the tier sticks for the session with automatic upgrade if complexity increases.
The math
Before routing (everything on Claude Sonnet):
All traffic: $5,440/month (base token cost, pre-tool-use)After routing (60% Nano, 25% Haiku, 15% Sonnet):
// Simplified: using average tokens per conversation tier
Tier 1 (9,000 convos): 45,000 tokens avg x $0.10/$0.40
Input: 405M x $0.10/MTok = $40.50
Output: 81M x $0.40/MTok = $32.40
Tier 2 (3,750 convos): 70,000 tokens avg x $0.80/$4.00
Input: 262.5M x $0.80/MTok = $210.00
Output: 56.25M x $4.00/MTok = $225.00
Tier 3 (2,250 convos): 90,000 tokens avg x $3.00/$15.00
Input: 202.5M x $3.00/MTok = $607.50
Output: 33.75M x $15.00/MTok = $506.25
Classifier: $4.50/month
Total: $1,626.15/month (vs $5,440, saving ~$3,814)In practice, we saw a 42% reduction in total token spend after implementing routing. Simpler conversations on cheaper models were also shorter, reducing token counts further.
Running savings: $12,023 - $3,814 = $8,209 remaining.
Batch Processing: Half-Price Background Work
Production agents run background workloads (summarization, quality scoring, knowledge refreshes) that do not need real-time responses. Every major provider offers a Batch API with 50% discounts for asynchronous processing. For our agent, background tasks consumed 25% of total spend.
| Background Task | Token Volume | Frequency | Batch-Eligible? |
|---|---|---|---|
| Conversation summarization | ~2,000/convo | Nightly | Yes |
| Quality scoring (LLM-as-judge) | ~3,500/convo | Nightly | Yes |
| Knowledge base refresh | ~50,000/batch | Weekly | Yes |
| Memory extraction & facts | ~1,500/convo | Post-call | Yes |
| Analytics narrative generation | ~4,000/report | Daily | Yes |
| Prompt regression testing | ~8,000/test | Per deploy | Yes |
The math
Before batching:
Background task spend: $3,300/monthAfter batching (50% discount):
Background task spend: $1,650/month
Savings: $1,650/monthAnthropic's Batch API also stacks with prompt caching. A cached batch request on Claude Sonnet costs $1.50/MTok input and $7.50/MTok output: 75% off the standard input rate when you combine both discounts.
// Batch quality scoring: runs nightly, no latency requirement
// 50% discount on both input and output tokens vs real-time API
const batch = await anthropic.beta.messages.batches.create({
requests: conversations.map(convo => ({
custom_id: convo.id, // Maps results back to source conversations
params: {
model: "claude-sonnet-4-20250514",
max_tokens: 500,
// Cache the scoring rubric so all items in the batch share it.
// Combines batch discount (50%) with cache discount (90% reads)
// for 75% total savings on input tokens.
system: [{ type: "text", text: scoringPrompt, cache_control: { type: "ephemeral" } }],
messages: [{ role: "user", content: convo.transcript }]
}
}))
});Running savings: $8,209 - $1,650 = $6,559 remaining.
Plan-and-Execute: Stop Re-Planning
Agents re-plan identical workflows on every call. "Cancel my order" always triggers the same chain: look up order, check cancellation window, process refund. Plan-and-Execute separates planning from execution and caches the result, so the 58% of requests that follow known patterns skip the planning LLM call entirely.
Remember that $13K bill? Over 40% of it was the agent re-reasoning through steps it had already solved thousands of times.
async function handleRequest(message: string, context: AgentContext) {
// Step 1: Classify intent on cheapest model (GPT-4.1 Nano, ~$0.0001)
const intent = await classifyIntent(message);
// Step 2: Check plan cache. This is a DB lookup, not an LLM call.
// 58% of requests hit cache, saving 2,000-5,000 planning tokens each.
const cachedPlan = await planCache.find(intent, context.parameters);
if (cachedPlan) {
// Cache hit: skip the planning LLM call entirely.
// Execute pre-validated steps directly, which also reduces errors.
return await executePlan(cachedPlan, context);
}
// Cache miss: generate plan with mid-tier model (Haiku, not Sonnet)
// because plan generation is structured output, not nuanced reasoning
const plan = await generatePlan(intent, message, context);
// Store for future reuse. Key includes intent + parameters
// so "cancel order #123" and "cancel order #456" share the same plan.
await planCache.store(intent, context.parameters, plan);
return await executePlan(plan, context);
}Why it works
Over 30 days, 58% of requests matched one of 23 plan templates: order status, cancellations, address updates, billing inquiries. Same steps, different parameters. For those 58%, the planning call is eliminated. For the other 42%, Haiku generates the plan instead of Sonnet.
The math
Before Plan-and-Execute:
// Planning cost: ~3,000 tokens per request on Sonnet
Planning: 3,000 tokens x 6 turns x 15,000 convos x $3.00/MTok
= 270M tokens x $3.00/MTok = $810/month
(Plus output tokens for plans: ~$400/month)
Total planning cost: ~$1,210/monthAfter Plan-and-Execute:
// 58% cache hits: $0 planning cost
// 42% cache misses: plan on Haiku instead of Sonnet
Miss planning: 3,000 x 6 x 6,300 convos x $0.80/MTok = $90.72
Miss output: ~$50/month
Intent classifier: already counted in routing
Total planning cost: ~$141/month (saving ~$1,069)The savings compound with model routing. Cached plans can execute their individual steps on the cheapest capable model per step.
Running savings: $6,559 - $1,069 = $5,490 remaining.
Context Management: Smaller Windows
By turn 6, you are sending 12,500 tokens of history on every request, and most of it is low-value for the current turn. Three techniques cut history tokens by 40-60%.
1. Sliding window with summary
Keep the last 3 turns verbatim and summarize earlier turns into a compressed context block:
function buildContext(history: Message[]): Message[] {
// Short conversations don't need compression (3 turns = 6 messages)
if (history.length <= 6) return history;
// Summarize older turns into ~200 tokens instead of ~2,500 raw
// This preserves key facts while cutting 90% of history tokens
const oldTurns = history.slice(0, -6);
const summary = await summarize(oldTurns); // Nano model: ~$0.00005 per call
return [
// Inject summary as context so the model retains earlier conversation state
{ role: "system", content: `Previous context: ${summary}` },
...history.slice(-6) // Recent turns stay verbatim for accuracy
];
}2. Tool schema pruning
Do not send all 8 tool schemas on every turn. After intent classification, send only the 2-3 tools relevant to the current task:
// Before: 2,400 tokens of tool schemas on every turn (8 tools x ~300 tokens each)
const allTools = [orderLookup, orderCancel, orderModify, crmUpdate,
billingCheck, refundProcess, escalate, faqSearch];
// After: ~800 tokens. Intent classification already ran, so we know
// which tools are relevant. No reason to pay for schemas the model won't use.
const relevantTools = selectToolsForIntent(intent, allTools);
// "order_status" -> [orderLookup, orderModify] (2 tools, ~600 tokens)
// "billing_question" -> [billingCheck, refundProcess] (2 tools, ~600 tokens)3. Structured extraction over raw replay
Extract structured data from tool results once instead of replaying raw JSON:
// Before: 800 tokens of raw JSON replayed in history every turn
// { "order": { "id": "ORD-9284", "items": [...], "shipping": {...}, ... } }
// After: 120 tokens. Extract once, reference compactly.
// The model only needs facts, not the full API response structure.
// Order ORD-9284: 2 items, shipped 3/18, arriving 3/21, tracking UPS-1Z999Back to that $13K bill: context management attacks the same turn-by-turn multiplier that made the original cost surprising.
Combined context savings
Before context optimization:
Average input per turn 6: 18,900 tokens
After (sliding window + tool pruning + structured extraction):
Average input per turn 6: 9,200 tokens (~51% reduction)
Monthly savings at blended model rate (~$1.20/MTok avg after routing):
Reduction: ~580M fewer tokens/month
Savings: ~$696/monthRunning savings: $5,490 - $696 = $4,794 remaining.
The Full Stack: $13K to $1,100
Layering all five techniques reduced our agent from $13,247/month to $1,100/month. A 92% reduction. The techniques compound because each one shrinks the input that the others operate on.
| Technique | Monthly Savings | Cumulative Cost | Reduction |
|---|---|---|---|
| Baseline (unoptimized) | - | $13,247 | - |
| + Prompt caching | -$1,224 | $12,023 | 9% |
| + Model routing | -$3,814 | $8,209 | 38% |
| + Batch processing | -$1,650 | $6,559 | 51% |
| + Plan-and-Execute | -$1,069 | $5,490 | 59% |
| + Context management | -$696 | $4,794 | 64% |
| + All techniques compounding | -$3,694* | ~$1,100 | 92% |
*Routing means cheaper models for caching. Caching means fewer tokens for routing decisions. Context management shrinks payloads everywhere. Plan-and-Execute skips entire LLM calls. The techniques multiply.
Actual production cost after full optimization: $1,100/month for the same 500-conversation/day, 8-tool agent.
Quality held through every change
We tracked these metrics using scorecards and analytics:
- Resolution rate: held at 84% (pre-optimization: 83%)
- Customer satisfaction: 4.2/5.0 (pre: 4.1/5.0. Routing actually improved simple cases)
- Escalation rate: dropped from 16% to 14% (Plan-and-Execute was more consistent)
- Average handle time: 2.1 minutes (pre: 2.3 minutes. Cached plans executed faster)
Quality monitoring is not optional. It is what makes optimization safe. Without real-time analytics on resolution rates and satisfaction scores, you are flying blind.
Implementation Checklist
Start with caching (week 1), then context management, then routing, then batching. Each step builds on a stable foundation before adding model changes.
Week 1: Prompt Caching (highest ROI, lowest risk)
- Enable
cache_controlon system prompts (Anthropic) or verify automatic caching (OpenAI) - Cache tool schemas alongside system prompt
- Monitor cache hit rates in your analytics dashboard. Target 85%+
Week 2: Context Management (no model changes)
- Implement sliding window summarization for conversations over 4 turns
- Prune tool schemas per intent (requires intent classification)
- Switch raw tool results to structured extraction
Week 3: Model Routing (requires testing)
- Build intent classifier on cheapest model (GPT-4.1 Nano or Gemini Flash Lite)
- Define tier boundaries from historical conversation analysis
- Shadow-route for 1 week: route silently, compare quality scores between tiers
- Deploy with automatic upgrade triggers (complexity score threshold)
Week 4: Batch Processing + Plan-and-Execute
- Move summarization, quality scoring, and analytics to Batch API
- Implement plan cache with top 20 intent templates
- Set cache TTL based on workflow change frequency
What Comes Next
Token prices drop every quarter. GPT-4o costs 92% less than GPT-4 did at launch. These optimization techniques keep working as prices fall because they are multiplicative, not additive.
The real shift is architectural. Teams that build tool-equipped agents with routing, caching, and plan reuse from day one never see a $13K bill. They start at $1,100 and scale to 5,000 conversations/day for $8,000 instead of $130K.
Start with caching this week. Route by next week. Your CFO will notice.
Monitor Your Agent's Token Economics
Chanl tracks cost per conversation, token breakdown by component, and quality scores alongside spend. Optimize with data, not guesses.
See the analyticsCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



