ChanlChanl
Operations

Your AI Agent Costs $13K/Month. Here's the Fix.

A production customer-service agent burned $13,247 in one month. Prompt caching, model routing, batch processing, and plan-and-execute architecture cut it to $1,100. Real pricing math for every technique.

DGDean GroverCo-founderFollow
March 20, 2026
16 min read read
Watercolor illustration of descending cost bars alongside token streams flowing through an optimization pipeline

We got our first real production bill: $13,247. For one agent.

A customer-service agent on Claude Sonnet. Nothing exotic: 4,000-token system prompt, 8 tools for order lookups and CRM updates, 500 conversations a day. Each conversation averaged 6 turns with full history replay. The math was brutal once we actually did it.

Over three months, we cut that agent to $1,100/month. Same quality. Same tools. Same conversation depth. This is the playbook.


In this article:


The $13K Bill, Decomposed

Most of a production agent's cost hides in content resent on every turn: system prompt, tool schemas, growing conversation history. Here is what our 500-conversation/day agent consumed:

ComponentTokens per ConversationMonthly Volume (15K convos)Notes
System prompt4,000 input60M inputResent every turn
Conversation history3,000 input (avg)45M inputGrows each turn
Tool schemas (8 tools)2,400 input36M inputResent every turn
Tool call results1,200 input18M inputCRM/order data
Agent responses1,500 output22.5M outputThe actual replies
Total~12,100159M input + 22.5M output

On Claude Sonnet at $3/MTok input and $15/MTok output:

text
Input:  159M tokens x $3.00/MTok  = $477
Output: 22.5M tokens x $15.00/MTok = $337.50

Wait. That is only $815. Where is the $13K?

Here is the part nobody warns you about: each conversation has multiple turns. Our 6-turn average means the system prompt, tool schemas, and growing history are resent on every single turn. The real math:

text
// Each turn resends: system prompt + tools + full history
// Turn 1: 4,000 + 2,400 + 0 history     = 6,400 input
// Turn 2: 4,000 + 2,400 + 2,500 history  = 8,900 input
// Turn 3: 4,000 + 2,400 + 5,000 history  = 11,400 input
// Turn 4: 4,000 + 2,400 + 7,500 history  = 13,900 input
// Turn 5: 4,000 + 2,400 + 10,000 history = 16,400 input
// Turn 6: 4,000 + 2,400 + 12,500 history = 18,900 input
// Total per conversation: ~75,900 input tokens + ~9,000 output

Now the real bill:

text
Monthly input:  75,900 x 15,000 convos = 1,138.5M tokens
Monthly output: 9,000 x 15,000 convos  = 135M tokens
 
Input cost:  1,138.5M x $3.00/MTok  = $3,415.50
Output cost: 135M x $15.00/MTok     = $2,025.00
Total: $5,440.50/month

Add multi-step tool calls (ours averaged 2.4 tool-use rounds per conversation) and another $4,000-$7,800 in LLM calls for tool reasoning. Actual bill: $13,247.

The system prompt and tool schemas alone, content that never changes between conversations, accounted for over 40% of input tokens. That is where we started cutting.

Prompt Caching: 90% Input Savings

Prompt caching eliminates re-processing of static content you resend every request: system prompts, tool schemas, few-shot examples. Subsequent calls read from cache at a fraction of the input price. For our agent, this single lever cut $1,224/month.

ProviderCache Write CostCache Read CostSavings on ReadsCache Duration
Anthropic1.25x input price0.1x input price90%5 minutes
Anthropic (extended)2x input price0.1x input price90%1 hour
OpenAI (GPT-4.1)1x (automatic)0.25x input price75%Automatic
OpenAI (GPT-5)1x (automatic)0.1x input price90%Automatic
Google (Gemini)1x (automatic)0.25x input price75%Automatic

For our agent, the static content per turn was 6,400 tokens (4,000 system prompt + 2,400 tool schemas). At 6 turns per conversation, that is 38,400 static tokens per conversation resent and reprocessed.

The math

Before caching (Claude Sonnet):

text
Static tokens: 38,400/convo x 15,000 convos = 576M tokens/month
Cost: 576M x $3.00/MTok = $1,728/month (just for static content)

After caching (Anthropic 5-minute cache):

text
// First request per conversation: cache write (1.25x)
Write cost: 6,400 tokens x 15,000 x $3.75/MTok = $360/month
 
// Remaining 5 turns per conversation: cache read (0.1x)
Read cost: 6,400 x 5 turns x 15,000 x $0.30/MTok = $144/month
 
Total: $504/month (vs $1,728, saving $1,224/month)

That is a 71% reduction on static content. With Anthropic's 1-hour cache and enough volume to keep the cache warm, reads drop to $0.30/MTok across nearly all requests.

Implementation

With Anthropic, add a single cache_control field:

typescript
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: systemPrompt, // 4,000 tokens of instructions
      // Cache this block: reads are 90% cheaper than re-processing
      cache_control: { type: "ephemeral" }
    }
  ],
  tools: tools, // 2,400 tokens of schemas (auto-cached with system)
  messages: conversationHistory
});

With OpenAI, caching is automatic for prompts over 1,024 tokens. No code changes needed. The discount appears on your bill.

Running savings: $13,247 - $1,224 = $12,023 remaining.

Model Routing: Right Model, Right Task

Sending every request to your flagship model wastes 60-65% of budget on tasks a 30x cheaper model handles equally well. "What are your hours?" and "I need to dispute a charge" are different tasks, but an unrouted agent treats them identically. A classifier ($4.50/month) sorts each request to the cheapest capable model.

The pricing gap

ModelInput ($/MTok)Output ($/MTok)Good For
GPT-4.1 Nano$0.10$0.40Classification, FAQ, routing
GPT-4o-mini$0.15$0.60Simple Q&A, extraction
Claude Haiku 3.5$0.80$4.00Standard support, summaries
Gemini 2.5 Flash$0.30$2.50Mid-complexity reasoning
GPT-4.1$2.00$8.00Complex multi-step tasks
Claude Sonnet 4$3.00$15.00Nuanced reasoning, writing
GPT-4o$2.50$10.00General flagship tasks

GPT-4.1 Nano to Claude Sonnet is a 30x gap on input. Even routing 50% of requests to a cheaper model creates massive savings.

Three-tier routing

typescript
// Tier 1: Simple (60-65% of traffic)
// Greetings, FAQs, order status, store hours
// Route to GPT-4.1 Nano ($0.10/$0.40) because these need no reasoning
const SIMPLE_INTENTS = [
  "greeting", "hours", "order_status",
  "return_policy", "faq"
];
 
// Tier 2: Standard (25-30% of traffic)
// Account changes, standard complaints, multi-step lookups
// Route to Claude Haiku 3.5 ($0.80/$4.00) for moderate reasoning
const STANDARD_INTENTS = [
  "account_update", "complaint", "billing_question",
  "product_comparison"
];
 
// Tier 3: Complex (10-15% of traffic)
// Disputes, escalations, multi-tool reasoning chains
// Route to Claude Sonnet ($3.00/$15.00) for nuanced judgment
const COMPLEX_INTENTS = [
  "dispute", "escalation", "multi_issue",
  "policy_exception"
];

The classifier runs on GPT-4.1 Nano ($4.50/month for all traffic). The first message gets classified, then the tier sticks for the session with automatic upgrade if complexity increases.

The math

Before routing (everything on Claude Sonnet):

text
All traffic: $5,440/month (base token cost, pre-tool-use)

After routing (60% Nano, 25% Haiku, 15% Sonnet):

text
// Simplified: using average tokens per conversation tier
Tier 1 (9,000 convos): 45,000 tokens avg x $0.10/$0.40
  Input: 405M x $0.10/MTok  = $40.50
  Output: 81M x $0.40/MTok  = $32.40
 
Tier 2 (3,750 convos): 70,000 tokens avg x $0.80/$4.00
  Input: 262.5M x $0.80/MTok = $210.00
  Output: 56.25M x $4.00/MTok = $225.00
 
Tier 3 (2,250 convos): 90,000 tokens avg x $3.00/$15.00
  Input: 202.5M x $3.00/MTok = $607.50
  Output: 33.75M x $15.00/MTok = $506.25
 
Classifier: $4.50/month
Total: $1,626.15/month (vs $5,440, saving ~$3,814)

In practice, we saw a 42% reduction in total token spend after implementing routing. Simpler conversations on cheaper models were also shorter, reducing token counts further.

Running savings: $12,023 - $3,814 = $8,209 remaining.

Batch Processing: Half-Price Background Work

Production agents run background workloads (summarization, quality scoring, knowledge refreshes) that do not need real-time responses. Every major provider offers a Batch API with 50% discounts for asynchronous processing. For our agent, background tasks consumed 25% of total spend.

Background TaskToken VolumeFrequencyBatch-Eligible?
Conversation summarization~2,000/convoNightlyYes
Quality scoring (LLM-as-judge)~3,500/convoNightlyYes
Knowledge base refresh~50,000/batchWeeklyYes
Memory extraction & facts~1,500/convoPost-callYes
Analytics narrative generation~4,000/reportDailyYes
Prompt regression testing~8,000/testPer deployYes

The math

Before batching:

text
Background task spend: $3,300/month

After batching (50% discount):

text
Background task spend: $1,650/month
Savings: $1,650/month

Anthropic's Batch API also stacks with prompt caching. A cached batch request on Claude Sonnet costs $1.50/MTok input and $7.50/MTok output: 75% off the standard input rate when you combine both discounts.

typescript
// Batch quality scoring: runs nightly, no latency requirement
// 50% discount on both input and output tokens vs real-time API
const batch = await anthropic.beta.messages.batches.create({
  requests: conversations.map(convo => ({
    custom_id: convo.id, // Maps results back to source conversations
    params: {
      model: "claude-sonnet-4-20250514",
      max_tokens: 500,
      // Cache the scoring rubric so all items in the batch share it.
      // Combines batch discount (50%) with cache discount (90% reads)
      // for 75% total savings on input tokens.
      system: [{ type: "text", text: scoringPrompt, cache_control: { type: "ephemeral" } }],
      messages: [{ role: "user", content: convo.transcript }]
    }
  }))
});

Running savings: $8,209 - $1,650 = $6,559 remaining.

Plan-and-Execute: Stop Re-Planning

Agents re-plan identical workflows on every call. "Cancel my order" always triggers the same chain: look up order, check cancellation window, process refund. Plan-and-Execute separates planning from execution and caches the result, so the 58% of requests that follow known patterns skip the planning LLM call entirely.

Remember that $13K bill? Over 40% of it was the agent re-reasoning through steps it had already solved thousands of times.

typescript
async function handleRequest(message: string, context: AgentContext) {
  // Step 1: Classify intent on cheapest model (GPT-4.1 Nano, ~$0.0001)
  const intent = await classifyIntent(message);
 
  // Step 2: Check plan cache. This is a DB lookup, not an LLM call.
  // 58% of requests hit cache, saving 2,000-5,000 planning tokens each.
  const cachedPlan = await planCache.find(intent, context.parameters);
 
  if (cachedPlan) {
    // Cache hit: skip the planning LLM call entirely.
    // Execute pre-validated steps directly, which also reduces errors.
    return await executePlan(cachedPlan, context);
  }
 
  // Cache miss: generate plan with mid-tier model (Haiku, not Sonnet)
  // because plan generation is structured output, not nuanced reasoning
  const plan = await generatePlan(intent, message, context);
 
  // Store for future reuse. Key includes intent + parameters
  // so "cancel order #123" and "cancel order #456" share the same plan.
  await planCache.store(intent, context.parameters, plan);
 
  return await executePlan(plan, context);
}

Why it works

Over 30 days, 58% of requests matched one of 23 plan templates: order status, cancellations, address updates, billing inquiries. Same steps, different parameters. For those 58%, the planning call is eliminated. For the other 42%, Haiku generates the plan instead of Sonnet.

The math

Before Plan-and-Execute:

text
// Planning cost: ~3,000 tokens per request on Sonnet
Planning: 3,000 tokens x 6 turns x 15,000 convos x $3.00/MTok
= 270M tokens x $3.00/MTok = $810/month
(Plus output tokens for plans: ~$400/month)
Total planning cost: ~$1,210/month

After Plan-and-Execute:

text
// 58% cache hits: $0 planning cost
// 42% cache misses: plan on Haiku instead of Sonnet
Miss planning: 3,000 x 6 x 6,300 convos x $0.80/MTok = $90.72
Miss output: ~$50/month
Intent classifier: already counted in routing
Total planning cost: ~$141/month (saving ~$1,069)

The savings compound with model routing. Cached plans can execute their individual steps on the cheapest capable model per step.

Running savings: $6,559 - $1,069 = $5,490 remaining.

Context Management: Smaller Windows

By turn 6, you are sending 12,500 tokens of history on every request, and most of it is low-value for the current turn. Three techniques cut history tokens by 40-60%.

1. Sliding window with summary

Keep the last 3 turns verbatim and summarize earlier turns into a compressed context block:

typescript
function buildContext(history: Message[]): Message[] {
  // Short conversations don't need compression (3 turns = 6 messages)
  if (history.length <= 6) return history;
 
  // Summarize older turns into ~200 tokens instead of ~2,500 raw
  // This preserves key facts while cutting 90% of history tokens
  const oldTurns = history.slice(0, -6);
  const summary = await summarize(oldTurns); // Nano model: ~$0.00005 per call
 
  return [
    // Inject summary as context so the model retains earlier conversation state
    { role: "system", content: `Previous context: ${summary}` },
    ...history.slice(-6) // Recent turns stay verbatim for accuracy
  ];
}

2. Tool schema pruning

Do not send all 8 tool schemas on every turn. After intent classification, send only the 2-3 tools relevant to the current task:

typescript
// Before: 2,400 tokens of tool schemas on every turn (8 tools x ~300 tokens each)
const allTools = [orderLookup, orderCancel, orderModify, crmUpdate,
                  billingCheck, refundProcess, escalate, faqSearch];
 
// After: ~800 tokens. Intent classification already ran, so we know
// which tools are relevant. No reason to pay for schemas the model won't use.
const relevantTools = selectToolsForIntent(intent, allTools);
// "order_status" -> [orderLookup, orderModify] (2 tools, ~600 tokens)
// "billing_question" -> [billingCheck, refundProcess] (2 tools, ~600 tokens)

3. Structured extraction over raw replay

Extract structured data from tool results once instead of replaying raw JSON:

typescript
// Before: 800 tokens of raw JSON replayed in history every turn
// { "order": { "id": "ORD-9284", "items": [...], "shipping": {...}, ... } }
 
// After: 120 tokens. Extract once, reference compactly.
// The model only needs facts, not the full API response structure.
// Order ORD-9284: 2 items, shipped 3/18, arriving 3/21, tracking UPS-1Z999

Back to that $13K bill: context management attacks the same turn-by-turn multiplier that made the original cost surprising.

Combined context savings

text
Before context optimization:
  Average input per turn 6: 18,900 tokens
 
After (sliding window + tool pruning + structured extraction):
  Average input per turn 6: 9,200 tokens (~51% reduction)
 
Monthly savings at blended model rate (~$1.20/MTok avg after routing):
  Reduction: ~580M fewer tokens/month
  Savings: ~$696/month

Running savings: $5,490 - $696 = $4,794 remaining.

The Full Stack: $13K to $1,100

Layering all five techniques reduced our agent from $13,247/month to $1,100/month. A 92% reduction. The techniques compound because each one shrinks the input that the others operate on.

TechniqueMonthly SavingsCumulative CostReduction
Baseline (unoptimized)-$13,247-
+ Prompt caching-$1,224$12,0239%
+ Model routing-$3,814$8,20938%
+ Batch processing-$1,650$6,55951%
+ Plan-and-Execute-$1,069$5,49059%
+ Context management-$696$4,79464%
+ All techniques compounding-$3,694*~$1,10092%

*Routing means cheaper models for caching. Caching means fewer tokens for routing decisions. Context management shrinks payloads everywhere. Plan-and-Execute skips entire LLM calls. The techniques multiply.

Actual production cost after full optimization: $1,100/month for the same 500-conversation/day, 8-tool agent.

Quality held through every change

We tracked these metrics using scorecards and analytics:

  • Resolution rate: held at 84% (pre-optimization: 83%)
  • Customer satisfaction: 4.2/5.0 (pre: 4.1/5.0. Routing actually improved simple cases)
  • Escalation rate: dropped from 16% to 14% (Plan-and-Execute was more consistent)
  • Average handle time: 2.1 minutes (pre: 2.3 minutes. Cached plans executed faster)

Quality monitoring is not optional. It is what makes optimization safe. Without real-time analytics on resolution rates and satisfaction scores, you are flying blind.

Implementation Checklist

Start with caching (week 1), then context management, then routing, then batching. Each step builds on a stable foundation before adding model changes.

Week 1: Prompt Caching (highest ROI, lowest risk)

  • Enable cache_control on system prompts (Anthropic) or verify automatic caching (OpenAI)
  • Cache tool schemas alongside system prompt
  • Monitor cache hit rates in your analytics dashboard. Target 85%+

Week 2: Context Management (no model changes)

  • Implement sliding window summarization for conversations over 4 turns
  • Prune tool schemas per intent (requires intent classification)
  • Switch raw tool results to structured extraction

Week 3: Model Routing (requires testing)

  • Build intent classifier on cheapest model (GPT-4.1 Nano or Gemini Flash Lite)
  • Define tier boundaries from historical conversation analysis
  • Shadow-route for 1 week: route silently, compare quality scores between tiers
  • Deploy with automatic upgrade triggers (complexity score threshold)

Week 4: Batch Processing + Plan-and-Execute

  • Move summarization, quality scoring, and analytics to Batch API
  • Implement plan cache with top 20 intent templates
  • Set cache TTL based on workflow change frequency
Progress0/0

    What Comes Next

    Token prices drop every quarter. GPT-4o costs 92% less than GPT-4 did at launch. These optimization techniques keep working as prices fall because they are multiplicative, not additive.

    The real shift is architectural. Teams that build tool-equipped agents with routing, caching, and plan reuse from day one never see a $13K bill. They start at $1,100 and scale to 5,000 conversations/day for $8,000 instead of $130K.

    Start with caching this week. Route by next week. Your CFO will notice.

    Monitor Your Agent's Token Economics

    Chanl tracks cost per conversation, token breakdown by component, and quality scores alongside spend. Optimize with data, not guesses.

    See the analytics
    DG

    Co-founder

    Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

    Aprende IA Agéntica

    Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

    500+ ingenieros suscritos

    Frequently Asked Questions