How much does a production AI agent cost per month?

A production AI agent handling 500 conversations per day on a flagship model like GPT-4o or Claude Sonnet typically costs $3,200-$13,000/month in token spend alone. The exact cost depends on conversation length, tool usage, system prompt size, and whether you've implemented caching. Unoptimized agents with long system prompts and multi-step tool use sit at the high end.

What is prompt caching and how much does it save?

Prompt caching stores your system prompt and static context so subsequent requests read from cache instead of reprocessing. Anthropic charges cache reads at 10% of input price (90% savings). OpenAI offers 50-90% cached input discounts depending on model family. For an agent with a 4,000-token system prompt handling 500 daily conversations, caching saves $2,700-$5,400/month.

What is model routing for AI agents?

Model routing sends each request to the cheapest model capable of handling it. Simple intents like greetings or FAQ lookups go to lightweight models (GPT-4.1 Nano at $0.10/MTok input), while complex reasoning stays on flagship models ($2-5/MTok input). Most production agents find that 60-70% of requests are simple enough for a small model, saving 30-50% overall.

Does the Batch API work for real-time AI agents?

Not for real-time conversations. The Batch API processes requests asynchronously over 24 hours at a 50% discount. It works for background agent tasks: nightly conversation summarization, bulk quality scoring, knowledge base updates, and analytics generation. These background workloads often represent 20-40% of total token spend.

What is Plan-and-Execute and how does it reduce costs?

Plan-and-Execute separates planning from execution. A lightweight model creates a step-by-step plan, then each step runs independently. If the same plan was executed before, it's retrieved from a memory cache instead of regenerated. This avoids re-planning for the 40-60% of requests that follow known patterns, cutting planning costs by up to 90%.

How do I calculate my AI agent's token cost per conversation?

Multiply your system prompt tokens plus average conversation tokens by your model's input price per million tokens, then add output tokens times the output price. A typical conversation: (4,000 system + 3,000 history) input tokens plus 1,500 output tokens. On GPT-4o, that's (7,000 x $2.50 / 1M) + (1,500 x $10 / 1M) = $0.0175 + $0.015 = $0.0325 per conversation. At 500/day, that's $487/month just for the base calls.

Your AI Agent Costs $13K/Month. Here's the Fix.

We got our first real production bill: $13,247. For one agent.

A customer-service agent on Claude Sonnet. Nothing exotic: 4,000-token system prompt, 8 tools for order lookups and CRM updates, 500 conversations a day. Each conversation averaged 6 turns with full history replay. The math was brutal once we actually did it.

Over three months, we cut that agent to $1,100/month. Same quality. Same tools. Same conversation depth. This is the playbook.

In this article:

The $13K Bill, Decomposed: where the money actually goes
Prompt Caching: 90% Input Savings: the single biggest lever
Model Routing: Right Model, Right Task: 30-50% off the remaining cost
Batch Processing: Half-Price Background Work: 50% off non-real-time tasks
Plan-and-Execute: Stop Re-Planning: caching decisions, not just prompts
Context Management: Smaller Windows: pay for what matters
The Full Stack: $13K to $1,100: all techniques combined
Implementation Checklist: week-by-week rollout

The $13K Bill, Decomposed

Most of a production agent's cost hides in content resent on every turn: system prompt, tool schemas, growing conversation history. Here is what our 500-conversation/day agent consumed:

Component	Tokens per Conversation	Monthly Volume (15K convos)	Notes
System prompt	4,000 input	60M input	Resent every turn
Conversation history	3,000 input (avg)	45M input	Grows each turn
Tool schemas (8 tools)	2,400 input	36M input	Resent every turn
Tool call results	1,200 input	18M input	CRM/order data
Agent responses	1,500 output	22.5M output	The actual replies
Total	~12,100	159M input + 22.5M output

On Claude Sonnet at $3/MTok input and $15/MTok output:

text

Input:  159M tokens x $3.00/MTok  = $477
Output: 22.5M tokens x $15.00/MTok = $337.50

Wait. That is only $815. Where is the $13K?

Here is the part nobody warns you about: each conversation has multiple turns. Our 6-turn average means the system prompt, tool schemas, and growing history are resent on every single turn. The real math:

text

// Each turn resends: system prompt + tools + full history
// Turn 1: 4,000 + 2,400 + 0 history     = 6,400 input
// Turn 2: 4,000 + 2,400 + 2,500 history  = 8,900 input
// Turn 3: 4,000 + 2,400 + 5,000 history  = 11,400 input
// Turn 4: 4,000 + 2,400 + 7,500 history  = 13,900 input
// Turn 5: 4,000 + 2,400 + 10,000 history = 16,400 input
// Turn 6: 4,000 + 2,400 + 12,500 history = 18,900 input
// Total per conversation: ~75,900 input tokens + ~9,000 output

Now the real bill:

text

Monthly input:  75,900 x 15,000 convos = 1,138.5M tokens
Monthly output: 9,000 x 15,000 convos  = 135M tokens
 
Input cost:  1,138.5M x $3.00/MTok  = $3,415.50
Output cost: 135M x $15.00/MTok     = $2,025.00
Total: $5,440.50/month

Add multi-step tool calls (ours averaged 2.4 tool-use rounds per conversation) and another $4,000-$7,800 in LLM calls for tool reasoning. Actual bill: $13,247.

The system prompt and tool schemas alone, content that never changes between conversations, accounted for over 40% of input tokens. That is where we started cutting.

Prompt Caching: 90% Input Savings

Prompt caching eliminates re-processing of static content you resend every request: system prompts, tool schemas, few-shot examples. Subsequent calls read from cache at a fraction of the input price. For our agent, this single lever cut $1,224/month.

Provider	Cache Write Cost	Cache Read Cost	Savings on Reads	Cache Duration
Anthropic	1.25x input price	0.1x input price	90%	5 minutes
Anthropic (extended)	2x input price	0.1x input price	90%	1 hour
OpenAI (GPT-4.1)	1x (automatic)	0.25x input price	75%	Automatic
OpenAI (GPT-5)	1x (automatic)	0.1x input price	90%	Automatic
Google (Gemini)	1x (automatic)	0.25x input price	75%	Automatic

For our agent, the static content per turn was 6,400 tokens (4,000 system prompt + 2,400 tool schemas). At 6 turns per conversation, that is 38,400 static tokens per conversation resent and reprocessed.

The math

Before caching (Claude Sonnet):

text

Static tokens: 38,400/convo x 15,000 convos = 576M tokens/month
Cost: 576M x $3.00/MTok = $1,728/month (just for static content)

After caching (Anthropic 5-minute cache):

text

// First request per conversation: cache write (1.25x)
Write cost: 6,400 tokens x 15,000 x $3.75/MTok = $360/month
 
// Remaining 5 turns per conversation: cache read (0.1x)
Read cost: 6,400 x 5 turns x 15,000 x $0.30/MTok = $144/month
 
Total: $504/month (vs $1,728, saving $1,224/month)

That is a 71% reduction on static content. With Anthropic's 1-hour cache and enough volume to keep the cache warm, reads drop to $0.30/MTok across nearly all requests.

Implementation

With Anthropic, add a single cache_control field:

typescript

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: systemPrompt, // 4,000 tokens of instructions
      // Cache this block: reads are 90% cheaper than re-processing
      cache_control: { type: "ephemeral" }
    }
  ],
  tools: tools, // 2,400 tokens of schemas (auto-cached with system)
  messages: conversationHistory
});

With OpenAI, caching is automatic for prompts over 1,024 tokens. No code changes needed. The discount appears on your bill.

Running savings: $13,247 - $1,224 = $12,023 remaining.

Model Routing: Right Model, Right Task

Sending every request to your flagship model wastes 60-65% of budget on tasks a 30x cheaper model handles equally well. "What are your hours?" and "I need to dispute a charge" are different tasks, but an unrouted agent treats them identically. A classifier ($4.50/month) sorts each request to the cheapest capable model.

The pricing gap

Model	Input ($/MTok)	Output ($/MTok)	Good For
GPT-4.1 Nano	$0.10	$0.40	Classification, FAQ, routing
GPT-4o-mini	$0.15	$0.60	Simple Q&A, extraction
Claude Haiku 3.5	$0.80	$4.00	Standard support, summaries
Gemini 2.5 Flash	$0.30	$2.50	Mid-complexity reasoning
GPT-4.1	$2.00	$8.00	Complex multi-step tasks
Claude Sonnet 4	$3.00	$15.00	Nuanced reasoning, writing
GPT-4o	$2.50	$10.00	General flagship tasks

GPT-4.1 Nano to Claude Sonnet is a 30x gap on input. Even routing 50% of requests to a cheaper model creates massive savings.

Three-tier routing

typescript

// Tier 1: Simple (60-65% of traffic)
// Greetings, FAQs, order status, store hours
// Route to GPT-4.1 Nano ($0.10/$0.40) because these need no reasoning
const SIMPLE_INTENTS = [
  "greeting", "hours", "order_status",
  "return_policy", "faq"
];
 
// Tier 2: Standard (25-30% of traffic)
// Account changes, standard complaints, multi-step lookups
// Route to Claude Haiku 3.5 ($0.80/$4.00) for moderate reasoning
const STANDARD_INTENTS = [
  "account_update", "complaint", "billing_question",
  "product_comparison"
];
 
// Tier 3: Complex (10-15% of traffic)
// Disputes, escalations, multi-tool reasoning chains
// Route to Claude Sonnet ($3.00/$15.00) for nuanced judgment
const COMPLEX_INTENTS = [
  "dispute", "escalation", "multi_issue",
  "policy_exception"
];

The classifier runs on GPT-4.1 Nano ($4.50/month for all traffic). The first message gets classified, then the tier sticks for the session with automatic upgrade if complexity increases.

The math

Before routing (everything on Claude Sonnet):

text

All traffic: $5,440/month (base token cost, pre-tool-use)

After routing (60% Nano, 25% Haiku, 15% Sonnet):

text

// Simplified: using average tokens per conversation tier
Tier 1 (9,000 convos): 45,000 tokens avg x $0.10/$0.40
  Input: 405M x $0.10/MTok  = $40.50
  Output: 81M x $0.40/MTok  = $32.40
 
Tier 2 (3,750 convos): 70,000 tokens avg x $0.80/$4.00
  Input: 262.5M x $0.80/MTok = $210.00
  Output: 56.25M x $4.00/MTok = $225.00
 
Tier 3 (2,250 convos): 90,000 tokens avg x $3.00/$15.00
  Input: 202.5M x $3.00/MTok = $607.50
  Output: 33.75M x $15.00/MTok = $506.25
 
Classifier: $4.50/month
Total: $1,626.15/month (vs $5,440, saving ~$3,814)

In practice, we saw a 42% reduction in total token spend after implementing routing. Simpler conversations on cheaper models were also shorter, reducing token counts further.

Running savings: $12,023 - $3,814 = $8,209 remaining.

Batch Processing: Half-Price Background Work

Production agents run background workloads (summarization, quality scoring, knowledge refreshes) that do not need real-time responses. Every major provider offers a Batch API with 50% discounts for asynchronous processing. For our agent, background tasks consumed 25% of total spend.

Background Task	Token Volume	Frequency	Batch-Eligible?
Conversation summarization	~2,000/convo	Nightly	Yes
Quality scoring (LLM-as-judge)	~3,500/convo	Nightly	Yes
Knowledge base refresh	~50,000/batch	Weekly	Yes
Memory extraction & facts	~1,500/convo	Post-call	Yes
Analytics narrative generation	~4,000/report	Daily	Yes
Prompt regression testing	~8,000/test	Per deploy	Yes

The math

Before batching:

text

Background task spend: $3,300/month

After batching (50% discount):

text

Background task spend: $1,650/month
Savings: $1,650/month

Anthropic's Batch API also stacks with prompt caching. A cached batch request on Claude Sonnet costs $1.50/MTok input and $7.50/MTok output: 75% off the standard input rate when you combine both discounts.

typescript

// Batch quality scoring: runs nightly, no latency requirement
// 50% discount on both input and output tokens vs real-time API
const batch = await anthropic.beta.messages.batches.create({
  requests: conversations.map(convo => ({
    custom_id: convo.id, // Maps results back to source conversations
    params: {
      model: "claude-sonnet-4-20250514",
      max_tokens: 500,
      // Cache the scoring rubric so all items in the batch share it.
      // Combines batch discount (50%) with cache discount (90% reads)
      // for 75% total savings on input tokens.
      system: [{ type: "text", text: scoringPrompt, cache_control: { type: "ephemeral" } }],
      messages: [{ role: "user", content: convo.transcript }]
    }
  }))
});

Running savings: $8,209 - $1,650 = $6,559 remaining.

Plan-and-Execute: Stop Re-Planning

Agents re-plan identical workflows on every call. "Cancel my order" always triggers the same chain: look up order, check cancellation window, process refund. Plan-and-Execute separates planning from execution and caches the result, so the 58% of requests that follow known patterns skip the planning LLM call entirely.

Remember that $13K bill? Over 40% of it was the agent re-reasoning through steps it had already solved thousands of times.

typescript

async function handleRequest(message: string, context: AgentContext) {
  // Step 1: Classify intent on cheapest model (GPT-4.1 Nano, ~$0.0001)
  const intent = await classifyIntent(message);
 
  // Step 2: Check plan cache. This is a DB lookup, not an LLM call.
  // 58% of requests hit cache, saving 2,000-5,000 planning tokens each.
  const cachedPlan = await planCache.find(intent, context.parameters);
 
  if (cachedPlan) {
    // Cache hit: skip the planning LLM call entirely.
    // Execute pre-validated steps directly, which also reduces errors.
    return await executePlan(cachedPlan, context);
  }
 
  // Cache miss: generate plan with mid-tier model (Haiku, not Sonnet)
  // because plan generation is structured output, not nuanced reasoning
  const plan = await generatePlan(intent, message, context);
 
  // Store for future reuse. Key includes intent + parameters
  // so "cancel order #123" and "cancel order #456" share the same plan.
  await planCache.store(intent, context.parameters, plan);
 
  return await executePlan(plan, context);
}

Why it works

Over 30 days, 58% of requests matched one of 23 plan templates: order status, cancellations, address updates, billing inquiries. Same steps, different parameters. For those 58%, the planning call is eliminated. For the other 42%, Haiku generates the plan instead of Sonnet.

The math

Before Plan-and-Execute:

text

// Planning cost: ~3,000 tokens per request on Sonnet
Planning: 3,000 tokens x 6 turns x 15,000 convos x $3.00/MTok
= 270M tokens x $3.00/MTok = $810/month
(Plus output tokens for plans: ~$400/month)
Total planning cost: ~$1,210/month

After Plan-and-Execute:

text

// 58% cache hits: $0 planning cost
// 42% cache misses: plan on Haiku instead of Sonnet
Miss planning: 3,000 x 6 x 6,300 convos x $0.80/MTok = $90.72
Miss output: ~$50/month
Intent classifier: already counted in routing
Total planning cost: ~$141/month (saving ~$1,069)

The savings compound with model routing. Cached plans can execute their individual steps on the cheapest capable model per step.

Running savings: $6,559 - $1,069 = $5,490 remaining.

Context Management: Smaller Windows

By turn 6, you are sending 12,500 tokens of history on every request, and most of it is low-value for the current turn. Three techniques cut history tokens by 40-60%.

1. Sliding window with summary

Keep the last 3 turns verbatim and summarize earlier turns into a compressed context block:

typescript

function buildContext(history: Message[]): Message[] {
  // Short conversations don't need compression (3 turns = 6 messages)
  if (history.length <= 6) return history;
 
  // Summarize older turns into ~200 tokens instead of ~2,500 raw
  // This preserves key facts while cutting 90% of history tokens
  const oldTurns = history.slice(0, -6);
  const summary = await summarize(oldTurns); // Nano model: ~$0.00005 per call
 
  return [
    // Inject summary as context so the model retains earlier conversation state
    { role: "system", content: `Previous context: ${summary}` },
    ...history.slice(-6) // Recent turns stay verbatim for accuracy
  ];
}

2. Tool schema pruning

Do not send all 8 tool schemas on every turn. After intent classification, send only the 2-3 tools relevant to the current task:

typescript

// Before: 2,400 tokens of tool schemas on every turn (8 tools x ~300 tokens each)
const allTools = [orderLookup, orderCancel, orderModify, crmUpdate,
                  billingCheck, refundProcess, escalate, faqSearch];
 
// After: ~800 tokens. Intent classification already ran, so we know
// which tools are relevant. No reason to pay for schemas the model won't use.
const relevantTools = selectToolsForIntent(intent, allTools);
// "order_status" -> [orderLookup, orderModify] (2 tools, ~600 tokens)
// "billing_question" -> [billingCheck, refundProcess] (2 tools, ~600 tokens)

3. Structured extraction over raw replay

Extract structured data from tool results once instead of replaying raw JSON:

typescript

// Before: 800 tokens of raw JSON replayed in history every turn
// { "order": { "id": "ORD-9284", "items": [...], "shipping": {...}, ... } }
 
// After: 120 tokens. Extract once, reference compactly.
// The model only needs facts, not the full API response structure.
// Order ORD-9284: 2 items, shipped 3/18, arriving 3/21, tracking UPS-1Z999

Back to that $13K bill: context management attacks the same turn-by-turn multiplier that made the original cost surprising.

Combined context savings

text

Before context optimization:
  Average input per turn 6: 18,900 tokens
 
After (sliding window + tool pruning + structured extraction):
  Average input per turn 6: 9,200 tokens (~51% reduction)
 
Monthly savings at blended model rate (~$1.20/MTok avg after routing):
  Reduction: ~580M fewer tokens/month
  Savings: ~$696/month

Running savings: $5,490 - $696 = $4,794 remaining.

The Full Stack: $13K to $1,100

Layering all five techniques reduced our agent from $13,247/month to $1,100/month. A 92% reduction. The techniques compound because each one shrinks the input that the others operate on.

Technique	Monthly Savings	Cumulative Cost	Reduction
Baseline (unoptimized)	-	$13,247	-
+ Prompt caching	-$1,224	$12,023	9%
+ Model routing	-$3,814	$8,209	38%
+ Batch processing	-$1,650	$6,559	51%
+ Plan-and-Execute	-$1,069	$5,490	59%
+ Context management	-$696	$4,794	64%
+ All techniques compounding	-$3,694*	~$1,100	92%

*Routing means cheaper models for caching. Caching means fewer tokens for routing decisions. Context management shrinks payloads everywhere. Plan-and-Execute skips entire LLM calls. The techniques multiply.

Actual production cost after full optimization: $1,100/month for the same 500-conversation/day, 8-tool agent.

Quality held through every change

We tracked these metrics using scorecards and analytics:

Resolution rate: held at 84% (pre-optimization: 83%)
Customer satisfaction: 4.2/5.0 (pre: 4.1/5.0. Routing actually improved simple cases)
Escalation rate: dropped from 16% to 14% (Plan-and-Execute was more consistent)
Average handle time: 2.1 minutes (pre: 2.3 minutes. Cached plans executed faster)

Quality monitoring is not optional. It is what makes optimization safe. Without real-time analytics on resolution rates and satisfaction scores, you are flying blind.

Implementation Checklist

Start with caching (week 1), then context management, then routing, then batching. Each step builds on a stable foundation before adding model changes.

Week 1: Prompt Caching (highest ROI, lowest risk)

Enable cache_control on system prompts (Anthropic) or verify automatic caching (OpenAI)
Cache tool schemas alongside system prompt
Monitor cache hit rates in your analytics dashboard. Target 85%+

Week 2: Context Management (no model changes)

Implement sliding window summarization for conversations over 4 turns
Prune tool schemas per intent (requires intent classification)
Switch raw tool results to structured extraction

Week 3: Model Routing (requires testing)

Build intent classifier on cheapest model (GPT-4.1 Nano or Gemini Flash Lite)
Define tier boundaries from historical conversation analysis
Shadow-route for 1 week: route silently, compare quality scores between tiers
Deploy with automatic upgrade triggers (complexity score threshold)

Week 4: Batch Processing + Plan-and-Execute

Move summarization, quality scoring, and analytics to Batch API
Implement plan cache with top 20 intent templates
Set cache TTL based on workflow change frequency

Progress0/0

What Comes Next

Token prices drop every quarter. GPT-4o costs 92% less than GPT-4 did at launch. These optimization techniques keep working as prices fall because they are multiplicative, not additive.

The real shift is architectural. Teams that build tool-equipped agents with routing, caching, and plan reuse from day one never see a $13K bill. They start at $1,100 and scale to 5,000 conversations/day for $8,000 instead of $130K.

Start with caching this week. Route by next week. Your CFO will notice.

Monitor Your Agent's Token Economics

Chanl tracks cost per conversation, token breakdown by component, and quality scores alongside spend. Optimize with data, not guesses.

See the analytics

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

cost-optimization production agent-infrastructure prompts tools analytics operations llm

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.