The voice agent you just shipped has a response latency problem. The LLM is fast. Your STT pipeline is fast. But customers hear a pause before every response, and you can't figure out why.
Open a production trace and look at the tool call timeline. Your agent calls get_customer_profile. Waits 120ms. Then it calls get_order_history. Waits 95ms. Then search_knowledge_base. Waits 180ms. Only then does it start generating the response. That is 395ms of pure waiting before a single word gets spoken.
The LLM isn't the bottleneck. Sequential tool calls are.
The Hidden Cost of the ReAct Loop
The ReAct pattern (reason, act, observe, repeat) is the foundation of most production agents. It works well. It's debuggable, composable, and handles complex multi-step tasks. The catch is that it's serial by design. Each reasoning step completes before the next tool call starts, and each tool call completes before the next reasoning step starts.
For a typical CX support conversation, your agent might run through this sequence for a billing question:
- LLM reasoning: "I need to look up this customer." Call
get_customer_profile. - Wait 120ms for the CRM to respond.
- LLM reasoning: "I need their billing history." Call
get_invoice_history. - Wait 150ms for the billing system.
- LLM reasoning: "Let me check our refund policy." Call
search_knowledge_base. - Wait 180ms for vector retrieval.
- LLM generates response.
Total tool wait: 450ms. On a voice channel where the acceptable response window before customers notice dead air is under 500ms, that leaves almost no headroom for actual LLM inference.
The problem isn't that any single call is slow. 120ms for a CRM lookup is reasonable. The problem is they're sequential when they don't need to be.
The speculative path turns 450ms of serial waiting into roughly 220ms, the time of the slowest single call.
How CPUs Solved This Problem 40 Years Ago
Modern processors don't execute instructions in strict sequence. They look ahead in the instruction stream, predict which branches are likely, and execute those branches speculatively in parallel with the current instruction. If the prediction is correct, the result is already computed when needed. If it's wrong, the processor discards the speculative work and runs the correct path.
This is branch prediction, and it's why your laptop doesn't crawl through code one instruction at a time.
The same insight applies to AI agents. While your LLM is reasoning about whether to call get_order_history, you can predict (with high accuracy) that it's about to make that call and start executing it now. By the time the LLM confirms the decision, the result is ready.
A March 2026 arXiv paper formalized this as PASTE: Pattern-Aware Speculative Tool Execution. The key results: 48.5% reduction in average task completion time and 1.8x improvement in tool execution throughput. Not by changing the model, improving the prompts, or optimizing individual tool latency. Just by running calls that were going to happen anyway slightly earlier.
The Two Properties That Make Speculation Work
PASTE's central insight is that agent workloads are far more predictable than they appear. Across thousands of production conversations, two properties hold consistently.
Stable application-level control flows. While user queries vary in content, the tool sequences they trigger are remarkably stable. Billing dispute conversations almost always follow the same tool sequence: customer lookup, then billing history, then policy search. Order modification conversations follow a different but equally stable sequence. The content varies. The structure doesn't.
Predictable data dependencies. Many sequential tool calls in production don't actually depend on each other's output. Your agent calls get_customer_profile and then get_order_history, but the order history lookup uses a customerId that came from the user's message directly, not from the profile result. If there's no data dependency, you can start the second call before the first returns.
These two properties together create the opportunity. Stable sequences give you the prediction. Independence gives you the safety to act on it.
Building a Speculative Executor
The implementation has three pieces: a pattern table that maps current tools to likely next tools, a speculative executor that fires pre-emptive calls, and a cache that returns results when the LLM confirms them.
Start with the pattern table:
interface ToolPattern {
likelihood: number; // 0-1: frequency of this sequence in production
dependsOnResult: boolean; // does the next call need this call's output?
}
// Built from production tool call logs; rebuild weekly
const TOOL_PATTERNS: Record<string, Record<string, ToolPattern>> = {
get_customer_profile: {
get_invoice_history: { likelihood: 0.81, dependsOnResult: false },
get_account_status: { likelihood: 0.68, dependsOnResult: false },
search_knowledge_base: { likelihood: 0.44, dependsOnResult: false },
},
get_order_status: {
get_shipment_tracking: { likelihood: 0.74, dependsOnResult: true }, // needs order ID
initiate_return: { likelihood: 0.39, dependsOnResult: true },
},
search_knowledge_base: {
get_policy_details: { likelihood: 0.61, dependsOnResult: true },
},
};The dependsOnResult: true flag is critical. When the next call needs the current call's output (an order ID, a case number, an account reference) you cannot speculate safely. Skip those pairs.
Now the executor:
import type { MCPClient } from "@modelcontextprotocol/sdk/client/index.js";
export class SpeculativeExecutor {
private cache: Map<string, Promise<unknown>> = new Map();
private budget: number;
constructor(
private client: MCPClient,
budget = 3
) {
this.budget = budget;
}
async callTool(name: string, args: Record<string, unknown>) {
const key = this.cacheKey(name, args);
// Cache hit: speculation was correct, result already in flight or ready
if (this.cache.has(key)) {
const result = await this.cache.get(key)!;
this.cache.delete(key);
return result;
}
// Cache miss: run normally, but start speculating on what comes next
const resultPromise = this.client.callTool({ name, arguments: args });
this.speculateNext(name, args);
return await resultPromise;
}
private speculateNext(current: string, args: Record<string, unknown>) {
const patterns = TOOL_PATTERNS[current];
if (!patterns) return;
for (const [nextTool, pattern] of Object.entries(patterns)) {
if (pattern.dependsOnResult) continue; // dependency: skip
if (pattern.likelihood < 0.5) continue; // too uncertain: skip
if (this.cache.size >= this.budget) break; // budget exceeded: stop
if (!SPECULATION_SAFE.has(nextTool)) continue; // mutations: skip
const nextArgs = this.inferArgs(nextTool, current, args);
if (!nextArgs) continue;
const key = this.cacheKey(nextTool, nextArgs);
if (!this.cache.has(key)) {
this.cache.set(
key,
this.client.callTool({ name: nextTool, arguments: nextArgs })
);
}
}
}
private inferArgs(
nextTool: string,
previousTool: string,
previousArgs: Record<string, unknown>
): Record<string, unknown> | null {
// Pull arguments for the next call from what we already know
if (nextTool === "get_invoice_history" && previousTool === "get_customer_profile") {
return previousArgs.customerId ? { customerId: previousArgs.customerId } : null;
}
if (nextTool === "get_account_status" && previousTool === "get_customer_profile") {
return previousArgs.customerId ? { accountId: previousArgs.customerId } : null;
}
// Add your domain-specific inference rules here
return null;
}
private cacheKey(tool: string, args: Record<string, unknown>): string {
return `${tool}:${JSON.stringify(args)}`;
}
}
// Only idempotent reads are safe to speculate on
const SPECULATION_SAFE = new Set([
"get_customer_profile",
"get_invoice_history",
"get_account_status",
"get_order_status",
"get_shipment_tracking",
"search_knowledge_base",
"get_policy_details",
"get_product_info",
// Mutations are explicitly excluded:
// send_email, process_refund, update_account, create_ticket, etc.
]);When the LLM confirms a call your executor already started, callTool hits the cache and returns the in-flight promise. If the promise is done, you get the result immediately. If it's still running, you catch the remainder of the wait rather than the full wait from scratch. Either way, you've saved time.
If the LLM calls a different tool than predicted, the cache entry just expires. The wrong speculation costs you one API call. The actual call runs normally. No user-facing impact.
Building the Pattern Table from Your Production Logs
The pattern table only works as well as it reflects your actual traffic. You build it by analyzing the tool call sequences in your production logs:
interface ToolCallLog {
sessionId: string;
sequence: Array<{ tool: string; args: Record<string, unknown> }>;
}
export function buildPatternTable(
logs: ToolCallLog[],
minLikelihood = 0.4
): Record<string, Record<string, ToolPattern>> {
const pairCounts: Record<string, Record<string, number>> = {};
const toolCounts: Record<string, number> = {};
for (const log of logs) {
for (let i = 0; i < log.sequence.length - 1; i++) {
const current = log.sequence[i].tool;
const next = log.sequence[i + 1].tool;
pairCounts[current] ??= {};
pairCounts[current][next] = (pairCounts[current][next] ?? 0) + 1;
toolCounts[current] = (toolCounts[current] ?? 0) + 1;
}
}
const table: Record<string, Record<string, ToolPattern>> = {};
for (const [tool, nextTools] of Object.entries(pairCounts)) {
table[tool] = {};
for (const [nextTool, count] of Object.entries(nextTools)) {
const likelihood = count / toolCounts[tool];
if (likelihood >= minLikelihood) {
table[tool][nextTool] = {
likelihood,
dependsOnResult: checkDataDependency(tool, nextTool),
};
}
}
}
return table;
}Pull this from a rolling 30-day window of production logs and rebuild weekly. Tool usage patterns shift as your agent's conversation mix changes, as you update prompts, and as you add or remove tools. Stale patterns produce more wrong speculations and chip away at the latency savings.
Why CX Tool Calls Are Particularly Predictable
General-purpose agents have diverse workloads. One request is a coding task, the next is a document summary, the next is a data lookup. Tool sequences vary enormously.
CX agents don't have this problem. A support agent handles billing disputes, order questions, account changes, and product inquiries. These are stable categories. Within each category, the tool sequences are highly consistent because the information the agent needs to resolve the issue is the same every time.
A billing dispute almost always needs: customer identity, billing history, current policy, relevant case notes. The sequence might vary slightly, but the same four or five tools appear in essentially every billing conversation.
This is exactly the workload PASTE was designed for. High sequence regularity means high prediction accuracy. High prediction accuracy means your speculation budget gets used efficiently rather than wasted on wrong guesses.
The other factor is CX latency tolerance. For an agent that composes long-form documents, a 500ms response time is invisible. For a voice agent where a customer expects a spoken reply in under a second, those 500ms are the entire response window. Speculation's benefit is proportionally larger when the latency budget is tighter.
Connecting to Your Production Setup
To wire the speculative executor into your live system, you need three inputs: tool call sequences from production traffic, an MCP client for execution, and per-tool latency data to know where speculation pays off.
import { SpeculativeExecutor } from "./speculative-executor";
import { buildPatternTable } from "./pattern-builder";
// Pull tool call sequences from your call logs (any analytics source works)
const toolLogs = await loadProductionToolSequences({
agentId: "cx-support-agent",
dateRange: { start: thirtyDaysAgo(), end: now() },
limit: 10_000,
});
// Build the pattern table from real traffic
const patterns = buildPatternTable(toolLogs, 0.45);
// Wrap your MCP client with speculative execution
const executor = new SpeculativeExecutor(mcpClient, 3);
// Use executor.callTool() wherever you currently call tools directly
const profile = await executor.callTool("get_customer_profile", { customerId });
// By this point, get_invoice_history is already in flight if the pattern
// table says it's likely needed
const invoices = await executor.callTool("get_invoice_history", { customerId });
// Cache hit: result was already being fetched. Zero additional wait.The raw material for a strong pattern table is tool co-occurrence data: which tools appear together, in what sequence, how often. Most agent platforms expose this either through call logs or session-level traces. If your platform's MCP integration tracks per-call timing, you already have what you need.
Per-tool latency distributions tell you where speculation pays off. Tools with consistent sub-50ms latency don't benefit much from speculation (there isn't much time to save). Tools with 100 to 300ms median latency, such as external CRM lookups, billing API calls, and vector knowledge base searches, are where speculation pays off most.
Before pushing speculation to production, run your test scenarios and compare timing. You'll see which conversation types benefit most and which don't have predictable enough sequences to warrant speculation.
Five Metrics That Tell You Whether It's Working
Speculation adds complexity. These five numbers tell you whether the complexity is paying off.
Speculation hit rate. The percentage of speculative calls the agent actually uses. Below 50% means your pattern table needs work: either the likelihood thresholds are too low or the inferred arguments are wrong.
Speculation waste rate. The percentage of speculative calls that never get used. High waste means you're spending API credits on guesses that miss. Adjust by raising your likelihood threshold or tightening your speculation budget.
P50 and P95 tool wait time. Before and after speculation, measure how long your agent spends blocked on tool results. P50 improvement validates typical case wins. P95 improvement validates tail latency improvement. Both should drop.
End-to-end turn latency. The time from customer message to agent first token. This is the metric customers feel in a voice conversation. Speculation should move this meaningfully: if your tools are the bottleneck, you should see a 30 to 50% improvement.
Wrong speculation rate by tool pair. Track which specific pairs produce the most wrong predictions. If get_customer_profile -> search_knowledge_base predicts wrong 70% of the time, remove it. Pairs with low hit rates cost more than they save.
Per-call timing at the session level, including whether each call was served from speculation cache or ran fresh, is straightforward to capture if your platform's monitoring already records tool execution events. That gives you hit rate and waste rate alongside standard latency metrics without separate instrumentation.
The Broader Principle
Speculative tool execution is one instance of a consistent pattern across computing: don't wait for sequential confirmation when you can predict and parallelize.
CPUs speculate on branch outcomes. LLM runtimes speculate on next tokens (speculative decoding). Now AI agents can speculate on next tool calls. The math is the same in all three cases: prediction cost is low, execution cost is high, and sequential waiting is often unnecessary.
Your production logs already contain the evidence of which tool calls are predictable. The pattern table extracts that evidence. The executor acts on it. What's left is building the safety filter (reads only, never writes), setting a conservative budget (start at 2, tune from there), and measuring whether the numbers move in the right direction.
For more on managing tool call latency in voice CX agents, see Voice Agent Platform Architecture: Sub-300ms Responses for the full latency budget breakdown, and Stop Loading All Your MCP Tools at Once for the complementary technique of reducing which tools are loaded at startup. Speculation works best when the tool set is already lean.
See Your Tool Call Timing in Production
Chanl shows per-tool latency, co-occurrence patterns, and which tool sequences are your best speculation candidates. Connect your agent and start measuring in minutes.
Explore Tool AnalyticsCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



