What is speculative tool execution in AI agents?

Speculative tool execution is a pattern where your agent pre-executes tool calls it predicts will be needed, before the LLM explicitly decides to make them. Instead of waiting for the current reasoning step to finish, the executor fires likely next calls in parallel and caches the results. When the LLM confirms the call, the result is already waiting. The PASTE paper (March 2026) demonstrated a 48.5% latency reduction using this approach on production agent workloads.

How do you build the pattern table for speculation?

Mine your production tool call logs to find which tools commonly follow which other tools. For each pair A then B, calculate what percentage of A calls are followed by B calls. If it exceeds your threshold (typically 40 to 60%), add the pair to your pattern table with its likelihood score. Rebuild the table weekly since patterns shift as your agent handles different conversation types and as you update your prompts.

What tools are safe to speculate on?

Only idempotent read operations should be speculated. Database reads, API lookups, search queries, and knowledge base retrievals are all safe. Never speculate on mutations: anything that sends a message, processes a payment, updates a record, or triggers a workflow. A wrong speculative read wastes API credits. A wrong speculative write causes real customer-facing damage. Maintain an explicit allowlist of safe tools rather than trying to infer safety from tool names.

What happens when a speculation is wrong?

The pending promise is abandoned and the actual tool call runs normally. Wrong speculations waste compute and API credits but don't cause errors or incorrect behavior. The key is keeping your speculation budget low (2 to 4 calls in flight at once) and your likelihood threshold high enough to maintain a hit rate above 50%. Track your waste rate: if more than half your speculative calls get abandoned, your pattern table needs recalibration.

How much latency does speculative tool execution actually save?

On production workloads, the PASTE research shows 48.5% reduction in average task completion time and 1.8x improvement in tool throughput. Results depend on your specific workload: how many tool calls per turn, which tools carry the highest latency, and how predictable your call sequences are. CX agents with stable conversation flows see the largest gains. A 900ms sequential response can become roughly 460ms with speculation.

Does speculative execution work for voice AI agents?

Yes, and voice is where it matters most. Voice agents need to start speaking within 500ms or users hear an uncomfortable pause. If your agent makes three serial tool calls at 100 to 150ms each, that is 300 to 450ms before it can even start generating a response. Speculation runs those calls in parallel, cutting the wait to the time of the slowest single call rather than the sum of all calls.

How do you handle tool calls with data dependencies?

If a tool genuinely needs the previous tool's output, you cannot speculate safely. Mark these with dependsOnResult: true in your pattern table and the executor will skip them. For partially dependent calls, you can sometimes speculate using conversation context parameters that are already available. For example, a customer ID from the user's message can seed a profile lookup even before an account number lookup returns.

Is speculative tool calling useful for lower-traffic systems?

Yes. The latency benefit applies per-request regardless of traffic volume, since speculation reduces individual response time rather than throughput. Even a staging or development agent benefits during testing. That said, the pattern table quality improves with more production data, so teams with higher traffic will have better prediction accuracy and higher hit rates.

Pre-Execute Tool Calls to Cut Agent Latency 48%

The voice agent you just shipped has a response latency problem. The LLM is fast. Your STT pipeline is fast. But customers hear a pause before every response, and you can't figure out why.

Open a production trace and look at the tool call timeline. Your agent calls get_customer_profile. Waits 120ms. Then it calls get_order_history. Waits 95ms. Then search_knowledge_base. Waits 180ms. Only then does it start generating the response. That is 395ms of pure waiting before a single word gets spoken.

The LLM isn't the bottleneck. Sequential tool calls are.

The Hidden Cost of the ReAct Loop

The ReAct pattern (reason, act, observe, repeat) is the foundation of most production agents. It works well. It's debuggable, composable, and handles complex multi-step tasks. The catch is that it's serial by design. Each reasoning step completes before the next tool call starts, and each tool call completes before the next reasoning step starts.

For a typical CX support conversation, your agent might run through this sequence for a billing question:

LLM reasoning: "I need to look up this customer." Call get_customer_profile.
Wait 120ms for the CRM to respond.
LLM reasoning: "I need their billing history." Call get_invoice_history.
Wait 150ms for the billing system.
LLM reasoning: "Let me check our refund policy." Call search_knowledge_base.
Wait 180ms for vector retrieval.
LLM generates response.

Total tool wait: 450ms. On a voice channel where the acceptable response window before customers notice dead air is under 500ms, that leaves almost no headroom for actual LLM inference.

The problem isn't that any single call is slow. 120ms for a CRM lookup is reasonable. The problem is they're sequential when they don't need to be.

Sequential vs speculative tool execution for a billing query with three tool calls

The speculative path turns 450ms of serial waiting into roughly 220ms, the time of the slowest single call.

How CPUs Solved This Problem 40 Years Ago

Modern processors don't execute instructions in strict sequence. They look ahead in the instruction stream, predict which branches are likely, and execute those branches speculatively in parallel with the current instruction. If the prediction is correct, the result is already computed when needed. If it's wrong, the processor discards the speculative work and runs the correct path.

This is branch prediction, and it's why your laptop doesn't crawl through code one instruction at a time.

The same insight applies to AI agents. While your LLM is reasoning about whether to call get_order_history, you can predict (with high accuracy) that it's about to make that call and start executing it now. By the time the LLM confirms the decision, the result is ready.

A March 2026 arXiv paper formalized this as PASTE: Pattern-Aware Speculative Tool Execution. The key results: 48.5% reduction in average task completion time and 1.8x improvement in tool execution throughput. Not by changing the model, improving the prompts, or optimizing individual tool latency. Just by running calls that were going to happen anyway slightly earlier.

The Two Properties That Make Speculation Work

PASTE's central insight is that agent workloads are far more predictable than they appear. Across thousands of production conversations, two properties hold consistently.

Stable application-level control flows. While user queries vary in content, the tool sequences they trigger are remarkably stable. Billing dispute conversations almost always follow the same tool sequence: customer lookup, then billing history, then policy search. Order modification conversations follow a different but equally stable sequence. The content varies. The structure doesn't.

Predictable data dependencies. Many sequential tool calls in production don't actually depend on each other's output. Your agent calls get_customer_profile and then get_order_history, but the order history lookup uses a customerId that came from the user's message directly, not from the profile result. If there's no data dependency, you can start the second call before the first returns.

These two properties together create the opportunity. Stable sequences give you the prediction. Independence gives you the safety to act on it.

Building a Speculative Executor

The implementation has three pieces: a pattern table that maps current tools to likely next tools, a speculative executor that fires pre-emptive calls, and a cache that returns results when the LLM confirms them.

Start with the pattern table:

speculative-executor/pattern-table.ts·typescript

interface ToolPattern {
  likelihood: number;       // 0-1: frequency of this sequence in production
  dependsOnResult: boolean; // does the next call need this call's output?
}
 
// Built from production tool call logs; rebuild weekly
const TOOL_PATTERNS: Record<string, Record<string, ToolPattern>> = {
  get_customer_profile: {
    get_invoice_history: { likelihood: 0.81, dependsOnResult: false },
    get_account_status:  { likelihood: 0.68, dependsOnResult: false },
    search_knowledge_base: { likelihood: 0.44, dependsOnResult: false },
  },
  get_order_status: {
    get_shipment_tracking: { likelihood: 0.74, dependsOnResult: true }, // needs order ID
    initiate_return:       { likelihood: 0.39, dependsOnResult: true },
  },
  search_knowledge_base: {
    get_policy_details: { likelihood: 0.61, dependsOnResult: true },
  },
};

The dependsOnResult: true flag is critical. When the next call needs the current call's output (an order ID, a case number, an account reference) you cannot speculate safely. Skip those pairs.

Now the executor:

speculative-executor/index.ts·typescript

import type { MCPClient } from "@modelcontextprotocol/sdk/client/index.js";
 
export class SpeculativeExecutor {
  private cache: Map<string, Promise<unknown>> = new Map();
  private budget: number;
 
  constructor(
    private client: MCPClient,
    budget = 3
  ) {
    this.budget = budget;
  }
 
  async callTool(name: string, args: Record<string, unknown>) {
    const key = this.cacheKey(name, args);
 
    // Cache hit: speculation was correct, result already in flight or ready
    if (this.cache.has(key)) {
      const result = await this.cache.get(key)!;
      this.cache.delete(key);
      return result;
    }
 
    // Cache miss: run normally, but start speculating on what comes next
    const resultPromise = this.client.callTool({ name, arguments: args });
    this.speculateNext(name, args);
    return await resultPromise;
  }
 
  private speculateNext(current: string, args: Record<string, unknown>) {
    const patterns = TOOL_PATTERNS[current];
    if (!patterns) return;
 
    for (const [nextTool, pattern] of Object.entries(patterns)) {
      if (pattern.dependsOnResult) continue;     // dependency: skip
      if (pattern.likelihood < 0.5) continue;    // too uncertain: skip
      if (this.cache.size >= this.budget) break;  // budget exceeded: stop
      if (!SPECULATION_SAFE.has(nextTool)) continue; // mutations: skip
 
      const nextArgs = this.inferArgs(nextTool, current, args);
      if (!nextArgs) continue;
 
      const key = this.cacheKey(nextTool, nextArgs);
      if (!this.cache.has(key)) {
        this.cache.set(
          key,
          this.client.callTool({ name: nextTool, arguments: nextArgs })
        );
      }
    }
  }
 
  private inferArgs(
    nextTool: string,
    previousTool: string,
    previousArgs: Record<string, unknown>
  ): Record<string, unknown> | null {
    // Pull arguments for the next call from what we already know
    if (nextTool === "get_invoice_history" && previousTool === "get_customer_profile") {
      return previousArgs.customerId ? { customerId: previousArgs.customerId } : null;
    }
    if (nextTool === "get_account_status" && previousTool === "get_customer_profile") {
      return previousArgs.customerId ? { accountId: previousArgs.customerId } : null;
    }
    // Add your domain-specific inference rules here
    return null;
  }
 
  private cacheKey(tool: string, args: Record<string, unknown>): string {
    return `${tool}:${JSON.stringify(args)}`;
  }
}
 
// Only idempotent reads are safe to speculate on
const SPECULATION_SAFE = new Set([
  "get_customer_profile",
  "get_invoice_history",
  "get_account_status",
  "get_order_status",
  "get_shipment_tracking",
  "search_knowledge_base",
  "get_policy_details",
  "get_product_info",
  // Mutations are explicitly excluded:
  // send_email, process_refund, update_account, create_ticket, etc.
]);

When the LLM confirms a call your executor already started, callTool hits the cache and returns the in-flight promise. If the promise is done, you get the result immediately. If it's still running, you catch the remainder of the wait rather than the full wait from scratch. Either way, you've saved time.

If the LLM calls a different tool than predicted, the cache entry just expires. The wrong speculation costs you one API call. The actual call runs normally. No user-facing impact.

Building the Pattern Table from Your Production Logs

The pattern table only works as well as it reflects your actual traffic. You build it by analyzing the tool call sequences in your production logs:

speculative-executor/pattern-builder.ts·typescript

interface ToolCallLog {
  sessionId: string;
  sequence: Array<{ tool: string; args: Record<string, unknown> }>;
}
 
export function buildPatternTable(
  logs: ToolCallLog[],
  minLikelihood = 0.4
): Record<string, Record<string, ToolPattern>> {
  const pairCounts: Record<string, Record<string, number>> = {};
  const toolCounts: Record<string, number> = {};
 
  for (const log of logs) {
    for (let i = 0; i < log.sequence.length - 1; i++) {
      const current = log.sequence[i].tool;
      const next = log.sequence[i + 1].tool;
 
      pairCounts[current] ??= {};
      pairCounts[current][next] = (pairCounts[current][next] ?? 0) + 1;
      toolCounts[current] = (toolCounts[current] ?? 0) + 1;
    }
  }
 
  const table: Record<string, Record<string, ToolPattern>> = {};
 
  for (const [tool, nextTools] of Object.entries(pairCounts)) {
    table[tool] = {};
    for (const [nextTool, count] of Object.entries(nextTools)) {
      const likelihood = count / toolCounts[tool];
      if (likelihood >= minLikelihood) {
        table[tool][nextTool] = {
          likelihood,
          dependsOnResult: checkDataDependency(tool, nextTool),
        };
      }
    }
  }
 
  return table;
}

Pull this from a rolling 30-day window of production logs and rebuild weekly. Tool usage patterns shift as your agent's conversation mix changes, as you update prompts, and as you add or remove tools. Stale patterns produce more wrong speculations and chip away at the latency savings.

Connected Integrations12 active

Salesforce

Slack

Google

Stripe

HubSpot

Intercom

Zapier

Shopify

GitHub

Jira

Gmail

PostgreSQL

Why CX Tool Calls Are Particularly Predictable

General-purpose agents have diverse workloads. One request is a coding task, the next is a document summary, the next is a data lookup. Tool sequences vary enormously.

CX agents don't have this problem. A support agent handles billing disputes, order questions, account changes, and product inquiries. These are stable categories. Within each category, the tool sequences are highly consistent because the information the agent needs to resolve the issue is the same every time.

A billing dispute almost always needs: customer identity, billing history, current policy, relevant case notes. The sequence might vary slightly, but the same four or five tools appear in essentially every billing conversation.

This is exactly the workload PASTE was designed for. High sequence regularity means high prediction accuracy. High prediction accuracy means your speculation budget gets used efficiently rather than wasted on wrong guesses.

The other factor is CX latency tolerance. For an agent that composes long-form documents, a 500ms response time is invisible. For a voice agent where a customer expects a spoken reply in under a second, those 500ms are the entire response window. Speculation's benefit is proportionally larger when the latency budget is tighter.

Connecting to Your Production Setup

To wire the speculative executor into your live system, you need three inputs: tool call sequences from production traffic, an MCP client for execution, and per-tool latency data to know where speculation pays off.

speculative-setup.ts·typescript

import { SpeculativeExecutor } from "./speculative-executor";
import { buildPatternTable } from "./pattern-builder";
 
// Pull tool call sequences from your call logs (any analytics source works)
const toolLogs = await loadProductionToolSequences({
  agentId: "cx-support-agent",
  dateRange: { start: thirtyDaysAgo(), end: now() },
  limit: 10_000,
});
 
// Build the pattern table from real traffic
const patterns = buildPatternTable(toolLogs, 0.45);
 
// Wrap your MCP client with speculative execution
const executor = new SpeculativeExecutor(mcpClient, 3);
 
// Use executor.callTool() wherever you currently call tools directly
const profile = await executor.callTool("get_customer_profile", { customerId });
// By this point, get_invoice_history is already in flight if the pattern
// table says it's likely needed
const invoices = await executor.callTool("get_invoice_history", { customerId });
// Cache hit: result was already being fetched. Zero additional wait.

The raw material for a strong pattern table is tool co-occurrence data: which tools appear together, in what sequence, how often. Most agent platforms expose this either through call logs or session-level traces. If your platform's MCP integration tracks per-call timing, you already have what you need.

Per-tool latency distributions tell you where speculation pays off. Tools with consistent sub-50ms latency don't benefit much from speculation (there isn't much time to save). Tools with 100 to 300ms median latency, such as external CRM lookups, billing API calls, and vector knowledge base searches, are where speculation pays off most.

Before pushing speculation to production, run your test scenarios and compare timing. You'll see which conversation types benefit most and which don't have predictable enough sequences to warrant speculation.

Five Metrics That Tell You Whether It's Working

Speculation adds complexity. These five numbers tell you whether the complexity is paying off.

Speculation hit rate. The percentage of speculative calls the agent actually uses. Below 50% means your pattern table needs work: either the likelihood thresholds are too low or the inferred arguments are wrong.

Speculation waste rate. The percentage of speculative calls that never get used. High waste means you're spending API credits on guesses that miss. Adjust by raising your likelihood threshold or tightening your speculation budget.

P50 and P95 tool wait time. Before and after speculation, measure how long your agent spends blocked on tool results. P50 improvement validates typical case wins. P95 improvement validates tail latency improvement. Both should drop.

End-to-end turn latency. The time from customer message to agent first token. This is the metric customers feel in a voice conversation. Speculation should move this meaningfully: if your tools are the bottleneck, you should see a 30 to 50% improvement.

Wrong speculation rate by tool pair. Track which specific pairs produce the most wrong predictions. If get_customer_profile -> search_knowledge_base predicts wrong 70% of the time, remove it. Pairs with low hit rates cost more than they save.

Per-call timing at the session level, including whether each call was served from speculation cache or ran fresh, is straightforward to capture if your platform's monitoring already records tool execution events. That gives you hit rate and waste rate alongside standard latency metrics without separate instrumentation.

The Broader Principle

Speculative tool execution is one instance of a consistent pattern across computing: don't wait for sequential confirmation when you can predict and parallelize.

CPUs speculate on branch outcomes. LLM runtimes speculate on next tokens (speculative decoding). Now AI agents can speculate on next tool calls. The math is the same in all three cases: prediction cost is low, execution cost is high, and sequential waiting is often unnecessary.

Your production logs already contain the evidence of which tool calls are predictable. The pattern table extracts that evidence. The executor acts on it. What's left is building the safety filter (reads only, never writes), setting a conservative budget (start at 2, tune from there), and measuring whether the numbers move in the right direction.

For more on managing tool call latency in voice CX agents, see Voice Agent Platform Architecture: Sub-300ms Responses for the full latency budget breakdown, and Stop Loading All Your MCP Tools at Once for the complementary technique of reducing which tools are loaded at startup. Speculation works best when the tool set is already lean.

See Your Tool Call Timing in Production

Chanl shows per-tool latency, co-occurrence patterns, and which tool sequences are your best speculation candidates. Connect your agent and start measuring in minutes.

Explore Tool Analytics

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

tool-calling latency performance agent-architecture speculative-execution

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos