ChanlChanl
Agent Architecture

Pre-Execute Tool Calls to Cut Agent Latency 48%

Sequential tool calls quietly kill your agent's response time. PASTE shows you can pre-execute likely tool calls during LLM thinking time and cut latency 48% without touching your model.

DGDean GroverCo-founderFollow
May 16, 2026
14 min read
Side-by-side timeline showing sequential tool calls stacking up to 450ms versus parallel speculative execution finishing in 220ms

The voice agent you just shipped has a response latency problem. The LLM is fast. Your STT pipeline is fast. But customers hear a pause before every response, and you can't figure out why.

Open a production trace and look at the tool call timeline. Your agent calls get_customer_profile. Waits 120ms. Then it calls get_order_history. Waits 95ms. Then search_knowledge_base. Waits 180ms. Only then does it start generating the response. That is 395ms of pure waiting before a single word gets spoken.

The LLM isn't the bottleneck. Sequential tool calls are.

The Hidden Cost of the ReAct Loop

The ReAct pattern (reason, act, observe, repeat) is the foundation of most production agents. It works well. It's debuggable, composable, and handles complex multi-step tasks. The catch is that it's serial by design. Each reasoning step completes before the next tool call starts, and each tool call completes before the next reasoning step starts.

For a typical CX support conversation, your agent might run through this sequence for a billing question:

  1. LLM reasoning: "I need to look up this customer." Call get_customer_profile.
  2. Wait 120ms for the CRM to respond.
  3. LLM reasoning: "I need their billing history." Call get_invoice_history.
  4. Wait 150ms for the billing system.
  5. LLM reasoning: "Let me check our refund policy." Call search_knowledge_base.
  6. Wait 180ms for vector retrieval.
  7. LLM generates response.

Total tool wait: 450ms. On a voice channel where the acceptable response window before customers notice dead air is under 500ms, that leaves almost no headroom for actual LLM inference.

The problem isn't that any single call is slow. 120ms for a CRM lookup is reasonable. The problem is they're sequential when they don't need to be.

Customer query arrives LLM reasons: call CRM CRM lookup: 120ms wait LLM reasons: call billing Billing lookup: 150ms wait LLM reasons: call knowledge base KB search: 180ms wait LLM generates response LLM starts reasoning Speculatively fire CRM + billing + KB in parallel LLM finishes reasoning: confirms CRM call CRM result already cached Next calls also already cached LLM generates response
Sequential vs speculative tool execution for a billing query with three tool calls

The speculative path turns 450ms of serial waiting into roughly 220ms, the time of the slowest single call.

How CPUs Solved This Problem 40 Years Ago

Modern processors don't execute instructions in strict sequence. They look ahead in the instruction stream, predict which branches are likely, and execute those branches speculatively in parallel with the current instruction. If the prediction is correct, the result is already computed when needed. If it's wrong, the processor discards the speculative work and runs the correct path.

This is branch prediction, and it's why your laptop doesn't crawl through code one instruction at a time.

The same insight applies to AI agents. While your LLM is reasoning about whether to call get_order_history, you can predict (with high accuracy) that it's about to make that call and start executing it now. By the time the LLM confirms the decision, the result is ready.

A March 2026 arXiv paper formalized this as PASTE: Pattern-Aware Speculative Tool Execution. The key results: 48.5% reduction in average task completion time and 1.8x improvement in tool execution throughput. Not by changing the model, improving the prompts, or optimizing individual tool latency. Just by running calls that were going to happen anyway slightly earlier.

The Two Properties That Make Speculation Work

PASTE's central insight is that agent workloads are far more predictable than they appear. Across thousands of production conversations, two properties hold consistently.

Stable application-level control flows. While user queries vary in content, the tool sequences they trigger are remarkably stable. Billing dispute conversations almost always follow the same tool sequence: customer lookup, then billing history, then policy search. Order modification conversations follow a different but equally stable sequence. The content varies. The structure doesn't.

Predictable data dependencies. Many sequential tool calls in production don't actually depend on each other's output. Your agent calls get_customer_profile and then get_order_history, but the order history lookup uses a customerId that came from the user's message directly, not from the profile result. If there's no data dependency, you can start the second call before the first returns.

These two properties together create the opportunity. Stable sequences give you the prediction. Independence gives you the safety to act on it.

Building a Speculative Executor

The implementation has three pieces: a pattern table that maps current tools to likely next tools, a speculative executor that fires pre-emptive calls, and a cache that returns results when the LLM confirms them.

Start with the pattern table:

speculative-executor/pattern-table.ts·typescript
interface ToolPattern {
  likelihood: number;       // 0-1: frequency of this sequence in production
  dependsOnResult: boolean; // does the next call need this call's output?
}
 
// Built from production tool call logs; rebuild weekly
const TOOL_PATTERNS: Record<string, Record<string, ToolPattern>> = {
  get_customer_profile: {
    get_invoice_history: { likelihood: 0.81, dependsOnResult: false },
    get_account_status:  { likelihood: 0.68, dependsOnResult: false },
    search_knowledge_base: { likelihood: 0.44, dependsOnResult: false },
  },
  get_order_status: {
    get_shipment_tracking: { likelihood: 0.74, dependsOnResult: true }, // needs order ID
    initiate_return:       { likelihood: 0.39, dependsOnResult: true },
  },
  search_knowledge_base: {
    get_policy_details: { likelihood: 0.61, dependsOnResult: true },
  },
};

The dependsOnResult: true flag is critical. When the next call needs the current call's output (an order ID, a case number, an account reference) you cannot speculate safely. Skip those pairs.

Now the executor:

speculative-executor/index.ts·typescript
import type { MCPClient } from "@modelcontextprotocol/sdk/client/index.js";
 
export class SpeculativeExecutor {
  private cache: Map<string, Promise<unknown>> = new Map();
  private budget: number;
 
  constructor(
    private client: MCPClient,
    budget = 3
  ) {
    this.budget = budget;
  }
 
  async callTool(name: string, args: Record<string, unknown>) {
    const key = this.cacheKey(name, args);
 
    // Cache hit: speculation was correct, result already in flight or ready
    if (this.cache.has(key)) {
      const result = await this.cache.get(key)!;
      this.cache.delete(key);
      return result;
    }
 
    // Cache miss: run normally, but start speculating on what comes next
    const resultPromise = this.client.callTool({ name, arguments: args });
    this.speculateNext(name, args);
    return await resultPromise;
  }
 
  private speculateNext(current: string, args: Record<string, unknown>) {
    const patterns = TOOL_PATTERNS[current];
    if (!patterns) return;
 
    for (const [nextTool, pattern] of Object.entries(patterns)) {
      if (pattern.dependsOnResult) continue;     // dependency: skip
      if (pattern.likelihood < 0.5) continue;    // too uncertain: skip
      if (this.cache.size >= this.budget) break;  // budget exceeded: stop
      if (!SPECULATION_SAFE.has(nextTool)) continue; // mutations: skip
 
      const nextArgs = this.inferArgs(nextTool, current, args);
      if (!nextArgs) continue;
 
      const key = this.cacheKey(nextTool, nextArgs);
      if (!this.cache.has(key)) {
        this.cache.set(
          key,
          this.client.callTool({ name: nextTool, arguments: nextArgs })
        );
      }
    }
  }
 
  private inferArgs(
    nextTool: string,
    previousTool: string,
    previousArgs: Record<string, unknown>
  ): Record<string, unknown> | null {
    // Pull arguments for the next call from what we already know
    if (nextTool === "get_invoice_history" && previousTool === "get_customer_profile") {
      return previousArgs.customerId ? { customerId: previousArgs.customerId } : null;
    }
    if (nextTool === "get_account_status" && previousTool === "get_customer_profile") {
      return previousArgs.customerId ? { accountId: previousArgs.customerId } : null;
    }
    // Add your domain-specific inference rules here
    return null;
  }
 
  private cacheKey(tool: string, args: Record<string, unknown>): string {
    return `${tool}:${JSON.stringify(args)}`;
  }
}
 
// Only idempotent reads are safe to speculate on
const SPECULATION_SAFE = new Set([
  "get_customer_profile",
  "get_invoice_history",
  "get_account_status",
  "get_order_status",
  "get_shipment_tracking",
  "search_knowledge_base",
  "get_policy_details",
  "get_product_info",
  // Mutations are explicitly excluded:
  // send_email, process_refund, update_account, create_ticket, etc.
]);

When the LLM confirms a call your executor already started, callTool hits the cache and returns the in-flight promise. If the promise is done, you get the result immediately. If it's still running, you catch the remainder of the wait rather than the full wait from scratch. Either way, you've saved time.

If the LLM calls a different tool than predicted, the cache entry just expires. The wrong speculation costs you one API call. The actual call runs normally. No user-facing impact.

Building the Pattern Table from Your Production Logs

The pattern table only works as well as it reflects your actual traffic. You build it by analyzing the tool call sequences in your production logs:

speculative-executor/pattern-builder.ts·typescript
interface ToolCallLog {
  sessionId: string;
  sequence: Array<{ tool: string; args: Record<string, unknown> }>;
}
 
export function buildPatternTable(
  logs: ToolCallLog[],
  minLikelihood = 0.4
): Record<string, Record<string, ToolPattern>> {
  const pairCounts: Record<string, Record<string, number>> = {};
  const toolCounts: Record<string, number> = {};
 
  for (const log of logs) {
    for (let i = 0; i < log.sequence.length - 1; i++) {
      const current = log.sequence[i].tool;
      const next = log.sequence[i + 1].tool;
 
      pairCounts[current] ??= {};
      pairCounts[current][next] = (pairCounts[current][next] ?? 0) + 1;
      toolCounts[current] = (toolCounts[current] ?? 0) + 1;
    }
  }
 
  const table: Record<string, Record<string, ToolPattern>> = {};
 
  for (const [tool, nextTools] of Object.entries(pairCounts)) {
    table[tool] = {};
    for (const [nextTool, count] of Object.entries(nextTools)) {
      const likelihood = count / toolCounts[tool];
      if (likelihood >= minLikelihood) {
        table[tool][nextTool] = {
          likelihood,
          dependsOnResult: checkDataDependency(tool, nextTool),
        };
      }
    }
  }
 
  return table;
}

Pull this from a rolling 30-day window of production logs and rebuild weekly. Tool usage patterns shift as your agent's conversation mix changes, as you update prompts, and as you add or remove tools. Stale patterns produce more wrong speculations and chip away at the latency savings.

Connected Integrations12 active
SalesforceSalesforce
SlackSlack
GoogleGoogle
StripeStripe
HubSpotHubSpot
IntercomIntercom
ZapierZapier
ShopifyShopify
GitHubGitHub
JiraJira
GmailGmail
PostgreSQLPostgreSQL

Why CX Tool Calls Are Particularly Predictable

General-purpose agents have diverse workloads. One request is a coding task, the next is a document summary, the next is a data lookup. Tool sequences vary enormously.

CX agents don't have this problem. A support agent handles billing disputes, order questions, account changes, and product inquiries. These are stable categories. Within each category, the tool sequences are highly consistent because the information the agent needs to resolve the issue is the same every time.

A billing dispute almost always needs: customer identity, billing history, current policy, relevant case notes. The sequence might vary slightly, but the same four or five tools appear in essentially every billing conversation.

This is exactly the workload PASTE was designed for. High sequence regularity means high prediction accuracy. High prediction accuracy means your speculation budget gets used efficiently rather than wasted on wrong guesses.

The other factor is CX latency tolerance. For an agent that composes long-form documents, a 500ms response time is invisible. For a voice agent where a customer expects a spoken reply in under a second, those 500ms are the entire response window. Speculation's benefit is proportionally larger when the latency budget is tighter.

Connecting to Your Production Setup

To wire the speculative executor into your live system, you need three inputs: tool call sequences from production traffic, an MCP client for execution, and per-tool latency data to know where speculation pays off.

speculative-setup.ts·typescript
import { SpeculativeExecutor } from "./speculative-executor";
import { buildPatternTable } from "./pattern-builder";
 
// Pull tool call sequences from your call logs (any analytics source works)
const toolLogs = await loadProductionToolSequences({
  agentId: "cx-support-agent",
  dateRange: { start: thirtyDaysAgo(), end: now() },
  limit: 10_000,
});
 
// Build the pattern table from real traffic
const patterns = buildPatternTable(toolLogs, 0.45);
 
// Wrap your MCP client with speculative execution
const executor = new SpeculativeExecutor(mcpClient, 3);
 
// Use executor.callTool() wherever you currently call tools directly
const profile = await executor.callTool("get_customer_profile", { customerId });
// By this point, get_invoice_history is already in flight if the pattern
// table says it's likely needed
const invoices = await executor.callTool("get_invoice_history", { customerId });
// Cache hit: result was already being fetched. Zero additional wait.

The raw material for a strong pattern table is tool co-occurrence data: which tools appear together, in what sequence, how often. Most agent platforms expose this either through call logs or session-level traces. If your platform's MCP integration tracks per-call timing, you already have what you need.

Per-tool latency distributions tell you where speculation pays off. Tools with consistent sub-50ms latency don't benefit much from speculation (there isn't much time to save). Tools with 100 to 300ms median latency, such as external CRM lookups, billing API calls, and vector knowledge base searches, are where speculation pays off most.

Before pushing speculation to production, run your test scenarios and compare timing. You'll see which conversation types benefit most and which don't have predictable enough sequences to warrant speculation.

Five Metrics That Tell You Whether It's Working

Speculation adds complexity. These five numbers tell you whether the complexity is paying off.

Speculation hit rate. The percentage of speculative calls the agent actually uses. Below 50% means your pattern table needs work: either the likelihood thresholds are too low or the inferred arguments are wrong.

Speculation waste rate. The percentage of speculative calls that never get used. High waste means you're spending API credits on guesses that miss. Adjust by raising your likelihood threshold or tightening your speculation budget.

P50 and P95 tool wait time. Before and after speculation, measure how long your agent spends blocked on tool results. P50 improvement validates typical case wins. P95 improvement validates tail latency improvement. Both should drop.

End-to-end turn latency. The time from customer message to agent first token. This is the metric customers feel in a voice conversation. Speculation should move this meaningfully: if your tools are the bottleneck, you should see a 30 to 50% improvement.

Wrong speculation rate by tool pair. Track which specific pairs produce the most wrong predictions. If get_customer_profile -> search_knowledge_base predicts wrong 70% of the time, remove it. Pairs with low hit rates cost more than they save.

Per-call timing at the session level, including whether each call was served from speculation cache or ran fresh, is straightforward to capture if your platform's monitoring already records tool execution events. That gives you hit rate and waste rate alongside standard latency metrics without separate instrumentation.

The Broader Principle

Speculative tool execution is one instance of a consistent pattern across computing: don't wait for sequential confirmation when you can predict and parallelize.

CPUs speculate on branch outcomes. LLM runtimes speculate on next tokens (speculative decoding). Now AI agents can speculate on next tool calls. The math is the same in all three cases: prediction cost is low, execution cost is high, and sequential waiting is often unnecessary.

Your production logs already contain the evidence of which tool calls are predictable. The pattern table extracts that evidence. The executor acts on it. What's left is building the safety filter (reads only, never writes), setting a conservative budget (start at 2, tune from there), and measuring whether the numbers move in the right direction.

For more on managing tool call latency in voice CX agents, see Voice Agent Platform Architecture: Sub-300ms Responses for the full latency budget breakdown, and Stop Loading All Your MCP Tools at Once for the complementary technique of reducing which tools are loaded at startup. Speculation works best when the tool set is already lean.

See Your Tool Call Timing in Production

Chanl shows per-tool latency, co-occurrence patterns, and which tool sequences are your best speculation candidates. Connect your agent and start measuring in minutes.

Explore Tool Analytics
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns and recipes for shipping AI agents that actually work — MCP, scorecards, regression tests, prompts, model comparisons. From teams running agents in production.

500+ builders subscribed

Frequently Asked Questions