What is durable execution for AI agents?

Durable execution guarantees an AI agent's workflow runs to completion despite failures like crashes, timeouts, or network errors. The agent's state is checkpointed at each step, so when it restarts it picks up where it left off rather than starting over. This matters especially for multi-step CX tasks like refunds, bookings, and follow-up sequences that touch multiple external systems.

Why do AI agents need durable execution more than traditional software?

AI agents chain together multiple non-deterministic steps: LLM calls, tool invocations, API requests, and memory lookups. Each step can fail independently, and some steps (like charging a card or sending a confirmation email) must not be repeated on retry. Traditional try/catch error handling doesn't preserve state across process restarts, so without explicit checkpointing you lose everything and start over.

What tools implement durable execution for AI agents?

Temporal is the most mature option, with a Python SDK and an official OpenAI Agents SDK integration released in March 2026. LangGraph offers built-in checkpointing via PostgreSQL or Redis. For simpler cases you can implement a lightweight checkpoint store with a database and idempotent retries, though this approach requires more ongoing maintenance.

How does durable execution help CX-specific agent tasks?

CX tasks are uniquely failure-prone because they span multiple systems (CRM, billing, calendar, email) and often extend across hours or days. A booking confirmation that requires three vendor API calls, a cancellation save that involves a live database write, or a follow-up reminder triggered 48 hours later all need guarantees that partial completions don't leave customers in an inconsistent state.

Does durable execution slow down my agent?

Checkpointing adds latency on each step, typically 10-50ms per database write. For real-time voice conversations this overhead is too high for in-conversation tool calls. The pattern works best for background tasks and async workflows decoupled from the live call. A practical split: fast in-call tools handle the conversation, and durable post-call workflows handle the side effects.

What is the difference between a retry and durable execution?

A retry re-runs an operation from the beginning. Durable execution resumes from the last successful checkpoint. For a 7-step booking workflow, a retry restarts at step 1; durable execution resumes at step 4, the first failed step. This matters when earlier steps have side effects like database writes or email sends that shouldn't repeat on every attempt.

How do I test that my agent's durable execution actually works?

Inject artificial failures: kill the process after each checkpoint, simulate network timeouts on tool calls, and send duplicate retries to verify idempotency. Run these as scenario tests before deploying, then monitor completion-rate and partial-failure metrics in production to catch edge cases that lab testing missed.

Your CX Agent Crashes Mid-Task. Here's the Fix.

Picture this: a customer calls your support line to cancel a subscription. Your AI agent starts the cancellation-save flow. It checks their account tier, looks up their usage in the last 30 days, calculates a targeted discount based on that data, and begins writing the outcome to your CRM. Somewhere between the discount calculation and the CRM write, your agent process crashes.

The customer hangs up thinking they were transferred. Your CRM has no record. The billing system still has them as active. When the customer calls back, your agent starts the conversation from scratch with no memory of what just happened.

This is the durable execution problem. It's one of the most underestimated production failure modes for CX agents, and it gets worse as your agents take on longer and more consequential tasks. You've invested in conversation quality, RAG pipelines, and tool integrations. But if your agent can't survive infrastructure failures mid-task, none of that reliability work matters when it counts.

What Durable Execution Actually Means

Durable execution means an agent's workflow is guaranteed to run to completion, even when the underlying process crashes, times out, or loses network access. Every step is checkpointed to persistent storage. When the agent restarts, it doesn't re-run from the beginning; it picks up from the last successful step with the same state it had before the failure.

The concept isn't new to distributed systems engineers. Message queues and workflow orchestrators have used it for decades. What's new is the collision between long-running agent workflows and the operational assumptions most teams bring from stateless API development, where a crash means a clean restart and nothing worse.

Think about the tasks your CX agents actually handle:

Process a refund (CRM lookup, billing API, confirmation email, audit log)
Schedule a follow-up call (calendar availability check, SMS send, 48-hour reminder)
Handle a cancellation save (account analysis, offer calculation, retention write, outcome log)
Escalate with context (call history fetch, conversation summary, ticket creation, queue assignment)

Every one of these is a multi-step workflow with side effects at each step. If step 4 fails and you retry from step 1, you'll double-charge the billing API, send duplicate emails, and create ghost records in your CRM. The agent isn't just slow; it's actively creating problems for your customers and your data.

Where Traditional Error Handling Breaks Down

The standard instinct when a step fails is to wrap it in a try/catch and retry. That works for idempotent operations on a single system. It breaks down for agent workflows in three specific ways.

In-memory state disappears on crash. If your Python or Node process dies, the local variables holding workflow progress are gone. You either restart from the beginning or you built your own state persistence layer. Most teams haven't built the persistence layer.

Retries create duplicates for non-idempotent operations. An email send, a billing write, or a CRM update is not safe to retry without precautions. You need idempotency keys, deduplication logic, and careful sequencing for every external call. This is the kind of infrastructure that typically takes weeks to build correctly and months to trust.

Long-running workflows outlive single request lifetimes. A follow-up task that fires 48 hours after a call can't live in a web server's memory. You need a scheduler, a state store, and a mechanism to resume the workflow when the timer fires. Temporal solves all three together; most home-grown solutions solve one of the three and paper over the others.

We wrote about a closely related problem in async agent tasks between conversations -- the coordination challenge when agents need to act outside a live session. Durable execution is the infrastructure layer that makes that coordination reliable rather than fire-and-forget.

How Temporal Makes Agents Fault-Tolerant

Temporal is a workflow orchestration platform built around durable execution. Your agent workflow is written as an ordinary Python function, but Temporal serializes execution state after every activity completes. If the process dies, Temporal replays the workflow from the last checkpoint using the recorded history -- the already-completed activities are skipped, and execution resumes at the first incomplete one.

Here's what a durable cancellation-save flow looks like with Temporal and the Python SDK:

cancellation_save_workflow.py·python

from datetime import timedelta
from temporalio import workflow, activity
from dataclasses import dataclass
 
@dataclass
class CancellationInput:
    customer_id: str
    call_id: str
 
@workflow.defn
class CancellationSaveWorkflow:
    @workflow.run
    async def run(self, input: CancellationInput) -> dict:
        # Each activity is independently checkpointed
        account = await workflow.execute_activity(
            fetch_account_data,
            input.customer_id,
            start_to_close_timeout=timedelta(seconds=10),
        )
 
        offer = await workflow.execute_activity(
            calculate_retention_offer,
            account,
            start_to_close_timeout=timedelta(seconds=5),
        )
 
        # This activity waits for the live call to complete
        outcome = await workflow.execute_activity(
            present_offer_and_wait,
            {"offer": offer, "call_id": input.call_id},
            start_to_close_timeout=timedelta(minutes=10),
        )
 
        if outcome["accepted"]:
            await workflow.execute_activity(
                write_retention_record,
                {"customer_id": input.customer_id, "offer": offer},
                start_to_close_timeout=timedelta(seconds=10),
            )
 
        await workflow.execute_activity(
            update_crm,
            {"customer_id": input.customer_id, "outcome": outcome},
            start_to_close_timeout=timedelta(seconds=10),
        )
 
        return outcome

If write_retention_record fails, Temporal retries just that activity with the configured backoff policy. The present_offer_and_wait step (which waited for the live conversation) is never replayed because Temporal has already recorded its result in the workflow history.

Notice that you're writing a regular Python function. There's no special state machine DSL, no Redis keys to juggle, no polling loop. The durability comes from the Temporal worker infrastructure, not from custom application code.

In March 2026, Temporal released an official integration with the OpenAI Agents SDK, which means you can run OpenAI-based agent graphs as durable workflows with a few lines of configuration. The same pattern applies to any framework that supports async execution.

LangGraph's Checkpoint Pattern

If you're already on LangGraph, you get durability through its built-in checkpointer. LangGraph serializes node outputs to a configurable backend -- PostgreSQL, Redis, or in-memory for development -- and resumes from the last completed node on failure.

booking_graph.py·python

from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
import psycopg
 
conn = psycopg.connect(DATABASE_URL, autocommit=True)
checkpointer = PostgresSaver(conn)
 
builder = StateGraph(BookingState)
builder.add_node("check_availability", check_availability)
builder.add_node("hold_slot", hold_slot)
builder.add_node("process_payment", process_payment)
builder.add_node("send_confirmation", send_confirmation)
 
builder.set_entry_point("check_availability")
builder.add_edge("check_availability", "hold_slot")
builder.add_edge("hold_slot", "process_payment")
builder.add_edge("process_payment", "send_confirmation")
builder.add_edge("send_confirmation", END)
 
graph = builder.compile(checkpointer=checkpointer)
 
# thread_id ties together the checkpointed state for this booking
config = {"configurable": {"thread_id": f"booking_{booking_id}"}}
result = await graph.ainvoke(initial_state, config=config)

The thread_id is the key. When you resume a failed workflow, you pass the same thread_id and LangGraph replays from the last completed node. If process_payment failed on the first attempt, the second attempt skips check_availability and hold_slot entirely -- their results are already in the checkpoint store.

LangGraph's approach works well if your agents are already graph-based and you want to stay within the LangChain framework. Temporal handles more complex coordination patterns, including workflows that span days or weeks and workflows that pause to wait for external signals (like a human approval).

CX cancellation-save workflow -- each node is checkpointed before the next step begins

The CX-Specific Failure Modes You'll Actually Hit

Most durable execution writing focuses on batch processing and data pipelines. CX agents have a distinct failure profile that makes the standard advice incomplete.

Side-effect amplification. A customer-facing action is never just a database write. Sending a confirmation email, posting a CRM note, and logging a billing event all happen together. If one of the three fails and you retry without idempotency keys, you get two emails, two billing logs, and one CRM note. The fix: pass a stable task_id (deterministically derived from customer_id + action_type + date) to every external API so duplicate calls are safely deduplicated on the server side.

Human wait states. Some CX workflows include a human-in-the-loop step: a supervisor reviews a refund above a threshold, a billing dispute requires a human decision before the next action fires. Temporal handles this with workflow.wait_condition() -- the workflow sleeps with zero compute and resumes when the signal arrives. Most home-grown schedulers can't model this without a polling loop that runs forever.

Post-call async tasks. A follow-up SMS four hours after the call, a satisfaction survey trigger 24 hours later, an escalation reminder if no response in 48 hours -- these are natural parts of a CX workflow that need to be durably scheduled, not fire-and-forget cron jobs that silently fail. We explored the customer experience gaps these create in CX agents that fail between conversations.

Multi-agent handoffs. When your triage agent hands off to a specialist agent, the handoff itself is a state transition that can fail. If the specialist agent is temporarily unavailable, does the triage agent retry, wait, or drop the task? Durable execution gives you a clean model for all three choices. The orchestration patterns for structuring these handoffs are covered in multi-agent orchestration patterns for production.

Building a Lightweight Checkpoint Store Without a Framework

Not every CX team is ready to add Temporal or LangGraph to their stack. If you're building on a simpler agent framework, you can get 80% of the benefit with a lightweight checkpoint store built on your existing database.

The pattern has three parts:

1. Deterministic task IDs. Before starting a workflow, generate a task_id from the inputs: sha256(customer_id + action_type + date). Write it to your database with status: started before you execute anything.

2. Step completion logging. After each step succeeds, write a record: {task_id, step_name, result, completed_at}. Before executing any step, check whether that step already has a completion record.

3. Resume logic on retry. Load the completion records on retry, skip already-completed steps, and resume at the first incomplete one.

checkpoint-runner.ts·typescript

type StepResult = {
  taskId: string;
  stepName: string;
  result: unknown;
  completedAt: string;
};
 
async function runWithCheckpoint<T>(
  taskId: string,
  stepName: string,
  fn: () => Promise<T>
): Promise<T> {
  const existing = await db.stepResults.findOne({ taskId, stepName });
  if (existing) {
    return existing.result as T; // already completed, return cached result
  }
 
  const result = await fn();
  await db.stepResults.insertOne({
    taskId,
    stepName,
    result,
    completedAt: new Date().toISOString(),
  });
  return result;
}
 
// Usage in a booking workflow
async function runBookingWorkflow(bookingId: string, input: BookingInput) {
  const taskId = `booking_${bookingId}`;
 
  const availability = await runWithCheckpoint(taskId, "check_availability", () =>
    checkCalendarAvailability(input.requestedSlot)
  );
 
  const slot = await runWithCheckpoint(taskId, "hold_slot", () =>
    holdCalendarSlot(availability.bestSlot)
  );
 
  const payment = await runWithCheckpoint(taskId, "process_payment", () =>
    processPayment(input.paymentMethod, input.amount)
  );
 
  await runWithCheckpoint(taskId, "send_confirmation", () =>
    sendConfirmationEmail(input.customerEmail, slot, payment.receiptId)
  );
}

This is simpler than Temporal but handles the common queue-based retry case. The gap is long wait states and complex branching -- if you need those patterns, the lightweight approach will eventually become painful and you should migrate to Temporal or LangGraph.

Monitoring Recovery in Production

Durable execution handles the "survive failure" part. You still need to know whether it's working once it's live.

Two metrics matter most:

Task completion rate. What percentage of started workflows reach a terminal state, whether success or handled failure? A drop here indicates a new category of unrecoverable failure that your checkpointing doesn't handle.

Step retry rate by step name. Which steps are failing most often? A high retry rate on update_crm points to a CRM API reliability issue. A high rate on process_payment suggests a payments integration problem. Per-step retry rates make failures specific and actionable.

With Chanl's analytics and monitoring, you can instrument your durable workflows to track both without manually wiring up each activity. Each tool call in a Chanl-connected workflow emits a span, giving you the retry-rate view by step name out of the box.

chanl-workflow-monitor.ts·typescript

import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Log each workflow outcome after it completes
await chanl.calls.logWorkflowOutcome({
  callId: input.callId,
  workflowType: "cancellation_save",
  completedSteps: completedSteps,
  outcome: result.outcome,
  retryCount: totalAttempts - completedSteps.length,
});

You can also set up scenario tests that deliberately inject failures at specific steps -- crashing after calculate_retention_offer, timing out write_retention_record -- and verify that your checkpoint layer handles each case correctly before it ever happens in production. A scenario that injects a crash at step 3 and confirms the workflow resumes at step 4 is a concrete regression test for your durability guarantee.

Where to Start This Week

The path from "we have no durable execution" to "our workflows survive failures" doesn't require a full platform migration.

Step 1: Identify your long-running CX workflows. Any task that touches more than two external systems, runs longer than five seconds, or has a post-call async component is a candidate. For most CX agents, this is three to five workflows.

Step 2: Add idempotency keys to every external call. Before adding any orchestration framework, make sure every external API call in those workflows accepts and deduplicates an idempotency key. This is the prerequisite for safe retries. Without it, no amount of checkpointing will prevent duplicate side effects.

Step 3: Choose your persistence layer. For simple queue-based workflows: a checkpoint table in your existing database plus the lightweight pattern above. For graph-based agent workflows: LangGraph with a PostgreSQL checkpointer. For production-scale multi-step workflows with wait states and long horizons: Temporal.

Step 4: Test failure scenarios before they happen. Use Chanl's scenario testing to inject artificial failures at each step and verify that your recovery logic produces the right outcome. A partial refund that gets retried must not double-charge. A failed CRM write must not lose the call outcome.

Durable execution isn't glamorous. It doesn't show up in a demo. But it's the infrastructure layer that means your CX agents actually complete what they start, every time, no matter what the infrastructure does underneath. The Build phase of your agent isn't done until you've answered: what happens when it fails on step 4?

Test your agent's recovery logic before it fails in production

Chanl's scenario testing lets you inject failures at any step and verify your agent handles them correctly. Build confidence in your durable workflows before customers feel the impact.

Start testing recovery scenarios

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

durable-execution fault-tolerance agent-architecture temporal long-running-tasks

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed