ChanlChanl
Agent Architecture

Your CX Agent Crashes Mid-Task. Here's the Fix.

When your CX agent crashes mid-refund or mid-booking, the customer is stuck. Durable execution guarantees long-running agent tasks survive failures. Here's how to build it.

DGDean GroverCo-founderFollow
May 7, 2026
14 min read
Diagram showing an AI agent resuming a multi-step workflow from the last checkpoint after a crash

Picture this: a customer calls your support line to cancel a subscription. Your AI agent starts the cancellation-save flow. It checks their account tier, looks up their usage in the last 30 days, calculates a targeted discount based on that data, and begins writing the outcome to your CRM. Somewhere between the discount calculation and the CRM write, your agent process crashes.

The customer hangs up thinking they were transferred. Your CRM has no record. The billing system still has them as active. When the customer calls back, your agent starts the conversation from scratch with no memory of what just happened.

This is the durable execution problem. It's one of the most underestimated production failure modes for CX agents, and it gets worse as your agents take on longer and more consequential tasks. You've invested in conversation quality, RAG pipelines, and tool integrations. But if your agent can't survive infrastructure failures mid-task, none of that reliability work matters when it counts.

What Durable Execution Actually Means

Durable execution means an agent's workflow is guaranteed to run to completion, even when the underlying process crashes, times out, or loses network access. Every step is checkpointed to persistent storage. When the agent restarts, it doesn't re-run from the beginning; it picks up from the last successful step with the same state it had before the failure.

The concept isn't new to distributed systems engineers. Message queues and workflow orchestrators have used it for decades. What's new is the collision between long-running agent workflows and the operational assumptions most teams bring from stateless API development, where a crash means a clean restart and nothing worse.

Think about the tasks your CX agents actually handle:

  • Process a refund (CRM lookup, billing API, confirmation email, audit log)
  • Schedule a follow-up call (calendar availability check, SMS send, 48-hour reminder)
  • Handle a cancellation save (account analysis, offer calculation, retention write, outcome log)
  • Escalate with context (call history fetch, conversation summary, ticket creation, queue assignment)

Every one of these is a multi-step workflow with side effects at each step. If step 4 fails and you retry from step 1, you'll double-charge the billing API, send duplicate emails, and create ghost records in your CRM. The agent isn't just slow; it's actively creating problems for your customers and your data.

Where Traditional Error Handling Breaks Down

The standard instinct when a step fails is to wrap it in a try/catch and retry. That works for idempotent operations on a single system. It breaks down for agent workflows in three specific ways.

In-memory state disappears on crash. If your Python or Node process dies, the local variables holding workflow progress are gone. You either restart from the beginning or you built your own state persistence layer. Most teams haven't built the persistence layer.

Retries create duplicates for non-idempotent operations. An email send, a billing write, or a CRM update is not safe to retry without precautions. You need idempotency keys, deduplication logic, and careful sequencing for every external call. This is the kind of infrastructure that typically takes weeks to build correctly and months to trust.

Long-running workflows outlive single request lifetimes. A follow-up task that fires 48 hours after a call can't live in a web server's memory. You need a scheduler, a state store, and a mechanism to resume the workflow when the timer fires. Temporal solves all three together; most home-grown solutions solve one of the three and paper over the others.

We wrote about a closely related problem in async agent tasks between conversations -- the coordination challenge when agents need to act outside a live session. Durable execution is the infrastructure layer that makes that coordination reliable rather than fire-and-forget.

How Temporal Makes Agents Fault-Tolerant

Temporal is a workflow orchestration platform built around durable execution. Your agent workflow is written as an ordinary Python function, but Temporal serializes execution state after every activity completes. If the process dies, Temporal replays the workflow from the last checkpoint using the recorded history -- the already-completed activities are skipped, and execution resumes at the first incomplete one.

Here's what a durable cancellation-save flow looks like with Temporal and the Python SDK:

cancellation_save_workflow.py·python
from datetime import timedelta
from temporalio import workflow, activity
from dataclasses import dataclass
 
@dataclass
class CancellationInput:
    customer_id: str
    call_id: str
 
@workflow.defn
class CancellationSaveWorkflow:
    @workflow.run
    async def run(self, input: CancellationInput) -> dict:
        # Each activity is independently checkpointed
        account = await workflow.execute_activity(
            fetch_account_data,
            input.customer_id,
            start_to_close_timeout=timedelta(seconds=10),
        )
 
        offer = await workflow.execute_activity(
            calculate_retention_offer,
            account,
            start_to_close_timeout=timedelta(seconds=5),
        )
 
        # This activity waits for the live call to complete
        outcome = await workflow.execute_activity(
            present_offer_and_wait,
            {"offer": offer, "call_id": input.call_id},
            start_to_close_timeout=timedelta(minutes=10),
        )
 
        if outcome["accepted"]:
            await workflow.execute_activity(
                write_retention_record,
                {"customer_id": input.customer_id, "offer": offer},
                start_to_close_timeout=timedelta(seconds=10),
            )
 
        await workflow.execute_activity(
            update_crm,
            {"customer_id": input.customer_id, "outcome": outcome},
            start_to_close_timeout=timedelta(seconds=10),
        )
 
        return outcome

If write_retention_record fails, Temporal retries just that activity with the configured backoff policy. The present_offer_and_wait step (which waited for the live conversation) is never replayed because Temporal has already recorded its result in the workflow history.

Notice that you're writing a regular Python function. There's no special state machine DSL, no Redis keys to juggle, no polling loop. The durability comes from the Temporal worker infrastructure, not from custom application code.

In March 2026, Temporal released an official integration with the OpenAI Agents SDK, which means you can run OpenAI-based agent graphs as durable workflows with a few lines of configuration. The same pattern applies to any framework that supports async execution.

LangGraph's Checkpoint Pattern

If you're already on LangGraph, you get durability through its built-in checkpointer. LangGraph serializes node outputs to a configurable backend -- PostgreSQL, Redis, or in-memory for development -- and resumes from the last completed node on failure.

booking_graph.py·python
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.postgres import PostgresSaver
import psycopg
 
conn = psycopg.connect(DATABASE_URL, autocommit=True)
checkpointer = PostgresSaver(conn)
 
builder = StateGraph(BookingState)
builder.add_node("check_availability", check_availability)
builder.add_node("hold_slot", hold_slot)
builder.add_node("process_payment", process_payment)
builder.add_node("send_confirmation", send_confirmation)
 
builder.set_entry_point("check_availability")
builder.add_edge("check_availability", "hold_slot")
builder.add_edge("hold_slot", "process_payment")
builder.add_edge("process_payment", "send_confirmation")
builder.add_edge("send_confirmation", END)
 
graph = builder.compile(checkpointer=checkpointer)
 
# thread_id ties together the checkpointed state for this booking
config = {"configurable": {"thread_id": f"booking_{booking_id}"}}
result = await graph.ainvoke(initial_state, config=config)

The thread_id is the key. When you resume a failed workflow, you pass the same thread_id and LangGraph replays from the last completed node. If process_payment failed on the first attempt, the second attempt skips check_availability and hold_slot entirely -- their results are already in the checkpoint store.

LangGraph's approach works well if your agents are already graph-based and you want to stay within the LangChain framework. Temporal handles more complex coordination patterns, including workflows that span days or weeks and workflows that pause to wait for external signals (like a human approval).

accepted declined Receive cancellation intent Fetch account data Calculate retention offer Present offer during live call Write retention record Log outcome to CRM Update billing status Send confirmation email Schedule 30-day follow-up check
CX cancellation-save workflow -- each node is checkpointed before the next step begins

The CX-Specific Failure Modes You'll Actually Hit

Most durable execution writing focuses on batch processing and data pipelines. CX agents have a distinct failure profile that makes the standard advice incomplete.

Side-effect amplification. A customer-facing action is never just a database write. Sending a confirmation email, posting a CRM note, and logging a billing event all happen together. If one of the three fails and you retry without idempotency keys, you get two emails, two billing logs, and one CRM note. The fix: pass a stable task_id (deterministically derived from customer_id + action_type + date) to every external API so duplicate calls are safely deduplicated on the server side.

Human wait states. Some CX workflows include a human-in-the-loop step: a supervisor reviews a refund above a threshold, a billing dispute requires a human decision before the next action fires. Temporal handles this with workflow.wait_condition() -- the workflow sleeps with zero compute and resumes when the signal arrives. Most home-grown schedulers can't model this without a polling loop that runs forever.

Post-call async tasks. A follow-up SMS four hours after the call, a satisfaction survey trigger 24 hours later, an escalation reminder if no response in 48 hours -- these are natural parts of a CX workflow that need to be durably scheduled, not fire-and-forget cron jobs that silently fail. We explored the customer experience gaps these create in CX agents that fail between conversations.

Multi-agent handoffs. When your triage agent hands off to a specialist agent, the handoff itself is a state transition that can fail. If the specialist agent is temporarily unavailable, does the triage agent retry, wait, or drop the task? Durable execution gives you a clean model for all three choices. The orchestration patterns for structuring these handoffs are covered in multi-agent orchestration patterns for production.

Building a Lightweight Checkpoint Store Without a Framework

Not every CX team is ready to add Temporal or LangGraph to their stack. If you're building on a simpler agent framework, you can get 80% of the benefit with a lightweight checkpoint store built on your existing database.

The pattern has three parts:

1. Deterministic task IDs. Before starting a workflow, generate a task_id from the inputs: sha256(customer_id + action_type + date). Write it to your database with status: started before you execute anything.

2. Step completion logging. After each step succeeds, write a record: {task_id, step_name, result, completed_at}. Before executing any step, check whether that step already has a completion record.

3. Resume logic on retry. Load the completion records on retry, skip already-completed steps, and resume at the first incomplete one.

checkpoint-runner.ts·typescript
type StepResult = {
  taskId: string;
  stepName: string;
  result: unknown;
  completedAt: string;
};
 
async function runWithCheckpoint<T>(
  taskId: string,
  stepName: string,
  fn: () => Promise<T>
): Promise<T> {
  const existing = await db.stepResults.findOne({ taskId, stepName });
  if (existing) {
    return existing.result as T; // already completed, return cached result
  }
 
  const result = await fn();
  await db.stepResults.insertOne({
    taskId,
    stepName,
    result,
    completedAt: new Date().toISOString(),
  });
  return result;
}
 
// Usage in a booking workflow
async function runBookingWorkflow(bookingId: string, input: BookingInput) {
  const taskId = `booking_${bookingId}`;
 
  const availability = await runWithCheckpoint(taskId, "check_availability", () =>
    checkCalendarAvailability(input.requestedSlot)
  );
 
  const slot = await runWithCheckpoint(taskId, "hold_slot", () =>
    holdCalendarSlot(availability.bestSlot)
  );
 
  const payment = await runWithCheckpoint(taskId, "process_payment", () =>
    processPayment(input.paymentMethod, input.amount)
  );
 
  await runWithCheckpoint(taskId, "send_confirmation", () =>
    sendConfirmationEmail(input.customerEmail, slot, payment.receiptId)
  );
}

This is simpler than Temporal but handles the common queue-based retry case. The gap is long wait states and complex branching -- if you need those patterns, the lightweight approach will eventually become painful and you should migrate to Temporal or LangGraph.

Monitoring Recovery in Production

Durable execution handles the "survive failure" part. You still need to know whether it's working once it's live.

Two metrics matter most:

Task completion rate. What percentage of started workflows reach a terminal state, whether success or handled failure? A drop here indicates a new category of unrecoverable failure that your checkpointing doesn't handle.

Step retry rate by step name. Which steps are failing most often? A high retry rate on update_crm points to a CRM API reliability issue. A high rate on process_payment suggests a payments integration problem. Per-step retry rates make failures specific and actionable.

With Chanl's analytics and monitoring, you can instrument your durable workflows to track both without manually wiring up each activity. Each tool call in a Chanl-connected workflow emits a span, giving you the retry-rate view by step name out of the box.

chanl-workflow-monitor.ts·typescript
import Chanl from "@chanl/sdk";
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Log each workflow outcome after it completes
await chanl.calls.logWorkflowOutcome({
  callId: input.callId,
  workflowType: "cancellation_save",
  completedSteps: completedSteps,
  outcome: result.outcome,
  retryCount: totalAttempts - completedSteps.length,
});

You can also set up scenario tests that deliberately inject failures at specific steps -- crashing after calculate_retention_offer, timing out write_retention_record -- and verify that your checkpoint layer handles each case correctly before it ever happens in production. A scenario that injects a crash at step 3 and confirms the workflow resumes at step 4 is a concrete regression test for your durability guarantee.

Where to Start This Week

The path from "we have no durable execution" to "our workflows survive failures" doesn't require a full platform migration.

Step 1: Identify your long-running CX workflows. Any task that touches more than two external systems, runs longer than five seconds, or has a post-call async component is a candidate. For most CX agents, this is three to five workflows.

Step 2: Add idempotency keys to every external call. Before adding any orchestration framework, make sure every external API call in those workflows accepts and deduplicates an idempotency key. This is the prerequisite for safe retries. Without it, no amount of checkpointing will prevent duplicate side effects.

Step 3: Choose your persistence layer. For simple queue-based workflows: a checkpoint table in your existing database plus the lightweight pattern above. For graph-based agent workflows: LangGraph with a PostgreSQL checkpointer. For production-scale multi-step workflows with wait states and long horizons: Temporal.

Step 4: Test failure scenarios before they happen. Use Chanl's scenario testing to inject artificial failures at each step and verify that your recovery logic produces the right outcome. A partial refund that gets retried must not double-charge. A failed CRM write must not lose the call outcome.

Durable execution isn't glamorous. It doesn't show up in a demo. But it's the infrastructure layer that means your CX agents actually complete what they start, every time, no matter what the infrastructure does underneath. The Build phase of your agent isn't done until you've answered: what happens when it fails on step 4?

Test your agent's recovery logic before it fails in production

Chanl's scenario testing lets you inject failures at any step and verify your agent handles them correctly. Build confidence in your durable workflows before customers feel the impact.

Start testing recovery scenarios
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions