ChanlChanl
Agent Architecture

How to Build Agent Interrupt and Approval Checkpoints

How to pause an AI agent before high-stakes actions, persist full state through the approval window, and resume cleanly. Covers interrupt gates, approval queues, checkpointing, and EU AI Act compliance for production CX agents.

DGDean GroverCo-founderFollow
May 24, 2026
16 min read
A Traffic Light Showing Amber Beside a Circuit Board Pattern, Representing a Deliberate Pause in an Automated Workflow

Your agent has done everything right. It verified the customer's identity, confirmed the return window, checked the order history, and calculated the refund. Now it's one function call away from sending $4,800 to a debit card.

Should it go?

If your answer is "it depends," you need the interrupt pattern.

The interrupt pattern is how production CX agents handle the gap between "the agent got it right" and "I'm comfortable letting it decide alone." It's not about distrusting your agent. It's about knowing exactly which actions are safe to delegate completely, and which ones need one more set of eyes. Not forever. Just for a beat.

Here's how to build it properly.

What Actions Are Worth Pausing On?

The point of an interrupt isn't to second-guess every decision. It's to identify the narrow category of actions where autonomous execution creates risk you're not ready to accept, then pause only those.

Four categories earn an interrupt:

Irreversible actions are the clearest case. A refund sent, a subscription cancelled, an account deleted. None of these can be undone with a follow-up API call. The cost of pausing once is lower than the cost of a reversal, if one even exists.

High-value actions earn a threshold. You decide the number: $200, $500, $2,000. Below it, the agent acts. Above it, a supervisor sees the proposed action before it fires.

Ambiguous intent catches the edge cases your training data didn't. When a customer's request contradicts their account state, or when sentiment signals conflict with the words, a pause gives a human a chance to interpret before the agent commits.

Regulated actions depend on your jurisdiction. The EU AI Act's August 2026 deadline (Article 14) requires documented human oversight for high-risk AI decisions in financial services, employment, and adjacent areas. If your agents serve EU customers, this category isn't optional.

The decision about when to pause is the strategy question. We covered it in depth in When and Where Should Humans Intervene in AI Workflows. This article is about how to implement the pause once you know where it belongs.

The Four Components of a Working Interrupt

A working interrupt requires four things in sequence: an interrupt gate that intercepts the flagged call, a checkpoint that preserves agent state, a notification that reaches the right person, and a resume path that reconstructs the workflow correctly.

Drop any one of these and you get a broken pattern. A gate without a checkpoint loses the agent's reasoning. A checkpoint without a notification means the pause never gets human attention. A notification without a resume path leaves the agent stuck indefinitely.

Here's each component in detail.

The Interrupt Gate

The gate is the code that sits in front of specific tool calls. When the agent prepares to execute a flagged action, the gate intercepts it and routes it to a pending state instead of firing immediately.

interrupt-gate.ts·typescript
interface InterruptRule {
  toolName: string;
  condition: (args: Record<string, unknown>) => boolean;
  reason: string;
  approverRole: string;
  timeoutMinutes: number;
}
 
interface InterruptDecision {
  shouldInterrupt: boolean;
  reason?: string;
  approverRole?: string;
  timeoutMinutes?: number;
}
 
class InterruptGate {
  constructor(private rules: InterruptRule[]) {}
 
  evaluate(
    toolName: string,
    args: Record<string, unknown>
  ): InterruptDecision {
    const rule = this.rules.find((r) => r.toolName === toolName);
    if (!rule) return { shouldInterrupt: false };
 
    const triggered = rule.condition(args);
    if (!triggered) return { shouldInterrupt: false };
 
    return {
      shouldInterrupt: true,
      reason: rule.reason,
      approverRole: rule.approverRole,
      timeoutMinutes: rule.timeoutMinutes,
    };
  }
}
 
const gate = new InterruptGate([
  {
    toolName: "issue_refund",
    condition: (args) => (args.amount as number) > 500,
    reason: "Refund exceeds $500 threshold. Supervisor review required.",
    approverRole: "supervisor",
    timeoutMinutes: 30,
  },
  {
    toolName: "cancel_subscription",
    condition: () => true,
    reason: "Subscription cancellation is irreversible",
    approverRole: "retention_lead",
    timeoutMinutes: 15,
  },
  {
    toolName: "send_contract",
    condition: (args) => (args.value as number) > 10000,
    reason: "Contract value exceeds $10k. Legal review required.",
    approverRole: "legal",
    timeoutMinutes: 1440,
  },
]);

The gate is deliberately stateless. It doesn't know about conversation history, customer context, or the agent's reasoning. It only answers one question: given this tool name and these arguments, should we interrupt?

The Checkpoint

This is where most interrupt implementations fail. When you pause the agent, you need to serialize the complete agent state so the resume looks identical to a natural continuation, not a restart.

What goes into the checkpoint:

checkpoint.ts·typescript
interface AgentCheckpoint {
  checkpointId: string;
  conversationId: string;
  createdAt: string;
  expiresAt: string;
 
  // The proposed action
  pendingToolCall: {
    toolName: string;
    args: Record<string, unknown>;
    callId: string;
  };
 
  // Full agent context at the moment of interrupt
  messages: Message[];
  priorToolResults: ToolResult[];
  systemPrompt: string;
  agentVersion: string;
 
  // Interrupt metadata
  interruptReason: string;
  approverRole: string;
  timeoutMinutes: number;
  notificationSentAt?: string;
}

The messages array is the part teams most often shortcut. They store just the pending tool call and plan to reconstruct context from a database on resume. This creates two problems: the reconstructed context may not match the original (customer data can change during the approval window), and the agent restarts its reasoning instead of continuing from the pause point.

Serialize the full conversation state. Yes, it costs more storage. It costs far less than an agent that resumes with a different mental model than the one it had when it paused.

For persistence, use a durable backend:

checkpoint-store.ts·typescript
import { Pool } from "pg";
 
class CheckpointStore {
  constructor(private db: Pool) {}
 
  async save(checkpoint: AgentCheckpoint): Promise<void> {
    await this.db.query(
      `INSERT INTO agent_checkpoints
         (checkpoint_id, conversation_id, payload, status, expires_at)
       VALUES ($1, $2, $3, 'pending', $4)`,
      [
        checkpoint.checkpointId,
        checkpoint.conversationId,
        JSON.stringify(checkpoint),
        checkpoint.expiresAt,
      ]
    );
  }
 
  async get(checkpointId: string): Promise<AgentCheckpoint | null> {
    const result = await this.db.query(
      `SELECT payload FROM agent_checkpoints
       WHERE checkpoint_id = $1
         AND status = 'pending'
         AND expires_at > NOW()`,
      [checkpointId]
    );
    return result.rows[0]?.payload ?? null;
  }
 
  async resolve(
    checkpointId: string,
    decision: "approved" | "rejected",
    modifiedArgs?: Record<string, unknown>
  ): Promise<void> {
    await this.db.query(
      `UPDATE agent_checkpoints
       SET status = $2, resolved_at = NOW(), modified_args = $3
       WHERE checkpoint_id = $1`,
      [
        checkpointId,
        decision,
        modifiedArgs ? JSON.stringify(modifiedArgs) : null,
      ]
    );
  }
}

The Approval Queue

The queue is the interface between the interrupt and the human who needs to act on it. At minimum it needs to show the action the agent wants to take, the arguments it proposes, the conversation context that led here, and three options: approve, modify, reject.

The notification side matters as much as the UI. A pending approval that doesn't reach the right person is dead work:

approval-notifier.ts·typescript
interface ApprovalRequest {
  checkpointId: string;
  approverRole: string;
  toolName: string;
  args: Record<string, unknown>;
  reason: string;
  conversationId: string;
  customerId: string;
  timeoutAt: string;
  reviewUrl: string;
}
 
class ApprovalNotifier {
  async notify(request: ApprovalRequest): Promise<void> {
    switch (request.approverRole) {
      case "supervisor":
        await this.sendSlackAlert(request, "#cx-supervisors");
        break;
      case "legal":
        await this.sendEmailAlert(request, "legal@yourcompany.com");
        break;
      case "retention_lead":
        await this.sendSlackAlert(request, "#retention-team");
        break;
    }
 
    // Audit trail regardless of notification channel
    await this.logToAuditTrail(request);
  }
}

Resume, Modify, Reject

The resume path reconstructs the agent's execution context from the checkpoint and re-injects it at the exact moment of the pause. The agent shouldn't need to re-derive anything it already computed.

agent-runner.ts·typescript
interface ApprovalDecision {
  approved: boolean;
  approverName: string;
  rejectionReason?: string;
  modifiedArgs?: Record<string, unknown>;
}
 
class AgentRunner {
  async resumeFromCheckpoint(
    checkpointId: string,
    decision: ApprovalDecision
  ): Promise<void> {
    const checkpoint = await this.store.get(checkpointId);
    if (!checkpoint) throw new Error("Checkpoint not found or expired");
 
    await this.store.resolve(
      checkpointId,
      decision.approved ? "approved" : "rejected",
      decision.modifiedArgs
    );
 
    if (!decision.approved) {
      await this.continueWithResult(checkpoint, {
        toolCallId: checkpoint.pendingToolCall.callId,
        content: `Action rejected by ${decision.approverName}: ${decision.rejectionReason}. Please inform the customer and offer alternatives.`,
      });
      return;
    }
 
    const finalArgs =
      decision.modifiedArgs ?? checkpoint.pendingToolCall.args;
 
    const result = await this.tools.call(
      checkpoint.pendingToolCall.toolName,
      finalArgs
    );
 
    await this.continueWithResult(checkpoint, {
      toolCallId: checkpoint.pendingToolCall.callId,
      content: JSON.stringify(result),
    });
  }
 
  private async continueWithResult(
    checkpoint: AgentCheckpoint,
    toolResult: { toolCallId: string; content: string }
  ): Promise<void> {
    const messages = [
      ...checkpoint.messages,
      {
        role: "tool" as const,
        content: toolResult.content,
        tool_call_id: toolResult.toolCallId,
      },
    ];
 
    await this.run({
      messages,
      systemPrompt: checkpoint.systemPrompt,
      priorToolResults: [
        ...checkpoint.priorToolResults,
        toolResult,
      ],
    });
  }
}

Here's the complete flow from tool call to resume:

Prepare tool call (issue_refund, $800) Evaluate against rules Save full checkpoint Create approval request Notify via Slack/email Approve / Modify / Reject Record decision Resume with tool result Execute approved action Return result Continue conversation Agent Interrupt Gate Checkpoint Store Approval Queue Approver Tool Executor
Agent interrupt flow from tool call to human approval and resume

Wiring It Into Your Tool Executor

Wire the gate, checkpoint, and notifier into your tool executor so every outbound tool call passes through the interrupt check automatically. The executor evaluates the gate, saves state if triggered, fires the notification, then throws an exception that your outer loop catches to pause the agent cleanly.

Here's how those pieces connect in practice:

agent-executor.ts·typescript
class InterruptException extends Error {
  constructor(public checkpointId: string) {
    super(`Interrupt: awaiting approval for checkpoint ${checkpointId}`);
  }
}
 
class AgentExecutor {
  async executeToolCall(
    toolName: string,
    args: Record<string, unknown>,
    context: AgentContext
  ): Promise<ToolResult> {
    const decision = this.gate.evaluate(toolName, args);
 
    if (!decision.shouldInterrupt) {
      return await this.tools.call(toolName, args);
    }
 
    const checkpoint: AgentCheckpoint = {
      checkpointId: crypto.randomUUID(),
      conversationId: context.conversationId,
      createdAt: new Date().toISOString(),
      expiresAt: new Date(
        Date.now() + decision.timeoutMinutes! * 60 * 1000
      ).toISOString(),
      pendingToolCall: {
        toolName,
        args,
        callId: context.currentToolCallId,
      },
      messages: context.messages,
      priorToolResults: context.priorToolResults,
      systemPrompt: context.systemPrompt,
      agentVersion: context.agentVersion,
      interruptReason: decision.reason!,
      approverRole: decision.approverRole!,
      timeoutMinutes: decision.timeoutMinutes!,
    };
 
    await this.store.save(checkpoint);
 
    await this.notifier.notify({
      checkpointId: checkpoint.checkpointId,
      approverRole: checkpoint.approverRole,
      toolName,
      args,
      reason: checkpoint.interruptReason,
      conversationId: context.conversationId,
      customerId: context.customerId,
      timeoutAt: checkpoint.expiresAt,
      reviewUrl: `${this.baseUrl}/approvals/${checkpoint.checkpointId}`,
    });
 
    throw new InterruptException(checkpoint.checkpointId);
  }
}

The InterruptException is caught by your outer execution loop to pause cleanly and return an appropriate message to the customer. Something like "I've escalated this for a quick review. You'll have confirmation within 15 minutes."

Testing Your Interrupt Flows Before Launch

The biggest risk with interrupt patterns is the resume path breaking in production. You've likely tested the happy path: agent pauses, human approves, action fires. You also need to test three other paths: rejection, modification, and timeout expiry.

Chanl Scenarios lets you create test cases that deliberately trigger each interrupt condition, then verify the checkpoint is complete and the resume reconstructs correctly:

interrupt-test.ts·typescript
const largeRefundScenario = {
  name: "Refund over $500 triggers supervisor interrupt",
  setup: {
    customer: { id: "test-001", orderHistory: [{ id: "ORD-999", amount: 850 }] },
    agentConfig: { tools: ["issue_refund", "check_order"] },
  },
  script: [{ role: "user", content: "I need a full refund for order ORD-999" }],
  assertions: [
    { type: "tool_call_intercepted", toolName: "issue_refund" },
    { type: "checkpoint_created", checkpointContains: ["messages", "pendingToolCall"] },
    { type: "notification_sent", toRole: "supervisor" },
    { type: "customer_informed", messageContains: "escalated" },
  ],
};
 
const rejectionScenario = {
  name: "Agent handles supervisor rejection gracefully",
  ...largeRefundScenario,
  approvalDecision: {
    approved: false,
    approverName: "Sarah Chen",
    rejectionReason: "Order outside return window",
  },
  assertions: [
    { type: "agent_resumed" },
    { type: "customer_offered_alternative" },
    { type: "no_refund_issued" },
  ],
};

Test the rejection path explicitly. When a supervisor rejects an $850 refund, does the agent offer a partial refund instead of silently ending the conversation? This is the path most teams forget to verify until a customer calls back angry.

Monitoring Interrupt Health in Production

Once interrupt flows are live, three metrics tell you whether they're working:

Interrupt rate by tool shows what percentage of calls to each flagged tool actually trigger an interrupt. A sudden spike means something upstream changed. A sudden drop might mean your gate condition broke silently.

Time to resolution tracks how long between interrupt creation and human decision. If this creeps up, approvers are becoming a bottleneck. Consider automating lower-risk approvals or redistributing the queue.

Resume success rate shows what percentage of approved interrupts result in a successful tool execution. Failures here point to state corruption in the checkpoint or a tool that changed between interrupt and resume.

Whatever you instrument with, alert when interrupt queues grow faster than they resolve. That's the leading indicator of a bottleneck before customers start feeling the delay.

A simple roll-up over your checkpoint table gets you most of the way there:

interrupt-health.sql·sql
SELECT
  payload->'pendingToolCall'->>'toolName' AS tool_name,
  COUNT(*) AS total,
  AVG(EXTRACT(EPOCH FROM (resolved_at - created_at)) / 60)
    FILTER (WHERE status IN ('approved', 'rejected')) AS median_minutes,
  COUNT(*) FILTER (WHERE status = 'approved')::float
    / NULLIF(COUNT(*) FILTER (WHERE status IN ('approved','rejected')), 0)
    AS approval_rate
FROM agent_checkpoints
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY tool_name;

Target a resume success rate above 0.95 and median resolution within your customer-facing SLA. Watch for drift in either direction. A sudden drop in interrupts on a tool you flagged usually means a gate condition broke silently, not that risk went away.

The Audit Trail

Every interrupt, decision, and resume needs a tamper-evident audit record. This is what your post-incident reviews reach for, and what regulators ask for:

audit-log.ts·typescript
type InterruptEventType =
  | "interrupted"
  | "notified"
  | "approved"
  | "rejected"
  | "modified"
  | "resumed"
  | "expired";
 
interface InterruptAuditEvent {
  eventType: InterruptEventType;
  checkpointId: string;
  conversationId: string;
  agentId: string;
  toolName: string;
  proposedArgs: Record<string, unknown>;
  finalArgs?: Record<string, unknown>;
  decidedBy?: string;
  decidedAt?: string;
  reason?: string;
  timestamp: string;
}

Store these in append-only tables. Log every state transition. If an auditor asks "what did your agent propose, who approved it, and what actually executed?" you should be able to answer in under five minutes.

Start Narrow, Expand With Confidence

Start with one interrupt rule. Pick the highest-stakes irreversible action in your agent's toolkit. Instrument it. Watch the metrics for two weeks. Then add the next rule.

Teams that interrupt everything create approval queues that never get fully resolved, which trains approvers to rubber-stamp them, which defeats the purpose entirely. Start with the actions that genuinely need a second look, prove that the review actually catches problems, then widen from there.

The interrupt pattern doesn't limit what your agent can do. It's what makes it safe to give your agent more authority over time. Giving an agent the ability to ask for help, precisely and sparingly, is what earns it the right to act alone.

If you want to go deeper on how to structure the human-agent handoff when an interrupt escalates to a full transfer, see The Context Package: What Your Agent Should Hand Off to a Human.

Monitor what your agents pause on in production

Chanl tracks interrupt rates, approval times, and resume success across all your agents. See which checkpoints are working and which are becoming bottlenecks.

See Chanl Monitoring
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Frequently Asked Questions