ChanlChanl
Agent Architecture

Your Agent Is Already a State Machine. Make It Explicit.

Every production AI agent is secretly a state machine. Making it explicit gives you checkpointing, testable paths, and observable state transitions -- without rewriting your agent logic.

DGDean GroverCo-founderFollow
May 6, 2026
14 min read
A graph diagram showing agent state transitions with named nodes and typed edges

Every agent you've shipped is secretly a state machine. Message comes in. Agent decides to call a tool. Tool returns a result. Agent decides what to do next. That's state machine logic: input, state transition, next state.

The question isn't whether your agent has a state machine in it. It's whether that machine is explicit or whether it's hiding in nested conditionals and recursive calls that nobody on your team wants to touch.

Making it explicit changes what you can do with your agent in production.

What an Agent State Machine Actually Is

A state machine models your agent as a graph where nodes do work and edges define what comes next. The agent's full context -- everything it needs to make decisions -- lives in a typed data structure that gets updated at each node boundary.

That's all it is. The complexity of agent orchestration reduces to three things: define your state type, write your nodes, declare your edges.

Nodes can be:

  • An LLM call that generates a response or selects a tool
  • A tool execution that fetches data or takes an action in an external system
  • A router that inspects current state and picks the next step
  • A checkpoint that persists state before a risky, irreversible operation

Edges can be fixed (always continue to this node) or conditional (pick the next node based on what just happened).

tool call requested high-risk action done approved rejected receive_message call_llm route_on_output execute_tool request_human_approval send_response
A CX agent state machine: the tool loop with conditional routing and a human escalation path

The transitions between nodes are the interesting part. Each edge is either unconditional ("always go to call_llm after a tool runs") or a routing function that inspects the current state and returns the name of the next node.

The Implicit Loop vs. The Explicit Machine

The basic agent loop works for prototypes but breaks down in production when you need to checkpoint a failure, test a specific decision path in isolation, or add a human approval step. The control flow is buried in the code, the state is invisible, and every new requirement means touching a function nobody fully understands anymore.

Most agent tutorials give you something like this:

implicit-loop.ts·typescript
async function runAgent(messages: Message[]): Promise<string> {
  while (true) {
    const response = await llm.chat(messages);
 
    if (!response.toolCalls?.length) {
      return response.content;
    }
 
    for (const toolCall of response.toolCalls) {
      const result = await executeTool(toolCall);
      messages.push({ role: "tool", content: result });
    }
 
    messages.push({ role: "assistant", content: response.content });
  }
}

This works. You can ship it. It's also a state machine -- the state is messages, the transitions are implicit in the while(true) and the if check. But it breaks down in production for predictable reasons:

  • You can't checkpoint this loop and resume from a failure midway through
  • You can't test the "tool succeeds, then escalate" path without running a live LLM
  • When something goes wrong, your trace says "loop ran 9 times" with no explanation of why
  • Adding a human approval step requires threading async state through the loop

The explicit state machine version does the same work:

explicit-state-machine.ts·typescript
import { StateGraph, Annotation } from "@langchain/langgraph";
 
// Every piece of context the agent needs is typed and explicit
const AgentState = Annotation.Root({
  messages: Annotation<Message[]>({ reducer: (a, b) => [...a, ...b] }),
  pendingToolCalls: Annotation<ToolCall[]>({ default: () => [] }),
  awaitingHumanApproval: Annotation<boolean>({ default: () => false }),
  finalResponse: Annotation<string | null>({ default: () => null }),
});
 
// Nodes are pure functions: state in, state update out
async function callLLM(state: typeof AgentState.State) {
  const response = await llm.chat(state.messages);
  return {
    messages: [{ role: "assistant", content: response.content }],
    pendingToolCalls: response.toolCalls ?? [],
  };
}
 
async function executeTool(state: typeof AgentState.State) {
  const results = await Promise.all(
    state.pendingToolCalls.map((tc) => runTool(tc))
  );
  return {
    messages: results.map((r, i) => ({
      role: "tool",
      toolCallId: state.pendingToolCalls[i].id,
      content: r,
    })),
    pendingToolCalls: [],
  };
}
 
// Routing function: inspects state, returns next node name
function routeAfterLLM(state: typeof AgentState.State): string {
  if (!state.pendingToolCalls.length) return "finish";
  if (isHighRiskAction(state.pendingToolCalls)) return "request_approval";
  return "execute_tool";
}

The routing logic is now a testable pure function. The state is typed and inspectable. Every node is a unit you can test in isolation.

The Three Patterns CX Agents Need

Most customer experience agents need a combination of three state machine patterns. Once you know them, you can build almost anything by composing them.

Pattern 1: The Tool Loop

The tool loop is the foundation. The agent calls the LLM, checks if the output contains tool calls, executes those tools, and loops back. In state machine terms:

text
call_llm -> route_on_output -> execute_tool -> call_llm (loop)
                            -> finish (exit)

The routing function is the key piece. It's a pure function that takes the current state and returns a string -- the name of the next node. This makes it trivially testable:

routing.test.ts·typescript
test("routes to execute_tool when tool calls are pending", () => {
  const state = { ...baseState, pendingToolCalls: [mockToolCall] };
  expect(routeAfterLLM(state)).toBe("execute_tool");
});
 
test("routes to finish when no tool calls", () => {
  const state = { ...baseState, pendingToolCalls: [] };
  expect(routeAfterLLM(state)).toBe("finish");
});

No LLM. No API call. Just state in, route name out.

Pattern 2: Conditional Routing

Real CX flows branch based on what happened. A return agent might check order eligibility, then route to approve_return or flag_for_review depending on the result. An upgrade agent might route to upsell_pitch or retention_offer based on the customer's tier.

eligible, low value eligible, high value not eligible verify_order check_eligibility auto_approve_return request_human_review explain_policy send_confirmation
Conditional routing in a returns agent: eligibility check determines the downstream path

Each routing function is just a conditional:

returns-routing.ts·typescript
function routeAfterEligibilityCheck(state: ReturnsAgentState): string {
  if (!state.isEligible) return "explain_policy";
  if (state.orderValue > REVIEW_THRESHOLD) return "request_human_review";
  return "auto_approve_return";
}

This conditional was probably in your code before. Pulling it out into a named routing function makes it visible, testable, and auditable.

Pattern 3: Human-in-the-Loop

This is the most underbuilt pattern in production CX agents and the most important for managing risk. When the agent reaches a high-stakes decision -- a large refund, a subscription cancellation, a billing change -- you pause execution, notify a human reviewer, and resume only after approval.

Without state machine checkpointing, implementing this means building your own persistence layer and resumption logic. With it, the pattern is a few lines:

human-in-loop.ts·typescript
import { interrupt, Command } from "@langchain/langgraph";
 
async function requestHumanApproval(state: AgentState) {
  // Pause here. State is checkpointed automatically.
  // The agent will not continue until resumeWithValue is called.
  const decision = interrupt({
    type: "approval_required",
    action: state.pendingToolCalls[0],
    context: state.messages.slice(-3),
  });
 
  return { humanApproved: decision.approved, awaitingHumanApproval: false };
}
 
// Later, when the reviewer approves via webhook:
await graph.invoke(
  new Command({ resume: { approved: true } }),
  { configurable: { thread_id: conversationId } }
);

The agent state is saved to a persistent store. The reviewer looks at the pending action and approves or rejects via your UI or API. Execution resumes from exactly the node where it paused -- not from the beginning.

For a voice agent handling a cancellation save desk, this means a supervisor can review the proposed retention offer in real time before it's presented to the customer.

Checkpointing: Survive Failures Without Losing Work

Checkpointing saves the agent's full state to a persistent store at each node boundary. If the agent crashes, times out, or gets interrupted mid-run, you resume from the last checkpoint rather than starting over.

checkpointing.ts·typescript
import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";
 
const checkpointer = new PostgresSaver(connectionPool);
 
const graph = new StateGraph(AgentState)
  .addNode("call_llm", callLLM)
  .addNode("execute_tool", executeTool)
  .addNode("request_approval", requestHumanApproval)
  .addConditionalEdges("call_llm", routeAfterLLM)
  .addEdge("execute_tool", "call_llm")
  .compile({ checkpointer });
 
// Every run uses a thread_id. State persists across failures.
const result = await graph.invoke(initialState, {
  configurable: { thread_id: conversationId }
});

This matters for CX agents specifically because:

  • Tool calls fail. A CRM API might time out at step 4 of an 8-step agent. You want to retry from step 4, not step 1.
  • Human review loops take time. An agent can't hold state in memory for 20 minutes.
  • Long calls get disconnected. The agent needs to reconstruct its context for the callback leg.

You can go deeper on failure recovery patterns in our post on production agent reliability.

What Explicit State Does for Observability

When agent state is a typed data structure updated at named nodes, monitoring becomes straightforward. Every state transition is a data change you can log, diff, and replay without custom instrumentation.

Your observability system knows exactly which node just ran, what the state looked like before it, and what changed afterward. When something goes wrong -- an agent took the wrong branch, a tool returned unexpected data, a routing function hit an unhandled case -- you have a complete audit trail, not just a stack trace from inside a recursive function.

observability-hook.ts·typescript
// LangGraph calls this after every node transition
graph.addListener("checkpoint", (event) => {
  telemetry.recordStateTransition({
    node: event.metadata.source,
    threadId: event.config.configurable.thread_id,
    stateBefore: event.parent_config?.checkpoint?.channel_values,
    stateAfter: event.checkpoint.channel_values,
    timestamp: Date.now(),
  });
});

This pairs naturally with the analytics you're running on conversation outcomes. The state machine trace tells you exactly what the agent did and which path it took. Conversation analytics tell you what outcome that produced. With both, you can identify not just that something went wrong but which specific path through the graph led there.

You can also build real-time dashboards on top of the state stream -- the monitoring view shows which node each active agent is currently in, how long it's been there, and what the pending state looks like. For a contact center running dozens of simultaneous calls, that's the difference between oversight and blindness.

Testing State Machine Agents

Explicit state machines make agent testing tractable because you can test nodes in isolation and specific graph paths without running a live LLM for every test case.

Node tests are pure unit tests. A node takes state in and returns a state update. You mock the state, call the node, check the update:

node.test.ts·typescript
test("executeTool updates messages with tool results", async () => {
  const state = {
    ...baseState,
    pendingToolCalls: [{ id: "tc_1", name: "lookup_order", args: { orderId: "ORD-9871" } }],
  };
 
  const update = await executeTool(state);
 
  expect(update.messages).toHaveLength(1);
  expect(update.messages[0].role).toBe("tool");
  expect(update.pendingToolCalls).toHaveLength(0);
});

Routing tests are even simpler -- they're just conditional function tests:

routing.test.ts·typescript
test("routes to request_approval for high-value orders", () => {
  const state = { ...baseState, orderValue: 850, isEligible: true };
  expect(routeAfterEligibilityCheck(state)).toBe("request_human_review");
});
 
test("routes to auto_approve for low-value eligible orders", () => {
  const state = { ...baseState, orderValue: 45, isEligible: true };
  expect(routeAfterEligibilityCheck(state)).toBe("auto_approve_return");
});

Path tests verify specific end-to-end flows through the graph. You supply initial state and a mock LLM that returns predetermined outputs, then check the final state:

path.test.ts·typescript
test("escalation path: high-risk tool triggers human approval node", async () => {
  const mockLLM = jest.fn().mockResolvedValue({
    toolCalls: [{ name: "cancel_subscription", args: { userId: "u_123" } }],
  });
 
  const result = await graph.invoke(
    { messages: [customerMessage], pendingToolCalls: [] },
    { configurable: { thread_id: "test-1" } }
  );
 
  expect(result.awaitingHumanApproval).toBe(true);
  expect(result.pendingToolCalls[0].name).toBe("cancel_subscription");
});
Operations engineer monitoring deploys

Deploy Gate

Pre-deploy quality checks

Score > 80%
92%
Latency < 500ms
234ms
Error Rate < 2%
3.1%
Deploy Blocked

You can scale this further with Chanl scenarios, which run realistic customer conversation simulations against your agent and check not just the final answer but the path through the state machine -- verifying that your escalation logic fires when it should, your tool loop terminates correctly, and your human-in-the-loop node activates on the right conditions.

Migrating an Existing Agent

You don't need to rewrite your agent to get state machine benefits. You can migrate incrementally:

  1. Identify the state your agent already carries across iterations -- messages, retrieved context, tool results, routing flags. Collect that into a typed object.

  2. Extract each major step into a named function that takes state in and returns a partial state update. This is usually a refactor, not a rewrite.

  3. Pull conditional logic out into routing functions. Every if (toolCalls.length) in your loop becomes a routing function you can test independently.

  4. Add a checkpointer last. Once nodes are named and state is typed, checkpointing is a configuration change, not a code change.

The hardest part is step one -- identifying what your implicit state actually is. If you've got a monolithic agent function, spend an hour mapping what it carries forward from one iteration to the next. That map is your state schema.

You can read about the broader journey from prototype to production -- including when to refactor and when to rebuild -- in our guide on production agent reliability. State machines are one layer of a reliability stack that also includes circuit breakers, rate limiting, and fallback chains.

What You Get

The payoff for making your state machine explicit is straightforward:

  • Debuggability: every step is named and every state change is inspectable
  • Testability: nodes and routing functions are pure enough to unit test
  • Fault tolerance: checkpointing lets you resume mid-run instead of starting over
  • Human oversight: the human-in-the-loop pattern is a first-class citizen, not an afterthought
  • Observability: state transitions are structured data, not log strings

The agent loop you've been writing is already a state machine. Build it explicitly, connect your observability stack to the state transition stream, and monitor every active run in real time. That's the loop that lets you improve agents with confidence rather than hope. Chanl's monitoring tooling is built to consume structured agent traces -- the kind that come naturally from explicit state machines. Build the graph, and your observability comes with it.

Test every path through your agent, not just the happy path

Chanl runs realistic customer scenarios against your agent's state machine and verifies that every branch -- including escalations, retries, and human handoffs -- behaves correctly before it goes live.

Get Started Free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Frequently Asked Questions