Every agent you've shipped is secretly a state machine. Message comes in. Agent decides to call a tool. Tool returns a result. Agent decides what to do next. That's state machine logic: input, state transition, next state.
The question isn't whether your agent has a state machine in it. It's whether that machine is explicit or whether it's hiding in nested conditionals and recursive calls that nobody on your team wants to touch.
Making it explicit changes what you can do with your agent in production.
What an Agent State Machine Actually Is
A state machine models your agent as a graph where nodes do work and edges define what comes next. The agent's full context -- everything it needs to make decisions -- lives in a typed data structure that gets updated at each node boundary.
That's all it is. The complexity of agent orchestration reduces to three things: define your state type, write your nodes, declare your edges.
Nodes can be:
- An LLM call that generates a response or selects a tool
- A tool execution that fetches data or takes an action in an external system
- A router that inspects current state and picks the next step
- A checkpoint that persists state before a risky, irreversible operation
Edges can be fixed (always continue to this node) or conditional (pick the next node based on what just happened).
The transitions between nodes are the interesting part. Each edge is either unconditional ("always go to call_llm after a tool runs") or a routing function that inspects the current state and returns the name of the next node.
The Implicit Loop vs. The Explicit Machine
The basic agent loop works for prototypes but breaks down in production when you need to checkpoint a failure, test a specific decision path in isolation, or add a human approval step. The control flow is buried in the code, the state is invisible, and every new requirement means touching a function nobody fully understands anymore.
Most agent tutorials give you something like this:
async function runAgent(messages: Message[]): Promise<string> {
while (true) {
const response = await llm.chat(messages);
if (!response.toolCalls?.length) {
return response.content;
}
for (const toolCall of response.toolCalls) {
const result = await executeTool(toolCall);
messages.push({ role: "tool", content: result });
}
messages.push({ role: "assistant", content: response.content });
}
}This works. You can ship it. It's also a state machine -- the state is messages, the transitions are implicit in the while(true) and the if check. But it breaks down in production for predictable reasons:
- You can't checkpoint this loop and resume from a failure midway through
- You can't test the "tool succeeds, then escalate" path without running a live LLM
- When something goes wrong, your trace says "loop ran 9 times" with no explanation of why
- Adding a human approval step requires threading async state through the loop
The explicit state machine version does the same work:
import { StateGraph, Annotation } from "@langchain/langgraph";
// Every piece of context the agent needs is typed and explicit
const AgentState = Annotation.Root({
messages: Annotation<Message[]>({ reducer: (a, b) => [...a, ...b] }),
pendingToolCalls: Annotation<ToolCall[]>({ default: () => [] }),
awaitingHumanApproval: Annotation<boolean>({ default: () => false }),
finalResponse: Annotation<string | null>({ default: () => null }),
});
// Nodes are pure functions: state in, state update out
async function callLLM(state: typeof AgentState.State) {
const response = await llm.chat(state.messages);
return {
messages: [{ role: "assistant", content: response.content }],
pendingToolCalls: response.toolCalls ?? [],
};
}
async function executeTool(state: typeof AgentState.State) {
const results = await Promise.all(
state.pendingToolCalls.map((tc) => runTool(tc))
);
return {
messages: results.map((r, i) => ({
role: "tool",
toolCallId: state.pendingToolCalls[i].id,
content: r,
})),
pendingToolCalls: [],
};
}
// Routing function: inspects state, returns next node name
function routeAfterLLM(state: typeof AgentState.State): string {
if (!state.pendingToolCalls.length) return "finish";
if (isHighRiskAction(state.pendingToolCalls)) return "request_approval";
return "execute_tool";
}The routing logic is now a testable pure function. The state is typed and inspectable. Every node is a unit you can test in isolation.
The Three Patterns CX Agents Need
Most customer experience agents need a combination of three state machine patterns. Once you know them, you can build almost anything by composing them.
Pattern 1: The Tool Loop
The tool loop is the foundation. The agent calls the LLM, checks if the output contains tool calls, executes those tools, and loops back. In state machine terms:
call_llm -> route_on_output -> execute_tool -> call_llm (loop)
-> finish (exit)The routing function is the key piece. It's a pure function that takes the current state and returns a string -- the name of the next node. This makes it trivially testable:
test("routes to execute_tool when tool calls are pending", () => {
const state = { ...baseState, pendingToolCalls: [mockToolCall] };
expect(routeAfterLLM(state)).toBe("execute_tool");
});
test("routes to finish when no tool calls", () => {
const state = { ...baseState, pendingToolCalls: [] };
expect(routeAfterLLM(state)).toBe("finish");
});No LLM. No API call. Just state in, route name out.
Pattern 2: Conditional Routing
Real CX flows branch based on what happened. A return agent might check order eligibility, then route to approve_return or flag_for_review depending on the result. An upgrade agent might route to upsell_pitch or retention_offer based on the customer's tier.
Each routing function is just a conditional:
function routeAfterEligibilityCheck(state: ReturnsAgentState): string {
if (!state.isEligible) return "explain_policy";
if (state.orderValue > REVIEW_THRESHOLD) return "request_human_review";
return "auto_approve_return";
}This conditional was probably in your code before. Pulling it out into a named routing function makes it visible, testable, and auditable.
Pattern 3: Human-in-the-Loop
This is the most underbuilt pattern in production CX agents and the most important for managing risk. When the agent reaches a high-stakes decision -- a large refund, a subscription cancellation, a billing change -- you pause execution, notify a human reviewer, and resume only after approval.
Without state machine checkpointing, implementing this means building your own persistence layer and resumption logic. With it, the pattern is a few lines:
import { interrupt, Command } from "@langchain/langgraph";
async function requestHumanApproval(state: AgentState) {
// Pause here. State is checkpointed automatically.
// The agent will not continue until resumeWithValue is called.
const decision = interrupt({
type: "approval_required",
action: state.pendingToolCalls[0],
context: state.messages.slice(-3),
});
return { humanApproved: decision.approved, awaitingHumanApproval: false };
}
// Later, when the reviewer approves via webhook:
await graph.invoke(
new Command({ resume: { approved: true } }),
{ configurable: { thread_id: conversationId } }
);The agent state is saved to a persistent store. The reviewer looks at the pending action and approves or rejects via your UI or API. Execution resumes from exactly the node where it paused -- not from the beginning.
For a voice agent handling a cancellation save desk, this means a supervisor can review the proposed retention offer in real time before it's presented to the customer.
Checkpointing: Survive Failures Without Losing Work
Checkpointing saves the agent's full state to a persistent store at each node boundary. If the agent crashes, times out, or gets interrupted mid-run, you resume from the last checkpoint rather than starting over.
import { PostgresSaver } from "@langchain/langgraph-checkpoint-postgres";
const checkpointer = new PostgresSaver(connectionPool);
const graph = new StateGraph(AgentState)
.addNode("call_llm", callLLM)
.addNode("execute_tool", executeTool)
.addNode("request_approval", requestHumanApproval)
.addConditionalEdges("call_llm", routeAfterLLM)
.addEdge("execute_tool", "call_llm")
.compile({ checkpointer });
// Every run uses a thread_id. State persists across failures.
const result = await graph.invoke(initialState, {
configurable: { thread_id: conversationId }
});This matters for CX agents specifically because:
- Tool calls fail. A CRM API might time out at step 4 of an 8-step agent. You want to retry from step 4, not step 1.
- Human review loops take time. An agent can't hold state in memory for 20 minutes.
- Long calls get disconnected. The agent needs to reconstruct its context for the callback leg.
You can go deeper on failure recovery patterns in our post on production agent reliability.
What Explicit State Does for Observability
When agent state is a typed data structure updated at named nodes, monitoring becomes straightforward. Every state transition is a data change you can log, diff, and replay without custom instrumentation.
Your observability system knows exactly which node just ran, what the state looked like before it, and what changed afterward. When something goes wrong -- an agent took the wrong branch, a tool returned unexpected data, a routing function hit an unhandled case -- you have a complete audit trail, not just a stack trace from inside a recursive function.
// LangGraph calls this after every node transition
graph.addListener("checkpoint", (event) => {
telemetry.recordStateTransition({
node: event.metadata.source,
threadId: event.config.configurable.thread_id,
stateBefore: event.parent_config?.checkpoint?.channel_values,
stateAfter: event.checkpoint.channel_values,
timestamp: Date.now(),
});
});This pairs naturally with the analytics you're running on conversation outcomes. The state machine trace tells you exactly what the agent did and which path it took. Conversation analytics tell you what outcome that produced. With both, you can identify not just that something went wrong but which specific path through the graph led there.
You can also build real-time dashboards on top of the state stream -- the monitoring view shows which node each active agent is currently in, how long it's been there, and what the pending state looks like. For a contact center running dozens of simultaneous calls, that's the difference between oversight and blindness.
Testing State Machine Agents
Explicit state machines make agent testing tractable because you can test nodes in isolation and specific graph paths without running a live LLM for every test case.
Node tests are pure unit tests. A node takes state in and returns a state update. You mock the state, call the node, check the update:
test("executeTool updates messages with tool results", async () => {
const state = {
...baseState,
pendingToolCalls: [{ id: "tc_1", name: "lookup_order", args: { orderId: "ORD-9871" } }],
};
const update = await executeTool(state);
expect(update.messages).toHaveLength(1);
expect(update.messages[0].role).toBe("tool");
expect(update.pendingToolCalls).toHaveLength(0);
});Routing tests are even simpler -- they're just conditional function tests:
test("routes to request_approval for high-value orders", () => {
const state = { ...baseState, orderValue: 850, isEligible: true };
expect(routeAfterEligibilityCheck(state)).toBe("request_human_review");
});
test("routes to auto_approve for low-value eligible orders", () => {
const state = { ...baseState, orderValue: 45, isEligible: true };
expect(routeAfterEligibilityCheck(state)).toBe("auto_approve_return");
});Path tests verify specific end-to-end flows through the graph. You supply initial state and a mock LLM that returns predetermined outputs, then check the final state:
test("escalation path: high-risk tool triggers human approval node", async () => {
const mockLLM = jest.fn().mockResolvedValue({
toolCalls: [{ name: "cancel_subscription", args: { userId: "u_123" } }],
});
const result = await graph.invoke(
{ messages: [customerMessage], pendingToolCalls: [] },
{ configurable: { thread_id: "test-1" } }
);
expect(result.awaitingHumanApproval).toBe(true);
expect(result.pendingToolCalls[0].name).toBe("cancel_subscription");
});
Deploy Gate
Pre-deploy quality checks
You can scale this further with Chanl scenarios, which run realistic customer conversation simulations against your agent and check not just the final answer but the path through the state machine -- verifying that your escalation logic fires when it should, your tool loop terminates correctly, and your human-in-the-loop node activates on the right conditions.
Migrating an Existing Agent
You don't need to rewrite your agent to get state machine benefits. You can migrate incrementally:
-
Identify the state your agent already carries across iterations -- messages, retrieved context, tool results, routing flags. Collect that into a typed object.
-
Extract each major step into a named function that takes state in and returns a partial state update. This is usually a refactor, not a rewrite.
-
Pull conditional logic out into routing functions. Every
if (toolCalls.length)in your loop becomes a routing function you can test independently. -
Add a checkpointer last. Once nodes are named and state is typed, checkpointing is a configuration change, not a code change.
The hardest part is step one -- identifying what your implicit state actually is. If you've got a monolithic agent function, spend an hour mapping what it carries forward from one iteration to the next. That map is your state schema.
You can read about the broader journey from prototype to production -- including when to refactor and when to rebuild -- in our guide on production agent reliability. State machines are one layer of a reliability stack that also includes circuit breakers, rate limiting, and fallback chains.
What You Get
The payoff for making your state machine explicit is straightforward:
- Debuggability: every step is named and every state change is inspectable
- Testability: nodes and routing functions are pure enough to unit test
- Fault tolerance: checkpointing lets you resume mid-run instead of starting over
- Human oversight: the human-in-the-loop pattern is a first-class citizen, not an afterthought
- Observability: state transitions are structured data, not log strings
The agent loop you've been writing is already a state machine. Build it explicitly, connect your observability stack to the state transition stream, and monitor every active run in real time. That's the loop that lets you improve agents with confidence rather than hope. Chanl's monitoring tooling is built to consume structured agent traces -- the kind that come naturally from explicit state machines. Build the graph, and your observability comes with it.
Test every path through your agent, not just the happy path
Chanl runs realistic customer scenarios against your agent's state machine and verifies that every branch -- including escalations, retries, and human handoffs -- behaves correctly before it goes live.
Get Started FreeCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



