A customer calls about a bill she doesn't recognize. She explains the situation clearly: a duplicate charge appeared after she changed her payment method last week. The voicebot asks her to say "billing," "account," or "support." She says "billing." The bot routes her to a menu about paying her bill. She says "dispute." The bot doesn't understand. She says "charge I don't recognize." The bot offers to read her current balance. She hangs up and calls back, this time pressing zero repeatedly until she reaches a human.
That interaction cost the company a support call, a frustrated customer, and whatever trust the automated system was supposed to build. The voicebot did exactly what it was designed to do. The problem is that what it was designed to do has nothing to do with what the customer actually needed.
This is the fundamental problem with decision trees. Not that they're badly built. That they can only handle the conversations their designers imagined in advance.
How Decision Trees Actually Break
The typical explanation for why scripted bots fail is "they can't handle natural language." That's true, but it's too vague to be useful. The failure modes are specific, predictable, and worth understanding individually, because each one points to a different architectural limitation.
The keyword trap
Decision trees route conversations by matching keywords or short phrases to predefined categories. This works when the customer says "billing" and you want to route to billing. It fails the moment someone says something that's semantically correct but lexically unexpected.
"I think you guys charged me twice" should route to billing disputes. But the keywords "charged" and "twice" don't appear in the tree. So the bot either asks the customer to rephrase (which feels insulting) or picks the closest match (which is usually wrong).
The gap between how people actually talk and how designers expect them to talk is enormous. People use slang, incomplete sentences, indirect phrasing, and regional expressions. A tree that handles "cancel my subscription" won't necessarily handle "I need to stop this thing" or "just make it go away." Same intent. Different words. Broken bot.
The single-intent assumption
Most decision trees are built to handle one request per conversation. The customer states a need, the bot processes it, the call ends. In reality, customers frequently bundle requests: "Can you check my balance and also tell me when my contract ends?" Or they pivot mid-conversation: "Actually, before you do that, I have a question about my last payment."
A scripted bot has no mechanism for handling this. It's in the "check balance" branch. There's no path from that branch to "contract end date." The customer either restarts the conversation or gives up.
The emotional dead end
When a customer is frustrated, confused, or upset, they don't follow the script. They interrupt. They repeat themselves. They express emotion before stating their need: "This is ridiculous. I've been trying to fix this for three days and nobody can help me."
A decision tree sees that sentence and matches nothing. There's no keyword for frustration. There's no branch for "customer is upset before stating intent." The bot asks the customer to choose from a menu, which makes the frustration worse. The call spirals.
The maintenance death spiral
Every time a new product launches, a policy changes, or a new type of customer request appears, someone has to manually add branches to the decision tree. Over months and years, trees become massive, fragile structures where adding one branch can break three others. The maintenance burden grows faster than the tree's capability.
Teams eventually stop updating the tree for edge cases because the risk of breaking existing flows outweighs the benefit. The tree fossilizes. New request types route to the wrong place or don't route at all.
What Agentic AI Actually Changes
The word "agentic" gets overused to the point of meaninglessness in marketing copy. Here's what it actually means at the architecture level, and why it addresses the specific failures above.
An agentic AI agent doesn't follow a predetermined path through a decision tree. Instead, it does four things that trees fundamentally cannot:
It reasons over the full conversation context. Instead of matching keywords in the current utterance, an agentic system considers the entire conversation history, the customer's stated and implied needs, and the current state of any in-progress actions. When a customer says "I think you guys charged me twice," the agent reasons from the full sentence to the intent (billing dispute) without needing a keyword match.
It uses tools to take real action. A decision tree can route a call. An agentic AI can look up the customer's account, find the duplicate charge, check the refund policy, initiate the refund, and confirm it, all within a single conversation. With MCP-connected tools, the agent has direct access to CRM systems, order databases, scheduling APIs, and whatever else it needs to actually resolve the problem rather than just categorize it.
It handles multiple intents without breaking. When a customer asks two things in one sentence, the agent can address both. It doesn't need a branch for every combination of requests. It reasons about each intent, determines the order, and works through them. If the customer pivots mid-conversation, the agent adapts.
It adapts to emotional context. An agentic system can detect frustration and adjust its approach: acknowledging the customer's feelings, moving more quickly, offering to escalate proactively. This isn't sentiment analysis bolted onto a script. It's reasoning about how to pursue the goal (resolving the customer's issue) given the current emotional context.
Scripted vs. Agentic: A Direct Comparison
| Dimension | Scripted Decision Tree | Agentic AI |
|---|---|---|
| How it handles input | Matches keywords to predefined branches | Reasons over full context and conversation history |
| Multi-step problems | One task per branch; can't chain actions | Plans and executes multi-step workflows autonomously |
| Novel requests | Fails or misroutes | Reasons from first principles using available context |
| Emotional customers | No mechanism for emotional awareness | Detects and adapts to frustration, confusion, urgency |
| Tool use | None (can only route) | Calls APIs, databases, CRMs to take real action |
| Multi-intent requests | Handles one intent per conversation | Addresses multiple intents sequentially |
| Maintenance | Manual branch updates for every new scenario | Adapts via prompt and knowledge base updates |
| Memory | Resets each conversation | Can retain context across sessions |
| Failure mode | Misroutes or loops | May take wrong action (requires guardrails) |
| Best suited for | Simple, predictable, high-volume routing | Complex, dynamic, resolution-oriented interactions |
The important thing about this comparison is the last two rows. Agentic AI isn't risk-free. A decision tree that misroutes is annoying but predictable. An agent that takes the wrong action (refunding the wrong amount, updating the wrong account) can cause real damage. This is why testing and guardrails matter more, not less, with agentic systems.
Where Scripts Still Win (and the Hybrid That Makes Sense)
The honest answer is that decision trees aren't always wrong. They're wrong for conversations that require reasoning, adaptation, or action. For a narrow set of use cases, deterministic logic is actually preferable.
Regulatory disclosures. When the law requires you to read specific language to a customer (consent forms, privacy disclosures, terms of service), you want deterministic execution, not creative reasoning. The exact wording matters. An agent might paraphrase. A script won't.
Payment authentication. The sequence of verifying identity, confirming payment details, and processing authorization has strict requirements. A deterministic flow with specific validation steps is more reliable than an agent reasoning about "what to do next."
Simple, high-volume routing. If 80% of your calls are "press 1 for sales, press 2 for support" and customers are fine with that experience, a scripted IVR works. The ROI of an agentic system only appears when the conversation requires more than routing.
The practical architecture for most teams is a hybrid. An agentic AI handles the conversation: understanding the customer, reasoning about intent, managing emotional context. Scripted logic governs the specific gates where determinism matters: authentication steps, disclosure requirements, compliance checkpoints. The agent handles the conversation; the script handles the rules.
The Five Failure Modes You Have to Test For
Replacing a decision tree with an agentic AI introduces new failure modes that decision trees never had. Teams that skip testing for these discover them in production, usually through customer complaints.
Confident wrong answers. An agent that doesn't know something will sometimes generate a fluent, authoritative response that's completely incorrect. Unlike a decision tree (which simply fails to route), an agent can actively mislead. This is the single highest-risk failure mode and the hardest to catch without structured testing.
Tool misuse. An agent with access to a refund tool might initiate a refund when the customer was only asking a question about charges. Tool access creates the possibility of unintended actions. Every tool needs guardrails: confirmation steps, amount limits, reversibility checks.
Context drift. In long conversations, agents can lose track of the original intent. A customer who started asking about a billing dispute might end up in a conversation about their plan features because the agent followed a tangent. Context management is an active design challenge, not something that happens automatically.
Escalation timing. When should an agent hand off to a human? Too early, and you lose the automation benefit. Too late, and the customer has wasted time with an agent that can't help. The threshold is different for every use case and needs to be tested explicitly.
Hallucinated capabilities. An agent might offer to do things it can't actually do: "I'll escalate this to your account manager" when no such escalation path exists, or "I've scheduled a technician visit" when it has no access to scheduling tools. Testing for this requires deliberately probing the edges of the agent's actual capabilities.
Scenario testing is the mechanism for catching these before they reach customers. The pattern is straightforward: simulate conversations that are designed to trigger each failure mode, then evaluate whether the agent handled them correctly. Not just whether it produced a plausible response, but whether it took the right action.
For a deeper treatment of how to structure this testing, see the AI agent evaluation framework.
Making the Transition Without a Catastrophic Cutover
The teams that struggle most with the transition from scripted to agentic aren't the ones with bad agents. They're the ones who try to replace everything at once. A big-bang cutover from a decision tree to an agentic system is the highest-risk approach and almost never necessary.
Phase 1: Pick one use case and prove it
Start with a single high-volume, well-understood conversation type where the decision tree is already failing. Billing disputes, order status inquiries, and appointment scheduling are common starting points because they're frequent enough to generate data quickly and well-defined enough to measure improvement clearly.
Deploy the agentic system alongside the existing tree. Route a small percentage of calls (5-10%) to the agent and compare first-contact resolution, escalation rate, and customer satisfaction against the tree for the same call type.
The metric that matters here is first-contact resolution, not containment. A scripted bot that "contains" a call (the customer doesn't hang up) but doesn't resolve the issue is worse than an agent that escalates with full context. Measure whether the customer's problem actually got solved.
Phase 2: Expand based on evidence
Once the first use case is provably better on real metrics, expand to the next highest-impact conversation type. Each expansion should be evidence-based: data from the previous use case informs how to deploy the next one.
This is also where integration work gets real. The first use case might only need access to one backend system (account lookup, order database). The second might require CRM writes, scheduling APIs, or payment processing. Each integration needs its own testing and validation.
Phase 3: Run the hybrid
For most organizations, the endgame isn't "replace every script." It's a hybrid architecture where agentic AI handles conversations that require reasoning and action, while deterministic logic governs the gates that require precision. Authentication steps stay scripted. Compliance disclosures stay scripted. Everything else migrates to the agent.
The agent owns the conversation. The scripts own the rules.
Phase 4: Close the feedback loop
The transition doesn't end at deployment. An agentic system is only as good as the feedback loop around it. Every call generates data about what the agent did well and where it failed. That data needs to flow back into prompt improvements, tool configuration changes, and knowledge base updates.
Teams that close this loop see their agents get measurably better every week. Teams that don't close it see their agents plateau at their day-one performance, which is eventually no better than the decision tree they replaced.
The Infrastructure That Makes Agents Work
There's a common misconception that the transition from scripted to agentic is primarily about choosing the right LLM. The model matters, but the infrastructure around the model determines whether the agent actually works in production.
An agent that can reason about a billing dispute but has no access to the billing system is useless. An agent that can take action but has no monitoring for incorrect actions is dangerous. An agent that handles calls well today but has no mechanism for improving is going to fall behind.
The infrastructure stack for a production agentic agent includes:
Tools and integrations so the agent can actually do things, not just talk about them. CRM lookups, order management, scheduling, payment processing. Each tool needs to be connected, tested, and governed with appropriate guardrails.
A knowledge base so the agent has accurate, current information. Product details, policies, procedures, FAQs. This is different from what's in the system prompt. The knowledge base is where domain knowledge lives; the system prompt is where behavior is defined.
Testing infrastructure so you can simulate conversations before they reach customers. Adversarial testing, persona variation, edge case probing. The more capable the agent, the more important it is to test the boundaries of that capability.
Monitoring and scorecards so you know how the agent is performing in production. Not just whether calls resolve, but whether the agent's reasoning was sound, its actions were appropriate, and its escalations were timely.
The model generates the intelligence. The infrastructure makes it useful and safe.
The Real Question
The transition from scripted to agentic isn't a technology upgrade. It's a shift in how you think about customer interactions. Decision trees treat conversations as paths to be followed. Agentic AI treats them as problems to be solved.
The scripted approach worked when customer requests were predictable and limited. It breaks when customers need real resolution, not just routing. And that's most of the time.
The question for CX teams isn't whether to make the transition. It's whether to make it now, deliberately, with proper testing and a phased rollout. Or to make it later, reactively, after the gap between what customers expect and what scripted systems deliver becomes impossible to ignore.
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



