ChanlChanl
Industry & Strategy

The no-code ceiling: when agent builders hit production

Visual agent builders get you to 80% fast. The last 20%, telephony, monitoring, testing, and memory, requires infrastructure they never intended to provide.

DGDean GroverCo-founderFollow
April 3, 2026
14 min read read
A clean desk with colorful building blocks arranged into a fragile tower on one side and a sturdy steel structure with monitoring instruments on the other

The weekend demo that won't ship

Visual agent builders get your prototype working in days. Voiceflow, Botpress, Relevance AI. Pick your favorite. By Friday afternoon, you have something impressive: a conversational agent that understands questions, follows branching logic, and calls external APIs. You show the demo. Your team is excited.

Then the questions start.

"Can we run it on our phone system?" Maybe, with middleware. "How do we test it before pushing changes?" You call it manually and hope for the best. "What happens when it gives a wrong answer to a customer?" You find out when the customer complains.

Most companies have AI pilots running. Far fewer reach production. The gap is not conversation logic. Every visual builder handles that well. The gap is everything else: the telephony routing, the automated testing, the quality monitoring, the persistent memory, the secure tool management that production demands and builders never intended to provide.

This article walks through exactly where each major builder hits its ceiling, what production actually requires, and how to close the gap without rewriting your agent from scratch.

What do visual agent builders actually do well?

Visual agent builders solve a real problem: they let teams design conversational logic without writing parser code, state machines, or prompt chains from scratch. For prototyping and internal tools, they're often all you need. The trouble starts when the prototype needs to become a production service.

Voiceflow gives you the strongest visual conversation designer on the market. Drag nodes, draw connections, define intents, write response templates. Non-technical product managers can build and iterate on conversation flows directly. For chat-based agents, it's hard to beat the speed of prototyping.

Botpress takes a more developer-friendly approach. You still get a visual builder, but there are code escape hatches when you need custom logic. If your team has engineers who want to own the complexity, Botpress lets them drop into TypeScript without abandoning the visual canvas entirely.

Relevance AI focuses on tool-use agents. Its strength is connecting agents to external APIs and letting them decide which tools to call based on conversation context. If your agent needs to pull CRM data, check inventory, or file tickets, Relevance makes the wiring straightforward.

All three platforms solve conversation design beautifully. They've eliminated weeks of boilerplate. For internal tools, demos, and low-stakes use cases, they're often all you need.

The problems start when "demo" becomes "deploy."

Where do visual builders break down in production?

Production requirements don't arrive one at a time. They pile up together, usually right after someone approves the budget to ship your prototype to real customers. Five gaps show up repeatedly, roughly in the order teams discover them. Voice latency comes first, then testing, memory, tool management, and monitoring.

1. Voice latency and real-world behavior

The first wall is voice. Your chat prototype works. Now someone wants it on the phone system.

Voiceflow's voice capabilities are bolted on rather than native. Latency regularly exceeds 600ms when the market standard for natural conversation is sub-300ms. More importantly, real-world phone behaviors like interrupting, hesitating, and talking over the agent simply aren't part of the platform. Botpress requires third-party middleware for telephony entirely. Relevance AI has no native voice channel at all.

This isn't a criticism of these platforms. They were built for chat-first experiences. But the moment your use case involves phone calls, you've moved beyond what they were designed to handle.

2. Testing is manual and shallow

Visual builders offer testing that matches their design paradigm: click through the flow, check that each node connects correctly, verify that responses look right. Voiceflow's testing is "visual and block-based but lacks depth for full back-and-forth simulations."

Production testing means something different. As we covered in how to evaluate agents before they talk to customers, you need to simulate 50 adversarial conversations before deploying a prompt change. You need a frustrated customer persona that interrupts, contradicts itself, and demands escalation. You need a confused persona that gives partial information and changes their mind. You need these running automatically in CI, not as a manual QA step someone remembers to do on a good day.

None of the three builders offer scenario testing with AI-powered personas. What they offer is a way to click through happy paths and confirm the flow diagram works as drawn.

3. Memory is session-scoped

A customer calls on Monday about a billing issue. They get a partial resolution and are told to call back. They call back on Wednesday. Your agent has no idea who they are.

This is the default behavior in every visual builder. Memory exists within a single conversation session. When the session ends, context evaporates. Botpress offers some session persistence, but cross-channel memory (the customer who called yesterday and chats today getting continuity) requires custom development that sits entirely outside the builder.

For any customer-facing agent, session-scoped memory creates an experience where every interaction starts from zero. That's not a minor gap. It's the difference between an agent that feels helpful and one that feels broken. We've written about what actually breaks when you build your own memory system from scratch.

4. Tool management lacks production controls

Relevance AI makes connecting APIs straightforward. All three platforms support some form of function calling or webhook integration. The problem isn't connecting the first three tools. It's managing twenty.

Production tool management means credential isolation (your CRM API key doesn't leak to a different tool's error handler). It means execution logging (which tool was called, with what inputs, and what it returned). It means protocol standardization so adding the next tool doesn't require writing custom glue code.

Relevance AI's pricing "scales with tool runs and agent executions, can become difficult to predict." Voiceflow and Botpress handle tools through webhooks that lack authentication management, retry logic, and audit trails. None of them support MCP (Model Context Protocol), the emerging standard for tool integration that handles credential management, execution logging, and protocol standardization in one layer.

5. Monitoring is a dashboard, not a system

All three platforms show you dashboards. Conversation counts, completion rates, basic analytics. What none of them provide is programmatic access to conversation quality metrics.

The difference matters. A dashboard tells you what happened yesterday when you remember to look at it. A monitoring system pages you at 2 AM when resolution rates drop below a threshold. It pipes sentiment trends into your weekly report automatically. It catches the moment your agent starts hallucinating product details and alerts you before 500 customers get wrong information.

Botpress gives you dashboard-only monitoring with no programmatic access. Voiceflow lacks depth for ongoing quality assessment. Relevance AI offers limited observability into why an agent made a specific tool call. In production, "limited observability" means you're debugging customer complaints with guesswork.

The infrastructure gap, visualized

Visual builders cover conversation design, intent recognition, response templates, basic webhooks, and flow visualization. Production demands six more capabilities they weren't built for: telephony, scenario testing, quality scoring, persistent memory, tool credential management, and an analytics pipeline. Here's the full picture.

What the Builder Gives You What Production Demands The Gap Conversation Design Intent Recognition Response Templates Basic API Webhooks Flow Visualization Telephony + Voice Scenario Testing Quality Scoring Persistent Memory Tool Credential Mgmt Analytics Pipeline
Visual builder coverage vs. production requirements

The gap isn't small, and it isn't optional. Every item on the right side is something teams discover they need after they've committed to shipping, usually through a painful production incident or a customer escalation.

Visual builder vs. production infrastructure comparison

CapabilityVoiceflowBotpressRelevance AIProduction Requirement
Conversation designVisual flow builderVisual + code escape hatchesTool-use focusedAny approach works
Voice / telephonyBolted-on, 600ms+ latencyRequires third-party middlewareNo native voice channelSub-300ms, interruption handling
TestingClick-through flow testingManual conversation testingManual testingAutomated adversarial scenarios in CI
MemorySession-scopedSession with limited persistenceSession-scopedCross-session, cross-channel persistence
Tool managementWebhooks, no credential isolationWebhooks, no audit trailsAPI connectors, usage-based pricingCredential isolation, audit logs, MCP
MonitoringDashboard analyticsDashboard, no programmatic accessLimited observabilityProgrammatic API, alerting, quality scoring
Feedback loopNot availableNot availableNot availableProduction data drives testing and improvements

What does a production AI agent actually need?

Every AI agent needs six capabilities to run reliably in production, regardless of how its conversation logic was built: pre-deploy testing, continuous quality scoring, persistent memory, programmatic analytics, secure tool management, and a feedback loop connecting production data back to agent improvements.

Pre-deploy testing. Before every change reaches customers, you need automated conversations that probe edge cases. Not five. Fifty. A persona that's angry. A persona that speaks broken English. A persona that tries to social-engineer the agent into giving a refund it shouldn't. These conversations need to run automatically and produce a pass/fail verdict.

Continuous quality scoring. Every production conversation gets evaluated against a scorecard: did the agent follow the script? Was the tone appropriate? Did it attempt to resolve the issue before escalating? Not a random sample. Every conversation. With scores you can trend over time and alert on when they drop.

Persistent memory. Context that survives across sessions, channels, and time. The customer's name, their open tickets, their sentiment from the last interaction. Available instantly when they reach out again, whether by phone, chat, or email.

Programmatic analytics. Latency percentiles, resolution rates, sentiment trends, tool call success rates. Not as a dashboard you check. As data you pipe into alerts, reports, and automated workflows. When p95 latency crosses 400ms, you hear about it immediately, not next quarter.

Secure tool management. Credential isolation per tool. Execution audit logs. Protocol standardization. The ability to add a new integration in minutes, not weeks.

The feedback loop. This is the piece that ties everything together. Production data flows back into testing. Quality scores identify weak spots. Scenario tests target those weak spots specifically. Improvements deploy. Scores improve. Repeat.

Visual builders give you the conversation logic. Production infrastructure gives you the confidence to put that logic in front of real customers.

Closing the gap without starting over

You don't need to throw away your visual builder prototype. The conversation logic you've designed is valuable. What you need is production infrastructure underneath it.

Here's what that looks like as a concrete workflow. Before any agent change reaches customers, you run a battery of scenario tests with AI-powered personas that probe the exact edge cases your production data has revealed:

production-loop.ts

That's 25 lines. It covers the entire production loop that visual builders lack: test before deploy, score every conversation, monitor in production, remember across sessions. Each step feeds into the next. Scenario results inform scorecard criteria. Scorecard trends drive which scenarios you write next. Production metrics validate that changes actually helped.

The memory layer is worth calling out specifically. When you create a memory entry, it's available across every future interaction with that customer, regardless of channel. The customer who chats today and calls tomorrow gets the same agent that remembers their preferences. Session-scoped builders can't do this because the architecture wasn't designed for it.

Choosing your path forward

Teams hitting the no-code ceiling have three options. Each makes sense for different situations.

Option 1: Push through with the builder. Use webhooks, custom middleware, and third-party integrations to fill each gap individually. This works when you have one or two gaps and engineering time to spare. It breaks down when you're stitching together five different workarounds that each need separate maintenance.

Option 2: Rewrite in code. Abandon the visual builder and build the agent from scratch with full control. This gives you everything but costs months of development time and throws away the conversation design work you've already done.

Option 3: Layer production infrastructure underneath. Keep the builder for what it does well (conversation design and prototyping). Add purpose-built infrastructure for what production demands (testing, monitoring, memory, tools). This preserves your existing work while closing the gaps that actually block shipping.

The right choice depends on your timeline, your team, and how many gaps you're facing. If it's one gap, patch it. If it's five, layering infrastructure is usually faster and more maintainable than either pushing through or starting over.

The pattern that repeats

The no-code ceiling isn't unique to AI agents. It's the same pattern that's played out with website builders, mobile app builders, and workflow automation tools. Visual builders accelerate the first 80% dramatically. The last 20% requires different tooling entirely.

What's different with AI agents is the cost of that last 20%. A website that's 80% done still works. It just looks a bit rough. An AI agent that's 80% done gives wrong answers to customers, forgets who they are between calls, and breaks silently with no one noticing until the complaints pile up.

The builders will keep getting better. Voiceflow, Botpress, and Relevance AI ship improvements constantly. Some of these gaps will close over time. But the fundamental architecture of visual-first platforms constrains what's possible in production environments. Session-scoped memory, dashboard-only monitoring, and webhook-based integrations are design decisions, not missing features.

Production AI agents need a production foundation. The conversation logic is the easy part. Everything around it is where teams actually succeed or fail.

Build agents that survive production

Chanl gives your AI agents the infrastructure visual builders don't: scenario testing, quality scoring, persistent memory, and production analytics. Keep your conversation logic. Add the operational backbone.

Start building
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos

Frequently Asked Questions