Voice AI Tests Pass in the Lab. They Fail on the Call.
A voice agent passes QA on Friday. By Tuesday, the support lead has forwarded three calls to engineering with the same note: "the bot just... stops." Nothing in the test suite covered the customer who started the call describing a billing problem, drifted into a question about a return, then asked to update their address. The lab tests were one intent, one turn, one accent, no noise. The first real call was none of those things.
This is the gap that takes most voice agents down. The model is fine. The prompt is fine. The QA process is broken because it was designed to grade the agent in conditions that don't exist on a real phone line.
The fix is not a bigger test set. It is a different one.
The Habits That Catch the Failures
Voice agents fail in a small number of predictable ways, and the teams that ship them at scale all built their test suites around the same handful of habits. Below is the list, written for the engineer running QA on Monday morning who needs to know what to add to the pipeline.
Test the Audio, Not the Transcript
The fastest way to make a voice agent look better than it is: write your evaluation suite against transcripts. Type in the user turn, type in the expected agent turn, score the match.
The transcript is the part the customer never hears. What lands in production is a recording of someone on a Bluetooth headset in a kitchen, with a dishwasher running, a child in the background, and a network that dropped half a second of audio. The speech recognition layer absorbs all of that before the language model ever sees a token. If you skip recording samples from your actual call pattern and feeding them through the STT layer, you are testing a different system than the one your customers use.
A reasonable starting set: a few dozen recorded calls per primary accent your business handles, with and without background noise, across the codecs your telephony provider negotiates. Replay them through the agent in CI. Score the transcript against ground truth and watch how recognition quality degrades.
Force the Agent Through Multi-Turn
Most internal QA scripts top out at three turns. A real support call is often eight to twelve. The interesting failure modes only appear after turn five, when the agent has to remember what was said earlier, hold a partial answer in working state, and resist the urge to reset the conversation when the customer says something unexpected.
Industry research on conversational systems consistently shows accuracy degrading sharply on long interactions, especially when topic switches happen midway. The honest read on a voice agent's quality is its multi-turn behavior, not its one-shot answer. If your suite does not push past five turns, you do not yet know how the agent will behave on the calls that matter.
Score the Tool Calls Separately
Function calling reliability is the part of the stack that drifts the most when you change models, change prompts, or change the underlying provider. It is also the part the customer notices first. An agent that hands off to the right tool but with the wrong arguments will confidently tell the customer their order shipped two weeks ago when it has not shipped at all.
Two scorecards work better than one. One scores the conversational quality: tone, empathy, on-topic, no hallucinated facts. A second scores the structural correctness of every tool call the agent made during the conversation: was the right function picked, were the arguments well-formed, did the agent recover when the tool returned an error or a null. Keep them separate so a regression in one is not masked by the other.
Build Personas, Not Scripts
The single best return on engineering time in voice testing is moving from scripted turns to persona-based simulation. A persona is not a script. It is a caller archetype with a goal, a style, and a set of behaviors the simulator can improvise around.
A small set, run against any non-trivial agent:
- The impatient caller who interrupts and expects an answer in the first sentence
- The confused first-timer who gives partial information and expects the agent to ask the right follow-up
- The technical expert who tests the agent's knowledge with edge cases
- The emotionally escalated customer who needs de-escalation before they can be helped
- The accidental caller who started a conversation about the wrong product
Each one fails the agent in a different way. The impatient caller exposes pacing problems. The confused first-timer exposes how the agent handles missing information. The technical expert exposes hallucination. The escalated customer exposes when, and to whom, the agent hands off. Running the same goal through five personas is more useful than running fifty scripted variants of one goal.
Make Regression a Gate, Not a Hope
The last habit is the one most teams skip. Every prompt change, every model swap, every tool definition update reruns the scored scenario set. The new version has to clear a defined threshold on every scorecard before it can be promoted. No exceptions, no "ship it and watch the dashboard."
A team running this loop properly is doing for voice what unit tests do for a backend. The first time a prompt change drops the de-escalation persona's score by ten points, you catch it before a customer does. The second time, the engineer who made the change already knows what scenarios to look at. The third time, the muscle is built and the team is no longer surprised by regressions.
Where Most Teams Get Stuck
The honest blocker is rarely "we do not know how to test." It is "we do not have a way to run two hundred scenarios cheaply and score the results without a human listening to every one."
The cheap workaround that gets a team going: agent-to-agent simulation. The persona is itself an LLM driven by a prompt, talking to the agent over a simulated voice channel. The simulator records the call, transcribes it, runs the scorecards, and writes a row in a database. Run the full suite nightly. Gate real call testing behind a passing score.
Chanl was built around this loop. The product is a persona library, a scorecard runner, and a regression dashboard wired together. The reason it exists is that every voice team we talked to had built one half of this themselves and was tired of maintaining it.
A Short Note on Realistic Targets
It is fashionable to claim "95% accuracy" on production voice agents. Some teams genuinely reach it on a narrow scope. Most do not, and the ones that publish the number have usually measured something narrow: intent classification on a clean transcript, not end-to-end task completion on a real call.
A more useful target is honest, by-scenario reporting. Appointment scheduling for an existing customer in the customer's primary language, on a stable connection, with full account data available, probably clears 90%. The same task with a frustrated customer calling from a car, with a name the speech recognition keeps mis-hearing, is a different number. Publishing the second one to your own team is what unlocks improvement.
What to Do Monday
If your team is shipping a voice agent and the test suite still looks like a spreadsheet of single-turn transcripts, the order of operations is short.
- Pull a week of real call recordings, transcribe them, and turn the top twenty patterns into scenarios.
- Build five caller personas around the patterns you cannot script.
- Write a conversational scorecard and a separate tool-call scorecard. Run both on every scenario.
- Add the simulation run to CI. Block promotion below the threshold.
- Schedule a weekly review of the calls the agent scored worst on, and feed the patterns back into the scenario set.
The agent that ships on this loop is not perfect. It is honest about what it can and cannot do, and it stops surprising the team that ships it.
Three months in, the support lead stops forwarding calls to engineering with the note "the bot just stops." The bot does stop sometimes. It is just that the team already saw it happen in a scenario, scored it, and made a call about whether it was worth fixing. That is the version of voice AI testing that holds up on a Tuesday.
Ship Voice Agents That Hold up on Real Calls
Chanl runs persona-based simulations, scores conversational quality and tool calls separately, and gates regressions before they reach production. Build the loop your team needs without rebuilding it from scratch.
See Chanl ScenariosFurther Reading
- Agent Regression Testing in the CI/CD Pipeline
- Agent Trajectory Evals: the Path, Not the Outcome
- Beyond LLM-as-Judge: What Replaces Automated Eval
- Critical Edge Cases in Voice AI
- Chanl Eval: Open Source Voice Evaluation
Engineering Lead
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.



