A customer calls on a Tuesday morning and asks your agent to cancel a subscription she never actually started. Your agent, trying to be helpful, confirms the cancellation. You find out three weeks later, from a chargeback.
Nobody wrote a test for that conversation. Nobody could have. Your QA team didn't know that specific combination of account state and phrasing existed until a real customer produced it.
This is the uncomfortable shape of AI quality in production: a meaningful share of the scenarios your agent encounters are ones it was never tested on. Experience with production voice and chat traffic suggests the untested share often sits in the 20 to 40% range, but the exact number matters less than the pattern. The long tail is real, and it's the well-documented failure mode of any language interface exposed to real users (see Chip Huyen's Designing Machine Learning Systems for the general case). Not in staging, not in your scorecard suite, not anywhere. Your users are simply more creative than you are. Every week they phrase a request in a way you didn't anticipate, combine flags your logic didn't foresee, or reference an account state your fixtures didn't cover.
Those untested scenarios are where most AI quality problems are born. And they're exactly where a well-run production failure loop earns its keep.
What does the loop actually look like?
A production failure loop is a five-phase cycle that starts when live traffic surfaces a failure and ends when that failure becomes permanent regression coverage. Each phase needs real plumbing, not good intentions. Here are the five phases in order.
- Detect. Every live conversation runs through a scorecard (automated or LLM-judged). A conversation that scores below threshold, escalates unexpectedly, triggers a compliance flag, or produces a customer complaint is marked for review.
- Regenerate. The flagged conversation is converted into a reproducible scenario: customer turns as scripted inputs, agent state at call-time (memory snapshot, KB version, tools available), and explicit pass/fail criteria derived from what went wrong.
- Suite. The scenario is promoted into the regression suite. Fuzzy-match against existing scenarios to avoid duplication; tag with the failure mode so you can track it by cluster, not just count.
- Gate. Every agent change (prompt edit, new tool, new model) runs the regression suite. Failures block deploy. The fast tier (~100 most-hit scenarios) runs on every PR; the slow tier runs nightly.
- Redeploy. When the fix lands and the suite passes, the new version ships. The scenario stays in the suite forever, now as permanent insurance against the bug coming back.
This is a closed loop with teeth. Every production failure becomes permanent test coverage. Every redeploy is gated on all the failures you've already seen. The agent cannot silently regress into a bug you've already diagnosed, because the scenario for that bug is sitting in the gate on every PR.
Why does scenario regeneration break?
The phase that breaks in practice is #2: regenerating a production failure as a test. It breaks for a simple reason: AI agent behavior depends on state that's hard to reproduce. The customer's memory entries, the KB snapshot at call time, the tools available in that version of the agent, the time of day that affected routing. Re-running "what the customer said" isn't enough. You need to re-run it with the agent state the customer hit.
Three practices make this tractable:
- Snapshot on failure. When a failure is detected, emit a structured event that captures agent state alongside the transcript:
memory_snapshot(the entries relevant to the session),kb_version_hash,tool_registry(id + version for each tool available),prompt_version, and the model identifier. Use the same schema your regression harness reads, so replay is mechanical rather than manual. This is pure engineering work. It's not hard, it just has to be built before you need it. - Scripted customer, not live customer. The scenario plays the customer's lines as deterministic scripted turns, not a persona-driven simulation. Personas are useful for stress testing unknown scenarios; for regenerating a specific failure, you want exact fidelity to what the customer said.
- Explicit pass/fail criteria. Not "handle this well." Something like "agent must confirm policy X within 3 turns" or "must NOT issue a refund without supervisor flag." These criteria are derived from the specific failure; they're what you're guarding against.
Done right, a scenario regenerated from a production call runs in seconds, reproduces the original failure mode, and (after the fix) reproduces the success. Done wrong, it's a flaky test that everyone learns to ignore. Flaky regression tests are how you lose the whole loop.
How do you keep the suite fast as it grows?
Split the suite into tiers and retire stale scenarios on a schedule. A regression suite that grows monotonically will eventually crush your CI runtime, developers will stop running it, and it will stop catching bugs. Tiering and retirement keep the fast path fast while preserving coverage.
The math is unforgiving: 5,000 scenarios × 15 seconds per run × parallelism limits = a multi-hour test. Nobody waits multiple hours for a PR. So people stop running the suite. So the suite stops catching bugs.
The fix is tiering and retirement:
| Tier | What it contains | When it runs | Failure policy |
|---|---|---|---|
| Fast (PR-blocking) | ~100 highest-value scenarios: most-hit failure clusters plus a curated smoke set | Every PR, under 10 minutes | Blocks deploy |
| Slow (nightly) | Everything else | Unattended overnight | Opens tickets, doesn't block deploys |
| Retirement | Scenarios with no failures in 90 days move to slow tier; no failures in 180 days archive (kept for audit) | On schedule | Resurrect archived scenario if the bug returns |
Treat the ~100 / 90-day / 180-day numbers as starting points, not laws. Tune them to your deploy cadence and your failure rediscovery rate. The goal is a suite that stays sized to the bugs you actually face this quarter, not the full history of every bug anyone ever saw.
What this gives you that retrospective QA doesn't
A weekly QA review of sampled calls will find a subset of your real failures. It will miss the long tail: the edge cases that occur once a month across thousands of sessions. It will also find failures without fixing them permanently; a manually-reviewed bug gets patched, then two quarters later someone touches the prompt and the same bug returns, and nobody notices until a customer escalates.
The production failure loop makes both problems structurally harder. Detection runs on every conversation, not a 2% sample. The scenario guards against regression forever, not just until the ticket closes. The loop converts tribal knowledge into test coverage, the single most durable form of engineering memory.
Start with the highest-cost failure
You don't need the whole loop on day one. Rank the last 30 days of flagged conversations by business impact (revenue at risk, compliance exposure, refund dollars, escalation cost) and pick the single most expensive failure mode at the top of that list. Maybe it's a compliance gap. Maybe it's failed refund authorizations. Maybe it's over-eager cancellations.
Build the loop for that one failure: detect it, regenerate a test scenario from one real failing call, put it in a small regression suite, gate deploys on it. Validate that the loop works: a fix ships only when the suite passes; a regression surfaces when something else later breaks it.
Then scale out. The plumbing is the same for the second, third, and hundredth failure mode. The hard part was building the first scenario's worth of state capture and fixture replay. Everything after is copy-paste.
The teams that ship AI agents confidently in 2026 aren't the ones with the most tests written in advance. They're the ones whose tests grow at the same rate their users do, because every failure, caught once, never gets a second chance. That Tuesday-morning cancellation? It goes in the suite. The next time someone changes the prompt, the gate catches it before the customer ever gets billed.
Related reading:
- Your AI Agent Isn't Learning From Production: the flywheel that feeds this loop.
- Human + AI Parity: One Scorecard: the detection layer that flags production failures.
- Sub-300ms Voice AI: latency as a signal the regression suite can guard.
Engineering Lead
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.



