What is a production failure loop for AI agents?

It's a closed cycle: a failure is detected in live traffic → the failing conversation is regenerated as a reproducible test scenario → the scenario goes into the regression suite → fixes are validated against that suite before redeploy. The failure becomes permanent test coverage.

How is this different from normal QA?

Normal QA tests what you anticipated. A production failure loop tests what you didn't: the unexpected phrasings, edge accounts, weird account states, and emergent misuse patterns that only real traffic produces. It forces your test suite to evolve at the same rate your user base does.

Can you actually turn a production call into a repeatable test?

Yes. The elements you need are: the customer's inputs as scripted turns, the system state that was live at the time (memory snapshot, relevant CRM fields, knowledge-base version), and the pass/fail criteria the original call failed. Run that against the new version of your agent; if it passes, the fix is real.

What counts as a 'failure' that should trigger this loop?

Missed intent, escalation to human for a scenario that should be self-serve, scorecard drop below threshold, customer complaint, compliance flag, or any resolution-rate outlier. Not every failure needs a test, but any failure that could recur does.

How many regenerated scenarios is too many?

The bound isn't count. It's runtime and deduplication. Fuzzy-match new scenarios against existing ones before promoting them into the suite. Retire scenarios that haven't failed in 90 days (they're no longer catching bugs). Aim for a fast-running tier of ~100 representative scenarios for every PR, and a slower nightly tier with everything else.

Does this replace scripted red-team testing?

No. It complements it. Red-team tests exercise adversarial scenarios you invent on purpose. The production loop catches the scenarios you'd never have imagined. You need both.

Every Failed Call Is a Test Case You Haven't Written Yet

A customer calls on a Tuesday morning and asks your agent to cancel a subscription she never actually started. Your agent, trying to be helpful, confirms the cancellation. You find out three weeks later, from a chargeback.

Nobody wrote a test for that conversation. Nobody could have. Your QA team didn't know that specific combination of account state and phrasing existed until a real customer produced it.

This is the uncomfortable shape of AI quality in production: a meaningful share of the scenarios your agent encounters are ones it was never tested on. Experience with production voice and chat traffic suggests the untested share often sits in the 20 to 40% range, but the exact number matters less than the pattern. The long tail is real, and it's the well-documented failure mode of any language interface exposed to real users (see Chip Huyen's Designing Machine Learning Systems for the general case). Not in staging, not in your scorecard suite, not anywhere. Your users are simply more creative than you are. Every week they phrase a request in a way you didn't anticipate, combine flags your logic didn't foresee, or reference an account state your fixtures didn't cover.

Those untested scenarios are where most AI quality problems are born. And they're exactly where a well-run production failure loop earns its keep.

What does the loop actually look like?

A production failure loop is a five-phase cycle that starts when live traffic surfaces a failure and ends when that failure becomes permanent regression coverage. Each phase needs real plumbing, not good intentions. Here are the five phases in order.

Detect. Every live conversation runs through a scorecard (automated or LLM-judged). A conversation that scores below threshold, escalates unexpectedly, triggers a compliance flag, or produces a customer complaint is marked for review.
Regenerate. The flagged conversation is converted into a reproducible scenario: customer turns as scripted inputs, agent state at call-time (memory snapshot, KB version, tools available), and explicit pass/fail criteria derived from what went wrong.
Suite. The scenario is promoted into the regression suite. Fuzzy-match against existing scenarios to avoid duplication; tag with the failure mode so you can track it by cluster, not just count.
Gate. Every agent change (prompt edit, new tool, new model) runs the regression suite. Failures block deploy. The fast tier (~100 most-hit scenarios) runs on every PR; the slow tier runs nightly.
Redeploy. When the fix lands and the suite passes, the new version ships. The scenario stays in the suite forever, now as permanent insurance against the bug coming back.

This is a closed loop with teeth. Every production failure becomes permanent test coverage. Every redeploy is gated on all the failures you've already seen. The agent cannot silently regress into a bug you've already diagnosed, because the scenario for that bug is sitting in the gate on every PR.

chanl-cli

$ chanl run --all --prompt-id 6612a...

Batch Scenario Results

────────────────────────────────────────────────────

ScenarioStatusScoreTimeResult

Billing disputecompleted96%8.2sPASS

Tech support triagecompleted74%12.1sFAIL

Account upgradecompleted92%6.7sPASS

VIP escalationcompleted88%14.3sPASS

Confused usercompleted67%9.8sFAIL

Summary

Total: 5

Passed: 3

Failed: 2

Average Score: 83%

2 of 5 scenarios failed

60%

Why does scenario regeneration break?

The phase that breaks in practice is #2: regenerating a production failure as a test. It breaks for a simple reason: AI agent behavior depends on state that's hard to reproduce. The customer's memory entries, the KB snapshot at call time, the tools available in that version of the agent, the time of day that affected routing. Re-running "what the customer said" isn't enough. You need to re-run it with the agent state the customer hit.

Three practices make this tractable:

Snapshot on failure. When a failure is detected, emit a structured event that captures agent state alongside the transcript: memory_snapshot (the entries relevant to the session), kb_version_hash, tool_registry (id + version for each tool available), prompt_version, and the model identifier. Use the same schema your regression harness reads, so replay is mechanical rather than manual. This is pure engineering work. It's not hard, it just has to be built before you need it.
Scripted customer, not live customer. The scenario plays the customer's lines as deterministic scripted turns, not a persona-driven simulation. Personas are useful for stress testing unknown scenarios; for regenerating a specific failure, you want exact fidelity to what the customer said.
Explicit pass/fail criteria. Not "handle this well." Something like "agent must confirm policy X within 3 turns" or "must NOT issue a refund without supervisor flag." These criteria are derived from the specific failure; they're what you're guarding against.

Done right, a scenario regenerated from a production call runs in seconds, reproduces the original failure mode, and (after the fix) reproduces the success. Done wrong, it's a flaky test that everyone learns to ignore. Flaky regression tests are how you lose the whole loop.

How do you keep the suite fast as it grows?

Split the suite into tiers and retire stale scenarios on a schedule. A regression suite that grows monotonically will eventually crush your CI runtime, developers will stop running it, and it will stop catching bugs. Tiering and retirement keep the fast path fast while preserving coverage.

The math is unforgiving: 5,000 scenarios × 15 seconds per run × parallelism limits = a multi-hour test. Nobody waits multiple hours for a PR. So people stop running the suite. So the suite stops catching bugs.

The fix is tiering and retirement:

Tier	What it contains	When it runs	Failure policy
Fast (PR-blocking)	~100 highest-value scenarios: most-hit failure clusters plus a curated smoke set	Every PR, under 10 minutes	Blocks deploy
Slow (nightly)	Everything else	Unattended overnight	Opens tickets, doesn't block deploys
Retirement	Scenarios with no failures in 90 days move to slow tier; no failures in 180 days archive (kept for audit)	On schedule	Resurrect archived scenario if the bug returns

Treat the ~100 / 90-day / 180-day numbers as starting points, not laws. Tune them to your deploy cadence and your failure rediscovery rate. The goal is a suite that stays sized to the bugs you actually face this quarter, not the full history of every bug anyone ever saw.

What this gives you that retrospective QA doesn't

A weekly QA review of sampled calls will find a subset of your real failures. It will miss the long tail: the edge cases that occur once a month across thousands of sessions. It will also find failures without fixing them permanently; a manually-reviewed bug gets patched, then two quarters later someone touches the prompt and the same bug returns, and nobody notices until a customer escalates.

The production failure loop makes both problems structurally harder. Detection runs on every conversation, not a 2% sample. The scenario guards against regression forever, not just until the ticket closes. The loop converts tribal knowledge into test coverage, the single most durable form of engineering memory.

Start with the highest-cost failure

You don't need the whole loop on day one. Rank the last 30 days of flagged conversations by business impact (revenue at risk, compliance exposure, refund dollars, escalation cost) and pick the single most expensive failure mode at the top of that list. Maybe it's a compliance gap. Maybe it's failed refund authorizations. Maybe it's over-eager cancellations.

Build the loop for that one failure: detect it, regenerate a test scenario from one real failing call, put it in a small regression suite, gate deploys on it. Validate that the loop works: a fix ships only when the suite passes; a regression surfaces when something else later breaks it.

Then scale out. The plumbing is the same for the second, third, and hundredth failure mode. The hard part was building the first scenario's worth of state capture and fixture replay. Everything after is copy-paste.

The teams that ship AI agents confidently in 2026 aren't the ones with the most tests written in advance. They're the ones whose tests grow at the same rate their users do, because every failure, caught once, never gets a second chance. That Tuesday-morning cancellation? It goes in the suite. The next time someone changes the prompt, the gate catches it before the customer ever gets billed.

Related reading:

Your AI Agent Isn't Learning From Production: the flywheel that feeds this loop.
Human + AI Parity: One Scorecard: the detection layer that flags production failures.
Sub-300ms Voice AI: latency as a signal the regression suite can guard.

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

signal-loop testing regression production-ai observability scenarios ai-quality ci-cd

Lucas Dalamarta

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed

Every Failed Call Is a Test Case You Haven't Written Yet

What does the loop actually look like?

Why does scenario regeneration break?

How do you keep the suite fast as it grows?

What this gives you that retrospective QA doesn't

Start with the highest-cost failure

Learn Agentic AI

Frequently Asked Questions

Related Articles

How to Build a Regression Test Suite for AI Agents

Is monitoring your AI agent actually enough?

Your Agent Has Observability. It Doesn't Have Measurement.