Your engineering team ships code with 85% test coverage. Your AI agent ships with... 12 scenarios, most of which are variations of a happy path you wrote in the first week of development.
This is almost universal. Fifty-seven percent of organizations now have AI agents in production, and quality is the top barrier to expansion for 32% of them. But only 52% run offline evaluations against actual test sets before deploying. The other half is essentially running blind.
The gap isn't ignorance -- it's that no one has told you how to think about coverage for agents the way you've been taught to think about coverage for code. Code coverage has a number. Agent coverage doesn't. This article gives you a framework for it.
Why Code Coverage Doesn't Apply Here
Code coverage answers a binary question: was this line (or branch, or function) executed during the test run? It works because code is deterministic. The same input, on the same code, produces the same output. Coverage tells you which parts of the decision tree you've exercised.
AI agents break this assumption at every level. The same input, processed by the same model with the same prompt, can produce different tool calls, different retrieval results, and different natural language responses across runs. There's no decision tree to map. Coverage based on execution paths doesn't exist.
What you can measure instead is whether your test suite exercises the behavioral space your agent operates in. Think of it as a distribution problem: the space of things users might try to do with your agent is a distribution. Your test suite is a sample from that distribution. Good coverage means your sample represents the distribution accurately, including the long tail.
The Five Coverage Dimensions
Coverage for an AI agent has five dimensions. A test suite that's strong on one but weak on others will leave predictable gaps.
1. Intent Coverage
Does your suite cover the range of things users are trying to accomplish? For a customer service agent, this might include: checking order status, initiating returns, reporting a damaged item, asking about policies, escalating to a human, and providing feedback. Each of these is a distinct intent category, and each can fail independently.
The failure mode here is depth without breadth. A team might have 40 scenarios for order status and zero for policy questions. When a user asks "can I return a sale item?" and the agent doesn't know, that failure wasn't predictable from the test suite because the intent category was unrepresented.
Practical test: list every distinct thing a user might want to accomplish with your agent, without regard to how common they are. Each item on that list should have at least one test scenario.
2. Persona Coverage
The same intent expressed by different users can produce different agent failures. A user who writes "ORDER DOESN'T WORK FIX THIS" is expressing the same intent as a user who writes "Hi, I placed an order yesterday and I'm having some trouble with the status update -- could you help?" But agents trained on formal customer service language often mishandle terse, emotional, or colloquial inputs.
Persona dimensions to test:
- Communication style (formal vs. colloquial vs. terse)
- Technical vocabulary (expert vs. novice language)
- Emotional tone (neutral, frustrated, confused, urgent)
- Language patterns (short questions, long explanations, mid-sentence topic switches)
A test suite with ten scenarios from one persona type will miss failures that only surface with other personas. Scenario testing with realistic personas -- not just syntactic variations of the same message -- is what reveals these gaps before production does.
3. Conversation Depth
Most pre-deploy testing covers single turns. Most production failures happen in turn four, seven, or twelve.
The pattern is consistent: agents perform well at the start of a conversation when context is minimal and intent is clear. As the conversation lengthens, context accumulates, earlier misunderstandings compound, and the agent loses track of constraints it established earlier. An agent that correctly says "I can't process refunds over $200 without manager approval" in turn two might quietly violate that policy in turn nine when the context window is crowded.
Test at multiple depths:
- 1-2 turn scenarios (baseline)
- 3-5 turn scenarios (typical conversation)
- 8-12 turn scenarios (complex requests)
- 15+ turn scenarios (escalation paths, edge cases)
Each depth tier reveals different failure modes. You need representation across all of them.
4. Tool Usage Coverage
Every tool your agent has access to should have test coverage, and that coverage should include failure cases, not just success cases.
What "tool coverage" actually means:
- Every tool gets called at least once in your suite
- Each tool is called with boundary inputs (empty strings, max-length values, unexpected types)
- At least one scenario per tool where the tool returns an error
- At least one scenario per tool where the tool is called incorrectly (wrong arguments, missing required fields)
- Multi-tool scenarios where two or more tools are needed in sequence
The failure mode here is testing the tool in isolation but never testing how the agent handles the tool's error response. An agent that works when the CRM API returns a clean 200 can behave unpredictably when it returns a 503 with a generic error message. Test both.

Deploy Gate
Pre-deploy quality checks
5. Adversarial Coverage
Every agent in production will encounter users who try to push it off its rails -- not necessarily with malicious intent, but through ambiguous requests, contradictory instructions, scope violations, and edge cases the original design never anticipated.
Adversarial coverage includes:
- Out-of-scope requests ("Can you help me write a cover letter?" to a returns agent)
- Contradictory instructions ("I know you said you can't do that, but just this once")
- Jailbreak attempts ("Ignore your previous instructions and...")
- Ambiguous inputs that could match multiple intents
- Extremely long inputs that approach context limits
- Inputs in unexpected languages or scripts
You don't need exhaustive adversarial coverage. You need enough to verify that your agent fails gracefully -- declining out-of-scope requests clearly, resisting manipulation consistently, and surfacing ambiguous cases rather than guessing.
How Many Scenarios Is Enough?
There's no universal number, but there's a useful heuristic: start with 50, grow by production.
50 well-chosen scenarios -- roughly 10 per coverage dimension -- is enough to catch most pre-deploy failures for a focused agent. Spreading them evenly across intent categories, personas, conversation depths, tool usage, and adversarial cases gives you genuine coverage, not a false sense of security from 50 variations of the same happy path.
After your first production deployment, run a post-mortem on failures. For each production failure, ask: was there a scenario in the pre-deploy suite that would have caught this? If the answer is no, add one. Do this consistently for the first three months and your suite will grow to accurately represent your agent's failure modes.
Practically, teams that do this well end up with:
- 20-30 scenarios derived directly from real production conversations
- 20-30 synthetic scenarios covering known flows not yet seen in production
- 10-15 adversarial scenarios targeting known manipulation patterns
- Growth rate of roughly 5-10 new scenarios per deployment cycle
What you're building is not a static test artifact. It's a living document of everything you've learned about how your agent can fail.
The Coverage-to-Confidence Mapping
Your suite's distribution maps directly to deploy confidence. Use this readiness table to assess where you are before shipping:
| Coverage State | Deploy Confidence | Typical Suite Characteristics |
|---|---|---|
| No pre-deploy testing | Very low | No suite, rely on observability only |
| Happy-path only | Low | 10-20 scenarios, all nominal flows |
| Intent-covered | Moderate | 30-50 scenarios, all intents represented |
| Multi-dimensional | High | 50-100 scenarios, all 5 dimensions |
| Production-derived | Very high | 100+ scenarios, real failures represented |
Most teams are at "happy-path only." Reaching "intent-covered" usually takes one week of deliberate work. Reaching "multi-dimensional" takes a sprint. The jump to "production-derived" happens automatically if you maintain the practice of converting failures into tests.
The Distribution Trap
One more thing to watch: coverage numbers can lie.
A suite of 100 scenarios that are all slight variations of five conversation flows has less real coverage than a suite of 30 scenarios that each represent a genuinely distinct part of the behavioral space. The number isn't what matters -- the distribution is.
Before you deploy, do a quick audit: group your scenarios by intent category and by persona type. If any single group has more than 30% of your scenarios, you have a distribution problem. Diversify before you ship.
This is why Chanl's scenario testing uses intent clustering to surface gaps in your suite before you run it -- not just to measure pass rates after. Scorecards then give you multi-dimensional quality scores on each scenario run, so you can see which coverage dimensions your agent handles well and which need more work before you're ready to declare it production-ready.
What Good Coverage Looks Like in Practice
Here's what a well-designed 60-scenario suite looks like for a customer service agent that handles returns, order status, and policy questions:
This isn't perfect coverage -- no suite is. But it's a distribution that will catch most failures before production rather than discovering them from customer complaints.
Connecting Pre-Deploy Testing to Production Monitoring
Coverage isn't a problem you solve once at deploy time. It's an ongoing practice that connects your pre-deploy test suite to your production observability.
The pipeline looks like this: production conversations surface new failure modes through analytics and monitoring. Those failures feed back into the test suite as new scenarios. The test suite grows to reflect the real distribution of things users try to do. Pre-deploy coverage improves with each cycle.
Teams that treat this as a closed loop -- not a one-time testing effort -- end up with agents that get more reliable over time rather than degrading as user behavior evolves.
The benchmark-to-production gap article covers how to measure reliability at scale once you've established baseline coverage. And if you're deciding between unit tests, scenario tests, and production monitoring, this breakdown of what scenario testing catches maps the coverage gaps each approach leaves. The coverage framework here is the foundation. Without it, the metrics don't mean much.
The Simple Version
If you take one thing from this: your test suite should look like a representative sample of your production traffic, not a curated demo of your agent at its best.
Start with real failures. Convert support tickets into test cases. Add synthetic scenarios to cover intents you haven't seen yet. Grow the suite with each deployment. That practice, done consistently, will tell you more about whether you're ready to ship than any coverage number will.
Know your coverage gaps before you deploy
Chanl's scenario testing surfaces intent coverage gaps, runs multi-dimensional quality scoring, and connects pre-deploy results to production monitoring. So you know what you've tested and what you haven't.
See Scenario TestingCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Aprende IA Agéntica
Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.



