A fintech team shipped a prompt change last quarter that improved CSAT by 0.4 points. Two days later, their compliance team noticed the agent had stopped including required APR disclosures in refinancing conversations. About 12% of calls. The prompt change hadn't touched the disclosure logic. But some rephrasing around the loan amount section introduced enough ambiguity that the model started treating the disclosure as "redundant context."
The fix was a two-line prompt addition. Finding the regression took four days, three compliance reviews, and a conversation with legal about mandatory reporting obligations.
That's the regression testing gap. Your CI/CD pipeline catches code regressions. Every pull request triggers tests. You'd never ship a deterministic code change that fails them. But for your AI agent, the pipeline usually ends at "deploy the new prompt." Nothing checks that the new prompt hasn't broken the 23 things the old one got right.
The fintech team has had no regression since. Not because they got lucky. Because they built a regression test suite and run it on every deploy. Here's how to build one.
Why Agent Regression Testing Differs From Code Testing
Software regression testing is binary. The function returns the right value or it doesn't. You write an assertion. It passes or fails deterministically.
Agent regression testing isn't binary. The same agent, same prompt, same model, same input can produce different responses on different runs. That's not a bug. That's how language models work. What you want to catch is behavioral regression: the distribution of outcomes shifting in a way that violates your spec.
The disclosure regression above is a good example. The agent didn't break entirely. It still handled most conversations correctly. But a specific behavior, including APR disclosures, became unreliable. The regression wasn't "it always fails now." It was "it fails 12% of the time now, up from 0%."
Testing for that requires a different mental model than code testing. You're not testing for deterministic correctness. You're testing for behavioral stability. Does the new version produce outcomes that are statistically indistinguishable from the old version on the behaviors that matter?
That's also why point-in-time evals, running a test suite once before the first deploy, aren't enough. Agent behavior drifts over time and across changes. Continuous regression testing catches the drift that one-time evals miss.
The Three Test Types in an Agent Regression Suite
A mature regression suite has three categories of tests, each catching a different failure class. Golden conversations cover the happy path. Red-line tests cover your hard constraints. The edge case library covers everything production has surprised you with so far.
Golden Conversations
Golden conversations are test cases where you know the correct outcome. The agent should resolve the issue, confirm the tool was called correctly, or produce a response that scores above a threshold on a specific scorecard dimension.
Golden conversations aren't about exact output matching. They're about property checking. For a returns agent:
- Does the response mention the order number the customer referenced?
- Does the agent call the
get_order_detailstool exactly once? - Does the automated scorer rate the response above 0.80 on task completion?
These are properties you can evaluate consistently even when the exact response wording varies. The golden conversation set is your happy-path coverage.
Red-Line Tests
Red-line tests are scenarios where the agent must exhibit a specific constrained behavior. These test your safety rails, compliance requirements, and explicit prohibitions.
For every constraint in your agent spec, you need at least one red-line test. The test presents a scenario that should trigger the constraint and verifies that the agent handles it correctly. Here's what one looks like for a promotional-item refund constraint.
const promoRefundRedLine = {
name: "promo-refund-red-line",
type: "red-line",
messages: [
{
role: "user",
content:
"I want a full refund on the item I bought during your spring sale event."
}
],
assertions: [
{
type: "not-contains",
value: "I'll process a refund",
description: "Agent must not offer refund on promotional item"
},
{
type: "scorecard-dimension",
dimension: "compliance",
threshold: 0.85,
description: "Compliance score must exceed 0.85 on promo-refund scenario"
}
]
};Red-line tests are the most important tests in your suite. A golden conversation failure means your agent got worse at something. A red-line failure means your agent crossed a line it was never supposed to cross.
Run red-line tests on every single build, without exception, at every environment. They're your safety net.
Edge Case Library
The edge case library is built from production failures and near-misses. Every time an agent does something unexpected in production, even if it recovers gracefully, you add a test case for it.
This is the most neglected part of most test suites. Teams write golden conversations upfront and red-lines from the spec. But edge cases come from production, and they only get added to the suite if someone's disciplined enough to convert incidents into test cases.
Build the habit: every time a bug reaches production, the incident response includes adding a regression test. After six months, your edge case library is the most valuable part of your test suite because it documents all the ways your agent surprised you, not just the ways you expected it to fail.
Handling Non-Determinism With a Statistical Verdict
Here's the practical problem. If you run the same test case on a language model, you get different responses. How do you decide if a test passed?
Three approaches work, and they're not mutually exclusive. Pick per test type.
Majority vote. Run each test case N times (typically 3 to 5) and count the results. If four out of five runs pass the assertion, the test passes. This smooths out low-temperature noise without being expensive. Use majority vote for most golden conversations and edge cases.
All-pass requirement. For red-line tests, require all N runs to pass. If your agent violates a constraint even once, the test fails. This is strict, but constraints should be strict. If a behavior is prohibited, it should be prohibited in every run, not just most of them.
Statistical threshold. For scorecard-dimension assertions, you're checking whether the distribution of scores meets a threshold, not whether a single run meets it. Run 10 test iterations and check that the median score is above your threshold. This gives you a stable signal for behaviors that vary in quality but not in direction.
The runner below routes each test type to its right verdict. The same test case object decides which one applies.
async function runRegressionTest(
test: RegressionTest,
agentId: string,
runs: number = 5
): Promise<TestResult> {
const results = await Promise.all(
Array.from({ length: runs }, () => executeTest(test, agentId))
);
if (test.type === 'red-line') {
// Red-lines require all runs to pass
const allPass = results.every((r) => r.passed);
return { passed: allPass, results, verdict: 'all-or-nothing' };
}
if (test.type === 'golden') {
// Golden conversations use majority vote
const passCount = results.filter((r) => r.passed).length;
return {
passed: passCount > runs / 2,
results,
verdict: 'majority',
passRate: passCount / runs
};
}
if (test.type === 'scorecard') {
// Scorecard tests check median against threshold
const scores = results.map((r) => r.score).sort();
const median = scores[Math.floor(scores.length / 2)];
return {
passed: median >= test.threshold!,
results,
verdict: 'median',
medianScore: median
};
}
}This approach makes your test results stable enough to use as a CI gate without requiring deterministic outputs.
The Two-Phase CI Pipeline
A good agent CI/CD pipeline has two phases, each optimized for a different trade-off. Phase 1 keeps PRs fast. Phase 2 keeps production safe.
Phase 1: Pre-Merge (Fast Suite)
When a developer opens a pull request, the fast suite runs automatically. It should finish in under 5 minutes so it doesn't kill development velocity.
The fast suite includes:
- All red-line tests (non-negotiable, run on every PR)
- Top 20 to 30 golden conversations for your highest-volume flows
- Your 5 to 10 most important edge cases
Run each test case 3 times. Total: 35 to 50 test cases at 3 runs each. At typical API latency with 10 parallel requests, that's 3 to 4 minutes. If the fast suite fails, the PR is blocked. No merge without addressing the failure.
Phase 2: Pre-Deploy (Full Suite)
When a change is merged and ready for production, the full suite runs. This takes 10 to 30 minutes and covers your complete behavioral surface.
The full suite includes:
- All red-line tests
- Full golden conversation set (100 to 200 scenarios)
- Complete edge case library
- Persona variants (different customer types running the same scenario)
- Constraint stress tests (users pushing back on refused requests)
Run each case 5 times. This phase is your final gate before production. Taking 20 minutes is fine. You only run it once per deploy cycle.

Deploy Gate
Pre-deploy quality checks
With Chanl's scenario runner, you can configure both phases from a single API call:
import { Chanl } from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Phase 1: Fast suite (run on every PR)
export async function runFastSuite(agentId: string): Promise<GateResult> {
const results = await chanl.scenarios.run({
agentId,
suiteId: 'returns-agent-fast',
scorecardId: 'returns-agent-scorecard',
runsPerScenario: 3,
parallel: 10
});
return {
passed: results.redLines.allPassed && results.golden.passRate > 0.85,
summary: results.summary,
failures: results.failures
};
}
// Phase 2: Full suite (run before production deploy)
export async function runFullSuite(agentId: string): Promise<GateResult> {
const results = await chanl.scenarios.run({
agentId,
suiteId: 'returns-agent-full',
scorecardId: 'returns-agent-scorecard',
runsPerScenario: 5,
parallel: 20
});
const scorecardGate = await chanl.scorecards.evaluate({
results: results.conversations,
baseline: 'returns-agent-baseline',
regressionThreshold: 0.05 // fail if any dimension drops more than 5 points
});
return {
passed: results.redLines.allPassed && scorecardGate.noRegressions,
regressions: scorecardGate.regressions,
summary: results.summary
};
}Production Regression Monitoring: The Third Layer
Pre-deploy testing catches regressions before they reach users. Production monitoring catches what pre-deploy testing missed, including model drift (when the underlying LLM gets silently updated by the provider), traffic distribution shifts, and edge cases your test suite didn't cover.
Production monitoring works by sampling live conversations, scoring them through your scorecard, and comparing the rolling distribution against your last deploy's baseline. Run this on a cron, alert when any dimension drifts beyond a defined threshold.
import { Chanl } from '@chanl/sdk';
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
// Cron job: sample last 24h of conversations, score them, compare to baseline
async function checkProductionDrift(agentId: string, baselineId: string) {
const recent = await chanl.transcript.list({
agentId,
since: '24h',
sample: 0.05 // 5% of live conversations
});
const scored = await chanl.scorecards.evaluate({
scorecardId: 'returns-agent-scorecard',
conversations: recent.items,
baseline: baselineId,
regressionThreshold: 0.05
});
for (const regression of scored.regressions) {
if (regression.dimension === 'compliance' && regression.median < 0.80) {
await notify(['#cx-ops', 'compliance@company.com'], regression);
} else if (regression.delta > 0.05) {
await notify(['#cx-ops'], regression);
}
}
}The three-layer approach covers the full lifecycle. Pre-merge testing catches obvious breaks. Pre-deploy testing catches subtle regressions. Production monitoring catches what both miss.
For teams running shadow mode deployments alongside regression suites, both layers reinforce each other. Shadow mode gives you behavioral comparison against live traffic, while regression suites give you deterministic checks against known cases.
Token-Efficient Regression Testing
Running 500 scenarios at 5 iterations each means 2,500 agent calls per deploy. At production conversation costs, that adds up. There's a practical limit to how much you can spend on regression testing.
The AgentAssay paper (arxiv, March 2026) cuts this significantly. The key insight: most regression assertions are binary checks that don't require a full LLM judge. "Did the agent call the right tool?" is a structured check, not a quality judgment. "Did the response contain a prohibited phrase?" is a string match.
By splitting assertions into binary checks (cheap to evaluate) and quality assessments (require LLM judge), you run binary checks on all 2,500 runs and LLM scoring only on the 15% to 20% that need it. This drops regression suite cost by 70% to 85% compared to full LLM-as-judge evaluation on every run.
type AssertionType =
| 'contains'
| 'not-contains'
| 'tool-called'
| 'tool-arg'
| 'scorecard'
| 'llm-judge';
function routeAssertion(type: AssertionType): 'binary' | 'llm-judge' {
const binaryTypes: AssertionType[] = [
'contains',
'not-contains',
'tool-called',
'tool-arg'
];
return binaryTypes.includes(type) ? 'binary' : 'llm-judge';
}
async function evaluateAssertion(
assertion: { type: AssertionType; [key: string]: unknown },
response: AgentResponse
): Promise<boolean> {
if (routeAssertion(assertion.type) === 'binary') {
return evaluateBinary(assertion, response); // fast, cheap, no API call
} else {
return evaluateLlmJudge(assertion, response); // LLM call only when needed
}
}Add this routing and your test suite cost drops dramatically while coverage stays constant.
Building Your First Regression Suite
If you don't have a regression suite today, here's the minimum viable path to one.
Week 1: Red-lines only (10-15 scenarios)
Start with your top constraints. For each constraint in your agent spec, write one red-line test. Run them manually on your current agent to establish a passing baseline. Add them to your CI pipeline. If any red-line fails on a PR, the PR is blocked. You now have a safety net for your most critical behaviors.
Week 2: Top 20 golden conversations
Find your 20 most common conversation types from your analytics. Write one golden conversation test case for each. Run them, establish a baseline, add them to the fast suite.
Week 3: Edge case library (your first 10)
Pull your last 10 customer complaints or production incidents. Convert each into a test case. Add them to the full suite.
At the end of three weeks, you have a 45-50 scenario suite that covers your red-lines, happy paths, and known failures. That's enough to catch the kind of regression that would have cost you four days and a legal conversation.
Check Chanl's scenario library for pre-built scenario templates by industry and conversation type. Most teams can start from existing templates and customize rather than writing from scratch.
The Ratchet Principle
Here's the one rule that turns a regression suite from a liability into an asset: never let the pass rate decrease.
Every week, review your scorecard baselines. If any dimension improved, raise the threshold. If it held steady, keep it. Never lower a threshold because a new version underperformed.
That's the ratchet. Quality gates can only move up. Each successful deploy sets a new floor that future deploys have to meet or exceed.
Teams that violate the ratchet end up with regression suites that gradually allow worse and worse behavior. The suite becomes a historical artifact, not a living gate. The ratchet keeps it honest.
Set your first baselines from your first successful deploy. From then on, every deploy has to meet or beat the previous one.
The fintech team from the opening now runs 312 scenarios on every PR and 780 on every deploy. Their CI pipeline catches regressions in 4 minutes. Their compliance team hasn't opened a retroactive incident in eight months. That's not luck. It's infrastructure they built once and never had to rebuild.
Automate your agent regression suite
Chanl's scenario runner integrates into CI/CD with a single API call. Configure fast and full suites, set quality gates, and get alerts when production behavior drifts.
See Chanl's scenariosCo-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
Weekly. Patterns and recipes for shipping AI agents that actually work — MCP, scorecards, regression tests, prompts, model comparisons. From teams running agents in production.



