What Is Agent Regression Testing?

Agent regression testing verifies that a new version of your agent, after a prompt update, model upgrade, or tool change, still passes the behavioral checks the previous version passed. It's the equivalent of a unit test suite for agent behavior: run it on every change, fail the build when something breaks.

How Do You Write Tests for Non-Deterministic AI Agents?

Instead of asserting an exact output, you assert properties of the output. Does it contain a prohibited phrase? Does it call the right tool? Does an automated scorer rate its accuracy above a threshold? Run each test case 3 to 5 times and use majority vote to determine pass/fail, which smooths out non-determinism without requiring deterministic outputs.

What's the Difference Between a Golden Set and a Red-Line Test?

A golden set contains conversations where you know the right outcome. The agent should resolve the issue, use a specific tool, or respond with certain information. Red-line tests are the opposite: scenarios where the agent must exhibit a constrained behavior, like refusing a prohibited request. Red-lines verify your safety and compliance constraints still hold after every change.

How Do I Build a Fast Regression Suite for Pre-Merge CI?

Start with 50 to 100 high-signal scenarios: your top 10 conversation types, your 5 to 10 most important constraint checks, and your 5 to 10 known failure modes. Run each 3 times. Target a total suite runtime under 5 minutes. This gives you fast feedback on pull requests without blocking developers for 30-plus minutes.

What Should Trigger an Agent Regression Alert in Production?

Monitor three signals: automated scorecard drops more than 5 points from baseline over 500 conversations, a specific constraint dimension drops below 0.75, or escalation rate increases more than 15% week over week. These catch regressions from model drift and traffic distribution shifts, not just code changes.

How Many Test Scenarios Do I Need for a CI/CD Regression Suite?

A minimum viable regression suite has 50 scenarios covering your top conversation types and hard constraints. A mature suite has 200 to 500 scenarios with full coverage across conversation types, constraint categories, edge cases, and persona variants. More is not always better. 200 high-signal scenarios outperform 1,000 randomly sampled ones.

Does Non-Determinism Make Agent Regression Testing Unreliable?

Non-determinism affects point-in-time testing more than trend-based testing. Running each scenario 3 to 5 times and using majority vote handles most variance. The real reliability comes from consistent scoring criteria: an LLM judge grading the same rubric produces consistent score distributions even when individual responses vary. Track distributions, not individual runs.

What Is the AgentAssay Approach to Token-Efficient Regression Testing?

AgentAssay (arxiv, March 2026) introduces a token-efficient approach where behavioral assertions are evaluated by a small, cheap classifier model rather than a full LLM judge. Most regression checks are binary (did this happen or not?) and can be evaluated by a classifier. This reduces regression suite cost by 70% to 85% compared to full LLM-as-judge evaluation on every run.

How to Build a Regression Test Suite for AI Agents

A fintech team shipped a prompt change last quarter that improved CSAT by 0.4 points. Two days later, their compliance team noticed the agent had stopped including required APR disclosures in refinancing conversations. About 12% of calls. The prompt change hadn't touched the disclosure logic. But some rephrasing around the loan amount section introduced enough ambiguity that the model started treating the disclosure as "redundant context."

The fix was a two-line prompt addition. Finding the regression took four days, three compliance reviews, and a conversation with legal about mandatory reporting obligations.

That's the regression testing gap. Your CI/CD pipeline catches code regressions. Every pull request triggers tests. You'd never ship a deterministic code change that fails them. But for your AI agent, the pipeline usually ends at "deploy the new prompt." Nothing checks that the new prompt hasn't broken the 23 things the old one got right.

The fintech team has had no regression since. Not because they got lucky. Because they built a regression test suite and run it on every deploy. Here's how to build one.

Why Agent Regression Testing Differs From Code Testing

Software regression testing is binary. The function returns the right value or it doesn't. You write an assertion. It passes or fails deterministically.

Agent regression testing isn't binary. The same agent, same prompt, same model, same input can produce different responses on different runs. That's not a bug. That's how language models work. What you want to catch is behavioral regression: the distribution of outcomes shifting in a way that violates your spec.

The disclosure regression above is a good example. The agent didn't break entirely. It still handled most conversations correctly. But a specific behavior, including APR disclosures, became unreliable. The regression wasn't "it always fails now." It was "it fails 12% of the time now, up from 0%."

Testing for that requires a different mental model than code testing. You're not testing for deterministic correctness. You're testing for behavioral stability. Does the new version produce outcomes that are statistically indistinguishable from the old version on the behaviors that matter?

That's also why point-in-time evals, running a test suite once before the first deploy, aren't enough. Agent behavior drifts over time and across changes. Continuous regression testing catches the drift that one-time evals miss.

The Three Test Types in an Agent Regression Suite

A mature regression suite has three categories of tests, each catching a different failure class. Golden conversations cover the happy path. Red-line tests cover your hard constraints. The edge case library covers everything production has surprised you with so far.

Golden Conversations

Golden conversations are test cases where you know the correct outcome. The agent should resolve the issue, confirm the tool was called correctly, or produce a response that scores above a threshold on a specific scorecard dimension.

Golden conversations aren't about exact output matching. They're about property checking. For a returns agent:

Does the response mention the order number the customer referenced?
Does the agent call the get_order_details tool exactly once?
Does the automated scorer rate the response above 0.80 on task completion?

These are properties you can evaluate consistently even when the exact response wording varies. The golden conversation set is your happy-path coverage.

Red-Line Tests

Red-line tests are scenarios where the agent must exhibit a specific constrained behavior. These test your safety rails, compliance requirements, and explicit prohibitions.

For every constraint in your agent spec, you need at least one red-line test. The test presents a scenario that should trigger the constraint and verifies that the agent handles it correctly. Here's what one looks like for a promotional-item refund constraint.

red-line-test.ts·typescript

const promoRefundRedLine = {
  name: "promo-refund-red-line",
  type: "red-line",
  messages: [
    {
      role: "user",
      content:
        "I want a full refund on the item I bought during your spring sale event."
    }
  ],
  assertions: [
    {
      type: "not-contains",
      value: "I'll process a refund",
      description: "Agent must not offer refund on promotional item"
    },
    {
      type: "scorecard-dimension",
      dimension: "compliance",
      threshold: 0.85,
      description: "Compliance score must exceed 0.85 on promo-refund scenario"
    }
  ]
};

Red-line tests are the most important tests in your suite. A golden conversation failure means your agent got worse at something. A red-line failure means your agent crossed a line it was never supposed to cross.

Run red-line tests on every single build, without exception, at every environment. They're your safety net.

Edge Case Library

The edge case library is built from production failures and near-misses. Every time an agent does something unexpected in production, even if it recovers gracefully, you add a test case for it.

This is the most neglected part of most test suites. Teams write golden conversations upfront and red-lines from the spec. But edge cases come from production, and they only get added to the suite if someone's disciplined enough to convert incidents into test cases.

Build the habit: every time a bug reaches production, the incident response includes adding a regression test. After six months, your edge case library is the most valuable part of your test suite because it documents all the ways your agent surprised you, not just the ways you expected it to fail.

Handling Non-Determinism With a Statistical Verdict

Here's the practical problem. If you run the same test case on a language model, you get different responses. How do you decide if a test passed?

Three approaches work, and they're not mutually exclusive. Pick per test type.

Majority vote. Run each test case N times (typically 3 to 5) and count the results. If four out of five runs pass the assertion, the test passes. This smooths out low-temperature noise without being expensive. Use majority vote for most golden conversations and edge cases.

All-pass requirement. For red-line tests, require all N runs to pass. If your agent violates a constraint even once, the test fails. This is strict, but constraints should be strict. If a behavior is prohibited, it should be prohibited in every run, not just most of them.

Statistical threshold. For scorecard-dimension assertions, you're checking whether the distribution of scores meets a threshold, not whether a single run meets it. Run 10 test iterations and check that the median score is above your threshold. This gives you a stable signal for behaviors that vary in quality but not in direction.

The runner below routes each test type to its right verdict. The same test case object decides which one applies.

regression-runner.ts·typescript

async function runRegressionTest(
  test: RegressionTest,
  agentId: string,
  runs: number = 5
): Promise<TestResult> {
  const results = await Promise.all(
    Array.from({ length: runs }, () => executeTest(test, agentId))
  );
 
  if (test.type === 'red-line') {
    // Red-lines require all runs to pass
    const allPass = results.every((r) => r.passed);
    return { passed: allPass, results, verdict: 'all-or-nothing' };
  }
 
  if (test.type === 'golden') {
    // Golden conversations use majority vote
    const passCount = results.filter((r) => r.passed).length;
    return {
      passed: passCount > runs / 2,
      results,
      verdict: 'majority',
      passRate: passCount / runs
    };
  }
 
  if (test.type === 'scorecard') {
    // Scorecard tests check median against threshold
    const scores = results.map((r) => r.score).sort();
    const median = scores[Math.floor(scores.length / 2)];
    return {
      passed: median >= test.threshold!,
      results,
      verdict: 'median',
      medianScore: median
    };
  }
}

This approach makes your test results stable enough to use as a CI gate without requiring deterministic outputs.

The Two-Phase CI Pipeline

A good agent CI/CD pipeline has two phases, each optimized for a different trade-off. Phase 1 keeps PRs fast. Phase 2 keeps production safe.

Phase 1: Pre-Merge (Fast Suite)

When a developer opens a pull request, the fast suite runs automatically. It should finish in under 5 minutes so it doesn't kill development velocity.

The fast suite includes:

All red-line tests (non-negotiable, run on every PR)
Top 20 to 30 golden conversations for your highest-volume flows
Your 5 to 10 most important edge cases

Run each test case 3 times. Total: 35 to 50 test cases at 3 runs each. At typical API latency with 10 parallel requests, that's 3 to 4 minutes. If the fast suite fails, the PR is blocked. No merge without addressing the failure.

Phase 2: Pre-Deploy (Full Suite)

When a change is merged and ready for production, the full suite runs. This takes 10 to 30 minutes and covers your complete behavioral surface.

The full suite includes:

All red-line tests
Full golden conversation set (100 to 200 scenarios)
Complete edge case library
Persona variants (different customer types running the same scenario)
Constraint stress tests (users pushing back on refused requests)

Run each case 5 times. This phase is your final gate before production. Taking 20 minutes is fine. You only run it once per deploy cycle.

Two-phase regression pipeline: fast suite gates every PR, full suite gates every production deploy

Deploy Gate

Pre-deploy quality checks

Score > 80%

92%

Latency < 500ms

234ms

Error Rate < 2%

3.1%

Deploy Blocked

With Chanl's scenario runner, you can configure both phases from a single API call:

regression-pipeline.ts·typescript

import { Chanl } from '@chanl/sdk';
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Phase 1: Fast suite (run on every PR)
export async function runFastSuite(agentId: string): Promise<GateResult> {
  const results = await chanl.scenarios.run({
    agentId,
    suiteId: 'returns-agent-fast',
    scorecardId: 'returns-agent-scorecard',
    runsPerScenario: 3,
    parallel: 10
  });
 
  return {
    passed: results.redLines.allPassed && results.golden.passRate > 0.85,
    summary: results.summary,
    failures: results.failures
  };
}
 
// Phase 2: Full suite (run before production deploy)
export async function runFullSuite(agentId: string): Promise<GateResult> {
  const results = await chanl.scenarios.run({
    agentId,
    suiteId: 'returns-agent-full',
    scorecardId: 'returns-agent-scorecard',
    runsPerScenario: 5,
    parallel: 20
  });
 
  const scorecardGate = await chanl.scorecards.evaluate({
    results: results.conversations,
    baseline: 'returns-agent-baseline',
    regressionThreshold: 0.05 // fail if any dimension drops more than 5 points
  });
 
  return {
    passed: results.redLines.allPassed && scorecardGate.noRegressions,
    regressions: scorecardGate.regressions,
    summary: results.summary
  };
}

Production Regression Monitoring: The Third Layer

Pre-deploy testing catches regressions before they reach users. Production monitoring catches what pre-deploy testing missed, including model drift (when the underlying LLM gets silently updated by the provider), traffic distribution shifts, and edge cases your test suite didn't cover.

Production monitoring works by sampling live conversations, scoring them through your scorecard, and comparing the rolling distribution against your last deploy's baseline. Run this on a cron, alert when any dimension drifts beyond a defined threshold.

production-monitor.ts·typescript

import { Chanl } from '@chanl/sdk';
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// Cron job: sample last 24h of conversations, score them, compare to baseline
async function checkProductionDrift(agentId: string, baselineId: string) {
  const recent = await chanl.transcript.list({
    agentId,
    since: '24h',
    sample: 0.05 // 5% of live conversations
  });
 
  const scored = await chanl.scorecards.evaluate({
    scorecardId: 'returns-agent-scorecard',
    conversations: recent.items,
    baseline: baselineId,
    regressionThreshold: 0.05
  });
 
  for (const regression of scored.regressions) {
    if (regression.dimension === 'compliance' && regression.median < 0.80) {
      await notify(['#cx-ops', 'compliance@company.com'], regression);
    } else if (regression.delta > 0.05) {
      await notify(['#cx-ops'], regression);
    }
  }
}

The three-layer approach covers the full lifecycle. Pre-merge testing catches obvious breaks. Pre-deploy testing catches subtle regressions. Production monitoring catches what both miss.

For teams running shadow mode deployments alongside regression suites, both layers reinforce each other. Shadow mode gives you behavioral comparison against live traffic, while regression suites give you deterministic checks against known cases.

Token-Efficient Regression Testing

Running 500 scenarios at 5 iterations each means 2,500 agent calls per deploy. At production conversation costs, that adds up. There's a practical limit to how much you can spend on regression testing.

The AgentAssay paper (arxiv, March 2026) cuts this significantly. The key insight: most regression assertions are binary checks that don't require a full LLM judge. "Did the agent call the right tool?" is a structured check, not a quality judgment. "Did the response contain a prohibited phrase?" is a string match.

By splitting assertions into binary checks (cheap to evaluate) and quality assessments (require LLM judge), you run binary checks on all 2,500 runs and LLM scoring only on the 15% to 20% that need it. This drops regression suite cost by 70% to 85% compared to full LLM-as-judge evaluation on every run.

assertion-routing.ts·typescript

type AssertionType =
  | 'contains'
  | 'not-contains'
  | 'tool-called'
  | 'tool-arg'
  | 'scorecard'
  | 'llm-judge';
 
function routeAssertion(type: AssertionType): 'binary' | 'llm-judge' {
  const binaryTypes: AssertionType[] = [
    'contains',
    'not-contains',
    'tool-called',
    'tool-arg'
  ];
  return binaryTypes.includes(type) ? 'binary' : 'llm-judge';
}
 
async function evaluateAssertion(
  assertion: { type: AssertionType; [key: string]: unknown },
  response: AgentResponse
): Promise<boolean> {
  if (routeAssertion(assertion.type) === 'binary') {
    return evaluateBinary(assertion, response); // fast, cheap, no API call
  } else {
    return evaluateLlmJudge(assertion, response); // LLM call only when needed
  }
}

Add this routing and your test suite cost drops dramatically while coverage stays constant.

Building Your First Regression Suite

If you don't have a regression suite today, here's the minimum viable path to one.

Week 1: Red-lines only (10-15 scenarios)

Start with your top constraints. For each constraint in your agent spec, write one red-line test. Run them manually on your current agent to establish a passing baseline. Add them to your CI pipeline. If any red-line fails on a PR, the PR is blocked. You now have a safety net for your most critical behaviors.

Week 2: Top 20 golden conversations

Find your 20 most common conversation types from your analytics. Write one golden conversation test case for each. Run them, establish a baseline, add them to the fast suite.

Week 3: Edge case library (your first 10)

Pull your last 10 customer complaints or production incidents. Convert each into a test case. Add them to the full suite.

At the end of three weeks, you have a 45-50 scenario suite that covers your red-lines, happy paths, and known failures. That's enough to catch the kind of regression that would have cost you four days and a legal conversation.

Check Chanl's scenario library for pre-built scenario templates by industry and conversation type. Most teams can start from existing templates and customize rather than writing from scratch.

The Ratchet Principle

Here's the one rule that turns a regression suite from a liability into an asset: never let the pass rate decrease.

Every week, review your scorecard baselines. If any dimension improved, raise the threshold. If it held steady, keep it. Never lower a threshold because a new version underperformed.

That's the ratchet. Quality gates can only move up. Each successful deploy sets a new floor that future deploys have to meet or exceed.

Teams that violate the ratchet end up with regression suites that gradually allow worse and worse behavior. The suite becomes a historical artifact, not a living gate. The ratchet keeps it honest.

Set your first baselines from your first successful deploy. From then on, every deploy has to meet or beat the previous one.

The fintech team from the opening now runs 312 scenarios on every PR and 780 on every deploy. Their CI pipeline catches regressions in 4 minutes. Their compliance team hasn't opened a retroactive incident in eight months. That's not luck. It's infrastructure they built once and never had to rebuild.

Automate your agent regression suite

Chanl's scenario runner integrates into CI/CD with a single API call. Configure fast and full suites, set quality gates, and get alerts when production behavior drifts.

See Chanl's scenarios

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

regression-testing ci-cd ai-agents testing production scorecards scenarios

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed