What is shadow mode for AI agents?

Shadow mode runs a candidate agent version in parallel with your production version on the same live traffic. The candidate receives identical requests but its responses are never shown to users. You compare the two versions on metrics like response quality, tool accuracy, and latency before deciding whether to promote the candidate to production.

How is shadow mode different from A/B testing for AI agents?

In A/B testing, both versions serve real users and you measure downstream impact like CSAT, conversion, and resolution rate. In shadow mode, only the production version serves users -- the shadow version is evaluated entirely internally. No users are exposed to potentially degraded behavior during the evaluation period, making shadow mode the more conservative option.

When should I use shadow mode vs. canary deployment?

Use shadow mode when you're unsure whether a change is safe for users at all. It's the most conservative option. Switch to canary once shadow mode shows the candidate is at least as good as production, and you want to measure real-user impact on a small traffic slice (typically 5-10%) before committing to a full rollout.

What metrics should I monitor during shadow mode?

The most useful metrics are: automated quality score via scorecard, tool call accuracy rate, hallucination rate, p95 latency, and conversation completion rate. You want to see the candidate matching or beating production on all five before promotion. Pay particular attention to cases where production and candidate call different tools on the same input -- those are your highest-value divergences to review manually.

How long should shadow mode run before I promote a candidate?

Run shadow mode for at least 1,000 request comparisons or 5 business days, whichever comes first. This gives statistical coverage across different query types, edge cases, and time-of-day variations. High-volume contact centers may reach 1,000 comparisons in a few hours; lower-volume teams may need a full week. Don't promote based on fewer than 200 comparisons.

What happens when the shadow agent fails or times out?

Shadow failures should never affect the user experience -- the request is already handled by the production agent. Log the shadow failure as a separate metric and investigate patterns. A high shadow timeout rate often means your candidate has a new tool dependency that's misconfigured or slow. Fix it before promotion, don't promote past it.

Can I use shadow mode for multi-agent systems?

Yes, but shadow at each agent entry point separately, not just the top-level orchestrator. Shadowing only the orchestrator misses regressions in sub-agents. The best setup shadows each agent in the chain independently so you can pinpoint exactly which component is regressing, rather than seeing a degraded end-to-end metric with no clear source.

What are good rollback gate thresholds for a CX agent canary?

Common starting thresholds: error rate above 5% over the last 500 requests, p95 latency above 2 seconds, automated quality score below 0.65, or tool call failure rate above 8%. Define these gates before you start the rollout, not while you're watching numbers move. Adjust based on your baseline after the first two canary cycles.

Shadow Mode: Deploy AI Agent Updates Without Risk

Your team spent three weeks improving the CX agent's refund handling. Better intent detection, a new tool for looking up order history, a tightened system prompt. You deploy on a Friday afternoon. By Saturday morning, CSAT is down 8 points and the first support escalation comes in from a customer who got the wrong refund amount.

You roll back. It takes four hours because the previous Docker image has the wrong tag in your registry. By Monday you've lost a weekend's worth of customer trust, and nobody on the team wants to ship the next improvement.

The problem wasn't code quality. The problem was deployment strategy. You went from dev to 100% of production traffic in one command. There was no safety net between "this looks good in staging" and "every customer in North America is hitting it."

Shadow mode is that safety net.

What Shadow Mode Actually Does

Shadow mode runs your candidate agent version in parallel with production, using the same real traffic, without ever showing the candidate's responses to users. Every incoming request is duplicated: the production agent handles it normally, and the shadow agent gets an identical copy. Only the production agent's response reaches the customer.

You collect both outputs and compare them. Where do they agree? Where do they diverge? Does the candidate call different tools? Take longer to respond? Produce a different answer to the same question?

This lets you catch regressions before users feel them -- not in staging, where the traffic is synthetic and the edge cases are invented, but in production, with real customer queries, real tool state, and real timing. It's the deployment strategy most teams wish they'd used after their first bad rollout.

The Four Deployment Strategies, and When Each Fits

Four strategies cover most agent update scenarios. Each sits at a different point on the risk-vs-feedback tradeoff:

Strategy	User exposure	Risk level	Best for
Direct deploy	100% immediately	High	Hotfixes, tiny config changes
Shadow mode	0% (internal only)	Very low	Any behavior-affecting change
Canary rollout	5-10% of users	Medium	After shadow mode confirms safety
A/B test	50%+ split	Medium	Measuring real-user impact at scale

The progression for a meaningful agent change looks like this: shadow mode first to validate safety, canary second to measure real-user impact at low exposure, then full rollout. You can skip to canary for minor changes with high test coverage. You should almost never go directly to 100%.

The distinction between shadow and A/B matters more than it might seem. In an A/B test, users in the "B" group experience the new behavior directly, so a bad candidate harms real customers. Shadow mode is purely internal -- no user sees the candidate until you decide it's ready. That means you can run shadow mode on a half-finished candidate, pull data, revise the candidate, and run shadow again, all without any customer exposure.

Setting Up Shadow Mode: The Architecture

The core requirement is request duplication with independent response paths. The minimal architecture has four components: a duplicator layer, the production agent, the shadow agent, and a comparison store.

Shadow mode -- production response reaches the customer while the shadow response goes to your metrics store

The duplicator is a thin middleware layer in your API gateway or agent proxy. It sends the request synchronously to the production agent, blocking on its response, and asynchronously fires an identical copy to the shadow agent without blocking.

shadow-proxy.ts·typescript

async function handleWithShadow(request: AgentRequest): Promise<AgentResponse> {
  // Production path: always await -- user waits for this
  const productionResponse = await productionAgent.handle(request);
 
  // Shadow path: fire and forget -- never blocks the user
  shadowAgent
    .handle(request)
    .then((shadowResponse) => {
      logComparison({
        requestId: request.id,
        production: productionResponse,
        shadow: shadowResponse,
        timestamp: new Date().toISOString(),
      });
    })
    .catch((err) => {
      logShadowFailure({ requestId: request.id, error: err.message });
    });
 
  return productionResponse;
}

The shadow path is deliberately fire-and-forget. If the shadow agent times out or crashes, the production response is already on its way to the customer. You log the failure separately, but it never touches the user experience.

One thing to get right: the shadow agent should run with the same tool permissions, the same memory state, and the same system context as the production agent. If the shadow can't access the same customer history that production can, you're not comparing like for like, and your divergence metrics will be full of noise.

What to Measure During Shadow Mode

The goal is to answer one question: is the candidate at least as good as production across every dimension that matters to customers? You need measurable metrics to answer it systematically.

Automated quality score. Run both responses through the same scorecard criteria. Look for regressions on factual accuracy, policy compliance, and task completion. With Chanl's scorecards, you can route both production and shadow responses through identical evaluation criteria and get side-by-side scores automatically, rather than building a separate scoring pipeline.

Tool call accuracy. On the same input, did both agents call the same tools with similar parameters? Divergence isn't automatically bad -- the candidate might be smarter -- but it flags specific inputs for manual review. If production calls lookup_order_history and the candidate calls get_customer_profile on the same request, you want a human to verify whether that's an improvement or a regression.

Hallucination rate. Compare factual claims in both responses against your knowledge base. A candidate that sounds more fluent but is less accurate is a regression, even if the quality score looks similar. This is the metric most teams forget to measure during shadow mode.

p95 latency. Did the candidate get slower? A 40% latency increase on the 95th percentile will hurt real-time CX interactions. The median might look fine while the tail is painful -- always check the tail.

Conversation completion rate. For multi-turn scenarios, does the candidate reach a resolution state at the same rate as production? A candidate that scores better on single-turn quality but worse on completing full conversation arcs is a net regression.

Total Calls

0+12%

Avg Duration

4:23-8s

Resolution

0%+3%

Live Dashboard

Active calls23

Avg wait0:04

Satisfaction98%

Reading the Comparison Dashboard

After a few days of shadow mode, you want a comparison view with these columns: request type, production score, shadow score, delta, and flagged for review.

Aggregate by intent or topic cluster to see whether the candidate is uniformly better, uniformly worse, or mixed. Mixed results are common and genuinely useful: they tell you exactly which parts of the new version are ready and which aren't. "Better on account questions, worse on billing questions" is actionable. "Generally seems good" is not.

A useful threshold: if the candidate is within 3% of production on every metric, you're clear to promote. If any metric regresses more than 5%, investigate before moving forward. The flagged rows (where production and candidate diverged significantly on the same input) are your manual review queue -- work through them before making the promotion call.

This comparison work connects to what we covered in catching score drift before it ships. Shadow mode is that drift-detection approach applied prospectively, against a candidate rather than your live version.

Canary Rollout: From Shadow to Full Traffic

Once shadow mode confirms the candidate is safe, you move to canary. Real users start seeing the new version, but only a small fraction.

canary-router.ts·typescript

const CANARY_PERCENTAGE = 5; // start at 5%
 
function routeRequest(request: AgentRequest): "production" | "candidate" {
  // Stable hashing: same user always hits the same version
  const hash = murmur3Hash(request.userId) % 100;
  return hash < CANARY_PERCENTAGE ? "candidate" : "production";
}

The key is consistent routing by user ID, not random per-request routing. If a user starts a conversation with the candidate agent, they should finish it with the candidate. Mixing versions mid-conversation is a fast path to incoherent responses -- production set the context for the first half, candidate tries to continue it without that context.

The standard progression:

5% for 24 hours. Watch for hard failures, latency spikes, and error rate jumps.
10% for 48 hours. Begin measuring user-facing outcomes: CSAT signals, escalation rate, resolution time.
25% for 72 hours. Compare cohort metrics between the canary and production groups.
50% then 100% once all gates pass at 25%.

Each stage should have explicit promotion criteria defined before the rollout starts, not after you're watching the numbers move. "Canary p95 latency must be within 15% of production" and "CSAT score on the canary cohort must not be more than 2 points below production" are useful gates. "It looks okay" is not.

Automated Rollback Gates

Manual review at each canary stage works for low-volume deployments. At scale, you need automated rollback so that a regression at 3am gets caught before it affects more users.

A rollback gate monitors a rolling metric window and triggers automatic traffic reversion if any condition is met:

rollback_gate.py·python

ROLLBACK_GATES = [
    {"metric": "error_rate", "threshold": 0.05, "window_requests": 500},
    {"metric": "p95_latency_ms", "threshold": 2000, "window_requests": 200},
    {"metric": "quality_score", "threshold": 0.65, "window_requests": 300},
    {"metric": "tool_call_failure_rate", "threshold": 0.08, "window_requests": 200},
]
 
async def check_rollback_gates(deployment_id: str) -> bool:
    for gate in ROLLBACK_GATES:
        value = await get_metric_window(
            deployment_id,
            gate["metric"],
            gate["window_requests"]
        )
        if value > gate["threshold"]:
            await trigger_rollback(
                deployment_id,
                reason=f"{gate['metric']} = {value:.3f} exceeds {gate['threshold']}"
            )
            await notify_on_call(deployment_id, gate)
            return False
    return True
 
# Run every 5 minutes during canary
async def canary_monitor_loop(deployment_id: str):
    while canary_is_active(deployment_id):
        await check_rollback_gates(deployment_id)
        await asyncio.sleep(300)

The gate checks run on a schedule and compare the canary cohort against the production baseline. If any gate fires, traffic shifts back to production and an alert goes out. The candidate stays available but stops receiving traffic until the team investigates.

Chanl's monitoring and analytics API both expose the metrics you need to feed these gates. Quality scores, tool failure rates, and conversation completion rates are available as time-series data you can pull into your rollback gate checker rather than building a separate measurement pipeline.

The Deployment Checklist

Pull the pieces together and you get a repeatable playbook:

text

Pre-deployment:
  Run full scenario test suite on candidate (/features/scenarios)
  Confirm candidate passes quality gates in staging
  Enable shadow mode proxy with request duplication
 
Shadow phase (minimum 1,000 comparisons or 5 days):
  Monitor production vs. candidate scores in comparison dashboard
  Review flagged divergence cases manually (> 5% delta)
  Confirm all metrics within acceptance threshold before advancing
 
Canary phase (5% / 10% / 25% traffic):
  Enable consistent user-ID canary routing
  Activate automated rollback gates
  Monitor CSAT, escalation rate, resolution time on canary cohort
  Compare against production cohort at each stage
 
Promotion:
  All gate conditions met at 25% canary
  Route 50% then 100% of traffic to new version
  Keep shadow mode running for 24h post-promotion as a sanity check

The extra steps add roughly a week to a deployment cycle for significant changes. That's a small price compared to a Friday rollout that tanks your weekend CSAT and takes four hours to reverse.

Why the Deployment Gap Exists

Shadow mode has a reputation for being complex infrastructure work. And the full setup described here -- request duplication, comparison logging, automated gates -- does take real engineering effort to build well.

But the minimal version is not complex. A proxy that duplicates requests and logs both responses to a database table is about 50 lines of code. A query that counts diverging responses by intent is a SQL aggregation with a GROUP BY. You can have a working shadow mode in half a day.

The teams that skip it usually frame it as "we'll add it after launch." The problem is that after launch there's always something more urgent. Then the bad rollout happens, and suddenly the shadow mode project gets an emergency slot on the sprint board.

Build it before the first bad deployment. It's easier to justify the infrastructure investment while you're still building than after you've just rolled back at 2am.

The patterns here complement what we covered in silent agent degradation: shadow mode catches regressions before they ship, continuous monitoring catches drift after they ship. Together, they give you the full deployment safety stack: a gate before production and a sensor in production.

This is what Monitor means in practice for your Build, Connect, Monitor infrastructure -- not just dashboards after the fact, but measurement woven into the deployment process itself.

Score both versions of your agent before users see the new one

Chanl's scorecards and analytics run across your production and shadow agent simultaneously. You get side-by-side quality comparisons in real time so you always know when a candidate is actually ready to promote.

See how Chanl fits your deployment pipeline

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

deployment shadow-mode canary-deployment agent-versioning production-ops

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.