Why not just run a proper A/B test on my AI agent?

You still can and should for high-risk changes. The point is that most questions teams A/B test for, like 'does asking for email first help?' or 'did the new tool improve resolution?', already have answers in the conversation logs. Causal methods on observational data return those answers in hours instead of six weeks, and they work on changes that already shipped.

What's the difference between propensity matching and a regular A/B test?

An A/B test randomly assigns users to variants so the groups are comparable by construction. Propensity matching takes observational data where assignment wasn't random and pairs treated conversations with control conversations that had similar context, persona, memory state, and time of day. Within matched pairs, you can compare outcomes as if a mini A/B test had been run.

How does synthetic control work for a product launch?

After a change like a prompt update or a new tool, you build a weighted combination of pre-launch conversations that matches the launch period on intent mix, persona distribution, and seasonality. That weighted combination is the counterfactual. The difference between actual post-launch outcomes and the synthetic counterfactual is the causal effect of the launch.

When does observational analysis break down?

It breaks when there are hidden confounders, when sample sizes are tiny, or when the outcome itself changes meaning across the change boundary. If your new agent version is only shown to high-value customers, no amount of matching fixes that. If your post-launch traffic includes a new channel, synthetic control can't find a comparable pre-launch period.

Which conversation signals do I need to log to make this possible?

At minimum: intent, resolution outcome, persona or segment, memory state used, agent version, tool calls made, and timestamp. These become the covariates for propensity matching and the metric series for synthetic control. Without them, every causal method reduces to naive averaging.

Do I need a data scientist to do this, or can engineers do it themselves?

Engineers can do propensity matching and diff-in-diff with DoWhy or scikit-learn in an afternoon. Synthetic control is a little more involved but PyWhy's SyntheticControl estimator and Abadie's R package both wrap the hard parts. The difficult work is choosing the covariates and defending the assumptions, not the code.

Every Conversation Is an Experiment You Didn't Run

A team I worked with last month wanted to test one question: does the agent close more tickets when it asks for email at the start of a conversation, or at the end?

They scoped a four-week A/B. Two prompt variants, fifty percent traffic split, statistical power calculation, the works. Ship date was mid-May.

Then someone pulled up the dashboard. The agent had been doing both for eight months. Sometimes early, sometimes late, depending on how the user opened. The LLM was making the call conversation by conversation.

They had already run the experiment. They were about to throw the data away and run it again.

Here's the part nobody says out loud about production AI agents: the six-week A/B ritual is engineering theater for most CX questions. Every conversation is a small experiment, and the default behavior is to discard every one of them and schedule a proper study.

Why do teams throw the data away?

Trusting observational data feels wrong. Engineers pick up that reflex for good reasons. Selection bias is real. Confounders are real. The A/B test is the gold standard for a reason.

The gold standard was designed for a world where you had to scope, deploy, and run an experiment before you could learn anything. Conversations do not work that way. The agent already ran your experiment. The variants already happened. The customers already responded. What you have is a dataset where treatment was not randomly assigned, and that is exactly the setting causal inference methods were built for.

Netflix, Spotify, and Airbnb have teams that do this at scale. Netflix publishes openly on quasi-experiments and observational causal inference for decisions where randomized tests are not feasible. Statsig and Eppo both ship propensity matching and CUPED as first-class features for exactly this reason. GrowthBook went further and wraps DoWhy directly. None of this is exotic anymore. It is just ignored in most AI agent teams, because the first instinct is to reach for the A/B framework.

Three methods cover most of what you actually need. Propensity matching reads two behaviors that are already living side by side in the logs. Synthetic control measures the lift from a change you already shipped. Diff-in-diff handles the case where you rolled a tool out to some conversations and not others, deliberately or accidentally.

Setting up the observational frame

Call the agent's two behaviors A and B. For the email example: A asks early, B asks late. Both already happened in production. For each conversation you have context (channel, persona, intent), the behavior that occurred (A or B), and the outcome (resolved / not resolved, CSAT, escalation).

The naive move is to average outcomes across all A conversations, average across all B, and call the difference the effect. That lies.

Why it lies: whatever made the agent choose A or B is probably correlated with the outcome. Maybe the agent asks early when the intent is clearly transactional, and late when it's ambiguous. The transactional ones would have resolved anyway. You end up attributing intent distribution to prompt behavior and declaring a winner that doesn't exist.

What you actually need is to compare A and B conversations that were otherwise identical. That is the job of propensity matching.

Propensity matching when both variants are already in the logs

The idea is simple. For every A conversation, find a B conversation that was as similar as possible on everything that might have pushed the agent toward A or B. Match them. Compare the outcomes inside each pair. Average across pairs.

The math that makes this work is the propensity score: given a conversation's covariates, it's the estimated probability that the conversation received treatment A. Match treated and control conversations with similar propensity scores, and inside those matched pairs you have something that behaves like a small randomized experiment on the observed covariates.

Here's a short PyWhy / DoWhy sketch:

propensity_match.py·python

import pandas as pd
from dowhy import CausalModel
 
# Each row: one conversation.
# treatment = 1 if agent asked email first, else 0
# outcome = 1 if ticket resolved in session
# covariates = features we want to match on
conversations = pd.read_parquet("conversations.parquet")
 
model = CausalModel(
    data=conversations,
    treatment="asked_email_first",
    outcome="resolved",
    common_causes=[
        "intent_transactional",
        "persona_segment",
        "memory_has_prior_session",
        "agent_version",
        "hour_of_day",
        "channel",
    ],
)
 
identified = model.identify_effect()
estimate = model.estimate_effect(
    identified,
    method_name="backdoor.propensity_score_matching",
)
 
print(f"Estimated effect of asking early: {estimate.value:.3f}")

The common_causes list is the whole game. Match on intent, persona, memory state, agent version, hour of day, and channel and you've ruled out the obvious confounders. Forget to include intent when the agent picks A mostly on transactional queries, and your estimate is junk. The code is the easy part. Defending the covariate list in a review is the job.

Once you have the estimate, pressure test it. DoWhy's refute_estimate API adds random noise, swaps treatments, and drops subsets. If the estimate survives those refutations, you have a defensible answer. If it collapses, one of your assumptions is wrong and it's back to the drawing board.

For the email team from the opening scene: when we ran this on their logs, asking early lifted resolution rate by about 3 points, with the effect holding up across two refutation runs. An afternoon of analysis replaced six weeks of A/B scoping. The PM had already blocked a sprint for the experiment. They unblocked it that afternoon.

Deploy Gate

Pre-deploy quality checks

Score > 80%

92%

Latency < 500ms

234ms

Error Rate < 2%

3.1%

Deploy Blocked

What if you already shipped the change? Synthetic control.

Matching works when both variants sit side by side in your logs. It doesn't work when you shipped something. You pushed a new system prompt last Tuesday. You want to know the effect. Before-vs-after averaging is a trap, because your traffic mix shifts constantly. A holiday weekend lands in "before", a partner launch lands in "after", and now you cannot tell prompt impact from seasonality.

Synthetic control fixes this. You build a weighted combination of pre-launch conversations that tracked your launch period's outcome metric week over week. That weighted combination is the counterfactual: what the metric would have looked like if you hadn't shipped. The difference between actual and counterfactual is the effect.

Alberto Abadie invented the method for policy studies (his original 2003 paper looked at the economic impact of terrorism in the Basque Country, of all things). Product teams picked it up because the setting generalizes cleanly: one unit got treated, many units did not, and you want the counterfactual trajectory of the treated unit.

For a prompt change, the "units" are time-bucketed cohorts of conversations, grouped by intent and persona. The treated unit is the post-launch cohort. The donor pool is every prior week's cohort. The method solves for weights that minimize the gap between actual and synthetic on pre-launch outcome data, then applies those weights to the post-launch window.

synthetic_control.py·python

import numpy as np
from scipy.optimize import minimize
 
# outcomes_pre: (n_donor_weeks, n_metric_timepoints) array for pre-launch weeks
# target_pre: (n_metric_timepoints,) for the launch week during its pre-period
# outcomes_post: (n_donor_weeks,) post-launch metric from each donor if no change
# target_post: actual launch week metric
 
def loss(w, donors, target):
    return np.sum((donors.T @ w - target) ** 2)
 
constraints = [{"type": "eq", "fun": lambda w: w.sum() - 1}]
bounds = [(0, 1)] * outcomes_pre.shape[0]
 
result = minimize(
    loss,
    x0=np.ones(outcomes_pre.shape[0]) / outcomes_pre.shape[0],
    args=(outcomes_pre, target_pre),
    bounds=bounds,
    constraints=constraints,
)
 
synthetic_post = outcomes_post @ result.x
effect = target_post - synthetic_post
print(f"Causal effect of prompt change: {effect:.3f}")

That sketch is the core math, not the whole method. The real workflow adds placebo tests (apply it to untreated units and check you get null effects), permutation-based inference, and cross-validation on pre-period fit. Don't ship a decision off the raw minimize output. PyWhy's SyntheticControl estimator handles the scaffolding cleanly. For small samples and a careful audience, the R Synth package is still the reference implementation, and Microsoft's EconML is worth a look if you want flexible estimators over the same workflow.

Synthetic control cannot tell you why a change worked. It tells you whether the metric after launch is meaningfully different from the counterfactual you built. For a product team deciding whether to keep a prompt change, that is usually the question you actually need answered.

Diff-in-diff when a tool only rolled out to some conversations

A third pattern shows up constantly with agent tools. You added a new tool, say lookup_order, for a subset of conversations. Maybe only high-intent ones. Maybe only authenticated users. Maybe just Tier-2 skills. You want to know whether it helped. The tool was not randomly assigned.

Difference-in-differences compares the change in outcome before and after the rollout for conversations that got the tool, against the change in outcome before and after for conversations that did not. The parallel trends assumption is the catch: without the tool, both groups would have moved in parallel. If that holds, the difference of differences is the causal effect.

diff_in_diff.py·python

import statsmodels.formula.api as smf
 
# One row per conversation with:
#   got_tool: 1 if lookup_order was available in that skill at conversation time
#   post_launch: 1 if after rollout date
#   resolved: outcome
 
did = smf.ols(
    "resolved ~ got_tool + post_launch + got_tool:post_launch + C(intent) + C(hour)",
    data=conversations,
).fit()
 
print(did.summary())
# The coefficient on got_tool:post_launch is the causal effect.

The interaction term is your causal estimate. The fixed effects for intent and hour absorb within-group drift. Before trusting the number, plot the pre-launch outcome trajectory for both groups and look for parallel lines. If they visibly diverge, parallel trends is violated and you need to either reweight or switch to synthetic control.

Remember the agent readiness work we wrote about in the unit, A/B, and live testing piece? Diff-in-diff is what turns the "live test" phase from a vibe into a number.

How the pieces connect

Three methods, three questions, one pipeline. Before picking code, pick the method that matches the shape of the question. Got two behaviors running side by side? Match them. Shipped a change and want lift? Synthetic control. Rolled a tool out to a subset? Diff-in-diff. Pick wrong and you'll answer the wrong question cleanly.

Method	Use when	Key assumption	Breaks when
Propensity matching	Two variants coexist in logs (A vs B, early vs late, tool A vs tool B)	No hidden confounders outside the matched covariates	An unobserved driver routes conversations to one variant
Synthetic control	One change shipped at a known date, pre-period is clean	Donor pool can reconstruct the pre-launch trajectory	Launch window is short or traffic mix changed with the launch
Diff-in-diff	Tool or feature rolled out to a subset, both groups still observable before and after	Parallel trends between treated and control groups	Pre-launch trajectories visibly diverge

Observational causal pipeline from conversation logs to decisions

The pipeline is: structured signals in, chosen method based on the question, causal estimate plus refutations out. Every step above the decision depends on having the right fields in your logs.

When does observational analysis break?

Four failure modes kill causal estimates: hidden confounders, small samples, gameable outcomes, and survivorship bias. Each breaks a different method, and each shows up often enough in real agent data that you should assume at least one applies on any given run. Here's how to spot them before they spot you.

Hidden confounders sink propensity matching. Can't observe it, can't match on it. If the agent was upgraded to a new model mid-period and that rollout correlated with anything in your covariates, you have a problem no amount of matching fixes. Rule of thumb: list every plausible reason the agent might have chosen A over B. If more than one isn't in your logs, stop and fix the logging before you run anything.

Small samples ruin synthetic control. You need enough pre-period weeks and enough donor cohorts that the weighted combination actually tracks the target. Launched two weeks ago? Do not run synthetic control yet. Run a simpler method and wait.

Outcome gaming is the worst failure mode, and it's the one that catches the most teams. If "resolved" is determined by the agent itself marking the ticket closed, and the new prompt variant is more aggressive about marking things closed, your outcome metric is contaminated before you even start. The only fix is an outcome measured outside the agent's control: a downstream system, a human grader, a customer survey, a delayed signal. Anything the agent can't reach.

Survivorship bias is the subtle one. Users who had bad experiences may have churned before you got to measure them. Your dataset is over-weighted toward users who survived. Panel data across a longer window helps, but acknowledge the bias in any decision you make on the numbers.

The signals that make this work

None of this is possible without structured conversation data. The fields I argued for in the Signal schema post are the exact fields causal methods need: intent, outcome, memory state, agent version, tool calls, timestamps, channel, persona segment. Miss any one of these and you lose the ability to control for it.

This is also why Signal extraction as a first-class pipeline matters more than most teams assume. Conversations are not natively structured. Turning a transcript into (intent, outcome, memory_used, tools_invoked) is the step that converts raw logs into causal-analysis-ready data. Pair it with Scorecards for outcome definitions that are not gameable by the agent itself, and with Monitoring to catch the distribution shifts that invalidate your causal assumptions between analysis runs.

Scenarios has a role here too, but not the one you might expect. Scenarios are best used to validate a causal finding before you roll back. If propensity matching says variant B is worse, you can replay variant B against a persona library with known-good outcomes and confirm the direction. That is the production workflow we recommend: causal analysis on the live data, scenario validation before acting on it. This connects directly to the retention correlation discussion in the companion piece, where the same signals feed longer-horizon questions.

Stop waiting for A/B tests

Formal A/B testing is not dead. For high-stakes changes with unknown user impact, randomized control is still the right move. Before a major pricing prompt change, before a new escalation policy, before anything where you can't afford to be wrong, scope the A/B.

But most of the questions you want to answer aren't like that. "Is the new persona handling complaints better than the old one?" "Did our tool rollout last month help resolution?" "Does asking for email first actually matter?" Those answers are in your logs right now. Causal methods on observational data give them back in hours, not six weeks.

Remember the email team from the opening scene. They saved 28 business days. The right causal method read the experiment they had already run and handed them a defensible answer before they could have finished writing the A/B spec. That is not an edge case. That is the default state for any team running an agent in production with decent conversation logging.

Your agent is running experiments every minute of every day. Stop throwing them away.

Sources & References

Every conversation is a small experiment. Log it like one.

Chanl turns raw transcripts into structured signals with the fields causal methods need: intent, outcome, memory state, agent version, tool calls. Point your agent at Chanl and start reading the experiments you're already running.

See Signal in action

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

testing evaluation causal-inference observational-experiments ab-testing conversation-analytics propensity-matching synthetic-control

Lucas Dalamarta

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed