ChanlChanl
Testing & Evaluation

Every Conversation Is an Experiment You Didn't Run

Your agent already ran the A/B test you're scoping. Here's how to read the results in your logs with propensity matching, synthetic control, and diff-in-diff.

LDLucas DalamartaEngineering LeadFollow
April 16, 2026
12 min read
Warm watercolor illustration of an engineer reviewing A/B test scorecards and conversation analytics at a rooftop workspace during golden hour

A team I worked with last month wanted to test one question: does the agent close more tickets when it asks for email at the start of a conversation, or at the end?

They scoped a four-week A/B. Two prompt variants, fifty percent traffic split, statistical power calculation, the works. Ship date was mid-May.

Then someone pulled up the dashboard. The agent had been doing both for eight months. Sometimes early, sometimes late, depending on how the user opened. The LLM was making the call conversation by conversation.

They had already run the experiment. They were about to throw the data away and run it again.

Here's the part nobody says out loud about production AI agents: the six-week A/B ritual is engineering theater for most CX questions. Every conversation is a small experiment, and the default behavior is to discard every one of them and schedule a proper study.

Why do teams throw the data away?

Trusting observational data feels wrong. Engineers pick up that reflex for good reasons. Selection bias is real. Confounders are real. The A/B test is the gold standard for a reason.

The gold standard was designed for a world where you had to scope, deploy, and run an experiment before you could learn anything. Conversations do not work that way. The agent already ran your experiment. The variants already happened. The customers already responded. What you have is a dataset where treatment was not randomly assigned, and that is exactly the setting causal inference methods were built for.

Netflix, Spotify, and Airbnb have teams that do this at scale. Netflix publishes openly on quasi-experiments and observational causal inference for decisions where randomized tests are not feasible. Statsig and Eppo both ship propensity matching and CUPED as first-class features for exactly this reason. GrowthBook went further and wraps DoWhy directly. None of this is exotic anymore. It is just ignored in most AI agent teams, because the first instinct is to reach for the A/B framework.

Three methods cover most of what you actually need. Propensity matching reads two behaviors that are already living side by side in the logs. Synthetic control measures the lift from a change you already shipped. Diff-in-diff handles the case where you rolled a tool out to some conversations and not others, deliberately or accidentally.

Setting up the observational frame

Call the agent's two behaviors A and B. For the email example: A asks early, B asks late. Both already happened in production. For each conversation you have context (channel, persona, intent), the behavior that occurred (A or B), and the outcome (resolved / not resolved, CSAT, escalation).

The naive move is to average outcomes across all A conversations, average across all B, and call the difference the effect. That lies.

Why it lies: whatever made the agent choose A or B is probably correlated with the outcome. Maybe the agent asks early when the intent is clearly transactional, and late when it's ambiguous. The transactional ones would have resolved anyway. You end up attributing intent distribution to prompt behavior and declaring a winner that doesn't exist.

What you actually need is to compare A and B conversations that were otherwise identical. That is the job of propensity matching.

Propensity matching when both variants are already in the logs

The idea is simple. For every A conversation, find a B conversation that was as similar as possible on everything that might have pushed the agent toward A or B. Match them. Compare the outcomes inside each pair. Average across pairs.

The math that makes this work is the propensity score: given a conversation's covariates, it's the estimated probability that the conversation received treatment A. Match treated and control conversations with similar propensity scores, and inside those matched pairs you have something that behaves like a small randomized experiment on the observed covariates.

Here's a short PyWhy / DoWhy sketch:

propensity_match.py·python
import pandas as pd
from dowhy import CausalModel
 
# Each row: one conversation.
# treatment = 1 if agent asked email first, else 0
# outcome = 1 if ticket resolved in session
# covariates = features we want to match on
conversations = pd.read_parquet("conversations.parquet")
 
model = CausalModel(
    data=conversations,
    treatment="asked_email_first",
    outcome="resolved",
    common_causes=[
        "intent_transactional",
        "persona_segment",
        "memory_has_prior_session",
        "agent_version",
        "hour_of_day",
        "channel",
    ],
)
 
identified = model.identify_effect()
estimate = model.estimate_effect(
    identified,
    method_name="backdoor.propensity_score_matching",
)
 
print(f"Estimated effect of asking early: {estimate.value:.3f}")

The common_causes list is the whole game. Match on intent, persona, memory state, agent version, hour of day, and channel and you've ruled out the obvious confounders. Forget to include intent when the agent picks A mostly on transactional queries, and your estimate is junk. The code is the easy part. Defending the covariate list in a review is the job.

Once you have the estimate, pressure test it. DoWhy's refute_estimate API adds random noise, swaps treatments, and drops subsets. If the estimate survives those refutations, you have a defensible answer. If it collapses, one of your assumptions is wrong and it's back to the drawing board.

For the email team from the opening scene: when we ran this on their logs, asking early lifted resolution rate by about 3 points, with the effect holding up across two refutation runs. An afternoon of analysis replaced six weeks of A/B scoping. The PM had already blocked a sprint for the experiment. They unblocked it that afternoon.

Operations engineer monitoring deploys

Deploy Gate

Pre-deploy quality checks

Score > 80%
92%
Latency < 500ms
234ms
Error Rate < 2%
3.1%
Deploy Blocked

What if you already shipped the change? Synthetic control.

Matching works when both variants sit side by side in your logs. It doesn't work when you shipped something. You pushed a new system prompt last Tuesday. You want to know the effect. Before-vs-after averaging is a trap, because your traffic mix shifts constantly. A holiday weekend lands in "before", a partner launch lands in "after", and now you cannot tell prompt impact from seasonality.

Synthetic control fixes this. You build a weighted combination of pre-launch conversations that tracked your launch period's outcome metric week over week. That weighted combination is the counterfactual: what the metric would have looked like if you hadn't shipped. The difference between actual and counterfactual is the effect.

Alberto Abadie invented the method for policy studies (his original 2003 paper looked at the economic impact of terrorism in the Basque Country, of all things). Product teams picked it up because the setting generalizes cleanly: one unit got treated, many units did not, and you want the counterfactual trajectory of the treated unit.

For a prompt change, the "units" are time-bucketed cohorts of conversations, grouped by intent and persona. The treated unit is the post-launch cohort. The donor pool is every prior week's cohort. The method solves for weights that minimize the gap between actual and synthetic on pre-launch outcome data, then applies those weights to the post-launch window.

synthetic_control.py·python
import numpy as np
from scipy.optimize import minimize
 
# outcomes_pre: (n_donor_weeks, n_metric_timepoints) array for pre-launch weeks
# target_pre: (n_metric_timepoints,) for the launch week during its pre-period
# outcomes_post: (n_donor_weeks,) post-launch metric from each donor if no change
# target_post: actual launch week metric
 
def loss(w, donors, target):
    return np.sum((donors.T @ w - target) ** 2)
 
constraints = [{"type": "eq", "fun": lambda w: w.sum() - 1}]
bounds = [(0, 1)] * outcomes_pre.shape[0]
 
result = minimize(
    loss,
    x0=np.ones(outcomes_pre.shape[0]) / outcomes_pre.shape[0],
    args=(outcomes_pre, target_pre),
    bounds=bounds,
    constraints=constraints,
)
 
synthetic_post = outcomes_post @ result.x
effect = target_post - synthetic_post
print(f"Causal effect of prompt change: {effect:.3f}")

That sketch is the core math, not the whole method. The real workflow adds placebo tests (apply it to untreated units and check you get null effects), permutation-based inference, and cross-validation on pre-period fit. Don't ship a decision off the raw minimize output. PyWhy's SyntheticControl estimator handles the scaffolding cleanly. For small samples and a careful audience, the R Synth package is still the reference implementation, and Microsoft's EconML is worth a look if you want flexible estimators over the same workflow.

Synthetic control cannot tell you why a change worked. It tells you whether the metric after launch is meaningfully different from the counterfactual you built. For a product team deciding whether to keep a prompt change, that is usually the question you actually need answered.

Diff-in-diff when a tool only rolled out to some conversations

A third pattern shows up constantly with agent tools. You added a new tool, say lookup_order, for a subset of conversations. Maybe only high-intent ones. Maybe only authenticated users. Maybe just Tier-2 skills. You want to know whether it helped. The tool was not randomly assigned.

Difference-in-differences compares the change in outcome before and after the rollout for conversations that got the tool, against the change in outcome before and after for conversations that did not. The parallel trends assumption is the catch: without the tool, both groups would have moved in parallel. If that holds, the difference of differences is the causal effect.

diff_in_diff.py·python
import statsmodels.formula.api as smf
 
# One row per conversation with:
#   got_tool: 1 if lookup_order was available in that skill at conversation time
#   post_launch: 1 if after rollout date
#   resolved: outcome
 
did = smf.ols(
    "resolved ~ got_tool + post_launch + got_tool:post_launch + C(intent) + C(hour)",
    data=conversations,
).fit()
 
print(did.summary())
# The coefficient on got_tool:post_launch is the causal effect.

The interaction term is your causal estimate. The fixed effects for intent and hour absorb within-group drift. Before trusting the number, plot the pre-launch outcome trajectory for both groups and look for parallel lines. If they visibly diverge, parallel trends is violated and you need to either reweight or switch to synthetic control.

Remember the agent readiness work we wrote about in the unit, A/B, and live testing piece? Diff-in-diff is what turns the "live test" phase from a vibe into a number.

How the pieces connect

Three methods, three questions, one pipeline. Before picking code, pick the method that matches the shape of the question. Got two behaviors running side by side? Match them. Shipped a change and want lift? Synthetic control. Rolled a tool out to a subset? Diff-in-diff. Pick wrong and you'll answer the wrong question cleanly.

MethodUse whenKey assumptionBreaks when
Propensity matchingTwo variants coexist in logs (A vs B, early vs late, tool A vs tool B)No hidden confounders outside the matched covariatesAn unobserved driver routes conversations to one variant
Synthetic controlOne change shipped at a known date, pre-period is cleanDonor pool can reconstruct the pre-launch trajectoryLaunch window is short or traffic mix changed with the launch
Diff-in-diffTool or feature rolled out to a subset, both groups still observable before and afterParallel trends between treated and control groupsPre-launch trajectories visibly diverge
Two variants in logs Pre vs post change Rollout to subset Live conversations Signal extraction What question? Propensity matching Synthetic control Diff-in-diff Causal effect + refutations Ship / rollback / investigate
Observational causal pipeline from conversation logs to decisions

The pipeline is: structured signals in, chosen method based on the question, causal estimate plus refutations out. Every step above the decision depends on having the right fields in your logs.

When does observational analysis break?

Four failure modes kill causal estimates: hidden confounders, small samples, gameable outcomes, and survivorship bias. Each breaks a different method, and each shows up often enough in real agent data that you should assume at least one applies on any given run. Here's how to spot them before they spot you.

Hidden confounders sink propensity matching. Can't observe it, can't match on it. If the agent was upgraded to a new model mid-period and that rollout correlated with anything in your covariates, you have a problem no amount of matching fixes. Rule of thumb: list every plausible reason the agent might have chosen A over B. If more than one isn't in your logs, stop and fix the logging before you run anything.

Small samples ruin synthetic control. You need enough pre-period weeks and enough donor cohorts that the weighted combination actually tracks the target. Launched two weeks ago? Do not run synthetic control yet. Run a simpler method and wait.

Outcome gaming is the worst failure mode, and it's the one that catches the most teams. If "resolved" is determined by the agent itself marking the ticket closed, and the new prompt variant is more aggressive about marking things closed, your outcome metric is contaminated before you even start. The only fix is an outcome measured outside the agent's control: a downstream system, a human grader, a customer survey, a delayed signal. Anything the agent can't reach.

Survivorship bias is the subtle one. Users who had bad experiences may have churned before you got to measure them. Your dataset is over-weighted toward users who survived. Panel data across a longer window helps, but acknowledge the bias in any decision you make on the numbers.

The signals that make this work

None of this is possible without structured conversation data. The fields I argued for in the Signal schema post are the exact fields causal methods need: intent, outcome, memory state, agent version, tool calls, timestamps, channel, persona segment. Miss any one of these and you lose the ability to control for it.

This is also why Signal extraction as a first-class pipeline matters more than most teams assume. Conversations are not natively structured. Turning a transcript into (intent, outcome, memory_used, tools_invoked) is the step that converts raw logs into causal-analysis-ready data. Pair it with Scorecards for outcome definitions that are not gameable by the agent itself, and with Monitoring to catch the distribution shifts that invalidate your causal assumptions between analysis runs.

Scenarios has a role here too, but not the one you might expect. Scenarios are best used to validate a causal finding before you roll back. If propensity matching says variant B is worse, you can replay variant B against a persona library with known-good outcomes and confirm the direction. That is the production workflow we recommend: causal analysis on the live data, scenario validation before acting on it. This connects directly to the retention correlation discussion in the companion piece, where the same signals feed longer-horizon questions.

Stop waiting for A/B tests

Formal A/B testing is not dead. For high-stakes changes with unknown user impact, randomized control is still the right move. Before a major pricing prompt change, before a new escalation policy, before anything where you can't afford to be wrong, scope the A/B.

But most of the questions you want to answer aren't like that. "Is the new persona handling complaints better than the old one?" "Did our tool rollout last month help resolution?" "Does asking for email first actually matter?" Those answers are in your logs right now. Causal methods on observational data give them back in hours, not six weeks.

Remember the email team from the opening scene. They saved 28 business days. The right causal method read the experiment they had already run and handed them a defensible answer before they could have finished writing the A/B spec. That is not an edge case. That is the default state for any team running an agent in production with decent conversation logging.

Your agent is running experiments every minute of every day. Stop throwing them away.

Every conversation is a small experiment. Log it like one.

Chanl turns raw transcripts into structured signals with the fields causal methods need: intent, outcome, memory state, agent version, tool calls. Point your agent at Chanl and start reading the experiments you're already running.

See Signal in action
LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.

500+ CS and revenue leaders subscribed

Frequently Asked Questions