What is a synthetic user in AI agent testing?

A synthetic user is a language model playing the role of a customer, following a persona definition that includes personality, intent, background, and behavioral quirks. Unlike scripted test inputs, synthetic users generate novel phrasings, go off-script, and behave unpredictably, like real customers do.

Why aren't scripted test inputs enough for AI agents?

Scripted inputs test only the cases you anticipated. Real users rephrase the same intent a hundred ways, start conversations mid-stream, contradict themselves, and surface edge cases you never imagined. Synthetic users generate this diversity automatically, covering the long tail that scripted tests miss.

How do you build an AI persona for testing?

A testing persona has four components: a customer profile (name, role, history with your company), an intent hierarchy (primary goal plus common tangents), behavioral traits (impatient, verbose, technical, confused), and escalation triggers (the conditions that should cause handoff or fallback). You instantiate the persona with a system prompt and let the model run the conversation.

How many synthetic users should I run per test cycle?

Start with 10-20 personas covering your main customer segments and known edge cases. Once the infrastructure is in place, scale to 50-100 per release. The goal is coverage of the behavioral space, not volume. Twenty well-designed personas reveal more than 200 variations of the same archetype.

What should I measure when testing with synthetic users?

Track task completion rate (did the synthetic user achieve their goal), turn efficiency (how many agent turns to resolve), escalation accuracy (did the agent escalate when it should have?), and tone consistency (does the agent maintain appropriate language across diverse customer phrasings).

Can synthetic users test multi-turn conversations?

Yes, and that's where they're most valuable. A synthetic user maintains state across turns, so it can ask follow-up questions based on what the agent just said, change its mind mid-conversation, or escalate emotionally if the agent doesn't resolve the issue. Scripted inputs can't simulate this dynamically.

How is synthetic user testing different from red-teaming?

Red-teaming tests adversarial inputs, users trying to break your agent or cause harmful outputs. Synthetic user testing focuses on realistic customer behavior your agent might encounter in normal production traffic. Both matter, but synthetic users cover the 'it just didn't work for this customer type' failures, not the 'someone tried to jailbreak it' failures.

How does synthetic user testing fit with other evaluation methods?

Synthetic users belong in pre-production and release testing. Unit tests verify individual agent actions. Scenario tests verify specific conversation paths. Synthetic users verify that your agent handles the full behavioral range of your customer base. After shipping, production monitoring and scorecards surface which real conversations need review.

Synthetic Users: Test Your Agent Against AI Personas

The agent worked perfectly for every test case you wrote. Then a real customer came in and said: "I'd like to reschedule my appointment, actually you know what cancel it, no wait can you just move it to next week," and your agent froze.

Not because it couldn't understand any of those individual requests. It could handle each one in isolation. But nobody wrote a test for a customer who changes their mind three times in one sentence, and your agent had never seen anything like it.

That's the gap synthetic users fill. Not more test cases. A different kind of testing entirely.

Why Scripted Tests Miss the Failures That Matter

Scripted test inputs catch the bugs you anticipated. They're good at verifying that "I want to book an appointment at 3pm" routes correctly and "cancel my subscription" triggers the right flow. They're bad at everything else.

When you write test cases, you're modeling your own mental model of how customers talk. The problem is customers don't consult your mental model before they call. They use their own words, their own grammar, their own logic, and they connect ideas in ways you didn't predict.

Industry observability reports on agent failure modes (see Arthur AI's 2026 Agentic AI Observability Playbook) keep landing on the same shape: most production failures come from inputs not covered by pre-production testing, not from cases your tests explicitly checked and got wrong. You could triple your test count and still miss the same failures, because they're concentrated in the part of input space you don't know to look at.

Synthetic users attack this differently. Instead of writing inputs, you write personas. Instead of running inputs through your agent, you let a language model play the role of the customer and generate conversations dynamically. The persona drives the intent; the model drives the phrasing, timing, and the tangents.

This approach doesn't replace scripted tests. It sits on top of them, covering behavioral diversity that scripted inputs can't reach.

What a Testing Persona Actually Is

A testing persona is a system prompt that turns a language model into a specific kind of customer. Not a prompt that says "act as a customer." Something much more specific than that.

A useful persona has four components:

Customer profile. Background that grounds the persona in reality: who they are, what they've bought, what their history with your company looks like. This affects tone and what the persona considers a reasonable response time or a reasonable answer.

Intent hierarchy. The primary goal they're trying to accomplish, plus the common side goals that real customers mix in. A customer trying to reschedule also wants to know if there are better available times, might want to change the service type, might ask about pricing on their way to the main request.

Behavioral traits. How they communicate. Patient or impatient. Concise or verbose. Technical or non-technical. Frustrated about a previous bad experience or neutral. These traits determine how the persona handles confusion, long silences, and agent responses that miss the point.

Escalation triggers. The specific things that should cause this customer to ask for a human, hang up, or disengage. Knowing these in advance lets you verify whether your agent correctly identifies and handles them.

Here's what this looks like as a TypeScript interface and system prompt builder:

persona-builder.ts·typescript

interface PersonaDefinition {
  id: string;
  name: string;
  profile: string;
  primaryIntent: string;
  sideIntents: string[];
  traits: string[];
  escalationTriggers: string[];
  company: string;
}
 
function buildPersonaPrompt(persona: PersonaDefinition): string {
  return `
You are ${persona.name}, a customer calling ${persona.company}.
 
PROFILE:
${persona.profile}
 
YOUR GOAL TODAY:
Primary: ${persona.primaryIntent}
You may also ask about: ${persona.sideIntents.join(", ")}
 
HOW YOU COMMUNICATE:
- ${persona.traits.join("\n- ")}
 
YOUR LIMITS:
If the agent ${persona.escalationTriggers.join(" or the agent ")}, ask to speak with a human.
If you've asked the same question twice without a useful answer, say you'll call back later.
 
CONVERSATION RULES:
- Don't summarize what you want in one clean sentence. Customers rarely do.
- Occasionally add context that's not directly relevant to your main goal.
- If the agent misunderstands, correct them but stay in character.
- After turn 8, if your goal isn't resolved, show mild frustration.
 
Respond only as the customer. Do not describe what you're doing, just say it.
  `.trim();
}

The "conversation rules" section is the part most teams skip. Without it, language models are too cooperative and too clear. They summarize their intent neatly on the first turn, never go off-track, and accept imperfect answers. Real customers do none of those things. Adding these behavioral constraints is what makes synthetic conversations actually stress-test your agent.

Building a Persona Library for Your Agent

For a typical customer service agent, you'll want personas that cover your main segments, your emotional edge cases, and your communication-style diversity.

Segment archetypes. One or two personas for each major customer segment. If your customers are primarily SMB owners, healthcare admins, and enterprise procurement managers, those groups have wildly different vocabularies, time budgets, and expectations. Your agent needs to work for all of them.

Emotional states. At least one persona that's frustrated from a previous bad experience, one that's confused but not angry, and one that's in a hurry. These stress your agent's tone-matching and de-escalation logic separately.

Communication edge cases. Personas that are verbose and unfocused, personas that give insufficient information up front, personas that ask multiple questions simultaneously. These test whether your agent handles ambiguity rather than defaulting to a generic response.

Language diversity. If your customer base includes non-native speakers or regional dialects, synthetic users in those voice registers will reveal gaps that standard test inputs miss.

Here's a concrete set for a healthcare appointment scheduling agent:

persona-library.ts·typescript

const personas: PersonaDefinition[] = [
  {
    id: "returning-patient-easy",
    name: "Maria Chen",
    profile:
      "Returning patient, seen Dr. Rodriguez twice before, no billing issues. Calls during lunch break and has about 10 minutes.",
    primaryIntent: "Reschedule a follow-up appointment from next Tuesday to sometime this week",
    sideIntents: [
      "Confirm it will be the same doctor",
      "Ask about parking",
    ],
    traits: [
      "Polite and direct",
      "Gives all information when asked",
      "Won't wait on hold more than 2 minutes",
    ],
    escalationTriggers: [
      "can't find any appointments this week",
      "asks her to call a different number",
    ],
    company: "Lakeside Medical Group",
  },
  {
    id: "new-patient-confused",
    name: "James Okafor",
    profile:
      "New patient referred by his PCP. Never called this clinic before. Unsure what kind of appointment he needs.",
    primaryIntent:
      "Schedule an initial consultation, but doesn't know which department",
    sideIntents: [
      "Ask if they accept his insurance",
      "Ask how long the wait for a first appointment is",
    ],
    traits: [
      "Uncertain and gives incomplete information without prompting",
      "Asks clarifying questions when confused",
      "Doesn't know medical terminology for what he needs",
    ],
    escalationTriggers: [
      "asks for information he doesn't have (insurance number, referral code) without explaining where to find it",
    ],
    company: "Lakeside Medical Group",
  },
  {
    id: "frustrated-billing",
    name: "Sandra Kowalski",
    profile:
      "Patient who received an unexpected bill and is calling to dispute it while also trying to schedule her next appointment. Already called once and felt brushed off.",
    primaryIntent:
      "Schedule next appointment AND get an explanation for why her bill was higher than expected",
    sideIntents: [
      "Ask to speak to billing directly if the agent can't help",
    ],
    traits: [
      "Starts neutral but gets impatient if she feels unheard",
      "Interrupts if the agent goes off-topic",
      "Appreciates when someone acknowledges her frustration before moving on",
    ],
    escalationTriggers: [
      "agent tries to schedule without addressing the billing question first",
      "agent says they can't help with billing at all",
    ],
    company: "Lakeside Medical Group",
  },
];

Twenty personas built like these will reveal more about your agent's weaknesses than 500 variations of "I want to book an appointment." The variety of behavior matters far more than the volume of identical inputs.

Running Synthetic Conversations

Once you have your personas, the test loop is clear: initialize the persona model with the persona prompt, start a conversation with your agent, and alternate turns until the task resolves, a timeout hits, or the persona escalates.

synthetic-conversation-runner.ts·typescript

import Anthropic from "@anthropic-ai/sdk";
 
interface ConversationTurn {
  role: "user" | "agent";
  content: string;
  timestamp: number;
}
 
async function runSyntheticConversation(
  persona: PersonaDefinition,
  agentFn: (history: ConversationTurn[]) => Promise<string>,
  maxTurns = 12
): Promise<{
  transcript: ConversationTurn[];
  resolved: boolean;
  turnCount: number;
  escalationTriggered: boolean;
}> {
  const anthropic = new Anthropic();
  const history: ConversationTurn[] = [];
  let escalationTriggered = false;
  let resolved = false;
 
  // Persona model generates the opening customer message
  const openingResponse = await anthropic.messages.create({
    model: "claude-haiku-4-5-20251001", // Fast model for the synthetic user
    max_tokens: 256,
    system: buildPersonaPrompt(persona),
    messages: [
      {
        role: "user",
        content:
          "Start the conversation. Greet the agent and begin explaining what you need.",
      },
    ],
  });
 
  const customerOpening =
    openingResponse.content[0].type === "text"
      ? openingResponse.content[0].text
      : "";
 
  history.push({
    role: "user",
    content: customerOpening,
    timestamp: Date.now(),
  });
 
  for (let turn = 0; turn < maxTurns; turn++) {
    // Your agent responds to the current history
    const agentResponse = await agentFn(history);
    history.push({ role: "agent", content: agentResponse, timestamp: Date.now() });
 
    // Simple heuristic for task resolution; tune for your agent
    if (
      agentResponse.toLowerCase().includes("is there anything else") ||
      agentResponse.toLowerCase().includes("all set") ||
      agentResponse.toLowerCase().includes("you're confirmed")
    ) {
      resolved = true;
      break;
    }
 
    // Build persona response from full history
    const personaMessages = history.map((t) => ({
      role: t.role === "user" ? ("user" as const) : ("assistant" as const),
      content: t.content,
    }));
 
    const personaResponse = await anthropic.messages.create({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 256,
      system: buildPersonaPrompt(persona),
      messages: personaMessages,
    });
 
    const customerResponse =
      personaResponse.content[0].type === "text"
        ? personaResponse.content[0].text
        : "";
 
    history.push({
      role: "user",
      content: customerResponse,
      timestamp: Date.now(),
    });
 
    // Detect escalation
    if (
      customerResponse.toLowerCase().includes("human") ||
      customerResponse.toLowerCase().includes("speak to someone") ||
      customerResponse.toLowerCase().includes("call back later")
    ) {
      escalationTriggered = true;
      break;
    }
  }
 
  return {
    transcript: history,
    resolved,
    turnCount: history.filter((t) => t.role === "agent").length,
    escalationTriggered,
  };
}

Use a smaller, faster model for the synthetic user (Haiku or equivalent) and your primary model for the agent under test. The synthetic user doesn't need to be brilliant. It just needs to maintain persona consistency and generate plausible, in-character responses. A fast cheap model works perfectly.

Running five personas against your agent costs roughly the same as a few hundred scripted test cases, but it covers behavioral surface area that scripted tests can't reach.

What to Measure Across Synthetic Conversations

Four metrics give you the most useful signal across a synthetic conversation test run.

Task completion rate is your north star: did the persona achieve its primary intent? Measure this per persona archetype so you know which customer segments your agent handles well and which it doesn't. A 95% overall completion rate that hides a 60% completion rate for "confused new patients" is telling you something important.

Turn efficiency measures how many agent turns it took to reach resolution. Outliers (conversations that took 3x the median turns) reveal where your agent gets confused or stuck in circles. Compare turn efficiency across persona types to identify which communication styles slow your agent down most.

Escalation accuracy has two sides: did your agent escalate when the persona hit an escalation trigger, and did it avoid unnecessary escalation when the persona didn't? False escalations (handing off when you could have resolved) have real cost too. Both directions matter.

Tone consistency asks whether your agent maintains the same language register across diverse customer phrasings of the same intent. Run the "returning patient easy" persona and the "new patient confused" persona against the same appointment flow and compare agent responses. Vocabulary drift or formality mismatch signals that your agent is overfitting to specific phrasings rather than the underlying intent.

You can automate most of this with an LLM judge pass over each transcript (checking task completion, escalation accuracy, and tone), combined with turn-count metrics you compute directly from the conversation object. We cover building that LLM judge layer in detail in LLM-as-a-Judge: Build a Production Eval Pipeline.

Synthetic user testing loop: from persona library through conversation to release decision

Running Scenarios at Scale With Chanl

Building the synthetic conversation runner yourself gives you control and is the right starting point. As your persona library grows and you want to run tests against every deploy, the infrastructure overhead adds up: managing model calls for both the persona and the agent, storing transcripts, computing metrics, and tracking regressions across versions.

Chanl's Scenarios handles this as managed infrastructure. You define a persona once, attach it to a scenario against a target agent, and scenarios.run() executes the conversation loop, stores the transcript, and computes scorecard metrics automatically.

chanl-scenario-runner.ts·typescript

import { ChanlSDK } from "@chanl/sdk";
 
const chanl = new ChanlSDK({ apiKey: process.env.CHANL_API_KEY! });
 
// Create the persona once. The same four-component structure
// (profile, intent, traits, escalation) drops into backstory + behavior.
const { data: personaResponse } = await chanl.personas.create({
  name: "Sandra Kowalski",
  gender: "female",
  emotion: "frustrated",
  language: "english",
  accent: "american",
  intentClarity: "very clear",
  speechStyle: "normal",
  backgroundNoise: false,
  allowInterruptions: true,
  backstory:
    "Patient with an unexpected bill; calling to schedule next visit AND dispute the charge. Already felt brushed off once.",
  tags: ["healthcare", "billing", "frustrated"],
});
 
// Run a scenario that was authored in the Chanl UI (or via API)
// against the agent under test. Returns an execution you can poll.
const { data: runData } = await chanl.scenarios.run("scenario_frustrated_billing", {
  agentId: "agent_xyz",
  simulationMode: "text",
  parameters: { personaId: personaResponse.persona.id },
});
 
const executionId = runData.executionId || runData.execution.id;
 
// Poll until terminal status
let execution = runData.execution;
while (!["completed", "failed", "timeout", "cancelled"].includes(execution.status)) {
  await new Promise((r) => setTimeout(r, 1500));
  const polled = await chanl.scenarios.getExecution(executionId);
  execution = polled.data.execution ?? polled.data;
}
 
console.log({
  status: execution.status,
  overallScore: execution.overallScore,
  stepResults: execution.stepResults, // Per-turn scorecard breakdown
  duration: execution.duration,
});

The per-turn stepResults breakdown is the key addition over rolling your own. Each agent turn gets scored on tone, task relevance, and accuracy, so instead of just knowing "the task didn't resolve," you see which specific turn caused the conversation to go off-track. Tests tell you whether something failed; scorecards tell you why. You need both as a feedback loop.

For monitoring ongoing quality after release, you can feed real production conversations through the same scorecard pipeline. The analytics dashboard shows how synthetic test performance correlates with production call quality, which tells you whether your persona library is actually representative of your real users.

When to Add a Persona to Your Library

Two situations warrant adding a new persona.

The first is after a production failure. When a real customer interaction fails and you review the transcript, build a synthetic persona that matches that customer's behavior pattern. This converts production failures into regression tests that prevent the same issue from slipping through again. It's the most direct way to close the gap between your test environment and production reality.

The second is before expanding to a new customer segment. If you're launching your agent in a new market, with a new customer type, or on a new channel, build synthetic personas for that segment before go-live. You don't have historical conversations to learn from, but you can interview sales or support about what that customer type is like and translate it into persona specifications.

The goal isn't a massive library. It's a representative one. Twenty well-maintained personas, extended based on what you see in production, are more valuable than two hundred variations of the same archetype.

From Testing to Knowing Your Agent Is Ready

Synthetic users change the question you're asking. Instead of "did the tests pass," you start asking "which customer types does this agent handle well?" That's a product question, not a QA question. It gives you a much richer picture of where your agent is actually ready for production traffic.

The relationship to AI Agent Testing: How to Evaluate Agents Before They Talk to Customers is additive: that article covers the evaluation framework broadly; synthetic users are the input generation mechanism that makes evaluation representative. You need the framework to know what to measure, and you need synthetic users to generate conversations diverse enough to measure it meaningfully.

Teams that get this right don't think of testing as a pre-ship gate they need to clear. They think of it as a continuous signal about which parts of their agent are production-ready and which parts need more work. Synthetic users are what make that signal broad enough to actually trust.

chanl-cli

$ chanl run --all --prompt-id 6612a...

Batch Scenario Results

────────────────────────────────────────────────────

ScenarioStatusScoreTimeResult

Billing disputecompleted96%8.2sPASS

Tech support triagecompleted74%12.1sFAIL

Account upgradecompleted92%6.7sPASS

VIP escalationcompleted88%14.3sPASS

Confused usercompleted67%9.8sFAIL

Summary

Total: 5

Passed: 3

Failed: 2

Average Score: 83%

2 of 5 scenarios failed

60%

Run 20 AI personas against your agent before the next deploy

Chanl's Scenarios feature handles the conversation loop, transcript storage, and per-turn scoring. Connect your agent endpoint and start finding the failures your scripted tests miss.

Start Free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

testing synthetic-users personas agent-evaluation simulation qa

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

Weekly. Patterns for shipping agents that work — MCP, scorecards, regression tests, prompts, model comparisons.

500+ builders subscribed