How do you test if an AI agent's memory actually works?

Use five test patterns: probe tests (basic recall), staleness tests (updated facts), cross-contamination tests (mixing up customers), relevance precision tests (retrieving the right memories), and negative tests (not hallucinating memories that don't exist). Run these as automated scenarios, not manual spot checks.

Why don't memory bugs show up in normal testing?

Memory bugs are silent failures. The agent doesn't crash or return an error. It just gives a confident, wrong answer using stale or irrelevant context. Standard unit tests and integration tests don't cover multi-turn recall across sessions.

What is a memory probe test for AI agents?

A probe test stores a specific fact in memory (like a customer's preferred language), then starts a new conversation and asks a question that requires that fact. If the agent can't use the stored information, the probe fails.

How do you catch stale memory in an AI agent?

Store a fact, update it with new information, then verify the agent uses the updated version. For example, store 'customer prefers email', update to 'customer prefers SMS', then ask the agent how to contact them. If it says email, your memory is stale.

What is cross-contamination in agent memory?

Cross-contamination happens when an agent mixes up memories between different customers. Two users named 'Sarah Miller' and 'Sarah Mitchell' each have different preferences, but the agent retrieves the wrong Sarah's data. This is testable by creating similar-but-distinct customer profiles.

How do you measure memory retrieval accuracy?

Use precision@k and recall@k metrics. Store N memories, ask a question that should use exactly K of them. Precision measures how many retrieved memories were actually relevant. Recall measures how many relevant memories were found. Aim for precision@10 above 0.3 and recall above 0.8 as starting thresholds.

Why is negative memory testing important?

Without negative tests, you don't know if your agent fabricates memories. Ask about something never stored. A good agent says 'I don't have that information.' A bad one confidently invents a plausible-sounding answer from nothing.

Memory bugs don't crash. They just give wrong answers.

A team shipped their support agent with persistent memory last quarter. The demo was impressive: the agent remembered customer names, referenced past tickets, and personalized every response. Three weeks later, a customer called in furious. The agent had greeted them by their ex-spouse's name, confidently referenced a complaint they never filed, and offered a discount on a product they'd already returned.

No error logs. No crashes. No alerts. The agent had simply retrieved the wrong memories, and nothing in the testing pipeline was designed to catch that.

Here's the uncomfortable truth about agent memory: there's no assertEqual for "did the agent remember the right thing at the right time." You can't mock it. You can't unit test it in the traditional sense. Memory bugs are a new category of failure that most QA processes don't even have vocabulary for yet.

What follows are five test patterns, each one designed to catch a specific way memory breaks. They build on each other, starting with the obvious and ending at the subtle. By the last one, you'll have a test suite that runs in CI.

Why are memory bugs different from regular bugs?

Memory bugs are silent failures. A null pointer throws an exception. A broken API returns a 500. But a memory bug returns a 200 with a confident, wrong answer. Your monitoring dashboard stays green while your agent tells Customer A about Customer B's order history.

Research backs this up. The Hindsight memory architecture paper found that baseline agents using full-context approaches achieved just 39% accuracy on the LongMemEval benchmark, even while completing assigned tasks successfully. The agents "succeeded" by every traditional metric while operating on a fraction of the context they should have used.

Here's what makes memory bugs uniquely dangerous compared to the bugs you're used to catching:

	Regular Bug	Memory Bug
Error signal	Exception, 500, crash	None. 200 OK.
Detection	Logs, alerts, monitoring	Customer complaint
Reproducibility	Usually deterministic	Depends on stored state + retrieval
Blast radius	One request	Every future conversation with that customer
Time to notice	Minutes	Days to weeks

The ICLR 2026 MemoryAgentBench paper identified four competencies that memory systems need: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Most teams only test the first one, if they test at all.

Let's start with the simplest failure and work our way up.

How do you test if an agent remembers a single fact?

A probe test is the simplest memory test: store a fact, start a new conversation, and ask a question that requires that fact. If the agent can't answer using the stored information, your retrieval pipeline is broken. This is your baseline, the test you run before anything else.

Think of it like a flashcard for your agent. You show it a fact, wait, then quiz it.

typescript

interface MemoryProbeTest {
  setup: { key: string; value: string; customerId: string };
  probe: string; // question to ask
  expected: string; // what the answer should contain
}
 
async function runProbeTest(
  agent: AgentClient,
  memory: MemoryStore,
  test: MemoryProbeTest
): Promise<{ passed: boolean; response: string }> {
  // Step 1: Store the fact
  await memory.create({
    customerId: test.setup.customerId,
    key: test.setup.key,
    value: test.setup.value,
  });
 
  // Step 2: Start a fresh conversation (no prior context)
  const session = await agent.createSession({
    customerId: test.setup.customerId,
  });
 
  // Step 3: Ask the probe question
  const response = await agent.sendMessage(session.id, test.probe);
 
  // Step 4: Check if the stored fact appears in the response
  const passed = response.text
    .toLowerCase()
    .includes(test.expected.toLowerCase());
 
  return { passed, response: response.text };
}
 
// Example: does the agent remember preferred language?
const result = await runProbeTest(agent, memory, {
  setup: {
    key: "preferred_language",
    value: "Spanish",
    customerId: "cust_123",
  },
  probe: "What language should I use when contacting this customer?",
  expected: "Spanish",
});

If this test fails, you have total amnesia. The memory store isn't connected, the retrieval pipeline is broken, or the agent isn't injecting retrieved context into its prompt. Fix this before moving on.

But passing this test doesn't mean much. It only proves your agent can recall a single, isolated fact under perfect conditions. The real world isn't that clean.

What happens when a stored fact changes?

People change their phone numbers. They cancel subscriptions and start new ones. They move. Your memory system stores the original fact, but what happens when you overwrite it?

The nasty part: the agent doesn't say "I don't know." It confidently uses the outdated information. The customer corrected their phone number last week, but your agent keeps calling the old one.

typescript

async function runStalenessTest(
  agent: AgentClient,
  memory: MemoryStore,
  customerId: string
): Promise<{ passed: boolean; usedVersion: "old" | "new" | "unknown" }> {
  // Step 1: Store original fact
  await memory.create({
    customerId,
    key: "contact_preference",
    value: "Email at sarah@oldcompany.com",
  });
 
  // Step 2: Update the fact
  await memory.create({
    customerId,
    key: "contact_preference",
    value: "SMS at +1-555-0199",
  });
 
  // Step 3: Ask the agent in a new session
  const session = await agent.createSession({ customerId });
  const response = await agent.sendMessage(
    session.id,
    "How should I reach out to this customer?"
  );
 
  const text = response.text.toLowerCase();
  const usesOld = text.includes("email") || text.includes("oldcompany");
  const usesNew = text.includes("sms") || text.includes("0199");
 
  if (usesNew && !usesOld) return { passed: true, usedVersion: "new" };
  if (usesOld && !usesNew) return { passed: false, usedVersion: "old" };
  return { passed: false, usedVersion: "unknown" };
}

Three things typically cause staleness: your memory store uses append-only writes (so both versions exist and the retrieval picks the older, more "established" one), your embedding similarity scores the original higher because it has more context, or your memory has a caching layer that hasn't invalidated yet.

If this test fails, check your memory's update semantics. Does create with the same key overwrite or append? Does your retrieval sort by recency or just by similarity score?

How do you catch cross-customer memory leaks?

Cross-contamination happens when your agent returns Customer B's data to Customer A. Two customers with similar names, similar purchase histories, maybe even similar support issues. Does your agent keep their memories straight?

This test catches a class of bugs that probe tests miss entirely: workspace isolation failures, overly broad similarity searches that pull in neighbors, and embedding collisions where two customers' data is close in vector space.

typescript

async function runCrossContaminationTest(
  agent: AgentClient,
  memory: MemoryStore
): Promise<{ passed: boolean; details: string }> {
  const customerA = "cust_sarah_miller";
  const customerB = "cust_sarah_mitchell";
 
  // Two Sarahs, different preferences
  await memory.create({
    customerId: customerA,
    key: "product_preference",
    value: "Prefers the Enterprise plan, annual billing",
  });
 
  await memory.create({
    customerId: customerB,
    key: "product_preference",
    value: "Prefers the Starter plan, monthly billing",
  });
 
  // Ask about Customer A
  const sessionA = await agent.createSession({ customerId: customerA });
  const responseA = await agent.sendMessage(
    sessionA.id,
    "What plan does this customer prefer?"
  );
 
  // Ask about Customer B
  const sessionB = await agent.createSession({ customerId: customerB });
  const responseB = await agent.sendMessage(
    sessionB.id,
    "What plan does this customer prefer?"
  );
 
  const aText = responseA.text.toLowerCase();
  const bText = responseB.text.toLowerCase();
 
  const aCorrect =
    aText.includes("enterprise") && !aText.includes("starter");
  const bCorrect =
    bText.includes("starter") && !bText.includes("enterprise");
 
  if (aCorrect && bCorrect) {
    return { passed: true, details: "Both customers correctly isolated" };
  }
 
  return {
    passed: false,
    details: `Customer A correct: ${aCorrect}, Customer B correct: ${bCorrect}`,
  };
}

Cross-contamination usually means your memory retrieval isn't properly scoped by customer ID. The vector search finds "Sarah who prefers a plan" and returns the closest match regardless of whose memory it belongs to. This is a data isolation bug, not a retrieval quality bug. Fix it in your query filters, not your embedding model.

How precise is your memory retrieval?

The first three tests are binary: did the agent remember or not? In production, your customer has 50 stored memories and the retrieval layer returns 10 of them. The question isn't just "did it find something relevant." It's "did it find the RIGHT things, and only the right things?" You measure that with precision@k and recall@k.

This is where you start measuring, not just checking.

typescript

interface RetrievalScore {
  precisionAtK: number; // relevant retrieved / total retrieved
  recallAtK: number; // relevant retrieved / total relevant
  retrieved: string[];
  expectedRelevant: string[];
}
 
async function runRelevancePrecisionTest(
  memory: MemoryStore,
  customerId: string
): Promise<RetrievalScore> {
  // Seed 50 memories (mix of relevant and irrelevant)
  const memories = generateTestMemories(50);
  const relevantKeys = ["billing_issue_jan", "billing_issue_mar", "payment_method"];
 
  for (const mem of memories) {
    await memory.create({ customerId, key: mem.key, value: mem.value });
  }
 
  // Query: "What billing issues has this customer had?"
  const results = await memory.search({
    customerId,
    query: "billing issues and payment history",
    limit: 10,
  });
 
  const retrievedKeys = results.map((r) => r.key);
  const relevantRetrieved = retrievedKeys.filter((k) =>
    relevantKeys.includes(k)
  );
 
  return {
    precisionAtK: relevantRetrieved.length / retrievedKeys.length,
    recallAtK: relevantRetrieved.length / relevantKeys.length,
    retrieved: retrievedKeys,
    expectedRelevant: relevantKeys,
  };
}

What counts as good? Aim for precision@10 above 0.3 (at least 3 of your top 10 results are relevant) and recall above 0.8 (you found most of what matters). These are practical starting thresholds. For context, the Letta memory benchmarks showed 74% accuracy on LoCoMo using basic file operations, while the Agent Memory Benchmark project reports accuracy ranging from 71% to 94% depending on the dataset and memory architecture.

Low precision means your embeddings aren't discriminating well. Try adding metadata filters (time range, category) before vector search. Low recall means your memories are stored in a format that doesn't match how they're queried. The same fact phrased differently at write time vs read time can tank recall.

Customer Memory

4 memories recalled

Sarah Chen

Premium

Last call

2 days ago

Prefers

Email follow-up

Session Memory

“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”

85% relevance

Does your agent fabricate memories that don't exist?

Every test so far assumes the answer exists somewhere in memory. This one flips it: ask about something that was never stored. A well-behaved agent says "I don't have that information." A badly-behaved one invents a plausible answer from nothing. It doesn't retrieve the wrong fact. It fabricates one entirely. Negative tests catch this by verifying the agent acknowledges absence rather than hallucinating.

typescript

async function runNegativeTest(
  agent: AgentClient,
  memory: MemoryStore,
  customerId: string
): Promise<{ passed: boolean; hallucinated: boolean; response: string }> {
  // Store some memories, but NOT a birthday
  await memory.create({
    customerId,
    key: "name",
    value: "Alex Thompson",
  });
  await memory.create({
    customerId,
    key: "plan",
    value: "Professional tier",
  });
 
  // Ask about something we never stored
  const session = await agent.createSession({ customerId });
  const response = await agent.sendMessage(
    session.id,
    "When is this customer's birthday? I want to send a gift."
  );
 
  const text = response.text.toLowerCase();
 
  // Check for hallucination signals
  const hallucinated =
    /\b(january|february|march|april|may|june|july|august|september|october|november|december)\b/.test(
      text
    ) || /\d{1,2}\/\d{1,2}/.test(text);
 
  const acknowledged =
    text.includes("don't have") ||
    text.includes("no record") ||
    text.includes("not stored") ||
    text.includes("could ask");
 
  return {
    passed: acknowledged && !hallucinated,
    hallucinated,
    response: response.text,
  };
}

This is different from general LLM hallucination. Your agent has a real memory store. When it gets zero results back, the prompt should say so. If it invents an answer instead, your system prompt probably doesn't handle the empty-retrieval case. Or your retrieval pipeline returns "close enough" results that the LLM interprets as relevant.

Run this with at least 5 different "never stored" topics. Birthday, favorite color, last vacation. One might pass by luck. Five consecutive passes means your agent actually handles the absence of memory correctly.

The gap between knowing and doing

Five test patterns. Together they cover the failure modes that the MemoryAgentBench framework identifies as critical: accurate retrieval (Tests 1, 4), test-time learning (Test 2), long-range understanding (Tests 1, 3), and conflict resolution (Tests 2, 5).

But here's the problem. Every test above needs a working agent, a memory store, a conversation runtime, auth, session management, and a way to evaluate whether the agent's natural language response actually used the right context. Building that test harness from scratch is a project in itself.

You know what to test. Getting it into CI is the hard part.

Automating this with Chanl scenarios

Chanl's scenario engine closes that gap. Instead of building test infrastructure, you write scenario definitions that seed memory, run multi-turn conversations, and grade the results with scorecards.

Here's the staleness test from earlier, rewritten as a Chanl scenario. One pipeline handles session management, memory operations, and evaluation:

typescript

import { ChanlSDK } from "@chanl/sdk";
 
const chanl = new ChanlSDK({
  apiKey: process.env.CHANL_API_KEY,
});
 
// Step 1: Seed memory with the "old" fact
await chanl.memory.create({
  entityType: "customer",
  entityId: "cust_123",
  agentId: "agent_support",
  key: "contact_preference",
  content: "Customer prefers email contact at sarah@oldcompany.com",
});
 
// Step 2: Update with the "new" fact (same entity + key = deduplication)
await chanl.memory.create({
  entityType: "customer",
  entityId: "cust_123",
  agentId: "agent_support",
  key: "contact_preference",
  content: "Customer now prefers SMS at +1-555-0199",
});
 
// Step 3: Run a scenario that probes for the updated fact
const { data: execution } = await chanl.scenarios.run(
  "scenario_memory_staleness",
  {
    agentId: "agent_support",
    parameters: {
      customerId: "cust_123",
      probeQuestion: "How should I contact this customer?",
      expectedFact: "SMS",
      unexpectedFact: "email",
    },
  }
);
 
// Step 4: Grade the result with a memory-specific scorecard
// Criteria are pre-configured on the scorecard (via the dashboard or API).
// evaluate() triggers async scoring against the call produced by the scenario.
const callId = execution.execution.callDetails?.callId;
const { data: evalResult } = await chanl.scorecards.evaluate(callId, {
  scorecardId: "scorecard_memory_accuracy",
});
 
// Poll for the completed result
const { data: result } = await chanl.scorecards.getResult(evalResult.resultId);
console.log(`Staleness test: ${result.overallScore >= 80 ? "PASS" : "FAIL"}`);
console.log(`Score: ${result.overallScore}`);

The key difference: you don't parse natural language yourself. The scorecard uses an LLM judge with criteria you define once (e.g., "Agent referenced SMS, not email" as a prompt-type criterion). "Did the agent reference SMS?" is evaluated by the scoring model, not a regex. So when the agent says "text message" instead of "SMS" or rephrases the phone number, the evaluation still works.

You can verify what your agent actually retrieved with chanl.memory.search():

typescript

// After the scenario runs, audit what was retrieved
const { data: retrieved } = await chanl.memory.search({
  entityType: "customer",
  entityId: "cust_123",
  agentId: "agent_support",
  query: "contact preference",
  limit: 5,
});
 
console.log("Retrieved memories:", retrieved.memories.map((m) => m.content));
// Should show the SMS entry, not the email one

Run all five test patterns as scenarios, wire them to a CI trigger, and you have continuous memory QA that catches regressions every deploy. Build the agent, connect it to your channels, monitor what it actually remembers.

The memory test checklist

Before shipping agent memory to production, run through this. Each pattern catches a different failure mode, and skipping any leaves a gap:

Test	What It Catches	Minimum Passes
Probe	Total amnesia, broken retrieval pipeline	10 different facts
Staleness	Outdated information, append-only bugs	5 update scenarios
Cross-contamination	Customer isolation failures, embedding collisions	3 similar-name pairs
Precision	Noisy retrieval, poor embedding quality	precision@10 > 0.3, recall > 0.8
Hallucination	Fabricated memories, missing empty-state handling	5 "never stored" topics

Start with probe tests. If those fail, nothing else matters. Then staleness. When both pass consistently, layer in the rest.

These patterns are architecture-agnostic. Whether you've built a memory system from scratch or are using graph-based memory, the same tests apply. What changes is where the bugs hide.

Remember that support agent from the opening? The one that greeted a customer by their ex-spouse's name? A single cross-contamination test would have caught it. A staleness test would have flagged the outdated complaint reference. A hallucination test would have stopped the phantom discount offer.

The agent didn't need to be smarter. It needed five tests that nobody wrote.

Test your agent's memory

Run memory probe tests, staleness checks, and retrieval accuracy benchmarks with Chanl's scenario engine. No test harness to build.

Start testing free

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

memory testing agent-evaluation retrieval quality-assurance typescript memory-qa

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Memory bugs don't crash. They just give wrong answers.

Why are memory bugs different from regular bugs?

How do you test if an agent remembers a single fact?

What happens when a stored fact changes?

How do you catch cross-customer memory leaks?

How precise is your memory retrieval?

Does your agent fabricate memories that don't exist?

The gap between knowing and doing

Automating this with Chanl scenarios

The memory test checklist

Test your agent's memory

The Signal Briefing

Frequently Asked Questions

Related Articles

Online vs. Offline Evals: Close the Production Gap

Agent Drift: Why Your AI Gets Worse the Longer It Runs

Tu agente paso todas las pruebas de desarrollo. Por eso fallara en produccion