ChanlChanl
Testing & Evaluation

Memory bugs don't crash. They just give wrong answers.

Memory bugs don't crash your agent. They just give subtly wrong answers using stale context. Here are 5 test patterns to catch them before customers do.

DGDean GroverCo-founderFollow
April 3, 2026
16 min read read
Person examining a translucent board with connected note cards, verifying links between them

A team shipped their support agent with persistent memory last quarter. The demo was impressive: the agent remembered customer names, referenced past tickets, and personalized every response. Three weeks later, a customer called in furious. The agent had greeted them by their ex-spouse's name, confidently referenced a complaint they never filed, and offered a discount on a product they'd already returned.

No error logs. No crashes. No alerts. The agent had simply retrieved the wrong memories, and nothing in the testing pipeline was designed to catch that.

Here's the uncomfortable truth about agent memory: there's no assertEqual for "did the agent remember the right thing at the right time." You can't mock it. You can't unit test it in the traditional sense. Memory bugs are a new category of failure that most QA processes don't even have vocabulary for yet.

What follows are five test patterns, each one designed to catch a specific way memory breaks. They build on each other, starting with the obvious and ending at the subtle. By the last one, you'll have a test suite that runs in CI.

Why are memory bugs different from regular bugs?

Memory bugs are silent failures. A null pointer throws an exception. A broken API returns a 500. But a memory bug returns a 200 with a confident, wrong answer. Your monitoring dashboard stays green while your agent tells Customer A about Customer B's order history.

Research backs this up. The Hindsight memory architecture paper found that baseline agents using full-context approaches achieved just 39% accuracy on the LongMemEval benchmark, even while completing assigned tasks successfully. The agents "succeeded" by every traditional metric while operating on a fraction of the context they should have used.

Here's what makes memory bugs uniquely dangerous compared to the bugs you're used to catching:

Regular BugMemory Bug
Error signalException, 500, crashNone. 200 OK.
DetectionLogs, alerts, monitoringCustomer complaint
ReproducibilityUsually deterministicDepends on stored state + retrieval
Blast radiusOne requestEvery future conversation with that customer
Time to noticeMinutesDays to weeks

The ICLR 2026 MemoryAgentBench paper identified four competencies that memory systems need: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Most teams only test the first one, if they test at all.

Let's start with the simplest failure and work our way up.

How do you test if an agent remembers a single fact?

A probe test is the simplest memory test: store a fact, start a new conversation, and ask a question that requires that fact. If the agent can't answer using the stored information, your retrieval pipeline is broken. This is your baseline, the test you run before anything else.

Think of it like a flashcard for your agent. You show it a fact, wait, then quiz it.

typescript
interface MemoryProbeTest {
  setup: { key: string; value: string; customerId: string };
  probe: string; // question to ask
  expected: string; // what the answer should contain
}
 
async function runProbeTest(
  agent: AgentClient,
  memory: MemoryStore,
  test: MemoryProbeTest
): Promise<{ passed: boolean; response: string }> {
  // Step 1: Store the fact
  await memory.create({
    customerId: test.setup.customerId,
    key: test.setup.key,
    value: test.setup.value,
  });
 
  // Step 2: Start a fresh conversation (no prior context)
  const session = await agent.createSession({
    customerId: test.setup.customerId,
  });
 
  // Step 3: Ask the probe question
  const response = await agent.sendMessage(session.id, test.probe);
 
  // Step 4: Check if the stored fact appears in the response
  const passed = response.text
    .toLowerCase()
    .includes(test.expected.toLowerCase());
 
  return { passed, response: response.text };
}
 
// Example: does the agent remember preferred language?
const result = await runProbeTest(agent, memory, {
  setup: {
    key: "preferred_language",
    value: "Spanish",
    customerId: "cust_123",
  },
  probe: "What language should I use when contacting this customer?",
  expected: "Spanish",
});

If this test fails, you have total amnesia. The memory store isn't connected, the retrieval pipeline is broken, or the agent isn't injecting retrieved context into its prompt. Fix this before moving on.

But passing this test doesn't mean much. It only proves your agent can recall a single, isolated fact under perfect conditions. The real world isn't that clean.

What happens when a stored fact changes?

People change their phone numbers. They cancel subscriptions and start new ones. They move. Your memory system stores the original fact, but what happens when you overwrite it?

The nasty part: the agent doesn't say "I don't know." It confidently uses the outdated information. The customer corrected their phone number last week, but your agent keeps calling the old one.

typescript
async function runStalenessTest(
  agent: AgentClient,
  memory: MemoryStore,
  customerId: string
): Promise<{ passed: boolean; usedVersion: "old" | "new" | "unknown" }> {
  // Step 1: Store original fact
  await memory.create({
    customerId,
    key: "contact_preference",
    value: "Email at sarah@oldcompany.com",
  });
 
  // Step 2: Update the fact
  await memory.create({
    customerId,
    key: "contact_preference",
    value: "SMS at +1-555-0199",
  });
 
  // Step 3: Ask the agent in a new session
  const session = await agent.createSession({ customerId });
  const response = await agent.sendMessage(
    session.id,
    "How should I reach out to this customer?"
  );
 
  const text = response.text.toLowerCase();
  const usesOld = text.includes("email") || text.includes("oldcompany");
  const usesNew = text.includes("sms") || text.includes("0199");
 
  if (usesNew && !usesOld) return { passed: true, usedVersion: "new" };
  if (usesOld && !usesNew) return { passed: false, usedVersion: "old" };
  return { passed: false, usedVersion: "unknown" };
}

Three things typically cause staleness: your memory store uses append-only writes (so both versions exist and the retrieval picks the older, more "established" one), your embedding similarity scores the original higher because it has more context, or your memory has a caching layer that hasn't invalidated yet.

If this test fails, check your memory's update semantics. Does create with the same key overwrite or append? Does your retrieval sort by recency or just by similarity score?

How do you catch cross-customer memory leaks?

Cross-contamination happens when your agent returns Customer B's data to Customer A. Two customers with similar names, similar purchase histories, maybe even similar support issues. Does your agent keep their memories straight?

This test catches a class of bugs that probe tests miss entirely: workspace isolation failures, overly broad similarity searches that pull in neighbors, and embedding collisions where two customers' data is close in vector space.

typescript
async function runCrossContaminationTest(
  agent: AgentClient,
  memory: MemoryStore
): Promise<{ passed: boolean; details: string }> {
  const customerA = "cust_sarah_miller";
  const customerB = "cust_sarah_mitchell";
 
  // Two Sarahs, different preferences
  await memory.create({
    customerId: customerA,
    key: "product_preference",
    value: "Prefers the Enterprise plan, annual billing",
  });
 
  await memory.create({
    customerId: customerB,
    key: "product_preference",
    value: "Prefers the Starter plan, monthly billing",
  });
 
  // Ask about Customer A
  const sessionA = await agent.createSession({ customerId: customerA });
  const responseA = await agent.sendMessage(
    sessionA.id,
    "What plan does this customer prefer?"
  );
 
  // Ask about Customer B
  const sessionB = await agent.createSession({ customerId: customerB });
  const responseB = await agent.sendMessage(
    sessionB.id,
    "What plan does this customer prefer?"
  );
 
  const aText = responseA.text.toLowerCase();
  const bText = responseB.text.toLowerCase();
 
  const aCorrect =
    aText.includes("enterprise") && !aText.includes("starter");
  const bCorrect =
    bText.includes("starter") && !bText.includes("enterprise");
 
  if (aCorrect && bCorrect) {
    return { passed: true, details: "Both customers correctly isolated" };
  }
 
  return {
    passed: false,
    details: `Customer A correct: ${aCorrect}, Customer B correct: ${bCorrect}`,
  };
}

Cross-contamination usually means your memory retrieval isn't properly scoped by customer ID. The vector search finds "Sarah who prefers a plan" and returns the closest match regardless of whose memory it belongs to. This is a data isolation bug, not a retrieval quality bug. Fix it in your query filters, not your embedding model.

How precise is your memory retrieval?

The first three tests are binary: did the agent remember or not? In production, your customer has 50 stored memories and the retrieval layer returns 10 of them. The question isn't just "did it find something relevant." It's "did it find the RIGHT things, and only the right things?" You measure that with precision@k and recall@k.

This is where you start measuring, not just checking.

typescript
interface RetrievalScore {
  precisionAtK: number; // relevant retrieved / total retrieved
  recallAtK: number; // relevant retrieved / total relevant
  retrieved: string[];
  expectedRelevant: string[];
}
 
async function runRelevancePrecisionTest(
  memory: MemoryStore,
  customerId: string
): Promise<RetrievalScore> {
  // Seed 50 memories (mix of relevant and irrelevant)
  const memories = generateTestMemories(50);
  const relevantKeys = ["billing_issue_jan", "billing_issue_mar", "payment_method"];
 
  for (const mem of memories) {
    await memory.create({ customerId, key: mem.key, value: mem.value });
  }
 
  // Query: "What billing issues has this customer had?"
  const results = await memory.search({
    customerId,
    query: "billing issues and payment history",
    limit: 10,
  });
 
  const retrievedKeys = results.map((r) => r.key);
  const relevantRetrieved = retrievedKeys.filter((k) =>
    relevantKeys.includes(k)
  );
 
  return {
    precisionAtK: relevantRetrieved.length / retrievedKeys.length,
    recallAtK: relevantRetrieved.length / relevantKeys.length,
    retrieved: retrievedKeys,
    expectedRelevant: relevantKeys,
  };
}

What counts as good? Aim for precision@10 above 0.3 (at least 3 of your top 10 results are relevant) and recall above 0.8 (you found most of what matters). These are practical starting thresholds. For context, the Letta memory benchmarks showed 74% accuracy on LoCoMo using basic file operations, while the Agent Memory Benchmark project reports accuracy ranging from 71% to 94% depending on the dataset and memory architecture.

Low precision means your embeddings aren't discriminating well. Try adding metadata filters (time range, category) before vector search. Low recall means your memories are stored in a format that doesn't match how they're queried. The same fact phrased differently at write time vs read time can tank recall.

Customer service representative

Customer Memory

4 memories recalled

Sarah Chen
Premium
Last call
2 days ago
Prefers
Email follow-up
Session Memory

“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”

85% relevance

Does your agent fabricate memories that don't exist?

Every test so far assumes the answer exists somewhere in memory. This one flips it: ask about something that was never stored. A well-behaved agent says "I don't have that information." A badly-behaved one invents a plausible answer from nothing. It doesn't retrieve the wrong fact. It fabricates one entirely. Negative tests catch this by verifying the agent acknowledges absence rather than hallucinating.

typescript
async function runNegativeTest(
  agent: AgentClient,
  memory: MemoryStore,
  customerId: string
): Promise<{ passed: boolean; hallucinated: boolean; response: string }> {
  // Store some memories, but NOT a birthday
  await memory.create({
    customerId,
    key: "name",
    value: "Alex Thompson",
  });
  await memory.create({
    customerId,
    key: "plan",
    value: "Professional tier",
  });
 
  // Ask about something we never stored
  const session = await agent.createSession({ customerId });
  const response = await agent.sendMessage(
    session.id,
    "When is this customer's birthday? I want to send a gift."
  );
 
  const text = response.text.toLowerCase();
 
  // Check for hallucination signals
  const hallucinated =
    /\b(january|february|march|april|may|june|july|august|september|october|november|december)\b/.test(
      text
    ) || /\d{1,2}\/\d{1,2}/.test(text);
 
  const acknowledged =
    text.includes("don't have") ||
    text.includes("no record") ||
    text.includes("not stored") ||
    text.includes("could ask");
 
  return {
    passed: acknowledged && !hallucinated,
    hallucinated,
    response: response.text,
  };
}

This is different from general LLM hallucination. Your agent has a real memory store. When it gets zero results back, the prompt should say so. If it invents an answer instead, your system prompt probably doesn't handle the empty-retrieval case. Or your retrieval pipeline returns "close enough" results that the LLM interprets as relevant.

Run this with at least 5 different "never stored" topics. Birthday, favorite color, last vacation. One might pass by luck. Five consecutive passes means your agent actually handles the absence of memory correctly.

The gap between knowing and doing

Five test patterns. Together they cover the failure modes that the MemoryAgentBench framework identifies as critical: accurate retrieval (Tests 1, 4), test-time learning (Test 2), long-range understanding (Tests 1, 3), and conflict resolution (Tests 2, 5).

But here's the problem. Every test above needs a working agent, a memory store, a conversation runtime, auth, session management, and a way to evaluate whether the agent's natural language response actually used the right context. Building that test harness from scratch is a project in itself.

You know what to test. Getting it into CI is the hard part.

Automating this with Chanl scenarios

Chanl's scenario engine closes that gap. Instead of building test infrastructure, you write scenario definitions that seed memory, run multi-turn conversations, and grade the results with scorecards.

Here's the staleness test from earlier, rewritten as a Chanl scenario. One pipeline handles session management, memory operations, and evaluation:

typescript
import { ChanlSDK } from "@chanl/sdk";
 
const chanl = new ChanlSDK({
  apiKey: process.env.CHANL_API_KEY,
});
 
// Step 1: Seed memory with the "old" fact
await chanl.memory.create({
  entityType: "customer",
  entityId: "cust_123",
  agentId: "agent_support",
  key: "contact_preference",
  content: "Customer prefers email contact at sarah@oldcompany.com",
});
 
// Step 2: Update with the "new" fact (same entity + key = deduplication)
await chanl.memory.create({
  entityType: "customer",
  entityId: "cust_123",
  agentId: "agent_support",
  key: "contact_preference",
  content: "Customer now prefers SMS at +1-555-0199",
});
 
// Step 3: Run a scenario that probes for the updated fact
const { data: execution } = await chanl.scenarios.run(
  "scenario_memory_staleness",
  {
    agentId: "agent_support",
    parameters: {
      customerId: "cust_123",
      probeQuestion: "How should I contact this customer?",
      expectedFact: "SMS",
      unexpectedFact: "email",
    },
  }
);
 
// Step 4: Grade the result with a memory-specific scorecard
// Criteria are pre-configured on the scorecard (via the dashboard or API).
// evaluate() triggers async scoring against the call produced by the scenario.
const callId = execution.execution.callDetails?.callId;
const { data: evalResult } = await chanl.scorecards.evaluate(callId, {
  scorecardId: "scorecard_memory_accuracy",
});
 
// Poll for the completed result
const { data: result } = await chanl.scorecards.getResult(evalResult.resultId);
console.log(`Staleness test: ${result.overallScore >= 80 ? "PASS" : "FAIL"}`);
console.log(`Score: ${result.overallScore}`);

The key difference: you don't parse natural language yourself. The scorecard uses an LLM judge with criteria you define once (e.g., "Agent referenced SMS, not email" as a prompt-type criterion). "Did the agent reference SMS?" is evaluated by the scoring model, not a regex. So when the agent says "text message" instead of "SMS" or rephrases the phone number, the evaluation still works.

You can verify what your agent actually retrieved with chanl.memory.search():

typescript
// After the scenario runs, audit what was retrieved
const { data: retrieved } = await chanl.memory.search({
  entityType: "customer",
  entityId: "cust_123",
  agentId: "agent_support",
  query: "contact preference",
  limit: 5,
});
 
console.log("Retrieved memories:", retrieved.memories.map((m) => m.content));
// Should show the SMS entry, not the email one

Run all five test patterns as scenarios, wire them to a CI trigger, and you have continuous memory QA that catches regressions every deploy. Build the agent, connect it to your channels, monitor what it actually remembers.

The memory test checklist

Before shipping agent memory to production, run through this. Each pattern catches a different failure mode, and skipping any leaves a gap:

TestWhat It CatchesMinimum Passes
ProbeTotal amnesia, broken retrieval pipeline10 different facts
StalenessOutdated information, append-only bugs5 update scenarios
Cross-contaminationCustomer isolation failures, embedding collisions3 similar-name pairs
PrecisionNoisy retrieval, poor embedding qualityprecision@10 > 0.3, recall > 0.8
HallucinationFabricated memories, missing empty-state handling5 "never stored" topics

Start with probe tests. If those fail, nothing else matters. Then staleness. When both pass consistently, layer in the rest.

These patterns are architecture-agnostic. Whether you've built a memory system from scratch or are using graph-based memory, the same tests apply. What changes is where the bugs hide.

Remember that support agent from the opening? The one that greeted a customer by their ex-spouse's name? A single cross-contamination test would have caught it. A staleness test would have flagged the outdated complaint reference. A hallucination test would have stopped the phantom discount offer.

The agent didn't need to be smarter. It needed five tests that nobody wrote.

Test your agent's memory

Run memory probe tests, staleness checks, and retrieval accuracy benchmarks with Chanl's scenario engine. No test harness to build.

Start testing free
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Aprende IA Agéntica

Una lección por semana: técnicas prácticas para construir, probar y lanzar agentes IA. Desde ingeniería de prompts hasta monitoreo en producción. Aprende haciendo.

500+ ingenieros suscritos

Frequently Asked Questions