The alert fired at 11:47 AM on a Tuesday. Not a security alert. A latency alert.
A CX agent handling billing inquiries had drifted from its usual 1.8-second response time to 18 seconds. Customers were waiting. The on-call engineer traced it to a change from two days earlier. Someone had bumped the conversation history window from 20 turns to unlimited, to "give the agent more context."
It had more context. It was also slower, more confused, and twice as expensive to run per call.
The fix took ten minutes. Cap the history. Add a retrieval layer. The insight took longer to internalize. The context window isn't a place to store information. It's a place to reason about information you've already selected. That distinction is the difference between an agent that performs in production and one that doesn't.
Your Context Window Is Not a Database
The context window is fast, volatile working memory for a single inference call. It clears when the call ends. Every token you load into it costs money at inference time. And past a certain density, more tokens actively hurt performance. They don't help it.
That last point is the one most teams don't believe until they see it in their own data.
The LoCoMo benchmark, a long-form conversational memory dataset designed to stress-test agent recall across extended interactions, ran a controlled comparison. The full-context baseline stuffed roughly 26,000 tokens per query into the window. It scored 72.9% accuracy with a p95 latency of 17.12 seconds. A two-layer memory architecture that retrieved and loaded roughly 6,956 tokens scored 91.6% accuracy with a p95 latency of 1.44 seconds.
That's 18.7 percentage points of accuracy gain, 4x fewer tokens used, and 91% lower tail latency. Smaller context outperformed bigger context, decisively.
| Approach | Tokens per query | Accuracy | p95 latency |
|---|---|---|---|
| Full-context baseline | ~26,000 | 72.9% | 17.12s |
| Two-layer memory | ~6,956 | 91.6% | 1.44s |
The reason is attention. Transformer models don't process all tokens equally. Attention heads allocate capacity across the full context, and that allocation degrades as the window fills with noise. When the window is cluttered with redundant history, partial tool results, and marginally relevant documents, the signal you actually need gets buried. You're not giving the model more information to work with. You're making it harder to find what matters.
RAM works the same way. You don't load your entire file system into RAM to run an application. You load what the application needs to execute right now. Everything else stays on disk.
What Breaks When You Pack the Context
Packing the context window creates three failure modes that compound on each other.
Attention dilution. The model reads everything you give it, but it doesn't attend to everything equally. Long-range dependencies in transformer attention degrade with distance. A critical customer fact, "account suspended for non-payment", sitting 18,000 tokens back in a wall of conversation history may not get weighted correctly when the customer asks a billing question. It's technically in the window. The model just effectively ignores it.
Latency inflation. Inference time scales super-linearly with context length because attention is quadratic. Adding 10,000 tokens doesn't add a fixed overhead, it compounds what's already there. The 17-second p95 from the LoCoMo full-context baseline isn't a theoretical edge case. It's what customers experience when you tell them to hold while the agent thinks. On a phone line, most callers hang up before 15 seconds of silence.
Cost multiplication. A 128K-token context with full history costs four to eight times more per call than a 6,000-token focused context. Multiply that by tens of thousands of daily conversations and you've built a cost structure that scales against you as volume grows.
A larger context window doesn't solve this. A 1M-token window with poor retrieval still underperforms a 6,000-token window with good retrieval. The fix is structural.
The Two-Layer Architecture
The two-layer fix separates concerns cleanly. The context window handles active reasoning. A persistent storage layer handles long-term recall.
When a session starts, you retrieve the relevant subset of what's in storage. The last few sessions with this customer, their key account facts, the procedural patterns for their likely scenarios. You load that into the context window. During the session, the model reasons from that focused set. When the session ends, you run an extraction pass on what's new and write it back to storage.
The context window never holds the full history. It holds a curated slice of it, assembled fresh for this specific customer and this specific moment.
The storage layer handles persistence, indexing, and retrieval. The context window handles reasoning. Neither tries to do the other's job.
The build for this pattern is covered in the companion piece, build your own AI agent memory system, which walks through the retrieval layer, vector search, and the deduplication patterns that matter at production scale. For teams who'd rather skip the infrastructure, Chanl's memory layer ships this two-layer pattern out of the box.
Four Memory Types Your CX Agent Needs
Most descriptions of agent memory stop at "short-term vs. long-term." That's not enough for production CX. You need four distinct types, and each one answers a different question for the model.
Working memory is the current conversation. The messages exchanged in this session, the tool calls made, the decisions reached so far. It's already in the context window. The main concern here is history compression. When a session runs long, you summarize older turns rather than letting the window fill up. The context engineering deep dive covers compression strategies in detail.
Episodic memory is what happened with this specific customer in past sessions. "Called three weeks ago about a billing error. Escalated to tier-2. Resolution: credit applied." This is the memory type most agents lack, and the absence is exactly what customers notice. Retrieve the last three to five sessions, weighted by recency. Don't retrieve everything back to account creation. Recency matters.
Semantic memory is entity knowledge that doesn't change with each interaction. The customer's plan tier, their account status, their product configuration, their preferred contact channel. Store it as structured key-value facts. Retrieval is a direct lookup by entity ID, not a similarity search. Fast and exact.
Procedural memory is scenario-handling patterns. How your agent navigates a payment dispute, what steps it follows for a data privacy request, when it escalates rather than resolves. This comes from your knowledge base and from successful past resolutions. Retrieval matches the current scenario type to stored patterns.

Customer Memory
4 memories recalled
“Discussed upgrading to Business plan. Budget approved at $50k. Follow up next Tuesday.”
An agent with all four types loaded correctly almost never needs to ask a customer to repeat themselves. Working memory tracks the live exchange. Episodic answers what happened before with this person. Semantic answers who this person is. Procedural answers how to handle this kind of situation.
What to Retrieve, and What to Leave Out
Good retrieval isn't pulling in everything that might be relevant. It's pulling in the minimum that makes the agent effective, while leaving the window comfortable for the live conversation.
Three signals do most of the work.
Recency weighting over the last three to five sessions with this customer, ordered by date. Not all sessions back to account creation. Recent history has much higher signal. A customer who called 18 months ago about a product that no longer exists is noise, not signal.
Semantic similarity search over stored episodic facts, using the customer's current message as the probe. "I'm having trouble with my invoice" should surface facts tagged to billing, payment, and invoicing. Not shipping. Not onboarding. A score threshold around 0.65 keeps retrieval focused.
Entity matching for this customer's structured semantic facts by their account ID. Not vector search, a direct fetch. Their plan tier, their status, their assigned representative. This lookup is cheap and it's always worth including.
Target under 4,000 tokens of retrieved context. That leaves comfortable room for the live conversation and any tool results without crowding the window. Inject those retrieved blocks into the system prompt as labeled sections, not a wall of text. A recent-sessions section, a relevant-facts section, an account-details section. Structure helps the model navigate.
What to Write Back
At session end, extract the delta. New facts the customer revealed, outcomes reached, sentiment signals. Don't store the raw transcript, structured extractions are cheaper to store and far cleaner to retrieve. Write-back is the step most teams skip, and skipping it is exactly why their agents never improve between sessions.
What's worth storing: preferences the customer stated explicitly, complaints they raised, outcomes reached (resolved, escalated, purchased), and semantic facts about their situation that aren't already in your CRM.
What's not worth storing: filler turns like "the customer greeted the agent warmly," redundant facts already captured in entity memory, and anything that'll be stale in 30 days. Temporary discounts, one-time codes, time-bound offers.
The write-back call runs after the session ends, not during it. It's cheap relative to inference cost. And it compounds. Every session improves the agent's starting context for next time.
The Full Session Cycle
Put it together and the cycle is short. Session starts, retrieve from storage, assemble the context window, reason in-call, session ends, extract the delta, write back. Next session starts with everything the agent learned last time.
The context window handles exactly one step in that loop: in-call reasoning. Everything before and after is the memory system's job.
If you're using an orchestration platform like VAPI, Retell, or Bland, the retrieval and write-back steps live outside the conversation loop. Your middleware calls retrieval before starting the call, and calls write-back on the call-end webhook. The agent itself only sees the assembled context. That separation is what makes the pattern portable across orchestration choices.
For teams instrumenting their own loop, the signal worth tracking is retrieval citation rate. Did the model actually use the facts you pulled in, or did it ignore them? If retrieval is high but citation is low, your retrieval is pulling in the wrong things. Quiet failure modes like that are easy to miss when you're only watching accuracy.
The Mental Model That Sticks
Once the RAM analogy clicks, common agent anti-patterns become obvious.
Packing full conversation history into every call is like loading a program's entire file system into RAM. The program doesn't run faster. It thrashes.
Skipping write-back is like running a process with no persistence. The work disappears when the session ends. Every customer starts from zero forever.
Retrieving everything above a very low similarity threshold is like a memory allocator that never frees. You fill the window with low-salience noise until there's no room for what matters.
The fix isn't a larger context window. A 1M-token window with poor retrieval still underperforms a 6,000-token window with disciplined retrieval. The LoCoMo benchmark makes that case as clearly as benchmarks can.
Remember that 11:47 alert? The agent didn't need more context. It needed less of the wrong context, and a memory system around the call that fetched the right context fresh each time. Ten minutes of work to cap history, an afternoon to wire up retrieval, and the 18-second p95 went back to where it belonged.
Think of the context window as RAM. Load what the agent needs for this call. Persist what it learns. Retrieve what it needs next time.
Give your agents cross-session memory
Chanl handles episodic storage, semantic search, entity memory, and write-back extraction, so your CX agents remember what matters and never start a conversation from zero.
Explore Memory- Context Window Behaves Like RAM, Not Storage: Why Most Agent Failures Happen (Mem0 Blog, 2026)
- State of AI Agent Memory 2026: Benchmarks, Architectures and Production Gaps (Mem0 Blog)
- Agent Memory Architectures: 5 Patterns and Trade-offs (Atlan)
- AI Agent Memory Architecture: The Three Layers Production Systems Need (Tacnode)
- Context Window Management for AI Agents: Summarization, Pruning, and Sliding Window Strategies (CallSphere)
- Designing Agentic Memory in 2026 (The Nuanced Perspective)
- AI Agent Architecture: Build Systems That Work in 2026 (Redis)
- Agent Context Windows in 2026: How to Stop Your AI from Forgetting Everything (SparkCo)
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
Learn Agentic AI
Weekly. Patterns and recipes for shipping AI agents that actually work — MCP, scorecards, regression tests, prompts, model comparisons. From teams running agents in production.



