ChanlChanl
Knowledge & Memory

How to Build a Tier-1 Chat Agent That Resolves (Not Just Deflects)

Bots claim 40% deflection; re-contact data says half is fake. Build the architecture that cuts tickets: auth-gated KB, calibrated confidence, escalation with context.

DGDean GroverCo-founderFollow
April 30, 2026
13 min read read
Watercolor Illustration of a Late-Night Developer Desk with Two Monitors. One Shows a Chat Window Where the Bot Says 'I Don't Know This One. Let Me Get Pat in Support.' The Other Shows a Dashboard with Two Columns Labeled Raw Deflection and Resolved Deflection, the Second Column Visibly Smaller.

A B2B SaaS leadership team reported a 47% deflection rate on their year-old support chat last quarter. Their CFO loved the slide. The number that didn't make the slide: of those deflected sessions, how many of the same customers opened a ticket within seven days about the same thing.

It was 41%.

So 47% of customers "didn't talk to a human," and 41% of those came back five days later and did. The actual deflection, the kind that subtracts a ticket from the queue and keeps it subtracted, was closer to 28%. Not bad. Not 47%. The CFO slide had been describing a number that meant nothing.

This pattern is everywhere. Intercom's Fin advertises a 67% resolution rate across 40M conversations, but Intercom's own help docs say their thousands of customers average 41%, and the best ones hit around 50%. Lightspeed, a flagship case study, hit 65%. They're the ceiling, not the floor. Meanwhile every B2B SaaS shipping a chat agent reports the headline rate. Three months later support volume is flat or up.

Raw deflection is a vanity number. Real deflection is a metric you can ship against. This piece is the architecture for the second one.

Define Resolved Deflection First

Resolved deflection is the percentage of chat sessions that did not reach a human and where the customer did not re-contact support about the same issue within seven days. It's the only deflection metric that maps to ticket volume going down. Anything else is counting sessions, not problems solved.

You compute it by joining your chat session log against your ticketing system. Pseudocode against a tickets and sessions table:

resolved-deflection.sql·sql
SELECT
  COUNT(*) FILTER (
    WHERE s.handoff = false
    AND NOT EXISTS (
      SELECT 1 FROM tickets t
      WHERE t.customer_id = s.customer_id
        AND t.subject_embedding <-> s.last_query_embedding < 0.25
        AND t.created_at BETWEEN s.ended_at AND s.ended_at + INTERVAL '7 days'
    )
  )::float / COUNT(*) AS resolved_deflection,
  COUNT(*) FILTER (WHERE s.handoff = false)::float / COUNT(*) AS raw_deflection
FROM chat_sessions s
WHERE s.ended_at >= NOW() - INTERVAL '30 days';

The vector-distance check on subject embedding is what makes this honest. Customers don't always re-contact with identical text. "My webhook is still failing" is the same issue as "events not arriving in production." A subject-embedding similarity threshold of 0.25 to 0.35 catches most paraphrases without false-merging unrelated tickets. Tune it on your own data.

If your raw and resolved numbers are within five points of each other, your bot is good. If they're 15+ points apart, you have a bot that closes sessions by attrition, not resolution.

Now the bot has to actually deflect things. That starts with answering the customer's actual question, not a generic FAQ.

Auth-Gated KB: "Why Did MY Webhook Fail?"

Customers don't ask "how do webhooks work." They ask "why did MY webhook fail at 3:14pm." A generic-docs RAG can answer the first. To answer the second, the agent has to know who the user is, which workspace they're in, and which logs they're allowed to see.

The pattern is session-token pass-through. The customer's session JWT carries their identity. Every tool call extracts the verified claims and scopes its query to that user. The agent never holds a service-level token, which is the same property that lets you survive a SOC 2 Confidentiality review.

Here's a tool definition that fetches webhook delivery logs and refuses to run without an authenticated session:

tools/fetch-webhook-logs.ts·typescript
import { z } from 'zod';
import { verifyJwt } from '../auth/jwt';
 
export const fetchWebhookLogs = {
  name: 'fetch_webhook_logs',
  description: "Fetch the calling customer's recent webhook delivery attempts. Always scoped to their workspace.",
  parameters: z.object({
    endpointId: z.string().describe('The webhook endpoint id, as it appears in the dashboard'),
    sinceMinutesAgo: z.number().int().min(1).max(1440).default(60),
  }),
  // The session JWT is forwarded by the chat runtime; the tool refuses to run without it.
  async execute({ endpointId, sinceMinutesAgo }, ctx: { sessionJwt?: string }) {
    if (!ctx.sessionJwt) {
      throw new Error('TOOL_REQUIRES_USER_SESSION');
    }
    const claims = await verifyJwt(ctx.sessionJwt);
    const { workspaceId, userId } = claims;
 
    const rows = await db.webhookDeliveries.find({
      workspaceId,                               // tenant scope, enforced by index
      endpointId,
      attemptedAt: { $gte: new Date(Date.now() - sinceMinutesAgo * 60_000) },
    }).limit(50);
 
    audit.log({ tool: 'fetch_webhook_logs', userId, workspaceId, count: rows.length });
    return rows.map(r => ({ status: r.status, code: r.responseCode, at: r.attemptedAt, error: r.error }));
  },
};

Three things this code makes load-bearing. First, the tool refuses to execute without ctx.sessionJwt. That's how you guarantee the agent can't call it on behalf of a customer who isn't actually in the session. Second, the workspaceId comes from the verified claims, not the LLM's tool arguments. The model cannot ask for someone else's logs by guessing an id. Third, the audit log captures user, workspace, and tool name on every call. SOC 2 auditors want that log; GDPR Article 15 requests are answered from it.

The same pattern applies to KB retrieval. Generic docs can be public. Anything that surfaces customer-specific configuration (feature flags, billing state, integrations they've connected) gets the same workspace-scoped tool wrapper. Auth0 published a useful framing on this: "you've authenticated the user, but have you authorized the agent?" The answer at most companies is no, and that's the gap that lets one customer's data leak into another customer's answer. We've written before about why a generic-docs RAG isn't enough for production agents; the auth-gated tool layer is the missing half.

The Confidence Gate That Catches Hallucinations

Now the agent can fetch the right data. It will still hallucinate. Hallucinations don't only come from missing data. They come from data that's almost relevant. The fix is a confidence gate that runs before the LLM generates the final answer, and an explicit "I don't know" branch that escalates rather than guesses.

Three signals to score together: top-k embedding similarity (does retrieval surface anything close), cross-encoder rerank score (does a cheaper second-pass model agree the result is on-topic), and a contextual-fit check (does the retrieved chunk actually contain what the question is asking for, not just the topic). Recent abstention research (TACL 2025's "Know Your Limits" survey, and arXiv 2510.13750 on activation-based uncertainty) keeps landing on the same conclusion: abstention is more reliable than retraining when wrong answers carry real cost. In customer support they always do. We argued the same point for voice in Smarter Escalation: When Should Voice AI Refuse to Answer; the chat case is a strict subset.

agent/confidence-gate.ts·typescript
const RETRIEVAL_FLOOR = 0.62;     // tune on your eval set
const RERANK_FLOOR    = 0.55;
const FIT_FLOOR       = 0.70;     // contextual-fit threshold
 
async function answerOrEscalate(query: string, ctx: SessionCtx) {
  const candidates = await kb.search({ query, k: 8, filters: { workspaceId: ctx.workspaceId } });
  if (!candidates.length || candidates[0].score < RETRIEVAL_FLOOR) {
    return escalate(ctx, { reason: 'no_relevant_kb', query });
  }
 
  const reranked = await reranker.score(query, candidates);
  const top = reranked[0];
  if (top.rerankScore < RERANK_FLOOR) {
    return escalate(ctx, { reason: 'weak_match', query, top: top.id });
  }
 
  const fit = await contextualFit(query, top.text);   // small LLM, returns 0..1
  if (fit < FIT_FLOOR) {
    return escalate(ctx, { reason: 'partial_match', query, top: top.id, fit });
  }
 
  const draft = await llm.complete({ system: SYSTEM_PROMPT, query, context: top.text });
  return { kind: 'answer', text: draft, citations: [top.id] };
}

The escalation calls aren't error paths. They're first-class outputs. One mid-market SaaS team that flipped escalate_to_human to the default for any query where any one of the three gates failed saw resolved deflection jump nine points. Raw deflection dropped four. Customers got handed off faster, and they didn't come back.

That's the trade you want. Lower raw deflection, higher resolved deflection, lower re-contact rate. Most teams optimize the wrong one.

No Yes No Yes No Yes Re-contacts in 7d No re-contact Customer message+ session JWT Auth-gated toolsworkspace scoped KB retrievaltenant-filtered Retrievalscore >= floor? Escalatereason: no_relevant_kb Rerankscore >= floor? Escalatereason: weak_match Contextualfit OK? Escalatereason: partial_match LLM answerwith citation Customersatisfied? Resolved deflection -1flag transcript for review Resolved deflection counted Human inheritsfull context payload
Tier-1 deflection chat agent with confidence-gated escalation

Escalation That Doesn't Lose Context

Escalation is where most deflection programs collapse. The bot says "let me get a human" and then the human asks the customer to repeat their question. That's the second-worst experience in support, just below "the bot answered wrong." If your post-handoff average handle time is longer than direct-to-human AHT, the AI made things worse. We've argued elsewhere that handoff is the new prompt. The artifact you ship to a human is as load-bearing as the system prompt you ship to the model.

A useful escalation payload, modeled on what teams running Pylon and Zendesk integrations have settled on:

escalation/handoff.ts·typescript
type HandoffPayload = {
  sessionId: string;
  customer: { id: string; tier: 'free' | 'pro' | 'enterprise'; accountAgeDays: number };
  query: string;                              // the original unresolved question
  tried: Array<{
    tool: string;
    args: Record<string, unknown>;
    resultSummary: string;                    // 1-2 sentence summary, not raw rows
  }>;
  kbSurfaced: Array<{ id: string; title: string; score: number }>;
  abstainReason: 'no_relevant_kb' | 'weak_match' | 'partial_match' | 'policy_block';
  sentiment: 'neutral' | 'frustrated' | 'urgent';
  suggestedNextStep: string;                  // optional, LLM-generated, human can ignore
};
 
async function escalate(ctx: SessionCtx, info: { reason: AbstainReason; query: string; top?: string; fit?: number }): Promise<Escalation> {
  const payload: HandoffPayload = {
    sessionId: ctx.sessionId,
    customer: await profiles.get(ctx.userId),
    query: info.query,
    tried: ctx.toolCalls.map(summarizeToolCall),
    kbSurfaced: ctx.kbHits.slice(0, 3),
    abstainReason: info.reason,
    sentiment: ctx.sentimentRolling,
    suggestedNextStep: await draftNextStep(ctx),  // optional, marked clearly as AI-generated
  };
  return tickets.create({ ...payload, channel: 'chat', priority: priorityFor(payload) });
}

The point of tried is that a human reading the ticket sees what the AI already attempted, so they don't repeat it. The point of kbSurfaced is that they see what the bot was looking at, so they can either send the right article or update the wrong one. The point of abstainReason is that you can sort tickets by it and find systemic gaps. If 40% of your escalations are no_relevant_kb for a single product area, that area needs more docs, not a smarter model.

Measure two metrics on this loop. First, post-handoff time-to-resolution against direct-to-human time-to-resolution. If they're equal, the handoff is working. If post-handoff is longer, your context payload is missing something. Second, percentage of human responses that quote the customer back to themselves ("just to confirm, you're asking about webhooks, right?"). That phrase is the smell of a dropped handoff.

Compliance Gates the Whole Stack

Auth-gated tools and audit logs aren't a security nice-to-have. They're how you survive SOC 2, GDPR, and CCPA when the auditors ask their hardest questions. The four most asked, with where each one lands in the architecture:

QuestionRegulationWhere it lands
"Show me everything you have on customer X"GDPR Article 15, CCPA right-to-knowAudit log of retrievals + tool calls + transcripts, joined on userId
"Delete everything you have on customer X"GDPR Article 17, CCPA right-to-deleteCascade delete: transcripts, embeddings in vector store, cached tool outputs, scorecards
"Who can see what"SOC 2 Confidentiality TSCTools refuse without verified session JWT; tenant scope enforced server-side
"Prove the AI didn't read someone else's data"SOC 2 Processing Integrity, GDPR Article 5Per-tool audit log with userId, workspaceId, query, timestamp

The right-to-erasure case is the one most teams miss. When a customer requests deletion, the transcript goes; most ticketing systems have a delete API. The vector embeddings of that transcript do not, unless you wrote the cascade. Pinecone, Weaviate, and pgvector all support deletion by metadata filter; you have to invoke it. Same for any cached tool outputs (webhook logs, billing snapshots) that landed in your conversation context. If your bot can re-answer the question after the deletion request, you didn't delete enough.

CCPA adds California-specific notice requirements at the start of the chat. A one-liner like "this chat may be reviewed for support quality and fed into our AI agent" plus a link to your privacy policy clears most of it. SOC 2 Type II is increasingly the price of admission for B2B enterprise contracts; the auth-gated tool pattern above is most of what an auditor wants to see for the Confidentiality TSC. If you serve healthcare or financial customers who pass through PHI or PCI data, that's another layer (BAA, scoped access logs) on top, but everything in this piece is the foundation.

One last anti-pattern. Do not auto-train your bot on every human-resolved ticket. Resolved tickets contain workarounds, expired pricing, one-off exceptions, and partial information. Auto-publishing them into the KB drifts the agent's grounded truth fast. The right pattern is suggest-then-curate: a pipeline proposes KB additions from human-resolved tickets, a human reviews and tags each one, and only then does it enter the index. This is the difference between a bot that gets sharper over time and a bot that hallucinates more confidently every quarter.

How Chanl Compresses This

The architecture above is buildable in any stack. We've been writing the same plumbing (auth-gated retrieval, confidence gates, escalation payloads, scorecard loops) into customer codebases for two years. At some point we stopped retyping it.

Chanl is the result. AI agents that remember each customer, with auth-gated tools, per-customer memory, and calibrated escalation already wired. The same six steps from this article, expressed with @chanl/sdk:

chanl-tier1-agent.ts·typescript
import { Chanl } from '@chanl/sdk';
 
const chanl = new Chanl({ apiKey: process.env.CHANL_API_KEY });
 
// 1. Workspace-scoped KB. Vectors are tenant-isolated by default.
await chanl.knowledge.create({
  source: 'url',
  url: 'https://docs.acme.com',
  workspaceId: customer.workspaceId,
});
 
// 2. Auth-gated tool. Session token forwards from chat runtime to tool execution.
await chanl.tools.create({
  name: 'fetch_webhook_logs',
  description: "Fetch the calling customer's recent webhook deliveries.",
  auth: 'session-pass-through',
  endpoint: 'https://api.acme.com/v1/webhooks/logs',
});
 
// 3. Hybrid search with tenant filter. Confidence floors enforced by the SDK.
const hits = await chanl.knowledge.search({
  query: userMessage,
  mode: 'hybrid',
  filters: { workspaceId: customer.workspaceId, customerTier: customer.tier },
  abstainBelow: 0.62,
});
 
// 4. Per-customer memory. Survives across sessions, stays in the customer's scope.
await chanl.memory.create({
  customerId: customer.id,
  content: 'Prefers terse answers; previously hit webhook signature errors.',
});
 
// 5. Score the session on the four axes that matter.
const scores = await chanl.scorecards.evaluate({
  callId: chatSessionId,
  axes: ['recall_sufficient', 'confidence_calibrated', 'escalation_correct', 'resolved_without_human'],
});
 
// 6. Run escalation-trigger personas before shipping new prompts.
await chanl.scenarios.runAll({ tag: 'escalation-triggers' });

What's compressed here. The session-token pass-through that took 25 lines of JWT-and-context-plumbing earlier is auth: 'session-pass-through'. The three-floor confidence gate is abstainBelow. Per-customer memory replaces the manual customer-context injection that most teams build by hand. The scorecard runs the four axes (recall sufficient, confidence calibrated, escalation correct, resolved without human) that map directly to resolved deflection.

A few honest notes on the snippet above. auth: 'session-pass-through' and abstainBelow on knowledge.search are the ergonomic shape we're moving toward, not all-shipped today; the underlying primitives (per-tool auth scoping, minScore on search) are. chanl.scorecards.evaluate works against callId; the axes array is shorthand for selecting which scorecard criteria to score on. And chanl.knowledge.suggestFromTranscript() for the suggest-then-curate KB loop is on our roadmap, not in @chanl/sdk today: if you want that pattern now, you write the curation pipeline yourself and feed approved articles into chanl.knowledge.create. We're collecting feedback on the API shape; if you have an opinion, we want to hear it.

What to Stop Doing Today

Three things to remove from your existing bot before you ship anything new:

  1. Reporting raw deflection without re-contact. Add the seven-day re-contact join to your dashboard this sprint. If it's a 15-point gap, your "deflection" is mostly attrition.
  2. Auto-training on every resolved ticket. Insert a curation step. A human reads, tags, and approves before it enters the KB.
  3. Generic-docs-only KB. If your agent can't answer "why did MY webhook fail at 3:14pm," it's a marketing brochure, not a support agent. Add at least one auth-gated tool against your real customer data this quarter.

The bots that actually deflect tickets are the ones honest about when they don't know. The metric that shows you they're honest is resolved deflection, and the customers who feel it are the ones who didn't have to ask twice.

AI agents that remember each customer.

Auth-gated knowledge, calibrated confidence, escalation that doesn't lose context. Score each session on resolved deflection, not raw.

See how Chanl scores chat agents
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.

500+ líderes de CS e ingresos suscritos

Frequently Asked Questions