What is a Signal schema for conversation data?

A Signal schema is a structured representation of what was said, who said it, when, and with what certainty. It separates conversations (events) from customers (entities), attaches confidence and provenance to every extracted field, and versions itself so the model can change without breaking the pipeline.

Why not just store the transcript as JSON and extract on the fly?

It works until it doesn't. At low volume, re-running extraction per query is tolerable. At 50k conversations per week with dozens of analysts asking questions, you need indexed, queryable fields. Raw transcripts also cannot be joined to customers, cohorts, or products without a real model.

How do you add confidence scores to extracted fields?

Every extracted field becomes an object with value, confidence, source_span, and model_version. The LLM returns a calibrated confidence for each field (not a single score for the whole response). Downstream queries filter on confidence thresholds, and dashboards can gray out low-confidence values instead of treating them as facts.

When should I use closed vocabulary (enums) vs open vocabulary (strings)?

Closed for anything you query by exact match (intent, channel, direction, resolution_status). Open for anything emergent (topic, mentioned_product, custom tag). Open-vocab fields get normalized later via clustering and become enums when they stabilize.

How do you handle schema drift when adding new extraction fields?

Use expand-contract. Add the new field alongside the old one, backfill at your own pace, dual-write during transition, then drop the old field. Stamp every document with schema_version so readers know what to expect. Never mutate old documents in place just to match a new shape.

What indexes do conversation analytics tables actually need?

Three tiers. First, (customerId, startedAt desc) for customer timelines. Second, (intent, startedAt) and (workspaceId, startedAt) for cohort queries. Third, text index on transcript.text for search. Beyond that, add indexes when a real dashboard is slow, not speculatively.

How do you evaluate extraction quality per field?

Treat each extracted field as its own model. Sample 100 to 500 conversations, have humans label ground truth for that field, and score precision/recall. Do this per field because an extraction pipeline with 95% intent accuracy and 40% sentiment accuracy is common, and the averaged number hides the problem.

Stop Storing Transcripts. Start Modeling Signals.

Somewhere around conversation number twelve thousand, the Slack message arrives. "Can you pull a list of at-risk customers who mentioned pricing in the last thirty days?" You open the query. Your conversations table has one column that matters, raw, and it holds a JSON blob that looks like this.

conversation.raw·json

{
  "transcript": "Yeah honestly we're looking at alternatives, the pricing...",
  "sentiment": "negative",
  "intent": "churn_risk",
  "summary": "Customer frustrated about pricing"
}

At a thousand conversations this was fine. You wrote a few scripts, showed some charts, looked competent. At twelve thousand, three things are simultaneously true. You cannot answer the Slack question without re-running an LLM over every document. You cannot tell whether "churn_risk" was the right label because you cannot see why the model decided that. And the "sentiment" field has started drifting because three weeks ago someone switched extraction models and nobody wrote it down.

This article is about fixing that, not by adding more columns, but by modeling conversations as data. Not as text with some flags bolted on. Actual data, with entities, events, confidence, provenance, versions, and indexes. You will build a Signal schema step by step with me, in Zod and Mongoose, starting from the mess above. By the end you will have something that survives a year of real use, a new extraction model, a GDPR deletion request, and a VP asking why cohort X trended down.

The blob was never really a schema

Almost every team building conversation analytics starts the same way. Dump the transcript, run an LLM, stuff the output into one document, call it a day. The appeal is obvious: it takes an afternoon, it demos well, and you avoid arguing about data modeling with anyone. The cost shows up on a very specific schedule, usually around the four to six month mark, always through a question you cannot answer without re-processing history.

The original sin is conflating two different things into one document. A conversation is an event that happened at a moment in time. A customer is an entity that persists across many such events. Smashing them together means you cannot ask entity-level questions without scanning every event and reassembling the state yourself. The first real design move is splitting them apart. Here is the minimal version using Zod for the types and Mongoose for storage.

schema/conversation.ts·typescript

import { z } from 'zod';
 
export const Conversation = z.object({
  id: z.string(),
  workspaceId: z.string(),
  customerId: z.string(),
  channel: z.enum(['voice', 'chat', 'email', 'sms']),
  direction: z.enum(['inbound', 'outbound']),
  startedAt: z.string().datetime(),
  endedAt: z.string().datetime(),
  transcript: z.array(z.object({
    speaker: z.enum(['customer', 'agent', 'bot']),
    text: z.string(),
    startedAt: z.string().datetime(),
  })),
});
export type Conversation = z.infer<typeof Conversation>;

A conversation is now self-contained, time-bounded, and joinable to a customer by customerId. The transcript lives as a structured array, not a single string, which matters later when you want to point at exactly which turn made the model say "churn risk." Note what is not here: no sentiment, no intent, no summary. Those are extractions, not source data, and they belong somewhere else with their own lifecycle.

The customer entity is deliberately skeletal. Resist the temptation to denormalize current-state flags (is_vip, last_sentiment) onto it. Those are projections over events, not facts, and you will regret them the first time you need to change how they are computed.

schema/customer.ts·typescript

import { z } from 'zod';
 
export const Customer = z.object({
  id: z.string(),
  workspaceId: z.string(),
  externalId: z.string().nullable(),
  email: z.string().email().nullable(),
  createdAt: z.string().datetime(),
  updatedAt: z.string().datetime(),
});
export type Customer = z.infer<typeof Customer>;

Two documents, one relationship, no extractions yet. You can already answer "show me all conversations for customer X in the last thirty days" with a single indexed query, which the blob schema could not do. But you still cannot answer the pricing question, because the interesting signal, what the conversation was about, has nowhere to live. That is the next piece.

Where should extracted fields actually live?

The natural thing to do next is start adding fields to the conversation document. intent, sentiment, topics, summary. Resist this too. Extractions have different properties from source data. They are produced by a model, they have a confidence, they can be redone when the model improves, and they can disagree with each other when you run two models in parallel. Jamming them into the same document as the transcript conflates "what happened" with "what the model currently thinks about what happened."

The cleaner move is a separate extractions document keyed by conversation, with one row per extraction field. Research on LLM structured output reliability shows per-field confidence scores catch errors with much better precision than a single whole-response score, and you cannot store per-field confidence cleanly if everything is one flat object. Start with the shape, not the query patterns.

schema/extraction.ts·typescript

import { z } from 'zod';
 
export const ExtractedField = z.object({
  value: z.unknown(),
  confidence: z.number().min(0).max(1),
  sourceSpan: z.object({
    turnIndex: z.number().int().min(0),
    startChar: z.number().int().min(0),
    endChar: z.number().int().min(0),
  }).nullable(),
  modelVersion: z.string(),
  extractedAt: z.string().datetime(),
});
export type ExtractedField = z.infer<typeof ExtractedField>;

That shape is the whole contract. Every extracted fact carries its value, how sure the model was, where in the transcript it came from, which model version produced it, and when. Without these four fields you cannot debug, cannot audit, and cannot compare two runs. With them, your extractions are first-class data that survive the model drift that is absolutely going to happen.

Now compose that into a per-conversation extraction document. Keep field names stable and let new fields be added as optional. The schema_version at the top is the piece that lets you evolve without breaking old readers.

schema/extraction.ts·typescript

export const ConversationExtraction = z.object({
  conversationId: z.string(),
  workspaceId: z.string(),
  schemaVersion: z.literal(3),
  intent: ExtractedField.optional(),
  sentiment: ExtractedField.optional(),
  mentionedProducts: z.array(ExtractedField).optional(),
  topic: ExtractedField.optional(),
  resolution: ExtractedField.optional(),
});
export type ConversationExtraction = z.infer<typeof ConversationExtraction>;

The pricing question from the opening paragraph is now answerable. You query extractions where mentionedProducts.value contains "pricing", join to conversations, join to customers, and filter on confidence. Not free, but straightforward, and you can do it in a single aggregation without re-running any LLM. The confidence threshold is the hidden lever that matters most in practice, which brings us to provenance.

Confidence you can actually use

A confidence score is only as useful as its calibration. The 2024 NAACL survey of LLM calibration and more recent layer-level studies make a specific warning: supervised fine-tuned models tend to be reasonably calibrated, but RLHF-trained models (PPO, DPO, GRPO) systematically overstate their certainty. If your model says 0.92 and it is actually right 72% of the time at that threshold, filtering by confidence is worse than random. So you need two things. You need the raw confidence from the model, and you need a way to verify that confidence maps to reality.

The raw confidence side is a prompt and schema issue. Ask the model for a per-field score, not a whole-response score, and validate it with Zod so you cannot silently accept malformed output. Most providers now return structured output natively; Zod closes the runtime gap by catching the rare case the model disagrees with its own schema.

extractors/intent.ts·typescript

import { z } from 'zod';
 
const IntentExtraction = z.object({
  intent: z.enum(['cancel', 'renew', 'expand', 'question', 'complaint']),
  confidence: z.number().min(0).max(1),
  evidenceTurnIndex: z.number().int(),
  evidenceStartChar: z.number().int(),
  evidenceEndChar: z.number().int(),
});
 
export async function extractIntent(transcript: string) {
  const raw = await llm.structuredOutput({ schema: IntentExtraction, prompt: INTENT_PROMPT, input: transcript });
  return IntentExtraction.parse(raw);
}

That function cannot return anything except a valid extraction or a thrown error. Which is exactly what you want, because the third state ("the model returned something vaguely intent-shaped") is how silent corruption enters your database. Now take its output and store it with the provenance wrapper from the previous section.

The calibration side is a separate evaluation loop (we'll get there), but the storage shape has to support it. Provenance is a discipline: modelVersion should be the full identifier (gpt-4o-2024-08-06, not "gpt"), sourceSpan should point at exact character offsets so you can reproduce the extraction, and you never mutate an extraction in place. A new run writes a new document; old runs stay queryable. This is how you avoid the "who changed what, when, and why" forensic exercise six months in.

CRM Auto-Update

4 fields extracted

Company

Acme Corp

Deal Size

$50,000

Next Step

Follow-up call

Sentiment

PositiveNEW

Pipeline Active

SalesforceSynced

HubSpotSynced

If you only implement two things from this article, make them per-field confidence and source spans. Everything downstream, dashboards, scorecards, monitoring, audit trails, gets better the moment extractions know where they came from.

When should a field be an enum versus a free-text string?

Use closed vocabulary (enums) for anything you filter, group, or alert on exactly. Use open vocabulary (free-text strings) for anything emergent where the taxonomy is still being discovered. Mix both in the same schema, but label each field explicitly so readers know what they are dealing with. The table below is the mental model.

Field use case	Vocabulary	Why
`intent`, `channel`, `direction`, `resolution_status`	Closed enum	Filtered/grouped daily, alerts fire on them, dashboards expect stable buckets
`topic`, `mentioned_product`, `custom_tag`, `reason_category`	Open string	Emergent, product team keeps inventing new ones, needs clustering before it is safe to enum
`sentiment`, `urgency`	Closed enum with `other`	Queryable but rare categories do show up, escape hatch prevents forced miscategorization
`customer_quote`, `promised_followup_text`	Open string	No taxonomy makes sense, pure free text with full-text search

The trap is deciding everything has to be an enum. You will be tempted because enums feel safe and queryable. They are, until the product team invents a new category every week and your migration becomes a monthly chore. Split fields by how you intend to use them, not by which feels more disciplined.

schema/vocabulary.ts·typescript

export const Intent = z.enum([
  'cancel', 'renew', 'expand', 'question', 'complaint', 'other',
]);
 
export const Topic = z.object({
  raw: z.string().min(1).max(120),
  normalized: z.string().nullable(),
  clusterId: z.string().nullable(),
});

Intent is a closed set with an other escape hatch. Dashboards group on it, alerts trigger off it, scorecards reference it. Topic stays open: the model returns whatever phrase the customer actually used, then a separate normalization job clusters similar strings, assigns a cluster ID, and optionally writes back a canonical label. This is the pattern from open-vocabulary NER research, adapted for operational data.

The important property is that both types live behind the same ExtractedField wrapper from earlier. Confidence, provenance, and source span apply identically. The difference is in how you query them. Enums go in composite indexes. Open-vocab fields get a text index for search plus a pointer to the cluster.

analytics/cluster-topics.ts·typescript

async function reclusterTopics(workspaceId: string) {
  const topics = await db.extractions.aggregate([
    { $match: { workspaceId, 'topic.value': { $exists: true } } },
    { $project: { raw: '$topic.value.raw', conversationId: 1 } },
  ]).toArray();
  const clusters = await embedAndCluster(topics.map(t => t.raw));
  await writeBackClusterIds(clusters);
}

Open-vocab fields crystallize into enums when they stabilize. If 87% of topics over six months fall into nine clusters, promote those clusters to enum values on the next schema version and keep the long tail open. This is how taxonomies grow without surprise migrations.

A blunter way to put it: open-vocab fields without a normalization job running behind them are a two-quarter timebomb. Ship the clustering loop the same week you ship the extractor, even if it is a cron that runs weekly and writes to a single clusterId column. Otherwise the product team will file a Looker ticket asking why "billing" and "invoice" are separate rows, and you will have no answer that does not start with "well, about six months ago..."

Schemas drift. Plan the version field now.

Every extraction pipeline you ship will need a breaking change within a year. Someone adds a new intent. Someone realizes sentiment should be multi-dimensional. Someone swaps models and confidences shift. If you have not planned for this, your options are "migrate everything under pressure" or "ignore the drift and hope." Neither is good. The version field is what gives you a third option.

Every document gets a schemaVersion. Readers switch on it. Writers only produce the latest version. Old documents stay readable because the reader knows which shape they are in. Under the hood, the expand-contract pattern is what makes this safe in production.

readers/extraction-reader.ts·typescript

import type { ConversationExtraction } from '../schema/extraction';
 
export function normalize(doc: any): ConversationExtraction {
  if (doc.schemaVersion === 3) return doc;
  if (doc.schemaVersion === 2) {
    return { ...doc, schemaVersion: 3, resolution: undefined };
  }
  if (doc.schemaVersion === 1) {
    const v2 = migrateV1toV2(doc);
    return normalize(v2);
  }
  throw new Error(`Unknown schemaVersion ${doc.schemaVersion}`);
}

That reader is boring on purpose. It handles historical shapes one by one, each migration step pure and testable. The database itself stays mixed-version for as long as you need. You add the new field in a new version, backfill at your own pace, dual-write during transition, and only drop the old field when nothing reads it anymore.

A practical note on storage. Mongoose schemas should mirror Zod but stay loose about unknown fields, because being strict at the database layer turns migrations into downtime events. Validate with Zod at the edges (writes and reads), keep Mongoose permissive in the middle.

schema/extraction.mongoose.ts·typescript

import { Schema } from 'mongoose';
 
export const ExtractionSchema = new Schema({
  conversationId: { type: String, required: true, index: true },
  workspaceId: { type: String, required: true, index: true },
  schemaVersion: { type: Number, required: true },
  intent: { type: Schema.Types.Mixed },
  sentiment: { type: Schema.Types.Mixed },
  mentionedProducts: { type: [Schema.Types.Mixed] },
  resolution: { type: Schema.Types.Mixed },
}, { strict: false, timestamps: true });

strict: false sounds dangerous and is. The safety comes from Zod validating at the application boundary, not Mongoose validating at the driver boundary. Strict application-layer validation plus permissive storage is how you get migrations that don't require taking the database down. If you want versioned migrations tracked in MongoDB itself (instead of in-app branching like above), Mongock is the Flyway-style option most production teams end up reaching for. Use it for data-shape changes; keep the reader-side normalize() for runtime shape negotiation. They solve different halves of the same problem.

Which indexes do conversation analytics actually need?

Schemas don't feel broken until someone runs a real query. The question "show me at-risk customers who mentioned pricing in the last thirty days" is three joins and two filters under the hood. If your indexes don't match how CX teams actually ask questions, every dashboard is a sequential scan. Three patterns cover about 80% of real analytics use, and they all fit in a handful of indexes.

db/indexes.ts·typescript

conversations.createIndex({ workspaceId: 1, startedAt: -1 });
conversations.createIndex({ customerId: 1, startedAt: -1 });
extractions.createIndex({ workspaceId: 1, 'intent.value': 1, 'intent.confidence': -1 });
extractions.createIndex({ conversationId: 1 });
extractions.createIndex({ 'topic.value.clusterId': 1, workspaceId: 1 });

The first index powers "all conversations in my workspace sorted by recency." The second powers "customer timeline." The third powers "show me high-confidence churn-risk calls this week." The fourth is the join key for enriching conversations with their extractions. The fifth is the bridge between open-vocab topics and analytics.

What is deliberately missing: no per-field index on sentiment, no index on mentionedProducts. Those fields get queried rarely enough that a scan inside the filtered result set is fine. Rule of thumb: don't create an index until a real dashboard is slow. Speculative indexes are storage overhead with no query benefit.

If you find yourself defining the same "at-risk customer" query in six places (Looker, an alert, a cohort export, a Hex notebook, two Slack bots), the Signal schema is telling you to put a metrics layer in front of it. Feast for feature-store-style reuse if extractions feed an ML model, dbt's Semantic Layer for BI-style metric definitions that the CX team can share. Either beats ten slightly-different SQL strings drifting in ten different tools.

Per-field evaluation, not per-response

Remember the calibration problem from earlier? The storage shape supports solving it, but only if you actually run the eval. The common mistake is treating extraction quality as one number. "Our extraction pipeline is 88% accurate." That number is a lie by averaging. In practice, an extraction pipeline with 96% intent accuracy and 42% sentiment accuracy is common, and the average hides exactly the failure mode you care about. Treat every extracted field as its own model with its own eval.

evals/eval-intent.ts·typescript

const sample = await db.extractions.aggregate([
  { $match: { workspaceId, schemaVersion: 3, 'intent.value': { $exists: true } } },
  { $sample: { size: 300 } },
  { $lookup: { from: 'conversations', localField: 'conversationId', foreignField: 'id', as: 'c' } },
]).toArray();
const labeled = await requestHumanLabels(sample, 'intent');
const scores = scorePrecisionRecall(labeled);

That loop costs you a few hundred dollars in human labeling and gives you a real calibration curve. Plot predicted confidence against empirical accuracy. If the 0.9 bucket is actually right 95% of the time, great, filter at 0.9 with confidence. If it is right 68% of the time, your model is miscalibrated for this field and you need either a recalibration layer or a new model. This is the same discipline the Cleanlab trust score work recommends, adapted for operational pipelines.

Connect this loop to the scorecard layer. Every scorecard dimension that depends on an extracted field should carry the eval results for that field, so reviewers know which ratings are backed by high-confidence extractions and which are soft. That is the job of Scorecards, and it only works if the underlying Signal schema is set up to support it.

Chanl's take on this

The Signal schema patterns above are the same ones we use inside Chanl, because they are the only shapes that survive production. Conversations stay split from customers, extracted fields carry their own confidence and source span, and schemaVersion is stamped on every document so readers don't guess. Monitoring charts respect those confidence thresholds, so a 0.41 sentiment score doesn't show up in a dashboard pretending to be a fact.

If you are using the Chanl SDK, the event side of this is what client.calls.* and the transcript APIs expose. Extractions surface with the value, confidence, and source_span shape, which is why scorecards and analytics can compute on them without each consumer re-parsing blobs.

using-chanl-sdk.ts·typescript

import { ChanlSDK } from '@chanl/sdk';
 
const chanl = new ChanlSDK({ apiKey: process.env.CHANL_API_KEY! });
const { data } = await chanl.calls.import({
  transcript: [
    { speaker: 'customer', text: 'I want to cancel.', startTime: 0, endTime: 1800 },
  ],
  analysisFields: ['sentiment', 'topics', 'extraction'],
});
// data.interactionId is populated; extractions land as a separate
// record keyed by that id once the pipeline finishes.

The import call kicks off the Signal-shaped extraction pipeline asynchronously; you read the result back via chanl.calls.get(id) once it completes. Building the equivalent in-house is a real project, and the part worth keeping in-house is usually not the plumbing. The plumbing is what you want to be boring.

If this resonates with how you think about agent data, it also connects to the memory work in From Session Context to Long-Term Knowledge and the deeper Build Your Own AI Agent Memory System walkthrough. Memory and extractions share the same discipline: typed fields, provenance, confidence, and migration paths. Treat them as the same problem and the architecture collapses into something coherent.

The floor, not the ceiling

Data modeling did not go away when LLMs arrived. It got harder. The disciplines that make transactional schemas survive a decade (normalize what persists, denormalize what you query, version what changes, index what you search) apply directly to conversation data. The twist is that LLMs are now writers in your system, not just readers, and their outputs need confidence and provenance that a human-entered field would never need.

Treat the Signal schema above as a floor, not an endpoint. The entity/event split, the confidence and source-span wrapper around every extraction, the schemaVersion stamp, the open/closed vocab distinction, the per-field eval loop: those are the parts you will wish you had started with. The payoff shows up six months in, when that Slack question about at-risk customers lands again. The answer is one aggregation with a confidence filter, not a weekend of re-processing twelve thousand transcripts. Remember the opening blob? That is what you just stopped shipping.