The dashboard says P95 is 800 milliseconds across speech-to-text, LLM, and text-to-speech. P95 means "the time 95 percent of calls finish under," so the dashboard is showing you the typical case. Every panel green. Then a support ticket lands at 9:14am. "The agent froze for four seconds in the middle of confirming my appointment, and I hung up." You pull the trace. Total round-trip on that turn was 4,180 milliseconds. The dashboard never flagged it because that call lived in the tail, the slowest few percent of calls, and nobody was watching the tail.
This is the math your voice latency dashboard probably gets wrong, and how to instrument, aggregate, and alert on the metric customers actually feel. We'll wrap each stage of every turn in a span (a timed slice of work, the unit of distributed tracing) with OpenTelemetry and Pipecat, aggregate the timings with histograms, plot the joint CDF (the curve answering "what fraction of full turns finished under each time T"), and set the SLO (the latency budget you publicly commit to) at P99.9, meaning the slowest 1 call in 1,000, because that's the only number that maps to a real call.
| What you will build | What you will learn |
|---|---|
| OpenTelemetry span tree per utterance | How to attribute latency to the stage that owns it |
| Histogram aggregator | Why averaging percentiles is wrong and what to do instead |
| Joint CDF dashboard | What end-to-end latency actually looks like |
| Tail attribution helper | Which stage is responsible for your P99.9 today |
| P99.9 alerting recipe | An SLO that maps to what customers feel |
Why per-stage P95 lies
Most teams report P95 per stage and mentally add the numbers up, as if the whole system behaved like the sum of typical cases. It does not. End-to-end latency is the per-turn combination of every stage's latency, and the combined percentile is dominated by the tail. A turn is only as fast as its slowest stage, and the slowest stage is rarely the same one twice in a row.
Back-of-an-envelope: say STT, LLM, and TTS each finish under their stage budget on 95 percent of utterances. Treat the tails as independent (a generous assumption, they're usually correlated through shared resources). The probability that an utterance finishes under budget end-to-end is the product:
P(end-to-end fast) = P(STT fast) * P(LLM fast) * P(TTS fast)
= 0.95 * 0.95 * 0.95
= 0.857So a stack where every panel reads "P95 green" misses the budget on roughly 14 percent of calls. Push to four stages including a transport hop and you're at 81 percent. Going the other direction: to get 99.9 percent of end-to-end calls under budget across three stages, each stage needs to be under budget 99.97 percent of the time. The tail you tolerate per stage has to be much narrower than the tail you tolerate end-to-end.
Independence is also a generous assumption in the wrong direction. Stage tails are usually correlated, because regional GPU pressure or a noisy neighbor hits STT, LLM, and TTS at the same time, so heavy load on one stage coincides with heavy load on the others and the bad minutes stack instead of canceling out. Treat 0.857 as the optimistic ceiling.
The metric that matters: utterance round-trip
Utterance round-trip is the time from the moment the user finishes their last word to the moment the first byte of TTS audio reaches their ear. Everything else is a diagnostic. The dashboard the customer would use, if they had one, is a histogram of utterance round-trips.
Each arrow in the timing model below is a span boundary you should be recording.
There is a trap here. Measure each stage's median per frame and you fall straight into what Gil Tene named coordinated omission, the bias you get when slow events get under-counted because their downstream stalls aren't recorded. A stage that processes a thousand frames at 5ms each plus one frame at 800ms looks like P95 = 5ms if you measure per-frame and have enough frames. The 800ms stall is the customer's whole experience, and your histogram drowns it.
The fix is to anchor your measurement to the unit the customer perceives. One span per utterance, child spans per stage, every utterance recorded.
Instrument the whole pipeline, not the parts
Pipecat and LiveKit both ship OpenTelemetry integrations that emit spans around their pipeline stages. The OpenTelemetry GenAI semantic conventions are the standard naming scheme for LLM calls; they give you a stable attribute set (gen_ai.system, gen_ai.request.model, gen_ai.response.id) so dashboards from different vendors can read the same trace. Wrap STT and TTS in your own spans using the same naming pattern.
The minimum useful instrumentation around a Pipecat pipeline looks like this. Set up a tracer once at process start, then wrap each utterance and each stage in spans that share a parent.
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import (
UserStoppedSpeakingFrame,
TranscriptionFrame,
LLMTextFrame,
LLMFullResponseEndFrame,
TTSAudioRawFrame,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(OTLPSpanExporter())
)
tracer = trace.get_tracer("voice-pipeline")
class UtteranceTracer(FrameProcessor):
"""Owns one span per utterance and child spans per stage."""
def __init__(self):
super().__init__()
self._utterance_span = None
self._stage_spans = {}
async def process_frame(self, frame, direction):
# User stops speaking: open the utterance span.
if isinstance(frame, UserStoppedSpeakingFrame):
self._utterance_span = tracer.start_span("utterance")
self._stage_spans["stt"] = tracer.start_span(
"stt.first_byte",
context=trace.set_span_in_context(self._utterance_span),
)
# Final transcript: close STT, open LLM TTFT. TranscriptionFrame is
# the final variant by convention (interim results arrive as
# InterimTranscriptionFrame).
elif isinstance(frame, TranscriptionFrame):
self._close("stt")
self._stage_spans["llm_ttft"] = tracer.start_span(
"llm.first_token",
context=trace.set_span_in_context(self._utterance_span),
)
# First LLM token: close TTFT, open completion.
elif isinstance(frame, LLMTextFrame) and "llm_ttft" in self._stage_spans:
self._close("llm_ttft")
self._stage_spans["llm_complete"] = tracer.start_span(
"llm.complete",
context=trace.set_span_in_context(self._utterance_span),
)
# LLM stream finished: close the completion span before TTS arrives,
# so llm.complete reflects the LLM boundary, not the TTS boundary.
elif isinstance(frame, LLMFullResponseEndFrame):
self._close("llm_complete")
# First TTS audio byte: open and immediately close TTS first-byte span.
elif isinstance(frame, TTSAudioRawFrame) and "tts" not in self._stage_spans:
self._stage_spans["tts"] = tracer.start_span(
"tts.first_byte",
context=trace.set_span_in_context(self._utterance_span),
)
# Audio is reaching the transport now, end the utterance.
self._close("tts")
self._utterance_span.set_attribute(
"gen_ai.system", "pipecat"
)
self._utterance_span.end()
self._utterance_span = None
await self.push_frame(frame, direction)
def _close(self, key):
span = self._stage_spans.pop(key, None)
if span is not None:
span.end()That gives you one span tree per turn. Export to any OpenTelemetry backend (Tempo, Honeycomb, Jaeger, Datadog APM, Vercel observability) and you have the raw material for everything that follows. The tagging matters: every utterance span should also carry call_id, agent_id, model name, and region, because those are the cuts you will want when you find a bad period.
Aggregate with histograms
Now you have spans. Do not store percentiles. Store histograms.
The reason: percentiles do not add or average like normal numbers. The average of two P95s is not a P95. The maximum of two P95s is not a P95 either. If you store P95 per minute and try to compute P95 per hour, you get a number that means nothing. HdrHistogram and t-digest are two latency-data structures that solve this; they preserve the distribution shape with bounded memory, and both are mergeable, so a per-shard summary can be combined into a global one.
A small Python aggregator pulls the OpenTelemetry spans you just emitted, builds a HdrHistogram per stage per minute, and computes the joint CDF for end-to-end utterance latency.
from collections import defaultdict
from datetime import datetime, timezone
from hdrh.histogram import HdrHistogram
import json
# Each span: { utterance_id, name, start_ns, end_ns, attributes }
def load_spans(path):
with open(path) as f:
for line in f:
yield json.loads(line)
# One HdrHistogram per (minute_bucket, stage). Range 1us to 60s, 3 sig digits.
HIST_PARAMS = (1, 60_000_000, 3)
per_stage = defaultdict(lambda: HdrHistogram(*HIST_PARAMS))
end_to_end = defaultdict(lambda: HdrHistogram(*HIST_PARAMS))
utterances = defaultdict(dict)
for span in load_spans("spans.jsonl"):
minute = (span["start_ns"] // 60_000_000_000) * 60
duration_us = (span["end_ns"] - span["start_ns"]) // 1_000
if span["name"] == "utterance":
end_to_end[minute].record_value(duration_us)
else:
per_stage[(minute, span["name"])].record_value(duration_us)
# Print the joint CDF on a sample minute.
sample_minute = next(iter(end_to_end))
hist = end_to_end[sample_minute]
print(f"minute={datetime.fromtimestamp(sample_minute, tz=timezone.utc).isoformat()}")
for pct in (50, 90, 95, 99, 99.9):
print(f" P{pct:>5} = {hist.get_value_at_percentile(pct) / 1000:.0f} ms")
# How does the joint compare to the naive product of per-stage P95s?
def stage_p95_at(minute, name):
return per_stage[(minute, name)].get_value_at_percentile(95) / 1000
stt = stage_p95_at(sample_minute, "stt.first_byte")
llm = stage_p95_at(sample_minute, "llm.first_token")
tts = stage_p95_at(sample_minute, "tts.first_byte")
print(f"naive sum of per-stage P95 = {stt + llm + tts:.0f} ms")
print(f"actual end-to-end P95 = {hist.get_value_at_percentile(95) / 1000:.0f} ms")
print(f"actual end-to-end P99.9 = {hist.get_value_at_percentile(99.9) / 1000:.0f} ms")The two interesting lines are at the bottom. On any pipeline running long enough, the naive sum of per-stage P95s is wildly different from the actual end-to-end P95, and both are wildly different from the P99.9 your customers feel. That gap is your tail.
Build the joint CDF
The dashboard you actually want is a cumulative distribution function across every utterance round-trip. Every point on the curve answers a customer-shaped question: what fraction of turns finished in under T milliseconds? Put your SLO line on the same chart and the gap is concrete. Either you fix the tail, or you raise the SLO, or you pretend the dashboard is fine and ship a worse product.

Keep two views side by side. The joint CDF across all calls tells you whether the system is healthy now. A sparkline of P99.9 over time, with anomalies marked, tells you whether the tail is drifting.
Attribute the tail to a stage
Once you can see the tail, the next question is which stage owns it. The blunt heuristic that works in practice is this: take the slowest 1 percent of utterances by total latency, then for each utterance ask which child span contributed the most. Bin the answers. Whichever stage shows up most often is the one that needs work.
def attribute_tail(spans_by_utterance, top_n=100):
"""Print which stage owned the tail across the slowest utterances."""
# spans_by_utterance is built by walking the OTEL span tree above and
# bucketing child-span durations under each utterance_id. Shape is:
# { utt_id: { "utterance": total_us, "stt.first_byte": ..., ... } }
sorted_utts = sorted(
spans_by_utterance.items(),
key=lambda kv: kv[1]["utterance"],
reverse=True,
)[:top_n]
blame = defaultdict(int)
for utt_id, stages in sorted_utts:
# Find which stage took the largest absolute share of this turn.
culprit = max(
("stt.first_byte", "llm.first_token", "tts.first_byte"),
key=lambda s: stages.get(s, 0),
)
blame[culprit] += 1
total = sum(blame.values())
for stage, count in sorted(blame.items(), key=lambda kv: -kv[1]):
print(f" {stage:20s} owns {count}/{total} ({100 * count / total:.0f}%)")In production voice systems the answer is almost always one of two stages. LLM time-to-first-token under cold model load, queue depth, or a long prompt that pushed past a context-cache boundary. Or TTS first-byte under a regional cache miss, voice-clone fetch from a slow region, or codec mismatch forcing transcoding. The next move is operational, not statistical: pin a model warm, replicate the voice clone closer to the user, or shorten the prompt.
Set the SLO at P99.9
P99.9 of utterance round-trip is the right SLO, not P95 of any individual stage. A 4-second freeze on 1 in 1000 turns destroys trust faster than 200 milliseconds of drift on the median. The customer is not running statistics. They are remembering the last call.
A workable alert recipe:
- Aggregate utterance latency in a 5-minute rolling window using HdrHistogram per region.
- Page the on-call if the rolling P99.9 exceeds 2.5x your stated median for 3 consecutive windows.
- Auto-attach the tail-attribution output (which stage owned the slowest 1 percent in that window) to the alert.
- Suppress the alert if request volume in the window is below a floor (a histogram with 4 calls in it has no statistical meaning).
Keep the per-stage panels. They are useful for diagnosis, not for paging.
Wire it into Chanl
Standing up your own histograms and OpenTelemetry pipeline is the right answer if you're running a voice fleet at scale. If you're still building, the Chanl SDK already records per-call timing aggregates and outlier flags, which gets you the same loop without running Tempo and a t-digest service.
Two SDK methods get you the watchdog. sdk.calls.list() returns recent calls in a window, sdk.calls.getMetrics(callId) returns timing aggregates per call. For follow-up after an alert, sdk.calls.getTranscript(callId) gives word-level segment timestamps you can use for stage attribution, and sdk.calls.analyze(callId) runs the configured scorecard.
import { ChanlSDK } from "@chanl/sdk";
const sdk = new ChanlSDK({
apiKey: process.env.CHANL_API_KEY!,
baseUrl: "https://api.chanl.com",
});
async function rollingP99dot9(windowMinutes = 60): Promise<number> {
const since = new Date(Date.now() - windowMinutes * 60_000).toISOString();
const { data } = await sdk.calls.list({
startDate: since,
status: "ended",
limit: 1000,
});
const latencies: number[] = [];
for (const call of data?.calls ?? []) {
const { data: m } = await sdk.calls.getMetrics(call.id);
const total = m?.metrics?.responseTime?.average;
if (typeof total === "number") latencies.push(total);
}
latencies.sort((a, b) => a - b);
const idx = Math.floor(latencies.length * 0.999);
return latencies[idx] ?? 0;
}
const p99dot9 = await rollingP99dot9();
const SLO_MS = 2_500;
if (p99dot9 > SLO_MS) {
await fetch(process.env.SLACK_WEBHOOK!, {
method: "POST",
body: JSON.stringify({
text: `Voice P99.9 = ${p99dot9}ms over budget (${SLO_MS}ms). Pulling outliers...`,
}),
});
}When the watchdog fires, grab the slowest call IDs from the same window and replay them through your scenario harness so the regression is reproducible. That flagged-outlier list is what you build a synthetic eval suite from over time. The agents that survive the tail are the ones you ship. The Analytics and Monitoring pages handle the cohorting if you want a UI on top of the same data.
The playbook
Six steps that turn a green-everywhere dashboard into one that actually maps to customer experience:
- Anchor every metric to one span per utterance, not per frame. Coordinated omission is real and it loves voice pipelines.
- Instrument STT, LLM, and TTS as child spans of the utterance using the OpenTelemetry GenAI conventions.
- Aggregate with HdrHistogram or t-digest so percentiles are mergeable across hosts and windows.
- Plot the joint CDF of utterance round-trip and put your SLO line on the same chart.
- Attribute the slowest 1 percent of turns to the stage that contributed the most variance.
- Set the SLO at P99.9 of utterance round-trip and alert on the joint, never on per-stage P95.
If you want to go deeper on the budget that sits underneath all of this, the companion piece on voice AI pipeline budgets walks through the per-stage targets, and the sub-300ms architecture article covers the streaming choices that make those budgets achievable. The 16% rule post is the consequence: latency is not a backend metric, it is a satisfaction metric, and the tail is what people remember.
Stop guessing where your voice agent's tail comes from
Chanl records per-call metrics, segments transcripts with timestamps, and flags outliers automatically. Wire it to your SLO in an afternoon.
See Analytics- Gil Tene. How NOT to Measure Latency. Strange Loop talk on coordinated omission and percentile reporting.
- HdrHistogram. High dynamic range histogram for latency measurement.
- OpenTelemetry. Semantic conventions for Generative AI (gen_ai.* attributes).
- Pipecat. OpenTelemetry tracing for voice pipelines.
- LiveKit. Agent metrics and observability.
- Twilio. Voice Intelligence operator results and conversational analytics.
- Brendan Gregg. Latency heat maps for tail analysis.
- Heinrich Hartmann. Statistics for Engineers: percentile aggregation pitfalls (Circonus).
Co-founder
Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.
The Signal Briefing
One email a week. How leading CS, revenue, and AI teams are turning conversations into decisions. Benchmarks, playbooks, and what's working in production.



