Why does my voice agent feel slow when the dashboard shows P95 under budget?

Per-stage P95 reports the 95th percentile of each stage in isolation. End-to-end latency is the joint distribution of all stages, which is dominated by the tail. Three stages each at P95=95% means only about 86% of calls finish under budget end-to-end, not 95%.

How does variance compound across STT, LLM, and TTS?

If the stages are roughly independent, the probability that an utterance finishes under your budget is the product of the per-stage probabilities. P(end-to-end <= T) = P(STT <= T1) * P(LLM <= T2) * P(TTS <= T3). Multiplying probabilities below 1 always makes the joint smaller than any individual term.

What is the right SLO for a voice agent?

P99.9 of utterance round-trip latency, measured from the moment the user finishes speaking to the first audible TTS byte. Customers feel every freeze, and a 4-second stall on 1-in-1000 turns destroys trust faster than 200ms drift on the median. P95 is a smoke test, not a contract.

Which stage usually owns the tail?

In production voice systems, the P99.9 outlier almost always lives in either LLM time-to-first-token (cold model, queue depth, long prompts) or TTS first-byte under cold-start (regional cache miss, voice-clone fetch). STT tail is usually narrower because most STT vendors do speculative streaming.

How do I instrument a Pipecat or LiveKit pipeline for tail-latency analysis?

Wrap each stage in an OpenTelemetry span tagged with the OpenTelemetry GenAI semantic conventions, then record stt.first_byte, stt.final, llm.first_token, llm.complete, tts.first_byte, and audio.queued as child spans of an utterance span. Pipecat ships an OTEL helper that automates the boilerplate.

Why are histograms better than averages for latency monitoring?

Averaging percentiles is mathematically nonsense, the average of two P95s is not a P95. HdrHistogram or t-digest preserves percentile fidelity at any granularity and is mergeable across windows and hosts. That lets you compute a true global P99.9 from per-shard summaries.

What is coordinated omission and why does it matter for voice?

Coordinated omission is Gil Tene's term for naive percentile measurement that omits the long pause following a slow request, dramatically under-reporting the tail. In voice it appears as per-frame metrics that look healthy while utterance-level round-trip is measured one frame at a time and the stalled frames vanish.

Your voice agent's P95 is lying. The real problem is P99.9

The dashboard says P95 is 800 milliseconds across speech-to-text, LLM, and text-to-speech. P95 means "the time 95 percent of calls finish under," so the dashboard is showing you the typical case. Every panel green. Then a support ticket lands at 9:14am. "The agent froze for four seconds in the middle of confirming my appointment, and I hung up." You pull the trace. Total round-trip on that turn was 4,180 milliseconds. The dashboard never flagged it because that call lived in the tail, the slowest few percent of calls, and nobody was watching the tail.

This is the math your voice latency dashboard probably gets wrong, and how to instrument, aggregate, and alert on the metric customers actually feel. We'll wrap each stage of every turn in a span (a timed slice of work, the unit of distributed tracing) with OpenTelemetry and Pipecat, aggregate the timings with histograms, plot the joint CDF (the curve answering "what fraction of full turns finished under each time T"), and set the SLO (the latency budget you publicly commit to) at P99.9, meaning the slowest 1 call in 1,000, because that's the only number that maps to a real call.

What you will build	What you will learn
OpenTelemetry span tree per utterance	How to attribute latency to the stage that owns it
Histogram aggregator	Why averaging percentiles is wrong and what to do instead
Joint CDF dashboard	What end-to-end latency actually looks like
Tail attribution helper	Which stage is responsible for your P99.9 today
P99.9 alerting recipe	An SLO that maps to what customers feel

Why per-stage P95 lies

Most teams report P95 per stage and mentally add the numbers up, as if the whole system behaved like the sum of typical cases. It does not. End-to-end latency is the per-turn combination of every stage's latency, and the combined percentile is dominated by the tail. A turn is only as fast as its slowest stage, and the slowest stage is rarely the same one twice in a row.

Back-of-an-envelope: say STT, LLM, and TTS each finish under their stage budget on 95 percent of utterances. Treat the tails as independent (a generous assumption, they're usually correlated through shared resources). The probability that an utterance finishes under budget end-to-end is the product:

cascade-variance.txt·text

P(end-to-end fast) = P(STT fast) * P(LLM fast) * P(TTS fast)
                   = 0.95 * 0.95 * 0.95
                   = 0.857

So a stack where every panel reads "P95 green" misses the budget on roughly 14 percent of calls. Push to four stages including a transport hop and you're at 81 percent. Going the other direction: to get 99.9 percent of end-to-end calls under budget across three stages, each stage needs to be under budget 99.97 percent of the time. The tail you tolerate per stage has to be much narrower than the tail you tolerate end-to-end.

Independence is also a generous assumption in the wrong direction. Stage tails are usually correlated, because regional GPU pressure or a noisy neighbor hits STT, LLM, and TTS at the same time, so heavy load on one stage coincides with heavy load on the others and the bad minutes stack instead of canceling out. Treat 0.857 as the optimistic ceiling.

The metric that matters: utterance round-trip

Utterance round-trip is the time from the moment the user finishes their last word to the moment the first byte of TTS audio reaches their ear. Everything else is a diagnostic. The dashboard the customer would use, if they had one, is a histogram of utterance round-trips.

Each arrow in the timing model below is a span boundary you should be recording.

Utterance round-trip span tree across STT, LLM, and TTS

There is a trap here. Measure each stage's median per frame and you fall straight into what Gil Tene named coordinated omission, the bias you get when slow events get under-counted because their downstream stalls aren't recorded. A stage that processes a thousand frames at 5ms each plus one frame at 800ms looks like P95 = 5ms if you measure per-frame and have enough frames. The 800ms stall is the customer's whole experience, and your histogram drowns it.

The fix is to anchor your measurement to the unit the customer perceives. One span per utterance, child spans per stage, every utterance recorded.

Instrument the whole pipeline, not the parts

Pipecat and LiveKit both ship OpenTelemetry integrations that emit spans around their pipeline stages. The OpenTelemetry GenAI semantic conventions are the standard naming scheme for LLM calls; they give you a stable attribute set (gen_ai.system, gen_ai.request.model, gen_ai.response.id) so dashboards from different vendors can read the same trace. Wrap STT and TTS in your own spans using the same naming pattern.

The minimum useful instrumentation around a Pipecat pipeline looks like this. Set up a tracer once at process start, then wrap each utterance and each stage in spans that share a parent.

instrument_pipecat.py·python

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from pipecat.processors.frame_processor import FrameProcessor
from pipecat.frames.frames import (
    UserStoppedSpeakingFrame,
    TranscriptionFrame,
    LLMTextFrame,
    LLMFullResponseEndFrame,
    TTSAudioRawFrame,
)
 
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter())
)
tracer = trace.get_tracer("voice-pipeline")
 
 
class UtteranceTracer(FrameProcessor):
    """Owns one span per utterance and child spans per stage."""
 
    def __init__(self):
        super().__init__()
        self._utterance_span = None
        self._stage_spans = {}
 
    async def process_frame(self, frame, direction):
        # User stops speaking: open the utterance span.
        if isinstance(frame, UserStoppedSpeakingFrame):
            self._utterance_span = tracer.start_span("utterance")
            self._stage_spans["stt"] = tracer.start_span(
                "stt.first_byte",
                context=trace.set_span_in_context(self._utterance_span),
            )
 
        # Final transcript: close STT, open LLM TTFT. TranscriptionFrame is
        # the final variant by convention (interim results arrive as
        # InterimTranscriptionFrame).
        elif isinstance(frame, TranscriptionFrame):
            self._close("stt")
            self._stage_spans["llm_ttft"] = tracer.start_span(
                "llm.first_token",
                context=trace.set_span_in_context(self._utterance_span),
            )
 
        # First LLM token: close TTFT, open completion.
        elif isinstance(frame, LLMTextFrame) and "llm_ttft" in self._stage_spans:
            self._close("llm_ttft")
            self._stage_spans["llm_complete"] = tracer.start_span(
                "llm.complete",
                context=trace.set_span_in_context(self._utterance_span),
            )
 
        # LLM stream finished: close the completion span before TTS arrives,
        # so llm.complete reflects the LLM boundary, not the TTS boundary.
        elif isinstance(frame, LLMFullResponseEndFrame):
            self._close("llm_complete")
 
        # First TTS audio byte: open and immediately close TTS first-byte span.
        elif isinstance(frame, TTSAudioRawFrame) and "tts" not in self._stage_spans:
            self._stage_spans["tts"] = tracer.start_span(
                "tts.first_byte",
                context=trace.set_span_in_context(self._utterance_span),
            )
            # Audio is reaching the transport now, end the utterance.
            self._close("tts")
            self._utterance_span.set_attribute(
                "gen_ai.system", "pipecat"
            )
            self._utterance_span.end()
            self._utterance_span = None
 
        await self.push_frame(frame, direction)
 
    def _close(self, key):
        span = self._stage_spans.pop(key, None)
        if span is not None:
            span.end()

That gives you one span tree per turn. Export to any OpenTelemetry backend (Tempo, Honeycomb, Jaeger, Datadog APM, Vercel observability) and you have the raw material for everything that follows. The tagging matters: every utterance span should also carry call_id, agent_id, model name, and region, because those are the cuts you will want when you find a bad period.

Aggregate with histograms

Now you have spans. Do not store percentiles. Store histograms.

The reason: percentiles do not add or average like normal numbers. The average of two P95s is not a P95. The maximum of two P95s is not a P95 either. If you store P95 per minute and try to compute P95 per hour, you get a number that means nothing. HdrHistogram and t-digest are two latency-data structures that solve this; they preserve the distribution shape with bounded memory, and both are mergeable, so a per-shard summary can be combined into a global one.

A small Python aggregator pulls the OpenTelemetry spans you just emitted, builds a HdrHistogram per stage per minute, and computes the joint CDF for end-to-end utterance latency.

aggregate_latencies.py·python

from collections import defaultdict
from datetime import datetime, timezone
from hdrh.histogram import HdrHistogram
import json
 
# Each span: { utterance_id, name, start_ns, end_ns, attributes }
def load_spans(path):
    with open(path) as f:
        for line in f:
            yield json.loads(line)
 
# One HdrHistogram per (minute_bucket, stage). Range 1us to 60s, 3 sig digits.
HIST_PARAMS = (1, 60_000_000, 3)
per_stage = defaultdict(lambda: HdrHistogram(*HIST_PARAMS))
end_to_end = defaultdict(lambda: HdrHistogram(*HIST_PARAMS))
utterances = defaultdict(dict)
 
for span in load_spans("spans.jsonl"):
    minute = (span["start_ns"] // 60_000_000_000) * 60
    duration_us = (span["end_ns"] - span["start_ns"]) // 1_000
 
    if span["name"] == "utterance":
        end_to_end[minute].record_value(duration_us)
    else:
        per_stage[(minute, span["name"])].record_value(duration_us)
 
# Print the joint CDF on a sample minute.
sample_minute = next(iter(end_to_end))
hist = end_to_end[sample_minute]
print(f"minute={datetime.fromtimestamp(sample_minute, tz=timezone.utc).isoformat()}")
for pct in (50, 90, 95, 99, 99.9):
    print(f"  P{pct:>5} = {hist.get_value_at_percentile(pct) / 1000:.0f} ms")
 
# How does the joint compare to the naive product of per-stage P95s?
def stage_p95_at(minute, name):
    return per_stage[(minute, name)].get_value_at_percentile(95) / 1000
 
stt = stage_p95_at(sample_minute, "stt.first_byte")
llm = stage_p95_at(sample_minute, "llm.first_token")
tts = stage_p95_at(sample_minute, "tts.first_byte")
print(f"naive sum of per-stage P95 = {stt + llm + tts:.0f} ms")
print(f"actual end-to-end P95     = {hist.get_value_at_percentile(95) / 1000:.0f} ms")
print(f"actual end-to-end P99.9   = {hist.get_value_at_percentile(99.9) / 1000:.0f} ms")

The two interesting lines are at the bottom. On any pipeline running long enough, the naive sum of per-stage P95s is wildly different from the actual end-to-end P95, and both are wildly different from the P99.9 your customers feel. That gap is your tail.

Build the joint CDF

The dashboard you actually want is a cumulative distribution function across every utterance round-trip. Every point on the curve answers a customer-shaped question: what fraction of turns finished in under T milliseconds? Put your SLO line on the same chart and the gap is concrete. Either you fix the tail, or you raise the SLO, or you pretend the dashboard is fine and ship a worse product.

Total Calls

0+12%

Avg Duration

4:23-8s

Resolution

0%+3%

Live Dashboard

Active calls23

Avg wait0:04

Satisfaction98%

Keep two views side by side. The joint CDF across all calls tells you whether the system is healthy now. A sparkline of P99.9 over time, with anomalies marked, tells you whether the tail is drifting.

Attribute the tail to a stage

Once you can see the tail, the next question is which stage owns it. The blunt heuristic that works in practice is this: take the slowest 1 percent of utterances by total latency, then for each utterance ask which child span contributed the most. Bin the answers. Whichever stage shows up most often is the one that needs work.

attribute_tail.py·python

def attribute_tail(spans_by_utterance, top_n=100):
    """Print which stage owned the tail across the slowest utterances."""
    # spans_by_utterance is built by walking the OTEL span tree above and
    # bucketing child-span durations under each utterance_id. Shape is:
    # { utt_id: { "utterance": total_us, "stt.first_byte": ..., ... } }
    sorted_utts = sorted(
        spans_by_utterance.items(),
        key=lambda kv: kv[1]["utterance"],
        reverse=True,
    )[:top_n]
 
    blame = defaultdict(int)
    for utt_id, stages in sorted_utts:
        # Find which stage took the largest absolute share of this turn.
        culprit = max(
            ("stt.first_byte", "llm.first_token", "tts.first_byte"),
            key=lambda s: stages.get(s, 0),
        )
        blame[culprit] += 1
 
    total = sum(blame.values())
    for stage, count in sorted(blame.items(), key=lambda kv: -kv[1]):
        print(f"  {stage:20s} owns {count}/{total} ({100 * count / total:.0f}%)")

In production voice systems the answer is almost always one of two stages. LLM time-to-first-token under cold model load, queue depth, or a long prompt that pushed past a context-cache boundary. Or TTS first-byte under a regional cache miss, voice-clone fetch from a slow region, or codec mismatch forcing transcoding. The next move is operational, not statistical: pin a model warm, replicate the voice clone closer to the user, or shorten the prompt.

Set the SLO at P99.9

P99.9 of utterance round-trip is the right SLO, not P95 of any individual stage. A 4-second freeze on 1 in 1000 turns destroys trust faster than 200 milliseconds of drift on the median. The customer is not running statistics. They are remembering the last call.

A workable alert recipe:

Aggregate utterance latency in a 5-minute rolling window using HdrHistogram per region.
Page the on-call if the rolling P99.9 exceeds 2.5x your stated median for 3 consecutive windows.
Auto-attach the tail-attribution output (which stage owned the slowest 1 percent in that window) to the alert.
Suppress the alert if request volume in the window is below a floor (a histogram with 4 calls in it has no statistical meaning).

Keep the per-stage panels. They are useful for diagnosis, not for paging.

Wire it into Chanl

Standing up your own histograms and OpenTelemetry pipeline is the right answer if you're running a voice fleet at scale. If you're still building, the Chanl SDK already records per-call timing aggregates and outlier flags, which gets you the same loop without running Tempo and a t-digest service.

Two SDK methods get you the watchdog. sdk.calls.list() returns recent calls in a window, sdk.calls.getMetrics(callId) returns timing aggregates per call. For follow-up after an alert, sdk.calls.getTranscript(callId) gives word-level segment timestamps you can use for stage attribution, and sdk.calls.analyze(callId) runs the configured scorecard.

latency-watchdog.ts·typescript

import { ChanlSDK } from "@chanl/sdk";
 
const sdk = new ChanlSDK({
  apiKey: process.env.CHANL_API_KEY!,
  baseUrl: "https://api.chanl.com",
});
 
async function rollingP99dot9(windowMinutes = 60): Promise<number> {
  const since = new Date(Date.now() - windowMinutes * 60_000).toISOString();
  const { data } = await sdk.calls.list({
    startDate: since,
    status: "ended",
    limit: 1000,
  });
 
  const latencies: number[] = [];
  for (const call of data?.calls ?? []) {
    const { data: m } = await sdk.calls.getMetrics(call.id);
    const total = m?.metrics?.responseTime?.average;
    if (typeof total === "number") latencies.push(total);
  }
 
  latencies.sort((a, b) => a - b);
  const idx = Math.floor(latencies.length * 0.999);
  return latencies[idx] ?? 0;
}
 
const p99dot9 = await rollingP99dot9();
const SLO_MS = 2_500;
 
if (p99dot9 > SLO_MS) {
  await fetch(process.env.SLACK_WEBHOOK!, {
    method: "POST",
    body: JSON.stringify({
      text: `Voice P99.9 = ${p99dot9}ms over budget (${SLO_MS}ms). Pulling outliers...`,
    }),
  });
}

When the watchdog fires, grab the slowest call IDs from the same window and replay them through your scenario harness so the regression is reproducible. That flagged-outlier list is what you build a synthetic eval suite from over time. The agents that survive the tail are the ones you ship. The Analytics and Monitoring pages handle the cohorting if you want a UI on top of the same data.

The playbook

Six steps that turn a green-everywhere dashboard into one that actually maps to customer experience:

Anchor every metric to one span per utterance, not per frame. Coordinated omission is real and it loves voice pipelines.
Instrument STT, LLM, and TTS as child spans of the utterance using the OpenTelemetry GenAI conventions.
Aggregate with HdrHistogram or t-digest so percentiles are mergeable across hosts and windows.
Plot the joint CDF of utterance round-trip and put your SLO line on the same chart.
Attribute the slowest 1 percent of turns to the stage that contributed the most variance.
Set the SLO at P99.9 of utterance round-trip and alert on the joint, never on per-stage P95.

If you want to go deeper on the budget that sits underneath all of this, the companion piece on voice AI pipeline budgets walks through the per-stage targets, and the sub-300ms architecture article covers the streaming choices that make those budgets achievable. The 16% rule post is the consequence: latency is not a backend metric, it is a satisfaction metric, and the tail is what people remember.

Stop guessing where your voice agent's tail comes from

Chanl records per-call metrics, segments transcripts with timestamps, and flags outliers automatically. Wire it to your SLO in an afternoon.

See Analytics

Sources & References

Key Takeaway

Testing edge cases before production deployment can reduce customer complaints by 80% and prevent costly emergency fixes post-launch.

voice latency observability opentelemetry slo pipecat p99 monitoring

Dean Grover

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

The Signal Briefing

Un email por semana. Cómo los equipos líderes de CS, ingresos e IA están convirtiendo conversaciones en decisiones. Benchmarks, playbooks y lo que funciona en producción.