ChanlChanl
Agent Architecture

The Buffering Bug That Quietly Breaks Voice Agent Latency

SSE streams fine locally, then tokens batch into 500ms bursts in production. Here's why, how to fix it, and why pipeline parallelism matters more than model speed.

LDLucas DalamartaEngineering LeadFollow
March 24, 2026
14 min read
Office workers are busy working on computers. - Photo by TECNIC Bioprocess Solutions on Unsplash

You've built a voice agent. Locally, tokens stream smoothly, latency sits under 300ms, and conversations feel natural. You deploy behind Nginx, and suddenly every response arrives in a half-second burst. The LLM hasn't gotten slower. The TTS provider hasn't changed. Something between your server and the user's browser is silently batching your stream into chunks, and your agent now sounds like it's thinking hard before every sentence.

That "something" is almost always proxy buffering, and it's the most common production failure in voice agent streaming. But fixing it is only the first layer. The real latency gains come from how you wire SSE and WebSockets through your pipeline, and which one you pick determines constraints you'll live with for months.

This article covers what each transport actually does, where each one breaks in production voice scenarios, and why pipeline parallelism saves more milliseconds than switching to a faster model.

Why streaming transports matter for voice agents

Streaming transports cut voice agent latency by 40-60% because they let pipeline stages overlap instead of waiting in sequence. Without streaming, each stage (speech-to-text, LLM generation, text-to-speech) must fully complete before the next one starts, producing 2-6 seconds of dead air. With streaming, the user hears audio while the LLM is still generating.

What does that look like concretely? Here's the latency math without streaming:

StageNaive duration
Speech-to-text (batch)200-400ms
LLM generation (wait for full response)1-4 seconds
Text-to-speech (full response synthesis)500ms-1.5s
Network round trips40-100ms
Total~2-6 seconds

And with pipeline parallelism through streaming:

StageStreaming duration
STT (streaming, start on partial audio)80-120ms to first transcript
LLM (streaming, first token latency)100-150ms
TTS (streaming, first audio chunk)60-100ms
Network (concurrent, not sequential)20-50ms overhead
Total perceived260-420ms

The stages still take the same total time. Streaming doesn't make the models faster. What it does is overlap them. The user starts hearing audio before the LLM has finished generating the full response, because TTS is synthesizing the first few sentences while the LLM is still working on the rest. That's pipeline parallelism, and that's where the 40-60% latency reduction comes from.

stream audio chunks partial transcript (t=80ms) first token (t=200ms) first audio chunk (t=280ms) complete transcript (t=200ms) continued tokens (ongoing) continued audio (ongoing) User Audio Speech-to-Text Language Model Text-to-Speech Speaker
Pipeline parallelism: stages overlap instead of waiting for each other

SSE: the right tool for server-to-client streaming

SSE is the default choice for streaming AI-generated tokens to a client. It's a browser-native protocol that opens a persistent HTTP connection, pushes events as they're generated, and handles reconnection automatically. OpenAI, Anthropic, and nearly every major AI API use SSE internally. No protocol upgrade, no bidirectional channel needed. Just HTTP with Content-Type: text/event-stream and a persistent keep-alive.

Here's the catch: SSE only flows in one direction, server to client. That's a feature, not a limitation, for most agent configurations. It works over HTTP/2 (which multiplexes connections), passes through standard reverse proxies without special configuration, and the browser's EventSource API reconnects automatically on drop.

Here's a minimal SSE server that streams from an LLM:

typescript
import express from "express";
 
const app = express();
app.use(express.json());
 
app.post("/api/chat/stream", async (req, res) => {
  const { agentId, messages } = req.body;
 
  // These three headers establish the SSE connection
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  // Critical for Nginx — without this, responses buffer until gzip threshold
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  let seq = 0;
 
  try {
    // Stream from your LLM provider
    const stream = await getLLMStream(agentId, messages);
 
    for await (const chunk of stream) {
      if (chunk.type === "token") {
        // SSE format: "id: N\ndata: <payload>\n\n"
        // The id enables reconnection resume via Last-Event-ID
        res.write(
          `id: ${++seq}\ndata: ${JSON.stringify({
            type: "token",
            content: chunk.content,
            seq,
          })}\n\n`
        );
 
        // Handle backpressure — if the client can't keep up, pause the stream
        if (!res.write("")) {
          await new Promise<void>((resolve) => res.once("drain", resolve));
        }
      }
 
      if (chunk.type === "done") {
        res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
      }
    }
  } catch (error) {
    res.write(
      `data: ${JSON.stringify({
        type: "error",
        message: error instanceof Error ? error.message : "Unknown error",
      })}\n\n`
    );
  }
 
  res.end();
});

The buffering traps that silently break SSE

Sound familiar? SSE works perfectly on localhost, then tokens arrive in 500ms bursts the moment you deploy behind a proxy. The culprit is almost always response buffering at one of these layers:

nginx
# Nginx SSE configuration — every directive here matters
location /api/chat/stream {
    proxy_pass http://backend:3000;
 
    # Disable buffering — without this, Nginx holds chunks until its buffer fills
    proxy_buffering off;
    proxy_cache off;
 
    # HTTP/1.1 keepalive for persistent connection
    proxy_http_version 1.1;
    proxy_set_header Connection '';
 
    # Extend timeouts for long-running streams and tool calls
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
 
    # Gzip buffers until it has enough data to compress — kills streaming
    gzip off;
}

The common failure modes and their fixes:

ProblemCauseFix
Tokens arrive in 500ms batchesproxy_buffering on (Nginx default)proxy_buffering off
Smooth locally, batchy in prodGzip compression bufferinggzip off on streaming endpoints
Stream dies after 100s of silenceCloudflare idle timeoutSend ": keepalive\n\n" every 30s
Tokens burst after tool callsALB 60s idle timeoutIncrease timeout or send heartbeats
Full response arrives at onceCDN response cachingCache-Control: no-cache header

For Cloudflare specifically, it terminates connections that go silent for 100 seconds, which matters for agents running tool calls. Send SSE comment heartbeats during tool execution:

typescript
// Keep Cloudflare alive while a tool call is running
const heartbeat = setInterval(() => {
  if (!res.writableEnded) {
    res.write(": keepalive\n\n");
  }
}, 30_000);
 
try {
  // ... stream tokens, execute tools, etc.
} finally {
  clearInterval(heartbeat);
}

Consuming SSE in the browser

The built-in EventSource only handles GET. For POST (required when you need to send message history or auth headers), use fetch with a streaming body reader:

typescript
async function streamChat(
  agentId: string,
  messages: Array<{ role: string; content: string }>,
  onToken: (content: string) => void
) {
  const controller = new AbortController();
  const start = Date.now();
  let ttft: number | null = null;
 
  const response = await fetch("/api/chat/stream", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${getToken()}`,
    },
    body: JSON.stringify({ agentId, messages }),
    signal: controller.signal,
  });
 
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    buffer += decoder.decode(value, { stream: true });
 
    // SSE lines are separated by double newlines
    const lines = buffer.split("\n");
    buffer = lines.pop()!; // Hold incomplete line in buffer
 
    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      if (line === "data: [DONE]") return;
 
      const data = JSON.parse(line.slice(6));
 
      if (data.type === "token") {
        if (!ttft) {
          ttft = Date.now() - start;
          console.log(`Time to first token: ${ttft}ms`);
        }
        onToken(data.content);
      }
    }
  }
 
  // Expose cancel for the UI
  return () => controller.abort();
}

WebSockets: when you need bidirectional real-time

WebSockets are necessary when both sides need to send data simultaneously, which in practice means voice audio streaming and barge-in handling. They open a persistent, full-duplex TCP connection where either side can send messages at any time. The upgrade from HTTP happens once on connect, then all subsequent messages travel over the same persistent socket.

What makes voice agents different from chat? The cases that require WebSockets are specific:

  1. Barge-in / interruption handling. The user starts talking while the agent is still speaking. Your system needs to simultaneously receive that audio, stop TTS playback, cancel the in-flight LLM generation, and re-route to STT, all triggered by a client event that arrives while the server is actively streaming audio back. SSE can't handle this because the client has no channel to send the interruption signal.

  2. Continuous audio streaming. Sending microphone audio in real time requires continuous client-to-server transmission. SSE is server-to-client only.

  3. Multi-turn coordination. Some architectures need the client to send semantic events mid-stream, signaling that a user nodded, confirming a detected intent, or injecting tool results from the client side.

{type: "audio", chunk: <bytes>} stream transcript tokens {type: "audio", chunk: <tts_bytes>} {type: "barge_in"} abort() flush() {type: "listening"} {type: "audio", chunk: <new_audio>} User starts speaking mid-response User Client WebSocket Server Language Model TTS Engine
WebSocket voice: barge-in signal travels opposite direction to the audio stream

Here's a WebSocket server that handles barge-in:

typescript
import { WebSocketServer, WebSocket } from "ws";
 
const wss = new WebSocketServer({ port: 8080 });
 
wss.on("connection", (ws: WebSocket) => {
  let activeController: AbortController | null = null;
  let isStreaming = false;
 
  ws.on("message", async (raw: Buffer) => {
    const message = JSON.parse(raw.toString());
 
    if (message.type === "barge_in") {
      // User started speaking — immediately cancel ongoing generation
      if (activeController) {
        activeController.abort();
        activeController = null;
        isStreaming = false;
      }
      ws.send(JSON.stringify({ type: "listening" }));
      return;
    }
 
    if (message.type === "audio_chunk") {
      // Route audio to your STT provider (Deepgram, AssemblyAI, etc.)
      await forwardToSTT(message.data);
      return;
    }
 
    if (message.type === "transcript") {
      // Full transcript ready — generate and stream response
      activeController = new AbortController();
      isStreaming = true;
 
      try {
        const stream = await getLLMStream(
          message.agentId,
          message.transcript,
          { signal: activeController.signal }
        );
 
        for await (const chunk of stream) {
          if (!isStreaming) break; // Barge-in may have cleared this flag
 
          if (chunk.type === "token") {
            // Feed tokens to TTS, then stream audio back
            const audioChunk = await synthesize(chunk.content);
            ws.send(
              JSON.stringify({
                type: "audio",
                data: audioChunk,
              })
            );
          }
        }
 
        if (isStreaming) {
          ws.send(JSON.stringify({ type: "done" }));
        }
      } catch (err: any) {
        if (err.name === "AbortError") return; // Expected on barge-in
        ws.send(JSON.stringify({ type: "error", message: err.message }));
      } finally {
        activeController = null;
        isStreaming = false;
      }
    }
  });
 
  ws.on("close", () => {
    activeController?.abort();
  });
});

The decision framework: SSE or WebSockets?

Pick SSE when data flows one way (server to client); pick WebSockets when both sides need to talk simultaneously. The question isn't which is "better" but which matches your data flow:

CriterionSSEWebSockets
DirectionServer → client onlyBidirectional
ProtocolStandard HTTP (no upgrade)TCP upgrade to ws:// or wss://
ReconnectionAutomatic via EventSourceYou implement retry logic
Proxy/CDN supportWorks everywhereNeeds explicit proxy support
AuthStandard HTTP headersAuth in query param or first message (no headers on upgrade)
HTTP/2 multiplexingYes, multiple SSE streams over one TCP connectionNo, each WebSocket is a separate connection
ComplexityLow: standard HTTP semanticsHigher: connection state, heartbeats, reconnection
Voice barge-inNot possibleNative
Token streamingYesYes

Use SSE when:

  • You're streaming LLM tokens to a chat UI
  • You're pushing notifications, status updates, or analytics events
  • You want to stream agent monitoring events to a dashboard
  • The client sends a request and waits for a streamed response, no events mid-stream

Use WebSockets when:

  • You need barge-in / interruption detection
  • You're streaming raw audio bidirectionally
  • You're building collaborative real-time features where multiple participants send and receive
  • The client needs to send events (not just messages) during an active server stream

For most AI chat products, SSE is the right choice. WebSockets add real complexity: you own reconnection, heartbeat management, and connection state. Don't reach for WebSockets because they feel more "real-time." Reach for them when you genuinely need the client to push events mid-stream. You can validate both paths without production traffic by running synthetic test scenarios against your agent first.

Building the streaming pipeline correctly

The transport protocol is one decision; wiring the full pipeline to stream end-to-end is the harder problem that actually determines your latency floor. A common mistake: teams add SSE at the HTTP layer but leave batch processing inside the pipeline stages. Tokens stream beautifully to the browser, but the LLM doesn't start until the full transcript is ready. Now here's where it gets interesting.

Here's what "streaming throughout" means in practice for a voice agent:

typescript
async function processVoiceTurn(
  audioChunks: AsyncIterable<Buffer>,
  agentId: string,
  ws: WebSocket
): Promise<void> {
  // Stage 1: Stream audio to STT — don't wait for full utterance
  const transcriptStream = await stt.streamTranscribe(audioChunks);
 
  // Stage 2: Start LLM as soon as we have enough context — don't wait for full transcript
  const partialTranscripts: string[] = [];
  let llmStream: AsyncIterable<LLMChunk> | null = null;
 
  for await (const transcript of transcriptStream) {
    partialTranscripts.push(transcript.text);
 
    // Fire LLM on end-of-utterance signal, not end-of-transcript
    if (transcript.isFinal && !llmStream) {
      const fullTranscript = partialTranscripts.join(" ");
      llmStream = getLLMStream(agentId, fullTranscript);
 
      // Stage 3: Start TTS on first LLM token — don't wait for full response
      processLLMToTTS(llmStream, ws).catch(console.error);
    }
  }
}
 
async function processLLMToTTS(
  llmStream: AsyncIterable<LLMChunk>,
  ws: WebSocket
): Promise<void> {
  const ttsStream = tts.createStream();
 
  // Feed LLM tokens into TTS as they arrive
  for await (const chunk of llmStream) {
    if (chunk.type === "token") {
      ttsStream.write(chunk.content);
    }
 
    if (chunk.type === "done") {
      ttsStream.end();
    }
  }
 
  // Forward TTS audio chunks to client as they synthesize
  for await (const audioChunk of ttsStream) {
    ws.send(JSON.stringify({ type: "audio", data: audioChunk.toString("base64") }));
  }
}

The key design choice: processLLMToTTS runs concurrently with the transcript loop, not after it. The for await on transcriptStream and the processLLMToTTS call run in parallel because we await the LLM stream setup and then .catch() the downstream chain. We don't await it inline. This is what creates the pipeline overlap.

What actually controls perceived latency

Five factors dominate voice agent latency in production, and most teams optimize the wrong one first. The dominant factors, ranked by impact:

1. Time to first audio chunk matters more than total generation time. Users experience the gap between finishing their sentence and hearing the first syllable of the response. If that gap is under 400ms, the conversation feels alive. If it's over 800ms, it feels broken, even if the full response arrives 2 seconds later. You can track this per-call with latency analytics.

2. Pipeline parallelism delivers more than model optimization. Switching from GPT-4o to a faster model might save 30ms on first-token latency. Implementing proper streaming throughout the pipeline typically saves 400-800ms total. Optimize the architecture before optimizing the model selection.

3. Cold starts are the enemy of consistent latency. Ever seen an agent that's fast 95% of the time but occasionally takes 2 full seconds? A system that achieves 280ms P50 but has 2,000ms P99 due to cold containers will feel unreliable. Maintain warm capacity, implement predictive scaling, and route user-facing traffic away from cold instances. You can see exactly where this is happening with proper agent monitoring.

4. Network topology matters. A 40ms round trip from your user to your server, before any processing, is 40ms you can't recover elsewhere. Edge deployment (6-8 geographic regions rather than one central data center) directly lowers the floor for every request.

5. TTS provider selection has outsized impact. The difference between a TTS provider with 250ms time-to-first-audio-chunk and one with 60ms is larger than the entire first-token latency budget for a fast LLM. Cartesia Sonic Turbo achieves ~40ms TTFB; ElevenLabs Flash is around 75ms. OpenAI's TTS API runs 120-180ms. That gap matters.

Where SSE fits in voice agent monitoring

SSE's best non-obvious use case is operational monitoring: pushing live events from your backend to a dashboard as calls happen, without the overhead of WebSocket infrastructure. This isn't the real-time audio path (which uses WebSockets or WebRTC), but the observability layer sitting alongside it.

When an agent uses a tool, scores poorly on a quality scorecard, or hits an error, you want that signal to surface immediately, not in the next batch report. An SSE stream from your monitoring backend to your dashboard delivers those events cleanly, because the dashboard only needs to receive events, not send them.

typescript
// Monitoring SSE endpoint — pushes agent events as they happen
app.get("/api/monitoring/stream/:workspaceId", (req, res) => {
  const { workspaceId } = req.params;
 
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  // Subscribe to events for this workspace
  const unsubscribe = eventBus.subscribe(workspaceId, (event) => {
    res.write(
      `data: ${JSON.stringify({
        type: event.type, // "tool_call", "score_update", "escalation"
        agentId: event.agentId,
        callId: event.callId,
        payload: event.payload,
        ts: event.timestamp,
      })}\n\n`
    );
  });
 
  req.on("close", unsubscribe);
});

This pattern, SSE for monitoring dashboards and WebSockets for real-time audio, is the common production architecture. Each transport does one thing well.

The production checklist

Before shipping a streaming voice agent, verify each layer. Miss any one of these and you'll spend hours debugging latency that isn't in the model:

Transport layer

  • SSE endpoints have proxy_buffering off in Nginx (or equivalent in your proxy)
  • gzip disabled on streaming routes
  • X-Accel-Buffering: no set as response header
  • Timeout configuration reviewed at every hop (proxy, ALB, CDN, Node.js)
  • Heartbeats enabled for connections with long-running tool calls

Pipeline

  • STT streaming enabled, not batch transcription
  • LLM receives partial transcripts (or at least fires on end-of-utterance, not end-of-audio-file)
  • TTS starts synthesizing on first LLM token, not after full response
  • Backpressure handling: drain event on res.write() returning false

Reliability

  • SSE events tagged with sequence numbers for resume-on-reconnect
  • WebSocket reconnection logic implemented with exponential backoff
  • Abort controllers cleaned up on disconnect (to avoid orphaned LLM calls)
  • Warm capacity maintained, no cold starts on P95 user-facing traffic

Observability

  • Time-to-first-token (TTFT) tracked per request
  • P50, P95, P99 latency tracked separately per pipeline stage
  • Error rates on SSE vs WebSocket connections tracked independently
  • Tool call latency tracked within stream events

The choice between SSE and WebSockets resolves cleanly once you're clear about data flow direction. For most AI chat and monitoring use cases, SSE is the right answer. It's simpler, works everywhere, and reconnects automatically. For voice with barge-in, audio streaming, or real-time multi-participant scenarios, WebSockets are necessary.

The harder work is building the pipeline to actually stream end-to-end, not just at the HTTP layer, but between every stage from STT to LLM to TTS. That's where the 400ms savings live. The transport protocol is the last few milliseconds. Pipeline architecture is the first few hundred.

Sources

LD

Engineering Lead

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions