When should I use SSE instead of WebSockets for a voice agent?

Use SSE when communication is one-way. Your server streams tokens or audio chunks to a client that only needs to receive. It works over standard HTTP, reconnects automatically, and passes through proxies without special configuration. Switch to WebSockets the moment you need the client to send events mid-stream, such as barge-in signals, real-time audio chunks, or cancellation requests.

Why does my SSE stream arrive in batches instead of token by token?

Almost certainly a proxy buffering problem. Nginx, Cloudflare, and AWS ALBs all buffer responses by default. You need proxy_buffering off in Nginx, X-Accel-Buffering: no as a response header, and gzip disabled on streaming endpoints. Any one of these misconfigured turns smooth per-token streaming into chunky 500ms bursts.

What is pipeline parallelism in a voice agent context?

Instead of waiting for each stage to finish before starting the next, you stream data between stages concurrently. For example, you start passing partial transcripts to the LLM while speech-to-text is still running, and begin feeding early LLM tokens to TTS synthesis while generation continues. Done right, this alone cuts 40-60% off total perceived latency.

How do I handle barge-in (interruption) with SSE?

You can't. Not cleanly. SSE is server-to-client only. When a user interrupts mid-response, your agent needs to stop generating and stop speaking immediately. That cancellation signal must travel client-to-server, which SSE doesn't support. You'll need either a separate HTTP POST to a cancel endpoint, or switch to WebSockets for the real-time audio channel where barge-in is expected.

What causes the 300ms response-time cliff in voice agents?

Human turn-taking rhythm averages 200-250ms. When an AI response exceeds 300ms, the listener's brain switches from conversational mode to waiting mode, a perceptual shift that degrades the experience even when accuracy is identical. Under 300ms feels like a conversation; over 500ms feels like a search engine.

Do I need WebRTC for a voice agent?

For user-facing voice, almost always yes. WebRTC uses UDP for transport, which handles packet loss gracefully rather than retransmitting stale audio, and includes built-in echo cancellation and noise suppression. The typical production pattern is WebRTC from browser to your media server, then WebSockets from the media server to STT/LLM/TTS APIs.

How do SSE reconnections affect mid-response state?

The browser's EventSource reconnects automatically using the Last-Event-ID header, but by default restarts the stream from scratch. For chat you want to resume, not replay. Tag each SSE event with a sequence number, store the message ID server-side, and on reconnect, use Last-Event-ID to skip already-delivered events.

What latency does each stage of a voice agent pipeline actually consume?

Streaming STT takes 80-120ms, LLM first-token latency is 100-150ms with modern models, TTS time-to-first-audio-chunk is 60-100ms with fast providers like Cartesia or ElevenLabs Flash, and network adds 20-50ms. The naive sum is 260-420ms, before any optimization. Pipeline parallelism cuts this because stages overlap rather than wait for each other.

The Buffering Bug That Quietly Breaks Voice Agent Latency

You've built a voice agent. Locally, tokens stream smoothly, latency sits under 300ms, and conversations feel natural. You deploy behind Nginx, and suddenly every response arrives in a half-second burst. The LLM hasn't gotten slower. The TTS provider hasn't changed. Something between your server and the user's browser is silently batching your stream into chunks, and your agent now sounds like it's thinking hard before every sentence.

That "something" is almost always proxy buffering, and it's the most common production failure in voice agent streaming. But fixing it is only the first layer. The real latency gains come from how you wire SSE and WebSockets through your pipeline, and which one you pick determines constraints you'll live with for months.

This article covers what each transport actually does, where each one breaks in production voice scenarios, and why pipeline parallelism saves more milliseconds than switching to a faster model.

Why streaming transports matter for voice agents

Streaming transports cut voice agent latency by 40-60% because they let pipeline stages overlap instead of waiting in sequence. Without streaming, each stage (speech-to-text, LLM generation, text-to-speech) must fully complete before the next one starts, producing 2-6 seconds of dead air. With streaming, the user hears audio while the LLM is still generating.

What does that look like concretely? Here's the latency math without streaming:

Stage	Naive duration
Speech-to-text (batch)	200-400ms
LLM generation (wait for full response)	1-4 seconds
Text-to-speech (full response synthesis)	500ms-1.5s
Network round trips	40-100ms
Total	~2-6 seconds

And with pipeline parallelism through streaming:

Stage	Streaming duration
STT (streaming, start on partial audio)	80-120ms to first transcript
LLM (streaming, first token latency)	100-150ms
TTS (streaming, first audio chunk)	60-100ms
Network (concurrent, not sequential)	20-50ms overhead
Total perceived	260-420ms

The stages still take the same total time. Streaming doesn't make the models faster. What it does is overlap them. The user starts hearing audio before the LLM has finished generating the full response, because TTS is synthesizing the first few sentences while the LLM is still working on the rest. That's pipeline parallelism, and that's where the 40-60% latency reduction comes from.

Pipeline parallelism: stages overlap instead of waiting for each other

SSE: the right tool for server-to-client streaming

SSE is the default choice for streaming AI-generated tokens to a client. It's a browser-native protocol that opens a persistent HTTP connection, pushes events as they're generated, and handles reconnection automatically. OpenAI, Anthropic, and nearly every major AI API use SSE internally. No protocol upgrade, no bidirectional channel needed. Just HTTP with Content-Type: text/event-stream and a persistent keep-alive.

Here's the catch: SSE only flows in one direction, server to client. That's a feature, not a limitation, for most agent configurations. It works over HTTP/2 (which multiplexes connections), passes through standard reverse proxies without special configuration, and the browser's EventSource API reconnects automatically on drop.

Here's a minimal SSE server that streams from an LLM:

typescript

import express from "express";
 
const app = express();
app.use(express.json());
 
app.post("/api/chat/stream", async (req, res) => {
  const { agentId, messages } = req.body;
 
  // These three headers establish the SSE connection
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  // Critical for Nginx — without this, responses buffer until gzip threshold
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  let seq = 0;
 
  try {
    // Stream from your LLM provider
    const stream = await getLLMStream(agentId, messages);
 
    for await (const chunk of stream) {
      if (chunk.type === "token") {
        // SSE format: "id: N\ndata: <payload>\n\n"
        // The id enables reconnection resume via Last-Event-ID
        res.write(
          `id: ${++seq}\ndata: ${JSON.stringify({
            type: "token",
            content: chunk.content,
            seq,
          })}\n\n`
        );
 
        // Handle backpressure — if the client can't keep up, pause the stream
        if (!res.write("")) {
          await new Promise<void>((resolve) => res.once("drain", resolve));
        }
      }
 
      if (chunk.type === "done") {
        res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
      }
    }
  } catch (error) {
    res.write(
      `data: ${JSON.stringify({
        type: "error",
        message: error instanceof Error ? error.message : "Unknown error",
      })}\n\n`
    );
  }
 
  res.end();
});

The buffering traps that silently break SSE

Sound familiar? SSE works perfectly on localhost, then tokens arrive in 500ms bursts the moment you deploy behind a proxy. The culprit is almost always response buffering at one of these layers:

nginx

# Nginx SSE configuration — every directive here matters
location /api/chat/stream {
    proxy_pass http://backend:3000;
 
    # Disable buffering — without this, Nginx holds chunks until its buffer fills
    proxy_buffering off;
    proxy_cache off;
 
    # HTTP/1.1 keepalive for persistent connection
    proxy_http_version 1.1;
    proxy_set_header Connection '';
 
    # Extend timeouts for long-running streams and tool calls
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
 
    # Gzip buffers until it has enough data to compress — kills streaming
    gzip off;
}

The common failure modes and their fixes:

Problem	Cause	Fix
Tokens arrive in 500ms batches	`proxy_buffering on` (Nginx default)	`proxy_buffering off`
Smooth locally, batchy in prod	Gzip compression buffering	`gzip off` on streaming endpoints
Stream dies after 100s of silence	Cloudflare idle timeout	Send `": keepalive\n\n"` every 30s
Tokens burst after tool calls	ALB 60s idle timeout	Increase timeout or send heartbeats
Full response arrives at once	CDN response caching	`Cache-Control: no-cache` header

For Cloudflare specifically, it terminates connections that go silent for 100 seconds, which matters for agents running tool calls. Send SSE comment heartbeats during tool execution:

typescript

// Keep Cloudflare alive while a tool call is running
const heartbeat = setInterval(() => {
  if (!res.writableEnded) {
    res.write(": keepalive\n\n");
  }
}, 30_000);
 
try {
  // ... stream tokens, execute tools, etc.
} finally {
  clearInterval(heartbeat);
}

Consuming SSE in the browser

The built-in EventSource only handles GET. For POST (required when you need to send message history or auth headers), use fetch with a streaming body reader:

typescript

async function streamChat(
  agentId: string,
  messages: Array<{ role: string; content: string }>,
  onToken: (content: string) => void
) {
  const controller = new AbortController();
  const start = Date.now();
  let ttft: number | null = null;
 
  const response = await fetch("/api/chat/stream", {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      Authorization: `Bearer ${getToken()}`,
    },
    body: JSON.stringify({ agentId, messages }),
    signal: controller.signal,
  });
 
  const reader = response.body!.getReader();
  const decoder = new TextDecoder();
  let buffer = "";
 
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
 
    buffer += decoder.decode(value, { stream: true });
 
    // SSE lines are separated by double newlines
    const lines = buffer.split("\n");
    buffer = lines.pop()!; // Hold incomplete line in buffer
 
    for (const line of lines) {
      if (!line.startsWith("data: ")) continue;
      if (line === "data: [DONE]") return;
 
      const data = JSON.parse(line.slice(6));
 
      if (data.type === "token") {
        if (!ttft) {
          ttft = Date.now() - start;
          console.log(`Time to first token: ${ttft}ms`);
        }
        onToken(data.content);
      }
    }
  }
 
  // Expose cancel for the UI
  return () => controller.abort();
}

WebSockets: when you need bidirectional real-time

WebSockets are necessary when both sides need to send data simultaneously, which in practice means voice audio streaming and barge-in handling. They open a persistent, full-duplex TCP connection where either side can send messages at any time. The upgrade from HTTP happens once on connect, then all subsequent messages travel over the same persistent socket.

What makes voice agents different from chat? The cases that require WebSockets are specific:

Barge-in / interruption handling. The user starts talking while the agent is still speaking. Your system needs to simultaneously receive that audio, stop TTS playback, cancel the in-flight LLM generation, and re-route to STT, all triggered by a client event that arrives while the server is actively streaming audio back. SSE can't handle this because the client has no channel to send the interruption signal.
Continuous audio streaming. Sending microphone audio in real time requires continuous client-to-server transmission. SSE is server-to-client only.
Multi-turn coordination. Some architectures need the client to send semantic events mid-stream, signaling that a user nodded, confirming a detected intent, or injecting tool results from the client side.

WebSocket voice: barge-in signal travels opposite direction to the audio stream

Here's a WebSocket server that handles barge-in:

typescript

import { WebSocketServer, WebSocket } from "ws";
 
const wss = new WebSocketServer({ port: 8080 });
 
wss.on("connection", (ws: WebSocket) => {
  let activeController: AbortController | null = null;
  let isStreaming = false;
 
  ws.on("message", async (raw: Buffer) => {
    const message = JSON.parse(raw.toString());
 
    if (message.type === "barge_in") {
      // User started speaking — immediately cancel ongoing generation
      if (activeController) {
        activeController.abort();
        activeController = null;
        isStreaming = false;
      }
      ws.send(JSON.stringify({ type: "listening" }));
      return;
    }
 
    if (message.type === "audio_chunk") {
      // Route audio to your STT provider (Deepgram, AssemblyAI, etc.)
      await forwardToSTT(message.data);
      return;
    }
 
    if (message.type === "transcript") {
      // Full transcript ready — generate and stream response
      activeController = new AbortController();
      isStreaming = true;
 
      try {
        const stream = await getLLMStream(
          message.agentId,
          message.transcript,
          { signal: activeController.signal }
        );
 
        for await (const chunk of stream) {
          if (!isStreaming) break; // Barge-in may have cleared this flag
 
          if (chunk.type === "token") {
            // Feed tokens to TTS, then stream audio back
            const audioChunk = await synthesize(chunk.content);
            ws.send(
              JSON.stringify({
                type: "audio",
                data: audioChunk,
              })
            );
          }
        }
 
        if (isStreaming) {
          ws.send(JSON.stringify({ type: "done" }));
        }
      } catch (err: any) {
        if (err.name === "AbortError") return; // Expected on barge-in
        ws.send(JSON.stringify({ type: "error", message: err.message }));
      } finally {
        activeController = null;
        isStreaming = false;
      }
    }
  });
 
  ws.on("close", () => {
    activeController?.abort();
  });
});

The decision framework: SSE or WebSockets?

Pick SSE when data flows one way (server to client); pick WebSockets when both sides need to talk simultaneously. The question isn't which is "better" but which matches your data flow:

Criterion	SSE	WebSockets
Direction	Server → client only	Bidirectional
Protocol	Standard HTTP (no upgrade)	TCP upgrade to `ws://` or `wss://`
Reconnection	Automatic via `EventSource`	You implement retry logic
Proxy/CDN support	Works everywhere	Needs explicit proxy support
Auth	Standard HTTP headers	Auth in query param or first message (no headers on upgrade)
HTTP/2 multiplexing	Yes, multiple SSE streams over one TCP connection	No, each WebSocket is a separate connection
Complexity	Low: standard HTTP semantics	Higher: connection state, heartbeats, reconnection
Voice barge-in	Not possible	Native
Token streaming	Yes	Yes

Use SSE when:

You're streaming LLM tokens to a chat UI
You're pushing notifications, status updates, or analytics events
You want to stream agent monitoring events to a dashboard
The client sends a request and waits for a streamed response, no events mid-stream

Use WebSockets when:

You need barge-in / interruption detection
You're streaming raw audio bidirectionally
You're building collaborative real-time features where multiple participants send and receive
The client needs to send events (not just messages) during an active server stream

For most AI chat products, SSE is the right choice. WebSockets add real complexity: you own reconnection, heartbeat management, and connection state. Don't reach for WebSockets because they feel more "real-time." Reach for them when you genuinely need the client to push events mid-stream. You can validate both paths without production traffic by running synthetic test scenarios against your agent first.

Building the streaming pipeline correctly

The transport protocol is one decision; wiring the full pipeline to stream end-to-end is the harder problem that actually determines your latency floor. A common mistake: teams add SSE at the HTTP layer but leave batch processing inside the pipeline stages. Tokens stream beautifully to the browser, but the LLM doesn't start until the full transcript is ready. Now here's where it gets interesting.

Here's what "streaming throughout" means in practice for a voice agent:

typescript

async function processVoiceTurn(
  audioChunks: AsyncIterable<Buffer>,
  agentId: string,
  ws: WebSocket
): Promise<void> {
  // Stage 1: Stream audio to STT — don't wait for full utterance
  const transcriptStream = await stt.streamTranscribe(audioChunks);
 
  // Stage 2: Start LLM as soon as we have enough context — don't wait for full transcript
  const partialTranscripts: string[] = [];
  let llmStream: AsyncIterable<LLMChunk> | null = null;
 
  for await (const transcript of transcriptStream) {
    partialTranscripts.push(transcript.text);
 
    // Fire LLM on end-of-utterance signal, not end-of-transcript
    if (transcript.isFinal && !llmStream) {
      const fullTranscript = partialTranscripts.join(" ");
      llmStream = getLLMStream(agentId, fullTranscript);
 
      // Stage 3: Start TTS on first LLM token — don't wait for full response
      processLLMToTTS(llmStream, ws).catch(console.error);
    }
  }
}
 
async function processLLMToTTS(
  llmStream: AsyncIterable<LLMChunk>,
  ws: WebSocket
): Promise<void> {
  const ttsStream = tts.createStream();
 
  // Feed LLM tokens into TTS as they arrive
  for await (const chunk of llmStream) {
    if (chunk.type === "token") {
      ttsStream.write(chunk.content);
    }
 
    if (chunk.type === "done") {
      ttsStream.end();
    }
  }
 
  // Forward TTS audio chunks to client as they synthesize
  for await (const audioChunk of ttsStream) {
    ws.send(JSON.stringify({ type: "audio", data: audioChunk.toString("base64") }));
  }
}

The key design choice: processLLMToTTS runs concurrently with the transcript loop, not after it. The for await on transcriptStream and the processLLMToTTS call run in parallel because we await the LLM stream setup and then .catch() the downstream chain. We don't await it inline. This is what creates the pipeline overlap.

What actually controls perceived latency

Five factors dominate voice agent latency in production, and most teams optimize the wrong one first. The dominant factors, ranked by impact:

1. Time to first audio chunk matters more than total generation time. Users experience the gap between finishing their sentence and hearing the first syllable of the response. If that gap is under 400ms, the conversation feels alive. If it's over 800ms, it feels broken, even if the full response arrives 2 seconds later. You can track this per-call with latency analytics.

2. Pipeline parallelism delivers more than model optimization. Switching from GPT-4o to a faster model might save 30ms on first-token latency. Implementing proper streaming throughout the pipeline typically saves 400-800ms total. Optimize the architecture before optimizing the model selection.

3. Cold starts are the enemy of consistent latency. Ever seen an agent that's fast 95% of the time but occasionally takes 2 full seconds? A system that achieves 280ms P50 but has 2,000ms P99 due to cold containers will feel unreliable. Maintain warm capacity, implement predictive scaling, and route user-facing traffic away from cold instances. You can see exactly where this is happening with proper agent monitoring.

4. Network topology matters. A 40ms round trip from your user to your server, before any processing, is 40ms you can't recover elsewhere. Edge deployment (6-8 geographic regions rather than one central data center) directly lowers the floor for every request.

5. TTS provider selection has outsized impact. The difference between a TTS provider with 250ms time-to-first-audio-chunk and one with 60ms is larger than the entire first-token latency budget for a fast LLM. Cartesia Sonic Turbo achieves ~40ms TTFB; ElevenLabs Flash is around 75ms. OpenAI's TTS API runs 120-180ms. That gap matters.

Where SSE fits in voice agent monitoring

SSE's best non-obvious use case is operational monitoring: pushing live events from your backend to a dashboard as calls happen, without the overhead of WebSocket infrastructure. This isn't the real-time audio path (which uses WebSockets or WebRTC), but the observability layer sitting alongside it.

When an agent uses a tool, scores poorly on a quality scorecard, or hits an error, you want that signal to surface immediately, not in the next batch report. An SSE stream from your monitoring backend to your dashboard delivers those events cleanly, because the dashboard only needs to receive events, not send them.

typescript

// Monitoring SSE endpoint — pushes agent events as they happen
app.get("/api/monitoring/stream/:workspaceId", (req, res) => {
  const { workspaceId } = req.params;
 
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("X-Accel-Buffering", "no");
  res.flushHeaders();
 
  // Subscribe to events for this workspace
  const unsubscribe = eventBus.subscribe(workspaceId, (event) => {
    res.write(
      `data: ${JSON.stringify({
        type: event.type, // "tool_call", "score_update", "escalation"
        agentId: event.agentId,
        callId: event.callId,
        payload: event.payload,
        ts: event.timestamp,
      })}\n\n`
    );
  });
 
  req.on("close", unsubscribe);
});

This pattern, SSE for monitoring dashboards and WebSockets for real-time audio, is the common production architecture. Each transport does one thing well.

Get Started

The production checklist

Before shipping a streaming voice agent, verify each layer. Miss any one of these and you'll spend hours debugging latency that isn't in the model:

Transport layer

SSE endpoints have proxy_buffering off in Nginx (or equivalent in your proxy)
gzip disabled on streaming routes
X-Accel-Buffering: no set as response header
Timeout configuration reviewed at every hop (proxy, ALB, CDN, Node.js)
Heartbeats enabled for connections with long-running tool calls

Pipeline

STT streaming enabled, not batch transcription
LLM receives partial transcripts (or at least fires on end-of-utterance, not end-of-audio-file)
TTS starts synthesizing on first LLM token, not after full response
Backpressure handling: drain event on res.write() returning false

Reliability

SSE events tagged with sequence numbers for resume-on-reconnect
WebSocket reconnection logic implemented with exponential backoff
Abort controllers cleaned up on disconnect (to avoid orphaned LLM calls)
Warm capacity maintained, no cold starts on P95 user-facing traffic

Observability

Time-to-first-token (TTFT) tracked per request
P50, P95, P99 latency tracked separately per pipeline stage
Error rates on SSE vs WebSocket connections tracked independently
Tool call latency tracked within stream events

The choice between SSE and WebSockets resolves cleanly once you're clear about data flow direction. For most AI chat and monitoring use cases, SSE is the right answer. It's simpler, works everywhere, and reconnects automatically. For voice with barge-in, audio streaming, or real-time multi-participant scenarios, WebSockets are necessary.

The harder work is building the pipeline to actually stream end-to-end, not just at the HTTP layer, but between every stage from STT to LLM to TTS. That's where the 400ms savings live. The transport protocol is the last few milliseconds. Pipeline architecture is the first few hundred.

Sources

MDN Web Docs: Server-Sent Events. Browser EventSource API reference, reconnection behavior, and SSE event format specification.
WHATWG: HTML Living Standard, Server-Sent Events. The SSE protocol specification, including Last-Event-ID reconnection semantics.
RFC 6455: The WebSocket Protocol. Full WebSocket specification covering the upgrade handshake, framing, and connection lifecycle.
Nginx Documentation: ngx_http_proxy_module, proxy_buffering. Configuration reference for the directives that make or break SSE passthrough.
Node.js: Backpressuring in Streams. Official guide to drain event handling and writable stream backpressure.
Deepgram: Streaming Speech Recognition. Real-time STT API reference including partial transcript events and endpointing configuration.
Anthropic API: Messages Streaming. Claude streaming event types: content_block_delta, message_delta, message_stop.
Cloudflare: Timeouts. Documented proxy timeout limits including the 100-second idle timeout that kills silent SSE streams.