Streaming and Latency Optimization: TTFT, SSE, KV Cache, and Batching
Design low-latency LLM experiences using streaming, Server-Sent Events, time-to-first-token optimization, KV cache management, speculative decoding, batching, and context reduction.
Start With a Slow Chat Experience
Imagine you built a RAG-powered support chat. The first user types a question:
"Can I pause my annual subscription?"
Then they wait. And wait.
Four seconds later, the response appears all at once. It is correct. But the user has already decided the chat feels sluggish. They close the window and submit a support ticket instead.
Now imagine the same chat streams each token as it is generated. The user sees text appear character by character:
"The... annual... subscriptions... cannot... be... paused..."
The total generation time is still four seconds. But the user starts reading after 300 milliseconds instead of four seconds. The experience feels fast because the output begins immediately.
This is the core insight of LLM latency optimization: users perceive speed based on when output starts, not when it finishes. Streaming converts wall-clock time into perceived responsiveness.
Mental model: Streaming does not make total generation faster. It makes the user experience feel faster because output starts earlier. The goal is to minimize the silent gap between submitting a question and seeing the first character of the answer.
What Streaming and Latency Optimization Means
LLM latency is not a single number. An API call that returns one complete response after 4 seconds is fast in absolute terms but feels slow. A streaming call that starts output after 300ms and finishes at 4 seconds feels fast even though the total time is the same.
The difference is that LLM generation happens in two phases — prefill and decode — with fundamentally different performance characteristics.
The Four Latency Metrics
| Metric | Meaning | Why It Matters | Typical Range |
|---|---|---|---|
| TTFT (Time to First Token) | Time from request submission to first output token | Determines perceived responsiveness | 200ms–5s |
| TPOT (Time Per Output Token) | Milliseconds between consecutive tokens | Determines how fast text streams | 10ms–100ms |
| Tokens per second | 1000 / TPOT (roughly) | Throughput for long generations | 10–100 tok/s |
| Total latency | TTFT + (num_output_tokens × TPOT) | Time until response is complete | 1s–30s |
Important distinction: TTFT and generation speed trade off against each other. A model optimized for TTFT may use a larger batch size (more queueing) but faster prefill. A model optimized for generation speed may sacrifice some TTFT for higher throughput. Optimize for the metric that matters to your user experience.
Why Not Just Make the Model Faster?
Beginners often ask: "Can I just use a faster model?"
A faster model helps, but it is not the full answer. Latency in an LLM system comes from many layers, and the generation speed is only one of them.
| Latency Source | Contribution | Can a faster model fix it? |
|---|---|---|
| Network round trip | 10–100ms | No |
| Provider queueing | 0ms–10s | No (may make it worse if faster = more popular) |
| Prompt processing (prefill) | 100ms–3s | Partially (depends on context size) |
| Retrieval and reranking | 100ms–1s | No (RAG pipeline is separate) |
| Token generation (decode) | 5ms–100ms per token | Yes |
| Client rendering | 1–50ms | No |
A faster model only improves generation speed. The other layers — network, queueing, retrieval, rendering — require architectural changes.
Step 1: Streaming with Server-Sent Events
Streaming is the foundation of low-latency LLM user experience. Instead of waiting for the full response, the server sends each token as it is generated.
How SSE Works
Server-Sent Events is a standard HTTP protocol for one-way streaming. The client opens an HTTP connection, and the server pushes events as they become available.
event: token
data: {"text":"The"}
event: token
data: {"text":" annual"}
event: token
data: {"text":" subscription"}
event: token
data: {"text":" cannot"}
event: done
data: {}
SSE Event Types for LLM Responses
| Event | Purpose | Payload |
|---|---|---|
status | Inform client of progress before generation starts | {status: "retrieving"}, {status: "generating"} |
token | Each generated token | {text: "The"} |
citation | Source citation when it becomes relevant | {source: "billing-faq.pdf", id: "S1"} |
error | Something went wrong | {code: "TIMEOUT", message: "..."} |
done | Generation complete | {totalTokens: 342, latency: 4200} |
meta | Final metadata without the text | {model: "gpt-4", latency: 4200} |
SSE vs WebSockets vs Plain HTTP
| Protocol | Direction | Best For | Complexity |
|---|---|---|---|
| SSE | Server → Client only | Token streaming, status updates | Low |
| WebSocket | Bidirectional | Chat with interruption, streaming + user input | Medium |
| Plain HTTP | Request → Response | Non-interactive, background tasks | Lowest |
SSE is usually the simplest choice for chat and assistant responses. WebSockets add complexity (connection management, reconnection, message framing) that is rarely justified when the only server-to-client traffic is token deltas.
SSE Implementation Considerations
- Connection timeout: LLM generation can take 30+ seconds. Ensure your load balancer and reverse proxy do not timeout idle connections.
- Buffering: Disable response buffering in your web framework and reverse proxy. Buffered SSE defeats the purpose of streaming.
- Error recovery: If the provider connection drops mid-stream, the server should attempt to reconnect or send an error event to the client.
- Cancel propagation: When the client disconnects, propagate the cancellation to the provider to stop generation and avoid paying for unused tokens.
Step 2: Optimizing Time to First Token
TTFT is the most visible latency metric because it is the silent gap the user experiences before anything happens. Optimizing TTFT means reducing the work done before the first token is emitted.
What Happens Before the First Token
TTFT Optimization Techniques
| Technique | Effect on TTFT | Tradeoff |
|---|---|---|
| Reduce prompt tokens | Less prefill work | Shorter context means less evidence |
| Cache stable prefix | Reuse prompt processing for shared prefix | Requires prompt structure discipline |
| Stream retrieval results separately | User sees progress during retrieval | More complex UI |
| Move retrieval before generation | Parallelize RAG with prefill | Requires careful timing |
| Use smaller model | Faster prefill and first token | Lower quality |
| Keep retrieval top-k small | Less context assembly overhead | May miss relevant evidence |
| Pre-warm provider connections | Avoid connection setup latency | Increases baseline cost |
| Choose providers with low queueing | Skip or reduce queue wait | May be more expensive per token |
Large context windows are not free: Sending 100K tokens because the model supports it can create high TTFT, high cost, and weaker attention to the important parts. The model may accept 100K tokens, but your user will not accept a 10-second TTFT.
Prefix Caching for TTFT
If every request shares the same system prompt, tool definitions, and policy text, the provider can cache the KV cache for that prefix. When a new request arrives with the same prefix, the prefill phase is partially or fully skipped.
Prefix caching is the single most impactful TTFT optimization. Designing your prompts so that the static prefix is as large as possible and the dynamic suffix is as small as possible directly reduces TTFT for every request.
Step 3: KV Cache Management
During the prefill phase, the model computes attention between every pair of input tokens and stores the resulting key-value pairs in the KV cache. During the decode phase, each new token reads from this cache rather than recomputing attention for all previous tokens.
The KV cache is why generation gets slower with longer contexts. Every new token must attend to all previous KV entries. Doubling the output length does not double the cost — it more than doubles it because each subsequent step has more KV entries to read.
KV Cache Memory
The memory required for the KV cache grows with batch size, context length, model size, and precision.
KV cache memory per token =
2 (key + value)
× num_hidden_layers
× num_attention_heads
× head_dimension
× bytes_per_element
Example for a 70B model:
= 2 × 80 layers × 64 heads × 128 dim × 2 bytes (FP16)
≈ 2.6 MB per token
For a 4096-token context with batch size 32:
= 4096 × 32 × 2.6 MB
≈ 341 GB
This is why large contexts with high concurrency require multiple GPUs and careful memory management.
Serving Implications of KV Cache
| Concern | Impact | Mitigation |
|---|---|---|
| Long prompts | More prefill work and KV memory | Prefix caching, prompt compression |
| Long outputs | More decode steps, growing KV cache | Shorter answers, early stopping |
| Many concurrent streams | More active KV cache competing for GPU memory | Dynamic batching, memory management |
| Larger models | KV cache scales with model dimensions | Use smaller models where possible |
| Batch processing | KV cache scales with batch size × context length | Continuous batching |
PagedAttention
PagedAttention is a technique that manages KV cache in fixed-size blocks (pages) rather than contiguous memory. This eliminates fragmentation and allows the GPU to use memory more efficiently, increasing the number of concurrent requests that can be served.
Without PagedAttention, the KV cache for each request must be stored in a contiguous block of GPU memory. This leads to fragmentation — the GPU may have enough total free memory but not enough contiguous space for a new request. PagedAttention solves this by storing KV cache in pages that can be non-contiguous, similar to how virtual memory works in operating systems.
Step 4: Batching Strategies
Batching groups multiple requests together to improve GPU utilization. Instead of processing one request at a time, the GPU processes N requests simultaneously, sharing the cost of loading model weights.
Batching Types
| Batch Type | How It Works | Best For | Latency Impact |
|---|---|---|---|
| Static batching | Group requests with similar sizes before inference | Offline jobs, batch processing | High queueing delay |
| Dynamic batching | Accumulate requests for a short window, then batch | Online serving with moderate traffic | Medium queueing delay |
| Continuous batching | Requests join and leave the batch during decoding | LLM serving with variable-length outputs | Low queueing delay |
Continuous Batching (Decoding)
Continuous batching is the standard for modern LLM serving. In traditional batching, all requests in the batch must complete before the next batch starts. In continuous batching, a request leaves the batch when it finishes generating, and a new request can join immediately.
Continuous batching improves GPU utilization by up to 2-3x compared to static batching, especially when output lengths vary significantly between requests.
The Batching Tradeoff
Batching improves throughput (requests per second) but can increase per-request latency. A request that arrives at a busy time may wait in the queue for the current batch to finish before it enters the next batch.
For interactive applications, set a maximum queue time. If a request has waited longer than N milliseconds, process it alone rather than waiting for more requests to join the batch.
Step 5: Speculative Decoding
Speculative decoding reduces generation latency by using a smaller, faster model (the draft model) to propose multiple candidate tokens, then using the larger target model to verify them in a single forward pass.
How Speculative Decoding Saves Time
Normal generation: K sequential decode steps, each requiring one forward pass through the large model.
Speculative decoding: One draft pass through the small model, then one verification pass through the large model that processes K tokens in parallel.
If the draft model is accurate enough that most tokens are accepted, speculative decoding produces K tokens in roughly the time of 1–2 decode steps instead of K steps.
Acceptance Rate
The effectiveness of speculative decoding depends on the acceptance rate — the fraction of draft tokens that the target model accepts.
| Acceptance Rate | Speedup | Scenario |
|---|---|---|
| 90%+ | 3–5x | Simple, predictable text (classification, extraction) |
| 70–90% | 2–3x | General chat and Q&A |
| Below 50% | Negligible or negative | Creative, unpredictable text |
Speculative decoding is mostly a serving-layer optimization. It may not be exposed or configurable by every provider. If you are self-hosting, it is one of the most impactful optimizations available.
Practical advice: Speculative decoding works best when the draft model is aligned with the target model. A draft model trained on similar data will propose tokens the target model is likely to accept. A mismatched draft model (e.g., a code-specialized draft with a chat-specialized target) may have low acceptance rates and no speedup.
Step 6: Product UX for Perceived Latency
Latency optimization is not only backend work. UX patterns can dramatically reduce perceived waiting time and even reduce total generation by encouraging concise answers.
UX Patterns for Latency
| Pattern | Why It Helps | Implementation |
|---|---|---|
| Stream answer text | User sees progress immediately | SSE events from server |
| Show retrieval progress | Makes waiting understandable | status: "searching docs..." event |
| Show intermediate steps | Builds trust for multi-step answers | "Step 1 of 3: checking config..." |
| Render citations as they arrive | Builds trust incrementally | Citation events alongside token events |
| Allow cancel | Saves cost and user time | AbortController on client, cancel propagation to provider |
| Generate concise by default | Lower latency and cost | System prompt: "Answer in 2-3 sentences" |
| Continue button for long answers | User controls generation scope | Show "Continue generating?" after initial response |
| Skeleton loading | Shows where content will appear | Fixed-height container with placeholder |
| Typing indicator | Familiar UI pattern | Animated cursor or dots before first token |
Common Failure Stories
The TTFT Was 8 Seconds Because of Context Bloat
A developer adds every retrieved chunk to the prompt without trimming. The RAG pipeline returns 25 chunks of 500 tokens each. The prompt is 12,500 tokens long. TTFT is 8 seconds because the prefill phase must process all those tokens before generating the first word.
The fix: trim retrieval results aggressively. Set a hard limit on context tokens (e.g., 3000 tokens max). Rerank and select only the most relevant chunks. The user does not need to see every retrieved document.
The Streaming Connection Timed Out
The load balancer has a 30-second idle timeout. The LLM generation takes 45 seconds for a complex question. The connection drops at 30 seconds, the client sees an incomplete response, and the user gets half an answer.
The fix: configure load balancers, reverse proxies, and API gateways with timeouts that match the expected generation time. For long generations, send periodic keepalive events (empty token events) to reset the timeout.
The Cancel Button Did Nothing
The user clicks cancel after 10 seconds. The frontend stops rendering. But the server continues generating for another 20 seconds, and the provider bills for the full generation. The user has already moved on, but the cost is incurred.
The fix: propagate client-side cancellation to the provider. When the SSE connection closes, abort the provider API call. Implement cancellation tokens that flow from the HTTP request through the gateway to the provider SDK.
The KV Cache Exploded Under Load
A popular model is serving 100 concurrent users with 8K-token contexts. The GPU runs out of memory. Requests start failing with OOM errors. The serving layer crashes and requires a restart.
The fix: implement memory-based admission control. Track KV cache memory usage per request. Before accepting a new request, verify that enough GPU memory is available. Reject or queue requests when memory is exhausted. Use PagedAttention to reduce fragmentation.
The Batch Waited Too Long for More Requests
A low-traffic chat service batches requests for 500ms before sending them to the GPU. At 2 AM with only one active user, every request waits 500ms in the batch queue before processing starts. The user experiences consistent 500ms extra latency for no benefit.
The fix: implement a maximum queue timeout. If no other request arrives within 100ms, process the single request immediately. Batching should never add noticeable latency for interactive users.
Evaluating Latency
You cannot optimize what you do not measure. Latency evaluation requires instrumenting every stage of the pipeline.
Metrics to Track
Pipeline latency breakdown (ms):
network_in: 45
auth: 2
retrieval: 320
reranking: 45
context_assembly: 12
provider_queue: 180
prefill: 950
decode_first_token: 35
total_generation: 3450
network_out: 8
TTFT = network_in + auth + retrieval + reranking + context_assembly + provider_queue + prefill = 1554ms
Total latency = TTFT + generation + network_out = 5012ms
What to Monitor
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| P50 TTFT | Typical user experience | > 2s |
| P95 TTFT | Worst-case experience | > 5s |
| P50 tokens per second | Typical generation speed | < 20 tok/s |
| P95 tokens per second | Slowest generation | < 5 tok/s |
| Provider queue time | Provider congestion | > 2s |
| Streaming error rate | Connection drops | > 1% |
| Cancel rate | Users abandoning | Investigate above 10% |
| Cache hit rate | Prefix cache effectiveness | < 20% |
Per-Feature Latency Budgets
Different features tolerate different latency.
| Feature | TTFT Budget | Total Latency Budget | Priority |
|---|---|---|---|
| Real-time chat | < 500ms | < 5s | Critical |
| Code completion | < 200ms | < 2s | Critical |
| Document summarization | < 3s | < 15s | Medium |
| Batch processing | < 10s | < 60s | Low |
Debugging rule: If a response feels slow, check the latency breakdown. If TTFT is high, look at prompt size and provider queueing. If generation speed is slow, look at model size, batch configuration, and speculative decoding. The breakdown tells you where to invest.
A Complete Streaming Request, End to End
Here is how a single streaming request flows through the entire optimized pipeline:
This flow combines every optimization in this guide: streaming for perceived latency, prefix caching for TTFT, context trimming for prefill speed, cancel propagation for cost control, and metrics for continuous improvement.
What to Remember for Interviews
When explaining streaming and latency optimization, tell the story in order:
- Streaming converts wall-clock time into perceived speed: The user sees the first token quickly even if total generation is the same. SSE is the simplest protocol for web streaming.
- TTFT and generation speed are separate metrics: TTFT depends on prompt size, provider queueing, and prefix caching. Generation speed depends on model size, batching, and speculative decoding. Optimize the metric that matters for your use case.
- TTFT optimization starts with prompt size: Shorter prompts prefill faster. Cache the stable prefix. Trim retrieval results aggressively. Do not pay the prefill cost for tokens the user does not need.
- KV cache is the memory bottleneck: It scales with context length, batch size, and model dimensions. PagedAttention reduces fragmentation. Memory-based admission control prevents OOM failures.
- Continuous batching is the standard for serving: It improves GPU utilization by 2-3x over static batching. But respect latency budgets — set maximum queue timeouts for interactive features.
- Speculative decoding reduces generation latency when draft acceptance is high: The draft model must align with the target model. Simple, predictable tasks benefit most.
- UX patterns reduce perceived latency: Streaming, progress indicators, cancel buttons, and concise defaults all make the system feel faster without changing the backend.
- Cancel propagation saves money: When the user disconnects, stop generation. Every token generated after the user leaves is wasted spend.
Practice: Design a streaming chat API for a RAG assistant that answers billing questions. Your TTFT budget is 500ms, total latency budget is 5s. Include retrieval timing, SSE events (status, token, citation, done), cancellation with provider propagation, timeout handling, and metrics for TTFT, TPOT, and total latency per request.