Gen AI Systems

Streaming and Latency Optimization: TTFT, SSE, KV Cache, and Batching

Design low-latency LLM experiences using streaming, Server-Sent Events, time-to-first-token optimization, KV cache management, speculative decoding, batching, and context reduction.

streamingSSETTFTKV cachespeculative decoding

Start With a Slow Chat Experience

Imagine you built a RAG-powered support chat. The first user types a question:

"Can I pause my annual subscription?"

Then they wait. And wait.

Four seconds later, the response appears all at once. It is correct. But the user has already decided the chat feels sluggish. They close the window and submit a support ticket instead.

Now imagine the same chat streams each token as it is generated. The user sees text appear character by character:

"The... annual... subscriptions... cannot... be... paused..."

The total generation time is still four seconds. But the user starts reading after 300 milliseconds instead of four seconds. The experience feels fast because the output begins immediately.

This is the core insight of LLM latency optimization: users perceive speed based on when output starts, not when it finishes. Streaming converts wall-clock time into perceived responsiveness.

Mental model: Streaming does not make total generation faster. It makes the user experience feel faster because output starts earlier. The goal is to minimize the silent gap between submitting a question and seeing the first character of the answer.


What Streaming and Latency Optimization Means

LLM latency is not a single number. An API call that returns one complete response after 4 seconds is fast in absolute terms but feels slow. A streaming call that starts output after 300ms and finishes at 4 seconds feels fast even though the total time is the same.

The difference is that LLM generation happens in two phases — prefill and decode — with fundamentally different performance characteristics.

The Four Latency Metrics

MetricMeaningWhy It MattersTypical Range
TTFT (Time to First Token)Time from request submission to first output tokenDetermines perceived responsiveness200ms–5s
TPOT (Time Per Output Token)Milliseconds between consecutive tokensDetermines how fast text streams10ms–100ms
Tokens per second1000 / TPOT (roughly)Throughput for long generations10–100 tok/s
Total latencyTTFT + (num_output_tokens × TPOT)Time until response is complete1s–30s
⚠️

Important distinction: TTFT and generation speed trade off against each other. A model optimized for TTFT may use a larger batch size (more queueing) but faster prefill. A model optimized for generation speed may sacrifice some TTFT for higher throughput. Optimize for the metric that matters to your user experience.


Why Not Just Make the Model Faster?

Beginners often ask: "Can I just use a faster model?"

A faster model helps, but it is not the full answer. Latency in an LLM system comes from many layers, and the generation speed is only one of them.

Latency SourceContributionCan a faster model fix it?
Network round trip10–100msNo
Provider queueing0ms–10sNo (may make it worse if faster = more popular)
Prompt processing (prefill)100ms–3sPartially (depends on context size)
Retrieval and reranking100ms–1sNo (RAG pipeline is separate)
Token generation (decode)5ms–100ms per tokenYes
Client rendering1–50msNo

A faster model only improves generation speed. The other layers — network, queueing, retrieval, rendering — require architectural changes.


Step 1: Streaming with Server-Sent Events

Streaming is the foundation of low-latency LLM user experience. Instead of waiting for the full response, the server sends each token as it is generated.

How SSE Works

Server-Sent Events is a standard HTTP protocol for one-way streaming. The client opens an HTTP connection, and the server pushes events as they become available.

txt
event: token
data: {"text":"The"}

event: token
data: {"text":" annual"}

event: token
data: {"text":" subscription"}

event: token
data: {"text":" cannot"}

event: done
data: {}

SSE Event Types for LLM Responses

EventPurposePayload
statusInform client of progress before generation starts{status: "retrieving"}, {status: "generating"}
tokenEach generated token{text: "The"}
citationSource citation when it becomes relevant{source: "billing-faq.pdf", id: "S1"}
errorSomething went wrong{code: "TIMEOUT", message: "..."}
doneGeneration complete{totalTokens: 342, latency: 4200}
metaFinal metadata without the text{model: "gpt-4", latency: 4200}

SSE vs WebSockets vs Plain HTTP

ProtocolDirectionBest ForComplexity
SSEServer → Client onlyToken streaming, status updatesLow
WebSocketBidirectionalChat with interruption, streaming + user inputMedium
Plain HTTPRequest → ResponseNon-interactive, background tasksLowest

SSE is usually the simplest choice for chat and assistant responses. WebSockets add complexity (connection management, reconnection, message framing) that is rarely justified when the only server-to-client traffic is token deltas.

SSE Implementation Considerations

  • Connection timeout: LLM generation can take 30+ seconds. Ensure your load balancer and reverse proxy do not timeout idle connections.
  • Buffering: Disable response buffering in your web framework and reverse proxy. Buffered SSE defeats the purpose of streaming.
  • Error recovery: If the provider connection drops mid-stream, the server should attempt to reconnect or send an error event to the client.
  • Cancel propagation: When the client disconnects, propagate the cancellation to the provider to stop generation and avoid paying for unused tokens.

Step 2: Optimizing Time to First Token

TTFT is the most visible latency metric because it is the silent gap the user experiences before anything happens. Optimizing TTFT means reducing the work done before the first token is emitted.

What Happens Before the First Token

TTFT Optimization Techniques

TechniqueEffect on TTFTTradeoff
Reduce prompt tokensLess prefill workShorter context means less evidence
Cache stable prefixReuse prompt processing for shared prefixRequires prompt structure discipline
Stream retrieval results separatelyUser sees progress during retrievalMore complex UI
Move retrieval before generationParallelize RAG with prefillRequires careful timing
Use smaller modelFaster prefill and first tokenLower quality
Keep retrieval top-k smallLess context assembly overheadMay miss relevant evidence
Pre-warm provider connectionsAvoid connection setup latencyIncreases baseline cost
Choose providers with low queueingSkip or reduce queue waitMay be more expensive per token
⚠️

Large context windows are not free: Sending 100K tokens because the model supports it can create high TTFT, high cost, and weaker attention to the important parts. The model may accept 100K tokens, but your user will not accept a 10-second TTFT.

Prefix Caching for TTFT

If every request shares the same system prompt, tool definitions, and policy text, the provider can cache the KV cache for that prefix. When a new request arrives with the same prefix, the prefill phase is partially or fully skipped.

Prefix caching is the single most impactful TTFT optimization. Designing your prompts so that the static prefix is as large as possible and the dynamic suffix is as small as possible directly reduces TTFT for every request.


Step 3: KV Cache Management

During the prefill phase, the model computes attention between every pair of input tokens and stores the resulting key-value pairs in the KV cache. During the decode phase, each new token reads from this cache rather than recomputing attention for all previous tokens.

The KV cache is why generation gets slower with longer contexts. Every new token must attend to all previous KV entries. Doubling the output length does not double the cost — it more than doubles it because each subsequent step has more KV entries to read.

KV Cache Memory

The memory required for the KV cache grows with batch size, context length, model size, and precision.

txt
KV cache memory per token =
  2 (key + value)
  × num_hidden_layers
  × num_attention_heads
  × head_dimension
  × bytes_per_element

Example for a 70B model:
  = 2 × 80 layers × 64 heads × 128 dim × 2 bytes (FP16)
  ≈ 2.6 MB per token

For a 4096-token context with batch size 32:
  = 4096 × 32 × 2.6 MB
  ≈ 341 GB

This is why large contexts with high concurrency require multiple GPUs and careful memory management.

Serving Implications of KV Cache

ConcernImpactMitigation
Long promptsMore prefill work and KV memoryPrefix caching, prompt compression
Long outputsMore decode steps, growing KV cacheShorter answers, early stopping
Many concurrent streamsMore active KV cache competing for GPU memoryDynamic batching, memory management
Larger modelsKV cache scales with model dimensionsUse smaller models where possible
Batch processingKV cache scales with batch size × context lengthContinuous batching

PagedAttention

PagedAttention is a technique that manages KV cache in fixed-size blocks (pages) rather than contiguous memory. This eliminates fragmentation and allows the GPU to use memory more efficiently, increasing the number of concurrent requests that can be served.

Without PagedAttention, the KV cache for each request must be stored in a contiguous block of GPU memory. This leads to fragmentation — the GPU may have enough total free memory but not enough contiguous space for a new request. PagedAttention solves this by storing KV cache in pages that can be non-contiguous, similar to how virtual memory works in operating systems.


Step 4: Batching Strategies

Batching groups multiple requests together to improve GPU utilization. Instead of processing one request at a time, the GPU processes N requests simultaneously, sharing the cost of loading model weights.

Batching Types

Batch TypeHow It WorksBest ForLatency Impact
Static batchingGroup requests with similar sizes before inferenceOffline jobs, batch processingHigh queueing delay
Dynamic batchingAccumulate requests for a short window, then batchOnline serving with moderate trafficMedium queueing delay
Continuous batchingRequests join and leave the batch during decodingLLM serving with variable-length outputsLow queueing delay

Continuous Batching (Decoding)

Continuous batching is the standard for modern LLM serving. In traditional batching, all requests in the batch must complete before the next batch starts. In continuous batching, a request leaves the batch when it finishes generating, and a new request can join immediately.

Continuous batching improves GPU utilization by up to 2-3x compared to static batching, especially when output lengths vary significantly between requests.

The Batching Tradeoff

Batching improves throughput (requests per second) but can increase per-request latency. A request that arrives at a busy time may wait in the queue for the current batch to finish before it enters the next batch.

For interactive applications, set a maximum queue time. If a request has waited longer than N milliseconds, process it alone rather than waiting for more requests to join the batch.


Step 5: Speculative Decoding

Speculative decoding reduces generation latency by using a smaller, faster model (the draft model) to propose multiple candidate tokens, then using the larger target model to verify them in a single forward pass.

How Speculative Decoding Saves Time

Normal generation: K sequential decode steps, each requiring one forward pass through the large model.

Speculative decoding: One draft pass through the small model, then one verification pass through the large model that processes K tokens in parallel.

If the draft model is accurate enough that most tokens are accepted, speculative decoding produces K tokens in roughly the time of 1–2 decode steps instead of K steps.

Acceptance Rate

The effectiveness of speculative decoding depends on the acceptance rate — the fraction of draft tokens that the target model accepts.

Acceptance RateSpeedupScenario
90%+3–5xSimple, predictable text (classification, extraction)
70–90%2–3xGeneral chat and Q&A
Below 50%Negligible or negativeCreative, unpredictable text

Speculative decoding is mostly a serving-layer optimization. It may not be exposed or configurable by every provider. If you are self-hosting, it is one of the most impactful optimizations available.

Practical advice: Speculative decoding works best when the draft model is aligned with the target model. A draft model trained on similar data will propose tokens the target model is likely to accept. A mismatched draft model (e.g., a code-specialized draft with a chat-specialized target) may have low acceptance rates and no speedup.


Step 6: Product UX for Perceived Latency

Latency optimization is not only backend work. UX patterns can dramatically reduce perceived waiting time and even reduce total generation by encouraging concise answers.

UX Patterns for Latency

PatternWhy It HelpsImplementation
Stream answer textUser sees progress immediatelySSE events from server
Show retrieval progressMakes waiting understandablestatus: "searching docs..." event
Show intermediate stepsBuilds trust for multi-step answers"Step 1 of 3: checking config..."
Render citations as they arriveBuilds trust incrementallyCitation events alongside token events
Allow cancelSaves cost and user timeAbortController on client, cancel propagation to provider
Generate concise by defaultLower latency and costSystem prompt: "Answer in 2-3 sentences"
Continue button for long answersUser controls generation scopeShow "Continue generating?" after initial response
Skeleton loadingShows where content will appearFixed-height container with placeholder
Typing indicatorFamiliar UI patternAnimated cursor or dots before first token

Common Failure Stories

The TTFT Was 8 Seconds Because of Context Bloat

A developer adds every retrieved chunk to the prompt without trimming. The RAG pipeline returns 25 chunks of 500 tokens each. The prompt is 12,500 tokens long. TTFT is 8 seconds because the prefill phase must process all those tokens before generating the first word.

The fix: trim retrieval results aggressively. Set a hard limit on context tokens (e.g., 3000 tokens max). Rerank and select only the most relevant chunks. The user does not need to see every retrieved document.

The Streaming Connection Timed Out

The load balancer has a 30-second idle timeout. The LLM generation takes 45 seconds for a complex question. The connection drops at 30 seconds, the client sees an incomplete response, and the user gets half an answer.

The fix: configure load balancers, reverse proxies, and API gateways with timeouts that match the expected generation time. For long generations, send periodic keepalive events (empty token events) to reset the timeout.

The Cancel Button Did Nothing

The user clicks cancel after 10 seconds. The frontend stops rendering. But the server continues generating for another 20 seconds, and the provider bills for the full generation. The user has already moved on, but the cost is incurred.

The fix: propagate client-side cancellation to the provider. When the SSE connection closes, abort the provider API call. Implement cancellation tokens that flow from the HTTP request through the gateway to the provider SDK.

The KV Cache Exploded Under Load

A popular model is serving 100 concurrent users with 8K-token contexts. The GPU runs out of memory. Requests start failing with OOM errors. The serving layer crashes and requires a restart.

The fix: implement memory-based admission control. Track KV cache memory usage per request. Before accepting a new request, verify that enough GPU memory is available. Reject or queue requests when memory is exhausted. Use PagedAttention to reduce fragmentation.

The Batch Waited Too Long for More Requests

A low-traffic chat service batches requests for 500ms before sending them to the GPU. At 2 AM with only one active user, every request waits 500ms in the batch queue before processing starts. The user experiences consistent 500ms extra latency for no benefit.

The fix: implement a maximum queue timeout. If no other request arrives within 100ms, process the single request immediately. Batching should never add noticeable latency for interactive users.


Evaluating Latency

You cannot optimize what you do not measure. Latency evaluation requires instrumenting every stage of the pipeline.

Metrics to Track

txt
Pipeline latency breakdown (ms):
  network_in: 45
  auth: 2
  retrieval: 320
  reranking: 45
  context_assembly: 12
  provider_queue: 180
  prefill: 950
  decode_first_token: 35
  total_generation: 3450
  network_out: 8

TTFT = network_in + auth + retrieval + reranking + context_assembly + provider_queue + prefill = 1554ms
Total latency = TTFT + generation + network_out = 5012ms

What to Monitor

MetricWhat It Tells YouAlert Threshold
P50 TTFTTypical user experience> 2s
P95 TTFTWorst-case experience> 5s
P50 tokens per secondTypical generation speed< 20 tok/s
P95 tokens per secondSlowest generation< 5 tok/s
Provider queue timeProvider congestion> 2s
Streaming error rateConnection drops> 1%
Cancel rateUsers abandoningInvestigate above 10%
Cache hit ratePrefix cache effectiveness< 20%

Per-Feature Latency Budgets

Different features tolerate different latency.

FeatureTTFT BudgetTotal Latency BudgetPriority
Real-time chat< 500ms< 5sCritical
Code completion< 200ms< 2sCritical
Document summarization< 3s< 15sMedium
Batch processing< 10s< 60sLow

Debugging rule: If a response feels slow, check the latency breakdown. If TTFT is high, look at prompt size and provider queueing. If generation speed is slow, look at model size, batch configuration, and speculative decoding. The breakdown tells you where to invest.


A Complete Streaming Request, End to End

Here is how a single streaming request flows through the entire optimized pipeline:

This flow combines every optimization in this guide: streaming for perceived latency, prefix caching for TTFT, context trimming for prefill speed, cancel propagation for cost control, and metrics for continuous improvement.


What to Remember for Interviews

When explaining streaming and latency optimization, tell the story in order:

  1. Streaming converts wall-clock time into perceived speed: The user sees the first token quickly even if total generation is the same. SSE is the simplest protocol for web streaming.
  2. TTFT and generation speed are separate metrics: TTFT depends on prompt size, provider queueing, and prefix caching. Generation speed depends on model size, batching, and speculative decoding. Optimize the metric that matters for your use case.
  3. TTFT optimization starts with prompt size: Shorter prompts prefill faster. Cache the stable prefix. Trim retrieval results aggressively. Do not pay the prefill cost for tokens the user does not need.
  4. KV cache is the memory bottleneck: It scales with context length, batch size, and model dimensions. PagedAttention reduces fragmentation. Memory-based admission control prevents OOM failures.
  5. Continuous batching is the standard for serving: It improves GPU utilization by 2-3x over static batching. But respect latency budgets — set maximum queue timeouts for interactive features.
  6. Speculative decoding reduces generation latency when draft acceptance is high: The draft model must align with the target model. Simple, predictable tasks benefit most.
  7. UX patterns reduce perceived latency: Streaming, progress indicators, cancel buttons, and concise defaults all make the system feel faster without changing the backend.
  8. Cancel propagation saves money: When the user disconnects, stop generation. Every token generated after the user leaves is wasted spend.

Practice: Design a streaming chat API for a RAG assistant that answers billing questions. Your TTFT budget is 500ms, total latency budget is 5s. Include retrieval timing, SSE events (status, token, citation, done), cancellation with provider propagation, timeout handling, and metrics for TTFT, TPOT, and total latency per request.