Streaming and Latency Optimization: TTFT, SSE, KV Cache, and Batching

Design low-latency LLM experiences using streaming, Server-Sent Events, time-to-first-token optimization, KV cache management, speculative decoding, batching, and context reduction.

streamingSSETTFTKV cachespeculative decoding

Start With a Slow Chat Experience

Imagine you built a RAG-powered support chat. The first user types a question:

"Can I pause my annual subscription?"

Then they wait. And wait.

Four seconds later, the response appears all at once. It is correct. But the user has already decided the chat feels sluggish. They close the window and submit a support ticket instead.

Now imagine the same chat streams each token as it is generated. The user sees text appear character by character:

"The... annual... subscriptions... cannot... be... paused..."

The total generation time is still four seconds. But the user starts reading after 300 milliseconds instead of four seconds. The experience feels fast because the output begins immediately.

This is the core insight of LLM latency optimization: users perceive speed based on when output starts, not when it finishes. Streaming converts wall-clock time into perceived responsiveness.

✅

Mental model: Streaming does not make total generation faster. It makes the user experience feel faster because output starts earlier. The goal is to minimize the silent gap between submitting a question and seeing the first character of the answer.

What Streaming and Latency Optimization Means

LLM latency is not a single number. An API call that returns one complete response after 4 seconds is fast in absolute terms but feels slow. A streaming call that starts output after 300ms and finishes at 4 seconds feels fast even though the total time is the same.

The difference is that LLM generation happens in two phases — prefill and decode — with fundamentally different performance characteristics.

The Four Latency Metrics

Metric	Meaning	Why It Matters	Typical Range
TTFT (Time to First Token)	Time from request submission to first output token	Determines perceived responsiveness	200ms–5s
TPOT (Time Per Output Token)	Milliseconds between consecutive tokens	Determines how fast text streams	10ms–100ms
Tokens per second	1000 / TPOT (roughly)	Throughput for long generations	10–100 tok/s
Total latency	TTFT + (num_output_tokens × TPOT)	Time until response is complete	1s–30s

⚠️

Important distinction: TTFT and generation speed trade off against each other. A model optimized for TTFT may use a larger batch size (more queueing) but faster prefill. A model optimized for generation speed may sacrifice some TTFT for higher throughput. Optimize for the metric that matters to your user experience.

Why Not Just Make the Model Faster?

Beginners often ask: "Can I just use a faster model?"

A faster model helps, but it is not the full answer. Latency in an LLM system comes from many layers, and the generation speed is only one of them.

Latency Source	Contribution	Can a faster model fix it?
Network round trip	10–100ms	No
Provider queueing	0ms–10s	No (may make it worse if faster = more popular)
Prompt processing (prefill)	100ms–3s	Partially (depends on context size)
Retrieval and reranking	100ms–1s	No (RAG pipeline is separate)
Token generation (decode)	5ms–100ms per token	Yes
Client rendering	1–50ms	No

A faster model only improves generation speed. The other layers — network, queueing, retrieval, rendering — require architectural changes.

Step 1: Streaming with Server-Sent Events

Streaming is the foundation of low-latency LLM user experience. Instead of waiting for the full response, the server sends each token as it is generated.

How SSE Works

Server-Sent Events is a standard HTTP protocol for one-way streaming. The client opens an HTTP connection, and the server pushes events as they become available.

txt

event: token
data: {"text":"The"}

event: token
data: {"text":" annual"}

event: token
data: {"text":" subscription"}

event: token
data: {"text":" cannot"}

event: done
data: {}

SSE Event Types for LLM Responses

Event	Purpose	Payload
`status`	Inform client of progress before generation starts	`{status: "retrieving"}, {status: "generating"}`
`token`	Each generated token	`{text: "The"}`
`citation`	Source citation when it becomes relevant	`{source: "billing-faq.pdf", id: "S1"}`
`error`	Something went wrong	`{code: "TIMEOUT", message: "..."}`
`done`	Generation complete	`{totalTokens: 342, latency: 4200}`
`meta`	Final metadata without the text	`{model: "gpt-4", latency: 4200}`

SSE vs WebSockets vs Plain HTTP

Protocol	Direction	Best For	Complexity
SSE	Server → Client only	Token streaming, status updates	Low
WebSocket	Bidirectional	Chat with interruption, streaming + user input	Medium
Plain HTTP	Request → Response	Non-interactive, background tasks	Lowest

SSE is usually the simplest choice for chat and assistant responses. WebSockets add complexity (connection management, reconnection, message framing) that is rarely justified when the only server-to-client traffic is token deltas.

SSE Implementation Considerations

Connection timeout: LLM generation can take 30+ seconds. Ensure your load balancer and reverse proxy do not timeout idle connections.
Buffering: Disable response buffering in your web framework and reverse proxy. Buffered SSE defeats the purpose of streaming.
Error recovery: If the provider connection drops mid-stream, the server should attempt to reconnect or send an error event to the client.
Cancel propagation: When the client disconnects, propagate the cancellation to the provider to stop generation and avoid paying for unused tokens.

Step 2: Optimizing Time to First Token

TTFT is the most visible latency metric because it is the silent gap the user experiences before anything happens. Optimizing TTFT means reducing the work done before the first token is emitted.

What Happens Before the First Token

TTFT Optimization Techniques

Technique	Effect on TTFT	Tradeoff
Reduce prompt tokens	Less prefill work	Shorter context means less evidence
Cache stable prefix	Reuse prompt processing for shared prefix	Requires prompt structure discipline
Stream retrieval results separately	User sees progress during retrieval	More complex UI
Move retrieval before generation	Parallelize RAG with prefill	Requires careful timing
Use smaller model	Faster prefill and first token	Lower quality
Keep retrieval top-k small	Less context assembly overhead	May miss relevant evidence
Pre-warm provider connections	Avoid connection setup latency	Increases baseline cost
Choose providers with low queueing	Skip or reduce queue wait	May be more expensive per token

⚠️

Large context windows are not free: Sending 100K tokens because the model supports it can create high TTFT, high cost, and weaker attention to the important parts. The model may accept 100K tokens, but your user will not accept a 10-second TTFT.

Prefix Caching for TTFT

If every request shares the same system prompt, tool definitions, and policy text, the provider can cache the KV cache for that prefix. When a new request arrives with the same prefix, the prefill phase is partially or fully skipped.

Prefix caching is the single most impactful TTFT optimization. Designing your prompts so that the static prefix is as large as possible and the dynamic suffix is as small as possible directly reduces TTFT for every request.

Step 3: KV Cache Management

During the prefill phase, the model computes attention between every pair of input tokens and stores the resulting key-value pairs in the KV cache. During the decode phase, each new token reads from this cache rather than recomputing attention for all previous tokens.

The KV cache is why generation gets slower with longer contexts. Every new token must attend to all previous KV entries. Doubling the output length does not double the cost — it more than doubles it because each subsequent step has more KV entries to read.

KV Cache Memory

The memory required for the KV cache grows with batch size, context length, model size, and precision.

txt

KV cache memory per token =
  2 (key + value)
  × num_hidden_layers
  × num_attention_heads
  × head_dimension
  × bytes_per_element

Example for a 70B model:
  = 2 × 80 layers × 64 heads × 128 dim × 2 bytes (FP16)
  ≈ 2.6 MB per token

For a 4096-token context with batch size 32:
  = 4096 × 32 × 2.6 MB
  ≈ 341 GB

This is why large contexts with high concurrency require multiple GPUs and careful memory management.

Serving Implications of KV Cache

Concern	Impact	Mitigation
Long prompts	More prefill work and KV memory	Prefix caching, prompt compression
Long outputs	More decode steps, growing KV cache	Shorter answers, early stopping
Many concurrent streams	More active KV cache competing for GPU memory	Dynamic batching, memory management
Larger models	KV cache scales with model dimensions	Use smaller models where possible
Batch processing	KV cache scales with batch size × context length	Continuous batching

PagedAttention

PagedAttention is a technique that manages KV cache in fixed-size blocks (pages) rather than contiguous memory. This eliminates fragmentation and allows the GPU to use memory more efficiently, increasing the number of concurrent requests that can be served.

Without PagedAttention, the KV cache for each request must be stored in a contiguous block of GPU memory. This leads to fragmentation — the GPU may have enough total free memory but not enough contiguous space for a new request. PagedAttention solves this by storing KV cache in pages that can be non-contiguous, similar to how virtual memory works in operating systems.

Step 4: Batching Strategies

Batching groups multiple requests together to improve GPU utilization. Instead of processing one request at a time, the GPU processes N requests simultaneously, sharing the cost of loading model weights.

Batching Types

Batch Type	How It Works	Best For	Latency Impact
Static batching	Group requests with similar sizes before inference	Offline jobs, batch processing	High queueing delay
Dynamic batching	Accumulate requests for a short window, then batch	Online serving with moderate traffic	Medium queueing delay
Continuous batching	Requests join and leave the batch during decoding	LLM serving with variable-length outputs	Low queueing delay

Continuous Batching (Decoding)

Continuous batching is the standard for modern LLM serving. In traditional batching, all requests in the batch must complete before the next batch starts. In continuous batching, a request leaves the batch when it finishes generating, and a new request can join immediately.

Continuous batching improves GPU utilization by up to 2-3x compared to static batching, especially when output lengths vary significantly between requests.

The Batching Tradeoff

Batching improves throughput (requests per second) but can increase per-request latency. A request that arrives at a busy time may wait in the queue for the current batch to finish before it enters the next batch.

For interactive applications, set a maximum queue time. If a request has waited longer than N milliseconds, process it alone rather than waiting for more requests to join the batch.

Step 5: Speculative Decoding

Speculative decoding reduces generation latency by using a smaller, faster model (the draft model) to propose multiple candidate tokens, then using the larger target model to verify them in a single forward pass.

How Speculative Decoding Saves Time

Normal generation: K sequential decode steps, each requiring one forward pass through the large model.

Speculative decoding: One draft pass through the small model, then one verification pass through the large model that processes K tokens in parallel.

If the draft model is accurate enough that most tokens are accepted, speculative decoding produces K tokens in roughly the time of 1–2 decode steps instead of K steps.

Acceptance Rate

The effectiveness of speculative decoding depends on the acceptance rate — the fraction of draft tokens that the target model accepts.

Acceptance Rate	Speedup	Scenario
90%+	3–5x	Simple, predictable text (classification, extraction)
70–90%	2–3x	General chat and Q&A
Below 50%	Negligible or negative	Creative, unpredictable text

Speculative decoding is mostly a serving-layer optimization. It may not be exposed or configurable by every provider. If you are self-hosting, it is one of the most impactful optimizations available.

✅

Practical advice: Speculative decoding works best when the draft model is aligned with the target model. A draft model trained on similar data will propose tokens the target model is likely to accept. A mismatched draft model (e.g., a code-specialized draft with a chat-specialized target) may have low acceptance rates and no speedup.

Step 6: Product UX for Perceived Latency

Latency optimization is not only backend work. UX patterns can dramatically reduce perceived waiting time and even reduce total generation by encouraging concise answers.

UX Patterns for Latency

Pattern	Why It Helps	Implementation
Stream answer text	User sees progress immediately	SSE events from server
Show retrieval progress	Makes waiting understandable	`status: "searching docs..."` event
Show intermediate steps	Builds trust for multi-step answers	"Step 1 of 3: checking config..."
Render citations as they arrive	Builds trust incrementally	Citation events alongside token events
Allow cancel	Saves cost and user time	AbortController on client, cancel propagation to provider
Generate concise by default	Lower latency and cost	System prompt: "Answer in 2-3 sentences"
Continue button for long answers	User controls generation scope	Show "Continue generating?" after initial response
Skeleton loading	Shows where content will appear	Fixed-height container with placeholder
Typing indicator	Familiar UI pattern	Animated cursor or dots before first token

Common Failure Stories

The TTFT Was 8 Seconds Because of Context Bloat

A developer adds every retrieved chunk to the prompt without trimming. The RAG pipeline returns 25 chunks of 500 tokens each. The prompt is 12,500 tokens long. TTFT is 8 seconds because the prefill phase must process all those tokens before generating the first word.

The fix: trim retrieval results aggressively. Set a hard limit on context tokens (e.g., 3000 tokens max). Rerank and select only the most relevant chunks. The user does not need to see every retrieved document.

The Streaming Connection Timed Out

The load balancer has a 30-second idle timeout. The LLM generation takes 45 seconds for a complex question. The connection drops at 30 seconds, the client sees an incomplete response, and the user gets half an answer.

The fix: configure load balancers, reverse proxies, and API gateways with timeouts that match the expected generation time. For long generations, send periodic keepalive events (empty token events) to reset the timeout.

The Cancel Button Did Nothing

The user clicks cancel after 10 seconds. The frontend stops rendering. But the server continues generating for another 20 seconds, and the provider bills for the full generation. The user has already moved on, but the cost is incurred.

The fix: propagate client-side cancellation to the provider. When the SSE connection closes, abort the provider API call. Implement cancellation tokens that flow from the HTTP request through the gateway to the provider SDK.

The KV Cache Exploded Under Load

A popular model is serving 100 concurrent users with 8K-token contexts. The GPU runs out of memory. Requests start failing with OOM errors. The serving layer crashes and requires a restart.

The fix: implement memory-based admission control. Track KV cache memory usage per request. Before accepting a new request, verify that enough GPU memory is available. Reject or queue requests when memory is exhausted. Use PagedAttention to reduce fragmentation.

The Batch Waited Too Long for More Requests

A low-traffic chat service batches requests for 500ms before sending them to the GPU. At 2 AM with only one active user, every request waits 500ms in the batch queue before processing starts. The user experiences consistent 500ms extra latency for no benefit.

The fix: implement a maximum queue timeout. If no other request arrives within 100ms, process the single request immediately. Batching should never add noticeable latency for interactive users.

Evaluating Latency

You cannot optimize what you do not measure. Latency evaluation requires instrumenting every stage of the pipeline.

Metrics to Track

txt

Pipeline latency breakdown (ms):
  network_in: 45
  auth: 2
  retrieval: 320
  reranking: 45
  context_assembly: 12
  provider_queue: 180
  prefill: 950
  decode_first_token: 35
  total_generation: 3450
  network_out: 8

TTFT = network_in + auth + retrieval + reranking + context_assembly + provider_queue + prefill = 1554ms
Total latency = TTFT + generation + network_out = 5012ms

What to Monitor

Metric	What It Tells You	Alert Threshold
P50 TTFT	Typical user experience	> 2s
P95 TTFT	Worst-case experience	> 5s
P50 tokens per second	Typical generation speed	< 20 tok/s
P95 tokens per second	Slowest generation	< 5 tok/s
Provider queue time	Provider congestion	> 2s
Streaming error rate	Connection drops	> 1%
Cancel rate	Users abandoning	Investigate above 10%
Cache hit rate	Prefix cache effectiveness	< 20%

Per-Feature Latency Budgets

Different features tolerate different latency.

Feature	TTFT Budget	Total Latency Budget	Priority
Real-time chat	< 500ms	< 5s	Critical
Code completion	< 200ms	< 2s	Critical
Document summarization	< 3s	< 15s	Medium
Batch processing	< 10s	< 60s	Low

✅

Debugging rule: If a response feels slow, check the latency breakdown. If TTFT is high, look at prompt size and provider queueing. If generation speed is slow, look at model size, batch configuration, and speculative decoding. The breakdown tells you where to invest.

A Complete Streaming Request, End to End

Here is how a single streaming request flows through the entire optimized pipeline:

This flow combines every optimization in this guide: streaming for perceived latency, prefix caching for TTFT, context trimming for prefill speed, cancel propagation for cost control, and metrics for continuous improvement.

What to Remember for Interviews

When explaining streaming and latency optimization, tell the story in order:

Streaming converts wall-clock time into perceived speed: The user sees the first token quickly even if total generation is the same. SSE is the simplest protocol for web streaming.
TTFT and generation speed are separate metrics: TTFT depends on prompt size, provider queueing, and prefix caching. Generation speed depends on model size, batching, and speculative decoding. Optimize the metric that matters for your use case.
TTFT optimization starts with prompt size: Shorter prompts prefill faster. Cache the stable prefix. Trim retrieval results aggressively. Do not pay the prefill cost for tokens the user does not need.
KV cache is the memory bottleneck: It scales with context length, batch size, and model dimensions. PagedAttention reduces fragmentation. Memory-based admission control prevents OOM failures.
Continuous batching is the standard for serving: It improves GPU utilization by 2-3x over static batching. But respect latency budgets — set maximum queue timeouts for interactive features.
Speculative decoding reduces generation latency when draft acceptance is high: The draft model must align with the target model. Simple, predictable tasks benefit most.
UX patterns reduce perceived latency: Streaming, progress indicators, cancel buttons, and concise defaults all make the system feel faster without changing the backend.
Cancel propagation saves money: When the user disconnects, stop generation. Every token generated after the user leaves is wasted spend.

✅

Practice: Design a streaming chat API for a RAG assistant that answers billing questions. Your TTFT budget is 500ms, total latency budget is 5s. Include retrieval timing, SSE events (status, token, citation, done), cancellation with provider propagation, timeout handling, and metrics for TTFT, TPOT, and total latency per request.

Agentic Patterns and Tool Use: ReAct, Function Calling, and Orchestration

Guardrails and Output Validation: Safer LLM Responses