Prompt Caching and Semantic Caching: Lower Latency and Cost
Learn exact prompt caching, prefix caching, semantic caching, TTLs, invalidation, cache safety, and when caching LLM responses is a bad idea.
Start With a Repetitive Problem
Imagine you built a support assistant for a SaaS product. Users ask questions like:
"How do I reset my password?" "What is the refund policy?" "Can I change my plan?"
The assistant answers each one correctly using RAG. But here is the problem: different users ask the same thing in different words, and every single request calls the LLM from scratch.
- "How do I reset my password?" — $0.04 in tokens, 2 seconds of latency
- "I forgot my password, can you help?" — $0.04, 2 seconds
- "Reset password steps" — $0.04, 2 seconds
Multiply by thousands of users, and you are spending money and time on the same answers over and over.
Meanwhile, every request carries the same 2000-token system prompt with instructions, tool definitions, and policy text. That prefix is identical every time, but you pay to process it on every call.
Caching exists to stop paying twice for the same work.
Mental model: Caching turns repeated LLM work into a lookup. The harder the work to produce, the more valuable the cache. But the wrong cache is worse than no cache — it can serve stale, private, or incorrect answers silently.
What Prompt and Semantic Caching Means
There are three distinct caching mechanisms for LLM systems, and they solve different problems:
| Mechanism | What It Matches | Where It Lives | Who Controls It |
|---|---|---|---|
| Exact-match cache | Identical normalized request | Your infrastructure | You |
| Semantic cache | Similar meaning above a threshold | Your infrastructure | You |
| Prefix cache | Shared prompt prefix tokens | Provider serving layer | Provider |
They are not alternatives. They work together in a hierarchy.
Why Not Just Use a Regular HTTP Cache?
Beginners often ask: "Can I just put a Redis cache in front of the LLM call?"
You can, but LLM caching is not the same as web caching.
A web cache keys on URL. An LLM cache must key on the full input: prompt text, model name, temperature, system instructions, tool definitions, and tenant scope. Two requests with the same user message but different system prompts produce different answers. A regular HTTP cache has no awareness of these dimensions.
A web cache serves identical content to anyone. An LLM cache must respect permissions — a cached answer for one tenant may include data another tenant should not see.
A web cache has clear invalidation (resource changed, TTL expired). An LLM cache must handle prompt version changes, model upgrades, document updates, embedding model changes, and ACL modifications.
Important distinction: LLM caching is not a general cache. It is a domain-specific cache that must understand model parameters, permission boundaries, and versioning. Applying a generic cache strategy will leak data or serve stale answers.
Step 1: Exact Prompt Caching
Exact caching is the simplest and safest form. It hashes the normalized prompt and model parameters into a cache key. If the exact same request arrives again, return the cached response.
What Goes Into the Cache Key
The cache key must include everything that affects the output:
| Input | Why It Belongs in Key |
|---|---|
| Model name and version | Different models produce different outputs |
| System prompt | Changes behavior and instructions |
| User prompt | Main input |
| Temperature and decoding params | Non-deterministic settings affect output |
| Tool and function schemas | Changes valid output structure |
| Prompt version tag | Supports safe rollout and rollback |
| Tenant or permission scope | Prevents data leakage across boundaries |
| Response format instruction | JSON vs markdown vs plain text |
Exact caching is safest for deterministic tasks: extraction, classification, structured data transformation, and repeated internal operations. If the prompt includes a timestamp, user name, or request ID, the cache key will be unique every time and caching will not help.
Practical advice: Normalize your prompts before hashing. Strip trailing whitespace, unify line endings, sort JSON keys in tool schemas, and canonicalize boolean values. Two semantically identical requests that differ by a trailing space will miss the cache.
Step 2: Semantic Caching
Semantic caching handles the case where different users ask the same question in different words.
"Can I get a refund?" and "What is your refund policy?" should return the same answer. But the exact strings are different, so exact-match caching will miss.
Semantic caching embeds each incoming query and searches for a previously answered query with high embedding similarity.
Threshold Tuning
The similarity threshold determines the tradeoff between hit rate and correctness.
| Threshold | Behavior | Risk |
|---|---|---|
| Too high (e.g., 0.98) | Very few hits, safe | Cache is mostly useless |
| Too low (e.g., 0.70) | Many hits, fast and cheap | Wrong answers on subtle differences |
| Per intent (best) | Different thresholds per task type | More configuration |
For example, a support FAQ about password reset can use a lower threshold because the answer is stable. A question about the user's current subscription plan should use a much higher threshold because small wording changes change the meaning.
An even better approach: include the answer content hash in the cache key. If the source document changes, old semantic cache entries become invalid automatically.
Do not semantic-cache sensitive or personalized answers unless the cache key includes the permission and personalization boundary. If two users ask "What is my bill?" with similar phrasing, the semantic cache should not serve one user's answer to the other.
Step 3: Prefix Caching (Provider-Side)
Prefix caching is different from exact and semantic caching. It happens inside the LLM provider's serving infrastructure, not in your code.
Many LLM requests share a long prefix: system instructions, tool definitions, few-shot examples, and static policy text. The provider caches the Key-Value (KV) cache for that shared prefix. When a new request arrives with the same prefix, the provider skips recomputing attention for those tokens.
Prefix caching is the most impactful performance optimization because it reduces both latency and cost for every request that shares a prefix. The savings apply whether or not the user's question has been asked before.
Designing Prompts for Prefix Cache Efficiency
-
Keep stable instructions at the beginning. System prompt first, then tool definitions, then examples, then the user message last. The prefix cache caches from the start of the prompt.
-
Avoid dynamic content in the prefix. Do not put timestamps, request IDs, or user names in the system prompt. They break the prefix match.
-
Version prompt prefixes deliberately. When you change the system prompt, the prefix cache invalidates. Group prompt changes into releases rather than making continuous edits.
-
Measure prefix cache hit rate. Providers expose this metric. If your hit rate is low, your prefix may contain too much dynamic content or you may have too many prompt variants.
-
Align prefixes across requests. If your support chat and your summarization feature use different system prompts, they do not share prefix cache. Consider whether they can share a common prefix.
Step 4: Caching in RAG Systems
RAG systems have more cacheable parts than a plain LLM call. Each layer of the pipeline can be cached independently.
| Layer | Cache Key | Safety | Savings |
|---|---|---|---|
| Parsed document | Document ID + content hash | Very safe | Avoids reparsing |
| Embeddings | Chunk hash + embedding model | Very safe | Avoids re-embedding |
| Retrieval result | Query + filters + index version | Safe if permissions match | Avoids vector search |
| Rerank result | Query + candidates + reranker version | Safe if permissions match | Avoids reranking |
| Final answer | Prompt + context IDs + model + user scope | Riskiest | Avoids generation |
Final-answer caching is the riskiest because the cached answer depends on all upstream layers. If the source document changes, the embedding model updates, the index is rebuilt, or the prompt changes, the cached answer becomes stale. Retrieval and embedding caches are usually safer and still provide meaningful savings.
Step 5: TTL and Invalidation
Caching without invalidation is just accumulating stale answers. Different data types need different invalidation strategies.
| Data Type | TTL Strategy | Invalidation Trigger |
|---|---|---|
| Static docs | Long TTL with content-hash invalidation | Document content changes |
| Product policies | Short TTL or versioned invalidation | Policy update published |
| User-specific answers | Very short TTL or no cache | Session or user state changes |
| Compliance answers | Avoid final-answer cache | Regulatory changes must be immediate |
| Analytics summaries | TTL aligned with source refresh | Data pipeline completes |
Invalidation Events
Version-Aware Caching
The safest invalidation strategy is to include version identifiers in the cache key itself:
cache_key = hash(
normalized_prompt +
model_version +
prompt_version +
embedding_model_version +
index_version +
tenant_id
)
When any version changes, the cache key changes automatically, and old entries are never returned. This is simpler than maintaining explicit invalidation lists. The tradeoff is that old cache entries become orphaned and must be evicted by the cache's TTL policy.
Common Failure Stories
The Stale Answer After a Policy Change
The refund policy changed on Monday. The semantic cache still returns "You can cancel within 30 days" on Wednesday. The cache key did not include the document version, so the old answer looks valid.
The fix: include source document version and content hash in the cache key. When the document updates, the cache misses automatically.
The Semantic Cache Returned the Wrong Answer
A user asks "What happens if I downgrade my plan?" The semantic cache finds a similar query: "What happens if I upgrade my plan?" — similarity 0.91, above threshold. It returns the upgrade answer to a downgrade question.
The fix: use per-intent thresholds and include intent classification in the cache key. Upgrade and downgrade should be different cache buckets.
The Cross-Tenant Cache Leak
Tenant A asks "What is my contract renewal date?" Tenant B asks "What is my renewal date?" The semantic cache matches and returns Tenant A's renewal date to Tenant B.
The fix: always include tenant scope in the cache key. Semantic cache hit or miss, the tenant filter must be applied before returning a cached answer.
The Prefix Cache Stopped Working
A developer adds {{current_timestamp}} to the system prompt for logging. Now every request has a unique prefix. The prefix cache hit rate drops from 80% to 0%. Cost per request doubles.
The fix: keep timestamps and request metadata out of the prompt prefix. Use a separate logging mechanism that does not affect the prompt.
The Cached Answer Contained PII
A user asks "What is my email on file?" The response includes their email address. The exact-match cache stores this response. Another user with a different account asks the exact same question and gets the first user's email.
The fix: never cache personalized responses with an exact-match or semantic cache unless the cache key includes user identity and the cache is scoped to that user.
Evaluating Cache Quality
You cannot optimize caching without measuring its impact on cost, latency, and correctness.
Hit Rate Evaluation
- What is the exact-match cache hit rate per feature?
- What is the semantic cache hit rate per intent?
- What is the prefix cache hit rate (provider-reported)?
- How does hit rate vary by time of day and traffic patterns?
Latency Evaluation
- What is the P50 and P95 latency for cache hits vs cache misses?
- What is the overhead of the semantic cache embedding and search step?
- Is the cache lookup faster than generating the response? (If not, something is wrong.)
Cost Evaluation
- How many tokens were saved per day by exact caching?
- How many by semantic caching?
- What is the dollar savings per feature?
- What is the cost of running the cache infrastructure?
Correctness Evaluation
- What is the cache correctness rate (sampled)?
- How many cached answers were flagged as stale or wrong?
- What is the false positive rate of the semantic cache?
Debugging rule: If a user complains about an incorrect or outdated answer, check whether it was served from cache. If yes, check what cache key matched and whether the invalidation trigger fired correctly. The cache is always the first suspect in staleness bugs.
A Complete Cached Request, End to End
Here is how a single support query flows through the caching layers:
This flow shows the cache hierarchy in action. Each layer catches a different type of repetition. The system saves cost and latency at every step without sacrificing correctness — as long as the cache keys include the right dimensions and invalidation is handled properly.
When Not to Cache
Caching is not always the right answer. Some situations demand a fresh response every time.
| Situation | Why Not to Cache |
|---|---|
| Medical, legal, or financial decisions | High-stakes freshness and correctness — stale answer could cause real harm |
| Personalized account answers | Privacy and permission risk — cached response may leak to wrong user |
| Rapidly changing data | Stale response risk — by the time the cache serves it, the data has changed |
| Creative generation | Users expect variation — identical responses look broken |
| Tool-dependent output | Side effects and current state matter — tool output changes between calls |
| Safety-sensitive moderation | Policy and context must be current — caching moderation results bypasses review |
| Streaming responses | Streaming creates state on the client — caching a stream is complex and rarely worth it |
| A/B test experiments | Different user groups should see different responses — cache defeats the experiment |
Rule of thumb: If a human would pause before repeating yesterday's answer in a conversation today, do not cache it. Caching is for answers that are timeless, identity-agnostic, and deterministic.
What to Remember for Interviews
When explaining LLM caching, tell the story in order:
- Exact caching is safest: Hash the full normalized request and all model parameters. Use this for deterministic, repeated tasks.
- Semantic caching is approximate: Embed queries and match by similarity. Tune thresholds by intent and task risk. Always include permission boundaries in the key.
- Prefix caching rewards stable prompts: Keep static content at the front of the prompt. Avoid dynamic values in the prefix. Measure provider-side cache hit rate.
- Cache layers work together: Exact → Semantic → Prefix → Generation. Each layer catches a different repetition pattern.
- Cache boundaries must include permissions: Tenant ID, user scope, and document access level must be in the cache key. Cross-tenant leaks are the most dangerous cache failure.
- Invalidate by version: Include prompt version, model version, embedding model version, and index version in the cache key. Version changes automatically invalidate old entries.
- When in doubt, do not cache: Personalized, time-sensitive, safety-critical, and creative responses should bypass the cache entirely.
Practice: Design caching for a RAG support bot that serves 10,000 tenants. Decide which layers to cache, how to key them, how to invalidate them, and which answers should never be cached. Then trace a repeat query through the full cache hierarchy and explain what happens at each layer.