Prompt Caching and Semantic Caching: Lower Latency and Cost

Learn exact prompt caching, prefix caching, semantic caching, TTLs, invalidation, cache safety, and when caching LLM responses is a bad idea.

prompt cachingsemantic cacheprefix cachecost reductionlatency

Start With a Repetitive Problem

Imagine you built a support assistant for a SaaS product. Users ask questions like:

"How do I reset my password?" "What is the refund policy?" "Can I change my plan?"

The assistant answers each one correctly using RAG. But here is the problem: different users ask the same thing in different words, and every single request calls the LLM from scratch.

"How do I reset my password?" — $0.04 in tokens, 2 seconds of latency
"I forgot my password, can you help?" — $0.04, 2 seconds
"Reset password steps" — $0.04, 2 seconds

Multiply by thousands of users, and you are spending money and time on the same answers over and over.

Meanwhile, every request carries the same 2000-token system prompt with instructions, tool definitions, and policy text. That prefix is identical every time, but you pay to process it on every call.

Caching exists to stop paying twice for the same work.

✅

Mental model: Caching turns repeated LLM work into a lookup. The harder the work to produce, the more valuable the cache. But the wrong cache is worse than no cache — it can serve stale, private, or incorrect answers silently.

What Prompt and Semantic Caching Means

There are three distinct caching mechanisms for LLM systems, and they solve different problems:

Mechanism	What It Matches	Where It Lives	Who Controls It
Exact-match cache	Identical normalized request	Your infrastructure	You
Semantic cache	Similar meaning above a threshold	Your infrastructure	You
Prefix cache	Shared prompt prefix tokens	Provider serving layer	Provider

They are not alternatives. They work together in a hierarchy.

Why Not Just Use a Regular HTTP Cache?

Beginners often ask: "Can I just put a Redis cache in front of the LLM call?"

You can, but LLM caching is not the same as web caching.

A web cache keys on URL. An LLM cache must key on the full input: prompt text, model name, temperature, system instructions, tool definitions, and tenant scope. Two requests with the same user message but different system prompts produce different answers. A regular HTTP cache has no awareness of these dimensions.

A web cache serves identical content to anyone. An LLM cache must respect permissions — a cached answer for one tenant may include data another tenant should not see.

A web cache has clear invalidation (resource changed, TTL expired). An LLM cache must handle prompt version changes, model upgrades, document updates, embedding model changes, and ACL modifications.

⚠️

Important distinction: LLM caching is not a general cache. It is a domain-specific cache that must understand model parameters, permission boundaries, and versioning. Applying a generic cache strategy will leak data or serve stale answers.

Step 1: Exact Prompt Caching

Exact caching is the simplest and safest form. It hashes the normalized prompt and model parameters into a cache key. If the exact same request arrives again, return the cached response.

What Goes Into the Cache Key

The cache key must include everything that affects the output:

Input	Why It Belongs in Key
Model name and version	Different models produce different outputs
System prompt	Changes behavior and instructions
User prompt	Main input
Temperature and decoding params	Non-deterministic settings affect output
Tool and function schemas	Changes valid output structure
Prompt version tag	Supports safe rollout and rollback
Tenant or permission scope	Prevents data leakage across boundaries
Response format instruction	JSON vs markdown vs plain text

Exact caching is safest for deterministic tasks: extraction, classification, structured data transformation, and repeated internal operations. If the prompt includes a timestamp, user name, or request ID, the cache key will be unique every time and caching will not help.

✅

Practical advice: Normalize your prompts before hashing. Strip trailing whitespace, unify line endings, sort JSON keys in tool schemas, and canonicalize boolean values. Two semantically identical requests that differ by a trailing space will miss the cache.

Step 2: Semantic Caching

Semantic caching handles the case where different users ask the same question in different words.

"Can I get a refund?" and "What is your refund policy?" should return the same answer. But the exact strings are different, so exact-match caching will miss.

Semantic caching embeds each incoming query and searches for a previously answered query with high embedding similarity.

Threshold Tuning

The similarity threshold determines the tradeoff between hit rate and correctness.

Threshold	Behavior	Risk
Too high (e.g., 0.98)	Very few hits, safe	Cache is mostly useless
Too low (e.g., 0.70)	Many hits, fast and cheap	Wrong answers on subtle differences
Per intent (best)	Different thresholds per task type	More configuration

For example, a support FAQ about password reset can use a lower threshold because the answer is stable. A question about the user's current subscription plan should use a much higher threshold because small wording changes change the meaning.

An even better approach: include the answer content hash in the cache key. If the source document changes, old semantic cache entries become invalid automatically.

⚠️

Do not semantic-cache sensitive or personalized answers unless the cache key includes the permission and personalization boundary. If two users ask "What is my bill?" with similar phrasing, the semantic cache should not serve one user's answer to the other.

Step 3: Prefix Caching (Provider-Side)

Prefix caching is different from exact and semantic caching. It happens inside the LLM provider's serving infrastructure, not in your code.

Many LLM requests share a long prefix: system instructions, tool definitions, few-shot examples, and static policy text. The provider caches the Key-Value (KV) cache for that shared prefix. When a new request arrives with the same prefix, the provider skips recomputing attention for those tokens.

Prefix caching is the most impactful performance optimization because it reduces both latency and cost for every request that shares a prefix. The savings apply whether or not the user's question has been asked before.

Designing Prompts for Prefix Cache Efficiency

Keep stable instructions at the beginning. System prompt first, then tool definitions, then examples, then the user message last. The prefix cache caches from the start of the prompt.
Avoid dynamic content in the prefix. Do not put timestamps, request IDs, or user names in the system prompt. They break the prefix match.
Version prompt prefixes deliberately. When you change the system prompt, the prefix cache invalidates. Group prompt changes into releases rather than making continuous edits.
Measure prefix cache hit rate. Providers expose this metric. If your hit rate is low, your prefix may contain too much dynamic content or you may have too many prompt variants.
Align prefixes across requests. If your support chat and your summarization feature use different system prompts, they do not share prefix cache. Consider whether they can share a common prefix.

Step 4: Caching in RAG Systems

RAG systems have more cacheable parts than a plain LLM call. Each layer of the pipeline can be cached independently.

Layer	Cache Key	Safety	Savings
Parsed document	Document ID + content hash	Very safe	Avoids reparsing
Embeddings	Chunk hash + embedding model	Very safe	Avoids re-embedding
Retrieval result	Query + filters + index version	Safe if permissions match	Avoids vector search
Rerank result	Query + candidates + reranker version	Safe if permissions match	Avoids reranking
Final answer	Prompt + context IDs + model + user scope	Riskiest	Avoids generation

Final-answer caching is the riskiest because the cached answer depends on all upstream layers. If the source document changes, the embedding model updates, the index is rebuilt, or the prompt changes, the cached answer becomes stale. Retrieval and embedding caches are usually safer and still provide meaningful savings.

Step 5: TTL and Invalidation

Caching without invalidation is just accumulating stale answers. Different data types need different invalidation strategies.

Data Type	TTL Strategy	Invalidation Trigger
Static docs	Long TTL with content-hash invalidation	Document content changes
Product policies	Short TTL or versioned invalidation	Policy update published
User-specific answers	Very short TTL or no cache	Session or user state changes
Compliance answers	Avoid final-answer cache	Regulatory changes must be immediate
Analytics summaries	TTL aligned with source refresh	Data pipeline completes

Invalidation Events

Version-Aware Caching

The safest invalidation strategy is to include version identifiers in the cache key itself:

txt

cache_key = hash(
    normalized_prompt +
    model_version +
    prompt_version +
    embedding_model_version +
    index_version +
    tenant_id
)

When any version changes, the cache key changes automatically, and old entries are never returned. This is simpler than maintaining explicit invalidation lists. The tradeoff is that old cache entries become orphaned and must be evicted by the cache's TTL policy.

Common Failure Stories

The Stale Answer After a Policy Change

The refund policy changed on Monday. The semantic cache still returns "You can cancel within 30 days" on Wednesday. The cache key did not include the document version, so the old answer looks valid.

The fix: include source document version and content hash in the cache key. When the document updates, the cache misses automatically.

The Semantic Cache Returned the Wrong Answer

A user asks "What happens if I downgrade my plan?" The semantic cache finds a similar query: "What happens if I upgrade my plan?" — similarity 0.91, above threshold. It returns the upgrade answer to a downgrade question.

The fix: use per-intent thresholds and include intent classification in the cache key. Upgrade and downgrade should be different cache buckets.

The Cross-Tenant Cache Leak

Tenant A asks "What is my contract renewal date?" Tenant B asks "What is my renewal date?" The semantic cache matches and returns Tenant A's renewal date to Tenant B.

The fix: always include tenant scope in the cache key. Semantic cache hit or miss, the tenant filter must be applied before returning a cached answer.

The Prefix Cache Stopped Working

A developer adds {{current_timestamp}} to the system prompt for logging. Now every request has a unique prefix. The prefix cache hit rate drops from 80% to 0%. Cost per request doubles.

The fix: keep timestamps and request metadata out of the prompt prefix. Use a separate logging mechanism that does not affect the prompt.

The Cached Answer Contained PII

A user asks "What is my email on file?" The response includes their email address. The exact-match cache stores this response. Another user with a different account asks the exact same question and gets the first user's email.

The fix: never cache personalized responses with an exact-match or semantic cache unless the cache key includes user identity and the cache is scoped to that user.

Evaluating Cache Quality

You cannot optimize caching without measuring its impact on cost, latency, and correctness.

Hit Rate Evaluation

What is the exact-match cache hit rate per feature?
What is the semantic cache hit rate per intent?
What is the prefix cache hit rate (provider-reported)?
How does hit rate vary by time of day and traffic patterns?

Latency Evaluation

What is the P50 and P95 latency for cache hits vs cache misses?
What is the overhead of the semantic cache embedding and search step?
Is the cache lookup faster than generating the response? (If not, something is wrong.)

Cost Evaluation

How many tokens were saved per day by exact caching?
How many by semantic caching?
What is the dollar savings per feature?
What is the cost of running the cache infrastructure?

Correctness Evaluation

What is the cache correctness rate (sampled)?
How many cached answers were flagged as stale or wrong?
What is the false positive rate of the semantic cache?

✅

Debugging rule: If a user complains about an incorrect or outdated answer, check whether it was served from cache. If yes, check what cache key matched and whether the invalidation trigger fired correctly. The cache is always the first suspect in staleness bugs.

A Complete Cached Request, End to End

Here is how a single support query flows through the caching layers:

This flow shows the cache hierarchy in action. Each layer catches a different type of repetition. The system saves cost and latency at every step without sacrificing correctness — as long as the cache keys include the right dimensions and invalidation is handled properly.

When Not to Cache

Caching is not always the right answer. Some situations demand a fresh response every time.

Situation	Why Not to Cache
Medical, legal, or financial decisions	High-stakes freshness and correctness — stale answer could cause real harm
Personalized account answers	Privacy and permission risk — cached response may leak to wrong user
Rapidly changing data	Stale response risk — by the time the cache serves it, the data has changed
Creative generation	Users expect variation — identical responses look broken
Tool-dependent output	Side effects and current state matter — tool output changes between calls
Safety-sensitive moderation	Policy and context must be current — caching moderation results bypasses review
Streaming responses	Streaming creates state on the client — caching a stream is complex and rarely worth it
A/B test experiments	Different user groups should see different responses — cache defeats the experiment

✅

Rule of thumb: If a human would pause before repeating yesterday's answer in a conversation today, do not cache it. Caching is for answers that are timeless, identity-agnostic, and deterministic.

What to Remember for Interviews

When explaining LLM caching, tell the story in order:

Exact caching is safest: Hash the full normalized request and all model parameters. Use this for deterministic, repeated tasks.
Semantic caching is approximate: Embed queries and match by similarity. Tune thresholds by intent and task risk. Always include permission boundaries in the key.
Prefix caching rewards stable prompts: Keep static content at the front of the prompt. Avoid dynamic values in the prefix. Measure provider-side cache hit rate.
Cache layers work together: Exact → Semantic → Prefix → Generation. Each layer catches a different repetition pattern.
Cache boundaries must include permissions: Tenant ID, user scope, and document access level must be in the cache key. Cross-tenant leaks are the most dangerous cache failure.
Invalidate by version: Include prompt version, model version, embedding model version, and index version in the cache key. Version changes automatically invalidate old entries.
When in doubt, do not cache: Personalized, time-sensitive, safety-critical, and creative responses should bypass the cache entirely.

✅

Practice: Design caching for a RAG support bot that serves 10,000 tenants. Decide which layers to cache, how to key them, how to invalidate them, and which answers should never be cached. Then trace a repeat query through the full cache hierarchy and explain what happens at each layer.

LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control

Agentic Patterns and Tool Use: ReAct, Function Calling, and Orchestration