LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control

Design the gateway layer between applications and LLM providers, including model routing, provider fallback, rate limiting, semantic routing, observability, and cost tracking.

LLM gatewaymodel routingfallbackcost optimizationrate limiting

Start With a Mess of Direct Integrations

Imagine your product has five features that call LLMs:

A support chat that answers customer tickets
A content summarizer that condenses weekly analytics
A code generator embedded in your internal CLI
A moderation service that flags unsafe user posts
A translation pipeline for multilingual docs

Each team picks a different provider. The chat team sets up OpenAI directly. The summarizer team uses Anthropic. The code generator calls a self-hosted model. The moderation team tries three providers and picks the cheapest. The translation team routes through a friend's side project.

Every integration duplicates the same plumbing: authentication, retries, rate limits, logging, error handling, and cost tracking. When OpenAI goes down, four teams scramble independently. When the monthly invoice arrives, nobody can explain which feature drove the spend. When a new intern writes a prompt that extracts PII and sends it to a model, nobody catches it.

An LLM gateway exists to prevent this mess.

✅

Mental model: The gateway is not just a proxy. It is a policy, routing, observability, and control plane for model usage. It separates what the application wants from how the LLM delivers it.

What an LLM Gateway Means

An LLM gateway sits between your product services and the language model providers. Every LLM call goes through the gateway, not directly to OpenAI, Anthropic, or a self-hosted endpoint.

The gateway decides:

Which model should handle this request
Whether the request is safe and policy-compliant
How fast to respond and what to do if the provider stalls
How much it costs and who to bill it to
What to log for debugging and compliance

Without a gateway, every service makes these decisions independently. With a gateway, they are centralized, measurable, and changeable without touching product code.

Why Not Just Call Providers Directly from Each Service?

Beginners often ask: "Why add another layer? Each team can manage their own integration."

Direct calls from each service work for one model and one team. They break at five models and five teams.

Here is what you lose without a gateway:

Concern	Without Gateway	With Gateway
Provider key rotation	Every service redeploys	One config change
Model upgrade	Each team updates independently	One routing rule
Provider outage	Each team scrambles	Automatic fallback
Cost attribution	Manual spreadsheet	Tagged per request
Prompt safety audit	Impossible to enforce	Centralized policy
Rate limit coordination	Each service hits quota alone	Shared budget pool

⚠️

Important distinction: A gateway is not the same as an API proxy. A proxy forwards requests. A gateway routes, enriches, validates, observes, and enforces policy. The difference matters when you need to explain a compliance audit or debug a bad answer.

The Request Flow Through the Gateway

Every LLM request goes through the same pipeline inside the gateway.

The trick is in what happens at each step. The gateway is not just forwarding bytes. It is making decisions about policy, safety, cost, and reliability while the request is in flight.

How the Application Talks to the Gateway

The application describes its intent. The gateway decides how to fulfill it.

json

{
  "feature": "support-chat",
  "tenantId": "acme",
  "taskType": "rag_answer",
  "latencyClass": "interactive",
  "qualityTier": "high",
  "messages": [
    {"role": "system", "content": "You are a support assistant..."},
    {"role": "user", "content": "Can I pause my annual subscription?"}
  ]
}

The application should not say "use GPT-4." It should say "this is a support question that needs high quality." The gateway maps that intent to the appropriate model, provider, and policy.

This separation is what makes the gateway powerful. You can change the underlying model without changing application code.

Step 1: Model Routing

Not every task needs the most expensive model. Some requests are trivial. Some need deep reasoning. Some need structured output. Some need speed above all else.

The gateway must decide which model handles each request.

Routing by Task Type

Task	Routing Choice	Why
Simple classification	Small fast model	Cheap, fast, good enough
JSON extraction	Model with strong structured output	Schema adherence matters
Legal or financial reasoning	Higher quality model plus guardrails	Error cost is high
RAG answer with citations	Strong instruction following	Citation accuracy matters
Embedding	Embedding-specific model	General models waste tokens
Bulk offline summarization	Cheaper batch path	Latency is irrelevant
Code generation	Code-tuned model	General models produce worse code

Semantic Routing

Some decisions cannot be made with a simple if-else on a feature flag. Semantic routing embeds or classifies the request to choose the model, prompt, or tool chain.

For example, a help desk receives queries about billing, account security, technical issues, and sales. Each intent needs a different prompt, different model, and different data sources.

Use semantic routing for intent families — groups of requests that need different processing. Do not use it to hide business logic that should be explicit configuration.

Step 2: Provider Fallback

Providers go down. They throttle. They timeout. A gateway must handle failure without every application team implementing their own retry logic.

Fallback improves availability, but it introduces subtle risks. A secondary model may have a different output style, different context limit, different tool support, or different safety behavior. Testing the fallback path is as important as testing the primary path.

Fallback Chain

Fallback Policy Decisions

Failure	Possible Action	Risk
Provider 5xx	Retry once, then fallback	Latency spike on retry
Timeout	Fallback if interactive; retry if batch	Wrong model on fallback
Rate limit	Queue, fallback, or reject based on feature	Queued requests may stale
Safety refusal	Do not fallback blindly	Policy bypass
Schema validation failure	Retry with repair prompt, then fail closed	May surface bad data
Content filter trigger	Log and reject	Regulatory exposure

⚠️

Do not fallback around safety: If a model refuses or flags unsafe content, sending the same request to another provider can bypass policy unintentionally. Safety refusal should be a hard stop, not a reroute.

Step 3: Rate Limiting and Budgets

LLM calls cost money and consume provider quota. Without rate limits, a single runaway feature or abusive user can exhaust the monthly budget in hours.

Rate limits must work at multiple levels simultaneously:

Limit Type	What It Protects	Example
Requests per minute	Provider quota and service stability	100 RPM per tenant
Tokens per minute	Cost and throughput	500K TPM per feature
Daily budget	Finance controls	$500/day per team
Concurrent streams	Connection capacity	10 concurrent per user
Feature-level quota	Product fairness	50 requests/day for free tier

Budget enforcement should happen before expensive operations. If a request is going to be rate-limited, reject it before running retrieval, pre-processing, or embedding. Otherwise you pay for work that never reaches the model.

Step 4: Cost Tracking

If you cannot attribute LLM costs, you cannot optimize them. Every request through the gateway should be tagged with enough dimensions to answer: "Where did our LLM budget go this month?"

What to Track

Dimension	Why It Matters
Model	The single biggest cost driver
Feature	Support chat vs report generator have different ROI
Tenant	Enterprise customer usage for billing
User	Abuse detection and fairness enforcement
Prompt version	Cost impact of prompt changes and experiments
Retrieval strategy	RAG context size directly affects token cost
Provider	Different providers have different per-token prices

Cost Formula

Every request has a predictable cost:

txt

request_cost =
  input_tokens * input_price_per_token
  + output_tokens * output_price_per_token
  + retrieval_cost
  + reranking_cost

The first two terms dominate. If a RAG request uses 4000 input tokens and generates 500 output tokens on a model that costs $10 per million input tokens, the input cost is $0.04 per request. Spread across millions of requests, small differences in context size become large differences in monthly spend.

Cost dashboards should show both absolute spend and value metrics. A feature that costs $5000 a month but resolves 10,000 support tickets may be more valuable than a feature that costs $200 a month but solves nothing.

Step 5: Gateway State and Data

The gateway sits in the critical path of every request. It collects data that is useful for operations, debugging, compliance, and capacity planning.

Data	Why Keep It	Retention Guidance
Raw prompts and responses	Debug bad answers, audit safety	Redact or sample if sensitive
Token counts	Cost and capacity planning	Keep aggregated
Model routing decisions	Verify routing logic	Keep with request trace ID
Provider errors	Reliability analysis	Keep until resolved
Safety classifications	Compliance and policy audit	Keep according to policy

💡

Privacy matters: Gateway logs can contain sensitive user data. Redaction, retention, access control, and audit logs are part of the gateway design, not an afterthought.

The gateway should also expose health metrics for itself:

Current request rate and concurrency
Provider error rates (per provider and per error type)
Fallback activation rate
Rate limit hit rate
P50, P95, P99 latency per model and per feature
Cost per hour, per feature, per tenant

Common Failure Stories

The Wrong Model Was Selected

A billing question about enterprise contracts is routed to a small, cheap model. The model hallucinates a discount policy because it lacks the reasoning depth to interpret the contract terms. The routing classification was too aggressive about cost savings.

The fix: routing rules must consider fallibility, not just cost. High-cost-of-error tasks should always use a higher-quality model or include a verification step.

The Fallback Model Changed the Answer

The primary provider returns a JSON response. When it fails and the gateway falls back, the secondary provider uses a different output structure. The application crashes parsing the response. The gateway logs show no error — the fallback succeeded — but the system produced a broken answer.

The fix: fallback chains must verify schema compatibility. If the fallback model cannot match the expected output format, the gateway should fail closed rather than returning an incompatible response.

The Budget Exhausted at 3 AM

A test script runs in a tight loop making LLM calls overnight. There is no per-feature rate limit. By morning, the team has burned through the entire monthly budget. The gateway correctly logged every call, but it never stopped any of them.

The fix: rate limits must have hard ceilings per feature, not just soft warnings. The gateway should pre-check budgets before accepting requests, especially for non-production traffic.

The Safety Filter Was Bypassed Via Fallback

A user sends a policy-violating prompt. The primary model refuses. The gateway treats the refusal as a failure and falls back to a secondary provider. The secondary provider processes the request and returns a harmful response.

The fix: safety refusals must bypass fallback logic entirely. A refusal is not a failure. It is a successful enforcement.

The Latency SLA Was Violated By Retries

A chat feature needs responses within two seconds. The primary provider times out at one second. The gateway retries, waits another second, then falls back, and the response arrives at three seconds. The user has already closed the chat.

The fix: retries must respect latency classes. Interactive requests should fall back immediately. Batch requests can retry.

Evaluating the Gateway

You cannot operate a gateway without monitoring its decisions and outcomes.

Routing Evaluation

Was the correct model selected for each task type?
How often does semantic routing misclassify an intent?
What is the cost impact of routing mistakes?

Fallback Evaluation

How often does each fallback activate?
Does the fallback response differ in quality from the primary?
Are there patterns in which requests trigger fallbacks?

Rate Limit Evaluation

How close is each tenant to their limit during peak hours?
Are rate limits being hit by legitimate traffic or abusive traffic?
What is the false positive rate of rate limiting?

Cost Evaluation

What is the cost per successful request per feature?
Which features have the worst cost-to-value ratio?
Are there opportunities to reduce context size without degrading quality?

✅

Debugging rule: If an LLM response is unexpectedly bad, first check which model the gateway selected and whether a fallback occurred. The generation model matters more than the prompt in many failures.

A Complete Gateway Request, End to End

Here is how a single request flows through the gateway:

This diagram is the blueprint for an LLM gateway. Every box is a decision point. Every decision point is a place to add policy, safety, observability, or cost control.

What to Remember for Interviews

When explaining an LLM gateway, tell the story in order:

Centralize model access: A gateway gives consistency across teams and features. Without it, every team reimplements authentication, retries, and logging.
Route by task: Use cheaper models for simple tasks and stronger models for hard tasks. The routing decision is the most impactful cost lever.
Fallback carefully: Availability fallback must not bypass safety or break schemas. A refusal is not a failure.
Rate-limit tokens, not just requests: Token volume drives cost and throughput. Hard ceilings prevent surprise bills.
Instrument cost and quality together: Cheap but bad answers are not optimization. Track both spend and outcome.
The gateway is a policy engine, not a proxy: Every request goes through validation, routing, rate limiting, and observability before it reaches a model.

✅

Practice: Design an LLM gateway for three product teams. Include model routing, tenant budgets, streaming, fallback policy, logging, and schema validation. Then trace a request through the full pipeline and explain what happens at each step.

RAG Architecture: Chunking, Retrieval, Reranking, and Generation

Prompt Caching and Semantic Caching: Lower Latency and Cost