LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control
Design the gateway layer between applications and LLM providers, including model routing, provider fallback, rate limiting, semantic routing, observability, and cost tracking.
Start With a Mess of Direct Integrations
Imagine your product has five features that call LLMs:
- A support chat that answers customer tickets
- A content summarizer that condenses weekly analytics
- A code generator embedded in your internal CLI
- A moderation service that flags unsafe user posts
- A translation pipeline for multilingual docs
Each team picks a different provider. The chat team sets up OpenAI directly. The summarizer team uses Anthropic. The code generator calls a self-hosted model. The moderation team tries three providers and picks the cheapest. The translation team routes through a friend's side project.
Every integration duplicates the same plumbing: authentication, retries, rate limits, logging, error handling, and cost tracking. When OpenAI goes down, four teams scramble independently. When the monthly invoice arrives, nobody can explain which feature drove the spend. When a new intern writes a prompt that extracts PII and sends it to a model, nobody catches it.
An LLM gateway exists to prevent this mess.
Mental model: The gateway is not just a proxy. It is a policy, routing, observability, and control plane for model usage. It separates what the application wants from how the LLM delivers it.
What an LLM Gateway Means
An LLM gateway sits between your product services and the language model providers. Every LLM call goes through the gateway, not directly to OpenAI, Anthropic, or a self-hosted endpoint.
The gateway decides:
- Which model should handle this request
- Whether the request is safe and policy-compliant
- How fast to respond and what to do if the provider stalls
- How much it costs and who to bill it to
- What to log for debugging and compliance
Without a gateway, every service makes these decisions independently. With a gateway, they are centralized, measurable, and changeable without touching product code.
Why Not Just Call Providers Directly from Each Service?
Beginners often ask: "Why add another layer? Each team can manage their own integration."
Direct calls from each service work for one model and one team. They break at five models and five teams.
Here is what you lose without a gateway:
| Concern | Without Gateway | With Gateway |
|---|---|---|
| Provider key rotation | Every service redeploys | One config change |
| Model upgrade | Each team updates independently | One routing rule |
| Provider outage | Each team scrambles | Automatic fallback |
| Cost attribution | Manual spreadsheet | Tagged per request |
| Prompt safety audit | Impossible to enforce | Centralized policy |
| Rate limit coordination | Each service hits quota alone | Shared budget pool |
Important distinction: A gateway is not the same as an API proxy. A proxy forwards requests. A gateway routes, enriches, validates, observes, and enforces policy. The difference matters when you need to explain a compliance audit or debug a bad answer.
The Request Flow Through the Gateway
Every LLM request goes through the same pipeline inside the gateway.
The trick is in what happens at each step. The gateway is not just forwarding bytes. It is making decisions about policy, safety, cost, and reliability while the request is in flight.
How the Application Talks to the Gateway
The application describes its intent. The gateway decides how to fulfill it.
{
"feature": "support-chat",
"tenantId": "acme",
"taskType": "rag_answer",
"latencyClass": "interactive",
"qualityTier": "high",
"messages": [
{"role": "system", "content": "You are a support assistant..."},
{"role": "user", "content": "Can I pause my annual subscription?"}
]
}
The application should not say "use GPT-4." It should say "this is a support question that needs high quality." The gateway maps that intent to the appropriate model, provider, and policy.
This separation is what makes the gateway powerful. You can change the underlying model without changing application code.
Step 1: Model Routing
Not every task needs the most expensive model. Some requests are trivial. Some need deep reasoning. Some need structured output. Some need speed above all else.
The gateway must decide which model handles each request.
Routing by Task Type
| Task | Routing Choice | Why |
|---|---|---|
| Simple classification | Small fast model | Cheap, fast, good enough |
| JSON extraction | Model with strong structured output | Schema adherence matters |
| Legal or financial reasoning | Higher quality model plus guardrails | Error cost is high |
| RAG answer with citations | Strong instruction following | Citation accuracy matters |
| Embedding | Embedding-specific model | General models waste tokens |
| Bulk offline summarization | Cheaper batch path | Latency is irrelevant |
| Code generation | Code-tuned model | General models produce worse code |
Semantic Routing
Some decisions cannot be made with a simple if-else on a feature flag. Semantic routing embeds or classifies the request to choose the model, prompt, or tool chain.
For example, a help desk receives queries about billing, account security, technical issues, and sales. Each intent needs a different prompt, different model, and different data sources.
Use semantic routing for intent families — groups of requests that need different processing. Do not use it to hide business logic that should be explicit configuration.
Step 2: Provider Fallback
Providers go down. They throttle. They timeout. A gateway must handle failure without every application team implementing their own retry logic.
Fallback improves availability, but it introduces subtle risks. A secondary model may have a different output style, different context limit, different tool support, or different safety behavior. Testing the fallback path is as important as testing the primary path.
Fallback Chain
Fallback Policy Decisions
| Failure | Possible Action | Risk |
|---|---|---|
| Provider 5xx | Retry once, then fallback | Latency spike on retry |
| Timeout | Fallback if interactive; retry if batch | Wrong model on fallback |
| Rate limit | Queue, fallback, or reject based on feature | Queued requests may stale |
| Safety refusal | Do not fallback blindly | Policy bypass |
| Schema validation failure | Retry with repair prompt, then fail closed | May surface bad data |
| Content filter trigger | Log and reject | Regulatory exposure |
Do not fallback around safety: If a model refuses or flags unsafe content, sending the same request to another provider can bypass policy unintentionally. Safety refusal should be a hard stop, not a reroute.
Step 3: Rate Limiting and Budgets
LLM calls cost money and consume provider quota. Without rate limits, a single runaway feature or abusive user can exhaust the monthly budget in hours.
Rate limits must work at multiple levels simultaneously:
| Limit Type | What It Protects | Example |
|---|---|---|
| Requests per minute | Provider quota and service stability | 100 RPM per tenant |
| Tokens per minute | Cost and throughput | 500K TPM per feature |
| Daily budget | Finance controls | $500/day per team |
| Concurrent streams | Connection capacity | 10 concurrent per user |
| Feature-level quota | Product fairness | 50 requests/day for free tier |
Budget enforcement should happen before expensive operations. If a request is going to be rate-limited, reject it before running retrieval, pre-processing, or embedding. Otherwise you pay for work that never reaches the model.
Step 4: Cost Tracking
If you cannot attribute LLM costs, you cannot optimize them. Every request through the gateway should be tagged with enough dimensions to answer: "Where did our LLM budget go this month?"
What to Track
| Dimension | Why It Matters |
|---|---|
| Model | The single biggest cost driver |
| Feature | Support chat vs report generator have different ROI |
| Tenant | Enterprise customer usage for billing |
| User | Abuse detection and fairness enforcement |
| Prompt version | Cost impact of prompt changes and experiments |
| Retrieval strategy | RAG context size directly affects token cost |
| Provider | Different providers have different per-token prices |
Cost Formula
Every request has a predictable cost:
request_cost =
input_tokens * input_price_per_token
+ output_tokens * output_price_per_token
+ retrieval_cost
+ reranking_cost
The first two terms dominate. If a RAG request uses 4000 input tokens and generates 500 output tokens on a model that costs $10 per million input tokens, the input cost is $0.04 per request. Spread across millions of requests, small differences in context size become large differences in monthly spend.
Cost dashboards should show both absolute spend and value metrics. A feature that costs $5000 a month but resolves 10,000 support tickets may be more valuable than a feature that costs $200 a month but solves nothing.
Step 5: Gateway State and Data
The gateway sits in the critical path of every request. It collects data that is useful for operations, debugging, compliance, and capacity planning.
| Data | Why Keep It | Retention Guidance |
|---|---|---|
| Raw prompts and responses | Debug bad answers, audit safety | Redact or sample if sensitive |
| Token counts | Cost and capacity planning | Keep aggregated |
| Model routing decisions | Verify routing logic | Keep with request trace ID |
| Provider errors | Reliability analysis | Keep until resolved |
| Safety classifications | Compliance and policy audit | Keep according to policy |
Privacy matters: Gateway logs can contain sensitive user data. Redaction, retention, access control, and audit logs are part of the gateway design, not an afterthought.
The gateway should also expose health metrics for itself:
- Current request rate and concurrency
- Provider error rates (per provider and per error type)
- Fallback activation rate
- Rate limit hit rate
- P50, P95, P99 latency per model and per feature
- Cost per hour, per feature, per tenant
Common Failure Stories
The Wrong Model Was Selected
A billing question about enterprise contracts is routed to a small, cheap model. The model hallucinates a discount policy because it lacks the reasoning depth to interpret the contract terms. The routing classification was too aggressive about cost savings.
The fix: routing rules must consider fallibility, not just cost. High-cost-of-error tasks should always use a higher-quality model or include a verification step.
The Fallback Model Changed the Answer
The primary provider returns a JSON response. When it fails and the gateway falls back, the secondary provider uses a different output structure. The application crashes parsing the response. The gateway logs show no error — the fallback succeeded — but the system produced a broken answer.
The fix: fallback chains must verify schema compatibility. If the fallback model cannot match the expected output format, the gateway should fail closed rather than returning an incompatible response.
The Budget Exhausted at 3 AM
A test script runs in a tight loop making LLM calls overnight. There is no per-feature rate limit. By morning, the team has burned through the entire monthly budget. The gateway correctly logged every call, but it never stopped any of them.
The fix: rate limits must have hard ceilings per feature, not just soft warnings. The gateway should pre-check budgets before accepting requests, especially for non-production traffic.
The Safety Filter Was Bypassed Via Fallback
A user sends a policy-violating prompt. The primary model refuses. The gateway treats the refusal as a failure and falls back to a secondary provider. The secondary provider processes the request and returns a harmful response.
The fix: safety refusals must bypass fallback logic entirely. A refusal is not a failure. It is a successful enforcement.
The Latency SLA Was Violated By Retries
A chat feature needs responses within two seconds. The primary provider times out at one second. The gateway retries, waits another second, then falls back, and the response arrives at three seconds. The user has already closed the chat.
The fix: retries must respect latency classes. Interactive requests should fall back immediately. Batch requests can retry.
Evaluating the Gateway
You cannot operate a gateway without monitoring its decisions and outcomes.
Routing Evaluation
- Was the correct model selected for each task type?
- How often does semantic routing misclassify an intent?
- What is the cost impact of routing mistakes?
Fallback Evaluation
- How often does each fallback activate?
- Does the fallback response differ in quality from the primary?
- Are there patterns in which requests trigger fallbacks?
Rate Limit Evaluation
- How close is each tenant to their limit during peak hours?
- Are rate limits being hit by legitimate traffic or abusive traffic?
- What is the false positive rate of rate limiting?
Cost Evaluation
- What is the cost per successful request per feature?
- Which features have the worst cost-to-value ratio?
- Are there opportunities to reduce context size without degrading quality?
Debugging rule: If an LLM response is unexpectedly bad, first check which model the gateway selected and whether a fallback occurred. The generation model matters more than the prompt in many failures.
A Complete Gateway Request, End to End
Here is how a single request flows through the gateway:
This diagram is the blueprint for an LLM gateway. Every box is a decision point. Every decision point is a place to add policy, safety, observability, or cost control.
What to Remember for Interviews
When explaining an LLM gateway, tell the story in order:
- Centralize model access: A gateway gives consistency across teams and features. Without it, every team reimplements authentication, retries, and logging.
- Route by task: Use cheaper models for simple tasks and stronger models for hard tasks. The routing decision is the most impactful cost lever.
- Fallback carefully: Availability fallback must not bypass safety or break schemas. A refusal is not a failure.
- Rate-limit tokens, not just requests: Token volume drives cost and throughput. Hard ceilings prevent surprise bills.
- Instrument cost and quality together: Cheap but bad answers are not optimization. Track both spend and outcome.
- The gateway is a policy engine, not a proxy: Every request goes through validation, routing, rate limiting, and observability before it reaches a model.
Practice: Design an LLM gateway for three product teams. Include model routing, tenant budgets, streaming, fallback policy, logging, and schema validation. Then trace a request through the full pipeline and explain what happens at each step.