Gen AI Systems

LLM Gateway and Routing: Model Selection, Fallbacks, and Cost Control

Design the gateway layer between applications and LLM providers, including model routing, provider fallback, rate limiting, semantic routing, observability, and cost tracking.

LLM gatewaymodel routingfallbackcost optimizationrate limiting

Start With a Mess of Direct Integrations

Imagine your product has five features that call LLMs:

  • A support chat that answers customer tickets
  • A content summarizer that condenses weekly analytics
  • A code generator embedded in your internal CLI
  • A moderation service that flags unsafe user posts
  • A translation pipeline for multilingual docs

Each team picks a different provider. The chat team sets up OpenAI directly. The summarizer team uses Anthropic. The code generator calls a self-hosted model. The moderation team tries three providers and picks the cheapest. The translation team routes through a friend's side project.

Every integration duplicates the same plumbing: authentication, retries, rate limits, logging, error handling, and cost tracking. When OpenAI goes down, four teams scramble independently. When the monthly invoice arrives, nobody can explain which feature drove the spend. When a new intern writes a prompt that extracts PII and sends it to a model, nobody catches it.

An LLM gateway exists to prevent this mess.

Mental model: The gateway is not just a proxy. It is a policy, routing, observability, and control plane for model usage. It separates what the application wants from how the LLM delivers it.


What an LLM Gateway Means

An LLM gateway sits between your product services and the language model providers. Every LLM call goes through the gateway, not directly to OpenAI, Anthropic, or a self-hosted endpoint.

The gateway decides:

  • Which model should handle this request
  • Whether the request is safe and policy-compliant
  • How fast to respond and what to do if the provider stalls
  • How much it costs and who to bill it to
  • What to log for debugging and compliance

Without a gateway, every service makes these decisions independently. With a gateway, they are centralized, measurable, and changeable without touching product code.


Why Not Just Call Providers Directly from Each Service?

Beginners often ask: "Why add another layer? Each team can manage their own integration."

Direct calls from each service work for one model and one team. They break at five models and five teams.

Here is what you lose without a gateway:

ConcernWithout GatewayWith Gateway
Provider key rotationEvery service redeploysOne config change
Model upgradeEach team updates independentlyOne routing rule
Provider outageEach team scramblesAutomatic fallback
Cost attributionManual spreadsheetTagged per request
Prompt safety auditImpossible to enforceCentralized policy
Rate limit coordinationEach service hits quota aloneShared budget pool
⚠️

Important distinction: A gateway is not the same as an API proxy. A proxy forwards requests. A gateway routes, enriches, validates, observes, and enforces policy. The difference matters when you need to explain a compliance audit or debug a bad answer.


The Request Flow Through the Gateway

Every LLM request goes through the same pipeline inside the gateway.

The trick is in what happens at each step. The gateway is not just forwarding bytes. It is making decisions about policy, safety, cost, and reliability while the request is in flight.

How the Application Talks to the Gateway

The application describes its intent. The gateway decides how to fulfill it.

json
{
  "feature": "support-chat",
  "tenantId": "acme",
  "taskType": "rag_answer",
  "latencyClass": "interactive",
  "qualityTier": "high",
  "messages": [
    {"role": "system", "content": "You are a support assistant..."},
    {"role": "user", "content": "Can I pause my annual subscription?"}
  ]
}

The application should not say "use GPT-4." It should say "this is a support question that needs high quality." The gateway maps that intent to the appropriate model, provider, and policy.

This separation is what makes the gateway powerful. You can change the underlying model without changing application code.


Step 1: Model Routing

Not every task needs the most expensive model. Some requests are trivial. Some need deep reasoning. Some need structured output. Some need speed above all else.

The gateway must decide which model handles each request.

Routing by Task Type

TaskRouting ChoiceWhy
Simple classificationSmall fast modelCheap, fast, good enough
JSON extractionModel with strong structured outputSchema adherence matters
Legal or financial reasoningHigher quality model plus guardrailsError cost is high
RAG answer with citationsStrong instruction followingCitation accuracy matters
EmbeddingEmbedding-specific modelGeneral models waste tokens
Bulk offline summarizationCheaper batch pathLatency is irrelevant
Code generationCode-tuned modelGeneral models produce worse code

Semantic Routing

Some decisions cannot be made with a simple if-else on a feature flag. Semantic routing embeds or classifies the request to choose the model, prompt, or tool chain.

For example, a help desk receives queries about billing, account security, technical issues, and sales. Each intent needs a different prompt, different model, and different data sources.

Use semantic routing for intent families — groups of requests that need different processing. Do not use it to hide business logic that should be explicit configuration.


Step 2: Provider Fallback

Providers go down. They throttle. They timeout. A gateway must handle failure without every application team implementing their own retry logic.

Fallback improves availability, but it introduces subtle risks. A secondary model may have a different output style, different context limit, different tool support, or different safety behavior. Testing the fallback path is as important as testing the primary path.

Fallback Chain

Fallback Policy Decisions

FailurePossible ActionRisk
Provider 5xxRetry once, then fallbackLatency spike on retry
TimeoutFallback if interactive; retry if batchWrong model on fallback
Rate limitQueue, fallback, or reject based on featureQueued requests may stale
Safety refusalDo not fallback blindlyPolicy bypass
Schema validation failureRetry with repair prompt, then fail closedMay surface bad data
Content filter triggerLog and rejectRegulatory exposure
⚠️

Do not fallback around safety: If a model refuses or flags unsafe content, sending the same request to another provider can bypass policy unintentionally. Safety refusal should be a hard stop, not a reroute.


Step 3: Rate Limiting and Budgets

LLM calls cost money and consume provider quota. Without rate limits, a single runaway feature or abusive user can exhaust the monthly budget in hours.

Rate limits must work at multiple levels simultaneously:

Limit TypeWhat It ProtectsExample
Requests per minuteProvider quota and service stability100 RPM per tenant
Tokens per minuteCost and throughput500K TPM per feature
Daily budgetFinance controls$500/day per team
Concurrent streamsConnection capacity10 concurrent per user
Feature-level quotaProduct fairness50 requests/day for free tier

Budget enforcement should happen before expensive operations. If a request is going to be rate-limited, reject it before running retrieval, pre-processing, or embedding. Otherwise you pay for work that never reaches the model.


Step 4: Cost Tracking

If you cannot attribute LLM costs, you cannot optimize them. Every request through the gateway should be tagged with enough dimensions to answer: "Where did our LLM budget go this month?"

What to Track

DimensionWhy It Matters
ModelThe single biggest cost driver
FeatureSupport chat vs report generator have different ROI
TenantEnterprise customer usage for billing
UserAbuse detection and fairness enforcement
Prompt versionCost impact of prompt changes and experiments
Retrieval strategyRAG context size directly affects token cost
ProviderDifferent providers have different per-token prices

Cost Formula

Every request has a predictable cost:

txt
request_cost =
  input_tokens * input_price_per_token
  + output_tokens * output_price_per_token
  + retrieval_cost
  + reranking_cost

The first two terms dominate. If a RAG request uses 4000 input tokens and generates 500 output tokens on a model that costs $10 per million input tokens, the input cost is $0.04 per request. Spread across millions of requests, small differences in context size become large differences in monthly spend.

Cost dashboards should show both absolute spend and value metrics. A feature that costs $5000 a month but resolves 10,000 support tickets may be more valuable than a feature that costs $200 a month but solves nothing.


Step 5: Gateway State and Data

The gateway sits in the critical path of every request. It collects data that is useful for operations, debugging, compliance, and capacity planning.

DataWhy Keep ItRetention Guidance
Raw prompts and responsesDebug bad answers, audit safetyRedact or sample if sensitive
Token countsCost and capacity planningKeep aggregated
Model routing decisionsVerify routing logicKeep with request trace ID
Provider errorsReliability analysisKeep until resolved
Safety classificationsCompliance and policy auditKeep according to policy
💡

Privacy matters: Gateway logs can contain sensitive user data. Redaction, retention, access control, and audit logs are part of the gateway design, not an afterthought.

The gateway should also expose health metrics for itself:

  • Current request rate and concurrency
  • Provider error rates (per provider and per error type)
  • Fallback activation rate
  • Rate limit hit rate
  • P50, P95, P99 latency per model and per feature
  • Cost per hour, per feature, per tenant

Common Failure Stories

The Wrong Model Was Selected

A billing question about enterprise contracts is routed to a small, cheap model. The model hallucinates a discount policy because it lacks the reasoning depth to interpret the contract terms. The routing classification was too aggressive about cost savings.

The fix: routing rules must consider fallibility, not just cost. High-cost-of-error tasks should always use a higher-quality model or include a verification step.

The Fallback Model Changed the Answer

The primary provider returns a JSON response. When it fails and the gateway falls back, the secondary provider uses a different output structure. The application crashes parsing the response. The gateway logs show no error — the fallback succeeded — but the system produced a broken answer.

The fix: fallback chains must verify schema compatibility. If the fallback model cannot match the expected output format, the gateway should fail closed rather than returning an incompatible response.

The Budget Exhausted at 3 AM

A test script runs in a tight loop making LLM calls overnight. There is no per-feature rate limit. By morning, the team has burned through the entire monthly budget. The gateway correctly logged every call, but it never stopped any of them.

The fix: rate limits must have hard ceilings per feature, not just soft warnings. The gateway should pre-check budgets before accepting requests, especially for non-production traffic.

The Safety Filter Was Bypassed Via Fallback

A user sends a policy-violating prompt. The primary model refuses. The gateway treats the refusal as a failure and falls back to a secondary provider. The secondary provider processes the request and returns a harmful response.

The fix: safety refusals must bypass fallback logic entirely. A refusal is not a failure. It is a successful enforcement.

The Latency SLA Was Violated By Retries

A chat feature needs responses within two seconds. The primary provider times out at one second. The gateway retries, waits another second, then falls back, and the response arrives at three seconds. The user has already closed the chat.

The fix: retries must respect latency classes. Interactive requests should fall back immediately. Batch requests can retry.


Evaluating the Gateway

You cannot operate a gateway without monitoring its decisions and outcomes.

Routing Evaluation

  • Was the correct model selected for each task type?
  • How often does semantic routing misclassify an intent?
  • What is the cost impact of routing mistakes?

Fallback Evaluation

  • How often does each fallback activate?
  • Does the fallback response differ in quality from the primary?
  • Are there patterns in which requests trigger fallbacks?

Rate Limit Evaluation

  • How close is each tenant to their limit during peak hours?
  • Are rate limits being hit by legitimate traffic or abusive traffic?
  • What is the false positive rate of rate limiting?

Cost Evaluation

  • What is the cost per successful request per feature?
  • Which features have the worst cost-to-value ratio?
  • Are there opportunities to reduce context size without degrading quality?

Debugging rule: If an LLM response is unexpectedly bad, first check which model the gateway selected and whether a fallback occurred. The generation model matters more than the prompt in many failures.


A Complete Gateway Request, End to End

Here is how a single request flows through the gateway:

This diagram is the blueprint for an LLM gateway. Every box is a decision point. Every decision point is a place to add policy, safety, observability, or cost control.


What to Remember for Interviews

When explaining an LLM gateway, tell the story in order:

  1. Centralize model access: A gateway gives consistency across teams and features. Without it, every team reimplements authentication, retries, and logging.
  2. Route by task: Use cheaper models for simple tasks and stronger models for hard tasks. The routing decision is the most impactful cost lever.
  3. Fallback carefully: Availability fallback must not bypass safety or break schemas. A refusal is not a failure.
  4. Rate-limit tokens, not just requests: Token volume drives cost and throughput. Hard ceilings prevent surprise bills.
  5. Instrument cost and quality together: Cheap but bad answers are not optimization. Track both spend and outcome.
  6. The gateway is a policy engine, not a proxy: Every request goes through validation, routing, rate limiting, and observability before it reaches a model.

Practice: Design an LLM gateway for three product teams. Include model routing, tenant budgets, streaming, fallback policy, logging, and schema validation. Then trace a request through the full pipeline and explain what happens at each step.