| name | llm-gateway-routing |
| description | LLM gateway and routing configuration using OpenRouter and LiteLLM. Invoke when: - Setting up multi-model access (OpenRouter, LiteLLM) - Configuring model fallbacks and reliability - Implementing cost-based or latency-based routing - A/B testing different models - Self-hosting an LLM proxy Keywords: openrouter, litellm, llm gateway, model routing, fallback, A/B testing |
LLM Gateway & Routing
Configure multi-model access, fallbacks, cost optimization, and A/B testing.
Why Use a Gateway?
Without gateway:
- Vendor lock-in (one provider)
- No fallbacks (provider down = app down)
- Hard to A/B test models
- Scattered API keys and configs
With gateway:
- Single API for 400+ models
- Automatic fallbacks
- Easy model switching
- Unified cost tracking
Quick Decision
| Need | Solution |
|---|---|
| Fastest setup, multi-model | OpenRouter |
| Full control, self-hosted | LiteLLM |
| Observability + routing | Helicone |
| Enterprise, guardrails | Portkey |
OpenRouter (Recommended)
Why OpenRouter
- 400+ models: OpenAI, Anthropic, Google, Meta, Mistral, and more
- Single API: One key for all providers
- Automatic fallbacks: Built-in reliability
- A/B testing: Easy model comparison
- Cost tracking: Unified billing dashboard
- Free credits: $1 free to start
Setup
# 1. Sign up at openrouter.ai
# 2. Get API key from dashboard
# 3. Add to .env:
OPENROUTER_API_KEY=sk-or-v1-...
Basic Usage
// Using fetch
const response = await fetch('https://openrouter.ai/api/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENROUTER_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'anthropic/claude-3-5-sonnet',
messages: [{ role: 'user', content: 'Hello!' }],
}),
});
With Vercel AI SDK (Recommended)
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";
const openrouter = createOpenAI({
baseURL: "https://openrouter.ai/api/v1",
apiKey: process.env.OPENROUTER_API_KEY,
});
const { text } = await generateText({
model: openrouter("anthropic/claude-3-5-sonnet"),
prompt: "Explain quantum computing",
});
Model IDs
// Format: provider/model-name
const models = {
// Anthropic
claude35Sonnet: "anthropic/claude-3-5-sonnet",
claudeHaiku: "anthropic/claude-3-5-haiku",
// OpenAI
gpt4o: "openai/gpt-4o",
gpt4oMini: "openai/gpt-4o-mini",
// Google
geminiPro: "google/gemini-pro-1.5",
geminiFlash: "google/gemini-flash-1.5",
// Meta
llama3: "meta-llama/llama-3.1-70b-instruct",
// Auto (OpenRouter picks best)
auto: "openrouter/auto",
};
Fallback Chains
// Define fallback order
const modelChain = [
"anthropic/claude-3-5-sonnet", // Primary
"openai/gpt-4o", // Fallback 1
"google/gemini-pro-1.5", // Fallback 2
];
async function callWithFallback(messages: Message[]) {
for (const model of modelChain) {
try {
return await openrouter.chat({ model, messages });
} catch (error) {
console.log(`${model} failed, trying next...`);
}
}
throw new Error("All models failed");
}
Cost Routing
// Route based on query complexity
function selectModel(query: string): string {
const complexity = analyzeComplexity(query);
if (complexity === "simple") {
// Simple queries → cheap model
return "openai/gpt-4o-mini"; // ~$0.15/1M tokens
} else if (complexity === "medium") {
// Medium → balanced
return "google/gemini-flash-1.5"; // ~$0.075/1M tokens
} else {
// Complex → best quality
return "anthropic/claude-3-5-sonnet"; // ~$3/1M tokens
}
}
function analyzeComplexity(query: string): "simple" | "medium" | "complex" {
// Simple heuristics
if (query.length < 50) return "simple";
if (query.includes("explain") || query.includes("analyze")) return "complex";
return "medium";
}
A/B Testing
// Random assignment
function getModel(userId: string): string {
const hash = userId.charCodeAt(0) % 100;
if (hash < 50) {
return "anthropic/claude-3-5-sonnet"; // 50%
} else {
return "openai/gpt-4o"; // 50%
}
}
// Track which model was used
const model = getModel(userId);
const response = await openrouter.chat({ model, messages });
await analytics.track("llm_call", { model, userId, latency, cost });
LiteLLM (Self-Hosted)
Why LiteLLM
- Self-hosted: Full control over data
- 100+ providers: Same coverage as OpenRouter
- Load balancing: Distribute across providers
- Cost tracking: Built-in spend management
- Caching: Redis or in-memory
- Rate limiting: Per-user limits
Setup
# Install
pip install litellm[proxy]
# Run proxy
litellm --config config.yaml
# Use as OpenAI-compatible endpoint
export OPENAI_API_BASE=http://localhost:4000
Configuration
# config.yaml
model_list:
# Claude models
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-3-5-sonnet-latest
api_key: sk-ant-...
# OpenAI models
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: sk-...
# Load balanced (multiple providers)
- model_name: balanced
litellm_params:
model: anthropic/claude-3-5-sonnet-latest
litellm_params:
model: openai/gpt-4o
# Requests distributed across both
# General settings
general_settings:
master_key: sk-master-...
database_url: postgresql://...
# Routing
router_settings:
routing_strategy: simple-shuffle # or latency-based-routing
num_retries: 3
timeout: 30
# Rate limiting
litellm_settings:
max_budget: 100 # $100/month
budget_duration: monthly
Fallbacks in LiteLLM
model_list:
- model_name: primary
litellm_params:
model: anthropic/claude-3-5-sonnet-latest
fallbacks:
- model_name: fallback-1
litellm_params:
model: openai/gpt-4o
- model_name: fallback-2
litellm_params:
model: google/gemini-pro
Usage
// Use like OpenAI SDK
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:4000",
apiKey: "sk-master-...",
});
const response = await client.chat.completions.create({
model: "claude-sonnet", // Maps to configured model
messages: [{ role: "user", content: "Hello!" }],
});
Routing Strategies
1. Cost-Based Routing
const costTiers = {
cheap: ["openai/gpt-4o-mini", "google/gemini-flash-1.5"],
balanced: ["anthropic/claude-3-5-haiku", "openai/gpt-4o"],
premium: ["anthropic/claude-3-5-sonnet", "openai/o1-preview"],
};
function routeByCost(budget: "cheap" | "balanced" | "premium"): string {
const models = costTiers[budget];
return models[Math.floor(Math.random() * models.length)];
}
2. Latency-Based Routing
// Track latency per model
const latencyStats: Record<string, number[]> = {};
function routeByLatency(): string {
const avgLatencies = Object.entries(latencyStats)
.map(([model, times]) => ({
model,
avg: times.reduce((a, b) => a + b, 0) / times.length,
}))
.sort((a, b) => a.avg - b.avg);
return avgLatencies[0].model;
}
// Update after each call
function recordLatency(model: string, latencyMs: number) {
if (!latencyStats[model]) latencyStats[model] = [];
latencyStats[model].push(latencyMs);
// Keep last 100 samples
if (latencyStats[model].length > 100) {
latencyStats[model].shift();
}
}
3. Task-Based Routing
const taskModels = {
coding: "anthropic/claude-3-5-sonnet", // Best for code
reasoning: "openai/o1-preview", // Best for logic
creative: "anthropic/claude-3-5-sonnet", // Best for writing
simple: "openai/gpt-4o-mini", // Cheap and fast
multimodal: "google/gemini-pro-1.5", // Vision + text
};
function routeByTask(task: keyof typeof taskModels): string {
return taskModels[task];
}
4. Hybrid Routing
interface RoutingConfig {
task: string;
maxCost: number;
maxLatency: number;
}
function hybridRoute(config: RoutingConfig): string {
// Filter by cost
const affordable = models.filter(m => m.cost <= config.maxCost);
// Filter by latency
const fast = affordable.filter(m => m.avgLatency <= config.maxLatency);
// Select best for task
const taskScores = fast.map(m => ({
model: m.id,
score: getTaskScore(m.id, config.task),
}));
return taskScores.sort((a, b) => b.score - a.score)[0].model;
}
Best Practices
1. Always Have Fallbacks
// Bad: Single point of failure
const response = await openai.chat({ model: "gpt-4o", messages });
// Good: Fallback chain
const models = ["gpt-4o", "claude-3-5-sonnet", "gemini-pro"];
for (const model of models) {
try {
return await gateway.chat({ model, messages });
} catch (e) {
continue;
}
}
2. Pin Model Versions
// Bad: Model can change
const model = "gpt-4";
// Good: Pinned version
const model = "openai/gpt-4-0125-preview";
3. Track Costs
// Log every call
async function trackedCall(model: string, messages: Message[]) {
const start = Date.now();
const response = await gateway.chat({ model, messages });
const latency = Date.now() - start;
await analytics.track("llm_call", {
model,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
cost: calculateCost(model, response.usage),
latency,
});
return response;
}
4. Set Token Limits
// Prevent runaway costs
const response = await gateway.chat({
model,
messages,
max_tokens: 500, // Limit output length
});
5. Use Caching
// LiteLLM caching
litellm_settings:
cache: true
cache_params:
type: redis
host: localhost
port: 6379
ttl: 3600 # 1 hour
References
references/openrouter-guide.md- OpenRouter deep divereferences/litellm-guide.md- LiteLLM self-hostingreferences/routing-strategies.md- Advanced routing patternsreferences/alternatives.md- Helicone, Portkey, etc.
Templates
templates/openrouter-config.ts- TypeScript OpenRouter setuptemplates/litellm-config.yaml- LiteLLM proxy configtemplates/fallback-chain.ts- Fallback implementation