AI Gateway Patterns (Jan 2025 - Cost-First Strategy)
Cost Hierarchy
| Rank |
Model |
Input/Output (per 1M) |
Use Case |
| 1 |
Gemini 3 Flash |
$0.10 / $0.40 |
TRY FIRST - Simple tasks, initial attempts |
| 2 |
DeepSeek V3 |
$0.14 / $0.28 |
Validation, code review |
| 3 |
Gemini 3 Pro |
$1.25 / $5.00 |
Large context (1M tokens) |
| 4 |
Claude Sonnet 4.5 |
$3.00 / $15.00 |
Quality code gen when Flash fails |
| 5 |
Claude Opus 4.5 |
$15.00 / $75.00 |
LAST RESORT - Complex only |
Key insight: Flash is 30x cheaper than Sonnet. Try it first!
Cost-Optimized Routing
TASK_ROUTING = {
# Simple → Cheapest first
TaskType.SIMPLE_TASK: ["gemini-3-flash", "deepseek-chat"],
# Validation → DeepSeek (cheap + good at code)
TaskType.VALIDATION: ["deepseek-chat", "gemini-3-flash"],
# Code gen → Try Flash first, escalate if needed
TaskType.CODE_GENERATION: ["gemini-3-flash", "deepseek-coder", "claude-sonnet"],
# Understanding → Flash for simple, Sonnet for complex
TaskType.WORKFLOW_UNDERSTANDING: ["gemini-3-flash", "claude-sonnet", "claude-opus"],
# Pipeline gen → Quality matters, but try cheaper first
TaskType.PIPELINE_GENERATION: ["claude-sonnet", "gemini-3-pro"],
}
LiteLLM Model IDs
MODELS = {
# Gemini via Vertex AI (us-central1)
"gemini-3-flash": "vertex_ai/gemini-3-flash-preview",
"gemini-3-pro": "vertex_ai/gemini-3-pro-preview",
# Claude via Vertex AI Model Garden (us-east5!)
"claude-sonnet": "vertex_ai/claude-sonnet-4@20250514",
"claude-opus": "vertex_ai/claude-opus-4-5@20250514",
# DeepSeek API
"deepseek-chat": "deepseek/deepseek-chat",
"deepseek-coder": "deepseek/deepseek-coder",
}
Regional Configuration
# CRITICAL: Different regions for different models!
if "claude" in model_id:
litellm.vertex_location = "us-east5"
else:
litellm.vertex_location = "us-central1"
Implementation
from dataclasses import dataclass
from typing import Optional, Dict, List
from enum import Enum
class TaskType(Enum):
SIMPLE_TASK = "simple"
VALIDATION = "validation"
CODE_GENERATION = "code_gen"
WORKFLOW_UNDERSTANDING = "understand"
PIPELINE_GENERATION = "pipeline"
@dataclass
class ModelConfig:
name: str
model_id: str
input_cost_per_million: float
output_cost_per_million: float
context_window: int
max_output_tokens: int = 4096
DEFAULT_MODELS = {
"gemini-3-flash": ModelConfig(
name="Gemini 3 Flash",
model_id="vertex_ai/gemini-3-flash-preview",
input_cost_per_million=0.10,
output_cost_per_million=0.40,
context_window=1_000_000,
max_output_tokens=8192
),
"gemini-3-pro": ModelConfig(
name="Gemini 3 Pro",
model_id="vertex_ai/gemini-3-pro-preview",
input_cost_per_million=1.25,
output_cost_per_million=5.00,
context_window=1_000_000,
max_output_tokens=8192
),
"claude-sonnet": ModelConfig(
name="Claude Sonnet 4.5",
model_id="vertex_ai/claude-sonnet-4@20250514",
input_cost_per_million=3.00,
output_cost_per_million=15.00,
context_window=200_000,
max_output_tokens=8192
),
"claude-opus": ModelConfig(
name="Claude Opus 4.5",
model_id="vertex_ai/claude-opus-4-5@20250514",
input_cost_per_million=15.00,
output_cost_per_million=75.00,
context_window=200_000,
max_output_tokens=4096
),
"deepseek-chat": ModelConfig(
name="DeepSeek V3",
model_id="deepseek/deepseek-chat",
input_cost_per_million=0.14,
output_cost_per_million=0.28,
context_window=128_000,
),
"deepseek-coder": ModelConfig(
name="DeepSeek Coder",
model_id="deepseek/deepseek-coder",
input_cost_per_million=0.14,
output_cost_per_million=0.28,
context_window=128_000,
),
}
# Cost-optimized routing - CHEAPEST FIRST!
TASK_ROUTING: Dict[TaskType, List[str]] = {
TaskType.SIMPLE_TASK: ["gemini-3-flash", "deepseek-chat"],
TaskType.VALIDATION: ["deepseek-chat", "gemini-3-flash"],
TaskType.CODE_GENERATION: ["gemini-3-flash", "deepseek-coder", "claude-sonnet"],
TaskType.WORKFLOW_UNDERSTANDING: ["gemini-3-flash", "claude-sonnet", "claude-opus"],
TaskType.PIPELINE_GENERATION: ["claude-sonnet", "gemini-3-pro"],
}
Cost Savings Example
| 100 Workflow Jobs |
Sonnet-First |
Flash-First |
Savings |
| Simple (30) |
$9 |
$0.30 |
97% |
| Understanding (30) |
$9 |
$0.30* |
97% |
| Code Gen (30) |
$9 |
$0.30* |
97% |
| Validation (10) |
$0.14 |
$0.14 |
0% |
| Total |
$27 |
$1.04 |
96% |
*Assumes Flash handles 80% successfully, escalates 20% to Sonnet
Environment Variables
GOOGLE_CLOUD_PROJECT=gen-lang-client-0497834162
VERTEX_AI_LOCATION=us-central1
VERTEX_AI_CLAUDE_LOCATION=us-east5
DEEPSEEK_API_KEY=your-key # Optional
Testing
@pytest.mark.asyncio
async def test_routes_simple_to_flash():
"""Simple tasks should use cheapest model (Flash)"""
with patch('litellm.acompletion') as mock:
mock.return_value = mock_response("ok")
gateway = AIGateway()
await gateway.complete("test", task_type=TaskType.SIMPLE_TASK)
assert "flash" in mock.call_args.kwargs['model'].lower()
@pytest.mark.asyncio
async def test_escalates_on_failure():
"""Should try next model in chain on failure"""
with patch('litellm.acompletion') as mock:
mock.side_effect = [Exception("Rate limited"), mock_response("ok")]
gateway = AIGateway()
result = await gateway.complete("test", task_type=TaskType.CODE_GENERATION)
assert mock.call_count == 2 # Tried Flash, then DeepSeek