Claude Code Plugins

Community-maintained marketplace

Feedback

AI integration patterns using LiteLLM for multi-model routing, caching, and cost optimization. Use when implementing AI services.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name ai-gateway
description AI integration patterns using LiteLLM for multi-model routing, caching, and cost optimization. Use when implementing AI services.
allowed-tools Read, Glob, Grep, Edit, Write, Bash(pytest:*), Bash(python:*)

AI Gateway Patterns (Jan 2025 - Cost-First Strategy)

Cost Hierarchy

Rank Model Input/Output (per 1M) Use Case
1 Gemini 3 Flash $0.10 / $0.40 TRY FIRST - Simple tasks, initial attempts
2 DeepSeek V3 $0.14 / $0.28 Validation, code review
3 Gemini 3 Pro $1.25 / $5.00 Large context (1M tokens)
4 Claude Sonnet 4.5 $3.00 / $15.00 Quality code gen when Flash fails
5 Claude Opus 4.5 $15.00 / $75.00 LAST RESORT - Complex only

Key insight: Flash is 30x cheaper than Sonnet. Try it first!

Cost-Optimized Routing

TASK_ROUTING = {
    # Simple → Cheapest first
    TaskType.SIMPLE_TASK: ["gemini-3-flash", "deepseek-chat"],
    
    # Validation → DeepSeek (cheap + good at code)
    TaskType.VALIDATION: ["deepseek-chat", "gemini-3-flash"],
    
    # Code gen → Try Flash first, escalate if needed
    TaskType.CODE_GENERATION: ["gemini-3-flash", "deepseek-coder", "claude-sonnet"],
    
    # Understanding → Flash for simple, Sonnet for complex
    TaskType.WORKFLOW_UNDERSTANDING: ["gemini-3-flash", "claude-sonnet", "claude-opus"],
    
    # Pipeline gen → Quality matters, but try cheaper first
    TaskType.PIPELINE_GENERATION: ["claude-sonnet", "gemini-3-pro"],
}

LiteLLM Model IDs

MODELS = {
    # Gemini via Vertex AI (us-central1)
    "gemini-3-flash": "vertex_ai/gemini-3-flash-preview",
    "gemini-3-pro": "vertex_ai/gemini-3-pro-preview",
    
    # Claude via Vertex AI Model Garden (us-east5!)
    "claude-sonnet": "vertex_ai/claude-sonnet-4@20250514",
    "claude-opus": "vertex_ai/claude-opus-4-5@20250514",
    
    # DeepSeek API
    "deepseek-chat": "deepseek/deepseek-chat",
    "deepseek-coder": "deepseek/deepseek-coder",
}

Regional Configuration

# CRITICAL: Different regions for different models!
if "claude" in model_id:
    litellm.vertex_location = "us-east5"
else:
    litellm.vertex_location = "us-central1"

Implementation

from dataclasses import dataclass
from typing import Optional, Dict, List
from enum import Enum

class TaskType(Enum):
    SIMPLE_TASK = "simple"
    VALIDATION = "validation"
    CODE_GENERATION = "code_gen"
    WORKFLOW_UNDERSTANDING = "understand"
    PIPELINE_GENERATION = "pipeline"

@dataclass
class ModelConfig:
    name: str
    model_id: str
    input_cost_per_million: float
    output_cost_per_million: float
    context_window: int
    max_output_tokens: int = 4096

DEFAULT_MODELS = {
    "gemini-3-flash": ModelConfig(
        name="Gemini 3 Flash",
        model_id="vertex_ai/gemini-3-flash-preview",
        input_cost_per_million=0.10,
        output_cost_per_million=0.40,
        context_window=1_000_000,
        max_output_tokens=8192
    ),
    "gemini-3-pro": ModelConfig(
        name="Gemini 3 Pro",
        model_id="vertex_ai/gemini-3-pro-preview",
        input_cost_per_million=1.25,
        output_cost_per_million=5.00,
        context_window=1_000_000,
        max_output_tokens=8192
    ),
    "claude-sonnet": ModelConfig(
        name="Claude Sonnet 4.5",
        model_id="vertex_ai/claude-sonnet-4@20250514",
        input_cost_per_million=3.00,
        output_cost_per_million=15.00,
        context_window=200_000,
        max_output_tokens=8192
    ),
    "claude-opus": ModelConfig(
        name="Claude Opus 4.5",
        model_id="vertex_ai/claude-opus-4-5@20250514",
        input_cost_per_million=15.00,
        output_cost_per_million=75.00,
        context_window=200_000,
        max_output_tokens=4096
    ),
    "deepseek-chat": ModelConfig(
        name="DeepSeek V3",
        model_id="deepseek/deepseek-chat",
        input_cost_per_million=0.14,
        output_cost_per_million=0.28,
        context_window=128_000,
    ),
    "deepseek-coder": ModelConfig(
        name="DeepSeek Coder",
        model_id="deepseek/deepseek-coder",
        input_cost_per_million=0.14,
        output_cost_per_million=0.28,
        context_window=128_000,
    ),
}

# Cost-optimized routing - CHEAPEST FIRST!
TASK_ROUTING: Dict[TaskType, List[str]] = {
    TaskType.SIMPLE_TASK: ["gemini-3-flash", "deepseek-chat"],
    TaskType.VALIDATION: ["deepseek-chat", "gemini-3-flash"],
    TaskType.CODE_GENERATION: ["gemini-3-flash", "deepseek-coder", "claude-sonnet"],
    TaskType.WORKFLOW_UNDERSTANDING: ["gemini-3-flash", "claude-sonnet", "claude-opus"],
    TaskType.PIPELINE_GENERATION: ["claude-sonnet", "gemini-3-pro"],
}

Cost Savings Example

100 Workflow Jobs Sonnet-First Flash-First Savings
Simple (30) $9 $0.30 97%
Understanding (30) $9 $0.30* 97%
Code Gen (30) $9 $0.30* 97%
Validation (10) $0.14 $0.14 0%
Total $27 $1.04 96%

*Assumes Flash handles 80% successfully, escalates 20% to Sonnet

Environment Variables

GOOGLE_CLOUD_PROJECT=gen-lang-client-0497834162
VERTEX_AI_LOCATION=us-central1
VERTEX_AI_CLAUDE_LOCATION=us-east5
DEEPSEEK_API_KEY=your-key  # Optional

Testing

@pytest.mark.asyncio
async def test_routes_simple_to_flash():
    """Simple tasks should use cheapest model (Flash)"""
    with patch('litellm.acompletion') as mock:
        mock.return_value = mock_response("ok")
        
        gateway = AIGateway()
        await gateway.complete("test", task_type=TaskType.SIMPLE_TASK)
        
        assert "flash" in mock.call_args.kwargs['model'].lower()

@pytest.mark.asyncio  
async def test_escalates_on_failure():
    """Should try next model in chain on failure"""
    with patch('litellm.acompletion') as mock:
        mock.side_effect = [Exception("Rate limited"), mock_response("ok")]
        
        gateway = AIGateway()
        result = await gateway.complete("test", task_type=TaskType.CODE_GENERATION)
        
        assert mock.call_count == 2  # Tried Flash, then DeepSeek