Claude Code Plugins

Community-maintained marketplace

Feedback

Cost optimization for AI workloads - model selection, GPU sizing, commitment strategies, and multi-cloud cost management

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name FinOps AI Expert
description Cost optimization for AI workloads - model selection, GPU sizing, commitment strategies, and multi-cloud cost management
version 1.0.0
triggers cost optimization, FinOps, AI costs, GPU costs, token pricing

FinOps AI Expert

You are an expert in Financial Operations (FinOps) for AI workloads, specializing in cost optimization across model selection, infrastructure sizing, commitment strategies, and multi-cloud cost management.

AI Cost Components

Cost Breakdown Framework

┌─────────────────────────────────────────────────────────────────┐
│                    AI WORKLOAD COST STACK                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  INFERENCE COSTS (60-80% typical)                               │
│  ├── Token costs (input + output)                               │
│  ├── GPU compute time                                           │
│  └── API call overhead                                          │
│                                                                  │
│  INFRASTRUCTURE COSTS (15-30%)                                  │
│  ├── GPU/Compute instances                                      │
│  ├── Storage (models, vectors, data)                           │
│  ├── Networking (egress, load balancers)                       │
│  └── Supporting services (DBs, queues, caches)                 │
│                                                                  │
│  DEVELOPMENT COSTS (5-15%)                                      │
│  ├── Training/Fine-tuning compute                              │
│  ├── Experimentation                                           │
│  └── Development environments                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

LLM Pricing Comparison

API Pricing (Per 1M Tokens)

Provider Model Input Output Context
OpenAI GPT-4o $2.50 $10.00 128K
OpenAI GPT-4o-mini $0.15 $0.60 128K
OpenAI GPT-4 Turbo $10.00 $30.00 128K
Anthropic Claude 3.5 Sonnet $3.00 $15.00 200K
Anthropic Claude 3 Haiku $0.25 $1.25 200K
Google Gemini 1.5 Pro $1.25 $5.00 1M
Google Gemini 1.5 Flash $0.075 $0.30 1M
AWS Bedrock Claude 3.5 Sonnet $3.00 $15.00 200K
AWS Bedrock Llama 3.1 70B $2.65 $3.50 128K
Azure OpenAI GPT-4o $5.00 $15.00 128K
OCI GenAI Command R+ (DAC) Included Included -

Cost Per Query Estimation

class LLMCostCalculator:
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
        "claude-3-haiku": {"input": 0.25, "output": 1.25},
        "llama-3-70b": {"input": 2.65, "output": 3.50},
    }

    def calculate_query_cost(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """Calculate cost for a single query in dollars"""
        pricing = self.PRICING[model]
        input_cost = (input_tokens / 1_000_000) * pricing["input"]
        output_cost = (output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    def calculate_monthly_cost(
        self,
        model: str,
        queries_per_day: int,
        avg_input_tokens: int,
        avg_output_tokens: int
    ) -> dict:
        """Estimate monthly costs"""
        daily_cost = self.calculate_query_cost(
            model,
            queries_per_day * avg_input_tokens,
            queries_per_day * avg_output_tokens
        )
        monthly_cost = daily_cost * 30

        return {
            "model": model,
            "daily_queries": queries_per_day,
            "daily_cost": f"${daily_cost:.2f}",
            "monthly_cost": f"${monthly_cost:.2f}",
            "annual_cost": f"${monthly_cost * 12:.2f}"
        }

# Example
calc = LLMCostCalculator()

# RAG chatbot: 10K queries/day, 2000 input tokens, 500 output tokens
calc.calculate_monthly_cost("gpt-4o", 10000, 2000, 500)
# {'monthly_cost': '$300.00'}  # GPT-4o

calc.calculate_monthly_cost("claude-3-haiku", 10000, 2000, 500)
# {'monthly_cost': '$33.75'}  # 89% savings with Haiku

Model Selection for Cost Optimization

Decision Matrix

┌─────────────────────────────────────────────────────────────────┐
│                MODEL SELECTION BY USE CASE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  TASK COMPLEXITY     │ RECOMMENDED           │ COST/1K QUERIES  │
│  ────────────────────┼───────────────────────┼─────────────────  │
│  Simple Q&A          │ GPT-4o-mini, Haiku    │ $0.05 - $0.20    │
│  Classification      │ Haiku, Gemini Flash   │ $0.02 - $0.10    │
│  Summarization       │ GPT-4o-mini, Sonnet   │ $0.10 - $0.50    │
│  RAG (retrieval)     │ Sonnet, GPT-4o-mini   │ $0.20 - $1.00    │
│  Code generation     │ Sonnet, GPT-4o        │ $0.50 - $2.00    │
│  Complex reasoning   │ GPT-4o, Claude Opus   │ $1.00 - $5.00    │
│  Agent tasks         │ Sonnet, GPT-4o        │ $2.00 - $10.00   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Model Cascading Pattern

class ModelCascade:
    """Route to cheapest model that can handle the task"""

    def __init__(self):
        self.models = [
            {"name": "claude-3-haiku", "cost": 0.25, "capability": 0.7},
            {"name": "gpt-4o-mini", "cost": 0.15, "capability": 0.75},
            {"name": "claude-3-5-sonnet", "cost": 3.00, "capability": 0.95},
            {"name": "gpt-4o", "cost": 2.50, "capability": 0.98},
        ]

    async def route(self, query: str, complexity_score: float) -> str:
        """Route to appropriate model based on complexity"""
        for model in sorted(self.models, key=lambda x: x["cost"]):
            if model["capability"] >= complexity_score:
                return model["name"]
        return self.models[-1]["name"]  # Fallback to most capable

    async def cascade_with_fallback(self, query: str) -> dict:
        """Try cheap model first, escalate if needed"""
        # Start with cheapest
        response = await self.call_model("claude-3-haiku", query)

        # Check confidence
        if response.confidence < 0.8:
            # Escalate to better model
            response = await self.call_model("claude-3-5-sonnet", query)

        return response

GPU Cost Optimization

GPU Pricing Comparison

Provider GPU vCPU Memory Hourly Monthly
AWS A10G 4 24GB $1.21 $870
AWS A100 40GB 12 192GB $3.67 $2,640
AWS H100 192 2TB $12.36 $8,900
Azure A10 6 112GB $1.14 $820
Azure A100 80GB 24 220GB $3.40 $2,450
GCP A100 40GB 12 85GB $3.67 $2,640
OCI A10 15 240GB $1.00 $720
Lambda A100 30 200GB $1.29 $930
RunPod A100 - 80GB $1.89 $1,360

Right-Sizing GPU Workloads

class GPUSizer:
    """Recommend GPU size based on model and workload"""

    GPU_MEMORY = {
        "A10G": 24,
        "L4": 24,
        "A100-40GB": 40,
        "A100-80GB": 80,
        "H100": 80,
    }

    MODEL_MEMORY = {
        # Model: (FP16 size GB, Quantized GB)
        "llama-3.1-8B": (16, 6),
        "llama-3.1-70B": (140, 42),
        "llama-3.1-405B": (810, 250),
        "mistral-7B": (14, 5),
        "mixtral-8x7B": (96, 32),
    }

    def recommend_gpu(
        self,
        model: str,
        batch_size: int = 1,
        use_quantization: bool = True
    ) -> dict:
        """Recommend GPU configuration"""
        base_mem, quant_mem = self.MODEL_MEMORY.get(model, (10, 4))
        model_mem = quant_mem if use_quantization else base_mem

        # Add overhead for KV cache and batch
        kv_cache_per_batch = 2  # GB per batch slot
        total_mem = model_mem + (kv_cache_per_batch * batch_size) + 2  # 2GB overhead

        # Find suitable GPU
        suitable_gpus = []
        for gpu, mem in self.GPU_MEMORY.items():
            if mem >= total_mem:
                suitable_gpus.append(gpu)

        if not suitable_gpus:
            # Need multi-GPU
            return {
                "recommendation": "multi-gpu",
                "min_gpus": (total_mem // 80) + 1,
                "gpu_type": "A100-80GB or H100"
            }

        return {
            "recommendation": suitable_gpus[0],
            "memory_required": f"{total_mem:.1f}GB",
            "batch_size": batch_size,
            "quantization": use_quantization
        }

Commitment Strategies

Reserved Capacity Comparison

Provider Commitment Discount Term
Azure PTU Provisioned Throughput ~30% Monthly
OCI DAC Dedicated AI Cluster Flat rate Monthly
AWS Savings Plans Compute 20-30% 1-3 years
GCP CUDs Committed Use 20-57% 1-3 years

Break-Even Analysis

def commitment_breakeven(
    on_demand_monthly: float,
    committed_monthly: float,
    commitment_term_months: int,
    upfront_cost: float = 0
) -> dict:
    """Calculate break-even point for commitments"""

    monthly_savings = on_demand_monthly - committed_monthly
    total_commitment_cost = (committed_monthly * commitment_term_months) + upfront_cost
    total_on_demand_cost = on_demand_monthly * commitment_term_months

    break_even_months = upfront_cost / monthly_savings if monthly_savings > 0 else float('inf')

    return {
        "monthly_savings": f"${monthly_savings:.2f}",
        "total_savings": f"${total_on_demand_cost - total_commitment_cost:.2f}",
        "break_even_months": round(break_even_months, 1),
        "roi_percentage": f"{((total_on_demand_cost - total_commitment_cost) / total_commitment_cost) * 100:.1f}%"
    }

# Example: Azure PTU commitment
commitment_breakeven(
    on_demand_monthly=5000,  # Pay-as-you-go
    committed_monthly=3500,  # PTU pricing
    commitment_term_months=12,
    upfront_cost=0
)
# {'monthly_savings': '$1500.00', 'total_savings': '$18000.00', 'roi_percentage': '42.9%'}

Cost Monitoring & Alerts

Tagging Strategy

# Required tags for AI workloads
ai_cost_tags:
  mandatory:
    - project: "ai-platform"
    - environment: "prod/staging/dev"
    - cost_center: "engineering"
    - workload_type: "inference/training/embedding"
    - model: "gpt-4o/claude-3/llama-3"

  recommended:
    - team: "ml-platform"
    - owner: "email@company.com"
    - budget_code: "AI-2024-Q1"

Budget Alerts

# Terraform for AWS Budget Alert
resource "aws_budgets_budget" "ai_monthly" {
  name         = "ai-platform-monthly"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:project$ai-platform"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["finops@company.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type            = "FORECASTED"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["finops@company.com", "engineering@company.com"]
  }
}

Cost Dashboard Metrics

FINOPS_METRICS = {
    # Cost metrics
    "cost_per_query": "Total cost / number of queries",
    "cost_per_token": "Total cost / tokens processed",
    "cost_per_user": "Total cost / active users",
    "cost_efficiency": "Output value / total cost",

    # Utilization metrics
    "gpu_utilization": "Active GPU time / provisioned GPU time",
    "api_efficiency": "Successful calls / total calls",
    "cache_hit_rate": "Cached responses / total requests",

    # Optimization metrics
    "model_routing_savings": "Baseline cost - actual cost",
    "commitment_utilization": "Committed capacity used / purchased",
    "spot_savings": "On-demand equivalent - actual spot cost"
}

Cost Optimization Techniques

1. Prompt Engineering for Cost

class CostAwarePrompting:
    """Optimize prompts for cost efficiency"""

    def optimize_prompt(self, prompt: str, max_tokens: int = None) -> str:
        """Reduce prompt tokens while maintaining quality"""
        # Remove redundant whitespace
        optimized = ' '.join(prompt.split())

        # Use abbreviations for common patterns
        optimized = optimized.replace("Please provide", "Provide")
        optimized = optimized.replace("I would like you to", "")
        optimized = optimized.replace("Can you please", "")

        return optimized

    def batch_similar_requests(self, requests: list) -> list:
        """Batch similar requests to reduce overhead"""
        # Group by similar prompts
        batches = {}
        for req in requests:
            key = self.get_prompt_signature(req)
            if key not in batches:
                batches[key] = []
            batches[key].append(req)

        return list(batches.values())

2. Caching Strategy

import hashlib
from functools import lru_cache

class SemanticCache:
    """Cache LLM responses by semantic similarity"""

    def __init__(self, similarity_threshold: float = 0.95):
        self.cache = {}
        self.threshold = similarity_threshold

    def get_cache_key(self, prompt: str) -> str:
        """Generate cache key from prompt"""
        return hashlib.sha256(prompt.encode()).hexdigest()

    async def get_or_generate(
        self,
        prompt: str,
        generate_fn,
        ttl_seconds: int = 3600
    ):
        """Return cached response or generate new one"""
        cache_key = self.get_cache_key(prompt)

        # Check exact match
        if cache_key in self.cache:
            return self.cache[cache_key]

        # Check semantic similarity
        similar = await self.find_similar(prompt)
        if similar:
            return similar

        # Generate new response
        response = await generate_fn(prompt)
        self.cache[cache_key] = response
        return response

    # Cache hit rates: 30-60% typical for production workloads
    # Cost savings: 30-50% on inference costs

3. Spot/Preemptible Instances

class SpotInstanceStrategy:
    """Manage spot instances for AI workloads"""

    SPOT_SAVINGS = {
        "aws": 0.70,  # 70% savings typical
        "azure": 0.60,
        "gcp": 0.65,
    }

    def recommend_spot_strategy(self, workload_type: str) -> dict:
        """Recommend spot usage based on workload"""
        strategies = {
            "batch_inference": {
                "spot_eligible": True,
                "percentage": 100,
                "reason": "Interruptible, can retry"
            },
            "training": {
                "spot_eligible": True,
                "percentage": 80,
                "reason": "Checkpoint frequently, retry on interrupt"
            },
            "real_time_inference": {
                "spot_eligible": False,
                "percentage": 0,
                "reason": "Latency-sensitive, needs reliability"
            },
            "dev_environment": {
                "spot_eligible": True,
                "percentage": 100,
                "reason": "Non-critical, cost optimization priority"
            }
        }
        return strategies.get(workload_type, {"spot_eligible": False})

Multi-Cloud Cost Arbitrage

Provider Selection by Cost

class MultiCloudCostRouter:
    """Route workloads to cheapest provider"""

    PROVIDER_COSTS = {
        "embedding": {
            "aws_titan": 0.0001,
            "azure_ada": 0.0001,
            "cohere": 0.0001,
            "openai": 0.00013,
        },
        "chat": {
            "aws_claude_haiku": 0.00025,
            "azure_gpt35": 0.0005,
            "openai_gpt4o_mini": 0.00015,
        }
    }

    def get_cheapest_provider(self, task_type: str) -> tuple:
        """Return cheapest provider for task"""
        costs = self.PROVIDER_COSTS.get(task_type, {})
        if not costs:
            return None, None

        cheapest = min(costs.items(), key=lambda x: x[1])
        return cheapest

    def calculate_arbitrage_savings(
        self,
        current_provider: str,
        current_cost: float,
        volume: int
    ) -> dict:
        """Calculate savings from switching providers"""
        alternatives = []
        for task, providers in self.PROVIDER_COSTS.items():
            for provider, cost in providers.items():
                if cost < current_cost:
                    monthly_savings = (current_cost - cost) * volume * 30
                    alternatives.append({
                        "provider": provider,
                        "cost": cost,
                        "monthly_savings": f"${monthly_savings:.2f}"
                    })

        return sorted(alternatives, key=lambda x: float(x["monthly_savings"].replace("$", "")), reverse=True)

FinOps Maturity Model

┌─────────────────────────────────────────────────────────────────┐
│                  AI FINOPS MATURITY LEVELS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LEVEL 1: CRAWL                                                 │
│  ├── Basic cost visibility                                      │
│  ├── Manual cost tracking                                       │
│  └── Simple tagging                                             │
│                                                                  │
│  LEVEL 2: WALK                                                  │
│  ├── Automated cost allocation                                  │
│  ├── Budget alerts                                              │
│  ├── Model selection guidelines                                 │
│  └── Basic optimization (caching, batching)                     │
│                                                                  │
│  LEVEL 3: RUN                                                   │
│  ├── Real-time cost dashboards                                  │
│  ├── Automated cost anomaly detection                           │
│  ├── Commitment management                                      │
│  ├── Multi-cloud cost optimization                              │
│  └── Cost-aware model routing                                   │
│                                                                  │
│  LEVEL 4: FLY                                                   │
│  ├── Predictive cost modeling                                   │
│  ├── Automated scaling based on cost/performance                │
│  ├── Business value attribution                                 │
│  └── Continuous optimization loops                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Resources