name	FinOps AI Expert
description	Cost optimization for AI workloads - model selection, GPU sizing, commitment strategies, and multi-cloud cost management
version	1.0.0
triggers	cost optimization, FinOps, AI costs, GPU costs, token pricing

FinOps AI Expert

You are an expert in Financial Operations (FinOps) for AI workloads, specializing in cost optimization across model selection, infrastructure sizing, commitment strategies, and multi-cloud cost management.

AI Cost Components

Cost Breakdown Framework

┌─────────────────────────────────────────────────────────────────┐
│                    AI WORKLOAD COST STACK                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  INFERENCE COSTS (60-80% typical)                               │
│  ├── Token costs (input + output)                               │
│  ├── GPU compute time                                           │
│  └── API call overhead                                          │
│                                                                  │
│  INFRASTRUCTURE COSTS (15-30%)                                  │
│  ├── GPU/Compute instances                                      │
│  ├── Storage (models, vectors, data)                           │
│  ├── Networking (egress, load balancers)                       │
│  └── Supporting services (DBs, queues, caches)                 │
│                                                                  │
│  DEVELOPMENT COSTS (5-15%)                                      │
│  ├── Training/Fine-tuning compute                              │
│  ├── Experimentation                                           │
│  └── Development environments                                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

LLM Pricing Comparison

API Pricing (Per 1M Tokens)

Provider	Model	Input	Output	Context
OpenAI	GPT-4o	$2.50	$10.00	128K
OpenAI	GPT-4o-mini	$0.15	$0.60	128K
OpenAI	GPT-4 Turbo	$10.00	$30.00	128K
Anthropic	Claude 3.5 Sonnet	$3.00	$15.00	200K
Anthropic	Claude 3 Haiku	$0.25	$1.25	200K
Google	Gemini 1.5 Pro	$1.25	$5.00	1M
Google	Gemini 1.5 Flash	$0.075	$0.30	1M
AWS Bedrock	Claude 3.5 Sonnet	$3.00	$15.00	200K
AWS Bedrock	Llama 3.1 70B	$2.65	$3.50	128K
Azure OpenAI	GPT-4o	$5.00	$15.00	128K
OCI GenAI	Command R+ (DAC)	Included	Included	-

Cost Per Query Estimation

class LLMCostCalculator:
    PRICING = {
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
        "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
        "claude-3-haiku": {"input": 0.25, "output": 1.25},
        "llama-3-70b": {"input": 2.65, "output": 3.50},
    }

    def calculate_query_cost(
        self,
        model: str,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """Calculate cost for a single query in dollars"""
        pricing = self.PRICING[model]
        input_cost = (input_tokens / 1_000_000) * pricing["input"]
        output_cost = (output_tokens / 1_000_000) * pricing["output"]
        return input_cost + output_cost

    def calculate_monthly_cost(
        self,
        model: str,
        queries_per_day: int,
        avg_input_tokens: int,
        avg_output_tokens: int
    ) -> dict:
        """Estimate monthly costs"""
        daily_cost = self.calculate_query_cost(
            model,
            queries_per_day * avg_input_tokens,
            queries_per_day * avg_output_tokens
        )
        monthly_cost = daily_cost * 30

        return {
            "model": model,
            "daily_queries": queries_per_day,
            "daily_cost": f"${daily_cost:.2f}",
            "monthly_cost": f"${monthly_cost:.2f}",
            "annual_cost": f"${monthly_cost * 12:.2f}"
        }

# Example
calc = LLMCostCalculator()

# RAG chatbot: 10K queries/day, 2000 input tokens, 500 output tokens
calc.calculate_monthly_cost("gpt-4o", 10000, 2000, 500)
# {'monthly_cost': '$300.00'}  # GPT-4o

calc.calculate_monthly_cost("claude-3-haiku", 10000, 2000, 500)
# {'monthly_cost': '$33.75'}  # 89% savings with Haiku

Model Selection for Cost Optimization

Decision Matrix

┌─────────────────────────────────────────────────────────────────┐
│                MODEL SELECTION BY USE CASE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  TASK COMPLEXITY     │ RECOMMENDED           │ COST/1K QUERIES  │
│  ────────────────────┼───────────────────────┼─────────────────  │
│  Simple Q&A          │ GPT-4o-mini, Haiku    │ $0.05 - $0.20    │
│  Classification      │ Haiku, Gemini Flash   │ $0.02 - $0.10    │
│  Summarization       │ GPT-4o-mini, Sonnet   │ $0.10 - $0.50    │
│  RAG (retrieval)     │ Sonnet, GPT-4o-mini   │ $0.20 - $1.00    │
│  Code generation     │ Sonnet, GPT-4o        │ $0.50 - $2.00    │
│  Complex reasoning   │ GPT-4o, Claude Opus   │ $1.00 - $5.00    │
│  Agent tasks         │ Sonnet, GPT-4o        │ $2.00 - $10.00   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Model Cascading Pattern

class ModelCascade:
    """Route to cheapest model that can handle the task"""

    def __init__(self):
        self.models = [
            {"name": "claude-3-haiku", "cost": 0.25, "capability": 0.7},
            {"name": "gpt-4o-mini", "cost": 0.15, "capability": 0.75},
            {"name": "claude-3-5-sonnet", "cost": 3.00, "capability": 0.95},
            {"name": "gpt-4o", "cost": 2.50, "capability": 0.98},
        ]

    async def route(self, query: str, complexity_score: float) -> str:
        """Route to appropriate model based on complexity"""
        for model in sorted(self.models, key=lambda x: x["cost"]):
            if model["capability"] >= complexity_score:
                return model["name"]
        return self.models[-1]["name"]  # Fallback to most capable

    async def cascade_with_fallback(self, query: str) -> dict:
        """Try cheap model first, escalate if needed"""
        # Start with cheapest
        response = await self.call_model("claude-3-haiku", query)

        # Check confidence
        if response.confidence < 0.8:
            # Escalate to better model
            response = await self.call_model("claude-3-5-sonnet", query)

        return response

GPU Cost Optimization

GPU Pricing Comparison

Provider	GPU	vCPU	Memory	Hourly	Monthly
AWS	A10G	4	24GB	$1.21	$870
AWS	A100 40GB	12	192GB	$3.67	$2,640
AWS	H100	192	2TB	$12.36	$8,900
Azure	A10	6	112GB	$1.14	$820
Azure	A100 80GB	24	220GB	$3.40	$2,450
GCP	A100 40GB	12	85GB	$3.67	$2,640
OCI	A10	15	240GB	$1.00	$720
Lambda	A100	30	200GB	$1.29	$930
RunPod	A100	-	80GB	$1.89	$1,360

Right-Sizing GPU Workloads

class GPUSizer:
    """Recommend GPU size based on model and workload"""

    GPU_MEMORY = {
        "A10G": 24,
        "L4": 24,
        "A100-40GB": 40,
        "A100-80GB": 80,
        "H100": 80,
    }

    MODEL_MEMORY = {
        # Model: (FP16 size GB, Quantized GB)
        "llama-3.1-8B": (16, 6),
        "llama-3.1-70B": (140, 42),
        "llama-3.1-405B": (810, 250),
        "mistral-7B": (14, 5),
        "mixtral-8x7B": (96, 32),
    }

    def recommend_gpu(
        self,
        model: str,
        batch_size: int = 1,
        use_quantization: bool = True
    ) -> dict:
        """Recommend GPU configuration"""
        base_mem, quant_mem = self.MODEL_MEMORY.get(model, (10, 4))
        model_mem = quant_mem if use_quantization else base_mem

        # Add overhead for KV cache and batch
        kv_cache_per_batch = 2  # GB per batch slot
        total_mem = model_mem + (kv_cache_per_batch * batch_size) + 2  # 2GB overhead

        # Find suitable GPU
        suitable_gpus = []
        for gpu, mem in self.GPU_MEMORY.items():
            if mem >= total_mem:
                suitable_gpus.append(gpu)

        if not suitable_gpus:
            # Need multi-GPU
            return {
                "recommendation": "multi-gpu",
                "min_gpus": (total_mem // 80) + 1,
                "gpu_type": "A100-80GB or H100"
            }

        return {
            "recommendation": suitable_gpus[0],
            "memory_required": f"{total_mem:.1f}GB",
            "batch_size": batch_size,
            "quantization": use_quantization
        }

Commitment Strategies

Reserved Capacity Comparison

Provider	Commitment	Discount	Term
Azure PTU	Provisioned Throughput	~30%	Monthly
OCI DAC	Dedicated AI Cluster	Flat rate	Monthly
AWS Savings Plans	Compute	20-30%	1-3 years
GCP CUDs	Committed Use	20-57%	1-3 years

Break-Even Analysis

def commitment_breakeven(
    on_demand_monthly: float,
    committed_monthly: float,
    commitment_term_months: int,
    upfront_cost: float = 0
) -> dict:
    """Calculate break-even point for commitments"""

    monthly_savings = on_demand_monthly - committed_monthly
    total_commitment_cost = (committed_monthly * commitment_term_months) + upfront_cost
    total_on_demand_cost = on_demand_monthly * commitment_term_months

    break_even_months = upfront_cost / monthly_savings if monthly_savings > 0 else float('inf')

    return {
        "monthly_savings": f"${monthly_savings:.2f}",
        "total_savings": f"${total_on_demand_cost - total_commitment_cost:.2f}",
        "break_even_months": round(break_even_months, 1),
        "roi_percentage": f"{((total_on_demand_cost - total_commitment_cost) / total_commitment_cost) * 100:.1f}%"
    }

# Example: Azure PTU commitment
commitment_breakeven(
    on_demand_monthly=5000,  # Pay-as-you-go
    committed_monthly=3500,  # PTU pricing
    commitment_term_months=12,
    upfront_cost=0
)
# {'monthly_savings': '$1500.00', 'total_savings': '$18000.00', 'roi_percentage': '42.9%'}

Cost Monitoring & Alerts

Tagging Strategy

# Required tags for AI workloads
ai_cost_tags:
  mandatory:
    - project: "ai-platform"
    - environment: "prod/staging/dev"
    - cost_center: "engineering"
    - workload_type: "inference/training/embedding"
    - model: "gpt-4o/claude-3/llama-3"

  recommended:
    - team: "ml-platform"
    - owner: "email@company.com"
    - budget_code: "AI-2024-Q1"

Budget Alerts

# Terraform for AWS Budget Alert
resource "aws_budgets_budget" "ai_monthly" {
  name         = "ai-platform-monthly"
  budget_type  = "COST"
  limit_amount = "10000"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:project$ai-platform"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type            = "PERCENTAGE"
    notification_type         = "ACTUAL"
    subscriber_email_addresses = ["finops@company.com"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type            = "FORECASTED"
    notification_type         = "FORECASTED"
    subscriber_email_addresses = ["finops@company.com", "engineering@company.com"]
  }
}

Cost Dashboard Metrics

FINOPS_METRICS = {
    # Cost metrics
    "cost_per_query": "Total cost / number of queries",
    "cost_per_token": "Total cost / tokens processed",
    "cost_per_user": "Total cost / active users",
    "cost_efficiency": "Output value / total cost",

    # Utilization metrics
    "gpu_utilization": "Active GPU time / provisioned GPU time",
    "api_efficiency": "Successful calls / total calls",
    "cache_hit_rate": "Cached responses / total requests",

    # Optimization metrics
    "model_routing_savings": "Baseline cost - actual cost",
    "commitment_utilization": "Committed capacity used / purchased",
    "spot_savings": "On-demand equivalent - actual spot cost"
}

Cost Optimization Techniques

1. Prompt Engineering for Cost

class CostAwarePrompting:
    """Optimize prompts for cost efficiency"""

    def optimize_prompt(self, prompt: str, max_tokens: int = None) -> str:
        """Reduce prompt tokens while maintaining quality"""
        # Remove redundant whitespace
        optimized = ' '.join(prompt.split())

        # Use abbreviations for common patterns
        optimized = optimized.replace("Please provide", "Provide")
        optimized = optimized.replace("I would like you to", "")
        optimized = optimized.replace("Can you please", "")

        return optimized

    def batch_similar_requests(self, requests: list) -> list:
        """Batch similar requests to reduce overhead"""
        # Group by similar prompts
        batches = {}
        for req in requests:
            key = self.get_prompt_signature(req)
            if key not in batches:
                batches[key] = []
            batches[key].append(req)

        return list(batches.values())

2. Caching Strategy

import hashlib
from functools import lru_cache

class SemanticCache:
    """Cache LLM responses by semantic similarity"""

    def __init__(self, similarity_threshold: float = 0.95):
        self.cache = {}
        self.threshold = similarity_threshold

    def get_cache_key(self, prompt: str) -> str:
        """Generate cache key from prompt"""
        return hashlib.sha256(prompt.encode()).hexdigest()

    async def get_or_generate(
        self,
        prompt: str,
        generate_fn,
        ttl_seconds: int = 3600
    ):
        """Return cached response or generate new one"""
        cache_key = self.get_cache_key(prompt)

        # Check exact match
        if cache_key in self.cache:
            return self.cache[cache_key]

        # Check semantic similarity
        similar = await self.find_similar(prompt)
        if similar:
            return similar

        # Generate new response
        response = await generate_fn(prompt)
        self.cache[cache_key] = response
        return response

    # Cache hit rates: 30-60% typical for production workloads
    # Cost savings: 30-50% on inference costs

3. Spot/Preemptible Instances

class SpotInstanceStrategy:
    """Manage spot instances for AI workloads"""

    SPOT_SAVINGS = {
        "aws": 0.70,  # 70% savings typical
        "azure": 0.60,
        "gcp": 0.65,
    }

    def recommend_spot_strategy(self, workload_type: str) -> dict:
        """Recommend spot usage based on workload"""
        strategies = {
            "batch_inference": {
                "spot_eligible": True,
                "percentage": 100,
                "reason": "Interruptible, can retry"
            },
            "training": {
                "spot_eligible": True,
                "percentage": 80,
                "reason": "Checkpoint frequently, retry on interrupt"
            },
            "real_time_inference": {
                "spot_eligible": False,
                "percentage": 0,
                "reason": "Latency-sensitive, needs reliability"
            },
            "dev_environment": {
                "spot_eligible": True,
                "percentage": 100,
                "reason": "Non-critical, cost optimization priority"
            }
        }
        return strategies.get(workload_type, {"spot_eligible": False})

Multi-Cloud Cost Arbitrage

Provider Selection by Cost

class MultiCloudCostRouter:
    """Route workloads to cheapest provider"""

    PROVIDER_COSTS = {
        "embedding": {
            "aws_titan": 0.0001,
            "azure_ada": 0.0001,
            "cohere": 0.0001,
            "openai": 0.00013,
        },
        "chat": {
            "aws_claude_haiku": 0.00025,
            "azure_gpt35": 0.0005,
            "openai_gpt4o_mini": 0.00015,
        }
    }

    def get_cheapest_provider(self, task_type: str) -> tuple:
        """Return cheapest provider for task"""
        costs = self.PROVIDER_COSTS.get(task_type, {})
        if not costs:
            return None, None

        cheapest = min(costs.items(), key=lambda x: x[1])
        return cheapest

    def calculate_arbitrage_savings(
        self,
        current_provider: str,
        current_cost: float,
        volume: int
    ) -> dict:
        """Calculate savings from switching providers"""
        alternatives = []
        for task, providers in self.PROVIDER_COSTS.items():
            for provider, cost in providers.items():
                if cost < current_cost:
                    monthly_savings = (current_cost - cost) * volume * 30
                    alternatives.append({
                        "provider": provider,
                        "cost": cost,
                        "monthly_savings": f"${monthly_savings:.2f}"
                    })

        return sorted(alternatives, key=lambda x: float(x["monthly_savings"].replace("$", "")), reverse=True)

FinOps Maturity Model

┌─────────────────────────────────────────────────────────────────┐
│                  AI FINOPS MATURITY LEVELS                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  LEVEL 1: CRAWL                                                 │
│  ├── Basic cost visibility                                      │
│  ├── Manual cost tracking                                       │
│  └── Simple tagging                                             │
│                                                                  │
│  LEVEL 2: WALK                                                  │
│  ├── Automated cost allocation                                  │
│  ├── Budget alerts                                              │
│  ├── Model selection guidelines                                 │
│  └── Basic optimization (caching, batching)                     │
│                                                                  │
│  LEVEL 3: RUN                                                   │
│  ├── Real-time cost dashboards                                  │
│  ├── Automated cost anomaly detection                           │
│  ├── Commitment management                                      │
│  ├── Multi-cloud cost optimization                              │
│  └── Cost-aware model routing                                   │
│                                                                  │
│  LEVEL 4: FLY                                                   │
│  ├── Predictive cost modeling                                   │
│  ├── Automated scaling based on cost/performance                │
│  ├── Business value attribution                                 │
│  └── Continuous optimization loops                              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

FinOps AI Expert

Install Skill

SKILL.md

FinOps AI Expert

AI Cost Components

Cost Breakdown Framework

LLM Pricing Comparison

API Pricing (Per 1M Tokens)

Cost Per Query Estimation

Model Selection for Cost Optimization

Decision Matrix

Model Cascading Pattern

GPU Cost Optimization

GPU Pricing Comparison

Right-Sizing GPU Workloads

Commitment Strategies

Reserved Capacity Comparison

Break-Even Analysis

Cost Monitoring & Alerts

Tagging Strategy

Budget Alerts

Cost Dashboard Metrics

Cost Optimization Techniques

1. Prompt Engineering for Cost

2. Caching Strategy

3. Spot/Preemptible Instances

Multi-Cloud Cost Arbitrage

Provider Selection by Cost

FinOps Maturity Model

Resources