Claude Code Plugins

Community-maintained marketplace

Feedback

Comprehensive LLM model evaluation and ranking system. Use when users ask to compare language models, find the best model for a specific task, understand model capabilities, get pricing information, or need help selecting between GPT-4, Claude, Gemini, Llama, or other LLMs. Provides benchmark-based rankings, cost analysis, and use-case-specific recommendations across reasoning, code generation, long context, multimodal, and other capabilities.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name llm-rankings
description Comprehensive LLM model evaluation and ranking system. Use when users ask to compare language models, find the best model for a specific task, understand model capabilities, get pricing information, or need help selecting between GPT-4, Claude, Gemini, Llama, or other LLMs. Provides benchmark-based rankings, cost analysis, and use-case-specific recommendations across reasoning, code generation, long context, multimodal, and other capabilities.

LLM Rankings Skill

Comprehensive evaluation and ranking system for comparing language models across performance, cost, and technical dimensions.

Core Capabilities

This skill provides four main ranking methodologies:

  1. Benchmark-Based Rankings - Objective comparisons using MMLU, GSM8K, HumanEval scores
  2. Task-Specific Rankings - Weighted recommendations for code generation, creative writing, reasoning, etc.
  3. Cost-Effectiveness Rankings - Performance per dollar analysis
  4. Real-World Performance - API reliability, documentation quality, ease of integration

Standard Workflows

Simple Comparison Request

When user asks "Which LLM is better for X?":

  1. Load relevant benchmark data from references/benchmarks.md
  2. Filter models matching requirements
  3. Calculate rankings with appropriate weighting
  4. Present top 3-5 recommendations with justification
  5. Include pricing information from references/pricing.md

Detailed Analysis Request

When user asks for comprehensive comparison:

  1. Load model specifications from references/model-details.md
  2. Generate side-by-side comparison table
  3. Include benchmark scores across multiple tests
  4. Calculate cost projections for expected usage
  5. Provide deployment considerations

Best Model for Task Query

When user describes a specific use case:

  1. Parse task requirements (performance needs, budget, technical constraints)
  2. Map to capability dimensions
  3. Load task-specific rankings from references/use-cases.md
  4. Return top 3 models with detailed reasoning
  5. Include caveats and alternative suggestions

Reference Resources

Load these files as needed to inform recommendations:

  • benchmarks.md - Comprehensive benchmark scores (MMLU, GSM8K, HumanEval, MMMU, etc.)
  • model-details.md - Technical specifications, context windows, API details, capabilities
  • use-cases.md - Task-specific recommendations organised by common use cases
  • pricing.md - Current pricing across all providers, cost optimisation strategies

Output Formats

Quick Recommendation

Present concise recommendations with model name, key strength, pricing snapshot, and one-sentence justification.

Comparison Table

Use markdown tables comparing models across relevant dimensions (performance, context window, pricing, best use).

Detailed Analysis

Structure as:

  1. Executive summary (2-3 sentences)
  2. Top recommendations (ranked with justification)
  3. Performance comparison (benchmark scores)
  4. Cost analysis (usage projections)
  5. Implementation considerations
  6. Alternative options

Key Principles

  1. Evidence-Based - Support all rankings with benchmark data or documented performance
  2. Context-Aware - Consider user's specific requirements, budget, technical environment
  3. Transparent - Explain weighting decisions and ranking criteria clearly
  4. Current Information - Use web_search to verify latest releases, pricing changes, benchmark updates
  5. Practical Focus - Prioritise real-world usage factors over pure benchmark scores
  6. Balanced - Present strengths and weaknesses honestly for each model

Important Considerations

  • Benchmark Limitations - Benchmarks don't perfectly reflect real-world performance
  • Task Specificity - A model's ranking varies significantly by use case
  • Pricing Volatility - API pricing changes frequently; verify for important decisions
  • Access Availability - Some models have waitlists or geographic restrictions
  • Trade-offs - Larger context windows often mean slower processing

Usage Notes

  • Always verify current pricing and availability via web search for recent changes
  • Consider user's deployment environment (API vs self-hosted)
  • Account for additional costs (vision inputs, fine-tuning, enterprise features)
  • Recommend testing on user's specific use case before committing
  • Highlight when free tiers or trials are available

Model Coverage

Provides comprehensive coverage of:

  • Anthropic: Claude Opus 4.1/4, Sonnet 4.5/4, Haiku 4
  • OpenAI: GPT-4 Turbo, GPT-4o, GPT-4o-mini, o1-preview, o1-mini
  • Google: Gemini 1.5 Pro, Gemini 1.5 Flash
  • Meta: Llama 3.1 (405B, 70B, 8B)
  • Mistral: Large 2, Small
  • DeepSeek: Coder V2
  • Other providers as relevant to user queries