| name | external-model-selection |
| description | Choose optimal external AI models for code analysis, bug investigation, and architectural decisions. Use when consulting multiple LLMs via claudish, comparing model perspectives, or investigating complex Go/LSP/transpiler issues. Provides empirically validated model rankings (91/100 for MiniMax M2, 83/100 for Grok Code Fast) and proven consultation strategies based on real-world testing. |
External Model Selection
Purpose: Select the best external AI models for your specific task based on empirical performance data from production bug investigations.
When Claude invokes this Skill: When you need to consult external models, choose between different LLMs, or want diverse perspectives on architectural decisions, code bugs, or design choices.
Quick Reference: Top Models
๐ฅ Tier 1 - Primary Recommendations (Use First)
1. MiniMax M2 (minimax/minimax-m2)
- Score: 91/100 | Speed: 3 min โกโกโก | Cost: $$
- Best for: Fast root cause analysis, production bugs, when you need simple implementable fixes
- Proven: Found exact bug (column calculation error) in 3 minutes during LSP investigation
- Why it wins: Pinpoint accuracy, avoids overengineering, focuses on simplest solution first
2. Grok Code Fast (x-ai/grok-code-fast-1)
- Score: 83/100 | Speed: 4 min โกโก | Cost: $$
- Best for: Debugging traces, validation strategies, test coverage design
- Proven: Step-by-step execution traces, identified tab/space edge cases
- Why it wins: Excellent debugging methodology, practical validation approach
3. GPT-5.1 Codex (openai/gpt-5.1-codex)
- Score: 80/100 | Speed: 5 min โก | Cost: $$$
- Best for: Architectural redesign, long-term refactoring plans
- Proven: Proposed granular mapping system for future enhancements
- Why it's valuable: Strong architectural vision, excellent for planning major changes
4. Sherlock Think Alpha (openrouter/sherlock-think-alpha) ๐ FREE
- Score: TBD | Speed: ~5 min โก | Cost: FREE ($0!) ๐ฐ
- Context: 1.8M tokens (LARGEST context window available!)
- Best for: Massive codebase analysis, entire project reasoning, long-context planning
- Secret: Big player testing under weird name - don't let the name fool you
- Specialties:
- Full codebase analysis (1.8M tokens = ~500k lines of code!)
- Research synthesis across dozens of files
- Protocol compliance & standards validation
- Entire project architectural analysis
- Why it's valuable: FREE + massive context = ideal for comprehensive analysis
- Use case: When you need to analyze entire codebase or massive context (and it's FREE!)
5. Gemini 3 Pro Preview (google/gemini-3-pro-preview) โญ NEW
- Score: TBD | Speed: ~5 min โก | Cost: $$$
- Context: 1M tokens (11.4B parameter model)
- Best for: Multimodal reasoning, agentic coding, complex architectural analysis, long-context planning
- Strengths: State-of-the-art on LMArena, GPQA Diamond, MathArena, SWE-Bench Verified
- Specialties:
- Autonomous agents & coding assistants
- Research synthesis & planning
- High-context information processing (1M token window!)
- Tool-calling & long-horizon planning
- Multimodal analysis (text, code, images)
- Why it's valuable: Google's flagship frontier model, excels at inferring intent with minimal prompting
- Use case: When you need deep reasoning across massive context (entire codebase analysis)
๐ฅ Tier 2 - Specialized Use Cases
6. Gemini 2.5 Flash (google/gemini-2.5-flash)
- Score: 73/100 | Speed: 6 min โก | Cost: $
- Best for: Ambiguous problems requiring exhaustive hypothesis exploration
- Caution: Can go too deep - best when truly uncertain about root cause
- Value: Low cost, thorough analysis when you need multiple angles
7. GLM-4.6 (z-ai/glm-4.6)
- Score: 70/100 | Speed: 7 min ๐ข | Cost: $$
- Best for: Adding debug infrastructure, algorithm enhancements
- Caution: Tends to overengineer - verify complexity is warranted
- Use case: When you actually need priority systems or extensive logging
โ AVOID - Known Reliability Issues
Qwen3 Coder (qwen/qwen3-coder-30b-a3b-instruct)
- Score: 0/100 | Status: FAILED (timeout after 8+ minutes)
- Issue: Reliability problems, availability issues
- Recommendation: DO NOT use for time-sensitive or production tasks
Consultation Strategies
Strategy 1: Fast Parallel Diagnosis (DEFAULT - 90% of use cases)
Models: minimax/minimax-m2 + x-ai/grok-code-fast-1
# Launch 2 models in parallel (single message, multiple Task calls)
Task 1: golang-architect (PROXY MODE) โ MiniMax M2
Task 2: golang-architect (PROXY MODE) โ Grok Code Fast
Time: ~4 minutes total Success Rate: 95%+ Cost: $$ (moderate)
Use for:
- Bug investigations
- Quick root cause diagnosis
- Production issues
- Most everyday tasks
Benefits:
- Fast diagnosis from MiniMax M2 (simplest solution)
- Validation strategy from Grok Code Fast (debugging trace)
- Redundancy if one model misses something
Strategy 2: Comprehensive Analysis (Critical issues)
Models: minimax/minimax-m2 + openai/gpt-5.1-codex + x-ai/grok-code-fast-1
# Launch 3 models in parallel
Task 1: golang-architect (PROXY MODE) โ MiniMax M2
Task 2: golang-architect (PROXY MODE) โ GPT-5.1 Codex
Task 3: golang-architect (PROXY MODE) โ Grok Code Fast
Time: ~5 minutes total Success Rate: 99%+ Cost: $$$ (high but justified)
Use for:
- Critical production bugs
- Architectural decisions
- High-impact changes
- When you need absolute certainty
Benefits:
- Quick fix (MiniMax M2)
- Long-term architectural plan (GPT-5.1)
- Validation and testing strategy (Grok)
- Triple redundancy
Strategy 3: Deep Exploration (Ambiguous problems)
Models: minimax/minimax-m2 + google/gemini-2.5-flash + x-ai/grok-code-fast-1
# Launch 3 models in parallel
Task 1: golang-architect (PROXY MODE) โ MiniMax M2
Task 2: golang-architect (PROXY MODE) โ Gemini 2.5 Flash
Task 3: golang-architect (PROXY MODE) โ Grok Code Fast
Time: ~6 minutes total Success Rate: 90%+ Cost: $$ (moderate)
Use for:
- Ambiguous bugs with unclear root cause
- Multi-faceted problems
- When initial investigation is inconclusive
- Complex system interactions
Benefits:
- Quick hypothesis (MiniMax M2)
- Exhaustive exploration (Gemini 2.5 Flash)
- Practical validation (Grok)
- Diverse analytical approaches
Strategy 4: Full Codebase Analysis (Massive Context) ๐
Models: openrouter/sherlock-think-alpha + google/gemini-3-pro-preview
# Launch 2 models in parallel
Task 1: golang-architect (PROXY MODE) โ Sherlock Think Alpha
Task 2: golang-architect (PROXY MODE) โ Gemini 3 Pro Preview
Time: ~5 minutes total Success Rate: TBD (new strategy) Cost: $$$ (one free, one paid = moderate overall)
Use for:
- Entire codebase architectural analysis
- Cross-file dependency analysis
- Large refactoring planning (50+ files)
- System-wide pattern detection
- Multi-module projects
Benefits:
- Sherlock: 1.8M token context (FREE!) - can analyze entire codebase
- Gemini 3 Pro: 1M token context + multimodal + SOTA reasoning
- Both have massive context windows for holistic analysis
- One free model reduces cost significantly
Prompt Strategy:
Analyze the entire Dingo codebase focusing on [specific aspect].
Context provided:
- All files in pkg/ (50+ files)
- All tests in tests/ (60+ files)
- Documentation in ai-docs/
- Total: ~200k lines of code
Your task: [specific analysis goal]
Strategy 5: Budget-Conscious (Cost-sensitive) ๐
Models: openrouter/sherlock-think-alpha + x-ai/grok-code-fast-1
# Launch 2 models in parallel
Task 1: golang-architect (PROXY MODE) โ Sherlock Think Alpha (FREE!)
Task 2: golang-architect (PROXY MODE) โ Grok Code Fast
Time: ~5 minutes total Success Rate: 85%+ Cost: $$ (Sherlock is FREE, only pay for Grok!)
Use for:
- Cost-sensitive projects
- Large context needs on a budget
- Non-critical investigations
- Exploratory analysis
- Learning and experimentation
Benefits:
- Sherlock is completely FREE with 1.8M context!
- Massive context window for comprehensive analysis
- Grok provides debugging methodology
- Lowest cost option with high value
Decision Tree: Which Strategy?
START: Need external model consultation
โ
[What type of task?]
โ
โโ Bug Investigation (90% of cases)
โ โ Strategy 1: MiniMax M2 + Grok Code Fast
โ โ Time: 4 min | Cost: $$ | Success: 95%+
โ
โโ Critical Bug / Architectural Decision
โ โ Strategy 2: MiniMax M2 + GPT-5.1 + Grok
โ โ Time: 5 min | Cost: $$$ | Success: 99%+
โ
โโ Ambiguous / Multi-faceted Problem
โ โ Strategy 3: MiniMax M2 + Gemini + Grok
โ โ Time: 6 min | Cost: $$ | Success: 90%+
โ
โโ Cost-Sensitive / Exploratory
โ Strategy 4: Gemini + Grok
โ Time: 6 min | Cost: $ | Success: 85%+
Critical Implementation Details
1. ALWAYS Use 10-Minute Timeout
CRITICAL: External models take 5-10 minutes. Default 2-minute timeout WILL fail.
# When delegating to agents in PROXY MODE:
Task tool โ golang-architect:
**CRITICAL - Timeout Configuration**:
When executing claudish via Bash tool, ALWAYS use:
```bash
Bash(
command='cat prompt.md | claudish --model [model-id] > output.md 2>&1',
timeout=600000, # 10 minutes (REQUIRED!)
description='External consultation via [model-name]'
)
Why: Qwen3 Coder failed due to 2-minute timeout. 10 minutes prevents this.
2. Launch Models in Parallel (Single Message)
CORRECT (6-8x speedup):
# Single message with multiple Task calls
Task 1: golang-architect (PROXY MODE) โ Model A
Task 2: golang-architect (PROXY MODE) โ Model B
Task 3: golang-architect (PROXY MODE) โ Model C
# All execute simultaneously
WRONG (sequential, slow):
# Multiple messages
Message 1: Task โ Model A (wait...)
Message 2: Task โ Model B (wait...)
Message 3: Task โ Model C (wait...)
# Takes 3x longer
3. Agent Return Format (Keep Brief!)
Agents in PROXY MODE MUST return MAX 3 lines:
[Model-name] analysis complete
Root cause: [one-line summary]
Full analysis: [file-path]
DO NOT return full analysis in agent response (causes context bloat).
4. File-Based Communication
Input: Write investigation prompt to file
ai-docs/sessions/[timestamp]/input/investigation-prompt.md
Output: Agents write full analysis to files
ai-docs/sessions/[timestamp]/output/[model-name]-analysis.md
Main chat: Reads ONLY summaries, not full files
Evidence: What Made Top Models Win
Based on LSP Source Mapping Bug Investigation (Session 20251118-223538):
โ Success Patterns
MiniMax M2 (91/100):
- Identified exact bug:
qPoscalculation produces column 15 instead of 27 - Proposed simplest fix: Change
strings.Index()tostrings.LastIndex() - Completed in 3 minutes
- Key insight: "The bug is entirely in source map generation"
Grok Code Fast (83/100):
- Provided step-by-step execution trace
- Identified tab vs spaces edge case
- Proposed validation strategy with debug logging
- Key insight: "Generated_column values don't match actual positions due to prefix length"
GPT-5.1 Codex (80/100):
- Identified architectural limitation (coarse-grained mappings)
- Proposed long-term solution: granular mapping segments
- Excellent testing strategy
- Key insight: "Single coarse-grained mapping per error propagation"
โ Failure Patterns
Gemini 2.5 Flash (73/100):
- Went too deep into fallback logic (not the root cause)
- Explored 10+ hypotheses
- Missed the simple bug (qPos calculation)
- Issue: Too thorough, lost focus on simplest explanation
GLM-4.6 (70/100):
- Focused on MapToOriginal algorithm (which was correct)
- Proposed complex enhancements (priority system, debug logging)
- Overengineered the solution
- Issue: Added complexity when simple data fix was needed
Sherlock Think (65/100):
- Focused on 0-based vs 1-based indexing (secondary issue)
- Proposed normalization (helpful but not main fix)
- Expensive at $$$ for limited value
- Issue: Fixed symptoms, not root cause
Qwen3 Coder (0/100):
- Timed out after 8+ minutes
- No output produced
- Reliability issues
- Issue: Complete failure, avoid entirely
Performance Benchmarks (Empirical Data)
Test: LSP Source Mapping Bug (diagnostic underlining wrong code) Methodology: 8 models tested in parallel on real production bug
| Model | Time | Accuracy | Solution | Cost-Value |
|---|---|---|---|---|
| MiniMax M2 | 3 min | โ Exact | Simple fix | โญโญโญโญโญ |
| Grok Code Fast | 4 min | โ Correct | Good validation | โญโญโญโญ |
| GPT-5.1 Codex | 5 min | โ ๏ธ Partial | Complex design | โญโญโญโญ |
| Gemini 2.5 Flash | 6 min | โ ๏ธ Missed | Overanalyzed | โญโญโญ |
| GLM-4.6 | 7 min | โ Wrong | Overengineered | โญโญ |
| Sherlock Think | 5 min | โ Secondary | Wrong cause | โญโญ |
| Qwen3 Coder | 8+ min | โ Failed | Timeout | โ ๏ธ |
Key Finding: Faster models (3-5 min) delivered better results than slower ones (6-8 min).
Correlation: Speed โ Simplicity (faster models prioritize simple explanations first)
When to Use Each Model
Use MiniMax M2 when:
- โ Need fast, accurate diagnosis (3 minutes)
- โ Want simplest solution (avoid overengineering)
- โ Production bug investigation
- โ Most everyday tasks (90% of use cases)
Use Grok Code Fast when:
- โ Need detailed debugging trace
- โ Want validation strategy
- โ Designing test coverage
- โ Understanding execution flow
Use GPT-5.1 Codex when:
- โ Planning major architectural changes
- โ Need long-term refactoring strategy
- โ Want comprehensive testing approach
- โ High-level design decisions
Use Gemini 2.5 Flash when:
- โ Problem is genuinely ambiguous
- โ Need exhaustive hypothesis exploration
- โ Budget is constrained (low cost)
- โ Multiple potential root causes
Avoid using when:
- โ Problem is simple/obvious (just fix it)
- โ Sonnet 4.5 internal can answer (use internal first)
- โ Already-solved problem (check docs first)
- โ Time-critical (Qwen3 unreliable)
Example: Invoking External Models
Step 1: Create Session
SESSION=$(date +%Y%m%d-%H%M%S)
mkdir -p ai-docs/sessions/$SESSION/{input,output}
Step 2: Write Investigation Prompt
# Write clear, self-contained prompt
echo "Problem: LSP diagnostic underlining wrong code..." > \
ai-docs/sessions/$SESSION/input/investigation-prompt.md
Step 3: Choose Strategy
Based on decision tree:
- Bug investigation โ Strategy 1 (MiniMax M2 + Grok)
Step 4: Launch Agents in Parallel
Single message with 2 Task calls:
Task 1 โ golang-architect (PROXY MODE):
You are operating in PROXY MODE to investigate bug using MiniMax M2.
INPUT FILES:
- ai-docs/sessions/$SESSION/input/investigation-prompt.md
YOUR TASK (PROXY MODE):
1. Read investigation prompt
2. Use claudish to consult minimax/minimax-m2
3. Write full response to output file
**CRITICAL - Timeout**:
Bash(timeout=600000) # 10 minutes!
OUTPUT FILES:
- ai-docs/sessions/$SESSION/output/minimax-m2-analysis.md
RETURN (MAX 3 lines):
MiniMax M2 analysis complete
Root cause: [one-line]
Full analysis: [file-path]
Task 2 โ golang-architect (PROXY MODE):
[Same structure for Grok Code Fast]
Step 5: Consolidate
After receiving both summaries:
- Review 1-line summaries from each model
- Identify consensus vs disagreements
- Optionally read full analyses if needed
- Decide on action based on recommendations
Supporting Files
- BENCHMARKS.md - Detailed performance metrics and test methodology
- STRATEGIES.md - Deep dive into each consultation strategy with examples
Validation & Maintenance
Last Validated: 2025-11-18 (Session 20251118-223538) Next Review: 2025-05 (6 months) Test Task: LSP Source Mapping Bug
Re-validation Schedule:
- Every 3-6 months
- After new models become available
- When model performance changes significantly
Track:
- Model availability/reliability
- Speed improvements
- Accuracy changes
- Cost fluctuations
Summary: Quick Decision Guide
Most common use case (90%): โ Use Strategy 1: MiniMax M2 + Grok Code Fast โ Time: 4 min | Cost: $$ | Success: 95%+
Critical issues: โ Use Strategy 2: MiniMax M2 + GPT-5.1 + Grok โ Time: 5 min | Cost: $$$ | Success: 99%+
Ambiguous problems: โ Use Strategy 3: MiniMax M2 + Gemini + Grok โ Time: 6 min | Cost: $$ | Success: 90%+
Cost-sensitive: โ Use Strategy 4: Gemini + Grok โ Time: 6 min | Cost: $ | Success: 85%+
Remember:
- โฑ๏ธ Always use 10-minute timeout
- ๐ Launch models in parallel (single message)
- ๐ Communication via files (not inline)
- ๐ฏ Brief summaries only (MAX 3 lines)
Full Reports:
- Comprehensive comparison:
ai-docs/sessions/20251118-223538/01-planning/comprehensive-model-comparison.md - Model ranking analysis:
ai-docs/sessions/20251118-223538/01-planning/model-ranking-analysis.md