Claude Code Plugins

Community-maintained marketplace

Feedback

Run multiple AI models in parallel for 3-5x speedup with ENFORCED performance statistics tracking. Use when validating with Grok, Gemini, GPT-5, DeepSeek, or Claudish proxy for code review, consensus analysis, or multi-expert validation. NEW in v3.1.0 - SubagentStop hook enforces statistics collection, MANDATORY checklist prevents incomplete reviews, timing instrumentation examples. Includes dynamic model discovery via `claudish --top-models` and `claudish --free`, session-based workspaces, and Pattern 7-8 for tracking model performance. Trigger keywords - "grok", "gemini", "gpt-5", "deepseek", "claudish", "multiple models", "parallel review", "external AI", "consensus", "multi-model", "model performance", "statistics", "free models".

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name multi-model-validation
description Run multiple AI models in parallel for 3-5x speedup with ENFORCED performance statistics tracking. Use when validating with Grok, Gemini, GPT-5, DeepSeek, or Claudish proxy for code review, consensus analysis, or multi-expert validation. NEW in v3.1.0 - SubagentStop hook enforces statistics collection, MANDATORY checklist prevents incomplete reviews, timing instrumentation examples. Includes dynamic model discovery via `claudish --top-models` and `claudish --free`, session-based workspaces, and Pattern 7-8 for tracking model performance. Trigger keywords - "grok", "gemini", "gpt-5", "deepseek", "claudish", "multiple models", "parallel review", "external AI", "consensus", "multi-model", "model performance", "statistics", "free models".
version 3.1.0
tags orchestration, claudish, parallel, consensus, multi-model, grok, gemini, external-ai, statistics, performance, free-models, enforcement
keywords grok, gemini, gpt-5, deepseek, claudish, parallel, consensus, multi-model, external-ai, proxy, openrouter, statistics, performance, quality-score, execution-time, free-models, top-models, enforcement, mandatory, checklist

Multi-Model Validation

Version: 3.1.0 Purpose: Patterns for running multiple AI models in parallel via Claudish proxy with dynamic model discovery, session-based workspaces, and performance statistics Status: Production Ready

Overview

Multi-model validation is the practice of running multiple AI models (Grok, Gemini, GPT-5, DeepSeek, etc.) in parallel to validate code, designs, or implementations from different perspectives. This achieves:

  • 3-5x speedup via parallel execution (15 minutes → 5 minutes)
  • Consensus-based prioritization (issues flagged by all models are CRITICAL)
  • Diverse perspectives (different models catch different issues)
  • Cost transparency (know before you spend)
  • Free model discovery (NEW v3.0) - find high-quality free models from trusted providers
  • Performance tracking - identify slow/failing models for future exclusion
  • Data-driven recommendations - optimize model shortlist based on historical performance

Key Innovations:

  1. Dynamic Model Discovery (NEW v3.0) - Use claudish --top-models and claudish --free to get current available models with pricing
  2. Session-Based Workspaces (NEW v3.0) - Each validation session gets a unique directory to prevent conflicts
  3. 4-Message Pattern - Ensures true parallel execution by using only Task tool calls in a single message
  4. Pattern 7-8 - Statistics collection and data-driven model recommendations

This skill is extracted from the /review command and generalized for use in any multi-model workflow.


Related Skills

CRITICAL: Tracking Protocol Required

Before using any patterns in this skill, ensure you have completed the pre-launch setup from orchestration:model-tracking-protocol.

Launching models without tracking setup = INCOMPLETE validation.

Cross-References:

  • orchestration:model-tracking-protocol - MANDATORY tracking templates and protocols (NEW in v0.6.0)
    • Pre-launch checklist (8 required items)
    • Tracking table templates
    • Failure documentation format
    • Results presentation template
  • orchestration:quality-gates - Approval gates and severity classification
  • orchestration:todowrite-orchestration - Progress tracking during execution
  • orchestration:error-recovery - Handling failures and retries

Skill Integration:

This skill (multi-model-validation) defines execution patterns (how to run models in parallel). The model-tracking-protocol skill defines tracking infrastructure (how to collect and present results).

Use both together:

skills: orchestration:multi-model-validation, orchestration:model-tracking-protocol

Core Patterns

Pattern 0: Session Setup and Model Discovery (NEW v3.0)

Purpose: Create isolated session workspace and discover available models dynamically.

Why Session-Based Workspaces:

Using a fixed directory like ai-docs/reviews/ causes problems:

  • ❌ Multiple sessions overwrite each other's files
  • ❌ Stale data from previous sessions pollutes results
  • ❌ Hard to track which files belong to which session

Instead, create a unique session directory for each validation:

# Generate unique session ID
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"

# Create session workspace
mkdir -p "$SESSION_DIR"

# Export for use by agents
export SESSION_ID SESSION_DIR

echo "Session: $SESSION_ID"
echo "Directory: $SESSION_DIR"

# Example output:
# Session: review-20251212-143052-a3f2
# Directory: /tmp/review-20251212-143052-a3f2

Benefits:

  • ✅ Each session is isolated (no cross-contamination)
  • ✅ Easy cleanup (rm -rf $SESSION_DIR when done)
  • ✅ Session ID can be used for tracking in statistics
  • ✅ Parallel sessions don't conflict

Dynamic Model Discovery:

NEVER hardcode model lists. Models change frequently - new ones appear, old ones deprecate, pricing updates. Instead, use claudish to get current available models:

# Get top paid models (best value for money)
claudish --top-models

# Example output:
#   google/gemini-3-pro-preview    Google     $7.00/1M   1048K   🔧 🧠 👁️
#   openai/gpt-5.1-codex           Openai     $5.63/1M   400K    🔧 🧠 👁️
#   x-ai/grok-code-fast-1          X-ai       $0.85/1M   256K    🔧 🧠
#   minimax/minimax-m2             Minimax    $0.64/1M   262K    🔧 🧠
#   z-ai/glm-4.6                   Z-ai       $1.07/1M   202K    🔧 🧠
#   qwen/qwen3-vl-235b-a22b-ins... Qwen       $0.70/1M   262K    🔧    👁️

# Get free models from trusted providers
claudish --free

# Example output:
#   google/gemini-2.0-flash-exp:free  Google     FREE      1049K   ✓ · ✓
#   mistralai/devstral-2512:free      Mistralai  FREE      262K    ✓ · ·
#   qwen/qwen3-coder:free             Qwen       FREE      262K    ✓ · ·
#   qwen/qwen3-235b-a22b:free         Qwen       FREE      131K    ✓ ✓ ·
#   openai/gpt-oss-120b:free          Openai     FREE      131K    ✓ ✓ ·

Recommended Free Models for Code Review:

Model Provider Context Capabilities Why Good
qwen/qwen3-coder:free Qwen 262K Tools ✓ Coding-specialized, large context
mistralai/devstral-2512:free Mistral 262K Tools ✓ Dev-focused, excellent for code
qwen/qwen3-235b-a22b:free Qwen 131K Tools ✓ Reasoning ✓ Massive 235B model, reasoning

Model Selection Flow:

1. Load Historical Performance (if exists)
   → Read ai-docs/llm-performance.json
   → Get avg speed, quality, success rate per model

2. Discover Available Models
   → Run: claudish --top-models (paid)
   → Run: claudish --free (free tier)

3. Merge with Historical Data
   → Add performance metrics to model list
   → Flag: "⚡ Fast", "🎯 High Quality", "⚠️ Slow", "❌ Unreliable"

4. Present to User (AskUserQuestion)
   → Show: Model | Provider | Price | Avg Speed | Quality
   → Suggest internal reviewer (ALWAYS)
   → Highlight top performers
   → Include 1-2 free models for comparison

5. User Selects Models
   → Minimum: 1 internal + 1 external
   → Recommended: 1 internal + 2-3 external

Interactive Model Selection (AskUserQuestion with multiSelect):

CRITICAL: Use AskUserQuestion tool with multiSelect: true to let users choose models interactively. This provides a better UX than just showing recommendations.

// Use AskUserQuestion to let user select models
AskUserQuestion({
  questions: [{
    question: "Which external models should validate your code? (Internal Claude reviewer always included)",
    header: "Models",
    multiSelect: true,
    options: [
      // Top paid (from claudish --top-models + historical data)
      {
        label: "x-ai/grok-code-fast-1 ⚡",
        description: "$0.85/1M | Quality: 87% | Avg: 42s | Fast + accurate"
      },
      {
        label: "google/gemini-3-pro-preview",
        description: "$7.00/1M | Quality: 91% | Avg: 55s | High accuracy"
      },
      // Free models (from claudish --free)
      {
        label: "qwen/qwen3-coder:free 🆓",
        description: "FREE | Quality: 82% | 262K context | Coding-specialized"
      },
      {
        label: "mistralai/devstral-2512:free 🆓",
        description: "FREE | 262K context | Dev-focused, new model"
      }
    ]
  }]
})

Remember Selection for Session:

Store the user's model selection in the session directory so it persists throughout the validation:

# After user selects models, save to session
save_session_models() {
  local session_dir="$1"
  shift
  local models=("$@")

  # Always include internal reviewer
  echo "claude-embedded" > "$session_dir/selected-models.txt"

  # Add user-selected models
  for model in "${models[@]}"; do
    echo "$model" >> "$session_dir/selected-models.txt"
  done

  echo "Session models saved to $session_dir/selected-models.txt"
}

# Load session models for subsequent operations
load_session_models() {
  local session_dir="$1"
  cat "$session_dir/selected-models.txt"
}

# Usage:
# After AskUserQuestion returns selected models
save_session_models "$SESSION_DIR" "x-ai/grok-code-fast-1" "qwen/qwen3-coder:free"

# Later in the session, retrieve the selection
MODELS=$(load_session_models "$SESSION_DIR")

Session Model Memory Structure:

$SESSION_DIR/
├── selected-models.txt    # User's model selection (persists for session)
├── code-context.md        # Code being reviewed
├── claude-review.md       # Internal review
├── grok-review.md         # External review (if selected)
├── qwen-coder-review.md   # External review (if selected)
└── consolidated-review.md # Final consolidated review

Why Remember the Selection:

  1. Re-runs: If validation needs to be re-run, use same models
  2. Consistency: All phases of validation use identical model set
  3. Audit trail: Know which models produced which results
  4. Cost tracking: Accurate cost attribution per session

Always Include Internal Reviewer:

BEST PRACTICE: Always run internal Claude reviewer alongside external models.

Why?
✓ FREE (embedded Claude, no API costs)
✓ Fast baseline (usually fastest)
✓ Provides comparison point
✓ Works even if ALL external models fail
✓ Consistent behavior (same model every time)

The internal reviewer should NEVER be optional - it's your safety net.

Pattern 1: The 4-Message Pattern (MANDATORY)

This pattern is CRITICAL for achieving true parallel execution with multiple AI models.

Why This Pattern Exists:

Claude Code executes tools sequentially by default when different tool types are mixed in the same message. To achieve true parallelism, you MUST:

  1. Use ONLY one tool type per message
  2. Ensure all Task calls are in a single message
  3. Separate preparation (Bash) from execution (Task) from presentation

The Pattern:

Message 1: Preparation (Bash Only)
  - Create workspace directories
  - Validate inputs (check if claudish installed)
  - Write context files (code to review, design reference, etc.)
  - NO Task calls
  - NO TodoWrite calls

Message 2: Parallel Execution (Task Only)
  - Launch ALL AI models in SINGLE message
  - ONLY Task tool calls
  - Separate each Task with --- delimiter
  - Each Task is independent (no dependencies)
  - All execute simultaneously

Message 3: Auto-Consolidation (Task Only)
  - Automatically triggered when N ≥ 2 models complete
  - Launch consolidation agent
  - Pass all review file paths
  - Apply consensus analysis

Message 4: Present Results
  - Show user prioritized issues
  - Include consensus levels (unanimous, strong, majority)
  - Link to detailed reports
  - Cost summary (if applicable)

Example: 5-Model Parallel Code Review

Message 1: Preparation (Session Setup + Model Discovery)
  # Create unique session workspace
  Bash: SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
  Bash: SESSION_DIR="/tmp/${SESSION_ID}" && mkdir -p "$SESSION_DIR"
  Bash: git diff > "$SESSION_DIR/code-context.md"

  # Discover available models
  Bash: claudish --top-models  # See paid options
  Bash: claudish --free        # See free options

  # User selects models via AskUserQuestion (see Pattern 0)

Message 2: Parallel Execution (ONLY Task calls - single message)
  Task: senior-code-reviewer
    Prompt: "Review $SESSION_DIR/code-context.md for security issues.
             Write detailed review to $SESSION_DIR/claude-review.md
             Return only brief summary."
  ---
  Task: codex-code-reviewer PROXY_MODE: x-ai/grok-code-fast-1
    Prompt: "Review $SESSION_DIR/code-context.md for security issues.
             Write detailed review to $SESSION_DIR/grok-review.md
             Return only brief summary."
  ---
  Task: codex-code-reviewer PROXY_MODE: qwen/qwen3-coder:free
    Prompt: "Review $SESSION_DIR/code-context.md for security issues.
             Write detailed review to $SESSION_DIR/qwen-coder-review.md
             Return only brief summary."
  ---
  Task: codex-code-reviewer PROXY_MODE: openai/gpt-5.1-codex
    Prompt: "Review $SESSION_DIR/code-context.md for security issues.
             Write detailed review to $SESSION_DIR/gpt5-review.md
             Return only brief summary."
  ---
  Task: codex-code-reviewer PROXY_MODE: mistralai/devstral-2512:free
    Prompt: "Review $SESSION_DIR/code-context.md for security issues.
             Write detailed review to $SESSION_DIR/devstral-review.md
             Return only brief summary."

  All 5 models execute simultaneously (5x parallelism!)

Message 3: Auto-Consolidation
  (Automatically triggered - don't wait for user to request)

  Task: senior-code-reviewer
    Prompt: "Consolidate 5 code reviews from:
             - $SESSION_DIR/claude-review.md
             - $SESSION_DIR/grok-review.md
             - $SESSION_DIR/qwen-coder-review.md
             - $SESSION_DIR/gpt5-review.md
             - $SESSION_DIR/devstral-review.md

             Apply consensus analysis:
             - Issues flagged by ALL 5 → UNANIMOUS (VERY HIGH confidence)
             - Issues flagged by 4 → STRONG (HIGH confidence)
             - Issues flagged by 3 → MAJORITY (MEDIUM confidence)
             - Issues flagged by 1-2 → DIVERGENT (LOW confidence)

             Prioritize by consensus level and severity.
             Write to $SESSION_DIR/consolidated-review.md"

Message 4: Present Results + Update Statistics
  # Track performance for each model (see Pattern 7)
  track_model_performance "claude-embedded" "success" 32 8 95
  track_model_performance "x-ai/grok-code-fast-1" "success" 45 6 87
  track_model_performance "qwen/qwen3-coder:free" "success" 52 5 82
  track_model_performance "openai/gpt-5.1-codex" "success" 68 7 89
  track_model_performance "mistralai/devstral-2512:free" "success" 48 5 84

  # Record session summary
  record_session_stats 5 5 0 68 245 3.6

  "Multi-model code review complete! 5 AI models analyzed your code.
   Session: $SESSION_ID

   Top 5 Issues (Prioritized by Consensus):
   1. [UNANIMOUS] Missing input validation on POST /api/users
   2. [UNANIMOUS] SQL injection risk in search endpoint
   3. [STRONG] Weak password hashing (bcrypt rounds too low)
   4. [MAJORITY] Missing rate limiting on authentication endpoints
   5. [MAJORITY] Insufficient error handling in payment flow

   Model Performance (this session):
   | Model                          | Time | Issues | Quality | Cost   |
   |--------------------------------|------|--------|---------|--------|
   | claude-embedded                | 32s  | 8      | 95%     | FREE   |
   | x-ai/grok-code-fast-1          | 45s  | 6      | 87%     | $0.002 |
   | qwen/qwen3-coder:free          | 52s  | 5      | 82%     | FREE   |
   | openai/gpt-5.1-codex           | 68s  | 7      | 89%     | $0.015 |
   | mistralai/devstral-2512:free   | 48s  | 5      | 84%     | FREE   |

   Parallel Speedup: 3.6x (245s sequential → 68s parallel)

   See $SESSION_DIR/consolidated-review.md for complete analysis.
   Performance logged to ai-docs/llm-performance.json"

Performance Impact:

  • Sequential execution: 5 models × 3 min = 15 minutes
  • Parallel execution: max(model times) ≈ 5 minutes
  • Speedup: 3x with perfect parallelism

Pattern 2: Parallel Execution Architecture

Single Message, Multiple Tasks:

The key to parallel execution is putting ALL Task calls in a single message with the --- delimiter:

✅ CORRECT - Parallel Execution:

Task: agent1
  Prompt: "Task 1 instructions"
---
Task: agent2
  Prompt: "Task 2 instructions"
---
Task: agent3
  Prompt: "Task 3 instructions"

All 3 execute simultaneously.

Anti-Pattern: Sequential Execution

❌ WRONG - Sequential Execution:

Message 1:
  Task: agent1

Message 2:
  Task: agent2

Message 3:
  Task: agent3

Each task waits for previous to complete (3x slower).

Independent Tasks Requirement:

Each Task must be independent (no dependencies):

✅ CORRECT - Independent:
  Task: review code for security
  Task: review code for performance
  Task: review code for style

  All can run simultaneously (same input, different perspectives).

❌ WRONG - Dependent:
  Task: implement feature
  Task: write tests for feature (depends on implementation)
  Task: review implementation (depends on tests)

  Must run sequentially (each needs previous output).

Unique Output Files:

Each Task MUST write to a unique output file within the session directory:

✅ CORRECT - Unique Files in Session Directory:
  Task: reviewer1 → $SESSION_DIR/claude-review.md
  Task: reviewer2 → $SESSION_DIR/grok-review.md
  Task: reviewer3 → $SESSION_DIR/qwen-coder-review.md

❌ WRONG - Shared File:
  Task: reviewer1 → $SESSION_DIR/review.md
  Task: reviewer2 → $SESSION_DIR/review.md (overwrites reviewer1!)
  Task: reviewer3 → $SESSION_DIR/review.md (overwrites reviewer2!)

❌ WRONG - Fixed Directory (not session-based):
  Task: reviewer1 → ai-docs/reviews/claude-review.md  # May conflict with other sessions!

Wait for All Before Consolidation:

Do NOT consolidate until ALL tasks complete:

✅ CORRECT - Wait for All:
  Launch: Task1, Task2, Task3, Task4 (parallel)
  Wait: All 4 complete
  Check: results.filter(r => r.status === 'fulfilled').length
  If >= 2: Proceed with consolidation
  If < 2: Offer retry or abort

❌ WRONG - Premature Consolidation:
  Launch: Task1, Task2, Task3, Task4
  After 30s: Task1, Task2 done
  Consolidate: Only Task1 + Task2 (Task3, Task4 still running!)

Pattern 3: Proxy Mode Implementation

PROXY_MODE Directive:

External AI models are invoked via the PROXY_MODE directive in agent prompts:

Task: codex-code-reviewer PROXY_MODE: x-ai/grok-code-fast-1
  Prompt: "Review code for security issues..."

Agent Behavior:

When an agent sees PROXY_MODE, it:

1. Detects PROXY_MODE directive in incoming prompt
2. Extracts model name (e.g., "x-ai/grok-code-fast-1")
3. Extracts actual task (everything after PROXY_MODE line)
4. Constructs claudish command:
   printf '%s' "AGENT_PROMPT" | claudish --model x-ai/grok-code-fast-1 --stdin --quiet --auto-approve
5. Executes SYNCHRONOUSLY (blocking, waits for full response)
6. Captures full output
7. Writes detailed results to file (ai-docs/grok-review.md)
8. Returns BRIEF summary only (2-5 sentences)

Critical: Blocking Execution

External model calls MUST be synchronous (blocking) so the agent waits for completion:

✅ CORRECT - Blocking (Synchronous):
  RESULT=$(printf '%s' "$PROMPT" | claudish --model grok --stdin --quiet --auto-approve)
  echo "$RESULT" > ai-docs/grok-review.md
  echo "Grok review complete. See ai-docs/grok-review.md"

❌ WRONG - Background (Asynchronous):
  printf '%s' "$PROMPT" | claudish --model grok --stdin --quiet --auto-approve &
  echo "Grok review started..."  # Agent returns immediately, review not done!

Why Blocking Matters:

If agents return before external models complete, the orchestrator will:

  • Think all reviews are done (they're not)
  • Try to consolidate partial results (missing data)
  • Present incomplete results to user (bad experience)

Output Strategy:

Agents write full detailed output to file and return brief summary only:

Full Output (ai-docs/grok-review.md):
  "# Code Review by Grok

   ## Security Issues

   ### CRITICAL: SQL Injection in User Search
   The search endpoint constructs SQL queries using string concatenation...
   [500 more lines of detailed analysis]"

Brief Summary (returned to orchestrator):
  "Grok review complete. Found 3 CRITICAL, 5 HIGH, 12 MEDIUM issues.
   See ai-docs/grok-review.md for details."

Why Brief Summaries:

  • Orchestrator doesn't need full 500-line review in context
  • Full review is in file for consolidation agent
  • Keeps orchestrator context clean (context efficiency)

Auto-Approve Flag:

Use --auto-approve flag to prevent interactive prompts:

✅ CORRECT - Auto-Approve:
  claudish --model grok --stdin --quiet --auto-approve

❌ WRONG - Interactive (blocks waiting for user input):
  claudish --model grok --stdin --quiet
  # Waits for user to approve costs... but this is inside an agent!

PROXY_MODE-Enabled Agents Reference

CRITICAL: Only these agents support PROXY_MODE. Using other agents (like general-purpose) will NOT work correctly.

Supported Agents by Plugin

agentdev plugin (3 agents)

Agent subagent_type Best For
reviewer agentdev:reviewer Implementation quality reviews
architect agentdev:architect Design plan reviews
developer agentdev:developer Implementation with external models

frontend plugin (8 agents)

Agent subagent_type Best For
plan-reviewer frontend:plan-reviewer Architecture plan validation
reviewer frontend:reviewer Code reviews
architect frontend:architect Architecture design
designer frontend:designer Design reviews
developer frontend:developer Full-stack implementation
ui-developer frontend:ui-developer UI implementation reviews
css-developer frontend:css-developer CSS architecture & styling
test-architect frontend:test-architect Testing strategy & implementation

seo plugin (5 agents)

Agent subagent_type Best For
editor seo:editor SEO content reviews
writer seo:writer Content generation
analyst seo:analyst Analysis tasks
researcher seo:researcher Research & data gathering
data-analyst seo:data-analyst Data analysis & insights

Total: 18 PROXY_MODE-enabled agents

How to Check if an Agent Supports PROXY_MODE

Look for <proxy_mode_support> in the agent's definition file:

grep -l "proxy_mode_support" plugins/*/agents/*.md

Common Mistakes

❌ WRONG ✅ CORRECT Why
subagent_type: "general-purpose" subagent_type: "agentdev:reviewer" general-purpose has no PROXY_MODE
subagent_type: "Explore" subagent_type: "agentdev:architect" Explore is for exploration, not reviews
Prompt: "Run claudish with model X" Prompt: "PROXY_MODE: model-x\n\n[task]" Don't tell agent to run claudish, use directive

Correct Pattern Example

// ✅ CORRECT: Use PROXY_MODE-enabled agent with directive
Task({
  subagent_type: "agentdev:reviewer",
  description: "Grok design review",
  run_in_background: true,
  prompt: `PROXY_MODE: x-ai/grok-code-fast-1

Review the design plan at ai-docs/feature-design.md

Focus on:
1. Completeness
2. Missing considerations
3. Potential issues
4. Implementation risks`
})

// ❌ WRONG: Using general-purpose and instructing to run claudish
Task({
  subagent_type: "general-purpose",
  description: "Grok design review",
  prompt: `Review using Grok via claudish:
  npx claudish --model x-ai/grok-code-fast-1 ...`
})

Pattern 4: Cost Estimation and Transparency

Input/Output Token Separation:

Provide separate estimates for input and output tokens:

Cost Estimation for Multi-Model Review:

Input Tokens (per model):
  - Code context: 500 lines × 1.5 = 750 tokens
  - Review instructions: 200 tokens
  - Total input per model: ~1000 tokens
  - Total input (5 models): 5,000 tokens

Output Tokens (per model):
  - Expected output: 2,000 - 4,000 tokens
  - Total output (5 models): 10,000 - 20,000 tokens

Cost Calculation (example rates):
  - Input: 5,000 tokens × $0.0001/1k = $0.0005
  - Output: 15,000 tokens × $0.0005/1k = $0.0075 (3-5x more expensive)
  - Total: $0.0080 (range: $0.0055 - $0.0105)

User Approval Gate:
  "Multi-model review will cost approximately $0.008 ($0.005 - $0.010).
   Proceed? (Yes/No)"

Input Token Estimation Formula:

Input Tokens = (Code Lines × 1.5) + Instruction Tokens

Why 1.5x multiplier?
  - Code lines: ~1 token per line (average)
  - Context overhead: +50% (imports, comments, whitespace)

Example:
  500 lines of code → 500 × 1.5 = 750 tokens
  + 200 instruction tokens = 950 tokens total input

Output Token Estimation Formula:

Output Tokens = Base Estimate + Complexity Factor

Base Estimates by Task Type:
  - Code review: 2,000 - 4,000 tokens
  - Design validation: 1,000 - 2,000 tokens
  - Architecture planning: 3,000 - 6,000 tokens
  - Bug investigation: 2,000 - 5,000 tokens

Complexity Factors:
  - Simple (< 100 lines code): Use low end of range
  - Medium (100-500 lines): Use mid-range
  - Complex (> 500 lines): Use high end of range

Example:
  400 lines of complex code → 4,000 tokens (high complexity)
  50 lines of simple code → 2,000 tokens (low complexity)

Range-Based Estimates:

Always provide a range (min-max), not a single number:

✅ CORRECT - Range:
  "Estimated cost: $0.005 - $0.010 (depends on review depth)"

❌ WRONG - Single Number:
  "Estimated cost: $0.0075"
  (User surprised when actual is $0.0095)

Why Output Costs More:

Output tokens are typically 3-5x more expensive than input tokens:

Example Pricing (OpenRouter):
  - Grok: $0.50 / 1M input, $1.50 / 1M output (3x difference)
  - Gemini Flash: $0.10 / 1M input, $0.40 / 1M output (4x difference)
  - GPT-5 Codex: $1.00 / 1M input, $5.00 / 1M output (5x difference)

Impact:
  If input = 5,000 tokens, output = 15,000 tokens:
    Input cost: $0.0005
    Output cost: $0.0075 (15x higher despite only 3x more tokens)
    Total: $0.0080 (94% is output!)

User Approval Before Execution:

ALWAYS ask for user approval before expensive operations:

Present to user:
  "You selected 5 AI models for code review:
   - Claude Sonnet (embedded, free)
   - Grok Code Fast (external, $0.002)
   - Gemini 2.5 Flash (external, $0.001)
   - GPT-5 Codex (external, $0.004)
   - DeepSeek Coder (external, $0.001)

   Estimated total cost: $0.008 ($0.005 - $0.010)

   Proceed with multi-model review? (Yes/No)"

If user says NO:
  Offer alternatives:
    1. Use only free embedded Claude
    2. Select fewer models
    3. Cancel review

If user says YES:
  Proceed with Message 2 (parallel execution)

Pattern 5: Auto-Consolidation Logic

Automatic Trigger:

Consolidation should happen automatically when N ≥ 2 reviews complete:

✅ CORRECT - Auto-Trigger:

const results = await Promise.allSettled([task1, task2, task3, task4, task5]);
const successful = results.filter(r => r.status === 'fulfilled');

if (successful.length >= 2) {
  // Auto-trigger consolidation (DON'T wait for user to ask)
  const consolidated = await Task({
    subagent_type: "senior-code-reviewer",
    description: "Consolidate reviews",
    prompt: `Consolidate ${successful.length} reviews and apply consensus analysis`
  });

  return formatResults(consolidated);
} else {
  // Too few successful reviews
  notifyUser("Only 1 model succeeded. Retry failures or abort?");
}

❌ WRONG - Wait for User:

const results = await Promise.allSettled([...]);
const successful = results.filter(r => r.status === 'fulfilled');

// Present results to user
notifyUser("3 reviews complete. Would you like me to consolidate them?");
// Waits for user to request consolidation...

Why Auto-Trigger:

  • Better UX (no extra user prompt needed)
  • Faster workflow (no wait for user response)
  • Expected behavior (user assumes consolidation is part of workflow)

Minimum Threshold:

Require at least 2 successful reviews for meaningful consensus:

if (successful.length >= 2) {
  // Proceed with consolidation
} else if (successful.length === 1) {
  // Only 1 review succeeded
  notifyUser("Only 1 model succeeded. No consensus available. See single review or retry?");
} else {
  // All failed
  notifyUser("All models failed. Check logs and retry?");
}

Pass All Review File Paths:

Consolidation agent needs paths to ALL review files within the session directory:

Task: senior-code-reviewer
  Prompt: "Consolidate reviews from these files:
           - $SESSION_DIR/claude-review.md
           - $SESSION_DIR/grok-review.md
           - $SESSION_DIR/qwen-coder-review.md

           Apply consensus analysis and prioritize issues."

Don't Inline Full Reviews:

❌ WRONG - Inline Reviews (context pollution):
  Prompt: "Consolidate these reviews:

           Claude Review:
           [500 lines of review content]

           Grok Review:
           [500 lines of review content]

           Qwen Review:
           [500 lines of review content]"

✅ CORRECT - File Paths in Session Directory:
  Prompt: "Read and consolidate reviews from:
           - $SESSION_DIR/claude-review.md
           - $SESSION_DIR/grok-review.md
           - $SESSION_DIR/qwen-coder-review.md"

Pattern 6: Consensus Analysis

Consensus Levels:

Classify issues by how many models flagged them:

Consensus Levels (for N models):

UNANIMOUS (100% agreement):
  - All N models flagged this issue
  - VERY HIGH confidence
  - MUST FIX priority

STRONG CONSENSUS (67-99% agreement):
  - Most models flagged this issue (⌈2N/3⌉ to N-1)
  - HIGH confidence
  - RECOMMENDED priority

MAJORITY (50-66% agreement):
  - Half or more models flagged this issue (⌈N/2⌉ to ⌈2N/3⌉-1)
  - MEDIUM confidence
  - CONSIDER priority

DIVERGENT (< 50% agreement):
  - Only 1-2 models flagged this issue
  - LOW confidence
  - OPTIONAL priority (may be model-specific perspective)

Example: 5 Models

Issue Flagged By:              Consensus Level:    Priority:
─────────────────────────────────────────────────────────────
All 5 models                   UNANIMOUS (100%)    MUST FIX
4 models                       STRONG (80%)        RECOMMENDED
3 models                       MAJORITY (60%)      CONSIDER
2 models                       DIVERGENT (40%)     OPTIONAL
1 model                        DIVERGENT (20%)     OPTIONAL

Keyword-Based Matching (v1.0):

Simple consensus analysis using keyword matching:

Algorithm:

1. Extract issues from each review
2. For each unique issue:
   a. Identify keywords (e.g., "SQL injection", "input validation")
   b. Check which other reviews mention same keywords
   c. Count models that flagged this issue
   d. Assign consensus level

Example:

Claude Review: "Missing input validation on POST /api/users"
Grok Review: "Input validation absent in user creation endpoint"
Gemini Review: "No validation for user POST endpoint"

Keywords: ["input validation", "POST", "/api/users", "user"]
Match: All 3 reviews mention these keywords
Consensus: UNANIMOUS (3/3 = 100%)

Model Agreement Matrix:

Show which models agree on which issues:

Issue Matrix:

Issue                             Claude  Grok  Gemini  GPT-5  DeepSeek  Consensus
──────────────────────────────────────────────────────────────────────────────────
SQL injection in search              ✓      ✓     ✓       ✓       ✓      UNANIMOUS
Missing input validation             ✓      ✓     ✓       ✓       ✗      STRONG
Weak password hashing                ✓      ✓     ✓       ✗       ✗      MAJORITY
Missing rate limiting                ✓      ✓     ✗       ✗       ✗      DIVERGENT
Insufficient error handling          ✓      ✗     ✗       ✗       ✗      DIVERGENT

Prioritized Issue List:

Sort issues by consensus level, then by severity:

Top 10 Issues (Prioritized):

1. [UNANIMOUS - CRITICAL] SQL injection in search endpoint
   Flagged by: Claude, Grok, Gemini, GPT-5, DeepSeek (5/5)

2. [UNANIMOUS - HIGH] Missing input validation on POST /api/users
   Flagged by: Claude, Grok, Gemini, GPT-5, DeepSeek (5/5)

3. [STRONG - HIGH] Weak password hashing (bcrypt rounds too low)
   Flagged by: Claude, Grok, Gemini, GPT-5 (4/5)

4. [STRONG - MEDIUM] Missing rate limiting on auth endpoints
   Flagged by: Claude, Grok, Gemini, GPT-5 (4/5)

5. [MAJORITY - MEDIUM] Insufficient error handling in payment flow
   Flagged by: Claude, Grok, Gemini (3/5)

... (remaining issues)

Future Enhancement (v1.1+): Semantic Similarity

Instead of keyword matching, use semantic similarity:
  - Embed issue descriptions with sentence-transformers
  - Calculate cosine similarity between embeddings
  - Issues with >0.8 similarity are "same issue"
  - More accurate consensus detection

Pattern 7: Statistics Collection and Analysis

Purpose: Track model performance to help users identify slow or poorly-performing models for future exclusion.

Storage Location: ai-docs/llm-performance.json (persistent across all sessions)

When to Collect Statistics:

  • After each model completes (success, failure, or timeout)
  • During consolidation phase (quality scores)
  • At session end (session summary)

File Structure (ai-docs/llm-performance.json):

{
  "schemaVersion": "2.0.0",
  "lastUpdated": "2025-12-12T10:45:00Z",
  "models": {
    "claude-embedded": {
      "modelId": "claude-embedded",
      "provider": "Anthropic",
      "isFree": true,
      "pricing": "FREE (embedded)",
      "totalRuns": 12,
      "successfulRuns": 12,
      "failedRuns": 0,
      "totalExecutionTime": 420,
      "avgExecutionTime": 35,
      "minExecutionTime": 28,
      "maxExecutionTime": 52,
      "totalIssuesFound": 96,
      "avgQualityScore": 92,
      "totalCost": 0,
      "qualityScores": [95, 90, 88, 94, 91],
      "lastUsed": "2025-12-12T10:35:22Z",
      "trend": "stable",
      "history": [
        {
          "timestamp": "2025-12-12T10:35:22Z",
          "session": "review-20251212-103522-a3f2",
          "status": "success",
          "executionTime": 32,
          "issuesFound": 8,
          "qualityScore": 95,
          "cost": 0
        }
      ]
    },
    "x-ai-grok-code-fast-1": {
      "modelId": "x-ai/grok-code-fast-1",
      "provider": "X-ai",
      "isFree": false,
      "pricing": "$0.85/1M",
      "totalRuns": 10,
      "successfulRuns": 9,
      "failedRuns": 1,
      "totalCost": 0.12,
      "trend": "improving"
    },
    "qwen-qwen3-coder-free": {
      "modelId": "qwen/qwen3-coder:free",
      "provider": "Qwen",
      "isFree": true,
      "pricing": "FREE",
      "totalRuns": 5,
      "successfulRuns": 5,
      "failedRuns": 0,
      "totalCost": 0,
      "trend": "stable"
    }
  },
  "sessions": [
    {
      "sessionId": "review-20251212-103522-a3f2",
      "timestamp": "2025-12-12T10:35:22Z",
      "totalModels": 4,
      "successfulModels": 3,
      "failedModels": 1,
      "parallelTime": 120,
      "sequentialTime": 335,
      "speedup": 2.8,
      "totalCost": 0.018,
      "freeModelsUsed": 2
    }
  ],
  "recommendations": {
    "topPaid": ["x-ai/grok-code-fast-1", "google/gemini-3-pro-preview"],
    "topFree": ["qwen/qwen3-coder:free", "mistralai/devstral-2512:free"],
    "bestValue": ["x-ai/grok-code-fast-1"],
    "avoid": [],
    "lastGenerated": "2025-12-12T10:45:00Z"
  }
}

Key Benefits of Persistent Storage:

  • Track model reliability over time (not just one session)
  • Identify consistently slow models
  • Calculate historical success rates
  • Generate data-driven shortlist recommendations

How to Calculate Quality Score:

Quality = % of model's issues that appear in unanimous or strong consensus

quality_score = (issues_in_unanimous + issues_in_strong) / total_issues * 100

Example:
- Model found 10 issues
- 4 appear in unanimous consensus
- 3 appear in strong consensus
- Quality = (4 + 3) / 10 * 100 = 70%

Higher quality means the model finds issues other models agree with.

How to Calculate Parallel Speedup:

speedup = sum(all_execution_times) / max(execution_time)

Example:
- Claude: 32s
- Grok: 45s
- Gemini: 38s
- GPT-5: 120s

Sequential would take: 32 + 45 + 38 + 120 = 235s
Parallel took: max(32, 45, 38, 120) = 120s
Speedup: 235 / 120 = 1.96x

Performance Statistics Display Format:

## Model Performance Statistics

| Model                     | Time   | Issues | Quality | Status    |
|---------------------------|--------|--------|---------|-----------|
| claude-embedded           | 32s    | 8      | 95%     | ✓         |
| x-ai/grok-code-fast-1     | 45s    | 6      | 85%     | ✓         |
| google/gemini-2.5-flash   | 38s    | 5      | 90%     | ✓         |
| openai/gpt-5.1-codex      | 120s   | 9      | 88%     | ✓ (slow)  |
| deepseek/deepseek-chat    | TIMEOUT| 0      | -       | ✗         |

**Session Summary:**
- Parallel Speedup: 1.96x (235s sequential → 120s parallel)
- Average Time: 59s
- Slowest: gpt-5.1-codex (2.0x avg)

**Recommendations:**
⚠️ gpt-5.1-codex runs 2x slower than average - consider removing
⚠️ deepseek-chat timed out - check API status or remove from shortlist
✓ Top performers: claude-embedded, gemini-2.5-flash (fast + high quality)

Recommendation Logic:

1. Flag SLOW models:
   if (model.executionTime > 2 * avgExecutionTime) {
     flag: "⚠️ Runs 2x+ slower than average"
     suggestion: "Consider removing from shortlist"
   }

2. Flag FAILED/TIMEOUT models:
   if (model.status !== "success") {
     flag: "⚠️ Failed or timed out"
     suggestion: "Check API status or increase timeout"
   }

3. Identify TOP PERFORMERS:
   if (model.qualityScore > 85 && model.executionTime < avgExecutionTime) {
     highlight: "✓ Top performer (fast + high quality)"
   }

4. Suggest SHORTLIST:
   sortedModels = models.sort((a, b) => {
     // Quality/speed ratio: higher quality + lower time = better
     scoreA = a.qualityScore / (a.executionTime / avgExecutionTime)
     scoreB = b.qualityScore / (b.executionTime / avgExecutionTime)
     return scoreB - scoreA
   })
   shortlist = sortedModels.slice(0, 3)

Implementation (writes to ai-docs/llm-performance.json):

# Track model performance after each model completes
# Updates historical aggregates and adds to run history
# Parameters: model_id, status, duration, issues, quality_score, cost, is_free
track_model_performance() {
  local model_id="$1"
  local status="$2"
  local duration="$3"
  local issues="${4:-0}"
  local quality_score="${5:-}"
  local cost="${6:-0}"
  local is_free="${7:-false}"

  local perf_file="ai-docs/llm-performance.json"
  local model_key=$(echo "$model_id" | tr '/:' '-')  # Handle colons in free model names

  # Initialize file if doesn't exist
  [[ -f "$perf_file" ]] || echo '{"schemaVersion":"2.0.0","models":{},"sessions":[],"recommendations":{}}' > "$perf_file"

  jq --arg model "$model_key" \
     --arg model_full "$model_id" \
     --arg status "$status" \
     --argjson duration "$duration" \
     --argjson issues "$issues" \
     --arg quality "${quality_score:-null}" \
     --argjson cost "$cost" \
     --argjson is_free "$is_free" \
     --arg now "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
     --arg session "${SESSION_ID:-unknown}" \
     '
     # Initialize model if not exists
     .models[$model] //= {"modelId":$model_full,"provider":"unknown","isFree":$is_free,
       "totalRuns":0,"successfulRuns":0,"failedRuns":0,
       "totalExecutionTime":0,"avgExecutionTime":0,"minExecutionTime":null,"maxExecutionTime":null,
       "totalIssuesFound":0,"avgQualityScore":null,"qualityScores":[],"totalCost":0,
       "lastUsed":null,"trend":"new","history":[]} |

     # Update aggregates
     .models[$model].totalRuns += 1 |
     .models[$model].successfulRuns += (if $status == "success" then 1 else 0 end) |
     .models[$model].failedRuns += (if $status != "success" then 1 else 0 end) |
     .models[$model].totalExecutionTime += $duration |
     .models[$model].avgExecutionTime = ((.models[$model].totalExecutionTime / .models[$model].totalRuns) | floor) |
     .models[$model].totalIssuesFound += $issues |
     .models[$model].totalCost += $cost |
     .models[$model].isFree = $is_free |
     .models[$model].lastUsed = $now |

     # Update min/max
     .models[$model].minExecutionTime = ([.models[$model].minExecutionTime, $duration] | map(select(. != null)) | min) |
     .models[$model].maxExecutionTime = ([.models[$model].maxExecutionTime, $duration] | max) |

     # Update quality scores and trend (if provided)
     (if $quality != "null" then
       .models[$model].qualityScores += [($quality|tonumber)] |
       .models[$model].avgQualityScore = ((.models[$model].qualityScores|add) / (.models[$model].qualityScores|length) | floor) |
       # Calculate trend (last 3 vs previous 3)
       (if (.models[$model].qualityScores | length) >= 6 then
         ((.models[$model].qualityScores[-3:] | add) / 3) as $recent |
         ((.models[$model].qualityScores[-6:-3] | add) / 3) as $previous |
         .models[$model].trend = (if ($recent - $previous) > 5 then "improving"
           elif ($recent - $previous) < -5 then "degrading"
           else "stable" end)
       else . end)
     else . end) |

     # Add to history (keep last 20)
     .models[$model].history = ([{"timestamp":$now,"session":$session,"status":$status,
       "executionTime":$duration,"issuesFound":$issues,"cost":$cost,
       "qualityScore":(if $quality != "null" then ($quality|tonumber) else null end)}] + .models[$model].history)[:20] |

     .lastUpdated = $now
     ' "$perf_file" > "${perf_file}.tmp" && mv "${perf_file}.tmp" "$perf_file"
}

# Usage examples:
# Paid models
track_model_performance "x-ai/grok-code-fast-1" "success" 45 6 87 0.002 false
track_model_performance "openai/gpt-5.1-codex" "success" 68 7 89 0.015 false

# Free models (cost=0, is_free=true)
track_model_performance "qwen/qwen3-coder:free" "success" 52 5 82 0 true
track_model_performance "mistralai/devstral-2512:free" "success" 48 5 84 0 true

# Embedded Claude (always free)
track_model_performance "claude-embedded" "success" 32 8 95 0 true

# Failed/timeout models
track_model_performance "some-model" "timeout" 120 0 "" 0 false

Record Session Summary:

record_session_stats() {
  local total="$1" success="$2" failed="$3"
  local parallel_time="$4" sequential_time="$5" speedup="$6"
  local total_cost="${7:-0}" free_models_used="${8:-0}"

  local perf_file="ai-docs/llm-performance.json"
  [[ -f "$perf_file" ]] || echo '{"schemaVersion":"2.0.0","models":{},"sessions":[],"recommendations":{}}' > "$perf_file"

  jq --arg session "${SESSION_ID:-unknown}" \
     --arg now "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
     --argjson total "$total" --argjson success "$success" --argjson failed "$failed" \
     --argjson parallel "$parallel_time" --argjson sequential "$sequential_time" --argjson speedup "$speedup" \
     --argjson cost "$total_cost" --argjson free_count "$free_models_used" \
     '.sessions = ([{"sessionId":$session,"timestamp":$now,"totalModels":$total,
       "successfulModels":$success,"failedModels":$failed,"parallelTime":$parallel,
       "sequentialTime":$sequential,"speedup":$speedup,"totalCost":$cost,
       "freeModelsUsed":$free_count}] + .sessions)[:50] | .lastUpdated = $now' \
     "$perf_file" > "${perf_file}.tmp" && mv "${perf_file}.tmp" "$perf_file"
}

# Usage:
# record_session_stats total success failed parallel_time sequential_time speedup total_cost free_count
record_session_stats 5 5 0 68 245 3.6 0.017 2

Get Recommendations from Historical Data:

get_model_recommendations() {
  local perf_file="ai-docs/llm-performance.json"
  [[ -f "$perf_file" ]] || { echo "No performance data yet."; return; }

  jq -r '
    (.models | to_entries | map(select(.value.successfulRuns > 0) | .value.avgExecutionTime) | add / length) as $avg |
    {
      "overallAvgTime": ($avg | floor),
      "slowModels": [.models | to_entries[] | select(.value.avgExecutionTime > ($avg * 2)) | .key],
      "unreliableModels": [.models | to_entries[] | select(.value.totalRuns >= 3 and (.value.failedRuns / .value.totalRuns) > 0.3) | .key],
      "topPaidPerformers": [.models | to_entries | map(select(.value.avgQualityScore != null and .value.avgQualityScore > 80 and .value.isFree == false and .value.avgExecutionTime <= $avg)) | sort_by(-.value.avgQualityScore)[:3] | .[].key],
      "topFreePerformers": [.models | to_entries | map(select(.value.avgQualityScore != null and .value.avgQualityScore > 75 and .value.isFree == true)) | sort_by(-.value.avgQualityScore)[:3] | .[].key],
      "bestValue": [.models | to_entries | map(select(.value.avgQualityScore != null and .value.totalCost > 0)) | sort_by(-(.value.avgQualityScore / (.value.totalCost / .value.totalRuns)))[:2] | .[].key],
      "degradingModels": [.models | to_entries[] | select(.value.trend == "degrading") | .key]
    }
  ' "$perf_file"
}

# Display formatted recommendations
display_recommendations() {
  local perf_file="ai-docs/llm-performance.json"
  [[ -f "$perf_file" ]] || { echo "No performance data yet. Run some validations first!"; return; }

  echo "## Model Recommendations (based on historical data)"
  echo ""

  # Top paid performers
  echo "### 💰 Top Paid Models"
  jq -r '.models | to_entries | map(select(.value.isFree == false and .value.avgQualityScore != null)) | sort_by(-.value.avgQualityScore)[:3] | .[] | "- \(.value.modelId): Quality \(.value.avgQualityScore)%, Avg \(.value.avgExecutionTime)s, Cost $\(.value.totalCost | . * 100 | floor / 100)"' "$perf_file"
  echo ""

  # Top free performers
  echo "### 🆓 Top Free Models"
  jq -r '.models | to_entries | map(select(.value.isFree == true and .value.avgQualityScore != null and .key != "claude-embedded")) | sort_by(-.value.avgQualityScore)[:3] | .[] | "- \(.value.modelId): Quality \(.value.avgQualityScore)%, Avg \(.value.avgExecutionTime)s"' "$perf_file"
  echo ""

  # Models to avoid
  echo "### ⚠️ Consider Avoiding"
  jq -r '
    (.models | to_entries | map(select(.value.successfulRuns > 0) | .value.avgExecutionTime) | add / length) as $avg |
    .models | to_entries[] |
    select(
      (.value.avgExecutionTime > ($avg * 2)) or
      (.value.totalRuns >= 3 and (.value.failedRuns / .value.totalRuns) > 0.3) or
      (.value.trend == "degrading")
    ) |
    "- \(.key): " +
    (if .value.avgExecutionTime > ($avg * 2) then "⏱️ Slow (2x+ avg)" else "" end) +
    (if .value.totalRuns >= 3 and (.value.failedRuns / .value.totalRuns) > 0.3 then " ❌ Unreliable (>\(.value.failedRuns)/\(.value.totalRuns) failures)" else "" end) +
    (if .value.trend == "degrading" then " 📉 Quality degrading" else "" end)
  ' "$perf_file"
}

Pattern 8: Data-Driven Model Selection (NEW v3.0)

Purpose: Use historical performance data to make intelligent model selection recommendations.

The Problem:

Users often select models arbitrarily or based on outdated information:

  • "I'll use GPT-5 because it's famous"
  • "Let me try this new model I heard about"
  • "I'll use the same 5 models every time"

The Solution:

Use accumulated performance data to recommend:

  1. Top performers (highest quality scores)
  2. Best value (quality/cost ratio)
  3. Top free models (high quality, zero cost)
  4. Models to avoid (slow, unreliable, or degrading)

Model Selection Algorithm:

1. Load historical data from ai-docs/llm-performance.json

2. Calculate metrics for each model:
   - Success Rate = successfulRuns / totalRuns × 100
   - Quality Score = avgQualityScore (from consensus analysis)
   - Speed Score = avgExecutionTime relative to overall average
   - Value Score = avgQualityScore / (totalCost / totalRuns)

3. Categorize models:
   TOP PAID: Quality > 80%, Success > 90%, Speed <= avg
   TOP FREE: Quality > 75%, Success > 90%, isFree = true
   BEST VALUE: Highest Quality/Cost ratio among paid models
   AVOID: Speed > 2x avg OR Success < 70% OR trend = "degrading"

4. Present recommendations with context:
   - Show historical metrics
   - Highlight trends (improving/stable/degrading)
   - Flag new models with insufficient data

Interactive Model Selection with Recommendations:

Instead of just displaying recommendations, use AskUserQuestion with multiSelect to let users interactively choose:

// Build options from claudish output + historical data
const paidModels = getTopModelsFromClaudish();  // claudish --top-models
const freeModels = getFreeModelsFromClaudish(); // claudish --free
const history = loadPerformanceHistory();        // ai-docs/llm-performance.json

// Merge and build AskUserQuestion options
AskUserQuestion({
  questions: [{
    question: "Select models for validation (Claude internal always included). Based on 25 sessions across 8 models.",
    header: "Models",
    multiSelect: true,
    options: [
      // Top paid with historical data
      {
        label: "x-ai/grok-code-fast-1 ⚡ (Recommended)",
        description: "$0.85/1M | Quality: 87% | Avg: 42s | Fast + accurate"
      },
      {
        label: "google/gemini-3-pro-preview 🎯",
        description: "$7.00/1M | Quality: 91% | Avg: 55s | High accuracy"
      },
      // Top free models
      {
        label: "qwen/qwen3-coder:free 🆓",
        description: "FREE | Quality: 82% | 262K | Coding-specialized"
      },
      {
        label: "mistralai/devstral-2512:free 🆓",
        description: "FREE | Quality: 84% | 262K | Dev-focused"
      }
      // Note: Models to AVOID are simply not shown in options
      // Note: New models show "(new)" instead of quality score
    ]
  }]
})

Key Principles for Model Selection UI:

  1. Put recommended models first with "(Recommended)" suffix
  2. Include historical metrics in description (Quality %, Avg time)
  3. Mark free models with 🆓 emoji
  4. Don't show models to avoid - just exclude them from options
  5. Mark new models with "(new)" when no historical data
  6. Remember selection - save to $SESSION_DIR/selected-models.txt

After Selection - Save to Session:

# User selected: grok-code-fast-1, qwen3-coder:free
# Save for session persistence
save_session_models "$SESSION_DIR" "${USER_SELECTED_MODELS[@]}"

# Now $SESSION_DIR/selected-models.txt contains:
# claude-embedded
# x-ai/grok-code-fast-1
# qwen/qwen3-coder:free

Warning Display (separate from selection):

If there are models to avoid, show a brief warning before the selection:

⚠️ Models excluded from selection (poor historical performance):
- openai/gpt-5.1-codex: Slow (2.1x avg)
- some-model: 60% success rate

Automatic Shortlist Generation:

# Generate optimal shortlist based on criteria
generate_shortlist() {
  local criteria="${1:-balanced}"  # balanced, quality, budget, free-only
  local perf_file="ai-docs/llm-performance.json"

  case "$criteria" in
    "balanced")
      # 1 internal + 1 fast paid + 1 free
      echo "claude-embedded"
      jq -r '.models | to_entries | map(select(.value.isFree == false and .value.avgQualityScore > 80)) | sort_by(.value.avgExecutionTime)[0].key' "$perf_file"
      jq -r '.models | to_entries | map(select(.value.isFree == true and .key != "claude-embedded" and .value.avgQualityScore > 75)) | sort_by(-.value.avgQualityScore)[0].key' "$perf_file"
      ;;
    "quality")
      # Top 3 by quality regardless of cost
      echo "claude-embedded"
      jq -r '.models | to_entries | map(select(.value.avgQualityScore != null and .key != "claude-embedded")) | sort_by(-.value.avgQualityScore)[:2] | .[].key' "$perf_file"
      ;;
    "budget")
      # Internal + 2 cheapest performers
      echo "claude-embedded"
      jq -r '.models | to_entries | map(select(.value.avgQualityScore > 75 and .value.isFree == true)) | sort_by(-.value.avgQualityScore)[:2] | .[].key' "$perf_file"
      ;;
    "free-only")
      # Only free models
      echo "claude-embedded"
      jq -r '.models | to_entries | map(select(.value.isFree == true and .key != "claude-embedded" and .value.avgQualityScore != null)) | sort_by(-.value.avgQualityScore)[:2] | .[].key' "$perf_file"
      ;;
  esac
}

# Usage:
generate_shortlist "balanced"   # For most use cases
generate_shortlist "quality"    # When accuracy is critical
generate_shortlist "budget"     # When cost matters
generate_shortlist "free-only"  # Zero-cost validation

Integration with Model Discovery:

Workflow:
1. Run `claudish --top-models` → Get current paid models
2. Run `claudish --free` → Get current free models
3. Load ai-docs/llm-performance.json → Get historical performance
4. Merge data:
   - New models (no history): Mark as "🆕 New"
   - Known models: Show performance metrics
   - Deprecated models: Filter out (not in claudish output)
5. Generate recommendations
6. Present to user with AskUserQuestion

Why This Matters:

Selection Method Outcome
Random/arbitrary Hit-or-miss, may waste money on slow models
Always same models Miss new better options, stuck with degrading ones
Data-driven Optimal quality/cost/speed balance, continuous improvement

Over time, the system learns which models work best for YOUR codebase and validation patterns.


Integrating Statistics in Your Plugin

To add LLM performance tracking to your plugin's commands:

Step 1: Reference This Skill

Add to your command's frontmatter:

skills: orchestration:multi-model-validation

Step 2: Track Each Model Execution

After each external model completes:

# Parameters: model_id, status, duration_seconds, issues_found, quality_score
track_model_performance "x-ai/grok-code-fast-1" "success" 45 6 85

Step 3: Record Session Summary

At the end of multi-model execution:

# Parameters: total, successful, failed, parallel_time, sequential_time, speedup
record_session_stats 4 3 1 120 335 2.8

Step 4: Display Statistics

In your finalization phase, show:

  1. This session's model performance table
  2. Historical performance (if ai-docs/llm-performance.json exists)
  3. Recommendations for slow/unreliable models

Example Integration (in command.md)

<phase name="External Review">
  <steps>
    <step>Record start time: PHASE_START=$(date +%s)</step>
    <step>Run external models in parallel (single message, multiple Task calls)</step>
    <step>
      After completion, track each model:
      track_model_performance "{model}" "{status}" "{duration}" "{issues}" "{quality}"
    </step>
    <step>
      Record session:
      record_session_stats $TOTAL $SUCCESS $FAILED $PARALLEL $SEQUENTIAL $SPEEDUP
    </step>
  </steps>
</phase>

<phase name="Finalization">
  <steps>
    <step>
      Display Model Performance Statistics (read from ai-docs/llm-performance.json)
    </step>
    <step>Show recommendations for slow/failing models</step>
  </steps>
</phase>

Plugins Using This Pattern

Plugin Command Usage
frontend /review Full implementation with historical tracking
agentdev /develop Plan review + quality review tracking

Integration with Other Skills

multi-model-validation + quality-gates:

Use Case: Cost approval before expensive multi-model review

Step 1: Cost Estimation (multi-model-validation)
  Calculate input/output tokens
  Estimate cost range

Step 2: User Approval Gate (quality-gates)
  Present cost estimate
  Ask user for approval
  If NO: Offer alternatives or abort
  If YES: Proceed with execution

Step 3: Parallel Execution (multi-model-validation)
  Follow 4-Message Pattern
  Launch all models simultaneously

multi-model-validation + error-recovery:

Use Case: Handling external model failures gracefully

Step 1: Parallel Execution (multi-model-validation)
  Launch 5 external models

Step 2: Error Handling (error-recovery)
  Model 1: Success
  Model 2: Timeout after 30s → Skip, continue with others
  Model 3: API 500 error → Retry once, then skip
  Model 4: Success
  Model 5: Success

Step 3: Partial Success Strategy (error-recovery)
  3/5 models succeeded (≥ 2 threshold)
  Proceed with consolidation using 3 reviews
  Notify user: "2 models failed, proceeding with 3 reviews"

Step 4: Consolidation (multi-model-validation)
  Consolidate 3 successful reviews
  Apply consensus analysis

multi-model-validation + todowrite-orchestration:

Use Case: Real-time progress tracking during parallel execution

Step 1: Initialize TodoWrite (todowrite-orchestration)
  Tasks:
    1. Prepare workspace
    2. Launch Claude review
    3. Launch Grok review
    4. Launch Gemini review
    5. Launch GPT-5 review
    6. Consolidate reviews
    7. Present results

Step 2: Update Progress (todowrite-orchestration)
  Mark tasks complete as models finish:
    - Claude completes → Mark task 2 complete
    - Grok completes → Mark task 3 complete
    - Gemini completes → Mark task 4 complete
    - GPT-5 completes → Mark task 5 complete

Step 3: User Sees Real-Time Progress
  "3/4 external models completed, 1 in progress..."

Best Practices

Do:

  • ✅ Use 4-Message Pattern for true parallel execution
  • ✅ Provide cost estimates BEFORE execution
  • ✅ Ask user approval for costs >$0.01
  • ✅ Auto-trigger consolidation when N ≥ 2 reviews complete
  • ✅ Use blocking (synchronous) claudish execution
  • ✅ Write full output to files, return brief summaries
  • ✅ Prioritize by consensus level (unanimous → strong → majority → divergent)
  • ✅ Show model agreement matrix
  • ✅ Handle partial success gracefully (some models fail)
  • Track execution time per model (NEW v2.0)
  • Calculate and display quality scores (NEW v2.0)
  • Show performance statistics table at end of session (NEW v2.0)
  • Generate recommendations for slow/failing models (NEW v2.0)

Don't:

  • ❌ Mix tool types in Message 2 (breaks parallelism)
  • ❌ Use background claudish execution (returns before completion)
  • ❌ Wait for user to request consolidation (auto-trigger instead)
  • ❌ Consolidate with < 2 successful reviews (no meaningful consensus)
  • ❌ Inline full reviews in consolidation prompt (use file paths)
  • ❌ Return full 500-line reviews to orchestrator (use brief summaries)
  • ❌ Skip cost approval gate for expensive operations
  • Skip statistics display (users need data to optimize model selection)
  • Keep slow models in shortlist (flag models 2x+ slower than average)

Performance:

  • Parallel execution: 3-5x faster than sequential
  • Message 2 speedup: 15 min → 5 min with 5 models
  • Context efficiency: Brief summaries save 50-80% context
  • Statistics overhead: <1 second (jq operations are fast)

Examples

Example 1: Dynamic Model Discovery + Review

Scenario: User requests "Let's run external models to validate our solution"

Execution:

Message 1: Session Setup + Model Discovery
  # Create unique session
  Bash: SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
  Bash: SESSION_DIR="/tmp/${SESSION_ID}" && mkdir -p "$SESSION_DIR"
  Output: Session: review-20251212-143052-a3f2

  # Discover available models
  Bash: claudish --top-models
  Output:
    google/gemini-3-pro-preview    Google     $7.00/1M   1048K   🔧 🧠 👁️
    openai/gpt-5.1-codex           Openai     $5.63/1M   400K    🔧 🧠 👁️
    x-ai/grok-code-fast-1          X-ai       $0.85/1M   256K    🔧 🧠
    minimax/minimax-m2             Minimax    $0.64/1M   262K    🔧 🧠

  Bash: claudish --free
  Output:
    qwen/qwen3-coder:free          Qwen       FREE       262K    ✓ · ·
    mistralai/devstral-2512:free   Mistralai  FREE       262K    ✓ · ·
    qwen/qwen3-235b-a22b:free      Qwen       FREE       131K    ✓ ✓ ·

  # Load historical performance
  Bash: cat ai-docs/llm-performance.json | jq '.models | keys'
  Output: ["claude-embedded", "x-ai-grok-code-fast-1", "qwen-qwen3-coder-free"]

  # Prepare code context
  Bash: git diff > "$SESSION_DIR/code-context.md"

Message 2: Model Selection (AskUserQuestion with multiSelect)
  # Use AskUserQuestion tool with multiSelect: true
  AskUserQuestion({
    questions: [{
      question: "Which external models should validate your code? (Internal Claude always included)",
      header: "Models",
      multiSelect: true,
      options: [
        { label: "x-ai/grok-code-fast-1 ⚡", description: "$0.85/1M | Quality: 87% | Avg: 42s" },
        { label: "google/gemini-3-pro-preview", description: "$7.00/1M | New model, no history" },
        { label: "qwen/qwen3-coder:free 🆓", description: "FREE | Quality: 82% | Coding-specialized" },
        { label: "mistralai/devstral-2512:free 🆓", description: "FREE | Dev-focused, new model" }
      ]
    }]
  })

  # User selects via interactive UI:
  # ☑ x-ai/grok-code-fast-1
  # ☐ google/gemini-3-pro-preview
  # ☑ qwen/qwen3-coder:free
  # ☑ mistralai/devstral-2512:free

  # Save selection to session for later use
  save_session_models "$SESSION_DIR" "x-ai/grok-code-fast-1" "qwen/qwen3-coder:free" "mistralai/devstral-2512:free"

  # Session now has:
  # $SESSION_DIR/selected-models.txt containing:
  # claude-embedded (always)
  # x-ai/grok-code-fast-1
  # qwen/qwen3-coder:free
  # mistralai/devstral-2512:free

Message 3: Parallel Execution (Task only - single message)
  Task: senior-code-reviewer
    Prompt: "Review $SESSION_DIR/code-context.md.
             Write to $SESSION_DIR/claude-review.md"
  ---
  Task: codex-code-reviewer PROXY_MODE: x-ai/grok-code-fast-1
    Prompt: "Review $SESSION_DIR/code-context.md.
             Write to $SESSION_DIR/grok-review.md"
  ---
  Task: codex-code-reviewer PROXY_MODE: qwen/qwen3-coder:free
    Prompt: "Review $SESSION_DIR/code-context.md.
             Write to $SESSION_DIR/qwen-coder-review.md"
  ---
  Task: codex-code-reviewer PROXY_MODE: mistralai/devstral-2512:free
    Prompt: "Review $SESSION_DIR/code-context.md.
             Write to $SESSION_DIR/devstral-review.md"

  All 4 execute simultaneously!

Message 4: Auto-Consolidation + Statistics Update
  # Consolidate
  Task: senior-code-reviewer
    Prompt: "Consolidate 4 reviews from $SESSION_DIR/*.md"

  # Track performance
  track_model_performance "claude-embedded" "success" 32 8 95 0 true
  track_model_performance "x-ai/grok-code-fast-1" "success" 45 6 87 0.002 false
  track_model_performance "qwen/qwen3-coder:free" "success" 52 5 82 0 true
  track_model_performance "mistralai/devstral-2512:free" "success" 48 5 84 0 true

  record_session_stats 4 4 0 52 177 3.4 0.002 3

Message 5: Present Results
  "Multi-model review complete! Session: review-20251212-143052-a3f2

   Top Issues (Consensus):
   1. [UNANIMOUS] SQL injection in search endpoint
   2. [STRONG] Missing input validation (3/4 models)
   3. [MAJORITY] Weak password hashing (2/4 models)

   Model Performance (this session):
   | Model                        | Time | Issues | Quality | Cost   |
   |------------------------------|------|--------|---------|--------|
   | claude-embedded              | 32s  | 8      | 95%     | FREE   |
   | x-ai/grok-code-fast-1        | 45s  | 6      | 87%     | $0.002 |
   | qwen/qwen3-coder:free        | 52s  | 5      | 82%     | FREE   |
   | mistralai/devstral-2512:free | 48s  | 5      | 84%     | FREE   |

   Session Stats:
   - Parallel Speedup: 3.4x (177s → 52s)
   - Total Cost: $0.002 (3 free models used!)

   Performance logged to ai-docs/llm-performance.json
   See $SESSION_DIR/consolidated-review.md for details."

Result: Dynamic model discovery, user selection, 3 free models, data-driven optimization


Example 2: Partial Success with Error Recovery

Scenario: 4 models selected, 2 fail

Execution:

Message 1: Preparation
  (same as Example 1)

Message 2: Parallel Execution
  Task: senior-code-reviewer (embedded)
  Task: PROXY_MODE grok (external)
  Task: PROXY_MODE gemini (external)
  Task: PROXY_MODE gpt-5-codex (external)

Message 3: Error Recovery (error-recovery skill)
  results = await Promise.allSettled([...]);

  Results:
    - Claude: Success ✓
    - Grok: Timeout after 30s ✗
    - Gemini: API 500 error ✗
    - GPT-5: Success ✓

  successful.length = 2 (Claude + GPT-5)
  2 ≥ 2 ✓ (threshold met, can proceed)

  Notify user:
    "2/4 models succeeded (Grok timeout, Gemini error).
     Proceeding with consolidation using 2 reviews."

Message 4: Auto-Consolidation
  Task: senior-code-reviewer
    Prompt: "Consolidate 2 reviews from:
             - ai-docs/reviews/claude-review.md
             - ai-docs/reviews/gpt5-review.md

             Note: Only 2 models (Grok and Gemini failed)."

Message 5: Present Results
  "Multi-model review complete (2/4 models succeeded).

   Top Issues (2-model consensus):
   1. [UNANIMOUS] SQL injection (both flagged)
   2. [DIVERGENT] Input validation (Claude only)
   3. [DIVERGENT] Rate limiting (GPT-5 only)

   Note: Grok and Gemini failed. Limited consensus data.
   See ai-docs/consolidated-review.md for details."

Result: Graceful degradation, useful results despite failures


Troubleshooting

Problem: Models executing sequentially instead of parallel

Cause: Mixed tool types in Message 2

Solution: Use ONLY Task calls in Message 2

❌ Wrong:
  Message 2:
    TodoWrite({...})
    Task({...})
    Task({...})

✅ Correct:
  Message 1: TodoWrite({...}) (separate message)
  Message 2: Task({...}); Task({...}) (only Task)

Problem: Agent returns before external model completes

Cause: Background claudish execution

Solution: Use synchronous (blocking) execution

❌ Wrong:
  claudish --model grok ... &

✅ Correct:
  RESULT=$(claudish --model grok ...)

Problem: Consolidation never triggers

Cause: Waiting for user to request it

Solution: Auto-trigger when N ≥ 2 reviews complete

❌ Wrong:
  if (results.length >= 2) {
    notifyUser("Ready to consolidate. Proceed?");
    // Waits for user...
  }

✅ Correct:
  if (results.length >= 2) {
    // Auto-trigger, don't wait
    await consolidate();
  }

Problem: Costs higher than estimated

Cause: Underestimated output tokens

Solution: Use range-based estimates, bias toward high end

✅ Better Estimation:
  Output: 3,000 - 5,000 tokens (range, not single number)
  Cost: $0.005 - $0.010 (gives user realistic expectation)

⚠️ MANDATORY: Statistics Collection Checklist

Statistics are NOT optional. The multi-model validation is INCOMPLETE without performance tracking.

Why This Matters

Real-world feedback showed that agents often:

  • ❌ Forget to instrument timing
  • ❌ Skip statistics because Task tool doesn't return timing
  • ❌ Get caught up in execution and forget the statistics phase
  • ❌ Present results without performance data

This checklist prevents those failures.

Complete Tracking Protocol

For the complete tracking protocol including:

  • Pre-launch checklist (8 required items)
  • Tracking table templates (simple, detailed, session-based)
  • Failure documentation format
  • Consensus analysis requirements
  • Results presentation template

See: orchestration:model-tracking-protocol

The tracking protocol skill provides copy-paste templates that make compliance easy and unforgettable.

Pre-Flight Checklist (Before Launching Models)

# 1. Record session start time (REQUIRED)
SESSION_START=$(date +%s)
echo "Session started at: $SESSION_START"

# 2. Create timing tracker file in session directory
echo "{}" > "$SESSION_DIR/timing.json"

# 3. Initialize per-model start times array
declare -A MODEL_START_TIMES

Per-Model Timing (During Execution)

CRITICAL: Record start time BEFORE launching each model:

# Before launching each Task
MODEL_START_TIMES["claude-embedded"]=$(date +%s)
MODEL_START_TIMES["x-ai/grok-code-fast-1"]=$(date +%s)
MODEL_START_TIMES["qwen/qwen3-coder:free"]=$(date +%s)

# After each TaskOutput returns, calculate duration
model_completed() {
  local model="$1"
  local status="$2"
  local issues="${3:-0}"
  local quality="${4:-}"

  local end_time=$(date +%s)
  local start_time="${MODEL_START_TIMES[$model]}"
  local duration=$((end_time - start_time))

  echo "Model $model completed in ${duration}s"

  # Track immediately (don't wait until end)
  track_model_performance "$model" "$status" "$duration" "$issues" "$quality"
}

# Call when each model completes
model_completed "claude-embedded" "success" 8 95
model_completed "x-ai/grok-code-fast-1" "success" 6 87

Post-Consolidation Checklist (MANDATORY)

Before presenting results to user, you MUST complete ALL of these:

□ 1. Calculate duration for EACH model
      DURATION=$((END_TIME - START_TIME))

□ 2. Call track_model_performance() for EACH model
      track_model_performance "model-id" "status" duration issues quality cost is_free

□ 3. Calculate parallel vs sequential times
      PARALLEL_TIME=$(max of all durations)
      SEQUENTIAL_TIME=$(sum of all durations)
      SPEEDUP=$(echo "scale=1; $SEQUENTIAL_TIME / $PARALLEL_TIME" | bc)

□ 4. Call record_session_stats()
      record_session_stats $TOTAL $SUCCESS $FAILED $PARALLEL_TIME $SEQUENTIAL_TIME $SPEEDUP $COST $FREE_COUNT

□ 5. Verify ai-docs/llm-performance.json was updated
      [ -f "ai-docs/llm-performance.json" ] && echo "✓ Stats saved"

□ 6. Display performance table (see template below)

FAILURE TO COMPLETE ALL 6 STEPS = INCOMPLETE REVIEW

Complete Timing Example

#!/bin/bash
# Full timing instrumentation example

# === PRE-FLIGHT ===
SESSION_START=$(date +%s)
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES
declare -A MODEL_DURATIONS

# === LAUNCH PHASE ===
# Record start times BEFORE launching Tasks
MODEL_START_TIMES["claude-embedded"]=$SESSION_START
MODEL_START_TIMES["x-ai/grok-code-fast-1"]=$SESSION_START
MODEL_START_TIMES["qwen/qwen3-coder:free"]=$SESSION_START

# Launch all Tasks in parallel (Message 2)
# ... Task calls here ...

# === COMPLETION PHASE ===
# After TaskOutput returns for each model
record_completion() {
  local model="$1"
  MODEL_END_TIMES["$model"]=$(date +%s)
  MODEL_DURATIONS["$model"]=$((MODEL_END_TIMES["$model"] - MODEL_START_TIMES["$model"]))
}

# Call as each completes
record_completion "claude-embedded"
record_completion "x-ai/grok-code-fast-1"
record_completion "qwen/qwen3-coder:free"

# === STATISTICS PHASE ===
# Calculate totals
PARALLEL_TIME=0
SEQUENTIAL_TIME=0
for model in "${!MODEL_DURATIONS[@]}"; do
  duration="${MODEL_DURATIONS[$model]}"
  SEQUENTIAL_TIME=$((SEQUENTIAL_TIME + duration))
  if [ "$duration" -gt "$PARALLEL_TIME" ]; then
    PARALLEL_TIME=$duration
  fi
done
SPEEDUP=$(echo "scale=1; $SEQUENTIAL_TIME / $PARALLEL_TIME" | bc)

# Track each model
track_model_performance "claude-embedded" "success" "${MODEL_DURATIONS[claude-embedded]}" 8 95 0 true
track_model_performance "x-ai/grok-code-fast-1" "success" "${MODEL_DURATIONS[x-ai/grok-code-fast-1]}" 6 87 0.002 false
track_model_performance "qwen/qwen3-coder:free" "success" "${MODEL_DURATIONS[qwen/qwen3-coder:free]}" 5 82 0 true

# Record session
record_session_stats 3 3 0 $PARALLEL_TIME $SEQUENTIAL_TIME $SPEEDUP 0.002 2

echo "Statistics collection complete!"

Required Output Template

Your final message to the user MUST include this table:

## Model Performance (This Session)

| Model                     | Time  | Issues | Quality | Cost   | Status |
|---------------------------|-------|--------|---------|--------|--------|
| claude-embedded           | 32s   | 8      | 95%     | FREE   | ✅     |
| x-ai/grok-code-fast-1     | 45s   | 6      | 87%     | $0.002 | ✅     |
| qwen/qwen3-coder:free     | 52s   | 5      | 82%     | FREE   | ✅     |

## Session Statistics

- **Parallel Time:** 52s (slowest model)
- **Sequential Time:** 129s (sum of all)
- **Speedup:** 2.5x
- **Total Cost:** $0.002
- **Free Models Used:** 2/3

✓ Performance logged to `ai-docs/llm-performance.json`

Verification Before Presenting

Run this check before your final message:

verify_statistics_complete() {
  local errors=0

  # Check file exists
  if [ ! -f "ai-docs/llm-performance.json" ]; then
    echo "ERROR: ai-docs/llm-performance.json not found"
    errors=$((errors + 1))
  fi

  # Check session was recorded
  if ! jq -e '.sessions[0]' ai-docs/llm-performance.json >/dev/null 2>&1; then
    echo "ERROR: No session recorded"
    errors=$((errors + 1))
  fi

  # Check models were tracked
  local model_count=$(jq '.models | length' ai-docs/llm-performance.json)
  if [ "$model_count" -eq 0 ]; then
    echo "ERROR: No models tracked"
    errors=$((errors + 1))
  fi

  if [ "$errors" -gt 0 ]; then
    echo "STATISTICS INCOMPLETE - $errors errors found"
    return 1
  fi

  echo "✓ Statistics verification passed"
  return 0
}

Common Mistakes and Fixes

Mistake Fix
"I'll track timing later" Record start time BEFORE launching
"Task tool doesn't return timing" Use bash timestamps around Task calls
"Too complex with parallel agents" Use associative arrays for per-model times
"Forgot to call track_model_performance" Add to checklist, verify file updated
"Presented results without table" Use required output template

Summary

Multi-model validation achieves 3-5x speedup and consensus-based prioritization through:

  • Pattern 0: Session Setup (NEW v3.0) - Unique session directories, dynamic model discovery
  • Pattern 1: 4-Message Pattern - True parallel execution
  • Pattern 2: Parallel Architecture - Single message, multiple Task calls
  • Pattern 3: Proxy Mode - Blocking execution via Claudish
  • Pattern 4: Cost Transparency - Estimate before, report after
  • Pattern 5: Auto-Consolidation - Triggered when N ≥ 2 complete
  • Pattern 6: Consensus Analysis - unanimous → strong → majority → divergent
  • Pattern 7: Statistics Collection - Track speed, cost, quality per model
  • Pattern 8: Data-Driven Selection (NEW v3.0) - Intelligent model recommendations

Master this skill and you can validate any implementation with multiple AI perspectives in minutes, while continuously improving your model shortlist based on actual performance data.

Version 3.1.0 Additions:

  • MANDATORY Statistics Collection Checklist - Prevents incomplete reviews
  • SubagentStop Hook - Automatically reminds when statistics weren't collected
  • Pre-Flight Checklist - Record SESSION_START, initialize timing arrays
  • Per-Model Timing Examples - Bash associative arrays for tracking durations
  • Required Output Template - Standardized performance table format
  • Verification Script - verify_statistics_complete() function
  • Common Mistakes Table - Quick reference for debugging

Version 3.0 Additions:

  • Pattern 0: Session Setup and Model Discovery
    • Unique session directories (/tmp/review-{timestamp}-{hash})
    • Dynamic model discovery via claudish --top-models and claudish --free
    • Always include internal reviewer (safety net)
    • Recommended free models: qwen3-coder, devstral-2512, qwen3-235b
  • Pattern 8: Data-Driven Model Selection
    • Historical performance tracking in ai-docs/llm-performance.json
    • Per-model metrics: speed, cost, quality, success rate, trend
    • Automatic shortlist generation (balanced, quality, budget, free-only)
    • Model recommendations with context
  • Enhanced Statistics
    • Cost tracking per model and per session
    • Free vs paid model tracking
    • Trend detection (improving/stable/degrading)
    • Top free performers category

Version 2.0 Additions:

  • Pattern 7: Statistics Collection and Analysis
  • Per-model execution time tracking
  • Quality score calculation (issues in consensus %)
  • Session summary statistics (speedup, avg time, success rate)
  • Recommendations for slow/failing models

Extracted From: