name	model-tracking-protocol
description	MANDATORY tracking protocol for multi-model validation. Creates structured tracking tables BEFORE launching models, tracks progress during execution, and ensures complete results presentation. Use when running 2+ external AI models in parallel. Trigger keywords - "multi-model", "parallel review", "external models", "consensus", "model tracking".
version	1.0.0
tags	orchestration, tracking, multi-model, statistics, mandatory
keywords	tracking, mandatory, pre-launch, statistics, consensus, results, failures

Model Tracking Protocol

Version: 1.0.0 Purpose: MANDATORY tracking protocol for multi-model validation to prevent incomplete reviews Status: Production Ready

Overview

This skill defines the MANDATORY tracking protocol for multi-model validation. It provides templates and procedures that make proper tracking unforgettable.

The Problem This Solves:

Agents often launch multiple external AI models but fail to:

Create structured tracking tables before launch
Collect timing and performance data during execution
Document failures with error messages
Perform consensus analysis comparing model findings
Present results in a structured format

The Solution:

This skill provides MANDATORY checklists, templates, and protocols that ensure complete tracking. Missing ANY of these steps = INCOMPLETE review.

MANDATORY Pre-Launch Checklist
Tracking Table Templates
Per-Model Status Updates
Failure Documentation Protocol
Consensus Analysis Requirements
Results Presentation Template
Common Failures and Prevention
Integration Examples

MANDATORY Pre-Launch Checklist

You MUST complete ALL items before launching ANY external models.

This is NOT optional. If you skip this, your multi-model validation is INCOMPLETE.

Checklist (Copy and Complete)

PRE-LAUNCH VERIFICATION (complete before Task calls):

[ ] 1. SESSION_ID created: ________________________
[ ] 2. SESSION_DIR created: ________________________
[ ] 3. Tracking table written to: $SESSION_DIR/tracking.md
[ ] 4. Start time recorded: SESSION_START=$(date +%s)
[ ] 5. Model list confirmed (comma-separated): ________________________
[ ] 6. Per-model timing arrays initialized
[ ] 7. Code context written to session directory
[ ] 8. Tracking marker created: /tmp/.claude-multi-model-active

If ANY item is unchecked, STOP and complete it before proceeding.

Why Pre-Launch Matters

Without pre-launch setup, you will:

Lose timing data (cannot calculate speed accurately)
Miss failed model details (no structured place to record)
Skip consensus analysis (no model list to compare)
Present incomplete results (no tracking table to populate)

Pre-Launch Script Template

CRITICAL CONSENSUS FIX APPLIED: Use file-based detection instead of environment variables.

#!/bin/bash
# Run this BEFORE launching any Task calls

# 1. Create unique session
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"

# 2. Record start time
SESSION_START=$(date +%s)

# 3. Create tracking table
cat > "$SESSION_DIR/tracking.md" << EOF
# Multi-Model Tracking

## Session Info
- Session ID: ${SESSION_ID}
- Started: $(date -u +%Y-%m-%dT%H:%M:%SZ)
- Models Requested: [FILL]

## Model Status

| Model | Agent ID | Status | Start | End | Duration | Issues | Quality | Notes |
|-------|----------|--------|-------|-----|----------|--------|---------|-------|
| [MODEL 1] | | pending | | | | | | |
| [MODEL 2] | | pending | | | | | | |
| [MODEL 3] | | pending | | | | | | |

## Failures

| Model | Failure Type | Error Message | Retry? |
|-------|--------------|---------------|--------|

## Consensus

| Issue | Model 1 | Model 2 | Model 3 | Agreement |
|-------|---------|---------|---------|-----------|

EOF

# 4. Initialize timing arrays
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES
declare -A MODEL_STATUS

# 5. Create tracking marker file (CRITICAL FIX)
# This allows hooks to detect that tracking is active
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

echo "Pre-launch setup complete. Session: $SESSION_ID"
echo "Directory: $SESSION_DIR"
echo "Tracking table: $SESSION_DIR/tracking.md"

Strict Mode (Optional)

For stricter enforcement, set:

export CLAUDE_STRICT_TRACKING=true

When enabled, hooks will BLOCK execution if tracking is not set up, rather than just warning.

Tracking Table Templates

Template A: Simple Model Tracking (3-5 models)

| Model | Status | Time | Issues | Quality | Cost |
|-------|--------|------|--------|---------|------|
| claude-embedded | pending | - | - | - | FREE |
| x-ai/grok-code-fast-1 | pending | - | - | - | - |
| qwen/qwen3-coder:free | pending | - | - | - | FREE |

Update as each completes:

| Model | Status | Time | Issues | Quality | Cost |
|-------|--------|------|--------|---------|------|
| claude-embedded | success | 32s | 8 | 95% | FREE |
| x-ai/grok-code-fast-1 | success | 45s | 6 | 87% | $0.002 |
| qwen/qwen3-coder:free | timeout | - | - | - | - |

Template B: Detailed Model Tracking (6+ models)

## Model Execution Status

### Summary
- Total Requested: 8
- Completed: 0
- In Progress: 0
- Failed: 0
- Pending: 8

### Detailed Status

| # | Model | Provider | Status | Start | Duration | Issues | Quality | Cost | Error |
|---|-------|----------|--------|-------|----------|--------|---------|------|-------|
| 1 | claude-embedded | Anthropic | pending | - | - | - | - | FREE | - |
| 2 | x-ai/grok-code-fast-1 | X-ai | pending | - | - | - | - | - | - |
| 3 | qwen/qwen3-coder:free | Qwen | pending | - | - | - | - | FREE | - |
| 4 | google/gemini-3-pro | Google | pending | - | - | - | - | - | - |
| 5 | openai/gpt-5.1-codex | OpenAI | pending | - | - | - | - | - | - |
| 6 | mistralai/devstral | Mistral | pending | - | - | - | - | FREE | - |
| 7 | deepseek/deepseek-r1 | DeepSeek | pending | - | - | - | - | - | - |
| 8 | anthropic/claude-sonnet | Anthropic | pending | - | - | - | - | - | - |

Template C: Session-Based Tracking File

Create this file at $SESSION_DIR/tracking.md:

# Multi-Model Validation Tracking
Session: ${SESSION_ID}
Started: ${TIMESTAMP}

## Pre-Launch Verification
- [x] Session directory created: ${SESSION_DIR}
- [x] Tracking table initialized
- [x] Start time recorded: ${SESSION_START}
- [x] Model list: ${MODEL_LIST}

## Model Status

| Model | Status | Start | Duration | Issues | Quality |
|-------|--------|-------|----------|--------|---------|
| claude | pending | - | - | - | - |
| grok | pending | - | - | - | - |
| gemini | pending | - | - | - | - |

## Failures
(populated as failures occur)

## Consensus
(populated after all complete)

Update Protocol

As each model completes, IMMEDIATELY update:

Status: pending -> in_progress -> success/failed/timeout
Duration: Calculate from start time
Issues: Number of issues found
Quality: Percentage if calculable
Error: If failed, brief error message

DO NOT wait until all models finish. Update as each completes.

Per-Model Status Update Protocol

IMMEDIATELY After Each Model Completes

Do NOT wait until all models finish. Update tracking AS EACH COMPLETES.

Update Script

# Call this when each model completes
update_model_status() {
  local model="$1"
  local status="$2"
  local issues="${3:-0}"
  local quality="${4:-}"
  local error="${5:-}"

  local end_time=$(date +%s)
  local start_time="${MODEL_START_TIMES[$model]}"
  local duration=$((end_time - start_time))

  # Update arrays
  MODEL_END_TIMES["$model"]=$end_time
  MODEL_STATUS["$model"]="$status"

  # Log update to session tracking file
  echo "$(date -u +%Y-%m-%dT%H:%M:%SZ) - Model: $model, Status: $status, Duration: ${duration}s" >> "$SESSION_DIR/execution.log"

  # Update tracking table (append to tracking.md)
  echo "| $model | $status | ${duration}s | $issues | ${quality:-N/A} | ${error:-} |" >> "$SESSION_DIR/tracking.md"

  # Track performance in global statistics
  if [[ "$status" == "success" ]]; then
    track_model_performance "$model" "success" "$duration" "$issues" "$quality"
  else
    track_model_performance "$model" "$status" "$duration" 0 ""
  fi
}

# Usage examples:
update_model_status "claude-embedded" "success" 8 95
update_model_status "x-ai/grok-code-fast-1" "success" 6 87
update_model_status "some-model" "timeout" 0 "" "Exceeded 120s limit"
update_model_status "other-model" "failed" 0 "" "API 500 error"

Status Values

Status	Meaning	Action
`pending`	Not started	Wait
`in_progress`	Currently executing	Monitor
`success`	Completed successfully	Collect results
`failed`	Error during execution	Document error
`timeout`	Exceeded time limit	Note timeout
`cancelled`	User cancelled	Note cancellation

Real-Time Progress Display

Show user progress as models complete:

Model Status (3/5 complete):
✓ claude-embedded (32s, 8 issues)
✓ x-ai/grok-code-fast-1 (45s, 6 issues)
✓ qwen/qwen3-coder:free (52s, 5 issues)
⏳ openai/gpt-5.1-codex (in progress, 60s elapsed)
⏳ google/gemini-3-pro (in progress, 48s elapsed)

Failure Documentation Protocol

EVERY failed model MUST be documented with:

Model name
Failure type (timeout, API error, parse error, etc.)
Error message (exact or summarized)
Whether retry was attempted

Failure Report Template

## Failed Models Report

### Model: x-ai/grok-code-fast-1
- **Failure Type:** API Error
- **Error Message:** "500 Internal Server Error from OpenRouter"
- **Retry Attempted:** Yes, 1 retry, same error
- **Impact:** Review results based on 3/4 models instead of 4
- **Recommendation:** Check OpenRouter status, retry later

### Model: google/gemini-3-pro
- **Failure Type:** Timeout
- **Error Message:** "Exceeded 120s limit, response incomplete"
- **Retry Attempted:** No, time constraints
- **Impact:** Lost Gemini perspective, consensus based on remaining models
- **Recommendation:** Extend timeout to 180s for this model

Failure Categorization

Category	Common Causes	Recovery
Timeout	Model slow, large input, network latency	Retry with extended timeout
API Error	Provider down, rate limit, auth issue	Wait and retry, check API status
Parse Error	Malformed response, encoding issue	Retry, simplify prompt
Auth Error	Invalid API key, expired token	Check credentials
Context Limit	Input too large for model	Reduce context, split task
Rate Limit	Too many requests	Wait, implement backoff

Failure Summary Table

Always include this in final results:

## Execution Summary

| Metric | Value |
|--------|-------|
| Models Requested | 8 |
| Successful | 5 (62.5%) |
| Failed | 3 (37.5%) |

### Failed Models

| Model | Failure | Recoverable? | Action |
|-------|---------|--------------|--------|
| grok-code-fast-1 | API 500 | Yes - retry later | Check OpenRouter status |
| gemini-3-pro | Timeout | Yes - extend limit | Use 180s timeout |
| deepseek-r1 | Auth Error | No - check key | Verify API key valid |

Writing Failures to Session Directory

# Document failure immediately when it occurs
document_failure() {
  local model="$1"
  local failure_type="$2"
  local error_msg="$3"
  local retry_attempted="${4:-No}"

  cat >> "$SESSION_DIR/failures.md" << EOF

### Model: $model
- **Failure Type:** $failure_type
- **Error Message:** "$error_msg"
- **Retry Attempted:** $retry_attempted
- **Timestamp:** $(date -u +%Y-%m-%dT%H:%M:%SZ)

EOF

  echo "Failure documented: $model ($failure_type)" >&2
}

# Usage:
document_failure "x-ai/grok-code-fast-1" "API Error" "500 Internal Server Error" "Yes, 1 retry"

Consensus Analysis Requirements

After ALL models complete (or max wait time), you MUST perform consensus analysis.

This is NOT optional. Even with 2 successful models, compare their findings.

Minimum Viable Consensus (2 models)

With only 2 models, consensus is simple:

AGREE: Both found the same issue
DISAGREE: Only one found the issue

| Issue | Model 1 | Model 2 | Consensus |
|-------|---------|---------|-----------|
| SQL injection | Yes | Yes | AGREE |
| Missing validation | Yes | No | Model 1 only |
| Weak hashing | No | Yes | Model 2 only |

Standard Consensus (3-5 models)

| Issue | Claude | Grok | Gemini | Agreement |
|-------|--------|------|--------|-----------|
| SQL injection | Yes | Yes | Yes | UNANIMOUS (3/3) |
| Missing validation | Yes | Yes | No | STRONG (2/3) |
| Rate limiting | Yes | No | No | DIVERGENT (1/3) |

Extended Consensus (6+ models)

For 6+ models, add summary statistics:

## Consensus Summary

- **Unanimous Issues (100%):** 3 issues
- **Strong Consensus (67%+):** 5 issues
- **Majority (50%+):** 2 issues
- **Divergent (<50%):** 4 issues

## Top 5 by Consensus

1. [6/6] SQL injection in search - FIX IMMEDIATELY
2. [6/6] Missing input validation - FIX IMMEDIATELY
3. [5/6] Weak password hashing - RECOMMENDED
4. [4/6] Missing rate limiting - CONSIDER
5. [3/6] Error handling gaps - INVESTIGATE

Consensus Analysis Script

# Perform consensus analysis on all model findings
analyze_consensus() {
  local session_dir="$1"
  local num_models="$2"

  echo "## Consensus Analysis" > "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"
  echo "Based on $num_models model reviews:" >> "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"

  # Read all review files and extract issues
  # (simplified - actual implementation would parse review markdown)
  for review in "$session_dir"/*-review.md; do
    echo "Processing: $review"
    # Extract issues, compare, categorize by agreement level
  done

  # Calculate consensus levels
  echo "### Consensus Levels" >> "$session_dir/consensus.md"
  echo "" >> "$session_dir/consensus.md"
  echo "- UNANIMOUS: All $num_models models agree" >> "$session_dir/consensus.md"
  echo "- STRONG: ≥67% of models agree" >> "$session_dir/consensus.md"
  echo "- MAJORITY: ≥50% of models agree" >> "$session_dir/consensus.md"
  echo "- DIVERGENT: <50% of models agree" >> "$session_dir/consensus.md"
}

NO Consensus Analysis = INCOMPLETE Review

If you present results without a consensus comparison, your review is INCOMPLETE.

Minimum Requirements:

✅ Compare findings across ALL successful models
✅ Categorize by agreement level (unanimous, strong, majority, divergent)
✅ Prioritize issues by consensus + severity
✅ Document in $SESSION_DIR/consensus.md

Results Presentation Template

Your final output MUST include ALL of these sections.

Required Output Format

## Multi-Model Review Complete

### Execution Summary

| Metric | Value |
|--------|-------|
| Session ID | review-20251224-143052-a3f2 |
| Session Directory | /tmp/review-20251224-143052-a3f2 |
| Models Requested | 5 |
| Successful | 4 (80%) |
| Failed | 1 (20%) |
| Total Duration | 68s (parallel) |
| Sequential Equivalent | 245s |
| Speedup | 3.6x |

### Model Performance

| Model | Time | Issues | Quality | Status | Cost |
|-------|------|--------|---------|--------|------|
| claude-embedded | 32s | 8 | 95% | Success | FREE |
| x-ai/grok-code-fast-1 | 45s | 6 | 87% | Success | $0.002 |
| qwen/qwen3-coder:free | 52s | 5 | 82% | Success | FREE |
| openai/gpt-5.1-codex | 68s | 7 | 89% | Success | $0.015 |
| mistralai/devstral | - | - | - | Timeout | - |

### Failed Models

| Model | Failure | Error |
|-------|---------|-------|
| mistralai/devstral | Timeout | Exceeded 120s limit |

### Top Issues by Consensus

1. **[UNANIMOUS]** SQL injection in search endpoint
   - Flagged by: claude, grok, qwen, gpt-5 (4/4)
   - Severity: CRITICAL
   - Action: FIX IMMEDIATELY

2. **[UNANIMOUS]** Missing input validation
   - Flagged by: claude, grok, qwen, gpt-5 (4/4)
   - Severity: CRITICAL
   - Action: FIX IMMEDIATELY

3. **[STRONG]** Weak password hashing
   - Flagged by: claude, grok, gpt-5 (3/4)
   - Severity: HIGH
   - Action: RECOMMENDED

### Detailed Reports

- Session directory: /tmp/review-20251224-143052-a3f2
- Consolidated review: /tmp/review-20251224-143052-a3f2/consolidated-review.md
- Individual reviews: /tmp/review-20251224-143052-a3f2/{model}-review.md
- Tracking data: /tmp/review-20251224-143052-a3f2/tracking.md
- Consensus analysis: /tmp/review-20251224-143052-a3f2/consensus.md

### Statistics Saved

- Performance data logged to: ai-docs/llm-performance.json

Missing Section Detection

Before presenting, verify ALL sections are present:

verify_output_complete() {
  local output="$1"

  local required=(
    "Execution Summary"
    "Model Performance"
    "Top Issues"
    "Detailed Reports"
    "Statistics"
  )

  local missing=()
  for section in "${required[@]}"; do
    if ! echo "$output" | grep -q "$section"; then
      missing+=("$section")
    fi
  done

  if [ ${#missing[@]} -gt 0 ]; then
    echo "ERROR: Missing required sections: ${missing[*]}" >&2
    return 1
  fi

  return 0
}

Checklist before presenting results:

Execution Summary (models requested/successful/failed)
Model Performance table (per-model times and quality)
Failed Models section (if any failed)
Top Issues by Consensus (prioritized list)
Detailed Reports (session directory, file paths)
Statistics confirmation (llm-performance.json updated)

Common Failures and Prevention

Failure 1: No Tracking Table Created

Symptom: Results presented as prose, not structured data

What went wrong:

"I ran 5 models. 3 succeeded and found various issues."
(No table, no structure)

Prevention:

Always run pre-launch script FIRST
Create $SESSION_DIR/tracking.md before Task calls
Populate table as models complete

Detection: SubagentStop hook warns if no tracking found

Failure 2: Timing Not Recorded

Symptom: "Duration: unknown" or missing speed stats

What went wrong:

# Launched models without recording start time
Task: reviewer1
Task: reviewer2
# No SESSION_START, cannot calculate duration!

Prevention:

# ALWAYS do this first
SESSION_START=$(date +%s)
MODEL_START_TIMES["model1"]=$SESSION_START

Detection: Hook checks for timing data in output

Failure 3: Failed Models Not Documented

Symptom: "2 of 8 succeeded" with no failure details

What went wrong:

"Launched 8 models. 2 succeeded."
(No info on why 6 failed)

Prevention:

# Immediately when model fails
document_failure "model-name" "Timeout" "Exceeded 120s" "No"

Detection: Hook checks for failure section when success < total

Failure 4: No Consensus Analysis

Symptom: Individual model results listed without comparison

What went wrong:

"Model 1 found: A, B, C
 Model 2 found: B, D, E"
(No comparison: which issues do they agree on?)

Prevention:

After all complete, ALWAYS run consolidation
Create consensus table comparing findings
Prioritize by agreement level

Detection: Hook checks for consensus keywords

Failure 5: Statistics Not Saved

Symptom: No record in ai-docs/llm-performance.json

What went wrong:

# Forgot to call tracking functions
# No record of this session

Prevention:

# ALWAYS call these
track_model_performance "model" "status" duration issues quality
record_session_stats total success failed parallel sequential speedup

Detection: Hook checks file modification time

Prevention Checklist

Before presenting results, verify:

[ ] Tracking table exists at $SESSION_DIR/tracking.md
[ ] Tracking table is populated with all model results
[ ] All model times recorded (or "timeout"/"failed" noted)
[ ] All failures documented in $SESSION_DIR/failures.md
[ ] Consensus analysis performed in $SESSION_DIR/consensus.md
[ ] Results match required output format
[ ] Statistics saved to ai-docs/llm-performance.json
[ ] Session directory contains all artifacts

Integration Examples

Example 1: Complete Multi-Model Review Workflow

#!/bin/bash
# Full multi-model review with complete tracking

# ============================================================================
# PHASE 1: PRE-LAUNCH (MANDATORY)
# ============================================================================

# 1. Create unique session
SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"

# 2. Record start time
SESSION_START=$(date +%s)

# 3. Create tracking table
cat > "$SESSION_DIR/tracking.md" << EOF
# Multi-Model Validation Tracking

## Session: $SESSION_ID
Started: $(date -u +%Y-%m-%dT%H:%M:%SZ)

## Model Status
| Model | Status | Duration | Issues | Quality |
|-------|--------|----------|--------|---------|
EOF

# 4. Initialize timing arrays
declare -A MODEL_START_TIMES
declare -A MODEL_END_TIMES

# 5. Create tracking marker
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

# 6. Write code context
git diff > "$SESSION_DIR/code-context.md"

echo "Pre-launch complete. Session: $SESSION_ID"

# ============================================================================
# PHASE 2: MODEL EXECUTION (Parallel Task calls)
# ============================================================================

# Record start times for each model
MODEL_START_TIMES["claude-embedded"]=$(date +%s)
MODEL_START_TIMES["x-ai/grok-code-fast-1"]=$(date +%s)
MODEL_START_TIMES["qwen/qwen3-coder:free"]=$(date +%s)

# Launch all models in single message (parallel execution)
# (These would be actual Task calls in practice)
echo "Launching 3 models in parallel..."

# ============================================================================
# PHASE 3: RESULTS COLLECTION (as each completes)
# ============================================================================

# Update status immediately after each completes
update_model_status() {
  local model="$1" status="$2" issues="${3:-0}" quality="${4:-}"
  local end_time=$(date +%s)
  local duration=$((end_time - MODEL_START_TIMES["$model"]))

  echo "| $model | $status | ${duration}s | $issues | ${quality:-N/A} |" >> "$SESSION_DIR/tracking.md"
  track_model_performance "$model" "$status" "$duration" "$issues" "$quality"
}

# Example completions
update_model_status "claude-embedded" "success" 8 95
update_model_status "x-ai/grok-code-fast-1" "success" 6 87
update_model_status "qwen/qwen3-coder:free" "timeout"

# ============================================================================
# PHASE 4: CONSENSUS ANALYSIS (MANDATORY)
# ============================================================================

# Consolidate and compare findings
echo "Performing consensus analysis..."
# (Would launch consolidation agent here)

# ============================================================================
# PHASE 5: STATISTICS & PRESENTATION
# ============================================================================

# Calculate session stats
PARALLEL_TIME=52  # max of all durations
SEQUENTIAL_TIME=129  # sum of all durations
SPEEDUP=2.5

# Record session
record_session_stats 3 2 1 "$PARALLEL_TIME" "$SEQUENTIAL_TIME" "$SPEEDUP"

# Present results
cat << RESULTS
## Multi-Model Review Complete

Session: $SESSION_ID
Directory: $SESSION_DIR

Models: 3 requested, 2 successful, 1 failed

See tracking table: $SESSION_DIR/tracking.md
See consensus: $SESSION_DIR/consensus.md
Statistics saved to: ai-docs/llm-performance.json
RESULTS

# Cleanup marker
rm -f /tmp/.claude-multi-model-active

Example 2: Minimal 2-Model Comparison

# Simplest viable multi-model validation

# Pre-launch
SESSION_ID="review-$(date +%s)"
SESSION_DIR="/tmp/$SESSION_ID"
mkdir -p "$SESSION_DIR"
SESSION_START=$(date +%s)
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

# Launch
echo "Launching Claude + Grok..."
# Task: claude-embedded
# Task: PROXY_MODE grok

# Track
track_model_performance "claude" "success" 32 8 95
track_model_performance "grok" "success" 45 6 87

# Consensus
echo "Issues both found: SQL injection, missing validation" > "$SESSION_DIR/consensus.md"

# Stats
record_session_stats 2 2 0 45 77 1.7

# Cleanup
rm -f /tmp/.claude-multi-model-active

Example 3: Handling Failures

# Multi-model with failure handling

# Pre-launch (same as Example 1)
# ... setup code ...

# Launch 4 models
# ... Task calls ...

# Model 1: Success
update_model_status "claude" "success" 32 8 95

# Model 2: Success
update_model_status "grok" "success" 45 6 87

# Model 3: Timeout
update_model_status "gemini" "timeout"
document_failure "gemini" "Timeout" "Exceeded 120s limit" "No"

# Model 4: API Error
update_model_status "gpt5" "failed"
document_failure "gpt5" "API Error" "500 from OpenRouter" "Yes, 1 retry"

# Proceed with 2 successful models
if [ "$SUCCESS_COUNT" -ge 2 ]; then
  echo "Proceeding with $SUCCESS_COUNT successful models"
  # Consensus with partial data
else
  echo "ERROR: Only $SUCCESS_COUNT succeeded, need minimum 2"
fi

Integration with Other Skills

With `multi-model-validation`

The multi-model-validation skill defines the execution patterns (4-Message Pattern, parallel execution, proxy mode). This skill (model-tracking-protocol) defines the tracking infrastructure.

Use together:

skills: orchestration:multi-model-validation, orchestration:model-tracking-protocol

Workflow:

Read multi-model-validation for execution patterns
Read model-tracking-protocol for tracking setup
Pre-launch (tracking protocol)
Execute (validation patterns)
Track (protocol updates)
Present (protocol templates)

With `quality-gates`

Use quality gates to ensure tracking is complete before proceeding:

# After tracking setup, verify completeness
if [ ! -f "$SESSION_DIR/tracking.md" ]; then
  echo "QUALITY GATE FAILED: No tracking table"
  exit 1
fi

# Before presenting results, verify all sections present
verify_output_complete "$OUTPUT" || exit 1

With `todowrite-orchestration`

Track progress through multi-model phases:

TodoWrite:
1. Pre-launch setup (tracking protocol)
2. Launch models (validation patterns)
3. Collect results (tracking updates)
4. Consensus analysis (protocol requirement)
5. Present results (protocol template)

Quick Reference

File-Based Tracking Marker (CONSENSUS FIX)

Create marker after pre-launch setup:

echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

Check if tracking active (in hooks):

if [[ -f /tmp/.claude-multi-model-active ]]; then
  SESSION_DIR=$(cat /tmp/.claude-multi-model-active)
  [[ -f "$SESSION_DIR/tracking.md" ]] && echo "Tracking active"
fi

Remove marker when done:

rm -f /tmp/.claude-multi-model-active

Pre-Launch Commands

SESSION_ID="review-$(date +%Y%m%d-%H%M%S)-$(head -c 4 /dev/urandom | xxd -p)"
SESSION_DIR="/tmp/${SESSION_ID}"
mkdir -p "$SESSION_DIR"
SESSION_START=$(date +%s)
echo "$SESSION_DIR" > /tmp/.claude-multi-model-active

Tracking Commands

update_model_status "model" "status" issues quality
document_failure "model" "type" "error" "retry"
track_model_performance "model" "status" duration issues quality
record_session_stats total success failed parallel sequential speedup

Verification Commands

verify_output_complete "$OUTPUT"
[ -f "$SESSION_DIR/tracking.md" ] && echo "Tracking exists"
[ -f ai-docs/llm-performance.json ] && echo "Statistics saved"

Summary

This skill provides MANDATORY tracking infrastructure for multi-model validation:

Pre-Launch Checklist - 8 items to complete before launching models
Tracking Tables - Templates for 3-5 models and 6+ models
Status Updates - Per-model completion tracking
Failure Documentation - Required format for all failures
Consensus Analysis - Comparing findings across models
Results Template - Required output format
Common Failures - Prevention strategies
Integration Examples - Complete workflows

Key Innovation: File-based tracking marker (/tmp/.claude-multi-model-active) allows hooks to detect active tracking without relying on environment variables.

Use this skill when: Running 2+ external AI models in parallel for validation, review, or consensus analysis.

Missing tracking = INCOMPLETE validation.

Install Skill

SKILL.md

Model Tracking Protocol

Overview

Table of Contents

MANDATORY Pre-Launch Checklist

Checklist (Copy and Complete)

Why Pre-Launch Matters

Pre-Launch Script Template

Strict Mode (Optional)

Tracking Table Templates

Template A: Simple Model Tracking (3-5 models)

Template B: Detailed Model Tracking (6+ models)

Template C: Session-Based Tracking File

Update Protocol

Per-Model Status Update Protocol

IMMEDIATELY After Each Model Completes

Update Script

Status Values

Real-Time Progress Display

Failure Documentation Protocol

Failure Report Template

Failure Categorization

Failure Summary Table

Writing Failures to Session Directory

Consensus Analysis Requirements

Minimum Viable Consensus (2 models)

Standard Consensus (3-5 models)

Extended Consensus (6+ models)

Consensus Analysis Script

NO Consensus Analysis = INCOMPLETE Review

Results Presentation Template

Required Output Format

Missing Section Detection

Common Failures and Prevention

Failure 1: No Tracking Table Created

Failure 2: Timing Not Recorded

Failure 3: Failed Models Not Documented

Failure 4: No Consensus Analysis

Failure 5: Statistics Not Saved

Prevention Checklist

Integration Examples

Example 1: Complete Multi-Model Review Workflow

Example 2: Minimal 2-Model Comparison

Example 3: Handling Failures

Integration with Other Skills

With multi-model-validation

With quality-gates

With todowrite-orchestration

Quick Reference

File-Based Tracking Marker (CONSENSUS FIX)

Pre-Launch Commands

Tracking Commands

Verification Commands

Summary

With `multi-model-validation`

With `quality-gates`

With `todowrite-orchestration`