Claude Code Plugins

Community-maintained marketplace

Feedback

gpu-debugger

@gpu-cli/gpu
0
0

Debug failed GPU CLI runs. Analyze error messages, diagnose OOM errors, fix sync issues, troubleshoot connectivity, and resolve common problems. Turn cryptic errors into actionable fixes.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name gpu-debugger
description Debug failed GPU CLI runs. Analyze error messages, diagnose OOM errors, fix sync issues, troubleshoot connectivity, and resolve common problems. Turn cryptic errors into actionable fixes.

GPU Debugger

Turn errors into solutions.

This skill helps debug failed GPU CLI runs: OOM errors, sync failures, connectivity issues, model loading problems, and more.

When to Use This Skill

Problem This Skill Helps With
"CUDA out of memory" OOM diagnosis and fixes
"Connection refused" Connectivity troubleshooting
"Sync failed" File sync debugging
"Pod won't start" Provisioning issues
"Model won't load" Model loading errors
"Command exited with error" Exit code analysis
"My run is hanging" Stuck process diagnosis

Debugging Workflow

Error occurs
     │
     ▼
┌─────────────────────────┐
│ 1. Collect information  │
│    - Error message      │
│    - Daemon logs        │
│    - Exit code          │
│    - VRAM usage         │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ 2. Identify error type  │
│    - OOM                │
│    - Network            │
│    - Model              │
│    - Sync               │
│    - Permission         │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ 3. Apply fix            │
│    - Config change      │
│    - Code change        │
│    - Retry              │
└─────────────────────────┘

Information Collection Commands

Check Daemon Logs

# Last 50 log lines
gpu daemon logs --tail 50

# Full logs since last restart
gpu daemon logs

# Follow logs in real-time
gpu daemon logs --follow

Check Pod Status

# Current pod status
gpu status

# Pod details
gpu pods list

Check Job History

# Recent jobs
gpu jobs list

# Specific job details
gpu jobs show <job-id>

Common Errors and Solutions

1. CUDA Out of Memory (OOM)

Error messages:

CUDA out of memory. Tried to allocate X GiB
RuntimeError: CUDA error: out of memory
torch.cuda.OutOfMemoryError

Diagnosis:

# In your script, check VRAM usage
import torch
print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
print(f"VRAM reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

Solutions by severity:

Solution VRAM Savings Effort
Reduce batch size ~Linear Easy
Enable gradient checkpointing ~40% Easy
Use FP16/BF16 ~50% Easy
Use INT8 quantization ~50% Medium
Use INT4 quantization ~75% Medium
Enable CPU offloading Variable Easy
Use larger GPU Solves it $$

Quick fixes:

# Reduce batch size
BATCH_SIZE = 1  # Start small, increase until OOM

# Enable gradient checkpointing
model.gradient_checkpointing_enable()

# Use FP16
model = model.half()

# Use INT4 quantization
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

# CPU offloading (for diffusers)
pipe.enable_model_cpu_offload()

# Clear cache between batches
torch.cuda.empty_cache()

Config fix:

{
  // Upgrade to larger GPU
  "gpu_type": "A100 PCIe 80GB",  // Instead of RTX 4090
  "min_vram": 80
}

2. Connection Refused / Timeout

Error messages:

Connection refused
Connection timed out
SSH connection failed
Failed to connect to daemon

Diagnosis:

# Check daemon status
gpu daemon status

# Check if daemon is running
ps aux | grep gpud

# Check daemon logs
gpu daemon logs --tail 20

Solutions:

Cause Solution
Daemon not running gpu daemon start
Daemon crashed gpu daemon restart
Wrong socket Check GPU_DAEMON_SOCKET env var
Port conflict Kill conflicting process

Restart daemon:

gpu daemon stop
gpu daemon start

3. Pod Won't Start / Provisioning Failed

Error messages:

Failed to create pod
No GPUs available
Insufficient resources
Provisioning timeout

Diagnosis:

# Check available GPUs
gpu machines list

# Check specific GPU availability
gpu machines list --gpu "RTX 4090"

Solutions:

Cause Solution
GPU type unavailable Try different GPU type
Region full Remove region constraint
Price too low Increase max_price
Volume region mismatch Use volume's region

Config fixes:

{
  // Use min_vram instead of exact GPU
  "gpu_type": null,
  "min_vram": 24,  // Any GPU with 24GB+

  // Or try different GPU
  "gpu_type": "RTX A6000",  // Alternative to RTX 4090

  // Or relax region constraint
  "region": null,  // Any region

  // Or increase price tolerance
  "max_price": 2.0  // Allow up to $2/hr
}

4. Sync Errors

Error messages:

rsync error
Sync failed
File not found
Permission denied during sync

Diagnosis:

# Check sync status
gpu sync status

# Check .gitignore
cat .gitignore

# Check outputs config
cat gpu.jsonc | grep outputs

Solutions:

Cause Solution
File too large Add to .gitignore
Permission issue Check file permissions
Path not in outputs Add to outputs config
Disk full on pod Increase workspace size

Config fixes:

{
  // Ensure outputs are configured
  "outputs": ["output/", "results/", "models/"],

  // Exclude large files
  "exclude_outputs": ["*.tmp", "*.log", "checkpoints/"],

  // Increase storage
  "workspace_size_gb": 100
}

5. Model Loading Errors

Error messages:

Model not found
Could not load model
Safetensors error
HuggingFace rate limit

Diagnosis:

# Check if model is downloading
# Look for download progress in job output

# Check HuggingFace cache on pod
gpu run ls -la ~/.cache/huggingface/hub/

Solutions:

Cause Solution
Model not downloaded Add to download spec
Wrong model path Fix path in code
HF rate limit Set HF_TOKEN
Network issue Retry with timeout
Gated model Accept license on HF

Config fixes:

{
  // Pre-download models
  "download": [
    { "strategy": "hf", "source": "meta-llama/Llama-3.1-8B-Instruct", "timeout": 7200 }
  ],

  // Set HF token in environment
  "environment": {
    "shell": {
      "steps": [
        { "run": "echo 'HF_TOKEN=your_token' >> ~/.bashrc" }
      ]
    }
  }
}

6. Process Hanging / Stuck

Symptoms:

  • No output for a long time
  • Process doesn't exit
  • gpu status shows job running forever

Diagnosis:

# Check if process is actually running
gpu run ps aux | grep python

# Check for infinite loops in logs
gpu jobs logs <job-id> --tail 100

# Check VRAM (might be swapping)
gpu run nvidia-smi

Solutions:

Cause Solution
Infinite loop Fix code logic
Waiting for input Make script non-interactive
VRAM thrashing Reduce memory usage
Deadlock Add timeout
Network wait Add timeout to requests

Code fixes:

# Add timeout to model loading
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    low_cpu_mem_usage=True  # Prevent OOM during loading
)

# Add timeout to HTTP requests
import requests
response = requests.get(url, timeout=30)

# Add progress bars to see activity
from tqdm import tqdm
for item in tqdm(items):
    process(item)

7. Exit Code Errors

Common exit codes:

Code Meaning Common Cause
0 Success -
1 General error Script exception
2 Misuse of command Bad arguments
126 Permission denied Script not executable
127 Command not found Missing binary
137 Killed (OOM) Out of memory
139 Segfault Bad memory access
143 Terminated Killed by signal

Diagnosis:

# Check last job exit code
gpu jobs list --limit 1

# Get full output including error
gpu jobs logs <job-id>

Solutions:

Exit Code Solution
1 Check Python traceback, fix exception
126 chmod +x script.sh
127 Install missing package
137 Reduce memory usage, bigger GPU
139 Update PyTorch/CUDA versions

8. CUDA Version Mismatch

Error messages:

CUDA error: no kernel image is available
CUDA version mismatch
Torch not compiled with CUDA enabled

Diagnosis:

# Check CUDA version on pod
gpu run nvcc --version
gpu run nvidia-smi | head -3

# Check PyTorch CUDA version
gpu run python -c "import torch; print(torch.version.cuda)"

Solutions:

{
  // Use a known-good base image
  "environment": {
    "base_image": "runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04"
  }
}

Or in requirements.txt:

# Install PyTorch with specific CUDA version
--extra-index-url https://download.pytorch.org/whl/cu124
torch==2.4.0

Debug Script Template

Add this to your projects for better error info:

#!/usr/bin/env python3
"""Wrapper script with debugging info."""

import sys
import traceback
import torch

def print_system_info():
    """Print system info for debugging."""
    print("=" * 50)
    print("SYSTEM INFO")
    print("=" * 50)
    print(f"Python: {sys.version}")
    print(f"PyTorch: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"CUDA version: {torch.version.cuda}")
        print(f"GPU: {torch.cuda.get_device_name(0)}")
        print(f"VRAM total: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        print(f"VRAM allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print("=" * 50)

def main():
    # Your actual code here
    pass

if __name__ == "__main__":
    print_system_info()
    try:
        main()
    except Exception as e:
        print("\n" + "=" * 50)
        print("ERROR OCCURRED")
        print("=" * 50)
        print(f"Error type: {type(e).__name__}")
        print(f"Error message: {str(e)}")
        print("\nFull traceback:")
        traceback.print_exc()

        # Print memory info for OOM debugging
        if torch.cuda.is_available():
            print(f"\nVRAM at error: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

        sys.exit(1)

Quick Reference: Error → Solution

Error Contains Likely Cause Quick Fix
CUDA out of memory OOM Reduce batch size, use quantization
Connection refused Daemon down gpu daemon restart
No GPUs available Supply shortage Try different GPU type
Model not found Not downloaded Add to download spec
Permission denied File permissions chmod +x or check path
Killed OOM (exit 137) Bigger GPU
Timeout Network/hanging Add timeout, check code
CUDA version Version mismatch Use compatible base image
rsync error Sync issue Check .gitignore, outputs
rate limit HuggingFace limit Set HF_TOKEN

Output Format

When debugging:

## Debug Analysis

### Error Identified

**Type**: [OOM/Network/Model/Sync/etc.]
**Message**: `[exact error message]`

### Root Cause

[Explanation of why this happened]

### Solution

**Option 1** (Recommended): [solution]
```[code/config change]```

**Option 2** (Alternative): [solution]
```[code/config change]```

### Prevention

To avoid this in the future:
1. [Prevention tip 1]
2. [Prevention tip 2]