| name | when-debugging-ml-training-use-ml-training-debugger |
| version | 1.0.0 |
| description | Debug ML training issues and optimize performance including loss divergence, overfitting, and slow convergence |
| category | machine-learning |
| tags | debugging, ml, training, optimization, troubleshooting |
| agents | ml-developer, performance-analyzer, coder |
| difficulty | advanced |
| estimated_duration | 30-60min |
| success_criteria | Issue diagnosed correctly, Root cause identified, Fix applied successfully, Training convergence restored |
| validation_method | training_validation |
| dependencies | claude-flow@alpha, tensorflow/pytorch, tensorboard (for visualization) |
| outputs | Diagnostic report, Fixed model/training code, Performance comparison, Optimization recommendations |
| triggers | Training loss diverging/NaN, Overfitting detected, Slow convergence, Poor validation performance |
| author | ruv |
When NOT to Use This Skill
- Simple data preprocessing without model training
- Statistical analysis that does not require ML models
- Rule-based systems without learning components
- Operations that do not involve model training or inference
Success Criteria
- Model training convergence: Loss decreasing consistently
- Validation accuracy: Meeting or exceeding baseline targets
- Training time: Within expected bounds for dataset size
- GPU utilization: >80% during training
- Model export success: 100% successful saves
- Inference latency: <100ms for real-time applications
Edge Cases & Error Handling
- GPU Memory Overflow: Reduce batch size, use gradient accumulation, or mixed precision
- Divergent Training: Implement learning rate scheduling, gradient clipping
- Data Pipeline Failures: Validate data integrity, handle missing/corrupted files
- Version Mismatches: Lock dependency versions, use containerization
- Checkpoint Corruption: Save multiple checkpoints, validate before loading
- Distributed Training Failures: Handle node failures, implement fault tolerance
Guardrails & Safety
- NEVER train on unvalidated or uncleaned data
- ALWAYS validate model outputs before deployment
- ALWAYS implement reproducibility (random seeds, version pinning)
- NEVER expose training data in model artifacts or logs
- ALWAYS monitor for bias and fairness issues
- ALWAYS implement model versioning and rollback capabilities
Evidence-Based Validation
- Verify hardware availability: Check GPU/TPU status before training
- Validate data quality: Run data integrity checks and statistics
- Monitor training: Track loss curves, gradients, and metrics
- Test model performance: Evaluate on held-out test set
- Benchmark inference: Measure latency and throughput under load
ML Training Debugger - Diagnose and Fix Training Issues
Overview
Systematic debugging workflow for ML training issues including loss divergence, overfitting, slow convergence, gradient problems, and performance optimization.
When to Use
- Training loss becomes NaN or infinite
- Severe overfitting (train >> val performance)
- Training not converging
- Gradient vanishing/exploding
- Poor validation accuracy
- Training too slow
Phase 1: Diagnose Issue (8 min)
Objective
Identify the specific training problem
Agent: ML-Developer
Step 1.1: Analyze Training Curves
import json
import numpy as np
# Load training history
with open('training_history.json', 'r') as f:
history = json.load(f)
# Diagnose issues
diagnosis = {
'loss_divergence': check_loss_divergence(history['loss']),
'overfitting': check_overfitting(history['loss'], history['val_loss']),
'slow_convergence': check_convergence_rate(history['loss']),
'gradient_issues': check_gradient_health(history),
'nan_values': any(np.isnan(history['loss']))
}
def check_loss_divergence(losses):
# Loss increasing over time
if len(losses) > 10:
recent_trend = np.mean(losses[-5:]) > np.mean(losses[-10:-5])
return recent_trend
def check_overfitting(train_loss, val_loss):
# Val loss diverging from train loss
if len(train_loss) > 10:
gap = np.mean(val_loss[-5:]) - np.mean(train_loss[-5:])
return gap > 0.5 # Significant gap
def check_convergence_rate(losses):
# Loss barely changing
if len(losses) > 20:
recent_change = abs(losses[-1] - losses[-10])
return recent_change < 0.01 # Plateau
await memory.store('ml-debugger/diagnosis', diagnosis)
Step 1.2: Identify Root Cause
root_causes = []
if diagnosis['loss_divergence']:
root_causes.append({
'issue': 'Loss Divergence',
'likely_cause': 'Learning rate too high',
'severity': 'HIGH',
'fix': 'Reduce learning rate by 10x'
})
if diagnosis['nan_values']:
root_causes.append({
'issue': 'NaN Loss',
'likely_cause': 'Numerical instability',
'severity': 'CRITICAL',
'fix': 'Add gradient clipping, reduce LR, check data for extreme values'
})
if diagnosis['overfitting']:
root_causes.append({
'issue': 'Overfitting',
'likely_cause': 'Model too complex or insufficient regularization',
'severity': 'MEDIUM',
'fix': 'Add dropout, L2 regularization, or more training data'
})
if diagnosis['slow_convergence']:
root_causes.append({
'issue': 'Slow Convergence',
'likely_cause': 'Learning rate too low or poor initialization',
'severity': 'LOW',
'fix': 'Increase learning rate, use better initialization'
})
await memory.store('ml-debugger/root-causes', root_causes)
Step 1.3: Generate Diagnostic Report
report = f"""
# ML Training Diagnostic Report
## Issues Detected
{chr(10).join([f"- **{rc['issue']}** (Severity: {rc['severity']})" for rc in root_causes])}
## Root Cause Analysis
{chr(10).join([f"""
### {rc['issue']}
- **Likely Cause**: {rc['likely_cause']}
- **Recommended Fix**: {rc['fix']}
""" for rc in root_causes])}
## Training History Summary
- Final Train Loss: {history['loss'][-1]:.4f}
- Final Val Loss: {history['val_loss'][-1]:.4f}
- Epochs Completed: {len(history['loss'])}
"""
with open('diagnostic_report.md', 'w') as f:
f.write(report)
Validation Criteria
- Issues identified
- Root causes determined
- Severity assessed
- Report generated
Phase 2: Analyze Root Cause (10 min)
Objective
Deep dive into the specific problem
Agent: Performance-Analyzer
Step 2.1: Gradient Analysis
import tensorflow as tf
# Monitor gradients during training
def gradient_analysis(model, X_batch, y_batch):
with tf.GradientTape() as tape:
predictions = model(X_batch, training=True)
loss = loss_fn(y_batch, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
analysis = {
'gradient_norms': [tf.norm(g).numpy() for g in gradients if g is not None],
'has_nan': any(tf.reduce_any(tf.math.is_nan(g)) for g in gradients if g is not None),
'has_inf': any(tf.reduce_any(tf.math.is_inf(g)) for g in gradients if g is not None)
}
# Check for vanishing/exploding gradients
gradient_norms = np.array(analysis['gradient_norms'])
analysis['vanishing'] = np.mean(gradient_norms) < 1e-7
analysis['exploding'] = np.mean(gradient_norms) > 100
return analysis
grad_analysis = gradient_analysis(model, X_train[:32], y_train[:32])
await memory.store('ml-debugger/gradient-analysis', grad_analysis)
Step 2.2: Data Analysis
# Check for data issues
data_issues = {
'class_imbalance': check_class_balance(y_train),
'outliers': detect_outliers(X_train),
'missing_normalization': check_normalization(X_train),
'label_noise': estimate_label_noise(X_train, y_train, model)
}
def check_class_balance(labels):
unique, counts = np.unique(labels, return_counts=True)
imbalance_ratio = max(counts) / min(counts)
return imbalance_ratio > 10 # Significant imbalance
def check_normalization(data):
mean = np.mean(data, axis=0)
std = np.std(data, axis=0)
# Data should be roughly normalized
return np.mean(np.abs(mean)) > 1 or np.mean(std) > 10
await memory.store('ml-debugger/data-issues', data_issues)
Step 2.3: Model Architecture Review
# Analyze model complexity
architecture_analysis = {
'total_params': model.count_params(),
'trainable_params': sum([tf.size(v).numpy() for v in model.trainable_variables]),
'depth': len(model.layers),
'has_batch_norm': any('batch_norm' in layer.name for layer in model.layers),
'has_dropout': any('dropout' in layer.name for layer in model.layers),
'activation_functions': [layer.activation.__name__ for layer in model.layers if hasattr(layer, 'activation')]
}
# Check for common issues
architecture_issues = []
if architecture_analysis['total_params'] / len(X_train) > 10:
architecture_issues.append('Model too complex relative to data size')
if not architecture_analysis['has_batch_norm'] and architecture_analysis['depth'] > 10:
architecture_issues.append('Deep model without batch normalization')
await memory.store('ml-debugger/architecture-issues', architecture_issues)
Validation Criteria
- Gradients analyzed
- Data issues identified
- Architecture reviewed
- Problems documented
Phase 3: Apply Fix (15 min)
Objective
Implement corrections based on diagnosis
Agent: Coder
Step 3.1: Fix Learning Rate
if 'Loss Divergence' in [rc['issue'] for rc in root_causes]:
# Reduce learning rate
old_lr = model.optimizer.learning_rate.numpy()
new_lr = old_lr / 10
model.optimizer.learning_rate.assign(new_lr)
print(f"✅ Reduced learning rate: {old_lr} → {new_lr}")
if 'Slow Convergence' in [rc['issue'] for rc in root_causes]:
# Increase learning rate with warmup
new_lr = old_lr * 5
model.optimizer.learning_rate.assign(new_lr)
# Add LR scheduler
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=1e-7
)
Step 3.2: Fix Overfitting
if 'Overfitting' in [rc['issue'] for rc in root_causes]:
# Add regularization
from tensorflow.keras import regularizers
# Clone model with regularization
new_layers = []
for layer in model.layers:
if isinstance(layer, tf.keras.layers.Dense):
new_layer = tf.keras.layers.Dense(
layer.units,
activation=layer.activation,
kernel_regularizer=regularizers.l2(0.01), # Add L2
name=layer.name + '_reg'
)
new_layers.append(new_layer)
# Add dropout after dense layers
new_layers.append(tf.keras.layers.Dropout(0.3))
else:
new_layers.append(layer)
# Rebuild model
fixed_model = tf.keras.Sequential(new_layers)
fixed_model.compile(
optimizer=model.optimizer,
loss=model.loss,
metrics=model.metrics
)
print("✅ Added L2 regularization and dropout")
Step 3.3: Fix Gradient Issues
if grad_analysis['exploding']:
# Add gradient clipping
optimizer = tf.keras.optimizers.Adam(
learning_rate=0.001,
clipnorm=1.0 # Clip by global norm
)
model.compile(
optimizer=optimizer,
loss=model.loss,
metrics=model.metrics
)
print("✅ Added gradient clipping")
if grad_analysis['vanishing']:
# Use better activation functions
# Replace sigmoid/tanh with ReLU/LeakyReLU
for layer in model.layers:
if hasattr(layer, 'activation'):
if layer.activation.__name__ in ['sigmoid', 'tanh']:
layer.activation = tf.keras.activations.relu
print(f"✅ Changed {layer.name} activation to ReLU")
Step 3.4: Fix Data Issues
if data_issues['class_imbalance']:
# Compute class weights
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(
'balanced',
classes=np.unique(y_train),
y=y_train
)
class_weight_dict = dict(enumerate(class_weights))
print(f"✅ Applying class weights: {class_weight_dict}")
if data_issues['missing_normalization']:
# Re-normalize data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_fixed = scaler.fit_transform(X_train)
X_val_fixed = scaler.transform(X_val)
print("✅ Data re-normalized")
Validation Criteria
- Fixes applied
- Model recompiled
- Data corrected
- Ready for retraining
Phase 4: Validate Fix (12 min)
Objective
Verify that fixes resolve the issues
Agent: Performance-Analyzer
Step 4.1: Retrain Model
# Retrain with fixes
print("Retraining model with fixes...")
history_fixed = model.fit(
X_train_fixed, y_train,
validation_data=(X_val_fixed, y_val),
batch_size=32,
epochs=50,
callbacks=[
tf.keras.callbacks.EarlyStopping(patience=10),
lr_scheduler if 'Slow Convergence' in [rc['issue'] for rc in root_causes] else None
],
class_weight=class_weight_dict if data_issues['class_imbalance'] else None,
verbose=1
)
# Save fixed training history
with open('training_history_fixed.json', 'w') as f:
json.dump({
'loss': history_fixed.history['loss'],
'val_loss': history_fixed.history['val_loss'],
'accuracy': history_fixed.history['accuracy'],
'val_accuracy': history_fixed.history['val_accuracy']
}, f)
Step 4.2: Compare Before/After
comparison = {
'before': {
'final_train_loss': history['loss'][-1],
'final_val_loss': history['val_loss'][-1],
'final_val_acc': history['val_accuracy'][-1],
'converged': len(history['loss']) < 100
},
'after': {
'final_train_loss': history_fixed.history['loss'][-1],
'final_val_loss': history_fixed.history['val_loss'][-1],
'final_val_acc': history_fixed.history['val_accuracy'][-1],
'converged': history_fixed.history['val_loss'][-1] < history_fixed.history['val_loss'][-10]
},
'improvement': {
'val_loss_reduction': (history['val_loss'][-1] - history_fixed.history['val_loss'][-1]) / history['val_loss'][-1] * 100,
'val_acc_improvement': (history_fixed.history['val_accuracy'][-1] - history['val_accuracy'][-1]) * 100
}
}
await memory.store('ml-debugger/comparison', comparison)
Step 4.3: Visualize Comparison
import matplotlib.pyplot as plt
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Loss comparison
axes[0,0].plot(history['loss'], label='Before (Train)', alpha=0.7)
axes[0,0].plot(history['val_loss'], label='Before (Val)', alpha=0.7)
axes[0,0].plot(history_fixed.history['loss'], label='After (Train)', linestyle='--')
axes[0,0].plot(history_fixed.history['val_loss'], label='After (Val)', linestyle='--')
axes[0,0].set_title('Loss Comparison')
axes[0,0].legend()
axes[0,0].grid(True)
# Accuracy comparison
axes[0,1].plot(history['accuracy'], label='Before (Train)', alpha=0.7)
axes[0,1].plot(history['val_accuracy'], label='Before (Val)', alpha=0.7)
axes[0,1].plot(history_fixed.history['accuracy'], label='After (Train)', linestyle='--')
axes[0,1].plot(history_fixed.history['val_accuracy'], label='After (Val)', linestyle='--')
axes[0,1].set_title('Accuracy Comparison')
axes[0,1].legend()
axes[0,1].grid(True)
plt.savefig('training_comparison.png')
Validation Criteria
- Retraining successful
- Issues resolved
- Improvement documented
- Comparison visualized
Phase 5: Optimize Performance (5 min)
Objective
Apply additional optimizations
Agent: ML-Developer
Step 5.1: Generate Recommendations
recommendations = []
if comparison['after']['final_val_acc'] < 0.85:
recommendations.append({
'type': 'Architecture',
'suggestion': 'Try deeper model or different architecture (CNN, Transformer)',
'expected_improvement': '+5-10% accuracy'
})
if comparison['after']['final_val_loss'] > 0.5:
recommendations.append({
'type': 'Data',
'suggestion': 'Collect more training data or apply data augmentation',
'expected_improvement': 'Better generalization'
})
if history_fixed.history['loss'][-1] > 0.1:
recommendations.append({
'type': 'Training',
'suggestion': 'Train longer with learning rate scheduling',
'expected_improvement': 'Lower training loss'
})
await memory.store('ml-debugger/recommendations', recommendations)
Step 5.2: Generate Final Report
# ML Training Debug Report
## Original Issues
${root_causes.map(rc => `- ${rc.issue}: ${rc.likely_cause}`).join('\n')}
## Fixes Applied
${fixes_applied.map(fix => `- ${fix}`).join('\n')}
## Results
### Before
- Val Loss: ${comparison.before.final_val_loss.toFixed(4)}
- Val Accuracy: ${(comparison.before.final_val_acc * 100).toFixed(2)}%
### After
- Val Loss: ${comparison.after.final_val_loss.toFixed(4)}
- Val Accuracy: ${(comparison.after.final_val_acc * 100).toFixed(2)}%
### Improvement
- Val Loss Reduction: ${comparison.improvement.val_loss_reduction.toFixed(2)}%
- Val Accuracy Gain: +${comparison.improvement.val_acc_improvement.toFixed(2)}%
## Recommendations for Further Improvement
${recommendations.map((r, i) => `${i+1}. **${r.type}**: ${r.suggestion} (${r.expected_improvement})`).join('\n')}
Validation Criteria
- Recommendations generated
- Final report complete
- Model saved
- Ready for production
Success Metrics
- Training converges successfully
- Validation loss improved by >10%
- No NaN or infinite values
- Overfitting reduced
Skill Completion
Outputs:
- diagnostic_report.md: Issue analysis
- fixed_model.h5: Corrected model
- training_comparison.png: Before/after visualization
- optimization_recommendations.md: Next steps
Complete when training issues resolved and model performing well.
Core Principles
1. Systematic Diagnosis Over Random Fixes
Foundation: Apply first principles to identify root causes before attempting solutions.
In practice:
- Analyze training curves quantitatively (loss trends, gradient statistics) before changing hyperparameters
- Diagnose issues methodically: divergence -> overfitting -> convergence -> gradients -> data quality
- Document findings at each diagnostic step to prevent repeated investigations
- Use metrics-driven thresholds (e.g., train-val gap >5% = overfitting) rather than intuition
2. Evidence-Based Validation
Foundation: Every fix must be validated through measurable improvement in training metrics.
In practice:
- Compare before/after metrics with statistical significance (>10% improvement threshold)
- Visualize training curves to verify convergence after fixes
- Run ablation tests: apply one fix at a time to isolate effectiveness
- Maintain checkpoint history to enable rapid rollback if fixes degrade performance
3. Reproducibility and Transparency
Foundation: Training experiments must be reproducible and all interventions trackable.
In practice:
- Pin random seeds, library versions, and data splits before debugging
- Log all hyperparameter changes, data transformations, and architecture modifications
- Save diagnostic reports as artifacts alongside model checkpoints
- Use version control for training configurations to track evolution of fixes
Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Random Hyperparameter Tuning | Changing learning rate, batch size, or architecture without understanding the root cause wastes compute and may mask underlying issues (e.g., data quality problems) | Run Phase 1 diagnosis first. Only adjust hyperparameters after identifying specific issues (e.g., loss divergence -> reduce LR, overfitting -> add regularization) |
| Ignoring Data Quality | Focusing only on model/training tweaks while data issues (class imbalance, outliers, missing normalization) persist leads to models that can never converge properly | Always run Phase 2 data analysis. Check class balance, outlier detection, and normalization BEFORE retraining. Bad data = bad model regardless of architecture |
| Single-Metric Optimization | Optimizing only training loss without monitoring validation metrics, overfitting, or gradient health results in models that memorize training data but fail in production | Track multi-dimensional metrics: train/val loss, accuracy, precision/recall, gradient norms, and train-val gap. Use Phase 4 validation checklist comprehensively |
Conclusion
This ML Training Debugger skill provides a systematic, evidence-based approach to diagnosing and resolving common training issues that plague neural network development. By following the five-phase workflow—diagnosis, root cause analysis, fix application, validation, and optimization—teams can reduce debugging time from days to hours while achieving more reliable convergence.
The skill's strength lies in its structured methodology: rather than randomly tweaking hyperparameters or architectures, it forces practitioners to first understand the problem through quantitative analysis of training curves, gradients, and data quality. This diagnostic-first approach prevents the common pitfall of applying fixes that mask symptoms rather than addressing root causes, ultimately leading to more robust models and faster iteration cycles.
For production ML teams, adopting this systematic debugging workflow ensures that training failures are treated as opportunities for learning rather than frustrations. The comprehensive diagnostic reports, before/after comparisons, and optimization recommendations create a knowledge base that compounds over time, enabling teams to build increasingly sophisticated models with fewer training-related setbacks.