Claude Code Plugins

Community-maintained marketplace

Feedback

when-developing-ml-models-use-ml-expert

@DNYoussef/context-cascade
6
0

Specialized ML model development, training, and deployment workflow

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name when-developing-ml-models-use-ml-expert
version 1.0.0
description Specialized ML model development, training, and deployment workflow
category machine-learning
tags ml, training, deployment, model-development, neural-networks
agents ml-developer, researcher, tester
difficulty advanced
estimated_duration 45-90min
success_criteria Model trained successfully, Validation metrics meet targets, Production deployment ready, Documentation complete
validation_method performance_metrics
dependencies claude-flow@alpha, tensorflow/pytorch, flow-nexus (optional for distributed training)
outputs Trained model file, Training metrics, Evaluation report, Deployment package
triggers New ML model needed, Model training required, Production deployment
author ruv

When NOT to Use This Skill

  • Simple data preprocessing without model training
  • Statistical analysis that does not require ML models
  • Rule-based systems without learning components
  • Operations that do not involve model training or inference

Success Criteria

  • Model training convergence: Loss decreasing consistently
  • Validation accuracy: Meeting or exceeding baseline targets
  • Training time: Within expected bounds for dataset size
  • GPU utilization: >80% during training
  • Model export success: 100% successful saves
  • Inference latency: <100ms for real-time applications

Edge Cases & Error Handling

  • GPU Memory Overflow: Reduce batch size, use gradient accumulation, or mixed precision
  • Divergent Training: Implement learning rate scheduling, gradient clipping
  • Data Pipeline Failures: Validate data integrity, handle missing/corrupted files
  • Version Mismatches: Lock dependency versions, use containerization
  • Checkpoint Corruption: Save multiple checkpoints, validate before loading
  • Distributed Training Failures: Handle node failures, implement fault tolerance

Guardrails & Safety

  • NEVER train on unvalidated or uncleaned data
  • ALWAYS validate model outputs before deployment
  • ALWAYS implement reproducibility (random seeds, version pinning)
  • NEVER expose training data in model artifacts or logs
  • ALWAYS monitor for bias and fairness issues
  • ALWAYS implement model versioning and rollback capabilities

Evidence-Based Validation

  • Verify hardware availability: Check GPU/TPU status before training
  • Validate data quality: Run data integrity checks and statistics
  • Monitor training: Track loss curves, gradients, and metrics
  • Test model performance: Evaluate on held-out test set
  • Benchmark inference: Measure latency and throughput under load

ML Expert - Machine Learning Model Development

Overview

Specialized workflow for ML model development, training, and deployment. Supports various architectures (CNNs, RNNs, Transformers) with distributed training capabilities.

When to Use

  • Developing new ML models
  • Training neural networks
  • Model optimization
  • Production deployment
  • Transfer learning
  • Fine-tuning existing models

Phase 1: Data Preparation (10 min)

Objective

Clean, preprocess, and prepare training data

Agent: ML-Developer

Step 1.1: Load and Analyze Data

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load data
data = pd.read_csv('dataset.csv')

# Analyze
analysis = {
    'shape': data.shape,
    'columns': data.columns.tolist(),
    'dtypes': data.dtypes.to_dict(),
    'missing': data.isnull().sum().to_dict(),
    'stats': data.describe().to_dict()
}

# Store analysis
await memory.store('ml-expert/data-analysis', analysis)

Step 1.2: Data Cleaning

# Handle missing values
data = data.fillna(data.mean())

# Remove duplicates
data = data.drop_duplicates()

# Handle outliers
from scipy import stats
z_scores = np.abs(stats.zscore(data.select_dtypes(include=[np.number])))
data = data[(z_scores < 3).all(axis=1)]

# Encode categorical variables
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in data.select_dtypes(include=['object']).columns:
    data[col] = le.fit_transform(data[col])

Step 1.3: Split Data

# Split into train/val/test
X = data.drop('target', axis=1)
y = data['target']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Normalize
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Save preprocessed data
np.save('data/X_train.npy', X_train)
np.save('data/X_val.npy', X_val)
np.save('data/X_test.npy', X_test)

Validation Criteria

  • Data loaded successfully
  • Missing values handled
  • Train/val/test split created
  • Normalization applied

Phase 2: Model Selection (10 min)

Objective

Choose and configure model architecture

Agent: Researcher

Step 2.1: Analyze Task Type

task_analysis = {
    'type': 'classification|regression|clustering',
    'complexity': 'low|medium|high',
    'dataSize': len(data),
    'features': X.shape[1],
    'classes': len(np.unique(y)) if classification else None
}

# Recommend architecture
if task_analysis['type'] == 'classification' and task_analysis['features'] > 100:
    recommended_architecture = 'deep_neural_network'
elif task_analysis['type'] == 'regression' and task_analysis['dataSize'] < 10000:
    recommended_architecture = 'random_forest'
# ... more logic

Step 2.2: Define Architecture

import tensorflow as tf

def create_model(architecture='dnn', input_shape, num_classes):
    if architecture == 'dnn':
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation='relu', input_shape=input_shape),
            tf.keras.layers.Dropout(0.3),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dropout(0.3),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(num_classes, activation='softmax')
        ])
    elif architecture == 'cnn':
        model = tf.keras.Sequential([
            tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=input_shape),
            tf.keras.layers.MaxPooling2D((2,2)),
            tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
            tf.keras.layers.MaxPooling2D((2,2)),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(128, activation='relu'),
            tf.keras.layers.Dense(num_classes, activation='softmax')
        ])
    # ... more architectures

    return model

model = create_model('dnn', input_shape=(X_train.shape[1],), num_classes=len(np.unique(y)))
model.summary()

Step 2.3: Configure Training

training_config = {
    'optimizer': tf.keras.optimizers.Adam(learning_rate=0.001),
    'loss': 'sparse_categorical_crossentropy',
    'metrics': ['accuracy'],
    'batch_size': 32,
    'epochs': 100,
    'callbacks': [
        tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),
        tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5),
        tf.keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True)
    ]
}

model.compile(
    optimizer=training_config['optimizer'],
    loss=training_config['loss'],
    metrics=training_config['metrics']
)

Validation Criteria

  • Task analyzed
  • Architecture selected
  • Model configured
  • Training parameters set

Phase 3: Train Model (20 min)

Objective

Execute training with monitoring

Agent: ML-Developer

Step 3.1: Start Training

# Train model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=training_config['batch_size'],
    epochs=training_config['epochs'],
    callbacks=training_config['callbacks'],
    verbose=1
)

# Save training history
import json
with open('training_history.json', 'w') as f:
    json.dump({
        'loss': history.history['loss'],
        'val_loss': history.history['val_loss'],
        'accuracy': history.history['accuracy'],
        'val_accuracy': history.history['val_accuracy']
    }, f)

Step 3.2: Monitor Training

import matplotlib.pyplot as plt

# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Loss
ax1.plot(history.history['loss'], label='Train Loss')
ax1.plot(history.history['val_loss'], label='Val Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True)

# Accuracy
ax2.plot(history.history['accuracy'], label='Train Accuracy')
ax2.plot(history.history['val_accuracy'], label='Val Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()
ax2.grid(True)

plt.savefig('training_curves.png')

Step 3.3: Distributed Training (Optional)

# Using Flow-Nexus for distributed training
from flow_nexus import DistributedTrainer

trainer = DistributedTrainer({
    'cluster_id': 'ml-training-cluster',
    'num_nodes': 4,
    'strategy': 'data_parallel'
})

# Train across multiple nodes
trainer.fit(
    model=model,
    train_data=(X_train, y_train),
    val_data=(X_val, y_val),
    config=training_config
)

Validation Criteria

  • Training completed
  • No NaN losses
  • Validation metrics improving
  • Model checkpoints saved

Phase 4: Validate Performance (10 min)

Objective

Evaluate model on test set

Agent: Tester

Step 4.1: Evaluate on Test Set

# Load best model
best_model = tf.keras.models.load_model('best_model.h5')

# Evaluate
test_loss, test_accuracy = best_model.evaluate(X_test, y_test, verbose=0)

# Predictions
y_pred = best_model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)

# Detailed metrics
from sklearn.metrics import classification_report, confusion_matrix

metrics = {
    'test_loss': float(test_loss),
    'test_accuracy': float(test_accuracy),
    'classification_report': classification_report(y_test, y_pred_classes, output_dict=True),
    'confusion_matrix': confusion_matrix(y_test, y_pred_classes).tolist()
}

await memory.store('ml-expert/metrics', metrics)

Step 4.2: Generate Evaluation Report

report = f"""
# Model Evaluation Report

## Performance Metrics
- Test Loss: {test_loss:.4f}
- Test Accuracy: {test_accuracy:.4f}

## Classification Report
{classification_report(y_test, y_pred_classes)}

## Model Summary
- Architecture: {recommended_architecture}
- Parameters: {model.count_params()}
- Training Time: {training_time} seconds

## Training History
- Best Val Loss: {min(history.history['val_loss']):.4f}
- Best Val Accuracy: {max(history.history['val_accuracy']):.4f}
- Epochs Trained: {len(history.history['loss'])}
"""

with open('evaluation_report.md', 'w') as f:
    f.write(report)

Validation Criteria

  • Test accuracy > target threshold
  • No overfitting detected
  • Metrics documented
  • Report generated

Phase 5: Deploy to Production (15 min)

Objective

Package model for deployment

Agent: ML-Developer

Step 5.1: Export Model

# Save in multiple formats
model.save('model.h5')  # Keras format
model.save('model_savedmodel')  # TensorFlow SavedModel
# Convert to TFLite for mobile
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('model.tflite', 'wb') as f:
    f.write(tflite_model)

# Save preprocessing pipeline
import joblib
joblib.dump(scaler, 'scaler.pkl')

Step 5.2: Create Deployment Package

import shutil
import os

# Create deployment directory
os.makedirs('deployment', exist_ok=True)

# Copy necessary files
shutil.copy('model.h5', 'deployment/')
shutil.copy('scaler.pkl', 'deployment/')
shutil.copy('evaluation_report.md', 'deployment/')

# Create inference script
inference_script = '''
import tensorflow as tf
import joblib
import numpy as np

class ModelInference:
    def __init__(self, model_path, scaler_path):
        self.model = tf.keras.models.load_model(model_path)
        self.scaler = joblib.load(scaler_path)

    def predict(self, input_data):
        # Preprocess
        scaled = self.scaler.transform(input_data)
        # Predict
        predictions = self.model.predict(scaled)
        return np.argmax(predictions, axis=1)

# Usage
inference = ModelInference('model.h5', 'scaler.pkl')
result = inference.predict(new_data)
'''

with open('deployment/inference.py', 'w') as f:
    f.write(inference_script)

Step 5.3: Generate Documentation

# Model Deployment Guide

## Files
- `model.h5`: Trained Keras model
- `scaler.pkl`: Preprocessing scaler
- `inference.py`: Inference script

## Usage
\`\`\`python
from inference import ModelInference

model = ModelInference('model.h5', 'scaler.pkl')
predictions = model.predict(new_data)
\`\`\`

## Performance
- Latency: < 50ms per prediction
- Accuracy: ${test_accuracy}
- Model Size: ${model_size}MB

## Requirements
- tensorflow>=2.0
- scikit-learn
- numpy

Validation Criteria

  • Model exported successfully
  • Deployment package created
  • Inference script tested
  • Documentation complete

Success Metrics

  • Test accuracy meets target
  • Training converged
  • No overfitting
  • Production-ready deployment

Skill Completion

Outputs:

  1. model.h5: Trained model file
  2. evaluation_report.md: Performance metrics
  3. deployment/: Production package
  4. training_history.json: Training logs

Complete when model deployed and validated.

Core Principles

1. Data-Centric Development

Foundation: Model performance is fundamentally limited by data quality, not just architecture sophistication.

In practice:

  • Invest significant effort in Phase 1 data preparation: cleaning, outlier removal, proper splitting, normalization
  • Analyze data statistics (class distribution, feature correlations, missing values) BEFORE selecting architecture
  • Treat data quality as a continuous concern: monitor for distribution drift post-deployment
  • Document data preprocessing pipelines with version control to ensure reproducibility

2. Iterative Experimentation with Strong Baselines

Foundation: Establish simple, well-tuned baselines before exploring complex architectures.

In practice:

  • Start with simplest viable architecture (e.g., shallow neural network or random forest) as baseline in Phase 2
  • Document baseline performance as reference point for all subsequent experiments
  • Only increase model complexity (deeper networks, attention mechanisms) when baseline performance plateaus
  • Track all architectural decisions and their impact on validation metrics to build institutional knowledge

3. Production-Ready from the Start

Foundation: Design for deployment, monitoring, and maintenance from Phase 1, not as afterthoughts.

In practice:

  • Phase 5 deployment considerations inform Phase 2 architecture selection (model size, inference latency constraints)
  • Include inference speed and memory footprint in Phase 4 validation criteria alongside accuracy
  • Build deployment pipelines (Docker, serving APIs, monitoring) in parallel with model development, not sequentially
  • Design model APIs with versioning, A/B testing, and rollback capabilities from initial deployment

Anti-Patterns

Anti-Pattern Problem Solution
Architecture-First Thinking Selecting trendy architectures (Transformers, GANs) before understanding task requirements and data characteristics leads to overcomplicated solutions that underperform simpler baselines Follow Phase 2 task analysis systematically. Match architecture to problem type (CNNs for images, RNNs for sequences, DNNs for tabular) based on data structure, not hype. Start simple, add complexity only when justified by metrics
Ignoring Validation Metrics Obsessing over training loss while ignoring validation accuracy, overfitting, or inference latency results in models that perform well on training data but fail in production Phase 4 validation MUST be comprehensive: test accuracy, precision/recall, confusion matrices, AND deployment constraints (latency, memory). Never deploy without holdout test set evaluation
Manual Deployment Workflows Treating deployment as a one-time manual process rather than automated pipeline creates bottlenecks, inconsistencies, and prevents rapid iteration or rollback Build Phase 5 deployment automation from the start: Dockerfiles, inference APIs, CI/CD pipelines, monitoring dashboards. Use model versioning and automated testing to enable safe continuous deployment

Conclusion

This ML Expert skill encapsulates the complete lifecycle of professional machine learning development, from data preparation through production deployment. By enforcing a structured five-phase workflow—data preparation, architecture selection, training, validation, and deployment—it ensures that ML projects avoid common pitfalls like premature complexity, inadequate validation, or deployment afterthoughts that plague ad-hoc development approaches.

The skill's emphasis on data quality, baseline establishment, and production-readiness from the start distinguishes it from purely algorithmic ML workflows. Rather than treating model development as an isolated research exercise, it integrates deployment constraints, monitoring requirements, and operational considerations into every phase, ensuring that models not only achieve high validation accuracy but also meet real-world performance, latency, and maintainability requirements.

For organizations building production ML systems, this systematic approach reduces the time from experimentation to deployment while increasing model reliability. The comprehensive documentation, performance benchmarking, and automated deployment pipelines created during the workflow become organizational assets, enabling teams to scale ML capabilities efficiently and maintain high-quality standards across multiple projects.