| name | ml |
| description | Machine Learning development workflow with experiment tracking, hyperparameter optimization, and MLOps integration |
| tier | gold |
| version | 2.0.0 |
| category | specialized-development |
| tags | machine-learning, mlops, experiment-tracking, hyperparameter-tuning, model-registry |
| agents | ml-developer, data-scientist, mlops-engineer |
| tools | experiment-tracker, hyperparameter-tuner, model-registry, ml-ops-pipeline |
| dependencies | python-specialist, testing-quality, functionality-audit |
| prerequisites | Python 3.8+, ML frameworks (TensorFlow/PyTorch/scikit-learn), Docker (for MLOps), Git LFS (for model versioning) |
| author | ruv |
ML Development Skill
When to Use This Skill
- Model Training: Training neural networks or ML models
- Hyperparameter Tuning: Optimizing model performance
- Model Debugging: Diagnosing training issues (overfitting, vanishing gradients)
- Data Pipeline: Building training/validation data pipelines
- Experiment Tracking: Managing ML experiments and metrics
- Model Deployment: Serving models in production
When NOT to Use This Skill
- Data Analysis: Exploratory data analysis or statistics (use data scientist)
- Data Engineering: Large-scale ETL or data warehouse (use data engineer)
- Research: Novel algorithm development (use research specialist)
- Simple Rules: Heuristic-based logic without ML
Success Criteria
- Model achieves target accuracy/F1/RMSE on validation set
- Training/validation curves show healthy convergence
- No overfitting (train/val gap <5%)
- Inference latency meets production requirements
- Model size within deployment constraints
- Experiment tracked with metrics and artifacts (MLflow, Weights & Biases)
- Reproducible results (fixed random seeds, versioned data)
Edge Cases to Handle
- Class Imbalance: Unequal class distribution requiring resampling
- Data Leakage: Information from validation/test leaking into training
- Catastrophic Forgetting: Model forgetting old tasks when learning new ones
- Adversarial Examples: Model vulnerable to adversarial attacks
- Distribution Shift: Training data differs from production data
- Hardware Constraints: GPU memory limitations or mixed precision training
Guardrails
- NEVER evaluate on training data
- ALWAYS use separate train/validation/test splits
- NEVER touch test set until final evaluation
- ALWAYS version datasets and models
- NEVER deploy without monitoring for data drift
- ALWAYS document model assumptions and limitations
- NEVER train on biased or unrepresentative data
Evidence-Based Validation
- Confusion matrix reviewed for class-wise performance
- Learning curves plotted (loss vs epochs)
- Validation metrics tracked across experiments
- Model profiled for inference time (TensorBoard, PyTorch Profiler)
- Ablation studies conducted for architecture choices
- Cross-validation performed for robust evaluation
- Statistical significance tested (t-test, bootstrap)
Comprehensive machine learning development workflow with enterprise-grade experiment tracking, automated hyperparameter optimization, model registry management, and production MLOps pipelines.
Overview
This Gold-tier skill provides a complete ML development lifecycle with:
- Experiment Tracking: MLflow/W&B integration for reproducible experiments
- Hyperparameter Optimization: Optuna/Ray Tune for automated tuning
- Model Registry: Centralized model versioning and deployment
- MLOps Pipeline: Production-ready model serving and monitoring
Quick Start
# Initialize ML project
npx claude-flow sparc run ml "Create ML project for image classification"
# Track experiment
python resources/scripts/experiment-tracker.py --config experiment-config.yaml
# Optimize hyperparameters
node resources/scripts/hyperparameter-tuner.js --space hyperparameter-space.json
# Deploy model
bash resources/scripts/model-registry.sh deploy production latest
Workflow Phases
1. Experiment Design
- Define hypothesis and metrics
- Configure experiment tracking
- Set up data pipelines
- Validate data quality
2. Model Development
- Implement model architecture
- Configure training pipeline
- Set up validation strategy
- Enable experiment logging
3. Hyperparameter Optimization
- Define search space
- Select optimization algorithm
- Run distributed trials
- Analyze results
4. Model Evaluation
- Comprehensive metrics analysis
- Cross-validation
- Error analysis
- Model interpretability
5. Model Deployment
- Register model in registry
- Create deployment pipeline
- Set up monitoring
- Enable A/B testing
Resources
Scripts
experiment-tracker.py: MLflow/W&B experiment tracking with auto-logginghyperparameter-tuner.js: Distributed hyperparameter optimizationmodel-registry.sh: Model versioning and deployment automationml-ops.py: End-to-end MLOps pipeline orchestration
Templates
experiment-config.yaml: Experiment configuration templatehyperparameter-space.json: Hyperparameter search space definitionmodel-card.md: Model documentation template
Examples
1. Experiment Tracking
150-line example showing MLflow integration with auto-logging, artifact tracking, and metric visualization.
2. Hyperparameter Optimization
250-line example demonstrating Optuna-based distributed hyperparameter tuning with pruning and parallel trials.
3. MLOps Pipeline
300-line example implementing complete MLOps workflow with model registry, CI/CD, and monitoring.
Best Practices
Reproducibility
- Track all experiment parameters
- Version control data and code
- Use deterministic random seeds
- Document environment dependencies
Experiment Organization
- Use hierarchical experiment structure
- Tag experiments meaningfully
- Archive failed experiments
- Maintain experiment runbooks
Model Management
- Semantic versioning for models
- Comprehensive model cards
- Automated model testing
- Deployment staging (dev/staging/prod)
Performance Optimization
- Distributed training for large models
- Mixed precision training
- Efficient data loading
- Model compression techniques
Monitoring & Observability
- Real-time metric tracking
- Data drift detection
- Model performance degradation alerts
- Resource utilization monitoring
Integration Points
- Data: AgentDB for vector search, PostgreSQL for metadata
- Compute: Flow Nexus sandboxes for distributed training
- CI/CD: Automated model testing and deployment
- Memory: Store experiment insights in Memory MCP
Advanced Features
- AutoML: Automated architecture search and feature engineering
- Distributed Training: Multi-GPU and multi-node training
- Model Compression: Quantization, pruning, distillation
- Federated Learning: Privacy-preserving distributed training
- Continuous Training: Automated retraining on new data
Troubleshooting
Common Issues
- Out of Memory: Reduce batch size, enable gradient checkpointing
- Slow Training: Use mixed precision, optimize data pipeline
- Poor Convergence: Adjust learning rate, check data quality
- Deployment Failures: Validate model compatibility, test inference
Debug Mode
# Enable verbose logging
export ML_DEBUG=1
python resources/scripts/experiment-tracker.py --debug
Performance Metrics
- Experiment Setup: 2-5 minutes
- Hyperparameter Optimization: 30min - 6 hours (depending on search space)
- Model Deployment: 5-10 minutes
- Full MLOps Pipeline: 1-2 hours
Support
For issues or questions:
- Check examples directory for reference implementations
- Review test files for usage patterns
- Consult MLflow/Optuna documentation
- Use
functionality-auditskill for validation
Core Principles
1. Reproducibility First
Reproducibility is the foundation of scientific machine learning. Every experiment must be independently verifiable by other researchers.
In practice:
- Version control all training code, data, and model artifacts with Git LFS
- Pin all dependency versions (TensorFlow 2.13.0, not >=2.0)
- Set deterministic random seeds across all libraries (Python, NumPy, TensorFlow/PyTorch)
- Document hardware specifications and environment configuration
- Track hyperparameters, data splits, and preprocessing steps in experiment tracker
- Store data checksums to verify dataset integrity
2. Evidence-Based Validation
Claims about model performance require rigorous statistical evidence, not cherry-picked metrics.
In practice:
- Use separate train/validation/test splits with no data leakage
- Perform k-fold cross-validation for robust performance estimates
- Report confidence intervals using bootstrapping or statistical tests
- Analyze confusion matrices for per-class performance
- Conduct ablation studies to validate architectural choices
- Test statistical significance of improvements (paired t-test, p < 0.05)
- Never evaluate on training data or touch test set until final evaluation
3. Production Readiness from Day One
Models must be deployable, not just accurate. Production constraints shape development decisions.
In practice:
- Profile inference latency and memory usage during development
- Design for deployment constraints (model size, hardware availability)
- Implement monitoring for data drift and performance degradation
- Document model assumptions, limitations, and failure modes
- Set up automated retraining pipelines for continuous learning
- Plan rollback strategies before deployment
- Test on production-like data distributions
Anti-Patterns
| Anti-Pattern | Problem | Solution |
|---|---|---|
| Training on test data | Inflates accuracy metrics, model fails in production due to overfitting to test distribution | Use strict train/val/test splits. Never touch test set until final evaluation. Use cross-validation on training data only. |
| Ignoring class imbalance | Model achieves high accuracy by predicting majority class, fails on minority classes | Apply class weights, resampling (SMOTE), or stratified sampling. Evaluate with F1-score, precision-recall curves, not just accuracy. |
| Deploying without monitoring | Model degrades silently as production data drifts from training distribution | Implement data drift detection (KL divergence, PSI). Monitor prediction confidence distributions. Set up automated alerts for performance degradation. |
| Hyperparameter tuning on test set | Leaks test information into model selection, overestimates generalization | Tune hyperparameters only on validation set. Reserve test set for final evaluation after all decisions are made. |
| No experiment tracking | Cannot reproduce results or compare experiments, wastes time re-running experiments | Use MLflow/W&B from day one. Track all hyperparameters, metrics, and artifacts. Tag experiments with meaningful names. |
| Premature optimization | Spending weeks optimizing model before validating basic approach works | Start with simple baseline (logistic regression, small network). Validate approach works. Then optimize. |
Conclusion
Machine learning development requires a rigorous engineering mindset that balances scientific accuracy with production pragmatism. The ML Development Skill provides a comprehensive workflow that treats reproducibility, evidence-based validation, and production readiness as first-class requirements, not afterthoughts.
By integrating experiment tracking (MLflow/W&B), automated hyperparameter optimization (Optuna), and production MLOps pipelines from the start, this skill ensures that models are not only accurate but also deployable, maintainable, and scientifically rigorous. The emphasis on statistical validation, data quality, and monitoring prevents common pitfalls like overfitting, data leakage, and silent production failures.
Whether you are training a simple classifier or building a complex multi-model pipeline, this skill enforces best practices that scale from research prototypes to production systems. The three core principles - reproducibility first, evidence-based validation, and production readiness - guide every phase of the ML lifecycle, from initial data exploration to continuous model improvement in production.