name	ml-pipeline
description	Machine learning pipeline development with scikit-learn. Use this skill for training models, cross-validation, hyperparameter tuning, and model evaluation.

Machine Learning Pipeline

Name: ml-pipeline
Author: obinopaul

A structured approach to building and evaluating ML models.

When to Use This Skill

User asks to "train a model" or "build a classifier/regressor"
Need to evaluate model performance
Hyperparameter tuning tasks
Model comparison and selection

ML Pipeline Workflow

1. Data Preparation

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load and prepare data
df = pd.read_csv("data.csv")

# Separate features and target
X = df.drop(columns=['target'])
y = df['target']

# Handle categorical variables
le = LabelEncoder()
for col in X.select_dtypes(include='object').columns:
    X[col] = le.fit_transform(X[col].astype(str))

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Model Training with Cross-Validation

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Define models to compare
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(kernel='rbf', random_state=42),
}

# Cross-validation comparison
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
    results[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
    }
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

3. Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

# Example: Random Forest tuning
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_scaled, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

4. Model Evaluation

from sklearn.metrics import (
    classification_report, confusion_matrix, 
    accuracy_score, roc_auc_score, roc_curve
)
import matplotlib.pyplot as plt
import seaborn as sns

# Get best model
best_model = grid_search.best_estimator_

# Predictions
y_pred = best_model.predict(X_test_scaled)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig('confusion_matrix.png', dpi=150)

5. Model Saving

import joblib

# Save model and scaler
joblib.dump(best_model, 'model.joblib')
joblib.dump(scaler, 'scaler.joblib')

print("Model saved to model.joblib")

Key Metrics to Report

Classification: Accuracy, Precision, Recall, F1, ROC-AUC
Regression: MSE, RMSE, MAE, R²

Best Practices

Always use cross-validation for model selection
Scale features for distance-based algorithms
Handle class imbalance if present
Report confidence intervals, not just point estimates
Save models for reproducibility

ml-pipeline

Install Skill

SKILL.md

Machine Learning Pipeline

When to Use This Skill

ML Pipeline Workflow

1. Data Preparation

2. Model Training with Cross-Validation

3. Hyperparameter Tuning

4. Model Evaluation

5. Model Saving

Key Metrics to Report

Best Practices