name	scikit-learn
description	ML toolkit. Classification, regression, clustering, PCA, preprocessing, pipelines, GridSearch, cross-validation, RandomForest, SVM, for general machine learning workflows.

Scikit-learn: Machine Learning in Python

Overview

Scikit-learn is Python's premier machine learning library, offering simple and efficient tools for predictive data analysis. Apply this skill for classification, regression, clustering, dimensionality reduction, model selection, preprocessing, and hyperparameter optimization.

When to Use This Skill

This skill should be used when:

Building classification models (spam detection, image recognition, medical diagnosis)
Creating regression models (price prediction, forecasting, trend analysis)
Performing clustering analysis (customer segmentation, pattern discovery)
Reducing dimensionality (PCA, t-SNE for visualization)
Preprocessing data (scaling, encoding, imputation)
Evaluating model performance (cross-validation, metrics)
Tuning hyperparameters (grid search, random search)
Creating machine learning pipelines
Detecting anomalies or outliers
Implementing ensemble methods

Core Machine Learning Workflow

Standard ML Pipeline

Follow this general workflow for supervised learning tasks:

Data Preparation
- Load and explore data
- Split into train/test sets
- Handle missing values
- Encode categorical features
- Scale/normalize features
Model Selection
- Start with baseline model
- Try more complex models
- Use domain knowledge to guide selection
Model Training
- Fit model on training data
- Use pipelines to prevent data leakage
- Apply cross-validation
Model Evaluation
- Evaluate on test set
- Use appropriate metrics
- Analyze errors
Model Optimization
- Tune hyperparameters
- Feature engineering
- Ensemble methods
Deployment
- Save model using joblib
- Create prediction pipeline
- Monitor performance

Classification Quick Start

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline

# Create pipeline (prevents data leakage)
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Split data (use stratify for imbalanced classes)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train
pipeline.fit(X_train, y_train)

# Evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))

# Cross-validation for robust evaluation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Regression Quick Start

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

# Create pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
pipeline.fit(X_train, y_train)

# Evaluate
y_pred = pipeline.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

Algorithm Selection Guide

Classification Algorithms

Start with baseline: LogisticRegression

Fast, interpretable, works well for linearly separable data
Good for high-dimensional data (text classification)

General-purpose: RandomForestClassifier

Handles non-linear relationships
Robust to outliers
Provides feature importance
Good default choice

Best performance: HistGradientBoostingClassifier

State-of-the-art for tabular data
Fast on large datasets (>10K samples)
Often wins Kaggle competitions

Special cases:

Small datasets (<1K): SVC with RBF kernel
Very large datasets (>100K): SGDClassifier or LinearSVC
Interpretability critical: LogisticRegression or DecisionTreeClassifier
Probabilistic predictions: GaussianNB or calibrated models
Text classification: LogisticRegression with TfidfVectorizer

Regression Algorithms

Start with baseline: LinearRegression or Ridge

Fast, interpretable
Works well when relationships are linear

General-purpose: RandomForestRegressor

Handles non-linear relationships
Robust to outliers
Good default choice

Best performance: HistGradientBoostingRegressor

State-of-the-art for tabular data
Fast on large datasets

Special cases:

Regularization needed: Ridge (L2) or Lasso (L1 + feature selection)
Very large datasets: SGDRegressor
Outliers present: HuberRegressor or RANSAC

Clustering Algorithms

Known number of clusters: KMeans

Fast and scalable
Assumes spherical clusters

Unknown number of clusters: DBSCAN or HDBSCAN

Handles arbitrary shapes
Automatic outlier detection

Hierarchical relationships: AgglomerativeClustering

Creates hierarchy of clusters
Good for visualization (dendrograms)

Soft clustering (probabilities): GaussianMixture

Provides cluster probabilities
Handles elliptical clusters

Dimensionality Reduction

Preprocessing/feature extraction: PCA

Fast and efficient
Linear transformation
ALWAYS standardize first

Visualization only: t-SNE or UMAP

Preserves local structure
Non-linear
DO NOT use for preprocessing

Sparse data (text): TruncatedSVD

Works with sparse matrices
Latent Semantic Analysis

Non-negative data: NMF

Interpretable components
Topic modeling

Working with Different Data Types

Numeric Features

Continuous features:

Check distribution
Handle outliers (remove, clip, or use RobustScaler)
Scale using StandardScaler (most algorithms) or MinMaxScaler (neural networks)

Count data:

Consider log transformation or sqrt
Scale after transformation

Skewed data:

Use PowerTransformer (Yeo-Johnson or Box-Cox)
Or QuantileTransformer for stronger normalization

Categorical Features

Low cardinality (<10 categories):

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse_output=True)

High cardinality (>10 categories):

from sklearn.preprocessing import TargetEncoder
encoder = TargetEncoder()
# Uses target statistics, prevents leakage with cross-fitting

Ordinal relationships:

from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

Text Data

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

text_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('classifier', MultinomialNB())
])

text_pipeline.fit(X_train_text, y_train)

Mixed Data Types

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Define feature types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['country', 'occupation']

# Separate preprocessing pipelines
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=True))
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Complete pipeline
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)

Model Evaluation

Classification Metrics

Balanced datasets: Use accuracy or F1-score

Imbalanced datasets: Use balanced_accuracy, F1-weighted, or ROC-AUC

from sklearn.metrics import balanced_accuracy_score, f1_score, roc_auc_score

balanced_acc = balanced_accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average='weighted')

# ROC-AUC requires probabilities
y_proba = model.predict_proba(X_test)
auc = roc_auc_score(y_true, y_proba, multi_class='ovr')

Cost-sensitive: Define custom scorer or adjust decision threshold

Comprehensive report:

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))
print(confusion_matrix(y_true, y_pred))

Regression Metrics

Standard use: RMSE and R²

from sklearn.metrics import mean_squared_error, r2_score

rmse = mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)

Outliers present: Use MAE (robust to outliers)

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)

Percentage errors matter: Use MAPE

from sklearn.metrics import mean_absolute_percentage_error
mape = mean_absolute_percentage_error(y_true, y_pred)

Cross-Validation

Standard approach (5-10 folds):

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")

Imbalanced classes (use stratification):

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv)

Time series (respect temporal order):

from sklearn.model_selection import TimeSeriesSplit

cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(model, X, y, cv=cv)

Multiple metrics:

from sklearn.model_selection import cross_validate

scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
results = cross_validate(model, X, y, cv=5, scoring=scoring)

for metric in scoring:
    scores = results[f'test_{metric}']
    print(f"{metric}: {scores.mean():.3f}")

Hyperparameter Tuning

Grid Search (Exhaustive)

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Use best model
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)

Random Search (Faster)

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_distributions = {
    'n_estimators': randint(100, 1000),
    'max_depth': randint(5, 50),
    'min_samples_split': randint(2, 20),
    'max_features': uniform(0.1, 0.9)
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=100,  # Number of combinations to try
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)

Pipeline Hyperparameter Tuning

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

# Use double underscore for nested parameters
param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__kernel': ['rbf', 'linear'],
    'svm__gamma': ['scale', 'auto', 0.001, 0.01]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

Feature Engineering and Selection

Feature Importance

# Tree-based models have built-in feature importance
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

# Permutation importance (works for any model)
from sklearn.inspection import permutation_importance

result = permutation_importance(
    model, X_test, y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': result.importances_mean,
    'std': result.importances_std
}).sort_values('importance', ascending=False)

Feature Selection Methods

Univariate selection:

from sklearn.feature_selection import SelectKBest, f_classif

selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)
selected_features = selector.get_support(indices=True)

Recursive Feature Elimination:

from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier

selector = RFECV(
    RandomForestClassifier(n_estimators=100),
    step=1,
    cv=5,
    n_jobs=-1
)
X_selected = selector.fit_transform(X, y)
print(f"Optimal features: {selector.n_features_}")

Model-based selection:

from sklearn.feature_selection import SelectFromModel

selector = SelectFromModel(
    RandomForestClassifier(n_estimators=100),
    threshold='median'  # or '0.5*mean', or specific value
)
X_selected = selector.fit_transform(X, y)

Polynomial Features

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=2, include_bias=False)),
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
])

pipeline.fit(X_train, y_train)

Common Patterns and Best Practices

Always Use Pipelines

Pipelines prevent data leakage and ensure proper workflow:

✅ Correct:

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

❌ Wrong (data leakage):

scaler = StandardScaler().fit(X)  # Fit on all data!
X_train, X_test = train_test_split(scaler.transform(X))

Stratify for Imbalanced Classes

# Always use stratify for classification with imbalanced classes
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

Scale When Necessary

Scale for: SVM, Neural Networks, KNN, Linear Models with regularization, PCA, Gradient Descent

Don't scale for: Tree-based models (Random Forest, Gradient Boosting), Naive Bayes

Handle Missing Values

from sklearn.impute import SimpleImputer

# Numeric: use median (robust to outliers)
imputer = SimpleImputer(strategy='median')

# Categorical: use constant value or most_frequent
imputer = SimpleImputer(strategy='constant', fill_value='missing')

Use Appropriate Metrics

Balanced classification: accuracy, F1
Imbalanced classification: balanced_accuracy, F1-weighted, ROC-AUC
Regression with outliers: MAE instead of RMSE
Cost-sensitive: custom scorer

Set Random States

# For reproducibility
model = RandomForestClassifier(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=42
)

Use Parallel Processing

# Use all CPU cores
model = RandomForestClassifier(n_jobs=-1)
grid_search = GridSearchCV(model, param_grid, n_jobs=-1)

Unsupervised Learning

Clustering Workflow

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Always scale for clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Elbow method to find optimal k
inertias = []
silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    inertias.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X_scaled, labels))

# Plot and choose k
import matplotlib.pyplot as plt
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(K_range, inertias, 'bo-')
ax1.set_xlabel('k')
ax1.set_ylabel('Inertia')
ax2.plot(K_range, silhouette_scores, 'ro-')
ax2.set_xlabel('k')
ax2.set_ylabel('Silhouette Score')
plt.show()

# Fit final model
optimal_k = 5  # Based on elbow/silhouette
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
labels = kmeans.fit_predict(X_scaled)

Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# ALWAYS scale before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Specify variance to retain
pca = PCA(n_components=0.95)  # Keep 95% of variance
X_pca = pca.fit_transform(X_scaled)

print(f"Original features: {X.shape[1]}")
print(f"Reduced features: {pca.n_components_}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.3f}")

# Visualize explained variance
import matplotlib.pyplot as plt
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.show()

Visualization with t-SNE

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# Reduce to 50 dimensions with PCA first (faster)
pca = PCA(n_components=min(50, X.shape[1]))
X_pca = pca.fit_transform(X_scaled)

# Apply t-SNE (only for visualization!)
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca)

# Visualize
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis', alpha=0.6)
plt.colorbar()
plt.title('t-SNE Visualization')
plt.show()

Saving and Loading Models

import joblib

# Save model or pipeline
joblib.dump(model, 'model.pkl')
joblib.dump(pipeline, 'pipeline.pkl')

# Load
loaded_model = joblib.load('model.pkl')
loaded_pipeline = joblib.load('pipeline.pkl')

# Use loaded model
predictions = loaded_model.predict(X_new)

Reference Documentation

This skill includes comprehensive reference files:

references/supervised_learning.md: Detailed coverage of all classification and regression algorithms, parameters, use cases, and selection guidelines
references/preprocessing.md: Complete guide to data preprocessing including scaling, encoding, imputation, transformations, and best practices
references/model_evaluation.md: In-depth coverage of cross-validation strategies, metrics, hyperparameter tuning, and validation techniques
references/unsupervised_learning.md: Comprehensive guide to clustering, dimensionality reduction, anomaly detection, and evaluation methods
references/pipelines_and_composition.md: Complete guide to Pipeline, ColumnTransformer, FeatureUnion, custom transformers, and composition patterns
references/quick_reference.md: Quick lookup guide with code snippets, common patterns, and decision trees for algorithm selection

Read these files when:

Need detailed parameter explanations for specific algorithms
Comparing multiple algorithms for a task
Understanding evaluation metrics in depth
Building complex preprocessing workflows
Troubleshooting common issues

Example search patterns:

# To find information about specific algorithms
grep -r "GradientBoosting" references/

# To find preprocessing techniques
grep -r "OneHotEncoder" references/preprocessing.md

# To find evaluation metrics
grep -r "f1_score" references/model_evaluation.md

Common Pitfalls to Avoid

Data leakage: Always use pipelines, fit only on training data
Not scaling: Scale for distance-based algorithms (SVM, KNN, Neural Networks)
Wrong metrics: Use appropriate metrics for imbalanced data
Not using cross-validation: Single train-test split can be misleading
Forgetting stratification: Stratify for imbalanced classification
Using t-SNE for preprocessing: t-SNE is for visualization only!
Not setting random_state: Results won't be reproducible
Ignoring class imbalance: Use stratification, appropriate metrics, or resampling
PCA without scaling: Components will be dominated by high-variance features
Testing on training data: Always evaluate on held-out test set

Install Skill

SKILL.md

Scikit-learn: Machine Learning in Python

Overview

When to Use This Skill

Core Machine Learning Workflow

Standard ML Pipeline

Classification Quick Start

Regression Quick Start

Algorithm Selection Guide

Classification Algorithms

Regression Algorithms

Clustering Algorithms

Dimensionality Reduction

Working with Different Data Types

Numeric Features

Categorical Features

Text Data

Mixed Data Types

Model Evaluation

Classification Metrics

Regression Metrics

Cross-Validation

Hyperparameter Tuning

Grid Search (Exhaustive)

Random Search (Faster)

Pipeline Hyperparameter Tuning

Feature Engineering and Selection

Feature Importance

Feature Selection Methods

Polynomial Features

Common Patterns and Best Practices

Always Use Pipelines

Stratify for Imbalanced Classes

Scale When Necessary

Handle Missing Values

Use Appropriate Metrics

Set Random States

Use Parallel Processing

Unsupervised Learning

Clustering Workflow

Dimensionality Reduction

Visualization with t-SNE

Saving and Loading Models

Reference Documentation

Common Pitfalls to Avoid