Supervised Learning Skill
Build, tune, and evaluate classification and regression models.
Quick Start
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Evaluate
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='f1_weighted')
print(f"CV F1: {cv_scores.mean():.4f} (+/- {cv_scores.std()*2:.4f})")
print(f"Test Accuracy: {model.score(X_test, y_test):.4f}")
Key Topics
1. Classification Algorithms
| Algorithm |
Best For |
Complexity |
| Logistic Regression |
Baseline, interpretable |
O(n*d) |
| Random Forest |
Tabular, general |
O(ndtrees) |
| XGBoost |
Competitions, accuracy |
O(ndtrees) |
| SVM |
High-dim, small data |
O(n²) |
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
classifiers = {
'lr': LogisticRegression(max_iter=1000, class_weight='balanced'),
'rf': RandomForestClassifier(n_estimators=100, class_weight='balanced'),
'xgb': XGBClassifier(n_estimators=100, eval_metric='logloss')
}
2. Regression Algorithms
| Algorithm |
Best For |
Key Param |
| Ridge |
Multicollinearity |
alpha |
| Lasso |
Feature selection |
alpha |
| Random Forest |
Non-linear |
n_estimators |
| XGBoost |
Best accuracy |
learning_rate |
3. Hyperparameter Tuning
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_dist = {
'n_estimators': randint(50, 300),
'max_depth': randint(3, 15),
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10)
}
search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_dist,
n_iter=50,
cv=5,
scoring='f1_weighted',
n_jobs=-1,
random_state=42
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV score: {search.best_score_:.4f}")
4. Handling Class Imbalance
| Technique |
Implementation |
| Class Weights |
class_weight='balanced' |
| SMOTE |
imblearn.over_sampling.SMOTE() |
| Threshold Tuning |
Adjust prediction threshold |
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
pipeline = Pipeline([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier())
])
5. Model Comparison
from sklearn.model_selection import cross_validate
import pandas as pd
def compare_models(models, X, y, cv=5):
results = []
for name, model in models.items():
cv_results = cross_validate(
model, X, y, cv=cv,
scoring=['accuracy', 'f1_weighted', 'roc_auc_ovr_weighted'],
return_train_score=True
)
results.append({
'model': name,
'train_acc': cv_results['train_accuracy'].mean(),
'test_acc': cv_results['test_accuracy'].mean(),
'test_f1': cv_results['test_f1_weighted'].mean(),
'test_auc': cv_results['test_roc_auc_ovr_weighted'].mean()
})
return pd.DataFrame(results).round(4)
Best Practices
DO
- Start with a simple baseline
- Use stratified splits for classification
- Log all hyperparameters
- Check for overfitting (train vs test gap)
- Use early stopping for boosting
DON'T
- Don't tune on test set
- Don't ignore class imbalance
- Don't skip feature importance analysis
- Don't use accuracy for imbalanced data
Exercises
Exercise 1: Model Selection
# TODO: Compare 3 different classifiers using cross-validation
# Report F1 score for each
Exercise 2: Hyperparameter Tuning
# TODO: Use RandomizedSearchCV to tune XGBoost
# Find optimal n_estimators, max_depth, learning_rate
Unit Test Template
import pytest
from sklearn.datasets import make_classification
def test_classifier_trains():
"""Test classifier can fit and predict."""
X, y = make_classification(n_samples=100, random_state=42)
model = get_classifier()
model.fit(X[:80], y[:80])
predictions = model.predict(X[80:])
assert len(predictions) == 20
assert set(predictions).issubset({0, 1})
def test_handles_imbalance():
"""Test model handles imbalanced classes."""
X, y = make_classification(n_samples=100, weights=[0.9, 0.1])
model = get_balanced_classifier()
model.fit(X, y)
predictions = model.predict(X)
# Should predict both classes
assert len(set(predictions)) == 2
Troubleshooting
| Problem |
Cause |
Solution |
| Overfitting |
Model too complex |
Reduce depth, add regularization |
| Underfitting |
Model too simple |
Increase complexity |
| Class imbalance |
Skewed data |
Use SMOTE or class weights |
| Slow training |
Large data |
Use LightGBM, reduce estimators |
Related Resources
- Agent:
02-supervised-learning
- Previous:
ml-fundamentals
- Next:
clustering
Version: 1.4.0 | Status: Production Ready