Claude Code Plugins

Community-maintained marketplace

Feedback

Standards for rigorous data analysis using OSEMN methodology. Focuses on statistical validation, model reliability, and AI-readability.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name data-analyst
description Standards for rigorous data analysis using OSEMN methodology. Focuses on statistical validation, model reliability, and AI-readability.

Data Analyst Standards (OSEMN)

Purpose

To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.

Core Principles (Core Philosophy)

"No Hiding" AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.

Quality Standards (Tier 1 Best Practices)

1. Data Integrity (Obtain & Scrub)

"Garbage In, Garbage Out"

  • Data Source Verification: Check file extension, size, and metadata.
  • Data Quality Check:
    • Missing Values: Identify mechanism (MCAR, MAR, MNAR) before imputing.
    • Logical Failures: Check for impossible values (e.g., age < 0, future dates), Data Leakage, and Overfitting indicators.
    • Data Types: Ensure numeric cols are not strings, etc.
  • File Handling Standards:
    • Excel (.xlsx): Preserve existing formatting/formulas. Use openpyxl for editing, pandas for reading. Zero Hardcoding of calculated values.
    • CSV (.csv): Detect delimiter automatically (csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly.

2. Hypothesis Driven EDA (Explore)

"Ask, Don't just Plot"

  • Univariate Analysis: Distribution of each key variable (Histogram/Boxplot). Check for Skewness/Kurtosis.
  • Bivariate Analysis: Correlation matrix, Scatter plots for relationships.
  • Statistical Validation:
    • Normality Test: Shapiro-Wilk or K-S test.
    • Significance: T-test/ANOVA for group differences.
  • Insight Logging: Record the implication of every finding immediately.

3. Rigorous Modeling (Model)

"Trust but Verify"

  • Baseline First: Compare complex models against a Dummy/Logistic Baseline.
  • Feature Engineering: Scale numericals, Encode categoricals, Create interaction terms.
  • Cross-Validation: Use Stratified K-Fold to prevent overfitting.
  • Metric Selection: Optimize for business KPI (not just Accuracy).
  • Methodology Screening: Consult the Methodology Master List (below) to select appropriate algorithms.

4. Interpretation (Interpret)

"Why did it predict that?"

  • Feature Importance: SHAP values or Permutation Importance.
  • Error Analysis: Manually inspect the "Top 10 Worst Errors".
  • Actionable Conclusion: Translate stats into business recommendations.

Checklist (Quality Gate)

Before finalizing:

  • Reproducibility: Can the notebook run from top to bottom without error?
  • Storytelling: Does the notebook flow like a narrative?
  • Visuals: Are all graphs labeled (Title, Axis, Legend)?

Appendix: Methodology Master List (Reference)

Scan these tables to select the most appropriate methodology for your data and goal.

1. Preprocessing & Data Cleaning

Methodology Usage / Purpose Data Constraints
Simple Imputation Missing Value Imputation (Simple Replacement) Mean/Median (Numeric), Mode (Categorical)
KNN Imputation Missing Value Imputation (Similarity-based) Mainly Numeric, useful when correlations exist
Iterative Imputation Missing Value Imputation (Model-based) High variable correlation, assumes MAR
One-Hot Encoding Categorical to Numeric Nominal data, Low Cardinality
Label Encoding Categorical to Numeric Ordinal data
Target Encoding Categorical to Numeric High Cardinality features, Risk of Overfitting
Standard Scaler Scaling (Standardization) Sensitive to outliers, assumes Gaussian distribution
MinMax Scaler Scaling (Normalization) Bounded data, distribution agnostic
Robust Scaler Scaling (Robust to Outliers) Data with many outliers (Uses Median/IQR)
SMOTE Oversampling (Imbalanced Data) Synthesize minority class samples (Training set ONLY)
PCA Dimensionality Reduction, Multicollinearity Removal Continuous variables, assumes linear relationships

2. Machine Learning Models

Methodology Type Usage / Purpose Constraints / Notes
Linear Regression Regression Baseline for regression Linear relationship assumption
Logistic Regression Classification Baseline for classification Linear separation assumption, large sparse data OK
SVM / SVR Class/Reg High accuracy in high dimensional spaces Computationally expensive (O(n^3)), Scale-sensitive
K-Nearest Neighbors Class/Reg Instance-based learning, Simple Scale-sensitive, Small data
Random Forest Ensemble Robust Classification/Regression Handles Mixed types, Robust to outliers/missing values
XGBoost / LightGBM Ensemble High Performance Large datasets, handles missing values internally
CatBoost Ensemble Best for Categorical Features Handles categories automatically, Slower training
Isolation Forest Anomaly Detection Outlier/Anomaly Detection High dimensional data, efficiency
K-Means Clustering Partitioning into K clusters Spherical Clusters, Sensitive to outliers, Scale-sensitive
DBSCAN Clustering Density-based clustering, Detects Outliers Arbitrary shapes, Scale-sensitive, finding epsilon is hard
Hierarchical Clustering Dendrogram visualization Computationally expensive for large data

3. Deep Learning Models

Methodology Usage / Purpose Data Constraints
CNN Image/Pattern Recognition Grid-like data (Images, etc.)
RNN / LSTM Sequence/Time-Series Prediction Sequential data
Transformer NLP, Complex Pattern Matching Long sequences, Large-scale data

4. Validation & Optimization

Methodology Type Usage / Purpose Notes
Stratified K-Fold Validation Cross Validation (Generalization) Essential for Imbalanced Class distribution
K-Fold CV Validation Cross Validation Sufficient data, Balanced classes
Time Series Split Validation Cross Validation (Temporal) No future data leakage (essential for time-series)
Grid Search Tuning Hyperparameter Optimization Small search space (Exhaustive)
Bayesian Optimization Tuning Hyperparameter Optimization Large search space, High evaluation cost
Optuna Tuning Next-gen Hyperparameter Optimization Efficient, Define-by-run, Pruning capabilities
L1 (Lasso) Regularization Sparse Model, Feature Selection When sparse solution is needed
L2 (Ridge) Regularization Prevent Overfitting, Weight Decay When high multicollinearity exists
ElasticNet Regularization Combination of L1 and L2 When both feature selection and regularization needed

5. Interpretation

Methodology Usage / Purpose Notes
SHAP Explain Model Predictions Specialized for Tree-based models

Appendix: Evaluation Metrics Guide

Select metrics based on your problem type and business goal.

Classification Metrics

Metric Focus When to use
Accuracy Overall Correctness Balanced datasets only. Misleading for imbalanced data.
Precision False Positive Reduction When FP is costly (e.g., Spam Filter).
Recall False Negative Reduction When FN is critical (e.g., Cancer Diagnosis, Fraud).
F1 Score Balance When you need a balance between Precision and Recall.
ROC-AUC Ranking Quality When you need robust performance across thresholds.
Log Loss Probability Confidence When the predicted probability value itself matters.

Regression Metrics

Metric Focus When to use
MSE Large Error Penalty When outliers/large errors should be heavily penalized.
RMSE Interpretability When you need error in the same unit as the target.
MAE Robustness When you want to be robust against outliers.
R2 Score Explainability To see how much variance is explained by the model.
MAPE Business Interpretability Error in Percentage (%). Easy for stakeholders.

Clustering Metrics (Unsupervised)

Metric Focus When to use
Silhouette Score Cluster Separation To measure how similar an object is to its own cluster compared to other clusters.
Davies-Bouldin Cluster Compactness Lower is better. Good for comparing clustering algorithms.
Elbow Method Optimal K To find the inflection point (optimal K) in K-Means.