| name | data-analyst |
| description | Standards for rigorous data analysis using OSEMN methodology. Focuses on statistical validation, model reliability, and AI-readability. |
Data Analyst Standards (OSEMN)
Purpose
To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.
Core Principles (Core Philosophy)
"No Hiding"
AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.
Quality Standards (Tier 1 Best Practices)
1. Data Integrity (Obtain & Scrub)
"Garbage In, Garbage Out"
- Data Source Verification: Check file extension, size, and metadata.
- Data Quality Check:
- Missing Values: Identify mechanism (MCAR, MAR, MNAR) before imputing.
- Logical Failures: Check for impossible values (e.g., age < 0, future dates), Data Leakage, and Overfitting indicators.
- Data Types: Ensure numeric cols are not strings, etc.
- File Handling Standards:
- Excel (.xlsx): Preserve existing formatting/formulas. Use
openpyxlfor editing,pandasfor reading. Zero Hardcoding of calculated values. - CSV (.csv): Detect delimiter automatically (
csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly.
- Excel (.xlsx): Preserve existing formatting/formulas. Use
2. Hypothesis Driven EDA (Explore)
"Ask, Don't just Plot"
- Univariate Analysis: Distribution of each key variable (Histogram/Boxplot). Check for Skewness/Kurtosis.
- Bivariate Analysis: Correlation matrix, Scatter plots for relationships.
- Statistical Validation:
- Normality Test: Shapiro-Wilk or K-S test.
- Significance: T-test/ANOVA for group differences.
- Insight Logging: Record the implication of every finding immediately.
3. Rigorous Modeling (Model)
"Trust but Verify"
- Baseline First: Compare complex models against a Dummy/Logistic Baseline.
- Feature Engineering: Scale numericals, Encode categoricals, Create interaction terms.
- Cross-Validation: Use Stratified K-Fold to prevent overfitting.
- Metric Selection: Optimize for business KPI (not just Accuracy).
- Methodology Screening: Consult the Methodology Master List (below) to select appropriate algorithms.
4. Interpretation (Interpret)
"Why did it predict that?"
- Feature Importance: SHAP values or Permutation Importance.
- Error Analysis: Manually inspect the "Top 10 Worst Errors".
- Actionable Conclusion: Translate stats into business recommendations.
Checklist (Quality Gate)
Before finalizing:
- Reproducibility: Can the notebook run from top to bottom without error?
- Storytelling: Does the notebook flow like a narrative?
- Visuals: Are all graphs labeled (Title, Axis, Legend)?
Appendix: Methodology Master List (Reference)
Scan these tables to select the most appropriate methodology for your data and goal.
1. Preprocessing & Data Cleaning
| Methodology | Usage / Purpose | Data Constraints |
|---|---|---|
| Simple Imputation | Missing Value Imputation (Simple Replacement) | Mean/Median (Numeric), Mode (Categorical) |
| KNN Imputation | Missing Value Imputation (Similarity-based) | Mainly Numeric, useful when correlations exist |
| Iterative Imputation | Missing Value Imputation (Model-based) | High variable correlation, assumes MAR |
| One-Hot Encoding | Categorical to Numeric | Nominal data, Low Cardinality |
| Label Encoding | Categorical to Numeric | Ordinal data |
| Target Encoding | Categorical to Numeric | High Cardinality features, Risk of Overfitting |
| Standard Scaler | Scaling (Standardization) | Sensitive to outliers, assumes Gaussian distribution |
| MinMax Scaler | Scaling (Normalization) | Bounded data, distribution agnostic |
| Robust Scaler | Scaling (Robust to Outliers) | Data with many outliers (Uses Median/IQR) |
| SMOTE | Oversampling (Imbalanced Data) | Synthesize minority class samples (Training set ONLY) |
| PCA | Dimensionality Reduction, Multicollinearity Removal | Continuous variables, assumes linear relationships |
2. Machine Learning Models
| Methodology | Type | Usage / Purpose | Constraints / Notes |
|---|---|---|---|
| Linear Regression | Regression | Baseline for regression | Linear relationship assumption |
| Logistic Regression | Classification | Baseline for classification | Linear separation assumption, large sparse data OK |
| SVM / SVR | Class/Reg | High accuracy in high dimensional spaces | Computationally expensive (O(n^3)), Scale-sensitive |
| K-Nearest Neighbors | Class/Reg | Instance-based learning, Simple | Scale-sensitive, Small data |
| Random Forest | Ensemble | Robust Classification/Regression | Handles Mixed types, Robust to outliers/missing values |
| XGBoost / LightGBM | Ensemble | High Performance | Large datasets, handles missing values internally |
| CatBoost | Ensemble | Best for Categorical Features | Handles categories automatically, Slower training |
| Isolation Forest | Anomaly Detection | Outlier/Anomaly Detection | High dimensional data, efficiency |
| K-Means | Clustering | Partitioning into K clusters | Spherical Clusters, Sensitive to outliers, Scale-sensitive |
| DBSCAN | Clustering | Density-based clustering, Detects Outliers | Arbitrary shapes, Scale-sensitive, finding epsilon is hard |
| Hierarchical | Clustering | Dendrogram visualization | Computationally expensive for large data |
3. Deep Learning Models
| Methodology | Usage / Purpose | Data Constraints |
|---|---|---|
| CNN | Image/Pattern Recognition | Grid-like data (Images, etc.) |
| RNN / LSTM | Sequence/Time-Series Prediction | Sequential data |
| Transformer | NLP, Complex Pattern Matching | Long sequences, Large-scale data |
4. Validation & Optimization
| Methodology | Type | Usage / Purpose | Notes |
|---|---|---|---|
| Stratified K-Fold | Validation | Cross Validation (Generalization) | Essential for Imbalanced Class distribution |
| K-Fold CV | Validation | Cross Validation | Sufficient data, Balanced classes |
| Time Series Split | Validation | Cross Validation (Temporal) | No future data leakage (essential for time-series) |
| Grid Search | Tuning | Hyperparameter Optimization | Small search space (Exhaustive) |
| Bayesian Optimization | Tuning | Hyperparameter Optimization | Large search space, High evaluation cost |
| Optuna | Tuning | Next-gen Hyperparameter Optimization | Efficient, Define-by-run, Pruning capabilities |
| L1 (Lasso) | Regularization | Sparse Model, Feature Selection | When sparse solution is needed |
| L2 (Ridge) | Regularization | Prevent Overfitting, Weight Decay | When high multicollinearity exists |
| ElasticNet | Regularization | Combination of L1 and L2 | When both feature selection and regularization needed |
5. Interpretation
| Methodology | Usage / Purpose | Notes |
|---|---|---|
| SHAP | Explain Model Predictions | Specialized for Tree-based models |
Appendix: Evaluation Metrics Guide
Select metrics based on your problem type and business goal.
Classification Metrics
| Metric | Focus | When to use |
|---|---|---|
| Accuracy | Overall Correctness | Balanced datasets only. Misleading for imbalanced data. |
| Precision | False Positive Reduction | When FP is costly (e.g., Spam Filter). |
| Recall | False Negative Reduction | When FN is critical (e.g., Cancer Diagnosis, Fraud). |
| F1 Score | Balance | When you need a balance between Precision and Recall. |
| ROC-AUC | Ranking Quality | When you need robust performance across thresholds. |
| Log Loss | Probability Confidence | When the predicted probability value itself matters. |
Regression Metrics
| Metric | Focus | When to use |
|---|---|---|
| MSE | Large Error Penalty | When outliers/large errors should be heavily penalized. |
| RMSE | Interpretability | When you need error in the same unit as the target. |
| MAE | Robustness | When you want to be robust against outliers. |
| R2 Score | Explainability | To see how much variance is explained by the model. |
| MAPE | Business Interpretability | Error in Percentage (%). Easy for stakeholders. |
Clustering Metrics (Unsupervised)
| Metric | Focus | When to use |
|---|---|---|
| Silhouette Score | Cluster Separation | To measure how similar an object is to its own cluster compared to other clusters. |
| Davies-Bouldin | Cluster Compactness | Lower is better. Good for comparing clustering algorithms. |
| Elbow Method | Optimal K | To find the inflection point (optimal K) in K-Means. |