Claude Code Plugins

Community-maintained marketplace

Feedback

Statistical analysis strategies and data exploration techniques

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name data-science
description Statistical analysis strategies and data exploration techniques

Data Science for Scientific Discovery

When to Use This Skill

  • When choosing appropriate statistical tests
  • When exploring data structure
  • When validating assumptions
  • When interpreting statistical results

Data Exploration

Initial Data Assessment

Always start with:

import pandas as pd
import numpy as np

# Basic info
print(f"Shape: {data.shape}")
print(f"Columns: {data.columns.tolist()}")
print(f"Data types:\n{data.dtypes}")

# Missing values
print(f"Missing values:\n{data.isnull().sum()}")

# Summary statistics
print(data.describe())

# Check for duplicates
print(f"Duplicate rows: {data.duplicated().sum()}")

Distribution Checks

Before running statistical tests, check distributions:

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Visual check
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Histogram
axes[0, 0].hist(data["variable"], bins=30)
axes[0, 0].set_title("Histogram")

# Q-Q plot for normality
stats.probplot(data["variable"], dist="norm", plot=axes[0, 1])
axes[0, 1].set_title("Q-Q Plot")

# Box plot (check for outliers)
axes[1, 0].boxplot(data["variable"])
axes[1, 0].set_title("Box Plot")

# Violin plot by group
if "group" in data.columns:
    sns.violinplot(data=data, x="group", y="variable", ax=axes[1, 1])
    axes[1, 1].set_title("Distribution by Group")

plt.tight_layout()
plt.savefig("distribution_check.png")

Statistical normality tests:

# Shapiro-Wilk test (n < 50)
stat, p = stats.shapiro(data["variable"])
print(f"Shapiro-Wilk: p={p:.4f}")

# Kolmogorov-Smirnov test (n >= 50)
stat, p = stats.kstest(data["variable"], 'norm')
print(f"K-S test: p={p:.4f}")

# Interpretation: p < 0.05 → reject normality

Choosing Statistical Tests

Decision Tree

What type of comparison?
│
├─> Two groups, continuous outcome
│   ├─> Normally distributed → Independent t-test
│   ├─> Non-normal → Mann-Whitney U test
│   └─> Paired samples → Paired t-test or Wilcoxon
│
├─> Multiple groups (3+), continuous outcome
│   ├─> Normally distributed → One-way ANOVA
│   ├─> Non-normal → Kruskal-Wallis test
│   └─> Multiple factors → Two-way ANOVA or mixed models
│
├─> Categorical outcome
│   ├─> 2x2 table → Chi-square or Fisher's exact
│   └─> Larger table → Chi-square test
│
└─> Association between continuous variables
    ├─> Linear relationship → Pearson correlation
    ├─> Monotonic relationship → Spearman correlation
    └─> Prediction → Linear regression

Implementation Examples

T-test (two groups, normal data):

from scipy.stats import ttest_ind

group1 = data[data["group"] == "A"]["variable"]
group2 = data[data["group"] == "B"]["variable"]

t_stat, p_value = ttest_ind(group1, group2)
print(f"t-test: t={t_stat:.3f}, p={p_value:.4f}")

# Effect size (Cohen's d)
mean_diff = group1.mean() - group2.mean()
pooled_std = np.sqrt((group1.std()**2 + group2.std()**2) / 2)
cohens_d = mean_diff / pooled_std
print(f"Cohen's d: {cohens_d:.3f}")

ANOVA (multiple groups, normal data):

from scipy.stats import f_oneway

groups = [data[data["group"] == g]["variable"] for g in data["group"].unique()]
f_stat, p_value = f_oneway(*groups)
print(f"ANOVA: F={f_stat:.3f}, p={p_value:.4f}")

# Effect size (eta-squared)
# If significant, do post-hoc tests
if p_value < 0.05:
    from scipy.stats import tukey_hsd
    res = tukey_hsd(*groups)
    print(res)

Mann-Whitney U (two groups, non-normal):

from scipy.stats import mannwhitneyu

u_stat, p_value = mannwhitneyu(group1, group2, alternative='two-sided')
print(f"Mann-Whitney U: U={u_stat:.3f}, p={p_value:.4f}")

Kruskal-Wallis (multiple groups, non-normal):

from scipy.stats import kruskal

h_stat, p_value = kruskal(*groups)
print(f"Kruskal-Wallis: H={h_stat:.3f}, p={p_value:.4f}")

Correlation:

from scipy.stats import pearsonr, spearmanr

# Pearson (linear relationship)
r, p = pearsonr(data["var1"], data["var2"])
print(f"Pearson r={r:.3f}, p={p:.4f}")

# Spearman (monotonic relationship)
rho, p = spearmanr(data["var1"], data["var2"])
print(f"Spearman ρ={rho:.3f}, p={p:.4f}")

Multiple Testing Correction

When: Testing multiple hypotheses (e.g., 100 metabolites)

Problem: 5% false positive rate × 100 tests = 5 false positives expected

Solutions:

from statsmodels.stats.multitest import multipletests

# Run many tests
p_values = [test_function(var) for var in variables]

# Bonferroni correction (conservative)
rejected, p_corrected, _, _ = multipletests(p_values, method='bonferroni')

# False Discovery Rate (less conservative, recommended)
rejected, p_corrected, _, _ = multipletests(p_values, method='fdr_bh')

print(f"Significant after FDR correction: {sum(rejected)}")

When to use:

  • Bonferroni: Small number of tests (<20), need strict control
  • FDR (Benjamini-Hochberg): Large number of tests (>20), exploratory analysis

Dimensionality Reduction

Principal Component Analysis (PCA)

When: High-dimensional data, want to visualize structure

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize features (important!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(data[feature_columns])

# Fit PCA
pca = PCA(n_components=2)
pc_scores = pca.fit_transform(X_scaled)

# Variance explained
print(f"Variance explained: {pca.explained_variance_ratio_}")

# Plot
plt.figure(figsize=(8, 6))
for group in data["group"].unique():
    mask = data["group"] == group
    plt.scatter(pc_scores[mask, 0], pc_scores[mask, 1], label=group, alpha=0.6)
plt.xlabel(f"PC1 ({pca.explained_variance_ratio_[0]:.1%})")
plt.ylabel(f"PC2 ({pca.explained_variance_ratio_[1]:.1%})")
plt.legend()
plt.title("PCA")
plt.savefig("pca.png")

Interpretation:

  • Separation by group → systematic differences
  • Outliers → potential batch effects or errors
  • PC loadings → which features drive separation

t-SNE (for visualization only)

When: PCA doesn't show clear separation

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_scores = tsne.fit_transform(X_scaled)

# Plot similar to PCA

Warning: t-SNE is non-linear and stochastic. Don't overinterpret distances.

Clustering

K-means Clustering

When: Grouping samples by similarity (unsupervised)

from sklearn.cluster import KMeans

# Determine optimal k using elbow method
inertias = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(range(2, 11), inertias, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.savefig("elbow_plot.png")

# Fit with chosen k
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(X_scaled)
data["cluster"] = clusters

Hierarchical Clustering

When: Want dendrogram or nested clusters

from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist

# Compute linkage
Z = linkage(X_scaled, method='ward')

# Plot dendrogram
plt.figure(figsize=(10, 6))
dendrogram(Z, labels=data["sample_id"].values, leaf_font_size=8)
plt.title("Hierarchical Clustering")
plt.xlabel("Sample")
plt.ylabel("Distance")
plt.savefig("dendrogram.png")

Feature Selection

Univariate Feature Selection

When: Reduce features for modeling

from sklearn.feature_selection import SelectKBest, f_classif

# Select top k features
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Get feature names
feature_scores = pd.DataFrame({
    "feature": feature_columns,
    "score": selector.scores_,
    "p_value": selector.pvalues_
}).sort_values("score", ascending=False)

print(feature_scores.head(10))

Recursive Feature Elimination

When: Want model-based feature selection

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
rfe = RFE(estimator=model, n_features_to_select=10)
rfe.fit(X_scaled, y)

selected_features = [f for f, selected in zip(feature_columns, rfe.support_) if selected]
print(f"Selected features: {selected_features}")

Power Analysis

When: Planning analysis or interpreting negative results

from statsmodels.stats.power import ttest_power

# For t-test
effect_size = 0.5  # Cohen's d
alpha = 0.05
n_per_group = 20

power = ttest_power(effect_size, n_per_group, alpha)
print(f"Power: {power:.3f}")

# If power < 0.8, you might miss true effects

Confounders and Covariates

Check for Confounders

Always check:

  • Age, sex, batch, technical factors
# Check if confounder is associated with both exposure and outcome
from scipy.stats import chi2_contingency

# Confounder vs exposure
table1 = pd.crosstab(data["sex"], data["group"])
chi2, p1, _, _ = chi2_contingency(table1)
print(f"Sex vs Group: p={p1:.4f}")

# Confounder vs outcome
t, p2 = ttest_ind(data[data["sex"]=="M"]["outcome"],
                  data[data["sex"]=="F"]["outcome"])
print(f"Sex vs Outcome: p={p2:.4f}")

# If both p < 0.05, sex is a confounder

Adjust for Covariates

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# ANCOVA (adjusting for continuous covariate)
model = ols('outcome ~ C(group) + age + sex', data=data).fit()
print(anova_lm(model))

Common Mistakes

Not checking assumptions

  • Always verify normality, homogeneity of variance
  • Use diagnostic plots

P-hacking

  • Don't try multiple tests until one is significant
  • Pre-specify analysis plan

Ignoring multiple testing

  • Correct p-values when testing many hypotheses
  • Use FDR for large-scale screens

Overinterpreting p-values

  • p=0.051 is not "failed", p=0.049 is not "proof"
  • Report effect sizes and confidence intervals

Using wrong test

  • Non-normal data with t-test → use non-parametric
  • Paired data with independent test → use paired test

Correlation = causation

  • Always consider confounders
  • Use mechanistic reasoning

Reporting Statistical Results

Good reporting includes:

  1. Test used: "Two-sample t-test"
  2. Test statistic: "t = 2.45"
  3. P-value: "p = 0.028"
  4. Effect size: "Cohen's d = 0.82 (large effect)"
  5. Sample size: "n₁ = 15, n₂ = 15"
  6. Confidence interval: "95% CI: [0.5, 3.2]"

Example:

Group A (n=15, mean±SD: 12.3±2.1) had significantly higher levels
than Group B (n=15, 9.1±1.8; t(28)=2.45, p=0.028, d=0.82, 95% CI: [0.5, 3.2]).

Key Principle

Statistics are tools for inference, not magic.

Understand what each test assumes and when it's appropriate. Always combine statistical significance with domain knowledge and effect sizes.