Claude Code Plugins

Community-maintained marketplace

Feedback

ai-data-analyst

@GrupoUS/gpus
0
0

Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Use when you need to analyze datasets, perform statistical tests, create visualizations, or build predictive models with reproducible, code-based workflows.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name ai-data-analyst
description Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Use when you need to analyze datasets, perform statistical tests, create visualizations, or build predictive models with reproducible, code-based workflows.

Skill: AI data analyst

Purpose

Perform comprehensive data analysis, statistical modeling, and data visualization by writing and executing self-contained Python scripts. Generate publication-quality charts, statistical reports, and actionable insights from data files or databases.

When to use this skill

  • You need to analyze datasets to understand patterns, trends, or relationships.
  • You want to perform statistical tests or build predictive models.
  • You need data visualizations (charts, graphs, dashboards) to communicate findings.
  • You're doing exploratory data analysis (EDA) to understand data structure and quality.
  • You need to clean, transform, or merge datasets for analysis.
  • You want reproducible analysis with documented methodology and code.

Key capabilities

Unlike point-solution data analysis tools:

  • Full Python ecosystem: Access to pandas, numpy, scikit-learn, statsmodels, matplotlib, seaborn, plotly, and more.
  • Runs locally: Your data stays on your machine; no uploads to third-party services.
  • Reproducible: All analysis is code-based and version controllable.
  • Customizable: Extend with any Python library or custom analysis logic.
  • Publication-quality output: Generate professional charts and reports.
  • Statistical rigor: Access to comprehensive statistical and ML libraries.

Inputs

  • Data sources: CSV files, Excel files, JSON, Parquet, or database connections.
  • Analysis goals: Questions to answer or hypotheses to test.
  • Variables of interest: Specific columns, metrics, or dimensions to focus on.
  • Output preferences: Chart types, report format, statistical tests needed.
  • Context: Business domain, data dictionary, or known data quality issues.

Out of scope

  • Real-time streaming data analysis (use appropriate streaming tools).
  • Extremely large datasets requiring distributed computing (use Spark/Dask instead).
  • Production ML model deployment (use ML ops tools and infrastructure).
  • Live dashboarding (use BI tools like Tableau/Looker for operational dashboards).

Conventions and best practices

Python environment

  • Use virtual environments to isolate dependencies.
  • Install only necessary packages for the specific analysis.
  • Document all dependencies in requirements.txt or environment.yml.

Code structure

  • Write self-contained scripts that can be re-run by others.
  • Use clear variable names and add comments for complex logic.
  • Separate concerns: data loading, cleaning, analysis, visualization.
  • Save intermediate results to files when analysis is multi-stage.

Data handling

  • Never modify source data files – work on copies or in-memory dataframes.
  • Document data transformations clearly in code comments.
  • Handle missing values explicitly and document approach.
  • Validate data quality before analysis (check for nulls, outliers, duplicates).

Visualization best practices

  • Choose appropriate chart types for the data and question.
  • Use clear labels, titles, and legends on all charts.
  • Apply appropriate color schemes (colorblind-friendly when possible).
  • Include sample sizes and confidence intervals where relevant.
  • Save visualizations in high-resolution formats (PNG 300 DPI, SVG for vector graphics).

Statistical analysis

  • State assumptions for statistical tests clearly.
  • Check assumptions before applying tests (normality, homoscedasticity, etc.).
  • Report effect sizes not just p-values.
  • Use appropriate corrections for multiple comparisons.
  • Explain practical significance in addition to statistical significance.

Required behavior

  1. Understand the question: Clarify what insights or decisions the analysis should support.
  2. Explore the data: Check structure, types, missing values, distributions, outliers.
  3. Clean and prepare: Handle missing data, outliers, and transformations appropriately.
  4. Analyze systematically: Apply appropriate statistical methods or ML techniques.
  5. Visualize effectively: Create clear, informative charts that answer the question.
  6. Generate insights: Translate statistical findings into actionable business insights.
  7. Document thoroughly: Explain methodology, assumptions, limitations, and conclusions.
  8. Make reproducible: Ensure others can re-run the analysis and get the same results.

Required artifacts

  • Analysis script(s): Well-documented Python code performing the analysis.
  • Visualizations: Charts saved as high-quality image files (PNG/SVG).
  • Analysis report: Markdown or text document summarizing:
    • Research question and methodology
    • Data description and quality assessment
    • Key findings with supporting statistics
    • Visualizations with interpretations
    • Limitations and caveats
    • Recommendations or next steps
  • Requirements file: requirements.txt with all dependencies.
  • Sample data (if appropriate and non-sensitive): Small sample for reproducibility.

Implementation checklist

1. Data exploration and preparation

  • Load data and inspect structure (shape, columns, types)
  • Check for missing values, duplicates, outliers
  • Generate summary statistics (mean, median, std, min, max)
  • Visualize distributions of key variables
  • Document data quality issues found

2. Data cleaning and transformation

  • Handle missing values (impute, drop, or flag)
  • Address outliers if needed (cap, transform, or document)
  • Create derived variables if needed
  • Normalize or scale variables for modeling
  • Split data if doing train/test analysis

3. Analysis execution

  • Choose appropriate analytical methods
  • Check statistical assumptions
  • Execute analysis with proper parameters
  • Calculate confidence intervals and effect sizes
  • Perform sensitivity analyses if appropriate

4. Visualization

  • Create exploratory visualizations
  • Generate publication-quality final charts
  • Ensure all charts have clear labels and titles
  • Use appropriate color schemes and styling
  • Save in high-resolution formats

5. Reporting

  • Write clear summary of methods used
  • Present key findings with supporting evidence
  • Explain practical significance of results
  • Document limitations and assumptions
  • Provide actionable recommendations

6. Reproducibility

  • Test that script runs from clean environment
  • Document all dependencies
  • Add comments explaining non-obvious code
  • Include instructions for running analysis

Verification

Run the following to verify the analysis:

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install dependencies
pip install -r requirements.txt

# Run analysis script
python analysis.py

# Check outputs generated
ls -lh outputs/

The skill is complete when:

  • Analysis script runs without errors from clean environment.
  • All required visualizations are generated in high quality.
  • Report clearly explains methodology, findings, and limitations.
  • Results are interpretable and actionable.
  • Code is well-documented and reproducible.

Common analysis patterns

Exploratory Data Analysis (EDA)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load and inspect data
df = pd.read_csv('data.csv')
print(df.info())
print(df.describe())

# Check for missing values
print(df.isnull().sum())

# Visualize distributions
df.hist(figsize=(12, 10), bins=30)
plt.tight_layout()
plt.savefig('distributions.png', dpi=300)

# Check correlations
corr = df.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.savefig('correlations.png', dpi=300)

Time series analysis

import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Load time series data
df = pd.read_csv('timeseries.csv', parse_dates=['date'])
df.set_index('date', inplace=True)

# Decompose time series
decomposition = seasonal_decompose(df['value'], model='additive', period=30)
fig = decomposition.plot()
fig.set_size_inches(12, 8)
plt.savefig('decomposition.png', dpi=300)

# Calculate rolling statistics
df['rolling_mean'] = df['value'].rolling(window=7).mean()
df['rolling_std'] = df['value'].rolling(window=7).std()

# Plot with trends
plt.figure(figsize=(12, 6))
plt.plot(df['value'], label='Original')
plt.plot(df['rolling_mean'], label='7-day Moving Avg', linewidth=2)
plt.fill_between(df.index,
                 df['rolling_mean'] - df['rolling_std'],
                 df['rolling_mean'] + df['rolling_std'],
                 alpha=0.3)
plt.legend()
plt.savefig('trends.png', dpi=300)

Statistical hypothesis testing

from scipy import stats
import numpy as np

# Compare two groups
group_a = df[df['group'] == 'A']['metric']
group_b = df[df['group'] == 'B']['metric']

# Check normality
_, p_norm_a = stats.shapiro(group_a)
_, p_norm_b = stats.shapiro(group_b)

# Choose appropriate test
if p_norm_a > 0.05 and p_norm_b > 0.05:
    # Parametric test (t-test)
    statistic, p_value = stats.ttest_ind(group_a, group_b)
    test_used = "Independent t-test"
else:
    # Non-parametric test (Mann-Whitney U)
    statistic, p_value = stats.mannwhitneyu(group_a, group_b)
    test_used = "Mann-Whitney U test"

# Calculate effect size (Cohen's d)
pooled_std = np.sqrt((group_a.std()**2 + group_b.std()**2) / 2)
cohens_d = (group_a.mean() - group_b.mean()) / pooled_std

print(f"Test used: {test_used}")
print(f"Test statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Effect size (Cohen's d): {cohens_d:.4f}")

Predictive modeling

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Prepare data
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.4f}")
print(f"R² Score: {r2:.4f}")

# Feature importance
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(importance['feature'][:10], importance['importance'][:10])
plt.xlabel('Feature Importance')
plt.title('Top 10 Most Important Features')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300)

Recommended Python libraries

Data manipulation

  • pandas: Data manipulation and analysis
  • numpy: Numerical computing
  • polars: High-performance DataFrame library (alternative to pandas)

Visualization

  • matplotlib: Foundational plotting library
  • seaborn: Statistical visualizations
  • plotly: Interactive charts
  • altair: Declarative statistical visualization

Statistical analysis

  • scipy.stats: Statistical functions and tests
  • statsmodels: Statistical modeling
  • pingouin: Statistical tests with clear output

Machine learning

  • scikit-learn: ML algorithms and tools
  • xgboost: Gradient boosting
  • lightgbm: Fast gradient boosting

Time series

  • statsmodels.tsa: Time series analysis
  • prophet: Forecasting tool
  • pmdarima: Auto ARIMA

Specialized

  • networkx: Network analysis
  • geopandas: Geospatial data analysis
  • textblob / spacy: Natural language processing

Safety and escalation

  • Data privacy: Never analyze or share data containing PII without proper authorization.
  • Statistical validity: If sample sizes are too small for reliable inference, call this out explicitly.
  • Causal claims: Avoid implying causation from correlational analysis; be explicit about limitations.
  • Model limitations: Document when models may not generalize or when predictions should not be trusted.
  • Data quality: If data quality issues could materially affect conclusions, flag this prominently.

Integration with other skills

This skill can be combined with:

  • Internal data querying: To fetch data from warehouses or databases for analysis.
  • Web app builder: To create interactive dashboards displaying analysis results.
  • Internal tools: To build analysis tools for non-technical stakeholders.