Claude Code Plugins

Community-maintained marketplace

Feedback

Comprehensive CSV data analysis using Python pandas with statistics, visualizations, and insights

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name csv-analyzer
description Comprehensive CSV data analysis using Python pandas with statistics, visualizations, and insights
allowed-tools Read, Bash, Write

CSV Data Analyzer

Analyze CSV files using Python pandas to provide statistics, detect patterns, and generate insights.

When to Use

  • Analyzing tabular data in CSV format
  • Generating statistical summaries
  • Detecting data quality issues
  • Creating data visualizations
  • Exploring datasets

Prerequisites

  • Python 3.8+ installed
  • pandas, numpy libraries available
  • Optional: matplotlib, seaborn for visualizations

Workflow

1. Start Python REPL

python3 -i

2. Load Required Libraries

import pandas as pd
import numpy as np
import sys

# Optional for visualizations
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    HAS_PLOT = True
except ImportError:
    HAS_PLOT = False
    print("Visualization libraries not available")

3. Load CSV Data

# Load the CSV file
csv_path = "/absolute/path/to/file.csv"
df = pd.read_csv(csv_path)

# Display basic info
print(f"Loaded {len(df)} rows and {len(df.columns)} columns")
print(f"\nColumns: {', '.join(df.columns)}")

4. Initial Exploration

# First few rows
print("\n=== First 5 Rows ===")
print(df.head())

# Data types and non-null counts
print("\n=== Data Info ===")
print(df.info())

# Basic statistics
print("\n=== Statistical Summary ===")
print(df.describe())

# Missing values
print("\n=== Missing Values ===")
missing = df.isnull().sum()
if missing.any():
    print(missing[missing > 0])
else:
    print("No missing values found")

5. Data Quality Checks

# Duplicate rows
duplicates = df.duplicated().sum()
print(f"\n=== Duplicates ===")
print(f"Found {duplicates} duplicate rows")

# Unique values per column
print("\n=== Unique Values ===")
for col in df.columns:
    unique_count = df[col].nunique()
    print(f"{col}: {unique_count} unique values")

# Data type validation
print("\n=== Data Types ===")
for col in df.columns:
    dtype = df[col].dtype
    sample = df[col].dropna().iloc[0] if len(df[col].dropna()) > 0 else "N/A"
    print(f"{col}: {dtype} (example: {sample})")

6. Statistical Analysis

# Numerical columns analysis
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
    print("\n=== Numerical Analysis ===")
    for col in numeric_cols:
        print(f"\n{col}:")
        print(f"  Mean: {df[col].mean():.2f}")
        print(f"  Median: {df[col].median():.2f}")
        print(f"  Std Dev: {df[col].std():.2f}")
        print(f"  Min: {df[col].min():.2f}")
        print(f"  Max: {df[col].max():.2f}")

# Categorical columns analysis
categorical_cols = df.select_dtypes(include=['object']).columns
if len(categorical_cols) > 0:
    print("\n=== Categorical Analysis ===")
    for col in categorical_cols:
        print(f"\n{col}:")
        value_counts = df[col].value_counts().head(5)
        print(value_counts)

7. Correlation Analysis (for numerical data)

if len(numeric_cols) > 1:
    print("\n=== Correlation Matrix ===")
    correlation = df[numeric_cols].corr()
    print(correlation)

    # Find strong correlations
    print("\n=== Strong Correlations (>0.7 or <-0.7) ===")
    for i in range(len(correlation.columns)):
        for j in range(i+1, len(correlation.columns)):
            corr_value = correlation.iloc[i, j]
            if abs(corr_value) > 0.7:
                col1 = correlation.columns[i]
                col2 = correlation.columns[j]
                print(f"{col1} <-> {col2}: {corr_value:.2f}")

8. Group-By Analysis (if applicable)

# Example: Group by categorical column and aggregate numerical
if len(categorical_cols) > 0 and len(numeric_cols) > 0:
    group_col = categorical_cols[0]
    agg_col = numeric_cols[0]

    print(f"\n=== Grouped Analysis: {group_col} by {agg_col} ===")
    grouped = df.groupby(group_col)[agg_col].agg(['count', 'mean', 'sum'])
    print(grouped)

9. Generate Report

Create a summary report with findings:

# Build report
report = f"""
# CSV Analysis Report: {csv_path.split('/')[-1]}

## Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB

## Data Quality
- Missing Values: {df.isnull().sum().sum()}
- Duplicate Rows: {duplicates}

## Numerical Columns ({len(numeric_cols)})
{', '.join(numeric_cols) if len(numeric_cols) > 0 else 'None'}

## Categorical Columns ({len(categorical_cols)})
{', '.join(categorical_cols) if len(categorical_cols) > 0 else 'None'}

## Key Findings
[Add insights based on analysis]
"""

print(report)

10. Optional: Create Visualizations

if HAS_PLOT and len(numeric_cols) > 0:
    # Histograms for numerical columns
    df[numeric_cols].hist(figsize=(12, 8))
    plt.tight_layout()
    plt.savefig('histograms.png')
    print("Saved histograms.png")

    # Correlation heatmap
    if len(numeric_cols) > 1:
        plt.figure(figsize=(10, 8))
        sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm')
        plt.title('Correlation Heatmap')
        plt.tight_layout()
        plt.savefig('correlation_heatmap.png')
        print("Saved correlation_heatmap.png")

Common Patterns

Find Outliers

# Using IQR method
for col in numeric_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    if len(outliers) > 0:
        print(f"\n{col}: {len(outliers)} outliers detected")

Filter and Transform

# Filter rows
filtered = df[df['column_name'] > 100]

# Create new calculated column
df['new_column'] = df['col1'] + df['col2']

# Convert data types
df['date_column'] = pd.to_datetime(df['date_column'])

Export Results

# Save filtered data
filtered.to_csv('filtered_results.csv', index=False)

# Save summary statistics
summary = df.describe()
summary.to_csv('summary_stats.csv')

Best Practices

DO:

  • Always check for missing values
  • Validate data types
  • Look for duplicates
  • Check data ranges (min/max)
  • Document assumptions

DON'T:

  • Skip data quality checks
  • Ignore missing data
  • Assume data types are correct
  • Forget to save important findings
  • Modify original data without backup

Example Analysis Session

# Load data
df = pd.read_csv('/data/sales.csv')

# Quick overview
print(f"Shape: {df.shape}")
print(df.head())

# Check for issues
print(f"Missing: {df.isnull().sum().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")

# Analyze sales by region
by_region = df.groupby('region')['sales'].agg([
    ('total_sales', 'sum'),
    ('avg_sales', 'mean'),
    ('num_transactions', 'count')
])
print(by_region)

# Find top performers
top_10 = df.nlargest(10, 'sales')[['product', 'sales', 'region']]
print(top_10)

Troubleshooting

Large Files:

# Read in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)

Encoding Issues:

df = pd.read_csv('file.csv', encoding='utf-8')
# or
df = pd.read_csv('file.csv', encoding='latin-1')

Date Parsing:

df = pd.read_csv('file.csv', parse_dates=['date_column'])

Cleanup

# Exit Python REPL when done
exit()

Related Skills

  • data-visualization: Create advanced visualizations
  • statistical-modeling: Build predictive models
  • data-cleaning: Advanced data cleaning workflows