| name | data-analysis |
| description | Analyze datasets using Python with pandas, numpy, and visualization libraries. Generates statistical summaries, identifies patterns, creates charts, and provides insights. Use when analyzing CSV/Excel files, exploring data, creating visualizations, or when users mention data analysis, statistics, charts, or datasets. |
| license | MIT |
| metadata | [object Object] |
| compatibility | Requires Python 3.8+ with pandas, numpy, matplotlib, and seaborn |
Data Analysis Skill
When to Use This Skill
Use this skill when:
- Analyzing datasets (CSV, Excel, JSON)
- Performing statistical analysis
- Creating data visualizations
- Identifying trends and patterns
- Data cleaning and preprocessing
- Users mention "analyze data", "statistics", "charts", "trends", or "insights"
Analysis Process
1. Data Loading & Initial Exploration
Load the data:
import pandas as pd
import numpy as np
# CSV files
df = pd.read_csv('data.csv')
# Excel files
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# JSON files
df = pd.read_json('data.json')
# From database
import sqlalchemy
engine = sqlalchemy.create_engine('postgresql://user:pass@localhost/db')
df = pd.read_sql('SELECT * FROM table', engine)
Initial exploration:
# Basic information
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
# First look at data
print("\nFirst 5 rows:")
print(df.head())
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
# Basic statistics
print("\nDescriptive statistics:")
print(df.describe())
2. Data Cleaning
Handle missing values:
# Check missing data patterns
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("Missing data percentage:")
print(missing_pct[missing_pct > 0])
# Drop columns with too many missing values
df = df.drop(columns=missing_pct[missing_pct > 50].index)
# Fill missing values
df['numeric_column'].fillna(df['numeric_column'].median(), inplace=True)
df['categorical_column'].fillna(df['categorical_column'].mode()[0], inplace=True)
# Or drop rows with missing values
df = df.dropna()
Handle duplicates:
# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
# Remove duplicates
df = df.drop_duplicates()
# Keep specific duplicates
df = df.drop_duplicates(subset=['id'], keep='first')
Data type conversions:
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Convert to category (saves memory)
df['category'] = df['category'].astype('category')
3. Statistical Analysis
See references/STATISTICS.md for detailed formulas.
Descriptive statistics:
# Central tendency
mean = df['column'].mean()
median = df['column'].median()
mode = df['column'].mode()[0]
# Dispersion
std = df['column'].std()
variance = df['column'].var()
range_val = df['column'].max() - df['column'].min()
iqr = df['column'].quantile(0.75) - df['column'].quantile(0.25)
# Distribution
skewness = df['column'].skew()
kurtosis = df['column'].kurtosis()
print(f"""
Statistics for {column}:
Mean: {mean:.2f}
Median: {median:.2f}
Std Dev: {std:.2f}
Range: {range_val:.2f}
IQR: {iqr:.2f}
Skewness: {skewness:.2f}
""")
Correlation analysis:
# Correlation matrix
correlation = df[numeric_columns].corr()
print(correlation)
# Find strong correlations
strong_corr = correlation[(correlation > 0.7) | (correlation < -0.7)]
strong_corr = strong_corr[strong_corr != 1.0].stack()
print("\nStrong correlations:")
print(strong_corr)
Group analysis:
# Group by categorical variable
grouped = df.groupby('category').agg({
'sales': ['sum', 'mean', 'count'],
'profit': ['sum', 'mean'],
'quantity': 'sum'
})
print(grouped)
# Multiple grouping
df.groupby(['region', 'category'])['sales'].sum().unstack()
4. Data Visualization
Distribution plots:
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Histogram
plt.figure()
df['column'].hist(bins=30, edgecolor='black')
plt.title('Distribution of Column')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.savefig('histogram.png', dpi=300, bbox_inches='tight')
plt.close()
# Box plot
plt.figure()
df.boxplot(column='value', by='category')
plt.title('Value by Category')
plt.suptitle('') # Remove default title
plt.savefig('boxplot.png', dpi=300, bbox_inches='tight')
plt.close()
# Violin plot
plt.figure()
sns.violinplot(data=df, x='category', y='value')
plt.title('Value Distribution by Category')
plt.savefig('violin.png', dpi=300, bbox_inches='tight')
plt.close()
Relationship plots:
# Scatter plot
plt.figure()
plt.scatter(df['x'], df['y'], alpha=0.5)
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('X vs Y')
plt.savefig('scatter.png', dpi=300, bbox_inches='tight')
plt.close()
# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.savefig('correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.close()
# Pair plot for multiple variables
sns.pairplot(df[['var1', 'var2', 'var3', 'category']], hue='category')
plt.savefig('pairplot.png', dpi=300, bbox_inches='tight')
plt.close()
Time series plots:
# Line plot
plt.figure()
df.set_index('date')['value'].plot()
plt.title('Value Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.savefig('timeseries.png', dpi=300, bbox_inches='tight')
plt.close()
# Multiple time series
df.pivot(index='date', columns='category', values='value').plot()
plt.title('Values by Category Over Time')
plt.legend(title='Category')
plt.savefig('timeseries_multi.png', dpi=300, bbox_inches='tight')
plt.close()
Categorical plots:
# Bar plot
category_counts = df['category'].value_counts()
plt.figure()
category_counts.plot(kind='bar')
plt.title('Count by Category')
plt.xlabel('Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.savefig('barplot.png', dpi=300, bbox_inches='tight')
plt.close()
# Stacked bar plot
df.groupby(['region', 'category'])['sales'].sum().unstack().plot(kind='bar', stacked=True)
plt.title('Sales by Region and Category')
plt.savefig('stacked_bar.png', dpi=300, bbox_inches='tight')
plt.close()
5. Advanced Analysis
Trend detection:
from scipy import stats
# Linear regression
slope, intercept, r_value, p_value, std_err = stats.linregress(df['x'], df['y'])
print(f"Trend: slope={slope:.4f}, R²={r_value**2:.4f}, p={p_value:.4f}")
# Moving average
df['ma_7'] = df['value'].rolling(window=7).mean()
df['ma_30'] = df['value'].rolling(window=30).mean()
Outlier detection:
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['column']))
outliers = df[z_scores > 3]
print(f"Outliers detected: {len(outliers)}")
# IQR method
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < Q1 - 1.5*IQR) | (df['column'] > Q3 + 1.5*IQR)]
print(f"Outliers by IQR: {len(outliers)}")
Statistical tests:
from scipy import stats
# T-test (compare two groups)
group1 = df[df['category'] == 'A']['value']
group2 = df[df['category'] == 'B']['value']
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"T-test: t={t_stat:.4f}, p={p_value:.4f}")
# ANOVA (compare multiple groups)
groups = [df[df['category'] == cat]['value'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F={f_stat:.4f}, p={p_value:.4f}")
# Chi-square test (categorical variables)
contingency_table = pd.crosstab(df['category1'], df['category2'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-square: χ²={chi2:.4f}, p={p_value:.4f}")
6. Generate Report
Use the analysis script:
python scripts/analyze.py --file data.csv --output report.html
Create summary:
summary = f"""
# Data Analysis Report
## Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Date range: {df['date'].min()} to {df['date'].max()}
## Key Findings
### 1. [Finding Title]
{description_of_finding}
### 2. [Finding Title]
{description_of_finding}
## Statistical Summary
{df.describe().to_markdown()}
## Recommendations
1. [Recommendation based on analysis]
2. [Recommendation based on analysis]
"""
with open('analysis_report.md', 'w') as f:
f.write(summary)
Best Practices
Memory Optimization
# Read large files in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
processed_chunk = chunk.process() # Your processing
chunks.append(processed_chunk)
df = pd.concat(chunks)
# Optimize data types
df['int_col'] = df['int_col'].astype('int32') # Instead of int64
df['float_col'] = df['float_col'].astype('float32') # Instead of float64
Performance Tips
# Use vectorized operations instead of loops
# Bad
result = []
for value in df['column']:
result.append(value * 2)
# Good
result = df['column'] * 2
# Use .query() for filtering
df_filtered = df.query('age > 30 and city == "NYC"')
# Use .loc for setting values
df.loc[df['age'] > 30, 'category'] = 'senior'
Reproducibility
# Set random seed
np.random.seed(42)
# Save processed data
df.to_csv('processed_data.csv', index=False)
df.to_parquet('processed_data.parquet') # Better for large datasets
# Export analysis
import pickle
with open('analysis_results.pkl', 'wb') as f:
pickle.dump({'stats': stats, 'model': model}, f)
Common Analysis Types
Sales Analysis
# Total sales by period
sales_by_month = df.groupby(df['date'].dt.to_period('M'))['sales'].sum()
# Top products
top_products = df.groupby('product')['sales'].sum().sort_values(ascending=False).head(10)
# Growth rate
df['growth_rate'] = df['sales'].pct_change() * 100
Customer Analysis
# Customer segmentation
df['segment'] = pd.cut(df['total_purchases'],
bins=[0, 100, 500, float('inf')],
labels=['Low', 'Medium', 'High'])
# Retention analysis
cohort = df.groupby(['cohort_month', 'purchase_month']).size()
Performance Analysis
# Year-over-year comparison
df['year'] = df['date'].dt.year
yoy = df.groupby('year')['metric'].sum()
yoy_growth = yoy.pct_change() * 100
Error Handling
try:
df = pd.read_csv('data.csv')
except FileNotFoundError:
print("Error: File not found")
sys.exit(1)
except pd.errors.EmptyDataError:
print("Error: File is empty")
sys.exit(1)
except Exception as e:
print(f"Error loading data: {e}")
sys.exit(1)
# Validate data
assert not df.empty, "DataFrame is empty"
assert 'required_column' in df.columns, "Missing required column"
assert df['date'].dtype == 'datetime64[ns]', "Date column not in datetime format"
Output Guidelines
Always provide:
- Summary: High-level findings in plain language
- Statistics: Key numbers and metrics
- Visualizations: Charts that support findings
- Insights: Actionable conclusions
- Recommendations: Next steps based on analysis