| name | csv-analyzer |
| description | Comprehensive CSV data analysis using Python pandas with statistics, visualizations, and insights |
| allowed-tools | Read, Bash, Write |
CSV Data Analyzer
Analyze CSV files using Python pandas to provide statistics, detect patterns, and generate insights.
When to Use
- Analyzing tabular data in CSV format
- Generating statistical summaries
- Detecting data quality issues
- Creating data visualizations
- Exploring datasets
Prerequisites
- Python 3.8+ installed
- pandas, numpy libraries available
- Optional: matplotlib, seaborn for visualizations
Workflow
1. Start Python REPL
python3 -i
2. Load Required Libraries
import pandas as pd
import numpy as np
import sys
# Optional for visualizations
try:
import matplotlib.pyplot as plt
import seaborn as sns
HAS_PLOT = True
except ImportError:
HAS_PLOT = False
print("Visualization libraries not available")
3. Load CSV Data
# Load the CSV file
csv_path = "/absolute/path/to/file.csv"
df = pd.read_csv(csv_path)
# Display basic info
print(f"Loaded {len(df)} rows and {len(df.columns)} columns")
print(f"\nColumns: {', '.join(df.columns)}")
4. Initial Exploration
# First few rows
print("\n=== First 5 Rows ===")
print(df.head())
# Data types and non-null counts
print("\n=== Data Info ===")
print(df.info())
# Basic statistics
print("\n=== Statistical Summary ===")
print(df.describe())
# Missing values
print("\n=== Missing Values ===")
missing = df.isnull().sum()
if missing.any():
print(missing[missing > 0])
else:
print("No missing values found")
5. Data Quality Checks
# Duplicate rows
duplicates = df.duplicated().sum()
print(f"\n=== Duplicates ===")
print(f"Found {duplicates} duplicate rows")
# Unique values per column
print("\n=== Unique Values ===")
for col in df.columns:
unique_count = df[col].nunique()
print(f"{col}: {unique_count} unique values")
# Data type validation
print("\n=== Data Types ===")
for col in df.columns:
dtype = df[col].dtype
sample = df[col].dropna().iloc[0] if len(df[col].dropna()) > 0 else "N/A"
print(f"{col}: {dtype} (example: {sample})")
6. Statistical Analysis
# Numerical columns analysis
numeric_cols = df.select_dtypes(include=[np.number]).columns
if len(numeric_cols) > 0:
print("\n=== Numerical Analysis ===")
for col in numeric_cols:
print(f"\n{col}:")
print(f" Mean: {df[col].mean():.2f}")
print(f" Median: {df[col].median():.2f}")
print(f" Std Dev: {df[col].std():.2f}")
print(f" Min: {df[col].min():.2f}")
print(f" Max: {df[col].max():.2f}")
# Categorical columns analysis
categorical_cols = df.select_dtypes(include=['object']).columns
if len(categorical_cols) > 0:
print("\n=== Categorical Analysis ===")
for col in categorical_cols:
print(f"\n{col}:")
value_counts = df[col].value_counts().head(5)
print(value_counts)
7. Correlation Analysis (for numerical data)
if len(numeric_cols) > 1:
print("\n=== Correlation Matrix ===")
correlation = df[numeric_cols].corr()
print(correlation)
# Find strong correlations
print("\n=== Strong Correlations (>0.7 or <-0.7) ===")
for i in range(len(correlation.columns)):
for j in range(i+1, len(correlation.columns)):
corr_value = correlation.iloc[i, j]
if abs(corr_value) > 0.7:
col1 = correlation.columns[i]
col2 = correlation.columns[j]
print(f"{col1} <-> {col2}: {corr_value:.2f}")
8. Group-By Analysis (if applicable)
# Example: Group by categorical column and aggregate numerical
if len(categorical_cols) > 0 and len(numeric_cols) > 0:
group_col = categorical_cols[0]
agg_col = numeric_cols[0]
print(f"\n=== Grouped Analysis: {group_col} by {agg_col} ===")
grouped = df.groupby(group_col)[agg_col].agg(['count', 'mean', 'sum'])
print(grouped)
9. Generate Report
Create a summary report with findings:
# Build report
report = f"""
# CSV Analysis Report: {csv_path.split('/')[-1]}
## Dataset Overview
- Rows: {len(df):,}
- Columns: {len(df.columns)}
- Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB
## Data Quality
- Missing Values: {df.isnull().sum().sum()}
- Duplicate Rows: {duplicates}
## Numerical Columns ({len(numeric_cols)})
{', '.join(numeric_cols) if len(numeric_cols) > 0 else 'None'}
## Categorical Columns ({len(categorical_cols)})
{', '.join(categorical_cols) if len(categorical_cols) > 0 else 'None'}
## Key Findings
[Add insights based on analysis]
"""
print(report)
10. Optional: Create Visualizations
if HAS_PLOT and len(numeric_cols) > 0:
# Histograms for numerical columns
df[numeric_cols].hist(figsize=(12, 8))
plt.tight_layout()
plt.savefig('histograms.png')
print("Saved histograms.png")
# Correlation heatmap
if len(numeric_cols) > 1:
plt.figure(figsize=(10, 8))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.savefig('correlation_heatmap.png')
print("Saved correlation_heatmap.png")
Common Patterns
Find Outliers
# Using IQR method
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
if len(outliers) > 0:
print(f"\n{col}: {len(outliers)} outliers detected")
Filter and Transform
# Filter rows
filtered = df[df['column_name'] > 100]
# Create new calculated column
df['new_column'] = df['col1'] + df['col2']
# Convert data types
df['date_column'] = pd.to_datetime(df['date_column'])
Export Results
# Save filtered data
filtered.to_csv('filtered_results.csv', index=False)
# Save summary statistics
summary = df.describe()
summary.to_csv('summary_stats.csv')
Best Practices
✅ DO:
- Always check for missing values
- Validate data types
- Look for duplicates
- Check data ranges (min/max)
- Document assumptions
❌ DON'T:
- Skip data quality checks
- Ignore missing data
- Assume data types are correct
- Forget to save important findings
- Modify original data without backup
Example Analysis Session
# Load data
df = pd.read_csv('/data/sales.csv')
# Quick overview
print(f"Shape: {df.shape}")
print(df.head())
# Check for issues
print(f"Missing: {df.isnull().sum().sum()}")
print(f"Duplicates: {df.duplicated().sum()}")
# Analyze sales by region
by_region = df.groupby('region')['sales'].agg([
('total_sales', 'sum'),
('avg_sales', 'mean'),
('num_transactions', 'count')
])
print(by_region)
# Find top performers
top_10 = df.nlargest(10, 'sales')[['product', 'sales', 'region']]
print(top_10)
Troubleshooting
Large Files:
# Read in chunks
chunk_size = 10000
chunks = []
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
Encoding Issues:
df = pd.read_csv('file.csv', encoding='utf-8')
# or
df = pd.read_csv('file.csv', encoding='latin-1')
Date Parsing:
df = pd.read_csv('file.csv', parse_dates=['date_column'])
Cleanup
# Exit Python REPL when done
exit()
Related Skills
data-visualization: Create advanced visualizationsstatistical-modeling: Build predictive modelsdata-cleaning: Advanced data cleaning workflows