| name | data-analysis |
| description | Comprehensive data analysis skill for exploratory data analysis (EDA), data cleaning, visualization, and statistical analysis. Use when asked to perform EDA, analyze datasets, clean/prepare data, create visualizations, or conduct statistical analysis. Triggers include phrases like "perform an initial EDA", "prepare this data for analysis", "analyze this dataset", "clean this data", "explore this data", or "create a visualization". Outputs are formatted for TAG brand guidelines with educational explanations throughout. |
Data Analysis Skill
This skill guides comprehensive data analysis workflows with an educational approach—explaining the WHY behind each step to build Python data analysis skills.
Core Principles
- Always explain reasoning: Every code block includes comments explaining PURPOSE, not just mechanics
- Teach transferable patterns: Flag which techniques are universal vs. dataset-specific
- Brand-consistent outputs: All visualizations and tables follow TAG brand guidelines
- Progressive complexity: Start simple, add sophistication as needed
TAG Brand Guidelines for Outputs
Color Palette
TAG_COLORS = {
'navy': '#002855', # Primary dark blue - use for headers, titles
'aspen_blue': '#008AD8', # Primary blue - use for main chart elements
'mid_blue': '#B9DEFA', # Light blue - use for backgrounds, secondary elements
'indigo': '#004BB9', # Accent - use for highlights, emphasis
'white': '#FFFFFF' # Background
}
# For matplotlib/plotly charts with multiple series:
TAG_PALETTE = ['#008AD8', '#002855', '#004BB9', '#B9DEFA', '#5BA4CF', '#003D7A']
Table Formatting Standards
- Headers: Navy (#002855) background, white text, bold
- Decimals: Round to 2 places (0.00)
- Percentages: Format as "XX.XX%" (not 0.XX)
- Large numbers: Use comma separators (1,234,567)
- Export as images for slide-ready output
Visualization Standards
- Title: 14pt bold, Navy (#002855)
- Axis labels: 11pt, dark gray
- Always include clear axis labels and titles
- Use TAG_PALETTE for consistent branding
- Save figures at 150 DPI minimum for presentations
EDA Workflow
Follow this five-step framework (based on industry best practices from Shopify Engineering):
Step 1: Initial Data Understanding
Purpose: Establish what you're working with before diving into analysis.
# WHY: Understanding shape helps anticipate computational limits
# and identifies if you need to sample for exploration
print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
# WHY: Data types reveal immediate issues—object types that should be
# numeric, dates stored as strings, etc.
print(df.dtypes)
# WHY: First/last rows show actual data format, catch encoding issues
df.head()
Key questions to answer:
- What is the unique identifier for each row?
- What does each row represent (one employee? one day? one transaction?)
- What is the date range if time-series data?
Step 2: Missing Data Analysis
Purpose: Missing data affects every downstream analysis. Understand it first.
# WHY: Ranking by missing % lets you prioritize which columns need attention
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'Missing Count': missing,
'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)
# WHY: Visualizing helps stakeholders understand data quality at a glance
For each column with missing data, document:
- WHY is it missing? (Not collected? Not applicable? System error?)
- Is missingness itself informative?
- Imputation strategy: mean/median (numeric), mode (categorical), or flag as unknown
Step 3: Feature Categorization & Description
Purpose: Different data types require different analysis techniques.
Categorize each column as:
- Continuous: Infinite values in range (salary, conversion rate)
- Discrete: Countable integers (headcount, sessions)
- Categorical: Finite set of values (department, status)
- DateTime: Temporal data needing special handling
# WHY: Separating by type ensures we apply appropriate statistics
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
# WHY: describe() gives quick distribution summary, but interpretation
# differs by column type
df[numeric_cols].describe()
df[categorical_cols].describe()
Step 4: Distribution Analysis (Shape of Data)
Purpose: Understanding distributions informs statistical method choices.
# WHY: Skewed data means median is more representative than mean
# WHY: Multimodal distributions suggest distinct subgroups in data
Check for:
- Skewness (asymmetry)
- Multimodality (multiple peaks suggesting subgroups)
- Bounds (natural limits like 0-100%)
- Outliers (see Step 5)
For time-series data, also examine:
- Trends (upward/downward over time)
- Seasonality (repeating patterns)
- Stationarity (stable mean/variance)
Step 5: Correlation & Relationships
Purpose: Identify which variables move together—critical for modeling.
# WHY: Correlation matrix reveals multicollinearity (problematic for regression)
# and potential predictive relationships
correlation_matrix = df[numeric_cols].corr()
# WHY: Heatmaps make patterns visually obvious across many variables
For categorical variables, use chi-square tests or contingency tables.
Step 6: Outlier Detection
Purpose: Outliers can skew statistics and need investigation.
# WHY: IQR method is robust to the outliers it's trying to detect
# (unlike z-score which uses mean/std affected by outliers)
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
For each outlier, determine:
- Data error? → Fix or remove
- Natural extreme? → Keep but note
- Different population? → Consider segmentation
Data Cleaning Workflow
Standard Cleaning Steps
# 1. COLUMN NAMES: Standardize for consistency
# WHY: Consistent naming prevents errors and makes code more readable
df.columns = df.columns.str.lower().str.replace(' ', '_')
# 2. DATA TYPES: Convert to appropriate types
# WHY: Correct types enable proper operations (math on numbers, sorting dates)
df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')
# 3. DUPLICATES: Identify and handle
# WHY: Duplicates inflate counts and skew statistics
duplicates = df.duplicated(subset=['unique_id_columns'], keep=False)
print(f"Found {duplicates.sum()} duplicate rows")
# 4. STRING STANDARDIZATION: Clean text fields
# WHY: "John Smith", "john smith", "JOHN SMITH" should match
df['name'] = df['name'].str.strip().str.title()
Visualization Templates
For Distributions (Histograms)
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df[column], bins=30, color='#008AD8', edgecolor='white', alpha=0.8)
ax.set_title(f'Distribution of {column}', fontsize=14, fontweight='bold', color='#002855')
ax.set_xlabel(column, fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('distribution.png', dpi=150, bbox_inches='tight')
For Comparisons (Bar Charts)
fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(categories, values, color='#008AD8', edgecolor='#002855')
ax.set_title('Title Here', fontsize=14, fontweight='bold', color='#002855')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
For Trends (Line Charts)
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(dates, values, color='#008AD8', linewidth=2, marker='o', markersize=4)
ax.set_title('Trend Over Time', fontsize=14, fontweight='bold', color='#002855')
ax.fill_between(dates, values, alpha=0.1, color='#008AD8')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
For Relationships (Scatter/Correlation)
import seaborn as sns
# Correlation heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f',
cmap=['#B9DEFA', '#FFFFFF', '#008AD8'], center=0,
ax=ax, square=True)
ax.set_title('Correlation Matrix', fontsize=14, fontweight='bold', color='#002855')
plt.tight_layout()
Slide-Ready Table Export
import dataframe_image as dfi
def style_for_slides(df):
"""
WHY: Consistent styling makes tables immediately usable in presentations
without manual reformatting.
"""
return (df.style
.set_properties(**{
'text-align': 'center',
'font-size': '11pt'
})
.set_table_styles([
{'selector': 'th', 'props': [
('background-color', '#002855'),
('color', 'white'),
('font-weight', 'bold'),
('text-align', 'center'),
('padding', '8px')
]},
{'selector': 'td', 'props': [
('padding', '6px'),
('border', '1px solid #B9DEFA')
]}
])
.format({
# Format specific column types
'percent_col': '{:.2%}',
'decimal_col': '{:.2f}',
'int_col': '{:,.0f}'
})
)
# Export as image
styled = style_for_slides(summary_df)
dfi.export(styled, 'summary_table.png', dpi=150)
Required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# For slide-ready table exports:
# pip install dataframe_image
# For interactive visualizations:
import plotly.express as px
import plotly.graph_objects as go
References
For detailed EDA checklists and advanced techniques, see:
references/eda-checklist.md- Comprehensive EDA checklist with code snippets