name

data-analysis

description

Comprehensive data analysis skill for exploratory data analysis (EDA), data cleaning, visualization, and statistical analysis. Use when asked to perform EDA, analyze datasets, clean/prepare data, create visualizations, or conduct statistical analysis. Triggers include phrases like "perform an initial EDA", "prepare this data for analysis", "analyze this dataset", "clean this data", "explore this data", or "create a visualization". Outputs are formatted for TAG brand guidelines with educational explanations throughout.

Data Analysis Skill

This skill guides comprehensive data analysis workflows with an educational approach—explaining the WHY behind each step to build Python data analysis skills.

Core Principles

Always explain reasoning: Every code block includes comments explaining PURPOSE, not just mechanics
Teach transferable patterns: Flag which techniques are universal vs. dataset-specific
Brand-consistent outputs: All visualizations and tables follow TAG brand guidelines
Progressive complexity: Start simple, add sophistication as needed

TAG Brand Guidelines for Outputs

Color Palette

TAG_COLORS = {
    'navy': '#002855',      # Primary dark blue - use for headers, titles
    'aspen_blue': '#008AD8', # Primary blue - use for main chart elements
    'mid_blue': '#B9DEFA',   # Light blue - use for backgrounds, secondary elements
    'indigo': '#004BB9',     # Accent - use for highlights, emphasis
    'white': '#FFFFFF'       # Background
}

# For matplotlib/plotly charts with multiple series:
TAG_PALETTE = ['#008AD8', '#002855', '#004BB9', '#B9DEFA', '#5BA4CF', '#003D7A']

Table Formatting Standards

Headers: Navy (#002855) background, white text, bold
Decimals: Round to 2 places (0.00)
Percentages: Format as "XX.XX%" (not 0.XX)
Large numbers: Use comma separators (1,234,567)
Export as images for slide-ready output

Visualization Standards

Title: 14pt bold, Navy (#002855)
Axis labels: 11pt, dark gray
Always include clear axis labels and titles
Use TAG_PALETTE for consistent branding
Save figures at 150 DPI minimum for presentations

EDA Workflow

Follow this five-step framework (based on industry best practices from Shopify Engineering):

Step 1: Initial Data Understanding

Purpose: Establish what you're working with before diving into analysis.

# WHY: Understanding shape helps anticipate computational limits
# and identifies if you need to sample for exploration
print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")

# WHY: Data types reveal immediate issues—object types that should be
# numeric, dates stored as strings, etc.
print(df.dtypes)

# WHY: First/last rows show actual data format, catch encoding issues
df.head()

Key questions to answer:

What is the unique identifier for each row?
What does each row represent (one employee? one day? one transaction?)
What is the date range if time-series data?

Step 2: Missing Data Analysis

Purpose: Missing data affects every downstream analysis. Understand it first.

# WHY: Ranking by missing % lets you prioritize which columns need attention
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
}).sort_values('Missing %', ascending=False)

# WHY: Visualizing helps stakeholders understand data quality at a glance

For each column with missing data, document:

WHY is it missing? (Not collected? Not applicable? System error?)
Is missingness itself informative?
Imputation strategy: mean/median (numeric), mode (categorical), or flag as unknown

Step 3: Feature Categorization & Description

Purpose: Different data types require different analysis techniques.

Categorize each column as:

Continuous: Infinite values in range (salary, conversion rate)
Discrete: Countable integers (headcount, sessions)
Categorical: Finite set of values (department, status)
DateTime: Temporal data needing special handling

# WHY: Separating by type ensures we apply appropriate statistics
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

# WHY: describe() gives quick distribution summary, but interpretation
# differs by column type
df[numeric_cols].describe()
df[categorical_cols].describe()

Step 4: Distribution Analysis (Shape of Data)

Purpose: Understanding distributions informs statistical method choices.

# WHY: Skewed data means median is more representative than mean
# WHY: Multimodal distributions suggest distinct subgroups in data

Check for:

Skewness (asymmetry)
Multimodality (multiple peaks suggesting subgroups)
Bounds (natural limits like 0-100%)
Outliers (see Step 5)

For time-series data, also examine:

Trends (upward/downward over time)
Seasonality (repeating patterns)
Stationarity (stable mean/variance)

Step 5: Correlation & Relationships

Purpose: Identify which variables move together—critical for modeling.

# WHY: Correlation matrix reveals multicollinearity (problematic for regression)
# and potential predictive relationships
correlation_matrix = df[numeric_cols].corr()

# WHY: Heatmaps make patterns visually obvious across many variables

For categorical variables, use chi-square tests or contingency tables.

Step 6: Outlier Detection

Purpose: Outliers can skew statistics and need investigation.

# WHY: IQR method is robust to the outliers it's trying to detect
# (unlike z-score which uses mean/std affected by outliers)
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]

For each outlier, determine:

Data error? → Fix or remove
Natural extreme? → Keep but note
Different population? → Consider segmentation

Data Cleaning Workflow

Standard Cleaning Steps

# 1. COLUMN NAMES: Standardize for consistency
# WHY: Consistent naming prevents errors and makes code more readable
df.columns = df.columns.str.lower().str.replace(' ', '_')

# 2. DATA TYPES: Convert to appropriate types
# WHY: Correct types enable proper operations (math on numbers, sorting dates)
df['date_column'] = pd.to_datetime(df['date_column'])
df['numeric_column'] = pd.to_numeric(df['numeric_column'], errors='coerce')

# 3. DUPLICATES: Identify and handle
# WHY: Duplicates inflate counts and skew statistics
duplicates = df.duplicated(subset=['unique_id_columns'], keep=False)
print(f"Found {duplicates.sum()} duplicate rows")

# 4. STRING STANDARDIZATION: Clean text fields
# WHY: "John Smith", "john smith", "JOHN SMITH" should match
df['name'] = df['name'].str.strip().str.title()

Visualization Templates

For Distributions (Histograms)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df[column], bins=30, color='#008AD8', edgecolor='white', alpha=0.8)
ax.set_title(f'Distribution of {column}', fontsize=14, fontweight='bold', color='#002855')
ax.set_xlabel(column, fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plt.savefig('distribution.png', dpi=150, bbox_inches='tight')

For Comparisons (Bar Charts)

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(categories, values, color='#008AD8', edgecolor='#002855')
ax.set_title('Title Here', fontsize=14, fontweight='bold', color='#002855')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()

For Trends (Line Charts)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(dates, values, color='#008AD8', linewidth=2, marker='o', markersize=4)
ax.set_title('Trend Over Time', fontsize=14, fontweight='bold', color='#002855')
ax.fill_between(dates, values, alpha=0.1, color='#008AD8')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()

For Relationships (Scatter/Correlation)

import seaborn as sns

# Correlation heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', 
            cmap=['#B9DEFA', '#FFFFFF', '#008AD8'], center=0,
            ax=ax, square=True)
ax.set_title('Correlation Matrix', fontsize=14, fontweight='bold', color='#002855')
plt.tight_layout()

Slide-Ready Table Export

import dataframe_image as dfi

def style_for_slides(df):
    """
    WHY: Consistent styling makes tables immediately usable in presentations
    without manual reformatting.
    """
    return (df.style
        .set_properties(**{
            'text-align': 'center',
            'font-size': '11pt'
        })
        .set_table_styles([
            {'selector': 'th', 'props': [
                ('background-color', '#002855'),
                ('color', 'white'),
                ('font-weight', 'bold'),
                ('text-align', 'center'),
                ('padding', '8px')
            ]},
            {'selector': 'td', 'props': [
                ('padding', '6px'),
                ('border', '1px solid #B9DEFA')
            ]}
        ])
        .format({
            # Format specific column types
            'percent_col': '{:.2%}',
            'decimal_col': '{:.2f}',
            'int_col': '{:,.0f}'
        })
    )

# Export as image
styled = style_for_slides(summary_df)
dfi.export(styled, 'summary_table.png', dpi=150)

Required Packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For slide-ready table exports:
# pip install dataframe_image

# For interactive visualizations:
import plotly.express as px
import plotly.graph_objects as go

References

For detailed EDA checklists and advanced techniques, see:

references/eda-checklist.md - Comprehensive EDA checklist with code snippets