name	Data Cleaner
description	Use this skill when the user needs to analyze, clean, or prepare datasets. Helps with listing columns, detecting data types (text, categorical, ordinal, numeric), identifying data quality issues, and cleaning values that don't fit expected patterns. Invoke when users mention data cleaning, data quality, column analysis, type detection, or preparing datasets.
allowed-tools	Read, Bash, Grep, Glob

Data Cleaning Skill

This skill helps analyze and clean datasets by detecting data types, identifying quality issues, and suggesting or applying corrections.

Core Capabilities

Column Analysis: List all columns with basic statistics and sample values
Type Detection: Automatically detect if columns are:
- Numeric (integer, float)
- Categorical (limited unique values)
- Ordinal (ordered categories)
- Text (free-form text)
- DateTime
- Boolean
Data Quality Reports: Comprehensive quality analysis with severity levels and completeness scores
Value Mapping Generation: Auto-generate standardization functions for categorical data
Value Cleaning: Fix common issues like extra whitespace, inconsistent casing, invalid values
Validation Reports: Compare before/after cleaning to verify transformations

Instructions

When the user requests data cleaning assistance:

Identify the dataset: Ask for the file path if not provided
Generate quality report: Use scripts/data_quality_report.py for comprehensive quality analysis
Analyze columns: Use scripts/analyze_columns.py to get an overview of all columns
Detect types: Use scripts/detect_types.py to determine the data type of each column
Generate value mappings: Use scripts/value_mapping_generator.py for categorical columns needing standardization
Present findings: Show the user:
- Data quality grade and issues
- Column names and detected types
- Suggested value mappings
- Sample problematic values
Suggest fixes: Recommend cleaning strategies based on issues found
Apply cleaning: If user approves, use scripts/clean_values.py to fix issues
Validate results: Use scripts/validation_report.py to compare before/after and confirm changes

Using the Python Scripts

All scripts should be run using the conda Python environment:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/<script_name>.py [arguments]

analyze_columns.py

Analyzes all columns in a dataset and provides summary statistics.

Usage:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/analyze_columns.py <file_path> [--format csv|excel|json]

Output: JSON with column names, types, null counts, unique counts, and sample values

detect_types.py

Detects the semantic type of each column (numeric, categorical, ordinal, text, datetime).

Usage:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/detect_types.py <file_path> [--format csv|excel|json]

Output: JSON mapping columns to detected types with confidence scores

clean_values.py

Cleans specific columns based on detected issues.

Usage:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/clean_values.py <file_path> <output_path> [--operations json_string]

Operations JSON format:

{
  "column_name": {
    "operation": "trim|lowercase|uppercase|remove_special|fill_missing|convert_type",
    "params": {}
  }
}

data_quality_report.py

Generates a comprehensive data quality report with severity levels and completeness scores.

Usage:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/data_quality_report.py <file_path> [--format csv|excel|json] [--output report.json]

Output: JSON report with:

Overall quality grade (A-F)
Per-column completeness scores
Missing values analysis
Formatting issues
Outliers detection
Data type consistency checks

value_mapping_generator.py

Auto-generates standardization mappings and Python functions for categorical columns.

Usage:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/value_mapping_generator.py <file_path> [--column COLUMN] [--threshold 20] [--output-functions functions.py]

Output: JSON with:

Suggested value mappings
Groups of similar values
Auto-generated Python standardization functions
Before/after value counts

Options:

--column: Analyze specific column only
--threshold: Max unique values to consider categorical (default: 20)
--output-functions: Write Python functions to file

validation_report.py

Compares original and cleaned datasets to validate transformations.

Usage:

"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/validation_report.py <original_file> <cleaned_file> [--format csv|excel|json] [--output validation.json]

Output: JSON report with:

Transformation examples for each column
Data loss analysis
Before/after distribution comparisons
Validation status (pass/review_needed)
Recommendations

Workflow Examples

Basic Workflow

User: "I need to clean my customer data"
Get file path from user
Run data_quality_report.py to assess overall quality
Run analyze_columns.py to see all columns
Run detect_types.py to determine types
Present findings and ask user which columns to clean
Run clean_values.py with appropriate operations
Run validation_report.py to verify changes
Confirm cleaning completed and show summary

Advanced Workflow (with auto-generated functions)

User: "Generate cleaning functions for my survey data"
Run data_quality_report.py for quality overview
Run value_mapping_generator.py for categorical columns
Show user the generated standardization functions
User can copy functions into their own cleaning script
Apply cleaning using the generated functions
Validate with validation_report.py

Best Practices

Always show sample values before suggesting changes
Explain why certain types were detected
Ask for confirmation before modifying data
Create backups or save to new files when cleaning
Handle both CSV and Excel files
Provide clear summaries of changes made

Data Cleaner

Install Skill

SKILL.md

Data Cleaning Skill

Core Capabilities

Instructions

Using the Python Scripts

analyze_columns.py

detect_types.py

clean_values.py

data_quality_report.py

value_mapping_generator.py

validation_report.py

Workflow Examples

Basic Workflow

Advanced Workflow (with auto-generated functions)

Best Practices