| name | Data Cleaner |
| description | Use this skill when the user needs to analyze, clean, or prepare datasets. Helps with listing columns, detecting data types (text, categorical, ordinal, numeric), identifying data quality issues, and cleaning values that don't fit expected patterns. Invoke when users mention data cleaning, data quality, column analysis, type detection, or preparing datasets. |
| allowed-tools | Read, Bash, Grep, Glob |
Data Cleaning Skill
This skill helps analyze and clean datasets by detecting data types, identifying quality issues, and suggesting or applying corrections.
Core Capabilities
- Column Analysis: List all columns with basic statistics and sample values
- Type Detection: Automatically detect if columns are:
- Numeric (integer, float)
- Categorical (limited unique values)
- Ordinal (ordered categories)
- Text (free-form text)
- DateTime
- Boolean
- Data Quality Reports: Comprehensive quality analysis with severity levels and completeness scores
- Value Mapping Generation: Auto-generate standardization functions for categorical data
- Value Cleaning: Fix common issues like extra whitespace, inconsistent casing, invalid values
- Validation Reports: Compare before/after cleaning to verify transformations
Instructions
When the user requests data cleaning assistance:
- Identify the dataset: Ask for the file path if not provided
- Generate quality report: Use
scripts/data_quality_report.pyfor comprehensive quality analysis - Analyze columns: Use
scripts/analyze_columns.pyto get an overview of all columns - Detect types: Use
scripts/detect_types.pyto determine the data type of each column - Generate value mappings: Use
scripts/value_mapping_generator.pyfor categorical columns needing standardization - Present findings: Show the user:
- Data quality grade and issues
- Column names and detected types
- Suggested value mappings
- Sample problematic values
- Suggest fixes: Recommend cleaning strategies based on issues found
- Apply cleaning: If user approves, use
scripts/clean_values.pyto fix issues - Validate results: Use
scripts/validation_report.pyto compare before/after and confirm changes
Using the Python Scripts
All scripts should be run using the conda Python environment:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/<script_name>.py [arguments]
analyze_columns.py
Analyzes all columns in a dataset and provides summary statistics.
Usage:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/analyze_columns.py <file_path> [--format csv|excel|json]
Output: JSON with column names, types, null counts, unique counts, and sample values
detect_types.py
Detects the semantic type of each column (numeric, categorical, ordinal, text, datetime).
Usage:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/detect_types.py <file_path> [--format csv|excel|json]
Output: JSON mapping columns to detected types with confidence scores
clean_values.py
Cleans specific columns based on detected issues.
Usage:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/clean_values.py <file_path> <output_path> [--operations json_string]
Operations JSON format:
{
"column_name": {
"operation": "trim|lowercase|uppercase|remove_special|fill_missing|convert_type",
"params": {}
}
}
data_quality_report.py
Generates a comprehensive data quality report with severity levels and completeness scores.
Usage:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/data_quality_report.py <file_path> [--format csv|excel|json] [--output report.json]
Output: JSON report with:
- Overall quality grade (A-F)
- Per-column completeness scores
- Missing values analysis
- Formatting issues
- Outliers detection
- Data type consistency checks
value_mapping_generator.py
Auto-generates standardization mappings and Python functions for categorical columns.
Usage:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/value_mapping_generator.py <file_path> [--column COLUMN] [--threshold 20] [--output-functions functions.py]
Output: JSON with:
- Suggested value mappings
- Groups of similar values
- Auto-generated Python standardization functions
- Before/after value counts
Options:
--column: Analyze specific column only--threshold: Max unique values to consider categorical (default: 20)--output-functions: Write Python functions to file
validation_report.py
Compares original and cleaned datasets to validate transformations.
Usage:
"C:\Users\brook\anaconda3\Scripts\conda.exe" run -n base python scripts/validation_report.py <original_file> <cleaned_file> [--format csv|excel|json] [--output validation.json]
Output: JSON report with:
- Transformation examples for each column
- Data loss analysis
- Before/after distribution comparisons
- Validation status (pass/review_needed)
- Recommendations
Workflow Examples
Basic Workflow
- User: "I need to clean my customer data"
- Get file path from user
- Run
data_quality_report.pyto assess overall quality - Run
analyze_columns.pyto see all columns - Run
detect_types.pyto determine types - Present findings and ask user which columns to clean
- Run
clean_values.pywith appropriate operations - Run
validation_report.pyto verify changes - Confirm cleaning completed and show summary
Advanced Workflow (with auto-generated functions)
- User: "Generate cleaning functions for my survey data"
- Run
data_quality_report.pyfor quality overview - Run
value_mapping_generator.pyfor categorical columns - Show user the generated standardization functions
- User can copy functions into their own cleaning script
- Apply cleaning using the generated functions
- Validate with
validation_report.py
Best Practices
- Always show sample values before suggesting changes
- Explain why certain types were detected
- Ask for confirmation before modifying data
- Create backups or save to new files when cleaning
- Handle both CSV and Excel files
- Provide clear summaries of changes made