| name | data-cleaning-standards |
| description | Clean and standardize vehicle insurance CSV/Excel data. Use when handling missing values, fixing data formats, removing duplicates, or standardizing fields. Mentions "clean data", "handle nulls", "standardize", "duplicates", or "normalize". |
| allowed-tools | Read, Edit, Grep, Glob |
Data Cleaning Standards
Clean and standardize vehicle insurance data following established business rules.
When to Activate
Use this skill when the user:
- Says "clean the data" or "standardize data"
- Mentions "missing values", "null handling", or "fill missing"
- Asks "remove duplicates" or "deduplicate"
- Wants to "normalize dates" or "standardize formats"
- Mentions data preparation or preprocessing
Quick Start Workflow
Step 1: Handle Missing Values
↓
Step 2: Remove Duplicates
↓
Step 3: Standardize Formats
↓
Step 4: Handle Outliers
Step 1: Handle Missing Values
1.1 Missing Value Strategy by Field Type
| Field Type | Strategy | Fill Value | Why |
|---|---|---|---|
| 三级机构 | Lookup from mapping | From mapping table | Authoritative source |
| 团队简称 | Keep null | '' (empty string) |
Optional field |
| 签单/批改保费 | Delete row | N/A | Critical metric |
| 手续费含税 | Fill zero | 0 |
Legitimate zero commission |
| 是否续保 | Keep null | '' |
Display as "Unknown" |
| String fields | Fill empty | '' |
Avoid None errors |
| Numeric fields | Delete or 0 | Depends on field | Case by case |
1.2 Implementation
def handle_missing_values(df, staff_mapping):
"""Apply missing value strategy"""
# 1. 三级机构 - lookup from mapping
if '三级机构' in df.columns:
for idx in df[df['三级机构'].isnull()].index:
staff = df.at[idx, '业务员']
mapped_info = lookup_staff_info(staff, staff_mapping)
if mapped_info:
df.at[idx, '三级机构'] = mapped_info['三级机构']
# 2. 手续费 - fill zero
if '手续费含税' in df.columns:
df['手续费含税'] = df['手续费含税'].fillna(0)
# 3. 签单保费 - delete missing rows
before = len(df)
df = df[df['签单/批改保费'].notnull()]
after = len(df)
if before > after:
print(f"Deleted {before - after} rows with missing premium")
# 4. String fields - fill empty
string_cols = df.select_dtypes(include=['object']).columns
df[string_cols] = df[string_cols].fillna('')
return df
Step 2: Remove Duplicates
2.1 Deduplication Rules
Composite Key: 保单号 + 投保确认时间
Keep Strategy: keep='last' (most recent record)
Why: Same policy may have multiple updates (批改)
2.2 Implementation
def remove_duplicates(df):
"""Remove duplicate records"""
# Ensure date is datetime
df['投保确认时间'] = pd.to_datetime(df['投保确认时间'], errors='coerce')
# Use duplicated() to avoid type issues
before = len(df)
dup_mask = df.duplicated(subset=['保单号', '投保确认时间'], keep='last')
df = df[~dup_mask]
after = len(df)
if before > after:
print(f"Removed {before - after} duplicates")
return df
Step 3: Standardize Formats
3.1 Date Standardization
Target Format: datetime64[ns]
def standardize_dates(df):
"""Convert all date fields to datetime"""
date_cols = ['刷新时间', '投保确认时间', '保险起期']
for col in date_cols:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors='coerce')
invalid = df[col].isnull().sum()
if invalid > 0:
print(f"⚠️ {col}: {invalid} invalid dates converted to NaT")
return df
3.2 Numeric Standardization
Target Format: float64
def standardize_numerics(df):
"""Convert numeric fields to float"""
numeric_cols = ['签单/批改保费', '签单数量', '手续费', '手续费含税', '增值税']
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce')
return df
3.3 String Standardization
Rules:
- Strip leading/trailing whitespace
- Map variations (Y/N → 是/否)
def standardize_strings(df):
"""Clean string fields"""
# Strip whitespace
string_cols = df.select_dtypes(include=['object']).columns
for col in string_cols:
df[col] = df[col].astype(str).str.strip()
# Standardize yes/no fields
yes_no_cols = ['是否续保', '是否新能源', '是否过户车', '是否异地车']
for col in yes_no_cols:
if col in df.columns:
df[col] = df[col].map({
'Y': '是', 'N': '否',
'y': '是', 'n': '否',
'1': '是', '0': '否'
}).fillna(df[col])
return df
Step 4: Handle Outliers
4.1 Outlier Detection Rules
Important: NEVER delete negative premiums (legitimate business data)
| Outlier Type | Detection Rule | Action |
|---|---|---|
| Negative premium | < 0 |
✅ KEEP (refunds/adjustments) |
| Extreme premium | < -1M or > 100K |
⚠️ Flag only |
| Zero commission | == 0 |
✅ KEEP (normal) |
| Negative amount | 保额 < 0 |
⚠️ Flag only |
4.2 Implementation
def detect_outliers(df):
"""Detect outliers but DO NOT delete"""
outliers = {}
# 1. Extreme premium (flag only)
if '签单/批改保费' in df.columns:
extreme = df[
(df['签单/批改保费'] < -1000000) |
(df['签单/批改保费'] > 100000)
]
if len(extreme) > 0:
outliers['extreme_premium'] = {
'count': len(extreme),
'samples': extreme['保单号'].head(5).tolist()
}
# 2. Negative amounts (flag only)
if '签单/批改保额' in df.columns:
negative_amt = df[df['签单/批改保额'] < 0]
if len(negative_amt) > 0:
outliers['negative_amount'] = {
'count': len(negative_amt),
'samples': negative_amt['保单号'].head(5).tolist()
}
# Report outliers without deleting
if outliers:
print("⚠️ Detected outliers (kept in data):")
for key, info in outliers.items():
print(f" - {key}: {info['count']} records")
return df, outliers
Critical Rule: Never filter out negative premiums:
# ❌ WRONG - DO NOT DO THIS
df = df[df['签单/批改保费'] > 0]
# ✅ CORRECT - Keep all values
total_premium = df['签单/批改保费'].sum() # May be negative
Complete Cleaning Pipeline
All-in-One Function
def clean_data(df, staff_mapping):
"""Complete cleaning pipeline"""
print("Starting data cleaning pipeline...")
initial_count = len(df)
# Step 1: Handle missing values
df = handle_missing_values(df, staff_mapping)
print(f"✓ Step 1: Handled missing values")
# Step 2: Remove duplicates
df = remove_duplicates(df)
print(f"✓ Step 2: Removed duplicates")
# Step 3: Standardize formats
df = standardize_dates(df)
df = standardize_numerics(df)
df = standardize_strings(df)
print(f"✓ Step 3: Standardized formats")
# Step 4: Detect outliers (no deletion)
df, outliers = detect_outliers(df)
print(f"✓ Step 4: Detected outliers")
final_count = len(df)
print(f"\nCleaning complete: {initial_count} → {final_count} records")
return df, outliers
Common Use Cases
Case 1: "Clean my CSV file"
import pandas as pd
import json
# Load data
df = pd.read_csv('data.csv', encoding='utf-8-sig')
mapping = json.load(open('staff_mapping.json'))
# Run full pipeline
df_clean, outliers = clean_data(df, mapping)
# Save cleaned data
df_clean.to_csv('data_cleaned.csv', index=False, encoding='utf-8-sig')
Case 2: "Handle missing institution fields"
# Focus on Step 1 - missing value handling
df = handle_missing_values(df, staff_mapping)
# Check results
missing_before = df_original['三级机构'].isnull().sum()
missing_after = df['三级机构'].isnull().sum()
print(f"Fixed {missing_before - missing_after} missing institutions")
Case 3: "Remove duplicate policies"
# Focus on Step 2 - deduplication
df_unique = remove_duplicates(df)
print(f"Removed {len(df) - len(df_unique)} duplicates")
Cleaning Checklist
Before cleaning:
- Backup original data
- Load staff mapping file
- Check file encoding (use
utf-8-sigfor Excel exports)
During cleaning:
- Handle missing values (Step 1)
- Remove duplicates (Step 2)
- Standardize formats (Step 3)
- Detect outliers (Step 4)
After cleaning:
- Verify record count change is reasonable
- Check critical fields are complete
- Review outlier report
- Save cleaned data with new filename
Troubleshooting
"Many records deleted"
Check: Are you accidentally deleting negative premiums?
# Check negative premium count
negative_count = (df['签单/批改保费'] < 0).sum()
print(f"Negative premiums: {negative_count} (should be kept)")
"Date conversion creates many NaT"
Solution: Check date format
# Inspect date column
print(df['投保确认时间'].head())
# Try different format
df['投保确认时间'] = pd.to_datetime(
df['投保确认时间'],
format='%Y/%m/%d', # Adjust format
errors='coerce'
)
"Duplicates not removed"
Check: Ensure date column is datetime type
print(df['投保确认时间'].dtype) # Should be datetime64[ns]
Related Files
Data processor: backend/data_processor.py
- See
_clean_data()method (lines 132-156) - See
merge_with_existing()method (lines 158-192)
Field definitions: docs/FIELD_MAPPING.md
Related Skills:
field-validation- Run this BEFORE cleaningstaff-mapping-management- Update mapping table
Skill Version: v1.0 Created: 2025-11-09 File Size: ~310 lines Focuses On: Data cleaning only (not validation or mapping)