| name | dataset-comparer |
| description | Compare two datasets to find differences, added/removed rows, changed values. Use for data validation, ETL verification, or tracking changes. |
Dataset Comparer
Compare two CSV/Excel datasets to identify differences, additions, deletions, and value changes.
Features
- Row Comparison: Find added, removed, and matching rows
- Value Changes: Detect changed values in matching rows
- Column Comparison: Identify schema differences
- Statistics: Summary of differences
- Diff Reports: HTML, CSV, and JSON output
- Flexible Matching: Compare by key columns or row position
Quick Start
from dataset_comparer import DatasetComparer
comparer = DatasetComparer()
comparer.load("old_data.csv", "new_data.csv")
# Compare by key column
diff = comparer.compare(key_columns=["id"])
print(f"Added rows: {diff['added_count']}")
print(f"Removed rows: {diff['removed_count']}")
print(f"Changed rows: {diff['changed_count']}")
# Generate report
comparer.generate_report("diff_report.html")
CLI Usage
# Basic comparison
python dataset_comparer.py --old old.csv --new new.csv
# Compare by key column
python dataset_comparer.py --old old.csv --new new.csv --key id
# Multiple key columns
python dataset_comparer.py --old old.csv --new new.csv --key id,date
# Generate HTML report
python dataset_comparer.py --old old.csv --new new.csv --key id --report diff.html
# Export differences to CSV
python dataset_comparer.py --old old.csv --new new.csv --key id --output diff.csv
# JSON output
python dataset_comparer.py --old old.csv --new new.csv --key id --json
# Ignore specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --ignore updated_at,modified_date
# Compare only specific columns
python dataset_comparer.py --old old.csv --new new.csv --key id --columns name,email,status
API Reference
DatasetComparer Class
class DatasetComparer:
def __init__(self)
# Data loading
def load(self, old_path: str, new_path: str) -> 'DatasetComparer'
def load_dataframes(self, old_df: pd.DataFrame,
new_df: pd.DataFrame) -> 'DatasetComparer'
# Comparison
def compare(self, key_columns: list = None,
ignore_columns: list = None,
compare_columns: list = None) -> dict
# Detailed results
def get_added_rows(self) -> pd.DataFrame
def get_removed_rows(self) -> pd.DataFrame
def get_changed_rows(self) -> pd.DataFrame
def get_unchanged_rows(self) -> pd.DataFrame
# Schema comparison
def compare_schema(self) -> dict
# Reports
def generate_report(self, output: str, format: str = "html") -> str
def to_dataframe(self) -> pd.DataFrame
def summary(self) -> str
Comparison Methods
Key-Based Comparison
Compare rows by matching key columns (like primary keys):
diff = comparer.compare(key_columns=["customer_id"])
# Multiple keys for composite matching
diff = comparer.compare(key_columns=["order_id", "product_id"])
Position-Based Comparison
Compare rows by their position (row number):
diff = comparer.compare() # No keys = positional comparison
Output Format
Comparison Result
{
"summary": {
"old_rows": 1000,
"new_rows": 1050,
"added_count": 75,
"removed_count": 25,
"changed_count": 50,
"unchanged_count": 900,
"total_differences": 150
},
"schema_changes": {
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "amount", "old_type": "int64", "new_type": "float64"}
]
},
"key_columns": ["id"],
"compared_columns": ["name", "email", "status"],
"ignored_columns": ["updated_at"]
}
Changed Row Details
changes = comparer.get_changed_rows()
# Returns DataFrame with columns:
# _key: Key value(s) for the row
# _column: Column that changed
# _old_value: Original value
# _new_value: New value
Schema Comparison
Compare column structure:
schema = comparer.compare_schema()
# Returns:
{
"old_columns": ["id", "name", "old_field"],
"new_columns": ["id", "name", "new_field"],
"common_columns": ["id", "name"],
"added_columns": ["new_field"],
"removed_columns": ["old_field"],
"type_changes": [
{"column": "price", "old_type": "int64", "new_type": "float64"}
],
"old_row_count": 1000,
"new_row_count": 1050
}
Filtering Options
Ignore Columns
Skip certain columns during comparison:
diff = comparer.compare(
key_columns=["id"],
ignore_columns=["updated_at", "modified_by", "timestamp"]
)
Compare Specific Columns
Only compare selected columns:
diff = comparer.compare(
key_columns=["id"],
compare_columns=["name", "email", "status"] # Only these columns
)
Report Formats
HTML Report
comparer.generate_report("diff_report.html", format="html")
Features:
- Summary statistics
- Interactive tables
- Color-coded changes (green=added, red=removed, yellow=changed)
- Schema comparison section
CSV Export
comparer.generate_report("diff_report.csv", format="csv")
Includes all differences in tabular format.
JSON Output
comparer.generate_report("diff_report.json", format="json")
Complete diff data in JSON format.
Example Workflows
Data Migration Validation
comparer = DatasetComparer()
comparer.load("source_data.csv", "migrated_data.csv")
diff = comparer.compare(key_columns=["id"])
if diff["summary"]["total_differences"] == 0:
print("Migration successful - no differences!")
else:
print(f"Found {diff['summary']['total_differences']} differences")
comparer.generate_report("migration_issues.html")
ETL Pipeline Verification
comparer = DatasetComparer()
comparer.load("yesterday.csv", "today.csv")
diff = comparer.compare(
key_columns=["transaction_id"],
ignore_columns=["processing_timestamp"]
)
# Check for unexpected changes
changed = comparer.get_changed_rows()
if len(changed) > 0:
print("Warning: Historical records changed!")
print(changed)
Incremental Update Detection
comparer = DatasetComparer()
comparer.load("last_sync.csv", "current.csv")
diff = comparer.compare(key_columns=["customer_id"])
# Get new records for processing
new_records = comparer.get_added_rows()
print(f"New records to process: {len(new_records)}")
# Get deleted records
deleted = comparer.get_removed_rows()
print(f"Records to deactivate: {len(deleted)}")
Schema Change Detection
comparer = DatasetComparer()
comparer.load("v1_export.csv", "v2_export.csv")
schema = comparer.compare_schema()
if schema["added_columns"]:
print(f"New columns: {schema['added_columns']}")
if schema["removed_columns"]:
print(f"Removed columns: {schema['removed_columns']}")
if schema["type_changes"]:
for change in schema["type_changes"]:
print(f"Type change: {change['column']} "
f"({change['old_type']} -> {change['new_type']})")
Large Dataset Tips
For very large datasets:
# Compare in chunks
comparer = DatasetComparer()
comparer.load("large_old.csv", "large_new.csv")
# Use key column for efficient matching
diff = comparer.compare(key_columns=["id"])
# Export only differences (not full data)
comparer.generate_report("diff_only.csv", format="csv")
Dependencies
- pandas>=2.0.0
- numpy>=1.24.0