Data Quality Planning
When to Use This Skill
Use this skill when:
- Data Quality Planning tasks - Working on define data quality rules, profiling strategies, validation frameworks, and quality metrics
- Planning or design - Need guidance on Data Quality Planning approaches
- Best practices - Want to follow established patterns and standards
Overview
Data quality planning establishes rules, processes, and metrics to ensure data is fit for its intended purpose. Quality dimensions help categorize and measure data health.
Data Quality Dimensions
Six Core Dimensions
| Dimension |
Definition |
Example Metric |
| Accuracy |
Data correctly represents reality |
% records matching source of truth |
| Completeness |
All required data is present |
% non-null required fields |
| Consistency |
Data agrees across systems |
% matching cross-system values |
| Timeliness |
Data is available when needed |
Avg latency from source to target |
| Uniqueness |
No duplicate records |
% unique on key columns |
| Validity |
Data conforms to rules |
% records passing validation |
Quality Dimension Matrix
# Quality Assessment: Customer Domain
| Dimension | Weight | Target | Current | Gap |
|-----------|--------|--------|---------|-----|
| Accuracy | 25% | 99% | 97% | -2% |
| Completeness | 25% | 98% | 95% | -3% |
| Consistency | 20% | 99% | 94% | -5% |
| Timeliness | 10% | 99% | 99% | 0% |
| Uniqueness | 15% | 100% | 98% | -2% |
| Validity | 5% | 99% | 96% | -3% |
| **Overall** | 100% | 98.6% | 96.2% | -2.4% |
Data Profiling
Profiling Types
| Type |
Purpose |
Output |
| Column |
Understand data distribution |
Min, Max, Null %, Distinct count |
| Cross-column |
Find relationships |
Correlations, functional dependencies |
| Cross-table |
Validate referential integrity |
Orphan records, FK violations |
| Cross-system |
Compare across sources |
Discrepancies, sync issues |
Profiling Template
# Column Profile: customers.email
## Statistics
| Metric | Value |
|--------|-------|
| Total Rows | 1,250,000 |
| Distinct Values | 1,180,000 |
| Null Count | 12,500 (1%) |
| Empty String | 2,340 (0.2%) |
| Min Length | 5 |
| Max Length | 254 |
| Avg Length | 24.3 |
## Pattern Analysis
| Pattern | Count | Example |
|---------|-------|---------|
| \w+@\w+\.\w+ | 1,185,000 | user@domain.com |
| NULL | 12,500 | NULL |
| Invalid format | 50,160 | user@domain, @domain.com |
## Value Distribution
| Domain | Count | % |
|--------|-------|---|
| gmail.com | 312,000 | 26% |
| outlook.com | 187,500 | 15% |
| company.com | 125,000 | 10% |
| Other | 612,500 | 49% |
## Issues Found
- 4% invalid email format
- 1% null values (required field)
- 5.6% duplicate emails
Validation Rules
Rule Categories
| Category |
Description |
Example |
| Format |
Pattern matching |
Email regex, phone format |
| Range |
Value boundaries |
Age 0-120, price > 0 |
| Referential |
FK constraints |
Order.customer_id exists |
| Business |
Domain logic |
Discount <= 50% |
| Cross-field |
Field relationships |
End date >= Start date |
| Aggregate |
Group-level |
Daily sales > $1000 |
Rule Definition Template
# Validation Rule: VR-CUST-001
## Metadata
- Name: Valid Email Format
- Domain: Customer
- Severity: Error
- Owner: Customer Data Steward
## Rule Definition
- Field: email
- Type: Format
- Pattern: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
## Threshold
- Error: < 99% compliance
- Warning: < 99.5% compliance
- Target: 99.9% compliance
## Remediation
1. Flag invalid emails for review
2. Attempt auto-correction (lowercase, trim)
3. Queue for customer contact if unresolvable
SQL Validation Examples
-- Completeness check
SELECT
'email' AS column_name,
COUNT(*) AS total_rows,
COUNT(email) AS non_null_rows,
CAST(COUNT(email) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS completeness_pct
FROM customers;
-- Uniqueness check
SELECT
'customer_id' AS column_name,
COUNT(*) AS total_rows,
COUNT(DISTINCT customer_id) AS unique_values,
CAST(COUNT(DISTINCT customer_id) * 100.0 / COUNT(*) AS DECIMAL(5,2)) AS uniqueness_pct
FROM customers;
-- Format validation
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN email LIKE '%_@_%.__%' THEN 1 ELSE 0 END) AS valid_format,
SUM(CASE WHEN email NOT LIKE '%_@_%.__%' THEN 1 ELSE 0 END) AS invalid_format
FROM customers;
-- Referential integrity
SELECT
o.order_id,
o.customer_id
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
-- Business rule
SELECT *
FROM orders
WHERE discount_percentage > 50
OR total_amount < 0
OR order_date > GETDATE();
C# Validation Framework
public interface IValidationRule<T>
{
string RuleName { get; }
ValidationSeverity Severity { get; }
ValidationResult Validate(T entity);
}
public class EmailFormatRule : IValidationRule<Customer>
{
private static readonly Regex EmailPattern = new(
@"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$",
RegexOptions.Compiled);
public string RuleName => "VR-CUST-001";
public ValidationSeverity Severity => ValidationSeverity.Error;
public ValidationResult Validate(Customer entity)
{
if (string.IsNullOrEmpty(entity.Email))
{
return ValidationResult.Fail(RuleName, "Email is required");
}
if (!EmailPattern.IsMatch(entity.Email))
{
return ValidationResult.Fail(RuleName,
$"Invalid email format: {entity.Email}");
}
return ValidationResult.Pass(RuleName);
}
}
public class DataQualityValidator<T>
{
private readonly IEnumerable<IValidationRule<T>> _rules;
private readonly ILogger _logger;
public async Task<QualityReport> ValidateAsync(
IEnumerable<T> records,
CancellationToken ct)
{
var report = new QualityReport
{
StartTime = DateTime.UtcNow,
TotalRecords = 0
};
var ruleResults = _rules.ToDictionary(
r => r.RuleName,
r => new RuleResult { RuleName = r.RuleName });
foreach (var record in records)
{
report.TotalRecords++;
foreach (var rule in _rules)
{
var result = rule.Validate(record);
var ruleResult = ruleResults[rule.RuleName];
if (result.IsValid)
{
ruleResult.PassCount++;
}
else
{
ruleResult.FailCount++;
ruleResult.Failures.Add(result);
}
}
}
report.RuleResults = ruleResults.Values.ToList();
report.EndTime = DateTime.UtcNow;
report.OverallScore = CalculateOverallScore(ruleResults.Values);
return report;
}
}
Quality Metrics Dashboard
KPIs Template
# Data Quality Dashboard
## Overall Score
| Domain | Score | Trend | Status |
|--------|-------|-------|--------|
| Customer | 96.2% | ↑ +0.5% | 🟡 Warning |
| Product | 98.5% | ↑ +0.1% | 🟢 Healthy |
| Order | 99.1% | → 0% | 🟢 Healthy |
| Inventory | 94.8% | ↓ -1.2% | 🔴 Critical |
## Rule Compliance
| Rule ID | Rule Name | Target | Actual | Status |
|---------|-----------|--------|--------|--------|
| VR-CUST-001 | Valid Email | 99% | 95.4% | 🔴 |
| VR-CUST-002 | Unique SSN | 100% | 99.9% | 🟢 |
| VR-ORD-001 | Valid Total | 100% | 100% | 🟢 |
| VR-INV-001 | Positive Qty | 100% | 97.2% | 🔴 |
## Trend (Last 30 Days)
| Week | Customer | Product | Order | Inventory |
|------|----------|---------|-------|-----------|
| W1 | 95.7% | 98.4% | 99.1% | 96.0% |
| W2 | 95.9% | 98.5% | 99.1% | 95.5% |
| W3 | 96.0% | 98.5% | 99.1% | 95.0% |
| W4 | 96.2% | 98.5% | 99.1% | 94.8% |
Data Cleansing Patterns
Cleansing Operations
| Operation |
Description |
Example |
| Standardization |
Consistent format |
"USA" → "United States" |
| Deduplication |
Remove duplicates |
Merge customer records |
| Enrichment |
Add missing data |
Append geocodes |
| Correction |
Fix known errors |
Typo correction |
| Imputation |
Fill missing values |
Mean/median substitution |
Cleansing Pipeline
public class CustomerCleansingPipeline
{
public Customer Cleanse(Customer raw)
{
var cleansed = new Customer
{
CustomerId = raw.CustomerId,
// Standardization
Email = raw.Email?.Trim().ToLowerInvariant(),
Phone = StandardizePhone(raw.Phone),
// Formatting
FirstName = ToTitleCase(raw.FirstName?.Trim()),
LastName = ToTitleCase(raw.LastName?.Trim()),
// Enrichment
State = LookupStateFromZip(raw.PostalCode) ?? raw.State,
// Validation
PostalCode = ValidatePostalCode(raw.PostalCode)
? raw.PostalCode
: null // Mark for review
};
return cleansed;
}
private string StandardizePhone(string phone)
{
if (string.IsNullOrEmpty(phone)) return null;
var digits = new string(phone.Where(char.IsDigit).ToArray());
return digits.Length switch
{
10 => $"+1-{digits[..3]}-{digits[3..6]}-{digits[6..]}",
11 when digits[0] == '1' => $"+{digits[0]}-{digits[1..4]}-{digits[4..7]}-{digits[7..]}",
_ => phone // Return original if non-standard
};
}
}
Quality Monitoring
Alerting Rules
# Quality Alert Configuration
## Critical Alerts (Immediate)
| Condition | Action |
|-----------|--------|
| Overall score < 90% | Page on-call, Slack #data-critical |
| Rule failure > 10% | Email data steward, create incident |
| New duplicate rate > 1% | Slack #data-ops |
## Warning Alerts (Daily Digest)
| Condition | Action |
|-----------|--------|
| Score decrease > 2% | Include in daily report |
| Approaching threshold | Email data steward |
| Trend declining 3+ days | Escalate to owner |
## Informational (Weekly Report)
| Condition | Action |
|-----------|--------|
| All scores stable | Include in weekly summary |
| Improvements noted | Celebrate in report |
Validation Checklist
Integration Points
Inputs from:
data-governance skill → Quality standards
conceptual-modeling skill → Business rules
er-modeling skill → Constraints and relationships
Outputs to:
migration-planning skill → Validation steps
- ETL/ELT pipelines → Quality gates
- Data catalog → Quality metadata
- Dashboards → Quality metrics