Claude Code Plugins

Community-maintained marketplace

Feedback

validation-pipeline

@jennifer-mckinney/my-skills
1
0

Build multi-stage validation pipelines for AI systems requiring 99.9%+ accuracy (FDA/EMA regulatory requirements). Includes data quality checks, citation verification, hallucination detection, confidence scoring, and multi-stage validation workflows for healthcare, legal, and financial AI applications.

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name validation-pipeline
description Build multi-stage validation pipelines for AI systems requiring 99.9%+ accuracy (FDA/EMA regulatory requirements). Includes data quality checks, citation verification, hallucination detection, confidence scoring, and multi-stage validation workflows for healthcare, legal, and financial AI applications.

Validation Pipeline Skill

Expert assistance for building, implementing, and maintaining validation pipelines for high-accuracy AI systems that require regulatory compliance and zero tolerance for errors.

What This Skill Provides

Core Tools

  • validate_data_quality.py - Pre-generation data quality validation
  • validate_citations.py - Source attribution and citation verification
  • calculate_confidence_scores.py - Multi-dimensional confidence scoring
  • detect_hallucinations.py - Post-generation hallucination detection
  • run_validation_pipeline.py - Orchestrate complete multi-stage validation

Reference Documentation

  • patterns.md - Multi-stage validation pipeline patterns
  • best_practices.md - Confidence scoring, threshold tuning, error analysis
  • advanced_topics.md - Hallucination detection, consistency reasoning, ensemble validation
  • troubleshooting.md - False positives/negatives, edge cases, debugging
  • validation_frameworks_comparison.md - Custom pipelines vs Guardrails vs NeMo Guardrails
  • regulatory_requirements.md - FDA/EMA compliance, audit trails, documentation

Templates

  • validation_pipeline.py - Complete multi-stage validation pipeline
  • confidence_scorer.py - Configurable confidence scoring system
  • citation_checker.py - Source attribution verification
  • consistency_checker.py - Cross-validation and consistency checks
  • validation_tests.py - Comprehensive test suite for validation systems

When to Use This Skill

Perfect For:

  • Building AI systems requiring high accuracy and reliability
  • Implementing zero-hallucination guardrails
  • Building citation-backed AI responses
  • Creating multi-stage validation workflows
  • Implementing confidence scoring for AI outputs
  • Building audit trails for AI decisions
  • Detecting and preventing hallucinations
  • Systems requiring rigorous quality control

Not For:

  • Simple chatbots without accuracy requirements
  • Prototype/demo systems without production requirements
  • Systems where occasional errors are acceptable
  • Content generation for entertainment
  • Low-stakes applications without quality requirements

Quick Start Workflows

Workflow 1: Build Complete Validation Pipeline

For building high-accuracy AI systems with rigorous quality control:

# Step 1: Initialize validation pipeline
cp templates/validation_pipeline.py validation_pipeline.py
# Edit configuration: thresholds, weights, validators

# Step 2: Configure confidence scoring
cp templates/confidence_scorer.py confidence_scorer.py
# Set scoring dimensions: citation_quality, semantic_match, source_authority

# Step 3: Set up citation verification
cp templates/citation_checker.py citation_checker.py
# Configure source validation rules

# Step 4: Implement consistency checking
cp templates/consistency_checker.py consistency_checker.py
# Define consistency rules and cross-validation logic

# Step 5: Test the pipeline
cp templates/validation_tests.py test_validation_pipeline.py
pytest test_validation_pipeline.py -v --cov

# Step 6: Run validation on sample data
python scripts/run_validation_pipeline.py \
  --input sample_responses.json \
  --config validation_config.yaml \
  --output validation_results.json

Expected Results:

  • Pre-generation: Data quality score (0.0-1.0)
  • Generation: Real-time confidence scoring (0.0-1.0)
  • Post-generation: Hallucination check (pass/fail + confidence)
  • Final: Pass/fail decision with detailed audit trail

Workflow 2: Implement Citation Verification

For systems requiring 100% source attribution:

# Step 1: Validate citations in AI response
python scripts/validate_citations.py \
  --response response.json \
  --sources sources.json \
  --min-confidence 0.95

# Step 2: Check for missing citations
python scripts/validate_citations.py \
  --response response.json \
  --sources sources.json \
  --check-coverage \
  --fail-on-uncited

# Step 3: Verify source accessibility
python scripts/validate_citations.py \
  --response response.json \
  --sources sources.json \
  --verify-sources \
  --timeout 30

# Step 4: Generate audit report
python scripts/validate_citations.py \
  --response response.json \
  --sources sources.json \
  --audit-report audit.json

Validation Checks:

  • Every claim has citation
  • Citations link to actual sources
  • Source content matches claim
  • Sources are accessible/retrievable
  • Citation format is consistent

Workflow 3: Detect and Prevent Hallucinations

For zero-hallucination requirements:

# Step 1: Run hallucination detection
python scripts/detect_hallucinations.py \
  --response response.json \
  --sources sources.json \
  --threshold 0.99

# Step 2: Check consistency with sources
python scripts/detect_hallucinations.py \
  --response response.json \
  --sources sources.json \
  --check-consistency \
  --semantic-similarity

# Step 3: Identify unsupported claims
python scripts/detect_hallucinations.py \
  --response response.json \
  --sources sources.json \
  --flag-unsupported \
  --output flagged_claims.json

# Step 4: Generate detailed analysis
python scripts/detect_hallucinations.py \
  --response response.json \
  --sources sources.json \
  --detailed-report \
  --output analysis.json

Detection Methods:

  • Semantic similarity to source content (>0.95 required)
  • Factual consistency checking
  • Entity verification against sources
  • Numerical/date accuracy verification
  • Citation coverage analysis

Workflow 4: Calculate Multi-Dimensional Confidence

For quantitative accuracy measurement:

# Step 1: Calculate base confidence score
python scripts/calculate_confidence_scores.py \
  --response response.json \
  --sources sources.json

# Step 2: Apply custom weights
python scripts/calculate_confidence_scores.py \
  --response response.json \
  --sources sources.json \
  --weights config/weights.yaml

# Step 3: Multi-dimensional scoring
python scripts/calculate_confidence_scores.py \
  --response response.json \
  --sources sources.json \
  --dimensions citation,semantic,authority,freshness

# Step 4: Threshold-based filtering
python scripts/calculate_confidence_scores.py \
  --response response.json \
  --sources sources.json \
  --min-threshold 0.95 \
  --fail-below-threshold

Confidence Dimensions:

  • Citation Quality: 0.0-1.0 (completeness, accuracy)
  • Semantic Match: 0.0-1.0 (similarity to sources)
  • Source Authority: 0.0-1.0 (source trustworthiness)
  • Temporal Freshness: 0.0-1.0 (recency of sources)
  • Consistency Score: 0.0-1.0 (cross-source agreement)

Workflow 5: Validate Data Quality (Pre-Generation)

For ensuring input quality before generation:

# Step 1: Check data completeness
python scripts/validate_data_quality.py \
  --data input_data.json \
  --check-completeness

# Step 2: Validate data structure
python scripts/validate_data_quality.py \
  --data input_data.json \
  --schema schema.json

# Step 3: Check for data quality issues
python scripts/validate_data_quality.py \
  --data input_data.json \
  --check-duplicates \
  --check-consistency \
  --check-freshness

# Step 4: Generate quality report
python scripts/validate_data_quality.py \
  --data input_data.json \
  --full-check \
  --output quality_report.json

Quality Checks:

  • Schema validation
  • Completeness (missing fields)
  • Duplicate detection
  • Data consistency
  • Temporal validity
  • Format validation

Decision Trees

Which Validation Stage Do I Need?

What are you validating?
├─ Input data before generation → Pre-generation validation
│  └─ Use: validate_data_quality.py
├─ AI response accuracy → Post-generation validation
│  ├─ Check citations → validate_citations.py
│  ├─ Detect hallucinations → detect_hallucinations.py
│  └─ Calculate confidence → calculate_confidence_scores.py
└─ Complete pipeline → Multi-stage validation
   └─ Use: run_validation_pipeline.py

Which Confidence Scoring Method?

What level of accuracy required?
├─ 95-99% accuracy → Simple confidence scoring
│  └─ Citation quality + semantic match
├─ 99-99.9% accuracy → Multi-dimensional scoring
│  └─ Citation + semantic + authority + consistency
└─ 99.9%+ accuracy (regulatory) → Ensemble scoring
   └─ Multiple models + human-in-the-loop + audit trails

How Should I Handle Low Confidence?

Response confidence below threshold?
├─ Confidence 0.9-0.95 → Flag for review
│  └─ Human verification queue
├─ Confidence 0.7-0.9 → Regenerate with better sources
│  └─ Retry with improved prompts
├─ Confidence 0.5-0.7 → Reject and log
│  └─ Return error to user
└─ Confidence <0.5 → Block and alert
   └─ System alert + immediate investigation

Which Hallucination Detection Method?

Type of content?
├─ Factual claims → Semantic similarity + entity verification
├─ Numerical data → Exact match verification + range checks
├─ Citations/references → Source availability + content match
└─ Dates/timelines → Temporal consistency + source verification

Quality Checklist

Essentials (Required for Production)

  • Pre-generation validation implemented

    • Data quality checks
    • Schema validation
    • Completeness verification
  • Generation monitoring configured

    • Real-time confidence scoring
    • Citation tracking
    • Token usage monitoring
  • Post-generation validation implemented

    • Citation verification (100% coverage)
    • Hallucination detection
    • Consistency checking
    • Confidence scoring (>0.95 threshold)
  • Audit trail configured

    • All validation decisions logged
    • Source attribution tracked
    • Confidence scores recorded
    • Timestamps for all operations
  • Error handling implemented

    • Fallback strategies defined
    • User-facing error messages
    • Alert system configured
    • Monitoring dashboards

Best Practices (Recommended)

  • Multi-dimensional confidence scoring

    • Citation quality (0.0-1.0)
    • Semantic match (0.0-1.0)
    • Source authority (0.0-1.0)
    • Temporal freshness (0.0-1.0)
  • Ensemble validation

    • Multiple validation methods
    • Cross-validation of results
    • Disagreement handling
  • Threshold tuning

    • A/B testing of thresholds
    • False positive/negative analysis
    • ROC curve optimization
  • Testing infrastructure

    • Unit tests for each validator
    • Integration tests for pipeline
    • Edge case test suite
    • Performance benchmarks

Advanced (Nice to Have)

  • Active learning

    • Feedback loop from failures
    • Threshold auto-tuning
    • Model fine-tuning on edge cases
  • Advanced hallucination detection

    • Semantic entailment checking
    • Logical consistency reasoning
    • Multi-hop fact verification
  • Performance optimization

    • Caching of validation results
    • Parallel validation execution
    • Incremental validation
  • Regulatory compliance

    • FDA 21 CFR Part 11 compliance
    • GDPR data handling
    • HIPAA audit trails

Common Pitfalls & Solutions

Pitfall 1: Confidence Scores Too Optimistic

Problem: System reports high confidence (>0.95) but produces errors

Solution:

  1. Add penalty for missing citations
  2. Increase semantic similarity threshold (0.95 → 0.98)
  3. Add source authority weighting
  4. Implement cross-validation
  5. Tune weights based on false positive analysis

Pitfall 2: Too Many False Positives

Problem: Valid responses flagged as hallucinations

Solution:

  1. Lower hallucination detection threshold
  2. Use semantic similarity instead of exact match
  3. Allow paraphrasing with high similarity (>0.9)
  4. Implement domain-specific rules
  5. Use human feedback to refine detection

Pitfall 3: Validation Too Slow

Problem: Validation takes longer than generation

Solution:

  1. Parallelize validation steps
  2. Cache source content
  3. Use lightweight embedding models
  4. Implement early stopping for high confidence
  5. Optimize citation checking with indexing

Pitfall 4: Citation Coverage Incomplete

Problem: Claims without source attribution

Solution:

  1. Implement strict citation requirements in prompts
  2. Use structured output format with inline citations
  3. Run citation coverage checker
  4. Reject responses with <100% coverage
  5. Fine-tune model on citation examples

Pitfall 5: Inconsistent Confidence Scores

Problem: Same response gets different scores

Solution:

  1. Use deterministic embedding models
  2. Fix random seeds for reproducibility
  3. Normalize scores consistently
  4. Cache validation results
  5. Version validation configuration

Pro Tips

  1. Start with strict thresholds, then tune down: Begin with 0.99 confidence requirement, analyze failures, then adjust. Better than starting permissive and missing issues.

  2. Log everything for debugging: Every validation decision should be logged with: input, output, confidence scores, sources used, timestamp. Critical for debugging and compliance.

  3. Use ensemble validation for critical systems: Run multiple validation methods and require agreement. For 99.9%+ accuracy, use 3+ independent validators.

  4. Implement human-in-the-loop for edge cases: For responses with confidence 0.9-0.95, queue for human review. Build feedback loop to improve thresholds.

  5. Test validation pipeline independently: Validate your validators! Use test suite with known good/bad examples. Track precision/recall metrics.

  6. Cache validation results: Don't re-validate identical responses. Use content hashing to cache validation results and improve performance.

  7. Version your validation configuration: Track threshold changes, weight updates, and rule modifications. Essential for debugging production issues.

Example Use Cases

Healthcare AI (FDA Compliance)

  • Requirement: 99.9%+ accuracy for clinical decision support
  • Pipeline: Pre-gen (data quality) → Generation (confidence scoring) → Post-gen (citation + hallucination) → Human review (confidence 0.9-0.95)
  • Audit: Full audit trail with source attribution, confidence scores, timestamps

Legal AI (Citation-Critical)

  • Requirement: 100% citation coverage for case law references
  • Pipeline: Pre-gen (source availability) → Generation (inline citations) → Post-gen (citation verification + consistency)
  • Audit: Case law verification, jurisdiction checking, temporal validity

Financial AI (Regulatory Compliance)

  • Requirement: Zero hallucinations for financial advice
  • Pipeline: Pre-gen (data freshness) → Generation (fact verification) → Post-gen (numerical accuracy + source verification)
  • Audit: Source timestamps, calculation verification, compliance logging

Integration with Development Workflow

Development Phase

  1. Build validation pipeline alongside AI system
  2. Use templates to implement validators
  3. Test with known good/bad examples
  4. Tune thresholds using validation set

Testing Phase

  1. Run validation tests (test_validation_pipeline.py)
  2. Analyze false positives/negatives
  3. Benchmark performance (latency, accuracy)
  4. Stress test with edge cases

Production Phase

  1. Enable real-time validation
  2. Monitor confidence score distributions
  3. Track validation failures
  4. Implement alerting for anomalies

Maintenance Phase

  1. Analyze production failures
  2. Tune thresholds based on data
  3. Update validation rules
  4. Retrain detection models

Metrics to Track

Accuracy Metrics

  • Precision: True positives / (True positives + False positives)
  • Recall: True positives / (True positives + False negatives)
  • F1 Score: Harmonic mean of precision and recall
  • Accuracy: (TP + TN) / Total

Confidence Metrics

  • Mean confidence score: Average across all responses
  • Confidence distribution: Histogram of scores
  • Threshold effectiveness: Pass rate at various thresholds
  • Calibration: Confidence vs actual accuracy

Performance Metrics

  • Validation latency: Time to validate response
  • Throughput: Validations per second
  • Cache hit rate: Cached validation reuse
  • Resource usage: CPU/memory/GPU

Business Metrics

  • False positive rate: Valid responses rejected
  • False negative rate: Invalid responses accepted
  • Human review rate: Responses needing review
  • Compliance rate: Meeting regulatory requirements

Quick Reference

Essential Commands

# Run complete validation pipeline
python scripts/run_validation_pipeline.py --input data.json --config config.yaml

# Validate citations only
python scripts/validate_citations.py --response response.json --sources sources.json

# Detect hallucinations
python scripts/detect_hallucinations.py --response response.json --sources sources.json --threshold 0.99

# Calculate confidence scores
python scripts/calculate_confidence_scores.py --response response.json --sources sources.json --min-threshold 0.95

# Check data quality
python scripts/validate_data_quality.py --data input.json --schema schema.json

# Run validation tests
pytest test_validation_pipeline.py -v --cov=validation_pipeline --cov-report=html

# Benchmark validation performance
python scripts/run_validation_pipeline.py --input data.json --benchmark --iterations 100

Configuration Example (validation_config.yaml)

validation:
  # Confidence thresholds
  min_confidence: 0.95
  reject_threshold: 0.70
  review_threshold: 0.90

  # Weights for confidence scoring
  weights:
    citation_quality: 0.35
    semantic_match: 0.30
    source_authority: 0.20
    temporal_freshness: 0.15

  # Hallucination detection
  hallucination:
    semantic_threshold: 0.95
    require_citations: true
    max_unsupported_claims: 0

  # Citation validation
  citations:
    require_100_percent_coverage: true
    verify_source_accessibility: true
    check_citation_format: true

  # Audit trail
  audit:
    log_all_validations: true
    include_source_content: true
    track_confidence_scores: true

Pro Tip: For FDA/EMA compliance, implement the complete multi-stage validation pipeline with audit trails from day one. Retrofitting compliance is 10x harder than building it in from the start.