| name | gate-validation |
| description | Validate Deep Research SOP quality gates (Gates 1-3) ensuring rigorous GO/NO-GO decisions based on comprehensive requirement checklists with statistical validation. Use at end of research phases (Foundations→Development→Production) to determine if criteria met for phase transition. Coordinates evaluator agent with ethics-agent, data-steward, and archivist for systematic validation across data quality, baseline replication, novel methods, holistic evaluation, reproducibility, and production readiness. Implements multiple comparison correction (Bonferroni), effect size calculation (Cohen's d), and statistical power analysis (1-β ≥ 0.8). |
| pipeline | Quality Gates (Gates 1, 2, 3) |
| quality_gate | All three gates (1, 2, 3) |
| phase | Phase transitions (1→2→3) |
| agents | evaluator (lead), ethics-agent, data-steward, archivist |
| time_estimate | 2-4 hours per gate |
| keywords | quality-gate, go-no-go-decision, gate-validation, evaluator-agent, phase-transition, statistical-rigor, multiple-comparison, bonferroni-correction, cohens-d, power-analysis, reproducibility-validation, ethics-review |
| version | 1.0.0 |
| category | quality |
| tags | quality, testing, validation |
| author | ruv |
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Gate Validation
Systematic validation of Deep Research SOP quality gates ensuring rigorous GO/NO-GO decisions at each phase transition.
Overview
Purpose: Validate quality gate requirements and make GO/NO-GO decisions
When to Use:
- End of Phase 1 (Foundations) → Gate 1 validation
- End of Phase 2 (Development) → Gate 2 validation
- End of Phase 3 (Production) → Gate 3 validation
- Before phase transitions in research workflow
- When evaluator agent assessment needed
Quality Gates:
- Gate 1: Data & Methods Validation (baseline replication, data quality, ethics)
- Gate 2: Model & Evaluation Validation (novel methods, holistic evaluation, safety)
- Gate 3: Production & Artifacts Validation (reproducibility, deployment, archival)
Prerequisites:
- Phase work completed (1, 2, or 3)
- All pipeline deliverables available
- Agent coordination active (evaluator, ethics-agent, data-steward, archivist)
Outputs:
- Quality gate decision (APPROVED / CONDITIONAL / REJECTED)
- Comprehensive checklist with pass/fail status
- Gap analysis and remediation plan
- Phase transition recommendation
- Memory MCP storage of gate status
Time Estimate: 2-4 hours per gate
Agents Used: evaluator (lead), ethics-agent, data-steward, archivist
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Quick Start
Gate 1 Validation (End of Phase 1)
# Validate Gate 1 requirements
claude-code invoke-skill gate-validation --gate 1
# Or use command directly
npx claude-flow@alpha sparc run evaluator \
"/validate-gate-1 --pipeline F --verbose"
Gate 2 Validation (End of Phase 2)
# Validate Gate 2 requirements
claude-code invoke-skill gate-validation --gate 2
# Or use command directly
npx claude-flow@alpha sparc run evaluator \
"/validate-gate-2 --pipeline E --include-ethics"
Gate 3 Validation (End of Phase 3)
# Validate Gate 3 requirements
claude-code invoke-skill gate-validation --gate 3
# Or use command directly
npx claude-flow@alpha sparc run evaluator \
"/validate-gate-3 --pipeline G --include-deployment"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Detailed Instructions
Quality Gate 1: Data & Methods Validation
Objective: Validate data quality, baseline replication, and ethics before method development
Requirements Checklist:
1.1 Literature Review (Pipeline A)
# Check literature review completeness
npx claude-flow@alpha memory retrieve --key "sop/phase1/literature-review"
Requirements:
- ≥50 papers reviewed and cataloged
- SOTA performance benchmarks identified
- Research gap analysis documented
- Hypotheses formulated with testable predictions
Validation:
papers_reviewed = len(literature_database.query("status == 'reviewed'"))
assert papers_reviewed >= 50, f"Only {papers_reviewed}/50 papers reviewed"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
1.2 Datasheet (Pipeline B)
# Check datasheet completion
npx claude-flow@alpha sparc run data-steward \
"Validate datasheet completeness" \
--datasheet phase1-foundations/datasheet.md
Requirements:
- Datasheet complete (Form F-P1) with ≥90% sections filled
- Motivation, composition, collection process documented
- Preprocessing and labeling procedures described
- Recommended uses and limitations identified
- Distribution and maintenance plan established
Validation:
from datasheet_parser import parse_datasheet
datasheet = parse_datasheet("phase1-foundations/datasheet.md")
completion = datasheet.calculate_completion()
assert completion >= 0.90, f"Datasheet only {completion*100:.1f}% complete"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
1.3 Bias Audit (Pipeline B)
# Check bias audit results
npx claude-flow@alpha sparc run data-steward \
"Review bias audit report" \
--audit-report phase1-foundations/bias-audit.json
Requirements:
- Bias audit complete across demographic groups
- Label distribution analyzed
- Representation metrics calculated
- Mitigation strategies documented (if bias detected)
- Acceptable bias levels (<10% demographic parity gap)
Validation:
import json
with open("phase1-foundations/bias-audit.json") as f:
audit = json.load(f)
max_demographic_gap = max(audit["demographic_parity_differences"].values())
assert max_demographic_gap < 0.10, f"Demographic parity gap {max_demographic_gap*100:.1f}% > 10%"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
1.4 Ethics Review (Pipeline B, F)
# Validate ethics review
npx claude-flow@alpha sparc run ethics-agent \
"/assess-risks --component dataset --gate 1 --validation-mode"
Requirements:
- Ethics review form (F-F1) complete
- Risk assessment across 6 domains completed
- IRB approval obtained (if human subjects data)
- Dual-use risks assessed and mitigated
- Ethics review status: APPROVED
Validation:
ethics_status=$(npx claude-flow@alpha memory retrieve --key "sop/gate-1/ethics-status")
if [ "$ethics_status" != "APPROVED" ]; then
echo "ERROR: Ethics review not approved: $ethics_status"
exit 1
fi
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
1.5 Baseline Replication (Pipeline D)
# Validate baseline replication
python scripts/validate_baseline_replication.py \
--baseline-paper "Attention is All You Need" \
--replicated-checkpoint phase1-foundations/baseline/checkpoint.pth \
--tolerance 0.01
Requirements:
- Baseline implementation complete (100% test coverage)
- Results within ±1% of published baseline
- Statistical significance confirmed (paired t-test, p<0.05)
- Reproducibility tested (3/3 successful runs)
- Reproducibility package created (Docker + README)
Validation:
from scipy import stats
baseline_accuracy = 0.850 # Published
replicated_accuracy = 0.852 # Our implementation
tolerance = 0.01
delta = abs(replicated_accuracy - baseline_accuracy)
assert delta <= tolerance, f"Baseline replication Δ={delta*100:.2f}% > ±1%"
# Statistical test
t_stat, p_value = stats.ttest_rel(baseline_runs, replicated_runs)
assert p_value >= 0.05, f"Significant difference detected (p={p_value:.4f})"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
1.6 Gate 1 Decision
evaluator agent synthesizes all requirements:
npx claude-flow@alpha sparc run evaluator \
"/validate-gate-1 --pipeline F --comprehensive"
Decision Criteria:
APPROVED (all critical requirements met):
- ✅ Literature review ≥50 papers
- ✅ Datasheet ≥90% complete
- ✅ Bias audit complete, gaps <10%
- ✅ Ethics review APPROVED
- ✅ Baseline replication ±1% tolerance
- ✅ Reproducibility 3/3 runs successful
- Action: Proceed to Phase 2 (Method Development)
CONDITIONAL (minor gaps, mitigation plan in place):
- ✅ Most requirements met
- ⚠️ Datasheet 85-89% complete (minor sections missing)
- ⚠️ Bias gaps 10-15% (mitigation plan documented)
- Action: Proceed to Phase 2 with restrictions, complete missing items in parallel
REJECTED (critical issues):
- ❌ Baseline replication >±1% tolerance
- ❌ Ethics review flagged critical risks
- ❌ Reproducibility tests failed
- ❌ Datasheet <85% complete
- Action: Return to Phase 1, address critical issues
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Quality Gate 2: Model & Evaluation Validation
Objective: Validate novel method performance, holistic evaluation, and safety before production
Requirements Checklist:
2.1 Novel Method Performance (Pipeline D)
# Validate novel method vs. baseline
python scripts/validate_method_performance.py \
--novel phase2-development/novel-method/checkpoint.pth \
--baseline phase1-foundations/baseline/checkpoint.pth
Requirements:
- Novel method implemented with complete codebase
- Performance meets or exceeds baseline
- Statistical significance confirmed (p<0.05)
- Improvement documented (e.g., +2.5% accuracy)
- Code quality: 100% test coverage, linting passed
Validation:
novel_accuracy = 0.875
baseline_accuracy = 0.850
improvement = novel_accuracy - baseline_accuracy
assert improvement >= 0, f"Novel method underperforms baseline by {improvement*100:.1f}%"
t_stat, p_value = stats.ttest_ind(novel_runs, baseline_runs)
assert p_value < 0.05, f"Improvement not statistically significant (p={p_value:.4f})"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
2.2 Ablation Studies (Pipeline D)
# Validate ablation studies
python scripts/validate_ablations.py \
--ablation-results phase2-development/ablations/
Requirements:
- Minimum 5 components ablated independently
- 3 runs per ablation configuration
- Statistical analysis complete (t-tests with Bonferroni correction)
- Component importance ranking documented
- Critical components identified (p<0.05)
Validation:
from scipy import stats
import numpy as np
ablation_results = load_ablation_results("phase2-development/ablations/")
num_components = len(ablation_results["components"])
assert num_components >= 5, f"Only {num_components}/5 components ablated"
for component in ablation_results["components"]:
assert len(component["runs"]) >= 3, f"{component['name']}: only {len(component['runs'])}/3 runs"
# Statistical validation with multiple comparison correction
baseline_scores = np.array(ablation_results["baseline_runs"])
p_values = []
for component in ablation_results["components"]:
ablated_scores = np.array(component["runs"])
# Paired t-test
t_stat, p_value = stats.ttest_rel(baseline_scores, ablated_scores)
p_values.append(p_value)
# Effect size (Cohen's d)
mean_diff = np.mean(baseline_scores - ablated_scores)
pooled_std = np.sqrt((np.std(baseline_scores)**2 + np.std(ablated_scores)**2) / 2)
cohens_d = mean_diff / pooled_std
# Statistical power analysis
from statsmodels.stats.power import ttest_power
power = ttest_power(cohens_d, nobs=len(baseline_scores), alpha=0.05, alternative='two-sided')
component["t_statistic"] = t_stat
component["p_value"] = p_value
component["cohens_d"] = cohens_d
component["statistical_power"] = power
# Validate sufficient power
if power < 0.8:
print(f"⚠️ WARNING: {component['name']} has low statistical power ({power:.2f} < 0.80)")
# Multiple comparison correction (Bonferroni)
bonferroni_alpha = 0.05 / num_components
print(f"Bonferroni-corrected α: {bonferroni_alpha:.4f}")
significant_components = []
for component, p_val in zip(ablation_results["components"], p_values):
if p_val < bonferroni_alpha:
significant_components.append(component["name"])
print(f"✅ {component['name']}: Significant after Bonferroni correction (p={p_val:.4f})")
else:
print(f"❌ {component['name']}: Not significant after correction (p={p_val:.4f})")
assert len(significant_components) > 0, "No components significant after multiple comparison correction"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
2.3 Holistic Evaluation (Pipeline E)
# Validate holistic evaluation
python scripts/validate_holistic_evaluation.py \
--evaluation-report phase2-development/holistic-evaluation/report.md
Requirements:
- Evaluation complete across ≥6 dimensions
- Accuracy: Test accuracy documented
- Fairness: Demographic parity <10%, equalized odds <10%
- Robustness: Adversarial robustness tested, OOD detection >0.70 AUROC
- Efficiency: Latency, throughput, memory within targets
- Interpretability: SHAP/attention analysis complete
- Safety: Harmful output rate <0.05%, privacy preserved
Validation:
eval_report = parse_evaluation_report("phase2-development/holistic-evaluation/report.md")
# Check dimensions
assert len(eval_report["dimensions"]) >= 6, "Holistic evaluation incomplete"
# Check fairness
assert eval_report["fairness"]["demographic_parity"] < 0.10, "Demographic parity gap >10%"
assert eval_report["fairness"]["equalized_odds_tpr"] < 0.10, "TPR gap >10%"
# Check safety
assert eval_report["safety"]["harmful_output_rate"] < 0.0005, "Harmful output rate >0.05%"
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
2.4 Ethics & Safety Review (Pipeline F)
# Validate Gate 2 ethics review
npx claude-flow@alpha sparc run ethics-agent \
"/assess-risks --component model --gate 2 --validation-mode"
Requirements:
- Ethics review complete for Gate 2
- Model safety evaluation passed
- Fairness audit passed
- Privacy audit passed (membership inference AUC ≤0.55)
- Adversarial prompt testing passed (>90% rejection)
- Dual-use risk assessment complete
- Ethics status: APPROVED
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
2.5 Method Card (Pipeline G)
# Validate method card
npx claude-flow@alpha sparc run archivist \
"Validate method card completeness" \
--method-card phase2-development/method-card.md
Requirements:
- Method card ≥90% complete
- 9 sections filled (Mitchell et al. 2019 template)
- Performance metrics documented
- Training details complete
- Ethical considerations addressed
- Limitations documented
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
2.6 Gate 2 Decision
evaluator agent synthesizes all requirements:
npx claude-flow@alpha sparc run evaluator \
"/validate-gate-2 --pipeline E --include-ethics --comprehensive"
Decision Criteria:
APPROVED:
- ✅ Novel method > baseline (statistically significant)
- ✅ Ablation studies ≥5 components
- ✅ Holistic evaluation 6+ dimensions
- ✅ Ethics review APPROVED
- ✅ Method card ≥90% complete
- ✅ Reproducibility 3/3 runs
- Action: Proceed to Phase 3 (Production)
CONDITIONAL:
- ✅ Most requirements met
- ⚠️ Fairness gaps 10-15% (mitigation plan)
- ⚠️ Adversarial robustness <30% (acceptable for non-adversarial deployment)
- Action: Proceed with deployment restrictions
REJECTED:
- ❌ Performance regression vs. baseline
- ❌ Ethics review flagged critical safety risks
- ❌ Fairness gaps >15%
- Action: Return to Phase 2
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Quality Gate 3: Production & Artifacts Validation
Objective: Validate reproducibility, deployment readiness, and artifact archival before production
Requirements Checklist:
3.1 Model Card (Pipeline G)
# Validate model card
npx claude-flow@alpha sparc run archivist \
"Validate model card for Gate 3" \
--model-card phase3-production/model-card.md
Requirements:
- Model card ≥90% complete (Form F-G2)
- All 9 sections filled (Mitchell et al. 2019)
- Performance metrics linked
- Datasheets and risk assessments linked
- Ethical considerations complete
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
3.2 Reproducibility Package (Pipeline G)
# Test reproducibility package
python scripts/test_reproducibility.py \
--package phase3-production/reproducibility-package/ \
--runs 3
Requirements:
- Reproducibility package complete (code + data + environment)
- Docker environment builds successfully
- README with ≤5 steps to reproduce
- All hyperparameters and seeds documented
- 3/3 reproduction runs successful
- Results within ±1% of reported values
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
3.3 DOI Assignment (Pipeline G)
# Validate DOI assignment
npx claude-flow@alpha sparc run archivist \
"Check DOI assignment status"
Requirements:
- Dataset DOI assigned (Zenodo)
- Model DOI assigned (Zenodo)
- Code DOI assigned (GitHub + Zenodo)
- Citation formats generated
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
3.4 Public Registry Publishing (Pipeline G)
# Validate public registry publishing
python scripts/validate_registry_publishing.py \
--registries "huggingface,zenodo,mlflow,github"
Requirements:
- Code public on GitHub (release created)
- Model published to HuggingFace Hub
- Artifacts published to Zenodo
- Registry URLs documented
- All artifacts accessible publicly
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
3.5 Deployment Readiness (Pipeline H)
# Validate deployment readiness
python scripts/validate_deployment_readiness.py \
--model phase3-production/final-checkpoint.pth \
--environment production
Requirements:
- Deployment plan validated
- Infrastructure requirements documented
- Monitoring plan established
- Incident response plan created
- Performance benchmarks (production environment) complete
- Rollback strategy documented
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
3.6 Publication Artifacts (Pipeline I)
# Validate publication readiness
npx claude-flow@alpha sparc run researcher \
"Validate publication artifacts" \
--venue NeurIPS
Requirements:
- Research paper draft complete
- Reproducibility checklist complete (NeurIPS/ICML)
- Supplementary materials prepared
- Artifact submission ready (ACM badges)
- Code release finalized
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
3.7 Gate 3 Decision
evaluator agent synthesizes all requirements:
npx claude-flow@alpha sparc run evaluator \
"/validate-gate-3 --pipeline G --include-deployment --comprehensive"
Decision Criteria:
APPROVED:
- ✅ Model card ≥90% complete
- ✅ Reproducibility 3/3 runs successful
- ✅ DOIs assigned (dataset, model, code)
- ✅ Code public (GitHub release)
- ✅ Ethics review APPROVED (deployment)
- ✅ Deployment plan validated
- ✅ Publication artifacts ready
- Action: DEPLOY to production, submit publication
CONDITIONAL:
- ✅ Most requirements met
- ⚠️ Minor documentation gaps (model card 85-89%)
- Action: Deploy with documentation fixes in progress
REJECTED:
- ❌ Reproducibility tests failed
- ❌ Critical ethics violations
- ❌ DOIs not assigned
- Action: Return to Phase 3
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Integration with Deep Research SOP
Quality Gate System
The three quality gates ensure systematic progression through the research lifecycle:
Gate 1 (End of Phase 1):
INPUTS: Literature review, datasheet, bias audit, ethics review, baseline replication
OUTPUT: GO/NO-GO for method development
AGENTS: evaluator, data-steward, ethics-agent
Gate 2 (End of Phase 2):
INPUTS: Novel method, ablations, holistic evaluation, ethics review, method card
OUTPUT: GO/NO-GO for production deployment
AGENTS: evaluator, ethics-agent
Gate 3 (End of Phase 3):
INPUTS: Model card, reproducibility package, DOIs, deployment plan, publication artifacts
OUTPUT: GO/NO-GO for deployment and publication
AGENTS: evaluator, archivist
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Troubleshooting
Issue: Gate 1 rejected due to baseline replication failure
Solution:
# Debug baseline replication
python scripts/debug_baseline_replication.py \
--baseline-paper "Attention is All You Need" \
--verbose
# Check for implementation bugs
python tests/integration/test_gradient_flow.py
python tests/integration/test_numerical_stability.py
# Re-run with stricter determinism
python train_baseline.py --deterministic --seed 42
Issue: Gate 2 rejected due to fairness gaps
Solution:
# Apply fairness mitigation
python scripts/fairness_calibration.py \
--model phase2-development/novel-method/checkpoint.pth \
--method "equalized_odds_postprocessing"
# Re-evaluate fairness
python scripts/fairness_eval.py --model calibrated_model.pth
Issue: Gate 3 rejected due to reproducibility failure
Solution:
# Audit reproducibility package
python scripts/audit_reproducibility.py \
--package phase3-production/reproducibility-package/ \
--strict --debug
# Fix non-deterministic behavior
python scripts/test_determinism.py --runs 10 --strict
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
Related Skills and Commands
Related Skills
baseline-replication- Required for Gate 1method-development- Required for Gate 2holistic-evaluation- Required for Gate 2reproducibility-audit- Required for Gate 3deep-research-orchestrator- Uses this skill for all 3 gates
Related Commands
/validate-gate-1- Gate 1 validation (evaluator)/validate-gate-2- Gate 2 validation (evaluator)/validate-gate-3- Gate 3 validation (evaluator)/assess-risks- Ethics assessment (ethics-agent)/init-datasheet- Datasheet creation (data-steward)/init-model-card- Model card creation (archivist)
When to Use This Skill
Use this skill when:
- Code quality issues are detected (violations, smells, anti-patterns)
- Audit requirements mandate systematic review (compliance, release gates)
- Review needs arise (pre-merge, production hardening, refactoring preparation)
- Quality metrics indicate degradation (test coverage drop, complexity increase)
- Theater detection is needed (mock data, stubs, incomplete implementations)
When NOT to Use This Skill
Do NOT use this skill for:
- Simple formatting fixes (use linter/prettier directly)
- Non-code files (documentation, configuration without logic)
- Trivial changes (typo fixes, comment updates)
- Generated code (build artifacts, vendor dependencies)
- Third-party libraries (focus on application code)
Success Criteria
This skill succeeds when:
- Violations Detected: All quality issues found with ZERO false negatives
- False Positive Rate: <5% (95%+ findings are genuine issues)
- Actionable Feedback: Every finding includes file path, line number, and fix guidance
- Root Cause Identified: Issues traced to underlying causes, not just symptoms
- Fix Verification: Proposed fixes validated against codebase constraints
Edge Cases and Limitations
Handle these edge cases carefully:
- Empty Files: May trigger false positives - verify intent (stub vs intentional)
- Generated Code: Skip or flag as low priority (auto-generated files)
- Third-Party Libraries: Exclude from analysis (vendor/, node_modules/)
- Domain-Specific Patterns: What looks like violation may be intentional (DSLs)
- Legacy Code: Balance ideal standards with pragmatic technical debt management
Quality Analysis Guardrails
CRITICAL RULES - ALWAYS FOLLOW:
- NEVER approve code without evidence: Require actual execution, not assumptions
- ALWAYS provide line numbers: Every finding MUST include file:line reference
- VALIDATE findings against multiple perspectives: Cross-check with complementary tools
- DISTINGUISH symptoms from root causes: Report underlying issues, not just manifestations
- AVOID false confidence: Flag uncertain findings as "needs manual review"
- PRESERVE context: Show surrounding code (5 lines before/after minimum)
- TRACK false positives: Learn from mistakes to improve detection accuracy
Evidence-Based Validation
Use multiple validation perspectives:
- Static Analysis: Code structure, patterns, metrics (connascence, complexity)
- Dynamic Analysis: Execution behavior, test results, runtime characteristics
- Historical Analysis: Git history, past bug patterns, change frequency
- Peer Review: Cross-validation with other quality skills (functionality-audit, theater-detection)
- Domain Expertise: Leverage .claude/expertise/{domain}.yaml if available
Validation Threshold: Findings require 2+ confirming signals before flagging as violations.
Integration with Quality Pipeline
This skill integrates with:
- Pre-Phase: Load domain expertise (.claude/expertise/{domain}.yaml)
- Parallel Skills: functionality-audit, theater-detection-audit, style-audit
- Post-Phase: Store findings in Memory MCP with WHO/WHEN/PROJECT/WHY tags
- Feedback Loop: Learnings feed dogfooding-system for continuous improvement
References
Quality Gate Methodologies
- Stage-Gate Process (Cooper, 1990): Phase-gate systems for innovation
- ISO 9001: Quality management systems
- FDA Software Validation Guidance: Pre-market validation requirements
Academic Standards
- NeurIPS Reproducibility Checklist
- ACM Artifact Evaluation Badging
- IEEE Software Quality Assurance Standards
Core Principles
1. GO/NO-GO Decisions Require Quantitative Evidence
Quality gate validation is not a rubber-stamp approval process - it is a rigorous GO/NO-GO decision framework backed by quantitative evidence. Every gate requirement has explicit pass/fail criteria with numerical thresholds (e.g., baseline replication within 1% tolerance, demographic parity < 10%, test coverage >= 90%). Subjective assessments ("looks good enough") are rejected. This principle ensures that phase transitions are defensible under academic peer review, regulatory audit, and production incident retrospectives.
2. Multiple Comparison Correction Prevents Statistical Illusions
Testing multiple hypotheses without correction inflates false positive rates exponentially. Gate validation implements Bonferroni correction to control family-wise error rate, ensuring that approval decisions maintain statistical rigor across dozens of simultaneous checks. A Gate 2 validation testing 30 requirements at p<0.05 would have 78% chance of false positive without correction - Bonferroni reduces this to 5%. This principle prevents the approval of models that pass gates by statistical luck rather than genuine quality.
3. Phase Transitions Require Cross-Agent Consensus
No single agent has complete view of production readiness. Gate validation coordinates evaluator, ethics-agent, data-steward, and archivist for multi-perspective assessment. Gate 1 cannot approve without data-steward datasheet validation AND ethics-agent risk assessment. Gate 2 requires evaluator holistic evaluation AND ethics-agent safety review. This principle prevents blind spots where technical excellence masks ethical risks or where comprehensive documentation hides functionality gaps.
Anti-Patterns
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Checkbox Compliance Without Verification | Marking gate requirements as "complete" without executing validation scripts leads to false approvals. A baseline replication marked complete without running tolerance checks may have 15% error instead of required 1%. Leads to reproducibility failures and rejected publications. | Execute validation scripts for EVERY gate requirement. Baseline replication: Run experiments/scripts/validate_baseline.py with tolerance checks. Datasheet: Parse form and calculate completion percentage. Store validation evidence in Memory MCP with WHO/WHEN/PROJECT/WHY tags. Require automated pass/fail, not manual judgment. |
| Skipping Gates Under Time Pressure | Proceeding to method development before Gate 1 approval ("we'll validate the baseline later") leads to wasted effort building on flawed foundations. Method development on un-validated datasets or misunderstood baselines produces unpublishable results. Common in industry where shipping pressure overrides rigor. | Enforce strict gate blocking: Phase 2 skills refuse to execute until Gate 1 APPROVED status confirmed in Memory MCP. Use npx claude-flow@alpha memory retrieve --key "sop/gate-1/status" before method-development. Coordinate with evaluator agent to validate gate completion. No exceptions for deadlines - failed gates require remediation, not bypass. |
| Ignoring CONDITIONAL Approvals | Treating CONDITIONAL gate status as equivalent to APPROVED ignores required mitigation plans. Gate 2 CONDITIONAL with "adversarial robustness mitigation required" means deployment ONLY in non-adversarial environments until mitigation complete. Ignoring conditions leads to production failures. | Document all conditions in gate validation report with explicit deployment restrictions. Store conditions in Memory MCP: memory_store("sop/gate-2/conditions", "adversarial mitigation required before high-risk deployment"). Coordinate with deployment-readiness skill to enforce restrictions. Track mitigation completion before upgrading CONDITIONAL to APPROVED. |
Conclusion
Gate validation transforms research phase transitions from arbitrary milestones into rigorous GO/NO-GO decision points backed by quantitative evidence and multi-agent consensus. By implementing statistical rigor (Bonferroni correction, effect size calculation, power analysis), this skill ensures that models advancing through Deep Research SOP phases meet production-ready standards at each gate. The three-gate system prevents wasted effort: Gate 1 catches data quality issues and baseline replication failures before method development, Gate 2 catches evaluation gaps before production preparation, and Gate 3 catches reproducibility failures before archival and deployment.
The value of gate validation extends beyond risk mitigation - it provides clear feedback loops for continuous improvement. A Gate 1 REJECTED decision with specific datasheet gaps (85% complete, missing distribution plan) provides actionable remediation guidance rather than vague "needs improvement" feedback. This precision enables rapid iteration: fix the specific gaps, re-validate, proceed. The integration with Memory MCP ensures that gate decisions persist across sessions, preventing teams from re-litigating completed validations or bypassing failed gates.
For academic ML research, gate validation is the difference between publishable and rejected work. NeurIPS, ICML, and ACM conferences increasingly require reproducibility artifacts, ethics statements, and comprehensive evaluation - requirements directly mapped to Gates 1, 2, and 3. Models passing all three gates have systematically validated data quality, baseline replication, novel methods, holistic evaluation, reproducibility, and production readiness. This rigorous validation framework ensures that research outputs meet or exceed venue requirements, reducing desk rejection rates and increasing acceptance probability.