| name | fyp-statistical-validator |
| description | Automate statistical validation, hypothesis testing, and confidence interval calculations required by TAR UMT thesis standards and top ML research conferences. Use when you need to (1) calculate 95% confidence intervals for model performance metrics, (2) perform hypothesis testing to compare models (McNemar's test, paired t-test, Bonferroni correction), (3) generate APA-formatted results tables for thesis chapters, (4) create reproducibility statements for experimental setup, (5) validate statistical significance of model improvements, or (6) format results according to academic publishing standards. Essential for FYP Chapter 4 (experimental setup documentation) and Chapter 5 (results with statistical validation). |
Statistical Validator for TAR UMT FYP
Purpose: Automate all statistical validation requirements for TAR UMT Data Science thesis, ensuring rigorous experimental validation and APA-compliant results reporting.
Priority: ⭐⭐⭐⭐⭐ CRITICAL - Every experiment MUST have statistical validation
Quick Start
The 3 Core Functions You Need
1. Calculate Confidence Intervals
from scripts.confidence_intervals import multi_seed_ci
# For 5 training runs with different seeds
seeds_results = [0.921, 0.925, 0.920, 0.928, 0.919]
mean, lower, upper, std = multi_seed_ci(seeds_results)
# Output: 92.3% ± 1.2% (95% CI: 90.8% - 93.8%)
2. Test Statistical Significance
from scripts.hypothesis_testing import paired_ttest
# Compare CrossViT vs ResNet-50 (5 seeds each)
crossvit_seeds = [0.921, 0.925, 0.920, 0.928, 0.919]
resnet_seeds = [0.887, 0.892, 0.884, 0.895, 0.882]
t, p, sig, interp = paired_ttest(crossvit_seeds, resnet_seeds)
# Output: "CrossViT significantly outperforms ResNet-50 (p<0.05)"
3. Generate Results Table
from scripts.table_formatter import format_results_table
models = ["CrossViT", "ResNet-50", "DenseNet-121"]
metrics = {
"Accuracy": [(0.923, 0.911, 0.935), (0.887, 0.872, 0.902), ...]
}
table = format_results_table(models, metrics, "Results", table_number=1)
# Output: APA-formatted table ready for thesis
When to Use This Skill
✅ Always Use When:
- Reporting any model performance metric → Need 95% CI
- Comparing models → Need hypothesis testing
- Writing Chapter 5 results → Need formatted tables
- Writing Chapter 4 setup → Need reproducibility statement
- Claiming "significant improvement" → Need p-value < 0.05
- Multiple model comparisons → Need Bonferroni correction
❌ Don't Use When:
- Just exploring data (Chapter 2 literature review)
- Explaining methodology theory (covered by tar-umt-fyp-rds)
- Writing code for training (covered by crossvit-covid19-fyp)
Core Capabilities
1. Confidence Interval Calculation
Three methods available:
| Method | When to Use | Example |
|---|---|---|
| Normal Approximation | Single run, quick estimate | normal_approximation_ci(0.923, 2117) |
| Bootstrap | Robust estimate, any distribution | bootstrap_ci(y_true, y_pred) |
| Multi-Seed | Multiple runs (RECOMMENDED) | multi_seed_ci([0.92, 0.93, ...]) |
TAR UMT Requirement: ALWAYS report confidence intervals with metrics
- Format: "92.3% (95% CI: 91.1% - 93.5%)"
- Never report a single number without CI
- Use multi-seed method with 5+ runs for best practice
Usage Pattern:
from scripts.confidence_intervals import multi_seed_ci, format_ci_result
# Calculate CI
mean, lower, upper, std = multi_seed_ci(accuracy_seeds)
# Format for thesis
result_text = format_ci_result(mean, lower, upper, std, "Accuracy")
# Output: "Accuracy: 92.3% ± 1.2% (95% CI: 90.8% - 93.8%)"
2. Hypothesis Testing Suite
Three tests provided:
A. McNemar's Test
Use for: Single run comparison, same test set
from scripts.hypothesis_testing import mcnemar_test
stat, p, interp = mcnemar_test(y_true, y_pred_model1, y_pred_model2)
# Tests: Are prediction patterns significantly different?
B. Paired t-test
Use for: Multiple runs with matched seeds
from scripts.hypothesis_testing import paired_ttest
t, p, sig, interp = paired_ttest(scores_model1, scores_model2)
# Tests: Is mean difference across runs significant?
C. Bonferroni Correction
Use for: REQUIRED when comparing against 5 baselines
from scripts.hypothesis_testing import bonferroni_correction
p_values = [0.0001, 0.0023, 0.0089, 0.0234, 0.0456]
sig_flags, adj_alpha, interp = bonferroni_correction(p_values)
# Adjusts α: 0.05 / 5 comparisons = 0.01
TAR UMT Requirement: Must use Bonferroni when comparing against multiple baselines to control family-wise error rate.
3. APA Table Formatting
Generate publication-ready tables:
from scripts.table_formatter import format_results_table
# Build metrics dictionary
metrics = {
"Accuracy": [
(0.923, 0.911, 0.935), # CrossViT: (mean, lower_ci, upper_ci)
(0.887, 0.872, 0.902), # ResNet-50
(0.892, 0.878, 0.906) # DenseNet-121
],
"F1-Score": [(0.91, 0.89, 0.93), (0.87, 0.85, 0.89), (0.88, 0.86, 0.90)],
"AUC-ROC": [(0.94, 0.92, 0.96), (0.90, 0.88, 0.92), (0.91, 0.89, 0.93)]
}
# Generate table
table = format_results_table(
models=["CrossViT", "ResNet-50", "DenseNet-121"],
metrics=metrics,
caption="Classification Performance Comparison",
table_number=1
)
# Output: APA-formatted table with proper headers, footnotes, CI notation
Also available:
format_hypothesis_table()→ Statistical test resultsformat_confusion_matrix_caption()→ Figure captionsformat_roc_curve_caption()→ ROC curve captionsgenerate_latex_table()→ LaTeX format (if needed)format_for_word()→ TSV for Word import
4. Reproducibility Documentation
Generate Chapter 4 experimental setup:
from scripts.reproducibility_generator import generate_reproducibility_statement
statement = generate_reproducibility_statement(
random_seeds=[42, 123, 456, 789, 101112],
n_runs=5,
dataset_info={"name": "COVID-19 Radiography Database", ...},
model_config={"name": "CrossViT-Tiny", ...},
training_config={"epochs": 50, "batch_size": 16, ...},
hardware_info={"gpu": "RTX 4060", "vram": "8GB", ...}
)
# Copy entire output into Chapter 4 section 4.5
What it generates:
- Random seeds and multiple runs explanation
- Dataset configuration details
- Model architecture specifications
- Training hyperparameters
- Hardware specifications
- Statistical methodology explanation
- Software dependencies
Also available:
generate_statistical_methods_section()→ Theory for Chapter 3generate_results_reporting_template()→ Formatted result paragraphs
TAR UMT Thesis Integration
Chapter 3: Research Methodology
Section 3.6: Statistical Validation Methods
Use: generate_statistical_methods_section()
Provides theory, formulas, and citations for CI estimation and hypothesis testing.
Chapter 4: Research Design
Section 4.5: Experimental Setup and Reproducibility
Use: generate_reproducibility_statement(...)
Documents all experimental details for reproducibility.
Chapter 5: Results and Evaluation
Section 5.1: Model Performance
- Use:
format_results_table()for performance comparison - All metrics MUST have 95% CI
Section 5.2: Statistical Significance
- Use:
paired_ttest(),bonferroni_correction() - Use:
format_hypothesis_table()for test results - MUST prove H₁ hypothesis with p<0.05
Section 5.3: Detailed Analysis
- Use:
generate_results_reporting_template()for formatted paragraphs - Format: "CrossViT achieved X% (95% CI: Y%-Z%), significantly outperforming baseline (p<0.05)"
Section 5.4: Figures
- Use:
format_confusion_matrix_caption()for Figure 1 - Use:
format_roc_curve_caption()for Figure 2
Complete Workflow Example
Step 1: After Training (Collect Results)
# Save results from each seed
crossvit_acc = [0.921, 0.925, 0.920, 0.928, 0.919]
crossvit_f1 = [0.91, 0.92, 0.90, 0.93, 0.89]
crossvit_auc = [0.94, 0.95, 0.93, 0.96, 0.92]
# Same for all 5 baselines
resnet_acc = [0.887, 0.892, 0.884, 0.895, 0.882]
# ... etc
Step 2: Calculate Confidence Intervals
from scripts.confidence_intervals import multi_seed_ci
# For each model and metric
crossvit_acc_ci = multi_seed_ci(crossvit_acc)[:3] # (mean, lower, upper)
crossvit_f1_ci = multi_seed_ci(crossvit_f1)[:3]
crossvit_auc_ci = multi_seed_ci(crossvit_auc)[:3]
# Repeat for all baselines
resnet_acc_ci = multi_seed_ci(resnet_acc)[:3]
# ... etc
Step 3: Hypothesis Testing
from scripts.hypothesis_testing import paired_ttest, bonferroni_correction
# Compare CrossViT against each baseline
p_values = []
for baseline_name, baseline_seeds in baselines.items():
t, p, sig, interp = paired_ttest(crossvit_acc, baseline_seeds)
p_values.append(p)
print(f"CrossViT vs {baseline_name}: {interp}")
# Apply Bonferroni correction
sig_flags, adj_alpha, interp = bonferroni_correction(p_values)
print(f"After correction: {interp}")
Step 4: Generate Tables
from scripts.table_formatter import format_results_table
models = ["CrossViT", "ResNet-50", "DenseNet-121", "EfficientNet-B0", "ViT-B/32"]
metrics = {
"Accuracy": [crossvit_acc_ci, resnet_acc_ci, densenet_acc_ci, ...],
"F1-Score": [crossvit_f1_ci, resnet_f1_ci, densenet_f1_ci, ...],
"AUC-ROC": [crossvit_auc_ci, resnet_auc_ci, densenet_auc_ci, ...]
}
table1 = format_results_table(models, metrics, "Model Performance Comparison", 1)
Step 5: Generate Reproducibility Statement
from scripts.reproducibility_generator import generate_reproducibility_statement
statement = generate_reproducibility_statement(
random_seeds=[42, 123, 456, 789, 101112],
n_runs=5,
dataset_info={...},
model_config={...},
training_config={...},
hardware_info={...}
)
Step 6: Copy to Thesis
- Chapter 4 ← Reproducibility statement
- Chapter 5.1 ← Results table
- Chapter 5.2 ← Hypothesis testing results
- Chapter 5.3 ← Formatted analysis paragraphs
Reference Documentation
For detailed API documentation, see:
- api_reference.md - Complete function reference with examples
- thesis_integration.md - Chapter-by-chapter integration guide
Quick Reference Table
| Task | Function | Script |
|---|---|---|
| Calculate CI (5 seeds) | multi_seed_ci() |
confidence_intervals.py |
| Calculate CI (single run) | normal_approximation_ci() |
confidence_intervals.py |
| Calculate CI (robust) | bootstrap_ci() |
confidence_intervals.py |
| Compare 2 models (single) | mcnemar_test() |
hypothesis_testing.py |
| Compare 2 models (seeds) | paired_ttest() |
hypothesis_testing.py |
| Multiple comparisons | bonferroni_correction() |
hypothesis_testing.py |
| Results table | format_results_table() |
table_formatter.py |
| Hypothesis table | format_hypothesis_table() |
table_formatter.py |
| Figure caption | format_confusion_matrix_caption() |
table_formatter.py |
| Reproducibility | generate_reproducibility_statement() |
reproducibility_generator.py |
| Statistical methods | generate_statistical_methods_section() |
reproducibility_generator.py |
| Results template | generate_results_reporting_template() |
reproducibility_generator.py |
Critical Reminders
TAR UMT Requirements (MUST DO):
- ✅ Every metric MUST have 95% CI - Never report single numbers
- ✅ Minimum 5 seeds required - Use different random seeds for each run
- ✅ Hypothesis testing REQUIRED - Must prove H₁ with p<0.05
- ✅ Bonferroni correction - Apply when comparing against 5 baselines
- ✅ APA table format - Use provided formatters for consistency
- ✅ Reproducibility statement - Include in Chapter 4
- ✅ Significance markers - Use *, **, *** for p-values
Best Practices:
- Always use paired t-test for multi-seed comparisons
- Report both mean±std AND 95% CI
- Include all individual seed results in appendix
- State null hypothesis (H₀) explicitly before testing
- Use adj_alpha from Bonferroni, not original α=0.05
Common Mistakes to Avoid:
❌ Reporting accuracy without confidence interval ❌ Claiming "better" without statistical test ❌ Using single run for final comparison ❌ Forgetting Bonferroni correction for multiple tests ❌ Not stating hypotheses in Chapter 1 ❌ Missing reproducibility statement in Chapter 4
Integration with Other Skills
This skill works with:
- crossvit-covid19-fyp → Technical specs for experiments
- fyp-jupyter → Running experiments that generate results to validate
- fyp-chapter-bridge → Converting validated results into thesis text
- tar-umt-fyp-rds → Understanding thesis structure requirements
Typical workflow:
- Use fyp-jupyter to run experiments → Get results
- Use fyp-statistical-validator (this skill) → Validate results statistically
- Use fyp-chapter-bridge → Convert into thesis chapters
- Use tar-umt-academic-writing → Add APA citations
Testing the Skill
Run demo scripts to verify everything works:
# Test confidence intervals
python scripts/confidence_intervals.py
# Test hypothesis testing
python scripts/hypothesis_testing.py
# Test table formatting
python scripts/table_formatter.py
# Test reproducibility generator
python scripts/reproducibility_generator.py
Each script includes if __name__ == "__main__" blocks with usage examples.
Required Citations
When using these methods in your thesis, cite:
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1923. https://doi.org/10.1162/089976698300017197
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30.
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall/CRC. https://doi.org/10.1201/9780429246593
Success Criteria
You know this skill is working when:
✅ All metrics in Chapter 5 have 95% CI ✅ Hypothesis testing proves H₁ with p<0.05 ✅ Tables formatted in APA style ✅ Reproducibility statement in Chapter 4 ✅ Bonferroni correction applied for 5 comparisons ✅ Results ready to copy-paste into Word document
Summary
This skill automates all statistical validation for TAR UMT FYP.
Three core functions:
multi_seed_ci()→ Calculate confidence intervalspaired_ttest()→ Test statistical significanceformat_results_table()→ Generate APA tables
Use in every thesis chapter that reports experimental results (Chapters 4-5).
Saves 20-30 hours of manual calculation and formatting during thesis writing.
For detailed documentation, see references/api_reference.md and references/thesis_integration.md