| name | numerical-validation |
| description | Verify mathematical correctness and numerical accuracy after code changes |
| tags | testing, numerical, validation, mathematical, scientific |
| version | 1 |
Numerical Validation for Scientific Code
Overview
Verify that changes to mathematical/algorithmic code maintain numerical accuracy and mathematical properties.
Core principle: Capture baseline, make change, compare numerically, verify invariants, provide full analysis.
Announce at start: "I'm using the numerical-validation skill to verify mathematical correctness."
When to Use This Skill
MUST use when modifying:
src/non_local_detector/core.py(HMM algorithms)src/non_local_detector/likelihoods/(likelihood models)src/non_local_detector/continuous_state_transitions.pysrc/non_local_detector/discrete_state_transitions.pysrc/non_local_detector/initial_conditions.py- Any code involving JAX transformations or numerical computations
Also use when:
- Refactoring mathematical code (tolerance: 1e-14)
- Optimizing algorithms (tolerance: 1e-10)
- Changing convergence criteria or tolerances
- Updating numerical dependencies
Process Checklist
Copy to TodoWrite:
Numerical Validation Progress:
- [ ] Capture baseline outputs before change
- [ ] Make the code change
- [ ] Capture new outputs after change
- [ ] Compare numerical differences
- [ ] Verify mathematical invariants
- [ ] Run property-based tests
- [ ] Run golden regression tests
- [ ] Generate full analysis report
- [ ] Present analysis and request approval (if differences found)
Detailed Steps
Step 1: Capture Baseline Outputs
Before making any changes:
# Run tests and capture output
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
src/non_local_detector/tests/test_golden_regression.py \
-v > /tmp/baseline_output.txt 2>&1
# Run property tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
-m property -v > /tmp/baseline_property.txt 2>&1
# Run snapshot tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
-m snapshot -v > /tmp/baseline_snapshot.txt 2>&1
Save output: Keep baseline files for comparison
Step 2: Make Code Change
Implement your modification to the mathematical/algorithmic code.
Step 3: Capture New Outputs
After making changes:
# Run same tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
src/non_local_detector/tests/test_golden_regression.py \
-v > /tmp/new_output.txt 2>&1
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
-m property -v > /tmp/new_property.txt 2>&1
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
-m snapshot -v > /tmp/new_snapshot.txt 2>&1
Step 4: Compare Numerical Differences
Difference tolerances:
- Refactoring (no behavior change): Max 1e-14 (floating-point noise only)
- Intentional algorithm changes: Max 1e-10 (must be justified)
- Any larger difference: Requires investigation and explanation
Compare outputs:
# Check if outputs differ
diff /tmp/baseline_output.txt /tmp/new_output.txt
For each difference:
- Is it expected based on the change?
- What is the magnitude? (< 1e-14 is floating-point noise, < 1e-10 is acceptable for algorithm changes)
- Does it affect scientific conclusions?
Step 5: Verify Mathematical Invariants
Critical invariants that must ALWAYS hold:
Probability distributions sum to 1.0:
assert np.allclose(probabilities.sum(axis=-1), 1.0, atol=1e-10)Transition matrices are stochastic:
assert np.allclose(transition_matrix.sum(axis=-1), 1.0, atol=1e-10) assert np.all(transition_matrix >= 0) assert np.all(transition_matrix <= 1)Log-probabilities are finite:
assert np.all(np.isfinite(log_probs))Covariance matrices are positive semi-definite:
eigenvalues = np.linalg.eigvalsh(covariance) assert np.all(eigenvalues >= -1e-10)Likelihoods are non-negative:
assert np.all(likelihood >= 0)
Verify these with tests or spot checks after changes.
Step 6: Run Property-Based Tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest -m property -v
Expected: All property tests pass
Property tests verify:
- Invariants hold across many random inputs (hypothesis library)
- Edge cases are handled correctly
- Mathematical properties are maintained
If failures: Investigate why property violated - likely a bug in your change.
Step 7: Run Golden Regression Tests
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest \
src/non_local_detector/tests/test_golden_regression.py -v
Golden regression tests:
- Use real scientific data
- Compare against validated reference outputs
- Catch subtle numerical changes that affect scientific results
Expected for refactoring: Exact match (or < 1e-14 difference) Expected for algorithm changes: Document and justify any differences
Step 8: Generate Full Analysis Report
Create a comprehensive report with:
1. Diff - What Changed:
Snapshot changes:
- test_model_output: posterior probabilities differ by max 2.3e-11
- test_transition_matrix: no changes
Test output changes:
- Golden regression: 3 values differ by < 1e-10
2. Explanation - Why It Changed:
Changed optimizer tolerance from 1e-6 to 1e-8, resulting in:
- More precise convergence
- Slight differences in final parameter estimates
- Differences are within acceptable scientific tolerance
3. Validation - Invariants Still Hold:
Verified:
✓ All probabilities sum to 1.0 (max deviation: 3.4e-15)
✓ Transition matrices stochastic (max row sum deviation: 1.2e-14)
✓ No NaN or Inf values in any outputs
✓ All property tests pass (42/42)
✓ Covariance matrices positive semi-definite
4. Test Case - Demonstrate Correctness:
# Before change:
old_result = [0.342156, 0.657844] # Posterior at time 10
# After change:
new_result = [0.342156023, 0.657843977] # Posterior at time 10
# Difference: 2.3e-8 (acceptable)
# Scientific interpretation: No change to conclusions
# Both results indicate strong preference for state 2
Step 9: Present Analysis and Request Approval
If differences are within tolerance (< 1e-14 for refactoring):
- Present analysis for information
- Proceed with change
- No approval needed
If differences are 1e-14 to 1e-10:
- Present full analysis
- Explain why differences are acceptable
- Request approval: "These differences are expected and scientifically acceptable. Approve?"
- Wait for user response
If differences are > 1e-10:
- Present full analysis
- Explain significance of differences
- Provide scientific justification
- Request explicit approval
- If rejected: Investigate further or revert change
Approval Process
For snapshot updates with numerical changes:
Generate full analysis (all 4 sections above)
Present to user
Ask: "These changes are [expected/acceptable/significant]. Approve snapshot update?"
If approved: User will set approval flag, then run:
/Users/edeno/miniconda3/envs/non_local_detector/bin/pytest --snapshot-update
Integration with Other Skills
- Use with scientific-tdd: After implementing new feature, validate numerics
- Use with safe-refactoring: Verify no numerical changes during refactoring
- Use with jax: After JAX optimizations, verify numerical equivalence
Tolerance Guidelines
| Change Type | Max Acceptable Difference | Approval Required |
|---|---|---|
| Pure refactoring | 1e-14 | No |
| Code optimization | 1e-10 | Yes (informational) |
| Algorithm modification | 1e-10 | Yes (justification) |
| > 1e-10 | Any | Yes (strong justification) |
Example Workflow
Task: Refactor HMM filtering to use scan instead of for loop
1. Capture baseline:
- Run golden regression: All pass
- Run property tests: 42 pass
- Save outputs to /tmp/baseline_*
2. Make change:
- Replace for loop with jax.lax.scan
- Maintain identical logic
3. Capture new outputs:
- Run same tests: All pass
- Save outputs to /tmp/new_*
4. Compare:
- Max difference: 4.2e-15 (floating-point noise)
- Within refactoring tolerance
5. Verify invariants:
✓ Probabilities sum to 1.0
✓ No NaN/Inf
✓ Property tests pass
6. Report:
"Refactoring complete. Max numerical difference: 4.2e-15 (floating-point noise).
All invariants verified. No approval needed."
Red Flags
Don't:
- Skip baseline capture
- Ignore numerical differences > 1e-14
- Assume "small" differences don't matter
- Update snapshots without analysis
- Skip property or golden regression tests
- Proceed with NaN/Inf in outputs
Do:
- Always capture baseline before changes
- Investigate all unexpected differences
- Verify mathematical invariants explicitly
- Provide full analysis for any differences
- Get approval before snapshot updates
- Document tolerance justifications