Claude Code Plugins

Community-maintained marketplace

Feedback

scientific-python-testing

@uw-ssec/rse-agents
0
0

Write robust, maintainable tests for scientific Python packages using pytest best practices following Scientific Python community guidelines

Install Skill

1Download skill
2Enable skills in Claude

Open claude.ai/settings/capabilities and find the "Skills" section

3Upload to Claude

Click "Upload skill" and select the downloaded ZIP file

Note: Please verify skill by going through its instructions before using it.

SKILL.md

name scientific-python-testing
description Write robust, maintainable tests for scientific Python packages using pytest best practices following Scientific Python community guidelines

Scientific Python Testing with pytest

A comprehensive guide to writing effective tests for scientific Python packages using pytest, following the Scientific Python Community guidelines and testing tutorial. This skill focuses on modern testing patterns, fixtures, parametrization, and best practices specific to scientific computing.

Quick Reference Card

Common Testing Tasks - Quick Decisions:

# 1. Basic test → Use simple assert
def test_function():
    assert result == expected

# 2. Floating-point comparison → Use approx
from pytest import approx
assert result == approx(0.333, rel=1e-6)

# 3. Testing exceptions → Use pytest.raises
with pytest.raises(ValueError, match="must be positive"):
    function(-1)

# 4. Multiple inputs → Use parametrize
@pytest.mark.parametrize("input,expected", [(1,1), (2,4), (3,9)])
def test_square(input, expected):
    assert input**2 == expected

# 5. Reusable setup → Use fixture
@pytest.fixture
def sample_data():
    return np.array([1, 2, 3, 4, 5])

# 6. NumPy arrays → Use approx or numpy.testing
assert np.mean(data) == approx(3.0)

Decision Tree:

  • Need multiple test cases with same logic? → Parametrize
  • Need reusable test data/setup? → Fixture
  • Testing floating-point results? → pytest.approx
  • Testing exceptions/warnings? → pytest.raises / pytest.warns
  • Complex numerical arrays? → numpy.testing.assert_allclose
  • Organizing by speed? → Markers and separate directories

When to Use This Skill

  • Writing tests for scientific Python packages and libraries
  • Testing numerical algorithms and scientific computations
  • Setting up test infrastructure for research software
  • Implementing continuous integration for scientific code
  • Testing data analysis pipelines and workflows
  • Validating scientific simulations and models
  • Ensuring reproducibility and correctness of research code
  • Testing code that uses NumPy, SciPy, Pandas, and other scientific libraries

Core Concepts

1. Why pytest for Scientific Python

pytest is the de facto standard for testing Python packages because it:

  • Simple syntax: Just use Python's assert statement
  • Detailed reporting: Clear, informative failure messages
  • Powerful features: Fixtures, parametrization, marks, plugins
  • Scientific ecosystem: Native support for NumPy arrays, approximate comparisons
  • Community standard: Used by NumPy, SciPy, Pandas, scikit-learn, and more

2. Test Structure and Organization

Standard test directory layout:

my-package/
├── src/
│   └── my_package/
│       ├── __init__.py
│       ├── analysis.py
│       └── utils.py
├── tests/
│   ├── conftest.py
│   ├── test_analysis.py
│   └── test_utils.py
└── pyproject.toml

Key principles:

  • Tests directory separate from source code (alongside src/)
  • Test files named test_*.py (pytest discovery)
  • Test functions named test_* (pytest discovery)
  • No __init__.py in tests directory (avoid importability issues)
  • Test against installed package, not local source

3. pytest Configuration

Configure pytest in pyproject.toml (recommended for modern packages):

[tool.pytest.ini_options]
minversion = "7.0"
addopts = [
    "-ra",              # Show summary of all test outcomes
    "--showlocals",     # Show local variables in tracebacks
    "--strict-markers", # Error on undefined markers
    "--strict-config",  # Error on config issues
]
xfail_strict = true     # xfail tests must fail
filterwarnings = [
    "error",            # Treat warnings as errors
]
log_cli_level = "info"  # Log level for test output
testpaths = [
    "tests",            # Limit pytest to tests directory
]

Testing Principles

Following the Scientific Python testing recommendations, effective testing provides multiple benefits and should follow key principles:

Advantages of Testing

  • Trustworthy code: Well-tested code behaves as expected and can be relied upon
  • Living documentation: Tests communicate intent and expected behavior, validated with each run
  • Preventing failure: Tests protect against implementation errors and unexpected dependency changes
  • Confidence when making changes: Thorough test suites enable adding features, fixing bugs, and refactoring with confidence

Fundamental Principles

1. Any test case is better than none

When in doubt, write the test that makes sense at the time:

  • Test critical behaviors, features, and logic
  • Write clear, expressive, well-documented tests
  • Tests are documentation of developer intentions
  • Good tests make it clear what they are testing and how

Don't get bogged down in taxonomy when learning—focus on writing tests that work.

2. As long as that test is correct

It's surprisingly easy to write tests that pass when they should fail:

  • Check that your test fails when it should: Deliberately break the code and verify the test fails
  • Keep it simple: Excessive mocks and fixtures make it difficult to know what's being tested
  • Test one thing at a time: A single test should test a single behavior

3. Start with Public Interface Tests

Begin by testing from the perspective of a user:

  • Test code as users will interact with it
  • Keep tests simple and readable for documentation purposes
  • Focus on supported use cases
  • Avoid testing private attributes
  • Minimize use of mocks/patches

4. Organize Tests into Suites

Divide tests by type and execution time for efficiency:

  • Unit tests: Fast, isolated tests of individual components
  • Integration tests: Tests of component interactions and dependencies
  • End-to-end tests: Complete workflow testing

Benefits:

  • Run relevant tests quickly and frequently
  • "Fail fast" by running fast suites first
  • Easier to read and reason about
  • Avoid false positives from expected external failures

Outside-In Testing Approach

The recommended approach is outside-in, starting from the user's perspective:

  1. Public Interface Tests: Test from user perspective, focusing on behavior and features
  2. Integration Tests: Test that components work together and with dependencies
  3. Unit Tests: Test individual units in isolation, optimized for speed

This approach ensures you're building the right thing before optimizing implementation details.

Quick Start

Minimal Test Example

# tests/test_basic.py

def test_simple_math():
    """Test basic arithmetic."""
    assert 4 == 2**2

def test_string_operations():
    """Test string methods."""
    result = "hello world".upper()
    assert result == "HELLO WORLD"
    assert "HELLO" in result

Scientific Test Example

# tests/test_scientific.py
import numpy as np
from pytest import approx

from my_package.analysis import compute_mean, fit_linear

def test_compute_mean():
    """Test mean calculation."""
    data = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
    result = compute_mean(data)
    assert result == approx(3.0)

def test_fit_linear():
    """Test linear regression."""
    x = np.array([0, 1, 2, 3, 4])
    y = np.array([0, 2, 4, 6, 8])
    slope, intercept = fit_linear(x, y)
    
    assert slope == approx(2.0)
    assert intercept == approx(0.0)

Testing Best Practices

Pattern 1: Writing Simple, Focused Tests

Bad - Multiple assertions testing different things:

def test_everything():
    data = load_data("input.csv")
    assert len(data) > 0
    processed = process_data(data)
    assert processed.mean() > 0
    result = analyze(processed)
    assert result.success

Good - Separate tests for each behavior:

def test_load_data_returns_nonempty():
    """Data loading should return at least one row."""
    data = load_data("input.csv")
    assert len(data) > 0

def test_process_data_positive_mean():
    """Processed data should have positive mean."""
    data = load_data("input.csv")
    processed = process_data(data)
    assert processed.mean() > 0

def test_analyze_succeeds():
    """Analysis should complete successfully."""
    data = load_data("input.csv")
    processed = process_data(data)
    result = analyze(processed)
    assert result.success

Arrange-Act-Assert pattern:

def test_computation():
    # Arrange - Set up test data
    data = np.array([1, 2, 3, 4, 5])
    expected = 3.0
    
    # Act - Execute the function
    result = compute_mean(data)
    
    # Assert - Check the result
    assert result == approx(expected)

Pattern 2: Testing for Failures

Always test that your code raises appropriate exceptions:

import pytest

def test_zero_division_raises():
    """Division by zero should raise ZeroDivisionError."""
    with pytest.raises(ZeroDivisionError):
        result = 1 / 0

def test_invalid_input_raises():
    """Invalid input should raise ValueError."""
    with pytest.raises(ValueError, match="must be positive"):
        result = compute_sqrt(-1)

def test_deprecation_warning():
    """Deprecated function should warn."""
    with pytest.warns(DeprecationWarning):
        result = old_function()

def test_deprecated_call():
    """Check for deprecated API usage."""
    with pytest.deprecated_call():
        result = legacy_api()

Pattern 3: Approximate Comparisons

Scientific computing often involves floating-point arithmetic that cannot be tested for exact equality:

For scalars:

from pytest import approx

def test_approximate_scalar():
    """Test with approximate comparison."""
    result = 1 / 3
    assert result == approx(0.33333333333, rel=1e-10)
    
    # Default relative tolerance is 1e-6
    assert 0.3 + 0.3 == approx(0.6)

def test_approximate_with_absolute_tolerance():
    """Test with absolute tolerance."""
    result = compute_small_value()
    assert result == approx(0.0, abs=1e-10)

For NumPy arrays (preferred over numpy.testing):

import numpy as np
from pytest import approx

def test_array_approximate():
    """Test NumPy arrays with approx."""
    result = np.array([0.1, 0.2, 0.3])
    expected = np.array([0.10001, 0.20001, 0.30001])
    assert result == approx(expected)

def test_array_with_nan():
    """Handle NaN values in arrays."""
    result = np.array([1.0, np.nan, 3.0])
    expected = np.array([1.0, np.nan, 3.0])
    assert result == approx(expected, nan_ok=True)

When to use numpy.testing:

import numpy as np
from numpy.testing import assert_allclose, assert_array_equal

def test_exact_integer_array():
    """Use numpy.testing for exact integer comparisons."""
    result = np.array([1, 2, 3])
    expected = np.array([1, 2, 3])
    assert_array_equal(result, expected)

def test_complex_array_tolerances():
    """Use numpy.testing for complex tolerance requirements."""
    result = compute_result()
    expected = load_reference()
    assert_allclose(result, expected, rtol=1e-7, atol=1e-10)

Pattern 4: Using Fixtures

Fixtures provide reusable test setup and teardown:

Basic fixtures:

import pytest
import numpy as np

@pytest.fixture
def sample_data():
    """Provide sample data for tests."""
    return np.array([1.0, 2.0, 3.0, 4.0, 5.0])

@pytest.fixture
def empty_array():
    """Provide empty array for edge case tests."""
    return np.array([])

def test_mean_with_fixture(sample_data):
    """Test using fixture."""
    result = np.mean(sample_data)
    assert result == approx(3.0)

def test_empty_array(empty_array):
    """Test edge case with empty array."""
    with pytest.warns(RuntimeWarning):
        result = np.mean(empty_array)
        assert np.isnan(result)

Fixtures with setup and teardown:

import pytest
import tempfile
from pathlib import Path

@pytest.fixture
def temp_datafile():
    """Create temporary data file for tests."""
    # Setup
    tmpfile = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')
    tmpfile.write("1.0\n2.0\n3.0\n")
    tmpfile.close()
    
    # Provide to test
    yield Path(tmpfile.name)
    
    # Teardown
    Path(tmpfile.name).unlink()

def test_load_data(temp_datafile):
    """Test data loading from file."""
    data = np.loadtxt(temp_datafile)
    assert len(data) == 3
    assert data[0] == approx(1.0)

Fixture scopes:

@pytest.fixture(scope="function")  # Default, run for each test
def data_per_test():
    return create_data()

@pytest.fixture(scope="class")  # Run once per test class
def data_per_class():
    return create_data()

@pytest.fixture(scope="module")  # Run once per module
def data_per_module():
    return load_large_dataset()

@pytest.fixture(scope="session")  # Run once per test session
def database_connection():
    conn = create_connection()
    yield conn
    conn.close()

Auto-use fixtures:

@pytest.fixture(autouse=True)
def reset_random_seed():
    """Reset random seed before each test for reproducibility."""
    np.random.seed(42)

Pattern 5: Parametrized Tests

Test the same function with multiple inputs:

Basic parametrization:

import pytest

@pytest.mark.parametrize("input_val,expected", [
    (0, 0),
    (1, 1),
    (2, 4),
    (3, 9),
    (-2, 4),
])
def test_square(input_val, expected):
    """Test squaring with multiple inputs."""
    assert input_val**2 == expected

@pytest.mark.parametrize("angle", [0, np.pi/6, np.pi/4, np.pi/3, np.pi/2])
def test_sine_range(angle):
    """Test sine function returns values in [0, 1] for first quadrant."""
    result = np.sin(angle)
    assert 0 <= result <= 1

Multiple parameters:

@pytest.mark.parametrize("n_air,n_water", [
    (1.0, 1.33),
    (1.0, 1.5),
    (1.5, 1.0),
])
def test_refraction(n_air, n_water):
    """Test Snell's law with different refractive indices."""
    angle_in = np.pi / 4
    angle_out = snell(angle_in, n_air, n_water)
    assert angle_out >= 0
    assert angle_out <= np.pi / 2

Parametrized fixtures:

@pytest.fixture(params=[1, 2, 3], ids=["one", "two", "three"])
def dimension(request):
    """Parametrized fixture for different dimensions."""
    return request.param

def test_array_creation(dimension):
    """Test array creation in different dimensions."""
    shape = tuple([10] * dimension)
    arr = np.zeros(shape)
    assert arr.ndim == dimension
    assert arr.shape == shape

Combining parametrization with custom IDs:

@pytest.mark.parametrize(
    "data,expected",
    [
        (np.array([1, 2, 3]), 2.0),
        (np.array([1, 1, 1]), 1.0),
        (np.array([0, 10]), 5.0),
    ],
    ids=["sequential", "constant", "extremes"]
)
def test_mean_with_ids(data, expected):
    """Test mean with descriptive test IDs."""
    assert np.mean(data) == approx(expected)

Pattern 6: Test Organization with Markers

Use markers to organize and selectively run tests:

Basic markers:

import pytest

@pytest.mark.slow
def test_expensive_computation():
    """Test that takes a long time."""
    result = run_simulation(n_iterations=1000000)
    assert result.converged

@pytest.mark.requires_gpu
def test_gpu_acceleration():
    """Test that requires GPU hardware."""
    result = compute_on_gpu(large_array)
    assert result.success

@pytest.mark.integration
def test_full_pipeline():
    """Integration test for complete workflow."""
    data = load_data()
    processed = preprocess(data)
    result = analyze(processed)
    output = save_results(result)
    assert output.exists()

Running specific markers:

pytest -m slow              # Run only slow tests
pytest -m "not slow"        # Skip slow tests
pytest -m "slow or gpu"     # Run slow OR gpu tests
pytest -m "slow and integration"  # Run slow AND integration tests

Skip and xfail markers:

@pytest.mark.skip(reason="Feature not implemented yet")
def test_future_feature():
    """Test for feature under development."""
    result = future_function()
    assert result.success

@pytest.mark.skipif(sys.version_info < (3, 10), reason="Requires Python 3.10+")
def test_pattern_matching():
    """Test using Python 3.10+ features."""
    match value:
        case 0:
            result = "zero"
        case _:
            result = "other"
    assert result == "zero"

@pytest.mark.xfail(reason="Known bug in upstream library")
def test_known_failure():
    """Test that currently fails due to known issue."""
    result = buggy_function()
    assert result == expected

@pytest.mark.xfail(strict=True)
def test_must_fail():
    """Test that MUST fail (test will fail if it passes)."""
    with pytest.raises(NotImplementedError):
        unimplemented_function()

Custom markers in pyproject.toml:

[tool.pytest.ini_options]
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
    "requires_gpu: marks tests that need GPU hardware",
    "integration: marks tests as integration tests",
    "unit: marks tests as unit tests",
]

Pattern 6b: Organizing Test Suites by Directory

Following Scientific Python recommendations, organize tests into separate directories by type and execution time:

tests/
├── unit/               # Fast, isolated unit tests
│   ├── conftest.py
│   ├── test_analysis.py
│   └── test_utils.py
├── integration/        # Integration tests
│   ├── conftest.py
│   └── test_pipeline.py
├── e2e/               # End-to-end tests
│   └── test_workflows.py
└── conftest.py        # Shared fixtures

Run specific test suites:

# Run only unit tests (fast)
pytest tests/unit/

# Run integration tests after unit tests pass
pytest tests/integration/

# Run all tests
pytest

Auto-mark all tests in a directory using conftest.py:

# tests/unit/conftest.py
import pytest

def pytest_collection_modifyitems(session, config, items):
    """Automatically mark all tests in this directory as unit tests."""
    for item in items:
        item.add_marker(pytest.mark.unit)

Benefits of organized test suites:

  • Run fast tests first ("fail fast" principle)
  • Developers can run relevant tests quickly
  • Clear separation of test types
  • Avoid false positives from slow/flaky tests
  • Better CI/CD optimization

Example test runner strategy:

# Run fast unit tests first, stop on failure
pytest tests/unit/ -x || exit 1

# If unit tests pass, run integration tests
pytest tests/integration/ -x || exit 1

# Finally run slow end-to-end tests
pytest tests/e2e/

Pattern 7: Mocking and Monkeypatching

Mock expensive operations or external dependencies:

Basic monkeypatching:

import platform

def test_platform_specific_behavior(monkeypatch):
    """Test behavior on different platforms."""
    # Mock platform.system() to return "Linux"
    monkeypatch.setattr(platform, "system", lambda: "Linux")
    result = get_platform_specific_path()
    assert result == "/usr/local/data"
    
    # Change mock to return "Windows"
    monkeypatch.setattr(platform, "system", lambda: "Windows")
    result = get_platform_specific_path()
    assert result == r"C:\Users\data"

Mocking with pytest-mock:

import pytest
from unittest.mock import Mock

def test_expensive_computation(mocker):
    """Mock expensive computation."""
    # Mock the expensive function
    mock_compute = mocker.patch("my_package.analysis.expensive_compute")
    mock_compute.return_value = 42
    
    result = run_analysis()
    
    # Verify the mock was called
    mock_compute.assert_called_once()
    assert result == 42

def test_matplotlib_plotting(mocker):
    """Test plotting without creating actual plots."""
    mock_plt = mocker.patch("matplotlib.pyplot")
    
    create_plot(data)
    
    # Verify plot was created
    mock_plt.figure.assert_called_once()
    mock_plt.plot.assert_called_once()
    mock_plt.savefig.assert_called_once_with("output.png")

Fixture for repeated mocking:

@pytest.fixture
def mock_matplotlib(mocker):
    """Mock matplotlib for testing plots."""
    fig = mocker.Mock(spec=plt.Figure)
    ax = mocker.Mock(spec=plt.Axes)
    line2d = mocker.Mock(name="plot", spec=plt.Line2D)
    ax.plot.return_value = (line2d,)
    
    mpl = mocker.patch("matplotlib.pyplot", autospec=True)
    mocker.patch("matplotlib.pyplot.subplots", return_value=(fig, ax))
    
    return {"fig": fig, "ax": ax, "mpl": mpl}

def test_my_plot(mock_matplotlib):
    """Test plotting function."""
    ax = mock_matplotlib["ax"]
    my_plotting_function(ax=ax)
    
    ax.plot.assert_called_once()
    ax.set_xlabel.assert_called_once()

Pattern 8: Testing Against Installed Version

Always test the installed package, not local source:

Why this matters:

my-package/
├── src/
│   └── my_package/
│       ├── __init__.py
│       ├── data/            # Data files
│       │   └── reference.csv
│       └── analysis.py
└── tests/
    └── test_analysis.py

Use src/ layout + editable install:

# Install in editable mode
pip install -e .

# Run tests against installed version
pytest

Benefits:

  • Tests ensure package installs correctly
  • Catches missing files (like data files)
  • Tests work in CI/CD environments
  • Validates package structure and imports

In tests, import from package:

# Good - imports installed package
from my_package.analysis import compute_mean

# Bad - would import from local src/ if not using src/ layout
# from analysis import compute_mean

Pattern 8b: Import Best Practices in Tests

Following Scientific Python unit testing guidelines, proper import patterns make tests more maintainable:

Keep imports local to file under test:

# Good - Import from the file being tested
from my_package.analysis import MyClass, compute_mean

def test_compute_mean():
    """Test imports from module under test."""
    data = MyClass()
    result = compute_mean(data)
    assert result > 0

Why this matters:

  • When code is refactored and symbols move, tests don't break
  • Tests only care about symbols used in the file under test
  • Reduces coupling between tests and internal code organization

Import specific symbols, not entire modules:

# Good - Specific imports, easy to mock
from numpy import mean as np_mean, ndarray as NpArray

def my_function(data: NpArray) -> float:
    return np_mean(data)

# Good - Easy to patch in tests
def test_my_function(mocker):
    mock_mean = mocker.patch("my_package.analysis.np_mean")
    # ...
# Less ideal - Harder to mock effectively
import numpy as np

def my_function(data: np.ndarray) -> float:
    return np.mean(data)

# Less ideal - Complex patching required
def test_my_function(mocker):
    # Must patch through the aliased namespace
    mock_mean = mocker.patch("my_package.analysis.np.mean")
    # ...

Consider meaningful aliases:

# Make imports meaningful to your domain
from numpy import sum as numeric_sum
from scipy.stats import ttest_ind as statistical_test

# Easy to understand and replace
result = numeric_sum(values)
p_value = statistical_test(group1, group2)

This approach makes it easier to:

  • Replace implementations without changing tests
  • Mock dependencies effectively
  • Understand code purpose from import names

Running pytest

Basic Usage

# Run all tests
pytest

# Run specific file
pytest tests/test_analysis.py

# Run specific test
pytest tests/test_analysis.py::test_mean

# Run tests matching pattern
pytest -k "mean or median"

# Verbose output
pytest -v

# Show local variables in failures
pytest -l  # or --showlocals

# Stop at first failure
pytest -x

# Show stdout/stderr
pytest -s

Debugging Tests

# Drop into debugger on failure
pytest --pdb

# Drop into debugger at start of each test
pytest --trace

# Run last failed tests
pytest --lf

# Run failed tests first, then rest
pytest --ff

# Show which tests would be run (dry run)
pytest --collect-only

Coverage

# Install pytest-cov
pip install pytest-cov

# Run with coverage
pytest --cov=my_package

# With coverage report
pytest --cov=my_package --cov-report=html

# With missing lines
pytest --cov=my_package --cov-report=term-missing

# Fail if coverage below threshold
pytest --cov=my_package --cov-fail-under=90

Configure in pyproject.toml:

[tool.pytest.ini_options]
addopts = [
    "--cov=my_package",
    "--cov-report=term-missing",
    "--cov-report=html",
]

[tool.coverage.run]
source = ["src"]
omit = [
    "*/tests/*",
    "*/__init__.py",
]

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "def __repr__",
    "raise AssertionError",
    "raise NotImplementedError",
    "if __name__ == .__main__.:",
    "if TYPE_CHECKING:",
    "@abstractmethod",
]

Scientific Python Testing Patterns

Pattern 9: Testing Numerical Algorithms

import numpy as np
from pytest import approx

def test_numerical_stability():
    """Test algorithm is numerically stable."""
    data = np.array([1e10, 1.0, -1e10])
    result = stable_sum(data)
    assert result == approx(1.0)

def test_convergence():
    """Test iterative algorithm converges."""
    x0 = np.array([1.0, 1.0, 1.0])
    result = iterative_solver(x0, tol=1e-8, max_iter=1000)
    
    assert result.converged
    assert result.iterations < 1000
    assert result.residual < 1e-8

def test_against_analytical_solution():
    """Test against known analytical result."""
    x = np.linspace(0, 1, 100)
    numerical = compute_integral(lambda t: t**2, x)
    analytical = x**3 / 3
    assert numerical == approx(analytical, rel=1e-6)

def test_conservation_law():
    """Test that physical conservation law holds."""
    initial_energy = compute_energy(system)
    system.evolve(dt=0.01, steps=1000)
    final_energy = compute_energy(system)
    
    # Energy should be conserved (within numerical error)
    assert final_energy == approx(initial_energy, rel=1e-10)

Pattern 10: Testing with Different NumPy dtypes

@pytest.mark.parametrize("dtype", [
    np.float32,
    np.float64,
    np.complex64,
    np.complex128,
])
def test_computation_dtypes(dtype):
    """Test function works with different dtypes."""
    data = np.array([1, 2, 3, 4, 5], dtype=dtype)
    result = compute_transform(data)
    
    assert result.dtype == dtype
    assert result.shape == data.shape

@pytest.mark.parametrize("dtype", [np.int32, np.int64, np.float32, np.float64])
def test_integer_and_float_types(dtype):
    """Test handling of integer and float types."""
    arr = np.array([1, 2, 3], dtype=dtype)
    result = safe_divide(arr, 2)
    
    # Result should be floating point
    assert result.dtype in [np.float32, np.float64]

Pattern 11: Testing Random/Stochastic Code

def test_random_with_seed():
    """Test random code with fixed seed for reproducibility."""
    np.random.seed(42)
    result1 = generate_random_samples(n=100)
    
    np.random.seed(42)
    result2 = generate_random_samples(n=100)
    
    # Should get identical results with same seed
    assert np.array_equal(result1, result2)

def test_statistical_properties():
    """Test statistical properties of random output."""
    np.random.seed(123)
    samples = generate_normal_samples(n=100000, mean=0, std=1)
    
    # Test mean and std are close to expected (not exact due to randomness)
    assert np.mean(samples) == approx(0, abs=0.01)
    assert np.std(samples) == approx(1, abs=0.01)

@pytest.mark.parametrize("seed", [42, 123, 456])
def test_reproducibility_with_seeds(seed):
    """Test reproducibility with different seeds."""
    np.random.seed(seed)
    result = stochastic_algorithm()
    
    # Should complete successfully regardless of seed
    assert result.success

Pattern 12: Testing Data Pipelines

def test_pipeline_end_to_end(tmp_path):
    """Test complete data pipeline."""
    # Arrange - Create input data
    input_file = tmp_path / "input.csv"
    input_file.write_text("x,y\n1,2\n3,4\n5,6\n")
    
    output_file = tmp_path / "output.csv"
    
    # Act - Run pipeline
    result = run_pipeline(input_file, output_file)
    
    # Assert - Check results
    assert result.success
    assert output_file.exists()
    
    output_data = np.loadtxt(output_file, delimiter=",", skiprows=1)
    assert len(output_data) == 3

def test_pipeline_stages_independently():
    """Test each pipeline stage separately."""
    # Test stage 1
    raw_data = load_data("input.csv")
    assert len(raw_data) > 0
    
    # Test stage 2
    cleaned = clean_data(raw_data)
    assert not np.any(np.isnan(cleaned))
    
    # Test stage 3
    transformed = transform_data(cleaned)
    assert transformed.shape == cleaned.shape
    
    # Test stage 4
    result = analyze_data(transformed)
    assert result.metrics["r2"] > 0.9

Pattern 13: Property-Based Testing with Hypothesis

For complex scientific code, consider property-based testing:

from hypothesis import given, strategies as st
from hypothesis.extra.numpy import arrays
import numpy as np

@given(arrays(np.float64, shape=st.integers(1, 100)))
def test_mean_is_bounded(arr):
    """Mean should be between min and max."""
    if len(arr) > 0 and not np.any(np.isnan(arr)):
        mean = np.mean(arr)
        assert np.min(arr) <= mean <= np.max(arr)

@given(
    x=arrays(np.float64, shape=10, elements=st.floats(-100, 100)),
    y=arrays(np.float64, shape=10, elements=st.floats(-100, 100))
)
def test_linear_fit_properties(x, y):
    """Test properties of linear regression."""
    if not (np.any(np.isnan(x)) or np.any(np.isnan(y))):
        slope, intercept = fit_linear(x, y)
        
        # Predictions should be finite
        predictions = slope * x + intercept
        assert np.all(np.isfinite(predictions))

Test Configuration Examples

Complete pyproject.toml Testing Section

[tool.pytest.ini_options]
minversion = "7.0"
addopts = [
    "-ra",                          # Show summary of all test outcomes
    "--showlocals",                 # Show local variables in tracebacks
    "--strict-markers",             # Error on undefined markers
    "--strict-config",              # Error on config issues
    "--cov=my_package",             # Coverage for package
    "--cov-report=term-missing",    # Show missing lines
    "--cov-report=html",            # HTML coverage report
]
xfail_strict = true                 # xfail tests must fail
filterwarnings = [
    "error",                        # Treat warnings as errors
    "ignore::DeprecationWarning:pkg_resources",  # Ignore specific warning
    "ignore::PendingDeprecationWarning",
]
log_cli_level = "info"              # Log level for test output
testpaths = [
    "tests",                        # Test directory
]
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
    "integration: marks tests as integration tests",
    "requires_gpu: marks tests that need GPU hardware",
]

[tool.coverage.run]
source = ["src"]
omit = [
    "*/tests/*",
    "*/__init__.py",
    "*/conftest.py",
]
branch = true                       # Measure branch coverage

[tool.coverage.report]
exclude_lines = [
    "pragma: no cover",
    "def __repr__",
    "raise AssertionError",
    "raise NotImplementedError",
    "if __name__ == .__main__.:",
    "if TYPE_CHECKING:",
    "@abstractmethod",
]
precision = 2
show_missing = true
skip_covered = false

conftest.py for Shared Fixtures

# tests/conftest.py
import pytest
import numpy as np
from pathlib import Path

@pytest.fixture(scope="session")
def test_data_dir():
    """Provide path to test data directory."""
    return Path(__file__).parent / "data"

@pytest.fixture
def sample_array():
    """Provide sample NumPy array."""
    np.random.seed(42)
    return np.random.randn(100)

@pytest.fixture
def temp_output_dir(tmp_path):
    """Provide temporary directory for test outputs."""
    output_dir = tmp_path / "output"
    output_dir.mkdir()
    return output_dir

@pytest.fixture(autouse=True)
def reset_random_state():
    """Reset random state before each test."""
    np.random.seed(42)

@pytest.fixture(scope="session")
def large_dataset():
    """Load large dataset once per test session."""
    return load_reference_data()

# Platform-specific fixtures
@pytest.fixture(params=["Linux", "Darwin", "Windows"])
def platform_name(request, monkeypatch):
    """Parametrize tests across platforms."""
    monkeypatch.setattr("platform.system", lambda: request.param)
    return request.param

Common Testing Pitfalls and Solutions

Pitfall 1: Testing Implementation Instead of Behavior

Bad:

def test_uses_numpy_mean():
    """Test that function uses np.mean."""  # Testing implementation!
    # This is fragile - breaks if implementation changes
    pass

Good:

def test_computes_correct_average():
    """Test that function returns correct average."""
    data = np.array([1, 2, 3, 4, 5])
    result = compute_average(data)
    assert result == approx(3.0)

Pitfall 2: Non-Deterministic Tests

Bad:

def test_random_sampling():
    samples = generate_samples()  # Uses random seed from system time!
    assert samples[0] > 0  # Might fail randomly

Good:

def test_random_sampling():
    np.random.seed(42)  # Fixed seed
    samples = generate_samples()
    assert samples[0] == approx(0.4967, rel=1e-4)

Pitfall 3: Exact Floating-Point Comparisons

Bad:

def test_computation():
    result = 0.1 + 0.2
    assert result == 0.3  # Fails due to floating-point error!

Good:

def test_computation():
    result = 0.1 + 0.2
    assert result == approx(0.3)

Pitfall 4: Testing Too Much in One Test

Bad:

def test_entire_analysis():
    # Load data
    data = load_data()
    assert data is not None
    
    # Process
    processed = process(data)
    assert len(processed) > 0
    
    # Analyze
    result = analyze(processed)
    assert result.score > 0.8
    
    # Save
    save_results(result, "output.txt")
    assert Path("output.txt").exists()

Good:

def test_load_data_succeeds():
    data = load_data()
    assert data is not None

def test_process_returns_nonempty():
    data = load_data()
    processed = process(data)
    assert len(processed) > 0

def test_analyze_gives_good_score():
    data = load_data()
    processed = process(data)
    result = analyze(processed)
    assert result.score > 0.8

def test_save_results_creates_file(tmp_path):
    output_file = tmp_path / "output.txt"
    result = create_mock_result()
    save_results(result, output_file)
    assert output_file.exists()

Testing Checklist

  • Tests are in tests/ directory separate from source
  • Test files named test_*.py
  • Test functions named test_*
  • Tests run against installed package (use src/ layout)
  • pytest configured in pyproject.toml
  • Using pytest.approx for floating-point comparisons
  • Tests check exceptions with pytest.raises
  • Tests check warnings with pytest.warns
  • Parametrized tests for multiple inputs
  • Fixtures for reusable setup
  • Markers used for test organization
  • Random tests use fixed seeds
  • Tests are independent (can run in any order)
  • Each test focuses on one behavior
  • Coverage > 80% (preferably > 90%)
  • All tests pass before committing
  • Slow tests marked with @pytest.mark.slow
  • Integration tests marked appropriately
  • CI configured to run tests automatically

Continuous Integration

GitHub Actions Example

# .github/workflows/tests.yml
name: Tests

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ${{ matrix.os }}
    strategy:
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        python-version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -e ".[dev]"

      - name: Run tests
        run: |
          pytest --cov=my_package --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v4
        with:
          files: ./coverage.xml

Resources

Summary

Testing scientific Python code with pytest, following Scientific Python community principles, provides:

  1. Confidence: Know your code works correctly
  2. Reproducibility: Ensure consistent behavior across environments
  3. Documentation: Tests show how code should be used and communicate developer intent
  4. Refactoring safety: Change code without breaking functionality
  5. Regression prevention: Catch bugs before they reach users
  6. Scientific rigor: Validate numerical accuracy and physical correctness

Key testing principles:

  • Start with public interface tests from the user's perspective
  • Organize tests into suites (unit, integration, e2e) by type and speed
  • Follow outside-in approach: public interface → integration → unit tests
  • Keep tests simple, focused, and independent
  • Test behavior rather than implementation
  • Use pytest's powerful features (fixtures, parametrization, markers) effectively
  • Always verify tests fail when they should to avoid false confidence

Remember: Any test is better than none, but well-organized tests following these principles create trustworthy, maintainable scientific software that the community can rely on.