name: numpy-statistics description: Standard and NaN-robust statistical functions for data analysis, histograms, and correlation matrices. Triggers: statistics, mean, nanmean, histogram, corrcoef, percentile, std.
Overview
NumPy provides a suite of statistical functions for summarizing data. Key capabilities include calculating central tendencies, dispersion, and relationships between variables, with specific handling for missing values (NaNs).
When to Use
- Summarizing experimental data (mean, median, standard deviation).
- Visualizing data distributions via histogram counts and binning.
- Identifying relationships between multiple variables using correlation matrices.
- Analyzing datasets with missing values where standard aggregations would fail.
Decision Tree
- Does your data contain
NaN?- Yes: Use
nanprefixed functions (e.g.,np.nanmean). - No: Use standard functions (e.g.,
np.mean).
- Yes: Use
- Creating a histogram?
- Need normalized area? Set
density=True. - Fixed bin widths? Provide an integer for
binsor an array for custom edges.
- Need normalized area? Set
- Checking correlation?
- Use
np.corrcoef. Note: output may require clipping if float errors occur.
- Use
Workflows
Robust Mean Calculation
- Identify an array with potential missing values (NaNs).
- Calculate the mean using
np.nanmean(arr). - Optionally use
np.nanstd(arr)to find the standard deviation of the valid subset.
Custom Histogram Creation
- Define a set of non-uniform bin edges
[0, 5, 10, 50, 100]. - Pass the data and edges to
np.histogram(data, bins=edges). - Retrieve the counts and the validated edges for plotting.
- Define a set of non-uniform bin edges
Inter-Variable Correlation Analysis
- Stack multiple data variables into a 2D array (rows as variables).
- Execute
np.corrcoef(data). - Inspect the off-diagonal elements for Pearson correlation strengths.
Non-Obvious Insights
- NaN Sensitivity: Standard statistical functions return
NaNif even one element is missing; thenanversions are essential for real-world messy data. - Histogram Density: The
density=Trueflag ensures the integral over the histogram is 1, not that the sum of the counts is 1 (unless bin widths are 1). - Precision Clipping: Correlation coefficients can occasionally drift outside
[-1, 1]due to floating-point rounding; NumPy automatically mitigates this incorrcoefresults.
Evidence
- "nanmean... Compute the arithmetic mean along the specified axis, ignoring NaNs." Source
- "Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function." Source
Scripts
scripts/numpy-statistics_tool.py: Computes robust statistics and custom histograms.scripts/numpy-statistics_tool.js: Basic mean calculator.
Dependencies
numpy(Python)