-
Notifications
You must be signed in to change notification settings - Fork 5
Description
There are instances where certain functions, such as stratified_transect_statistic(...), incorporate numpy.nansum(...) and other similar functions that replace NaN values with 0. In some cases, this is necessary to complete the calculation. However, there are other cases where this is no longer the case and may create some confusion in terms of where to expect NaN values to occur. For instance, from stratified_transect_statistic(...):
# Resample without replacement
sel_inds = np.random.choice( np.arange( start , end ) ,
num_transects_to_sample[ j ] ,
replace=False )
### Define stratified weights
stratified_weights = distance[ sel_inds ] / np.mean( distance[ sel_inds ] )
### Weighted value (e.g. biomass)
value_distance_density = value[ sel_inds ] / distance[ sel_inds ]
### Compute mean and variance
rho_j[ j ] = np.nansum( value[ sel_inds ] * stratified_weights ) / np.nansum( stratified_weights )
var_j[ j ] = np.nansum( ( stratified_weights ** 2 * ( value_distance_density - rho_j[ j ] ) ** 2 ) ) / sample_dof[ j ]Here it is noted in PR #176 that NaN values should only be relevant based on the distance input, which in this case is calculated via calculate_transect_distance(...). Since this function does not add a NaN to match the size of the original data, an NaN value should not be expected to be produced. Or if there is, then this should trigger an error state rather than letting the remaining transect_analysis(...) or stratified_summary(...) function/operation plod along like nothing is wrong. Although numpy.nansum slower than the standard sum, the size of the data probably means that this isn't a significant bottleneck. Nonetheless, it is probably overly conservative and could introduce cases where NaN values are replaced with unwanted 0 values that aren't accounted for and could influence downstream calculations (e.g. calculating the mean).