Add population-level calculations and stratified statistics #176

brandynlucca · 2024-02-08T00:49:55Z

Several functions were added to calculate the age- and sex-stratified areal number/biomass densities, summed abundance, and summed biomass estimates. This includes the following changes:

initialization_config.yml: added a field for replicates specific to the stratified resampling procedure.
spatial.py
- calculate_bounds: calculates the latitude/longitude boundary box around each transect line
- calculate_transect_distance: calculates the length distance and area of each transect line necessary for the stratified statistic calculation
statistics.py
- confidence_interval: calculates the 95% confidence interval (for a Normal distribution)
- stratified_transect_statistic: produces estimates and confidence intervals (around the mean) for the following statistics/estimators: mean, variance, coefficient of variation (CV)
survey.py
- nasc_to_biomass_conversion: converts NASC and biological data into areal biomass/number density, summed biomass, and summed abundance stratified by age, sex, stratum, and transect
  - Nested within transect_analysis (pre-existing Survey method)
- stratified_summary: wrapper function for stratified_transect_statistic and calculate_transect_distance that adds mean and confidence interval estimates for the stratified mean, variance, and coefficient of variation (CV) produced by the resampling/bootstrapping procedure
  - New Survey method

Current use

from EchoPro.survey import Survey

# Load configuration settings and all data
obj = Survey( init_config_path='./config_files/initialization_config.yml' , survey_year_config_path='./config_files/survey_year_2019_config.yml' )

# Calculate base transect-based results necessary for later calculations
obj.transect_analysis()

# Calculate transect-based stratified results
obj.stratified_summary()

# Calculate biomass 
print(f"Total adult biomass: {1e-6*self.biology[ 'population' ][ 'biomass' ][ 'biomass_df' ].pipe( lambda df: df.loc[ ( df[ 'sex' ] == 'total' ) ] ).B_adult.sum():.3f} kmt")

Total adult biomass: 1655.225 kmt

print(f"Adult biomass CV: {self.biology[ 'population' ][ 'stratified_results' ][ 'biomass' ][ 'CV' ][ 'estimate' ]*1e2:.2f}% [{ci_bounds[0]*1e2:.2f}%, {ci_bounds[1]*1e2:.2f}%; 95 pct. CI]" )

Adult biomass CV: 13.32% [12.40%, 14.23%; 95 pct. CI]

Incorporated the EPSG datum into initialization_config that is used for defining the projection and other spatial features for georeferenced NASC measurements.

A new function (`stretch`) has been added to `operations.py` to reduce the amount of cluttered and repetitive code contained within the `nasc_to_biomass_conversion` function. I expect this function to be re-used elsewhere, as well. The `stretch` function leverages the built-in `pandas.wide_to_long` 'gather/melt' method that ultimately re-indexes the data by consolidating the separate data columns (e.g `rho_a` for `male`, `female`, `unsexed`, and `total`) into a single index (e.g. `sex`) and data (e.g. `rho_a`) column. This can help provide a more intuitive way of filtering out specific groups/contrasts in downstream functions and methods.

The previous commit/push missed the doc string defined for the `stretch` function.

An additional utility function `group_merge` has been added to reduce the amount of repetition in cases where multiple dataframes are being merged in the same step/pipeline/chain. This doesn't change the previous output/result of the code, but it is expected to be used for later calculations/steps that will enable more consistent formatting and ensuring that the grouped merges are being performed in the same way every time. This is particularly important so the 'how' and 'on' arguments are appropriately applied and are less vulnerable to errant typos.

The `load_configuration` function was previously included as a static method within `Survey`; however, this isn't necessary since `load_configuration` never uses `self` as an argument. Consequently, it has been moved to `EchoPro.utils.data_file_validation`.

brandynlucca · 2024-02-08T20:34:37Z

Commit 1cdcf4d was also pushed to see how how the CI testing handled the change. The change in load_configuration from a class method to a standalone function results in an expected failed test state. This makes sense given that L22-25 within test_data_loader.py still expects load_configuration to be a Survey class method.

I am just noting this as a reminder to myself to amend the test_data_loader.py accordingly and have created Issue #177 to track both this failed test and future bugs/errors for the integration tests as a whole.

Various changes were made to enable the INPFC strata from the `INPFC` sheet to be validated (alongside `stratification1`), read, and incorporated into the `Survey` object. This replaces the previous hard-coded `pandas.DataFrame` that was generated in the `stratified_summary` method. In `survry_year_2019_config.yml`, this is represented by `sheetname: [ INPFC , stratification1 ]` associated with the `geo_strata` configuration setting. So now the data validating and reading functions can handle multiple `.xlsx` sheetnames from the same file.

As mentioned in Issue #177 that changes the location of `load_configuration` within `EchoPro`. When ran locally, the test passes. This commit also pushes changes to included test-related files that worked from this branch.

…_biomass_plus_jolly_hampton

The line `from functools import reduce` was missing from `operations.py` to enable the `reduce(...)` function used within `group_merge(...)`.

EchoPro/computation/spatial.py

EchoPro/tests/test_data_loader.py

EchoPro/survey.py

leewujung

@brandynlucca : Thanks for the PR, nice work!

My inline comments are small. Below are some high-level organization comments, regarding the function nasc_to_biomass_conversion.

Some observations

L862-L937, the code does 2 things:
- prepare dataframes_to_add, which is used to calculate nasc_fraction_total_df
- prepare sex_indexed_weight_proportions, which is used to calculate age_biomass_df
L939-L950:
- calculate nasc_fraction_total_df
L939-L1022:
- everything is derived from nasc_to_areal_number_density_df (and other stored data within the Survey object)
L1032-1038:
- age_biomass_df is calculated
At the end the key dataframes are saved

Suggestions

Since this function is very long, I suggest factoring out the section (1. and 2. above) into two separate functions, one that produces nasc_fraction_total_df and the other produces sex_indexed_weight_proportions. These two functions would be basically helper function that prepares data.

This way the main flow of your calculation will be very clearer, since people who look into the code do not have to keep track of the large number of dfs from the beginning, since they are merged/grouped nicely into columns with good names in nasc_fraction_total_df and sex_indexed_weight_proportions.

The separation will also help with writing tests, so that if something fails, the chunk to isolate to figure out potential bugs is smaller.

Co-authored-by: Wu-Jung Lee <[email protected]>

Renamed `calculate_bounds` to `calculate_start_end_coordinates` to reflect that the function is not drawing a true geospatial boundary box/rectangle around the transect coordinates.

Renamed dataframe the column with strata numbers within `self.biology[ 'weight' ][ 'weight_strata_df' ]` from `stratum` to `stratum_num`.

Code within the `nasc_to_biomass_conversion(...)` function were refactored to create `index_sex_weight_proportions(...)` and `index_transect_age_sex_proportions(...)`. These functions will yield the following variables: `sex_indexed_weight_proportions` and `nasc_fraction_total_df`.

Added preliminary doc strings to `index_sex_weight_proportions` and `index_transect_age_sex_proportions`. Small edits were also made to the corresponding `nasc_to_biomass_conversion(...)` code and imported modules in `biology.py`.

Missing modules located in `EchoPro.computation.biology` were appropriately added into `survey.py`.

brandynlucca · 2024-02-10T01:46:01Z

@leewujung I think I have addressed everything you've mentioned so far. I am now in the process of stepping backward to enable the age-1 inclusion/exclusion flag. I can include that in a separate PR while also constructing more basic testing functions.

leewujung · 2024-02-10T04:01:35Z

@brandynlucca : Could you fix test breaking? Seems it's a missing import. Thanks.

leewujung · 2024-02-10T04:03:52Z

I am now in the process of stepping backward to enable the age-1 inclusion/exclusion flag. I can include that in a separate PR while also constructing more basic testing functions.

I would suggest just having the tests of what you have in a single PR.

Let's do the age-1 inclusion/exclusion later, after kriging and reports/figures, so we have the entire process completed and verified first.

Amended the doc string associated with `calculate_start_end_coordinates`

EchoPro/computation/statistics.py

leewujung

@brandynlucca : Thanks for the changes. I realized that I did not review the function stratified_transect_statistic, so I just read through it now and only had a few minor suggestions and questions about NaN values.

The main higher-level question I have is on enabling stratified_transect_statistic to use
both INPFC and KS stratum. It seems that the only place needing change (ie adding an if-else case) would be the resampling part that needs to choose from a list from that "stratification1" sheet (I forgot what you called it in the dataframes) rather than from this np.range array.

If there is not already an issue tracking this, we can add one and add this as one of the TODO items.

Otherwise I think this PR is ready to be merged once the small changes are made. :)

Also, parking some thoughts here regarding documentation: I think we can include the equations implemented here into the docs with references, so that we could stop the confusion or ambiguity once and for all. Let's discuss this more once the code is more settled.

…ps://github.com/uw-echospace/EchoPro into brandynlucca-nasc_to_biomass_plus_jolly_hampton

Brandyn Lucca added 2 commits February 2, 2024 13:43

Added geospatial transformation to config

a207806

Incorporated the EPSG datum into initialization_config that is used for defining the projection and other spatial features for georeferenced NASC measurements.

Population metrics and stratified stats

6ad39ce

brandynlucca requested a review from leewujung February 8, 2024 00:50

Brandyn Lucca added 4 commits February 8, 2024 11:07

Added docstring to stretch function

1a9e729

The previous commit/push missed the doc string defined for the `stretch` function.

brandynlucca mentioned this pull request Feb 8, 2024

Failed integration tests for various Echopro functions and methods #177

Closed

1 task

Brandyn Lucca and others added 4 commits February 8, 2024 13:43

Amend test_data_loader

67348b5

As mentioned in Issue #177 that changes the location of `load_configuration` within `EchoPro`. When ran locally, the test passes. This commit also pushes changes to included test-related files that worked from this branch.

Merge branch 'brandynlucca-WIP-refactoring' into brandynlucca-nasc_to…

bd3cccc

…_biomass_plus_jolly_hampton

Missing import for group_merge

14b94f1

The line `from functools import reduce` was missing from `operations.py` to enable the `reduce(...)` function used within `group_merge(...)`.