Skip to content

Conversation

@brandynlucca
Copy link
Collaborator

Several functions were added to calculate the age- and sex-stratified areal number/biomass densities, summed abundance, and summed biomass estimates. This includes the following changes:

  • initialization_config.yml: added a field for replicates specific to the stratified resampling procedure.
  • spatial.py
    • calculate_bounds: calculates the latitude/longitude boundary box around each transect line
    • calculate_transect_distance: calculates the length distance and area of each transect line necessary for the stratified statistic calculation
  • statistics.py
    • confidence_interval: calculates the 95% confidence interval (for a Normal distribution)
    • stratified_transect_statistic: produces estimates and confidence intervals (around the mean) for the following statistics/estimators: mean, variance, coefficient of variation (CV)
  • survey.py
    • nasc_to_biomass_conversion: converts NASC and biological data into areal biomass/number density, summed biomass, and summed abundance stratified by age, sex, stratum, and transect
      • Nested within transect_analysis (pre-existing Survey method)
    • stratified_summary: wrapper function for stratified_transect_statistic and calculate_transect_distance that adds mean and confidence interval estimates for the stratified mean, variance, and coefficient of variation (CV) produced by the resampling/bootstrapping procedure
      • New Survey method

Current use

from EchoPro.survey import Survey

# Load configuration settings and all data
obj = Survey( init_config_path='./config_files/initialization_config.yml' , survey_year_config_path='./config_files/survey_year_2019_config.yml' )

# Calculate base transect-based results necessary for later calculations
obj.transect_analysis()

# Calculate transect-based stratified results
obj.stratified_summary()

# Calculate biomass 
print(f"Total adult biomass: {1e-6*self.biology[ 'population' ][ 'biomass' ][ 'biomass_df' ].pipe( lambda df: df.loc[ ( df[ 'sex' ] == 'total' ) ] ).B_adult.sum():.3f} kmt")
Total adult biomass: 1655.225 kmt
print(f"Adult biomass CV: {self.biology[ 'population' ][ 'stratified_results' ][ 'biomass' ][ 'CV' ][ 'estimate' ]*1e2:.2f}% [{ci_bounds[0]*1e2:.2f}%, {ci_bounds[1]*1e2:.2f}%; 95 pct. CI]" )
Adult biomass CV: 13.32% [12.40%, 14.23%; 95 pct. CI]

Brandyn Lucca added 2 commits February 2, 2024 13:43
Incorporated the EPSG datum into initialization_config that is used for
defining the projection and other spatial features for georeferenced
NASC measurements.
Brandyn Lucca added 4 commits February 8, 2024 11:07
A new function (`stretch`) has been added to `operations.py`
to reduce the amount of cluttered and repetitive code contained
within the `nasc_to_biomass_conversion` function. I expect this
function to be re-used elsewhere, as well. The `stretch` function
leverages the built-in `pandas.wide_to_long` 'gather/melt' method
that ultimately re-indexes the data by consolidating the separate
data columns (e.g `rho_a` for `male`, `female`, `unsexed`, and `total`)
into a single index (e.g. `sex`) and data (e.g. `rho_a`) column. This can
help provide a more intuitive way of filtering out specific groups/contrasts
in downstream functions and methods.
The previous commit/push missed the doc string defined for the
`stretch` function.
An additional utility function `group_merge` has been
added to reduce the amount of repetition in cases where
multiple dataframes are being merged in the same step/pipeline/chain.
This doesn't change the previous output/result of the code, but it is
expected to be used for later calculations/steps that will enable
more consistent formatting and ensuring that the grouped merges are
being performed in the same way every time. This is particularly
important so the 'how' and 'on' arguments are appropriately applied
and are less vulnerable to errant typos.
The `load_configuration` function was previously included as a
static method within `Survey`; however, this isn't necessary since
`load_configuration` never uses `self` as an argument. Consequently,
it has been moved to `EchoPro.utils.data_file_validation`.
@brandynlucca
Copy link
Collaborator Author

Commit 1cdcf4d was also pushed to see how how the CI testing handled the change. The change in load_configuration from a class method to a standalone function results in an expected failed test state. This makes sense given that L22-25 within test_data_loader.py still expects load_configuration to be a Survey class method.

I am just noting this as a reminder to myself to amend the test_data_loader.py accordingly and have created Issue #177 to track both this failed test and future bugs/errors for the integration tests as a whole.

Brandyn Lucca and others added 4 commits February 8, 2024 13:43
Various changes were made to enable the INPFC strata
from the `INPFC` sheet to be validated (alongside `stratification1`), read,
and incorporated into the `Survey` object. This replaces the previous
hard-coded `pandas.DataFrame` that was generated in the `stratified_summary`
method. In `survry_year_2019_config.yml`, this is represented by
`sheetname: [ INPFC , stratification1 ]` associated with the
`geo_strata` configuration setting. So now the data validating and reading
functions can handle multiple `.xlsx` sheetnames from the same file.
As mentioned in Issue #177 that changes the location of
`load_configuration` within `EchoPro`. When ran locally,
the test passes. This commit also pushes changes to included
test-related files that worked from this branch.
The line `from functools import reduce` was missing from
`operations.py` to enable the `reduce(...)` function used
within `group_merge(...)`.
Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandynlucca : Thanks for the PR, nice work!

My inline comments are small. Below are some high-level organization comments, regarding the function nasc_to_biomass_conversion.

Some observations

  1. L862-L937, the code does 2 things:
    • prepare dataframes_to_add, which is used to calculate nasc_fraction_total_df
    • prepare sex_indexed_weight_proportions, which is used to calculate age_biomass_df
  2. L939-L950:
    • calculate nasc_fraction_total_df
  3. L939-L1022:
    • everything is derived from nasc_to_areal_number_density_df (and other stored data within the Survey object)
  4. L1032-1038:
    • age_biomass_df is calculated
  5. At the end the key dataframes are saved

Suggestions

Since this function is very long, I suggest factoring out the section (1. and 2. above) into two separate functions, one that produces nasc_fraction_total_df and the other produces sex_indexed_weight_proportions. These two functions would be basically helper function that prepares data.

This way the main flow of your calculation will be very clearer, since people who look into the code do not have to keep track of the large number of dfs from the beginning, since they are merged/grouped nicely into columns with good names in nasc_fraction_total_df and sex_indexed_weight_proportions.

The separation will also help with writing tests, so that if something fails, the chunk to isolate to figure out potential bugs is smaller.

Brandyn Lucca added 2 commits February 9, 2024 14:56
Renamed `calculate_bounds` to `calculate_start_end_coordinates` to reflect
that the function is not drawing a true geospatial boundary box/rectangle
around the transect coordinates.
Renamed dataframe the column with strata numbers within
 `self.biology[ 'weight' ][ 'weight_strata_df' ]` from `stratum` to
 `stratum_num`.
Brandyn Lucca added 3 commits February 9, 2024 16:27
Code within the `nasc_to_biomass_conversion(...)` function were
refactored to create `index_sex_weight_proportions(...)` and
`index_transect_age_sex_proportions(...)`. These functions will
yield the following variables: `sex_indexed_weight_proportions` and
`nasc_fraction_total_df`.
Added preliminary doc strings to `index_sex_weight_proportions` and
`index_transect_age_sex_proportions`. Small edits were also made to the
corresponding `nasc_to_biomass_conversion(...)` code and imported
modules in `biology.py`.
Missing modules located in `EchoPro.computation.biology` were
appropriately added into `survey.py`.
@brandynlucca
Copy link
Collaborator Author

@leewujung I think I have addressed everything you've mentioned so far. I am now in the process of stepping backward to enable the age-1 inclusion/exclusion flag. I can include that in a separate PR while also constructing more basic testing functions.

@leewujung
Copy link
Member

@brandynlucca : Could you fix test breaking? Seems it's a missing import. Thanks.

@leewujung
Copy link
Member

I am now in the process of stepping backward to enable the age-1 inclusion/exclusion flag. I can include that in a separate PR while also constructing more basic testing functions.

I would suggest just having the tests of what you have in a single PR.

Let's do the age-1 inclusion/exclusion later, after kriging and reports/figures, so we have the entire process completed and verified first.

Brandyn Lucca and others added 2 commits February 10, 2024 12:19
Amended the doc string associated with `calculate_start_end_coordinates`
Copy link
Member

@leewujung leewujung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandynlucca : Thanks for the changes. I realized that I did not review the function stratified_transect_statistic, so I just read through it now and only had a few minor suggestions and questions about NaN values.

The main higher-level question I have is on enabling stratified_transect_statistic to use
both INPFC and KS stratum. It seems that the only place needing change (ie adding an if-else case) would be the resampling part that needs to choose from a list from that "stratification1" sheet (I forgot what you called it in the dataframes) rather than from this np.range array.

If there is not already an issue tracking this, we can add one and add this as one of the TODO items.

Otherwise I think this PR is ready to be merged once the small changes are made. :)

Also, parking some thoughts here regarding documentation: I think we can include the equations implemented here into the docs with references, so that we could stop the confusion or ambiguity once and for all. Let's discuss this more once the code is more settled.

@brandynlucca brandynlucca merged commit c442d17 into brandynlucca-WIP-refactoring Feb 13, 2024
@brandynlucca brandynlucca deleted the brandynlucca-nasc_to_biomass_plus_jolly_hampton branch February 14, 2024 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants