-
Notifications
You must be signed in to change notification settings - Fork 5
Add population-level calculations and stratified statistics #176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add population-level calculations and stratified statistics #176
Conversation
Incorporated the EPSG datum into initialization_config that is used for defining the projection and other spatial features for georeferenced NASC measurements.
A new function (`stretch`) has been added to `operations.py` to reduce the amount of cluttered and repetitive code contained within the `nasc_to_biomass_conversion` function. I expect this function to be re-used elsewhere, as well. The `stretch` function leverages the built-in `pandas.wide_to_long` 'gather/melt' method that ultimately re-indexes the data by consolidating the separate data columns (e.g `rho_a` for `male`, `female`, `unsexed`, and `total`) into a single index (e.g. `sex`) and data (e.g. `rho_a`) column. This can help provide a more intuitive way of filtering out specific groups/contrasts in downstream functions and methods.
The previous commit/push missed the doc string defined for the `stretch` function.
An additional utility function `group_merge` has been added to reduce the amount of repetition in cases where multiple dataframes are being merged in the same step/pipeline/chain. This doesn't change the previous output/result of the code, but it is expected to be used for later calculations/steps that will enable more consistent formatting and ensuring that the grouped merges are being performed in the same way every time. This is particularly important so the 'how' and 'on' arguments are appropriately applied and are less vulnerable to errant typos.
The `load_configuration` function was previously included as a static method within `Survey`; however, this isn't necessary since `load_configuration` never uses `self` as an argument. Consequently, it has been moved to `EchoPro.utils.data_file_validation`.
|
Commit 1cdcf4d was also pushed to see how how the CI testing handled the change. The change in I am just noting this as a reminder to myself to amend the |
Various changes were made to enable the INPFC strata from the `INPFC` sheet to be validated (alongside `stratification1`), read, and incorporated into the `Survey` object. This replaces the previous hard-coded `pandas.DataFrame` that was generated in the `stratified_summary` method. In `survry_year_2019_config.yml`, this is represented by `sheetname: [ INPFC , stratification1 ]` associated with the `geo_strata` configuration setting. So now the data validating and reading functions can handle multiple `.xlsx` sheetnames from the same file.
As mentioned in Issue #177 that changes the location of `load_configuration` within `EchoPro`. When ran locally, the test passes. This commit also pushes changes to included test-related files that worked from this branch.
…_biomass_plus_jolly_hampton
The line `from functools import reduce` was missing from `operations.py` to enable the `reduce(...)` function used within `group_merge(...)`.
leewujung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandynlucca : Thanks for the PR, nice work!
My inline comments are small. Below are some high-level organization comments, regarding the function nasc_to_biomass_conversion.
Some observations
- L862-L937, the code does 2 things:
- prepare
dataframes_to_add, which is used to calculatenasc_fraction_total_df - prepare
sex_indexed_weight_proportions, which is used to calculateage_biomass_df
- prepare
- L939-L950:
- calculate
nasc_fraction_total_df
- calculate
- L939-L1022:
- everything is derived from
nasc_to_areal_number_density_df(and other stored data within the Survey object)
- everything is derived from
- L1032-1038:
age_biomass_dfis calculated
- At the end the key dataframes are saved
Suggestions
Since this function is very long, I suggest factoring out the section (1. and 2. above) into two separate functions, one that produces nasc_fraction_total_df and the other produces sex_indexed_weight_proportions. These two functions would be basically helper function that prepares data.
This way the main flow of your calculation will be very clearer, since people who look into the code do not have to keep track of the large number of dfs from the beginning, since they are merged/grouped nicely into columns with good names in nasc_fraction_total_df and sex_indexed_weight_proportions.
The separation will also help with writing tests, so that if something fails, the chunk to isolate to figure out potential bugs is smaller.
Co-authored-by: Wu-Jung Lee <[email protected]>
Renamed `calculate_bounds` to `calculate_start_end_coordinates` to reflect that the function is not drawing a true geospatial boundary box/rectangle around the transect coordinates.
Renamed dataframe the column with strata numbers within `self.biology[ 'weight' ][ 'weight_strata_df' ]` from `stratum` to `stratum_num`.
Code within the `nasc_to_biomass_conversion(...)` function were refactored to create `index_sex_weight_proportions(...)` and `index_transect_age_sex_proportions(...)`. These functions will yield the following variables: `sex_indexed_weight_proportions` and `nasc_fraction_total_df`.
Added preliminary doc strings to `index_sex_weight_proportions` and `index_transect_age_sex_proportions`. Small edits were also made to the corresponding `nasc_to_biomass_conversion(...)` code and imported modules in `biology.py`.
Missing modules located in `EchoPro.computation.biology` were appropriately added into `survey.py`.
|
@leewujung I think I have addressed everything you've mentioned so far. I am now in the process of stepping backward to enable the age-1 inclusion/exclusion flag. I can include that in a separate PR while also constructing more basic testing functions. |
|
@brandynlucca : Could you fix test breaking? Seems it's a missing import. Thanks. |
I would suggest just having the tests of what you have in a single PR. Let's do the age-1 inclusion/exclusion later, after kriging and reports/figures, so we have the entire process completed and verified first. |
Amended the doc string associated with `calculate_start_end_coordinates`
leewujung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brandynlucca : Thanks for the changes. I realized that I did not review the function stratified_transect_statistic, so I just read through it now and only had a few minor suggestions and questions about NaN values.
The main higher-level question I have is on enabling stratified_transect_statistic to use
both INPFC and KS stratum. It seems that the only place needing change (ie adding an if-else case) would be the resampling part that needs to choose from a list from that "stratification1" sheet (I forgot what you called it in the dataframes) rather than from this np.range array.
If there is not already an issue tracking this, we can add one and add this as one of the TODO items.
Otherwise I think this PR is ready to be merged once the small changes are made. :)
Also, parking some thoughts here regarding documentation: I think we can include the equations implemented here into the docs with references, so that we could stop the confusion or ambiguity once and for all. Let's discuss this more once the code is more settled.
…ps://github.com/uw-echospace/EchoPro into brandynlucca-nasc_to_biomass_plus_jolly_hampton
Several functions were added to calculate the age- and sex-stratified areal number/biomass densities, summed abundance, and summed biomass estimates. This includes the following changes:
initialization_config.yml: added a field forreplicatesspecific to the stratified resampling procedure.spatial.pycalculate_bounds: calculates the latitude/longitude boundary box around each transect linecalculate_transect_distance: calculates the length distance and area of each transect line necessary for the stratified statistic calculationstatistics.pyconfidence_interval: calculates the 95% confidence interval (for a Normal distribution)stratified_transect_statistic: produces estimates and confidence intervals (around the mean) for the following statistics/estimators: mean, variance, coefficient of variation (CV)survey.pynasc_to_biomass_conversion: converts NASC and biological data into areal biomass/number density, summed biomass, and summed abundance stratified by age, sex, stratum, and transecttransect_analysis(pre-existingSurveymethod)stratified_summary: wrapper function forstratified_transect_statisticandcalculate_transect_distancethat adds mean and confidence interval estimates for the stratified mean, variance, and coefficient of variation (CV) produced by the resampling/bootstrapping procedureSurveymethodCurrent use