Incorporate kriging methods and documentation #204

brandynlucca · 2024-03-07T18:23:31Z

This (draft) PR includes changes to the primary kriging methods and calculations up to estimating areal biomass densities and subsequent total biomass calculations as well as sex- and age-specific estimates. There are high-priority discrepancies in the final kriged coefficient of variation (CV) and areal biomass density estimates that require further investigation and changes (#202). The complete implementation of the stratified algorithm (i.e. Jolly Hampton, 1990) for the kriged results are forthcoming.

…into test-branch

…est-branch

config_files/survey_year_2019_config.yml

echopop/tests/conftest.py

echopop/test_data/config_files/config_init.yml

echopop/tests/test_strata_mean_sigma_bs.py

…into test-branch

echopop/survey.py

leewujung · 2024-03-08T21:24:58Z

echopop/survey.py

+        kriged_results_output = (
+            kriged_results
+            .merge( self.biology[ 'weight' ][ 'proportions' ][ 'sex_age_weight_proportions_df' ] , 
+                    on = [ 'stratum_num' , 'sex' ] )
+            .assign( biomass_sex_age = lambda x: x.weight_sex_proportion_all * x.biomass_total )
+            .loc[ : , [ 'centroid_latitude' , 'centroid_longitude' , 'stratum_num' , 
+                        'cell_area_nmi2' , 'sex' , 'age' , 'biomass_unaged' , 'biomass_aged' ,
+                        'biomass_total' , 'biomass_adult_cell_CV' , 'biomass_sex_age' ] ]
+        )


Something like:

df_tmp = ( kriged_results .merge( self.biology[ 'weight' ][ 'proportions' ][ 'sex_age_weight_proportions_df' ] , on = [ 'stratum_num' , 'sex' ] ) ) kriged_results["biomass_sex_age"] = df_tmp["weight_sex_proportion_all"] * df_tmp["biomass_total"] kriged_results_output = kriged_results[ [ 'centroid_latitude' , 'centroid_longitude' , 'stratum_num' , 'cell_area_nmi2' , 'sex' , 'age' , 'biomass_unaged' , 'biomass_aged' , 'biomass_total' , 'biomass_adult_cell_CV' , 'biomass_sex_age' ] ]

leewujung · 2024-03-08T21:25:44Z

echopop/survey.py

+        self.statistics[ 'kriging' ].update(
+            {
+                'apportioned_kriged_biomass_df': kriged_results_output ,
+            }
+        )


I was curious about if there's any performance difference between .update or direct assignment. It looks like that's not the case, just that with .update one could assign many keys at a time.

leewujung

Hey @brandynlucca : Thanks for the draft PR!

I made some inline comments on: 1) vectoring operations rather than using lambda, and 2) breaking large single lines into separate blocks for readability.

I'd also suggest moving the blocks within the very long apportion_kriged_biomass method into multiple functions that sit outside of Survey. This will 1) improve code readability, and 2) enable you to write tests for each explicitly -- that's why even though some are just a few lines it may be good to break them out. This may also provide the opportunity to clean up the columns in the merged dataframes that are not going to be used later in the downstream steps.

Function names below are just placeholders, feel free to change them:

`calc_weight_strata`:

echopop/echopop/survey.py

Lines 1291 to 1313 in 5c36ebe

    
           ### Calculate station weights -- stratified by strata 
        
           weight_strata_station = ( 
        
               pd.concat( [ 
        
                   haul_catch_filtered  
        
                   .groupby( 'stratum_num' ) 
        
                   .apply( lambda df: pd.Series( { 'stratum_weight': df.haul_weight.sum( ) } ) ) 
        
                   .reset_index( ) 
        
                   .assign( station = 1 ) ,  
        
                   specimen_df_copy 
        
                   .groupby( [ 'stratum_num' ] ) 
        
                   .apply( lambda df: pd.Series( { 'stratum_weight': df.weight.sum( ) } ) ) 
        
                   .reset_index() 
        
                   .assign( station = 2 ) 
        
               ] ) 
        
           ) 
        
           ### Calculate stratum weights 
        
           weight_strata = ( 
        
               weight_strata_station 
        
               .groupby( [ 'stratum_num' ] ) 
        
               .apply( lambda x: pd.Series( { 'weight_stratum_total': x.stratum_weight.sum( ) } ) )             
        
               .reset_index() 
        
           )

`calc_length_age_sex_stratum_weight`:

echopop/echopop/survey.py

Lines 1315 to 1333 in 5c36ebe

    
           ### Calculate summed length-age-sex weight bins across strata 
        
           length_age_sex_stratum_weight = ( 
        
               specimen_df_copy 
        
               .dropna( how = 'any' ) 
        
               .loc[ lambda x: x.sex != 'unsexed' ] 
        
               .pipe( lambda df: pd.concat( [ df ,  
        
                                              df.copy( ).assign( sex = 'all' ) ] ) ) 
        
               .bin_variable( age_intervals , 'age' ) 
        
               .bin_variable( length_intervals , 'length' )       
        
               .groupby( [ 'stratum_num' , 'age_bin' , 'length_bin' , 'sex' ] ) 
        
               .apply( lambda df: pd.Series( { 'summed_weight_all': df.weight.sum( ) , 
        
                                               'summed_weight_adult': df.loc[ df.age > 1 ].weight.sum( ) } ) )  
        
               .reset_index( ) 
        
               .replace( np.nan , 0 ) 
        
               .assign( total_weight_sex_all = lambda df: df.groupby( [ 'stratum_num' , 'sex' ] )[ 'summed_weight_all' ].transform( sum ) , 
        
                        total_weight_sex_adult = lambda df: df.groupby( [ 'stratum_num' , 'sex' ] )[ 'summed_weight_adult' ].transform( sum ) , 
        
                        proportion_weight_all = lambda df: df.summed_weight_all / df.total_weight_sex_all , 
        
                        proportion_weight_adult = lambda df: df.summed_weight_adult / df.total_weight_sex_adult )             
        
           )

`calc_dist_weight_sum`:

echopop/echopop/survey.py

Lines 1335 to 1344 in 5c36ebe

    
           ### Normalize the age-length-sex indexed proportions 
        
           dist_weight_sum = ( 
        
               length_age_sex_stratum_weight 
        
               .merge( weight_strata , on = [ 'stratum_num' ] ) 
        
               .groupby( [ 'stratum_num' , 'sex' ] ) 
        
               .apply( lambda df: pd.Series( {  
        
                   'proportion_normalized': ( df.proportion_weight_all * ( df.summed_weight_all / df.weight_stratum_total ).sum( ) ).sum( ) 
        
               } ) ) 
        
               .reset_index( ) 
        
           )

`calc_age_proportions`:

echopop/echopop/survey.py

Lines 1346 to 1356 in 5c36ebe

    
           ### Calculate aged / unaged proportions 
        
           aged_proportions = ( 
        
               length_age_sex_stratum_weight 
        
               .loc[ lambda df: df.sex != 'all' ] 
        
               .groupby( [ 'stratum_num' ] ) 
        
               .apply( lambda df: pd.Series( { 'weight_aged_total': df.summed_weight_all.sum( ) } ) )  
        
               .reset_index( ) 
        
               .merge( weight_strata , on = [ 'stratum_num' ] )  
        
               .assign( proportion_aged_total = lambda x: x.weight_aged_total / x.weight_stratum_total , 
        
                        proportion_unaged_total = lambda x: np.round( 1.0 - x.proportion_aged_total , decimals = 10 ) ) 
        
           )

`calc_haul_sex_weights_normalized`:

echopop/echopop/survey.py

Lines 1358 to 1408 in 5c36ebe

    
           ### Calculate interpolated weights based on length bins for each sex per haul 
        
           # Pull length-weight fitted values 
        
           length_weight_fit = ( 
        
               self.statistics[ 'length_weight' ][ 'length_weight_df' ] 
        
               .assign( length_bin_value  = lambda x: x[ 'length_bin' ].apply( lambda y: y.mid ) ) 
        
           ) 
        
           # Sum haul weights per sex per stratum 
        
           haul_weights = ( 
        
               length_df_copy 
        
               .bin_variable( length_intervals , 'length' ) 
        
               .loc[ lambda x: x.group == 'sexed' ] 
        
               .pipe( lambda df: pd.concat( [ df , df.assign( sex = 'all' ) ] ) )   
        
               .merge( length_weight_fit , on = [ 'sex' , 'length_bin' ] ) 
        
               .groupby( [ 'stratum_num' , 'haul_num' , 'sex' ] ) 
        
               .apply( lambda df: pd.Series( { 'weight_interp': ( np.interp( df.length,  
        
                                                                             length_weight_fit.loc[ lambda x: x.sex.isin( df.sex ) ][ 'length_bin_value' ] ,  
        
                                                                             length_weight_fit.loc[ lambda x: x.sex.isin( df.sex ) ][ 'weight_modeled' ] ) * 
        
                                                                   df.length_count ).sum( ) } ) ) 
        
               .groupby( [ 'stratum_num' , 'sex' ] ) 
        
               .apply( lambda df: pd.Series( { 'summed_haul_weights': df.weight_interp.sum( ) } ) ) 
        
               .reset_index( ) 
        
           ) 
        
           ### Normalize haul weights (Station 1) 
        
           haul_sex_weights_normalized = ( 
        
               haul_weights 
        
               .merge( weight_strata_station.loc[ lambda x: x.station == 1 ] ,  
        
                       on = 'stratum_num' ) 
        
               .assign( weight_normalized_station_1 = lambda df: ( 
        
                   df 
        
                   .groupby( [ 'stratum_num' ] ) 
        
                   .apply( lambda strata: strata[ 'stratum_weight' ] * strata[ 'summed_haul_weights' ] / 
        
                                          ( df[ ( df.stratum_num == strata.name ) & ( df.sex == 'male' ) ][ 'summed_haul_weights' ].iloc[ 0 ] + 
        
                                            df[ ( df.stratum_num == strata.name ) & ( df.sex == 'female' ) ][ 'summed_haul_weights' ].iloc[ 0 ] ) ) 
        
                   .reset_index( drop = True ) ) ) 
        
               .merge( weight_strata , on = [ 'stratum_num' ] ) 
        
               .assign( proportion_normalized_station_1 = lambda df: ( 
        
                   df.weight_normalized_station_1 / df.weight_stratum_total 
        
               ) ) 
        
               .assign( sex_proportion = lambda df: ( 
        
                   df 
        
                   .groupby( [ 'stratum_num' ] ) 
        
                   .apply( lambda strata: strata[ 'proportion_normalized_station_1' ] / 
        
                                          ( df[ ( df.stratum_num == strata.name ) & ( df.sex == 'male' ) ][ 'proportion_normalized_station_1' ].iloc[ 0 ] + 
        
                                            df[ ( df.stratum_num == strata.name ) & ( df.sex == 'female' ) ][ 'proportion_normalized_station_1' ].iloc[ 0 ] ) ) 
        
                   .reset_index( drop = True ) ) ) 
        
               .merge( aged_proportions.loc[ : , [ 'stratum_num' , 'proportion_unaged_total' ] ] ,  
        
                       on = [ 'stratum_num' ] )         
        
               .assign( proportion_unaged_sex = lambda x: x.sex_proportion * x.proportion_unaged_total ) 
        
           )

`calc_grid_CV`:

echopop/echopop/survey.py

Lines 1437 to 1465 in 5c36ebe

    
           ### Calculate grid CV  
        
           biomass_density_adult = ( 
        
               self.biology[ 'population' ][ 'areal_density' ][ 'biomass_density_df' ] 
        
               .loc[ lambda df: df.sex == 'all' ] 
        
               .drop_duplicates( subset = [ 'transect_num' , 'latitude' , 'longitude' , 'stratum_num' ] ) 
        
           ) 
        
           ### --------------------------------------------------------------- 
        
           ### !!! TODO:  
        
           ### While `C0` produces a similar standard deviation as the original 
        
           ### code, `Bn` does not. This discrepancy results in an order of  
        
           ### magnitude change in the resulting `biomass_adult_cell_CV` estimate. 
        
           ### The source of this error stems from `kriged_results.B_a_adult_mean` 
        
           ### whereby the right-tail of the `B_a_adult_mean` distribution is  
        
           ### substantially shorter than from values (`biomass_density_adult_mean`) 
        
           ### produced in the original code. This therefore relates to Issue #202. 
        
           ### --------------------------------------------------------------- 
        
           C0 = np.std( biomass_density_adult.B_a , ddof = 1 ) ** 2 
        
           Bn = np.nansum( kriged_results.B_a_adult_mean * kriged_results.cell_area_nmi2 ) * 1e-9 
        
           kriged_results = ( 
        
               kriged_results 
        
               .assign( biomass_adult_cell_CV = lambda df: ( 
        
                   self.statistics[ 'kriging' ][ 'model_config' ][ 'A0' ] * 
        
                   np.sqrt( kriged_results.B_a_adult_prediction_variance * C0 ) * 
        
                   1e-9 / 
        
                   Bn * 
        
                   np.sqrt( len( kriged_results.B_a_adult_prediction_variance ) ) 
        
               ) ) 
        
           )

leewujung · 2024-03-08T23:48:08Z

I tracked the above part about factoring blocks into functions in #203. We can address them there.

@brandynlucca : There are a few small comments (like paths) that can probably be addressed easily within this PR. Otherwise, feel free to merge this. Great that we have the kriging and downstream incorporated!

leewujung · 2024-03-09T17:09:56Z

I'll merge this now so that #206 that is built on top of this can be cleaned up and merged too. I will create another issue to track the vectorizing comments.

Brandyn Lucca and others added 8 commits March 6, 2024 11:38

Edit to setup.cfg maintainer name

04dbc49

Merge branch 'test-branch' of https://github.com/brandynlucca/EchoPro …

d93334f

…into test-branch

Merge branch 'main' of https://github.com/uw-echospace/EchoPro into t…

d74dee8

…est-branch

Change setup.cfg software name

ba657c2

Additional kriging and unit test function updates

0646652

Updates to kriging methods and sex-age apportionin

2701083

Add documentation for newer functions

44e9149

Kriging methods documentation

8fa107c

brandynlucca closed this Mar 7, 2024

brandynlucca self-assigned this Mar 7, 2024

brandynlucca reopened this Mar 7, 2024

Merge branch 'main' into test-branch

5b9fdc2

brandynlucca marked this pull request as ready for review March 7, 2024 18:32

brandynlucca marked this pull request as draft March 7, 2024 18:32

brandynlucca changed the title ~~Test-branch~~ Incorporate kriging methods and documentation Mar 7, 2024

leewujung reviewed Mar 7, 2024

View reviewed changes

config_files/survey_year_2019_config.yml Outdated Show resolved Hide resolved

leewujung reviewed Mar 7, 2024

View reviewed changes

echopop/tests/conftest.py Outdated Show resolved Hide resolved

leewujung reviewed Mar 7, 2024

View reviewed changes

echopop/test_data/config_files/config_init.yml Outdated Show resolved Hide resolved

leewujung reviewed Mar 7, 2024

View reviewed changes

echopop/tests/test_strata_mean_sigma_bs.py Show resolved Hide resolved

brandynlucca added 2 commits March 7, 2024 14:34

Add equations to acoustics/semivariogram RTD

95cc04f

Merge branch 'test-branch' of https://github.com/brandynlucca/EchoPro …

5c36ebe

…into test-branch