Skip to content

Conversation

@ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented Oct 27, 2021

This PR is an initial pass based on the approach discussed in #241 (comment).

The PR does the following:

  • Create a ConfigContainer class responsible for storing the configuration dictionary as well as providing helper methods to set/update and retrieve configurations.
    • This class has a set of defaults to use when no configuration options are provided. One con of this approach is that it would need to be updated if any upstream library changes their default behavior. I'm considering if this should be left empty instead?
  • The ConfigContainer object belongs to a schema. This means that each schema uses the same configuration set.
    • Consider making the configs changeable per context.sql run in addition to having a schema wide config.
  • Adds the usage of the split_out and split_every config options during groupby aggregations in the aggregate plan.
  • Add tests for config
  • Add tests for groupby w/ configs
  • Add docs explaining the different config options

Copy link
Collaborator

@rajagurunath rajagurunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything LGTM !!

config_options = [config_options]
self.config_dict.update(config_options)

def get_groupby_aggregate_configs(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a doubt here, Are we planning to use different getter functions for each type of configuration?

For example: let's say in the future we may need to add configurations for persist=True|False or for some join based hints as mentioned here #280 (comment) are we planning to add individual getter functions?

Copy link
Collaborator Author

@ayushdg ayushdg Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. I've been thinking about this and have a couple of ideas:

  1. One approach is to have getter functions like these throughout the class. It might get large very quickly (based on the config options) but would be a nice central location to get relevant config options.
  2. The other option is to just have a generic prefix based getter, so it takes the prefix of the key string something like dask.groupby.aggregate and returns a dictionary with all key/val pairs matching that key prefix. It'll make this class much cleaner.

I don't have strong opinions on either but am leaning towards option 2. Open to opinions or suggestions from others as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea 👍 , I am also in the favour of option2, Happy to know other people's Suggestions.

@charlesbluca
Copy link
Collaborator

rerun tests

Copy link
Collaborator Author

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While exploring some of the work I did with ConfigContainer I came across Dask's config module: https://docs.dask.org/en/stable/configuration.html#configuration. I wonder if that's a better alternative to writing our own methods for doing similar things and would be better in the long run.

I was also thinking about how we could have a config provided at runtime with context.sql and that would require having another ConfigContainer attribute associated with the Context class rather than the schema and during query execution we create a new config_dictionary prioritizing from the one provided in context.sql but also picking from the schema for those configs that aren't passed with the sql call. Does that seem reasonable?

Comment on lines +311 to +313
key: val
for key, val in self.config_dict.items()
if key.startswith(config_prefix)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be the most efficient way to do this, but should be okay for small dictionaries.

Comment on lines +239 to +242
for config_key in list(groupby_agg_options.keys()):
groupby_agg_options[
config_key.rpartition(".")[2]
] = groupby_agg_options.pop(config_key)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was initially planning on having this logic in ConfigContainer which returns the param name from the config string: i.e dask.groupby.aggregate.split_out -> split_out but the reason things were getting a bit messy is because multiple config options down the line could have the param name which would cause conflicts, eg: dask.dataframe.drop_duplicates.split_out would also return split_out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, would it make sense to consolidate this functionality in get_config_by_prefix? We would filter the config options by prefix and then return those options with the prefix stripped. I can leave suggestions for what I imagine this would look like:

Suggested change
for config_key in list(groupby_agg_options.keys()):
groupby_agg_options[
config_key.rpartition(".")[2]
] = groupby_agg_options.pop(config_key)

Comment on lines +234 to +235
groupby_agg_options = context.schema[
context.schema_name
Copy link
Collaborator Author

@ayushdg ayushdg Nov 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was giving this a bit more thought and it seems like context.schema_name always returns the default schema, and the context.fqn method can help return the specific schema being used in that query. I'm not completely sure if the schema for groupby is associated with the groupby_call itself or with specific aggregations similar based on the code here:

schema_name, aggregation_name = context.fqn(
expr.getAggregation().getNameAsId()
)

Perhaps @rajagurunath or someone more familiar with this part of the codebase can provide insight here.

@ayushdg
Copy link
Collaborator Author

ayushdg commented Nov 5, 2021

@charlesbluca This should be good for an initial round of reviews before we finalize the design and start adding test cases.

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add a test to verify that groupby options are being successfully passed through to _perform_aggregation?

# Returns {
# "dask.groupby.aggregate.split_out":1,
# "dask.groupby.aggregate.split_every":1,
# "dask.sort.persistpersist": True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# "dask.sort.persistpersist": True
# "dask.sort.persist": True

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2021

Codecov Report

Merging #286 (e4c8613) into main (e48d9c1) will decrease coverage by 0.19%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
- Coverage   95.99%   95.80%   -0.20%     
==========================================
  Files          64       65       +1     
  Lines        2797     2834      +37     
  Branches      421      426       +5     
==========================================
+ Hits         2685     2715      +30     
- Misses         71       74       +3     
- Partials       41       45       +4     
Impacted Files Coverage Δ
dask_sql/datacontainer.py 93.57% <90.90%> (-0.75%) ⬇️
dask_sql/context.py 99.13% <100.00%> (+0.03%) ⬆️
dask_sql/physical/rel/logical/aggregate.py 95.91% <100.00%> (+0.08%) ⬆️
dask_sql/physical/utils/groupby.py 100.00% <100.00%> (ø)
dask_sql/physical/utils/sort.py 83.33% <0.00%> (-7.06%) ⬇️
dask_sql/server/responses.py 97.87% <0.00%> (-2.13%) ⬇️
dask_sql/physical/rel/logical/join.py 98.30% <0.00%> (ø)
dask_sql/physical/rel/custom/__init__.py 100.00% <0.00%> (ø)
dask_sql/physical/rel/custom/distributeby.py 86.36% <0.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48d9c1...e4c8613. Read the comment docs.

@ayushdg ayushdg marked this pull request as ready for review November 18, 2021 15:09
@ayushdg
Copy link
Collaborator Author

ayushdg commented Nov 18, 2021

Should be ready for another round of reviews. I haven't implemented to option to have the configs configurable per context.sql run, and will explore that in a followup pr in addition to considering using Dask's provided config module.

@ayushdg ayushdg changed the title [WIP] Add groupby split_out config options to dask-sql Add groupby split_out config options to dask-sql Nov 18, 2021
Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here @ayushdg 😄

  • I made some suggestions in an attempt to consolidate the logic for stripping the prefix for config options in get_config_by_prefix
  • Some minor formatting on a docstring codeblock
  • Would you mind adding some basic ConfigContainer tests to test_datacontainer.py? Just verifying that all the getter/setter methods work as expected

Comment on lines +239 to +242
for config_key in list(groupby_agg_options.keys()):
groupby_agg_options[
config_key.rpartition(".")[2]
] = groupby_agg_options.pop(config_key)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, would it make sense to consolidate this functionality in get_config_by_prefix? We would filter the config options by prefix and then return those options with the prefix stripped. I can leave suggestions for what I imagine this would look like:

Suggested change
for config_key in list(groupby_agg_options.keys()):
groupby_agg_options[
config_key.rpartition(".")[2]
] = groupby_agg_options.pop(config_key)

@ayushdg ayushdg deleted the fea-schema-split_out-config branch December 12, 2022 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants