-
Notifications
You must be signed in to change notification settings - Fork 72
Add groupby split_out config options to dask-sql #286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add groupby split_out config options to dask-sql #286
Conversation
rajagurunath
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything LGTM !!
dask_sql/datacontainer.py
Outdated
| config_options = [config_options] | ||
| self.config_dict.update(config_options) | ||
|
|
||
| def get_groupby_aggregate_configs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a doubt here, Are we planning to use different getter functions for each type of configuration?
For example: let's say in the future we may need to add configurations for persist=True|False or for some join based hints as mentioned here #280 (comment) are we planning to add individual getter functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point. I've been thinking about this and have a couple of ideas:
- One approach is to have getter functions like these throughout the class. It might get large very quickly (based on the config options) but would be a nice central location to get relevant config options.
- The other option is to just have a generic prefix based getter, so it takes the prefix of the key string something like
dask.groupby.aggregateand returns a dictionary with all key/val pairs matching that key prefix. It'll make this class much cleaner.
I don't have strong opinions on either but am leaning towards option 2. Open to opinions or suggestions from others as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea 👍 , I am also in the favour of option2, Happy to know other people's Suggestions.
|
rerun tests |
…ate to use the new api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While exploring some of the work I did with ConfigContainer I came across Dask's config module: https://docs.dask.org/en/stable/configuration.html#configuration. I wonder if that's a better alternative to writing our own methods for doing similar things and would be better in the long run.
I was also thinking about how we could have a config provided at runtime with context.sql and that would require having another ConfigContainer attribute associated with the Context class rather than the schema and during query execution we create a new config_dictionary prioritizing from the one provided in context.sql but also picking from the schema for those configs that aren't passed with the sql call. Does that seem reasonable?
| key: val | ||
| for key, val in self.config_dict.items() | ||
| if key.startswith(config_prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be the most efficient way to do this, but should be okay for small dictionaries.
| for config_key in list(groupby_agg_options.keys()): | ||
| groupby_agg_options[ | ||
| config_key.rpartition(".")[2] | ||
| ] = groupby_agg_options.pop(config_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was initially planning on having this logic in ConfigContainer which returns the param name from the config string: i.e dask.groupby.aggregate.split_out -> split_out but the reason things were getting a bit messy is because multiple config options down the line could have the param name which would cause conflicts, eg: dask.dataframe.drop_duplicates.split_out would also return split_out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, would it make sense to consolidate this functionality in get_config_by_prefix? We would filter the config options by prefix and then return those options with the prefix stripped. I can leave suggestions for what I imagine this would look like:
| for config_key in list(groupby_agg_options.keys()): | |
| groupby_agg_options[ | |
| config_key.rpartition(".")[2] | |
| ] = groupby_agg_options.pop(config_key) |
| groupby_agg_options = context.schema[ | ||
| context.schema_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was giving this a bit more thought and it seems like context.schema_name always returns the default schema, and the context.fqn method can help return the specific schema being used in that query. I'm not completely sure if the schema for groupby is associated with the groupby_call itself or with specific aggregations similar based on the code here:
dask-sql/dask_sql/physical/rel/logical/aggregate.py
Lines 285 to 287 in 5b8f8a9
| schema_name, aggregation_name = context.fqn( | |
| expr.getAggregation().getNameAsId() | |
| ) |
Perhaps @rajagurunath or someone more familiar with this part of the codebase can provide insight here.
|
@charlesbluca This should be good for an initial round of reviews before we finalize the design and start adding test cases. |
charlesbluca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to add a test to verify that groupby options are being successfully passed through to _perform_aggregation?
dask_sql/datacontainer.py
Outdated
| # Returns { | ||
| # "dask.groupby.aggregate.split_out":1, | ||
| # "dask.groupby.aggregate.split_every":1, | ||
| # "dask.sort.persistpersist": True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # "dask.sort.persistpersist": True | |
| # "dask.sort.persist": True |
Codecov Report
@@ Coverage Diff @@
## main #286 +/- ##
==========================================
- Coverage 95.99% 95.80% -0.20%
==========================================
Files 64 65 +1
Lines 2797 2834 +37
Branches 421 426 +5
==========================================
+ Hits 2685 2715 +30
- Misses 71 74 +3
- Partials 41 45 +4
Continue to review full report at Codecov.
|
|
Should be ready for another round of reviews. I haven't implemented to option to have the configs configurable per |
charlesbluca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work here @ayushdg 😄
- I made some suggestions in an attempt to consolidate the logic for stripping the prefix for config options in
get_config_by_prefix - Some minor formatting on a docstring codeblock
- Would you mind adding some basic
ConfigContainertests totest_datacontainer.py? Just verifying that all the getter/setter methods work as expected
| for config_key in list(groupby_agg_options.keys()): | ||
| groupby_agg_options[ | ||
| config_key.rpartition(".")[2] | ||
| ] = groupby_agg_options.pop(config_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more, would it make sense to consolidate this functionality in get_config_by_prefix? We would filter the config options by prefix and then return those options with the prefix stripped. I can leave suggestions for what I imagine this would look like:
| for config_key in list(groupby_agg_options.keys()): | |
| groupby_agg_options[ | |
| config_key.rpartition(".")[2] | |
| ] = groupby_agg_options.pop(config_key) |
This PR is an initial pass based on the approach discussed in #241 (comment).
The PR does the following:
ConfigContainerclass responsible for storing the configuration dictionary as well as providing helper methods to set/update and retrieve configurations.ConfigContainerobject belongs to a schema. This means that each schema uses the same configuration set.context.sqlrun in addition to having a schema wide config.split_outandsplit_everyconfig options during groupby aggregations in theaggregateplan.