Add groupby split_out config options to dask-sql #286

ayushdg · 2021-10-27T17:12:26Z

This PR is an initial pass based on the approach discussed in #241 (comment).

The PR does the following:

Create a ConfigContainer class responsible for storing the configuration dictionary as well as providing helper methods to set/update and retrieve configurations.
- This class has a set of defaults to use when no configuration options are provided. One con of this approach is that it would need to be updated if any upstream library changes their default behavior. I'm considering if this should be left empty instead?
The ConfigContainer object belongs to a schema. This means that each schema uses the same configuration set.
- Consider making the configs changeable per context.sql run in addition to having a schema wide config.
Adds the usage of the split_out and split_every config options during groupby aggregations in the aggregate plan.
- In practice this doesn't work right now due to an edgecase with how dask-sql calls groupby and how dask hashes output partitions when splitting out.
- Raise an issue for the above - Groupby + split_out fails when key columns have the same name dask/dask#8307
- Unblocked by Use unique names for null/non-null groupby columns #289
Add tests for config
Add tests for groupby w/ configs
Add docs explaining the different config options

…tter method from Context

rajagurunath

Everything LGTM !!

rajagurunath · 2021-10-27T18:14:59Z

dask_sql/datacontainer.py

+            config_options = [config_options]
+        self.config_dict.update(config_options)
+
+    def get_groupby_aggregate_configs(self):


Just a doubt here, Are we planning to use different getter functions for each type of configuration?

For example: let's say in the future we may need to add configurations for persist=True|False or for some join based hints as mentioned here #280 (comment) are we planning to add individual getter functions?

That's a great point. I've been thinking about this and have a couple of ideas:

One approach is to have getter functions like these throughout the class. It might get large very quickly (based on the config options) but would be a nice central location to get relevant config options.

The other option is to just have a generic prefix based getter, so it takes the prefix of the key string something like dask.groupby.aggregate and returns a dictionary with all key/val pairs matching that key prefix. It'll make this class much cleaner.

I don't have strong opinions on either but am leaning towards option 2. Open to opinions or suggestions from others as well.

That's a great idea 👍 , I am also in the favour of option2, Happy to know other people's Suggestions.

charlesbluca · 2021-11-04T16:29:09Z

rerun tests

…ate to use the new api

ayushdg

While exploring some of the work I did with ConfigContainer I came across Dask's config module: https://docs.dask.org/en/stable/configuration.html#configuration. I wonder if that's a better alternative to writing our own methods for doing similar things and would be better in the long run.

I was also thinking about how we could have a config provided at runtime with context.sql and that would require having another ConfigContainer attribute associated with the Context class rather than the schema and during query execution we create a new config_dictionary prioritizing from the one provided in context.sql but also picking from the schema for those configs that aren't passed with the sql call. Does that seem reasonable?

ayushdg · 2021-11-05T00:51:11Z

dask_sql/datacontainer.py

+                key: val
+                for key, val in self.config_dict.items()
+                if key.startswith(config_prefix)


This may not be the most efficient way to do this, but should be okay for small dictionaries.

ayushdg · 2021-11-05T00:53:20Z

dask_sql/physical/rel/logical/aggregate.py

+        for config_key in list(groupby_agg_options.keys()):
+            groupby_agg_options[
+                config_key.rpartition(".")[2]
+            ] = groupby_agg_options.pop(config_key)


I was initially planning on having this logic in ConfigContainer which returns the param name from the config string: i.e dask.groupby.aggregate.split_out -> split_out but the reason things were getting a bit messy is because multiple config options down the line could have the param name which would cause conflicts, eg: dask.dataframe.drop_duplicates.split_out would also return split_out.

Thinking about this more, would it make sense to consolidate this functionality in get_config_by_prefix? We would filter the config options by prefix and then return those options with the prefix stripped. I can leave suggestions for what I imagine this would look like:

Suggested change

for config_key in list(groupby_agg_options.keys()):

groupby_agg_options[

config_key.rpartition(".")[2]

] = groupby_agg_options.pop(config_key)

ayushdg · 2021-11-05T01:01:22Z

dask_sql/physical/rel/logical/aggregate.py

+        groupby_agg_options = context.schema[
+            context.schema_name


I was giving this a bit more thought and it seems like context.schema_name always returns the default schema, and the context.fqn method can help return the specific schema being used in that query. I'm not completely sure if the schema for groupby is associated with the groupby_call itself or with specific aggregations similar based on the code here:

dask-sql/dask_sql/physical/rel/logical/aggregate.py

Lines 285 to 287 in 5b8f8a9

schema_name, aggregation_name = context.fqn(

expr.getAggregation().getNameAsId()

)

Perhaps @rajagurunath or someone more familiar with this part of the codebase can provide insight here.

ayushdg · 2021-11-05T01:26:21Z

@charlesbluca This should be good for an initial round of reviews before we finalize the design and start adding test cases.

charlesbluca

Is it possible to add a test to verify that groupby options are being successfully passed through to _perform_aggregation?

dask_sql/context.py

dask_sql/datacontainer.py

charlesbluca · 2021-11-08T19:06:04Z

dask_sql/datacontainer.py

+            # Returns {
+            #   "dask.groupby.aggregate.split_out":1,
+            #   "dask.groupby.aggregate.split_every":1,
+            #   "dask.sort.persistpersist": True


Suggested change

# "dask.sort.persistpersist": True

# "dask.sort.persist": True

codecov-commenter · 2021-11-18T14:48:44Z

Codecov Report

Merging #286 (e4c8613) into main (e48d9c1) will decrease coverage by 0.19%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
- Coverage   95.99%   95.80%   -0.20%     
==========================================
  Files          64       65       +1     
  Lines        2797     2834      +37     
  Branches      421      426       +5     
==========================================
+ Hits         2685     2715      +30     
- Misses         71       74       +3     
- Partials       41       45       +4

Impacted Files	Coverage Δ
dask_sql/datacontainer.py	`93.57% <90.90%> (-0.75%)`	⬇️
dask_sql/context.py	`99.13% <100.00%> (+0.03%)`	⬆️
dask_sql/physical/rel/logical/aggregate.py	`95.91% <100.00%> (+0.08%)`	⬆️
dask_sql/physical/utils/groupby.py	`100.00% <100.00%> (ø)`
dask_sql/physical/utils/sort.py	`83.33% <0.00%> (-7.06%)`	⬇️
dask_sql/server/responses.py	`97.87% <0.00%> (-2.13%)`	⬇️
dask_sql/physical/rel/logical/join.py	`98.30% <0.00%> (ø)`
dask_sql/physical/rel/custom/__init__.py	`100.00% <0.00%> (ø)`
dask_sql/physical/rel/custom/distributeby.py	`86.36% <0.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48d9c1...e4c8613. Read the comment docs.

ayushdg · 2021-11-18T15:12:14Z

Should be ready for another round of reviews. I haven't implemented to option to have the configs configurable per context.sql run, and will explore that in a followup pr in addition to considering using Dask's provided config module.

charlesbluca

Thanks for the work here @ayushdg 😄

I made some suggestions in an attempt to consolidate the logic for stripping the prefix for config options in get_config_by_prefix
Some minor formatting on a docstring codeblock
Would you mind adding some basic ConfigContainer tests to test_datacontainer.py? Just verifying that all the getter/setter methods work as expected

charlesbluca · 2021-11-18T16:24:47Z

dask_sql/physical/rel/logical/aggregate.py

+        for config_key in list(groupby_agg_options.keys()):
+            groupby_agg_options[
+                config_key.rpartition(".")[2]
+            ] = groupby_agg_options.pop(config_key)


Thinking about this more, would it make sense to consolidate this functionality in get_config_by_prefix? We would filter the config options by prefix and then return those options with the prefix stripped. I can leave suggestions for what I imagine this would look like:

Suggested change

for config_key in list(groupby_agg_options.keys()):

groupby_agg_options[

config_key.rpartition(".")[2]

] = groupby_agg_options.pop(config_key)

dask_sql/physical/rel/logical/aggregate.py

dask_sql/datacontainer.py

brandon-b-miller and others added 4 commits October 25, 2021 13:35

update docs, error if no return type passed

a79c6d9

tests, bugs

7dd5bae

Add a configContainer class for adding config options as well as a se…

9da285c

…tter method from Context

Update aggregate plans to use dask groupby configs where applicable

f8b4eb5

rajagurunath reviewed Oct 27, 2021

View reviewed changes

ayushdg and others added 2 commits October 27, 2021 14:37

Merge branch '283' into fea-schema-split_out-config

bd15fb3

Use unique names for null/non-null groupby columns

b755bb2

charlesbluca mentioned this pull request Nov 2, 2021

Use unique names for null/non-null groupby columns #289

Merged

Merge branch '289' into fea-schema-split_out-config

696d47c

charlesbluca mentioned this pull request Nov 2, 2021

[QST] To what extent should Dask-SQL's output match PostgreSQL? #290

Open

Update config getter method, add drop_config method and update aggreg…

45edb63

…ate to use the new api

ayushdg commented Nov 5, 2021

View reviewed changes

charlesbluca reviewed Nov 8, 2021

View reviewed changes

randerzander mentioned this pull request Nov 13, 2021

[ENH] Support user control over case sensitivity #315

Closed

ayushdg mentioned this pull request Nov 15, 2021

Ignore case for queries in the parser configuration #316

Merged

ayushdg added 2 commits November 18, 2021 06:37

Add groupby config_option pytest

666eb84

update docstring

e4c8613

ayushdg marked this pull request as ready for review November 18, 2021 15:09

Fix typo

cdb7d62

ayushdg changed the title ~~[WIP] Add groupby split_out config options to dask-sql~~ Add groupby split_out config options to dask-sql Nov 18, 2021

charlesbluca requested changes Nov 18, 2021

View reviewed changes

charlesbluca reviewed Nov 18, 2021

View reviewed changes

dask_sql/datacontainer.py Show resolved Hide resolved

Fix docstring code block formatting

57b199c

charlesbluca approved these changes Nov 22, 2021

View reviewed changes

charlesbluca merged commit 2ca914a into dask-contrib:main Nov 22, 2021

charlesbluca mentioned this pull request Nov 22, 2021

Revert "Remove null-splitting from _perform_aggregation" #325

Merged

ayushdg mentioned this pull request Feb 4, 2022

[Review] Refactor ConfigContainer to use dask config #392

Merged

4 tasks

ayushdg deleted the fea-schema-split_out-config branch December 12, 2022 13:40

	schema_name, aggregation_name = context.fqn(
	expr.getAggregation().getNameAsId()
	)

	# "dask.sort.persistpersist": True
	# "dask.sort.persist": True

Add groupby split_out config options to dask-sql #286

Add groupby split_out config options to dask-sql #286

Uh oh!

Conversation

ayushdg commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rajagurunath left a comment

Choose a reason for hiding this comment

Uh oh!

rajagurunath Oct 27, 2021

Choose a reason for hiding this comment

Uh oh!

ayushdg Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajagurunath Oct 28, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca commented Nov 4, 2021

Uh oh!

ayushdg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushdg Nov 5, 2021

Choose a reason for hiding this comment

Uh oh!

ayushdg Nov 5, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

ayushdg Nov 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ayushdg commented Nov 5, 2021

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

charlesbluca Nov 8, 2021

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ayushdg commented Nov 18, 2021

Uh oh!

charlesbluca left a comment

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ayushdg commented Oct 27, 2021 •

edited

Loading

ayushdg Oct 27, 2021 •

edited

Loading

ayushdg left a comment •

edited

Loading

ayushdg Nov 5, 2021 •

edited

Loading

codecov-commenter commented Nov 18, 2021 •

edited

Loading