Remove null-splitting from `_perform_aggregation` #273

charlesbluca · 2021-10-20T21:16:13Z

dask-sql currently uses a workaround get_groupby_with_nulls_cols to split a dataframe by nulls/non-nulls before performing a groupby. This seems unnecessary, as we can use groupby(dropna=False) to keep nulls in the groupby result, and then later move them to the front of the dataframe (depending on how strictly we want to follow PostgreSQL). For comparison, here are the task graphs for a basic SQL operation:

SELECT
     user_id, SUM(b) AS "S"
FROM user_table_1
GROUP BY user_id

Before this PR:

HighLevelGraph with 20 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fbc344ae590>
 0. from_pandas-b5125674539d8b5d1f754260e384e1fa
 1. assign-e29cf71863faee69539b6f78b8dbc8f3
 2. getitem-aa5bcd5e86836ebad1df2a90117e16f6
 3. fillna-94de4188d246caf9e1b4bf4f1debce03
 4. isnull-76ff5f3bc798aa3da45b6b16dae758fc
 5. inv-8c38b3dd8368fbb685b718f17180f00b
 6. aggregate-agg-838eda3f7eb4dbc264011a4f8e14a10b
 7. rename-c2e63c54b4d4fff51825029f94563613
 8. reset_index-3a34f3d8a1856ef12b299845da6afa8c
 9. rename-79dfdb294050405020a476146c8b5845
 10. getitem-be78c3b9d2c681e09cccdfdf783dbd1f
 11. eq-77e1f642c7c2a8e643f4a22b61af4e11
 12. inv-07fa50e82ef9e1e810819771a8e1ce2d
 13. getitem-1cc4911cff51b0537d276b6a9de6ebf7
 14. where-7cc4db419a8a2172e17accd935b851f4
 15. assign-8ac7e56123d06d1204acb6609e9541b6
 16. getitem-0ab1e73633488c2092c98ca87c31333c
 17. getitem-2a55368cca98a96e0d82198cd6bd7802
 18. assign-837d2902d726cbdd2f356860ab530f04
 19. getitem-f7c62a09776201bb03dae99feccd8138

And after:

HighLevelGraph with 16 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f9c4455d730>
 0. from_pandas-b5125674539d8b5d1f754260e384e1fa
 1. assign-d126cb139c7dfef7033f0e1f1559a8aa
 2. aggregate-agg-5b194136cacfc9fef1c06119ce6a453f
 3. rename-529a9e13861a74ca909376c58eb77c82
 4. reset_index-fcfd12137febeac9b4cfed91590b47c5
 5. rename-1b4bcbd7d2f03cb785833918fd7e148a
 6. getitem-583c0b466bd64af2ba8edf50c31ec41b
 7. eq-c3e82a692e1f30a2482a526c57bfdf74
 8. inv-dd614b43ef9503c3ee6b6c1d0afabe65
 9. getitem-c598fe29b226039022009f4aca4598be
 10. where-491a45d0103973393d8e07a9b82481e9
 11. assign-fb69ab150f02d3d829815441b5e38bc1
 12. getitem-fc12148fe3ee2696902590cbf32e2620
 13. getitem-643b1fd05b327c34403caedd348cd66b
 14. assign-35d81a9f9ef223a5cc776b41dd7dc231
 15. getitem-801bb5652b721f072ea3d18ab9de6db9

This is currently excluding null position handling - adding that in will probably bring back some layers

charlesbluca · 2021-11-09T14:31:42Z

It looks like we will still need the groupby workaround (i.e. slow path) for situations where an aggregation needs to be handled by a custom AggregationOnPandas (and maybe just all dd.Aggregation subclasses in general), as dropna=False doesn't work as expected for these cases.

I haven't played around with groupby to know all the cases where we need a custom aggregation, so the only one I know of is when we attempt to aggregate a column with object dtype. For situations like this, a reasonable solution that @VibhuJawa proposed would be to establish a standard convert_dtypes API, and then have that get called on input dataframes so that they use optimal column dtypes as much as possible.

charlesbluca · 2021-11-09T17:14:16Z

dask_sql/physical/utils/groupby.py

    for group_column in group_columns:
        # the ~ makes NaN come first
-        is_null_column = ~(group_column.isnull())
+        is_null_column = group_column.isnull()


Noting that I removed the inverse operation here so that the slow path groupby agg is consistent with dask/dask-cudf's; this should also speed things up a little

codecov-commenter · 2021-11-09T17:25:18Z

Codecov Report

Merging #273 (05530d6) into main (2194a75) will decrease coverage by 0.06%.
The diff coverage is 86.66%.

@@            Coverage Diff             @@
##             main     #273      +/-   ##
==========================================
- Coverage   95.75%   95.69%   -0.07%     
==========================================
  Files          65       65              
  Lines        2802     2809       +7     
  Branches      418      421       +3     
==========================================
+ Hits         2683     2688       +5     
- Misses         77       78       +1     
- Partials       42       43       +1

Impacted Files	Coverage Δ
dask_sql/physical/rel/logical/aggregate.py	`94.70% <85.71%> (-1.14%)`	⬇️
dask_sql/physical/utils/groupby.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2194a75...05530d6. Read the comment docs.

This reverts commit 5ae2f79.

Remove null-splitting from _perform_aggregation

b251acb

This was referenced Nov 2, 2021

[QST] To what extent should Dask-SQL's output match PostgreSQL? #290

Open

Make meta consistent with results of cross join #300

Merged

charlesbluca added 4 commits November 9, 2021 06:53

Sort groupby results to match Postgres output

6d1a478

Add back slow path for groupby

ecde0c8

Don't group nulls first for slow path

276a5b9

Merge remote-tracking branch 'upstream/main' into simplify-groupby-agg

4a709b2

charlesbluca commented Nov 9, 2021

View reviewed changes

Remove unnecessary inverse comment

05530d6

charlesbluca merged commit 5ae2f79 into dask-contrib:main Nov 22, 2021

charlesbluca added a commit that referenced this pull request Nov 22, 2021

Revert "Remove null-splitting from _perform_aggregation (#273)"

b813a0b

This reverts commit 5ae2f79.

charlesbluca mentioned this pull request Nov 22, 2021

Revert "Remove null-splitting from _perform_aggregation" #325

Merged

charlesbluca added a commit that referenced this pull request Nov 22, 2021

Revert "Remove null-splitting from _perform_aggregation (#273)" (#325)

d492e84

This reverts commit 5ae2f79.

charlesbluca mentioned this pull request Nov 22, 2021

Remove null-splitting from _perform_aggregation #326

Merged

charlesbluca mentioned this pull request Mar 11, 2022

[QST] Remove SQL compatibility null handling from join code? #424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove null-splitting from `_perform_aggregation` #273

Remove null-splitting from `_perform_aggregation` #273

Uh oh!

charlesbluca commented Oct 20, 2021

Uh oh!

charlesbluca commented Nov 9, 2021

Uh oh!

charlesbluca Nov 9, 2021

Uh oh!

codecov-commenter commented Nov 9, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove null-splitting from _perform_aggregation #273

Remove null-splitting from _perform_aggregation #273

Uh oh!

Conversation

charlesbluca commented Oct 20, 2021

Uh oh!

charlesbluca commented Nov 9, 2021

Uh oh!

charlesbluca Nov 9, 2021

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Remove null-splitting from `_perform_aggregation` #273

Remove null-splitting from `_perform_aggregation` #273

codecov-commenter commented Nov 9, 2021 •

edited

Loading