Skip to content

Conversation

@charlesbluca
Copy link
Collaborator

dask-sql currently uses a workaround get_groupby_with_nulls_cols to split a dataframe by nulls/non-nulls before performing a groupby. This seems unnecessary, as we can use groupby(dropna=False) to keep nulls in the groupby result, and then later move them to the front of the dataframe (depending on how strictly we want to follow PostgreSQL). For comparison, here are the task graphs for a basic SQL operation:

SELECT
     user_id, SUM(b) AS "S"
FROM user_table_1
GROUP BY user_id

Before this PR:

HighLevelGraph with 20 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7fbc344ae590>
 0. from_pandas-b5125674539d8b5d1f754260e384e1fa
 1. assign-e29cf71863faee69539b6f78b8dbc8f3
 2. getitem-aa5bcd5e86836ebad1df2a90117e16f6
 3. fillna-94de4188d246caf9e1b4bf4f1debce03
 4. isnull-76ff5f3bc798aa3da45b6b16dae758fc
 5. inv-8c38b3dd8368fbb685b718f17180f00b
 6. aggregate-agg-838eda3f7eb4dbc264011a4f8e14a10b
 7. rename-c2e63c54b4d4fff51825029f94563613
 8. reset_index-3a34f3d8a1856ef12b299845da6afa8c
 9. rename-79dfdb294050405020a476146c8b5845
 10. getitem-be78c3b9d2c681e09cccdfdf783dbd1f
 11. eq-77e1f642c7c2a8e643f4a22b61af4e11
 12. inv-07fa50e82ef9e1e810819771a8e1ce2d
 13. getitem-1cc4911cff51b0537d276b6a9de6ebf7
 14. where-7cc4db419a8a2172e17accd935b851f4
 15. assign-8ac7e56123d06d1204acb6609e9541b6
 16. getitem-0ab1e73633488c2092c98ca87c31333c
 17. getitem-2a55368cca98a96e0d82198cd6bd7802
 18. assign-837d2902d726cbdd2f356860ab530f04
 19. getitem-f7c62a09776201bb03dae99feccd8138

And after:

HighLevelGraph with 16 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f9c4455d730>
 0. from_pandas-b5125674539d8b5d1f754260e384e1fa
 1. assign-d126cb139c7dfef7033f0e1f1559a8aa
 2. aggregate-agg-5b194136cacfc9fef1c06119ce6a453f
 3. rename-529a9e13861a74ca909376c58eb77c82
 4. reset_index-fcfd12137febeac9b4cfed91590b47c5
 5. rename-1b4bcbd7d2f03cb785833918fd7e148a
 6. getitem-583c0b466bd64af2ba8edf50c31ec41b
 7. eq-c3e82a692e1f30a2482a526c57bfdf74
 8. inv-dd614b43ef9503c3ee6b6c1d0afabe65
 9. getitem-c598fe29b226039022009f4aca4598be
 10. where-491a45d0103973393d8e07a9b82481e9
 11. assign-fb69ab150f02d3d829815441b5e38bc1
 12. getitem-fc12148fe3ee2696902590cbf32e2620
 13. getitem-643b1fd05b327c34403caedd348cd66b
 14. assign-35d81a9f9ef223a5cc776b41dd7dc231
 15. getitem-801bb5652b721f072ea3d18ab9de6db9

This is currently excluding null position handling - adding that in will probably bring back some layers

@charlesbluca
Copy link
Collaborator Author

It looks like we will still need the groupby workaround (i.e. slow path) for situations where an aggregation needs to be handled by a custom AggregationOnPandas (and maybe just all dd.Aggregation subclasses in general), as dropna=False doesn't work as expected for these cases.

I haven't played around with groupby to know all the cases where we need a custom aggregation, so the only one I know of is when we attempt to aggregate a column with object dtype. For situations like this, a reasonable solution that @VibhuJawa proposed would be to establish a standard convert_dtypes API, and then have that get called on input dataframes so that they use optimal column dtypes as much as possible.

for group_column in group_columns:
# the ~ makes NaN come first
is_null_column = ~(group_column.isnull())
is_null_column = group_column.isnull()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that I removed the inverse operation here so that the slow path groupby agg is consistent with dask/dask-cudf's; this should also speed things up a little

@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2021

Codecov Report

Merging #273 (05530d6) into main (2194a75) will decrease coverage by 0.06%.
The diff coverage is 86.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #273      +/-   ##
==========================================
- Coverage   95.75%   95.69%   -0.07%     
==========================================
  Files          65       65              
  Lines        2802     2809       +7     
  Branches      418      421       +3     
==========================================
+ Hits         2683     2688       +5     
- Misses         77       78       +1     
- Partials       42       43       +1     
Impacted Files Coverage Δ
dask_sql/physical/rel/logical/aggregate.py 94.70% <85.71%> (-1.14%) ⬇️
dask_sql/physical/utils/groupby.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2194a75...05530d6. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants