-
Notifications
You must be signed in to change notification settings - Fork 72
Remove null-splitting from _perform_aggregation
#273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove null-splitting from _perform_aggregation
#273
Conversation
|
It looks like we will still need the groupby workaround (i.e. slow path) for situations where an aggregation needs to be handled by a custom I haven't played around with groupby to know all the cases where we need a custom aggregation, so the only one I know of is when we attempt to aggregate a column with |
| for group_column in group_columns: | ||
| # the ~ makes NaN come first | ||
| is_null_column = ~(group_column.isnull()) | ||
| is_null_column = group_column.isnull() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that I removed the inverse operation here so that the slow path groupby agg is consistent with dask/dask-cudf's; this should also speed things up a little
Codecov Report
@@ Coverage Diff @@
## main #273 +/- ##
==========================================
- Coverage 95.75% 95.69% -0.07%
==========================================
Files 65 65
Lines 2802 2809 +7
Branches 418 421 +3
==========================================
+ Hits 2683 2688 +5
- Misses 77 78 +1
- Partials 42 43 +1
Continue to review full report at Codecov.
|
This reverts commit 5ae2f79.
This reverts commit 5ae2f79.
dask-sql currently uses a workaround
get_groupby_with_nulls_colsto split a dataframe by nulls/non-nulls before performing a groupby. This seems unnecessary, as we can usegroupby(dropna=False)to keep nulls in the groupby result, and then later move them to the front of the dataframe (depending on how strictly we want to follow PostgreSQL). For comparison, here are the task graphs for a basic SQL operation:Before this PR:
And after:
This is currently excluding null position handling - adding that in will probably bring back some layers