Add "collect" aggregation support to dask-cudf#15593
Add "collect" aggregation support to dask-cudf#15593rapids-bot[bot] merged 13 commits intorapidsai:branch-24.06from
Conversation
wence-
left a comment
There was a problem hiding this comment.
Some minor requests for clarification in the doc aspect.
| if as_index is not True: | ||
| raise NotImplementedError( | ||
| f"`as_index` is not supported by dask-expr. Please disable " | ||
| "query planning, or reset the index after aggregating.\n" | ||
| f"{_LEGACY_WORKAROUND}" | ||
| ) |
There was a problem hiding this comment.
nit: This is a somewhat confusing error message. The only way to get past it with query planning enabled is to say as_index=True, but the error message seems to say "as_index=True` is not handled by dask-expr.
Do you mean:
dask-expronly supportsas_index=True. Foras_index=Falseeither disable query planning or reset the index withreset_indexafter aggregating.
WDYT?
There was a problem hiding this comment.
It sounds like dask-expr doesn't actually have support for the as_index keyword arg in general and always follows the behavior of as_index=False, so perhaps we should consider:
- checking if the kwarg is provided at all, emitting a
FutureWarningif so - if the kwarg isn't
as_index=True, raise the error description suggested above
There was a problem hiding this comment.
The upstream dask.dataframe API has always raised an error when as_index is used, but dask-cudf has used a distinct groupby API until now.
I agree that the real problem is that as_index is not supported at all by dask-expr. Therefore @charlesbluca's suggestion is probably the most "correct". With that said, I'm feeling a bit hesitant to add more noise for something that technically "works fine" :/
| if "as_index" in kwargs: | ||
| warnings.warn( | ||
| "The `as_index` argument is no longer supported in " | ||
| "dask-cudf when query-planning is enabled.", | ||
| FutureWarning, | ||
| ) | ||
|
|
||
| if kwargs.pop("as_index", True) is not True: | ||
| raise NotImplementedError( | ||
| f"`as_index=False` is not supported. Please disable " | ||
| "query planning, or reset the index after aggregating.\n" | ||
| f"{_LEGACY_WORKAROUND}" | ||
| ) |
There was a problem hiding this comment.
This essentially does what @charlesbluca suggested - The message is still slightly confusing, but so is the behavior I guess.
There was a problem hiding this comment.
Maybe something to the effect of "as_index is not supported with query-planning enabled this will behave consistently with as_index=True" for both messages, with an additional blurb for the error showing the non-true value passed by the user?
wence-
left a comment
There was a problem hiding this comment.
Thanks for the changes, looks good. I agree the message is still a bit confusing but I do not obviously have a better suggestion :(
|
I really appreciate the reviews @wence- @charlesbluca ! I just cleanup up the messaging a bit - Definitely not perfect, but hopefully clear enough for most users. |
|
/merge |
Currently blocked by dask/dask#11064Description
This PR
(along with it's upstream dependency)enables"collect"aggregations in dask-cudf when query-planning is enabled. It also adds an clearer error message foras_indexusage (which is not supported in dask-dataframe, but was supported in legacy dask-cudf)Checklist