Skip to content

Add "collect" aggregation support to dask-cudf#15593

Merged
rapids-bot[bot] merged 13 commits intorapidsai:branch-24.06from
rjzamora:groupby-updates
May 1, 2024
Merged

Add "collect" aggregation support to dask-cudf#15593
rapids-bot[bot] merged 13 commits intorapidsai:branch-24.06from
rjzamora:groupby-updates

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Apr 24, 2024

Currently blocked by dask/dask#11064

Description

This PR (along with it's upstream dependency) enables "collect" aggregations in dask-cudf when query-planning is enabled. It also adds an clearer error message for as_index usage (which is not supported in dask-dataframe, but was supported in legacy dask-cudf)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Apr 24, 2024
@rjzamora rjzamora self-assigned this Apr 24, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 24, 2024
@rjzamora rjzamora marked this pull request as ready for review April 26, 2024 14:34
@rjzamora rjzamora requested a review from a team as a code owner April 26, 2024 14:34
@rjzamora rjzamora added 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Apr 26, 2024
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor requests for clarification in the doc aspect.

Comment on lines +84 to +89
if as_index is not True:
raise NotImplementedError(
f"`as_index` is not supported by dask-expr. Please disable "
"query planning, or reset the index after aggregating.\n"
f"{_LEGACY_WORKAROUND}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is a somewhat confusing error message. The only way to get past it with query planning enabled is to say as_index=True, but the error message seems to say "as_index=True` is not handled by dask-expr.

Do you mean:

dask-expr only supports as_index=True. For as_index=False either disable query planning or reset the index with reset_index after aggregating.

WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like dask-expr doesn't actually have support for the as_index keyword arg in general and always follows the behavior of as_index=False, so perhaps we should consider:

  • checking if the kwarg is provided at all, emitting a FutureWarning if so
  • if the kwarg isn't as_index=True, raise the error description suggested above

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The upstream dask.dataframe API has always raised an error when as_index is used, but dask-cudf has used a distinct groupby API until now.

I agree that the real problem is that as_index is not supported at all by dask-expr. Therefore @charlesbluca's suggestion is probably the most "correct". With that said, I'm feeling a bit hesitant to add more noise for something that technically "works fine" :/

Comment on lines +84 to +96
if "as_index" in kwargs:
warnings.warn(
"The `as_index` argument is no longer supported in "
"dask-cudf when query-planning is enabled.",
FutureWarning,
)

if kwargs.pop("as_index", True) is not True:
raise NotImplementedError(
f"`as_index=False` is not supported. Please disable "
"query planning, or reset the index after aggregating.\n"
f"{_LEGACY_WORKAROUND}"
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially does what @charlesbluca suggested - The message is still slightly confusing, but so is the behavior I guess.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something to the effect of "as_index is not supported with query-planning enabled this will behave consistently with as_index=True" for both messages, with an additional blurb for the error showing the non-true value passed by the user?

Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, looks good. I agree the message is still a bit confusing but I do not obviously have a better suggestion :(

@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented May 1, 2024

I really appreciate the reviews @wence- @charlesbluca ! I just cleanup up the messaging a bit - Definitely not perfect, but hopefully clear enough for most users.

@wence-
Copy link
Copy Markdown
Contributor

wence- commented May 1, 2024

/merge

@rapids-bot rapids-bot bot merged commit 7458a6e into rapidsai:branch-24.06 May 1, 2024
@rjzamora rjzamora deleted the groupby-updates branch May 2, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants