-
Notifications
You must be signed in to change notification settings - Fork 72
Use Dask's sort_values for first column sorting in apply_sort
#255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sort_values for first column sorting in apply_sortsort_values for first column sorting in apply_sort
|
Thinking about this more, I don't actually think we need to sort the first column of the dataframe - just accomplishing Wondering in that case if it makes sense to try and move this logic upstream - maybe with a kwarg for cc @rjzamora if you have any thoughts on this |
I agree that much of the dask_cudf sort_values logic should live in upstream dask.dataframe. However, I will need to think a bit more about the Somewhat-related thoughts: One thing I like about the dask_cudf logic is that |
Codecov Report
@@ Coverage Diff @@
## main #255 +/- ##
==========================================
- Coverage 95.89% 95.89% -0.01%
==========================================
Files 64 65 +1
Lines 2730 2800 +70
Branches 408 418 +10
==========================================
+ Hits 2618 2685 +67
- Misses 72 73 +1
- Partials 40 42 +2
Continue to review full report at Codecov.
|
|
Looks like the original fix I had in mind for dask/dask#8255 is actually needed here if we intend to use upstream Dask sorting functions ( In any case, I'm pretty sure using |
| ascending=sort_ascending, | ||
| na_position="first" if sort_null_first[0] else "last", | ||
| ) | ||
| ).persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that I added this persist call after sorting to match up with the other sort paths, which persist after sorting.
| ): | ||
| try: | ||
| return df.sort_values(sort_columns, ignore_index=True) | ||
| return df.sort_values(sort_columns, ignore_index=True).persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Now that Dask's
sort_valueshas support for ascending/descending and null positioning, we can use it to do what_sort_first_columnwas once doing in a (hopefully) more performant way. There are still some failures here that are resolved with dask/dask#8225, so this should be safe to merge by the next Dask release.