Use Dask's `sort_values` for first column sorting in `apply_sort` #255

charlesbluca · 2021-10-13T14:50:06Z

Now that Dask's sort_values has support for ascending/descending and null positioning, we can use it to do what _sort_first_column was once doing in a (hopefully) more performant way. There are still some failures here that are resolved with dask/dask#8225, so this should be safe to merge by the next Dask release.

dask_sql/physical/utils/sort.py

charlesbluca · 2021-10-14T21:34:13Z

Thinking about this more, I don't actually think we need to sort the first column of the dataframe - just accomplishing rearrange_by_divisions for the first column should be sufficient so that the map_partitions call sorts the dataframe properly.

Wondering in that case if it makes sense to try and move this logic upstream - maybe with a kwarg for sort_values like sort_function that defaults to M.sort_values but can be replaced with custom functions like sort_partition_func to do sorting with enhanced functionality. One issue that I foresee there is that this would work with a multi-column divisions, as I don't think that would respect multiple ascending/na_position values? But maybe we could restrict to only using single-column divisions if a custom sorting function is passed?

cc @rjzamora if you have any thoughts on this

rjzamora · 2021-10-14T23:59:52Z

cc @rjzamora if you have any thoughts on this

I agree that much of the dask_cudf sort_values logic should live in upstream dask.dataframe. However, I will need to think a bit more about the sort_function idea to provide any useful feedback.

Somewhat-related thoughts: One thing I like about the dask_cudf logic is that sort_values uses rearrange_by_column rather than rearrange_by_divisions. This makes the mutli-column sorting code path a bit simpler, because we can use multiple columns to define the new (temporary) "_partitions" column, but then the rest of the logic is the same as a single-column sort. I also like that dask_cudf's set_index is directly based on sort_values (which is not the case in dask.dataframe).

codecov-commenter · 2021-11-01T16:21:38Z

Codecov Report

Merging #255 (eb37f15) into main (ebdf4d5) will decrease coverage by 0.00%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #255      +/-   ##
==========================================
- Coverage   95.89%   95.89%   -0.01%     
==========================================
  Files          64       65       +1     
  Lines        2730     2800      +70     
  Branches      408      418      +10     
==========================================
+ Hits         2618     2685      +67     
- Misses         72       73       +1     
- Partials       40       42       +2

Impacted Files	Coverage Δ
dask_sql/physical/utils/sort.py	`83.33% <100.00%> (+0.57%)`	⬆️
dask_sql/datacontainer.py	`94.31% <0.00%> (-5.69%)`	⬇️
dask_sql/cmd.py	`100.00% <0.00%> (ø)`
dask_sql/physical/rel/convert.py	`87.50% <0.00%> (ø)`
dask_sql/physical/utils/groupby.py	`100.00% <0.00%> (ø)`
dask_sql/physical/rel/logical/window.py	`98.81% <0.00%> (ø)`
dask_sql/physical/rel/custom/__init__.py	`100.00% <0.00%> (ø)`
dask_sql/physical/rel/custom/distributeby.py	`86.36% <0.00%> (ø)`
dask_sql/context.py	`99.09% <0.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebdf4d5...eb37f15. Read the comment docs.

charlesbluca · 2021-11-04T14:24:46Z

Looks like the original fix I had in mind for dask/dask#8255 is actually needed here if we intend to use upstream Dask sorting functions (rearrange_by_divisions, set_partitions_pre) directly on a dask-cudf dataframe. However, I'm more interested in the sort_function approach described above - will explore that.

In any case, I'm pretty sure using sort_values instead of _sort_first_column for the initial sort is still beneficial performance-wise, so we can merge this now and make a follow up depending on what direction we decide to move in.

charlesbluca · 2021-11-04T14:46:16Z

dask_sql/physical/utils/sort.py

            ascending=sort_ascending,
            na_position="first" if sort_null_first[0] else "last",
-        )
+        ).persist()


Noting that I added this persist call after sorting to match up with the other sort paths, which persist after sorting.

charlesbluca · 2021-11-04T14:46:23Z

dask_sql/physical/utils/sort.py

    ):
        try:
-            return df.sort_values(sort_columns, ignore_index=True)
+            return df.sort_values(sort_columns, ignore_index=True).persist()


Same as above

Use dask's sort_values for first column sort

3ecd710

charlesbluca changed the title ~~Use dask's sort_values for first column sorting in apply_sort~~ Use Dask's sort_values for first column sorting in apply_sort Oct 13, 2021

charlesbluca commented Oct 13, 2021

View reviewed changes

dask_sql/physical/utils/sort.py Show resolved Hide resolved

Trigger CI

b4df399

Make sure to persist after sorting in the GPU case

eb37f15

charlesbluca added 2 commits November 4, 2021 07:44

Merge remote-tracking branch 'upstream/main' into dask-sort-values

cc97d38

Add another missing persist call

e4850ab

charlesbluca commented Nov 4, 2021

View reviewed changes

charlesbluca marked this pull request as ready for review November 4, 2021 14:46

charlesbluca merged commit 5b8f8a9 into dask-contrib:main Nov 4, 2021

This was referenced Nov 8, 2021

[QST] How should we restrict Dask/Distributed dependencies for development/release? #302

Open

Bump dask pinning to 2021.10.0 #303

Merged

charlesbluca deleted the dask-sort-values branch January 19, 2022 21:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Dask's `sort_values` for first column sorting in `apply_sort` #255

Use Dask's `sort_values` for first column sorting in `apply_sort` #255

Uh oh!

charlesbluca commented Oct 13, 2021

Uh oh!

Uh oh!

charlesbluca commented Oct 14, 2021

Uh oh!

rjzamora commented Oct 14, 2021

Uh oh!

codecov-commenter commented Nov 1, 2021 •

edited

Loading

Uh oh!

charlesbluca commented Nov 4, 2021 •

edited

Loading

Uh oh!

charlesbluca Nov 4, 2021

Uh oh!

charlesbluca Nov 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use Dask's sort_values for first column sorting in apply_sort #255

Use Dask's sort_values for first column sorting in apply_sort #255

Uh oh!

Conversation

charlesbluca commented Oct 13, 2021

Uh oh!

Uh oh!

charlesbluca commented Oct 14, 2021

Uh oh!

rjzamora commented Oct 14, 2021

Uh oh!

codecov-commenter commented Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

charlesbluca commented Nov 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlesbluca Nov 4, 2021

Choose a reason for hiding this comment

Uh oh!

charlesbluca Nov 4, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use Dask's `sort_values` for first column sorting in `apply_sort` #255

Use Dask's `sort_values` for first column sorting in `apply_sort` #255

codecov-commenter commented Nov 1, 2021 •

edited

Loading

charlesbluca commented Nov 4, 2021 •

edited

Loading