Conversation
# Conflicts: # sparsity/test/test_dask_sparse_frame.py
Codecov Report
@@ Coverage Diff @@
## master #37 +/- ##
==========================================
+ Coverage 83.07% 85.64% +2.57%
==========================================
Files 6 7 +1
Lines 957 1094 +137
==========================================
+ Hits 795 937 +142
+ Misses 162 157 -5
Continue to review full report at Codecov.
|
|
|
||
| def sort_index(df, npartitions=None, shuffle='tasks', | ||
| drop=True, upsample=1.0, divisions=None, | ||
| partition_size=128e6, **kwargs): |
There was a problem hiding this comment.
I see this is copied from dask almost 1:1, so I'll assume it's good.
|
|
||
| def rearrange_by_index(df, npartitions=None, max_branch=None, | ||
| shuffle='task'): | ||
| if shuffle == 'tasks': |
There was a problem hiding this comment.
Default is task but only supported method is tasks.
|
|
||
| for stage in range(1, stages + 1): | ||
| group = dict((('shuffle-group-' + token, stage, inp), | ||
| (shuffle_index, ('shuffle-join-' + token, stage - 1, inp), |
There was a problem hiding this comment.
This is one of few differences from dask version. You changed shuffle_group function to shuffle_index. Maybe key should also be changed?
There was a problem hiding this comment.
Hmm I'll leave this for now as I think the main thing what this function does is grouping the partitions into new groups.
| res2.sort_index(inplace=True) | ||
|
|
||
| pdt.assert_frame_equal(res, correct) | ||
| pdt.assert_frame_equal(res1, correct) |
There was a problem hiding this comment.
Hmm I think you're right :) it was not from this PR but I actually had missed it.
| dsf = dsf.sort_index(npartitions='auto', partition_size=80000) | ||
|
|
||
| assert dsf.known_divisions | ||
| assert dsf.npartitions == 16 |
There was a problem hiding this comment.
I don't know why this should be desired result. partition_size argument is not documented (I couldn't find documentation even in dask repo) - could you please describe this test a bit and maybe add at least partition_size description to sort_index's docstring?
| parts = [dsf.get_partition(i).compute().todense() | ||
| for i in range(dsf.npartitions)] | ||
| res = pd.concat(parts, axis=0) | ||
| pdt.assert_frame_equal(res, correct) No newline at end of file |
There was a problem hiding this comment.
Since we're already making changes... Could you please add a newline here? :P
# Conflicts: # sparsity/dask/core.py # sparsity/test/test_dask_sparse_frame.py
This implements sort_index and will sort the distributed dataframe and set divisions.