Sort index by kayibal · Pull Request #37 · datarevenue-berlin/sparsity

kayibal · 2018-04-01T23:16:33Z

This implements sort_index and will sort the distributed dataframe and set divisions.

… SparseFrame

# Conflicts: # sparsity/test/test_dask_sparse_frame.py

codecov · 2018-04-01T23:17:58Z

Codecov Report

Merging #37 into master will increase coverage by 2.57%.
The diff coverage is 94.4%.

@@            Coverage Diff             @@
##           master      #37      +/-   ##
==========================================
+ Coverage   83.07%   85.64%   +2.57%     
==========================================
  Files           6        7       +1     
  Lines         957     1094     +137     
==========================================
+ Hits          795      937     +142     
+ Misses        162      157       -5

Impacted Files	Coverage Δ
sparsity/sparse_frame.py	`87.87% <100%> (+0.47%)`	⬆️
sparsity/dask/core.py	`82.56% <90.47%> (+3.58%)`	⬆️
sparsity/dask/shuffle.py	`95% <95%> (ø)`
sparsity/dask/io.py	`65.51% <0%> (-1.73%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87f9928...0675bfa. Read the comment docs.

michcio1234

Cool, but I have small remarks.

michcio1234 · 2018-04-18T08:13:39Z

sparsity/dask/shuffle.py

+
+def sort_index(df, npartitions=None, shuffle='tasks',
+               drop=True, upsample=1.0, divisions=None,
+               partition_size=128e6, **kwargs):


I see this is copied from dask almost 1:1, so I'll assume it's good.

michcio1234 · 2018-04-18T08:27:04Z

sparsity/dask/shuffle.py

+
+def rearrange_by_index(df, npartitions=None, max_branch=None,
+                       shuffle='task'):
+    if shuffle == 'tasks':


Default is task but only supported method is tasks.

michcio1234 · 2018-04-18T08:35:38Z

sparsity/dask/shuffle.py

+
+    for stage in range(1, stages + 1):
+        group = dict((('shuffle-group-' + token, stage, inp),
+                      (shuffle_index, ('shuffle-join-' + token, stage - 1, inp),


This is one of few differences from dask version. You changed shuffle_group function to shuffle_index. Maybe key should also be changed?

Hmm I'll leave this for now as I think the main thing what this function does is grouping the partitions into new groups.

michcio1234 · 2018-04-18T08:48:19Z

sparsity/test/test_dask_sparse_frame.py

+        res2.sort_index(inplace=True)

-    pdt.assert_frame_equal(res, correct)
+        pdt.assert_frame_equal(res1, correct)


Missing assert for res2?

Hmm I think you're right :) it was not from this PR but I actually had missed it.

michcio1234 · 2018-04-18T09:23:24Z

sparsity/test/test_dask_sparse_frame.py

+    dsf = dsf.sort_index(npartitions='auto', partition_size=80000)
+
+    assert dsf.known_divisions
+    assert dsf.npartitions == 16


I don't know why this should be desired result. partition_size argument is not documented (I couldn't find documentation even in dask repo) - could you please describe this test a bit and maybe add at least partition_size description to sort_index's docstring?

michcio1234 · 2018-04-18T09:24:32Z

sparsity/test/test_dask_sparse_frame.py

+    parts = [dsf.get_partition(i).compute().todense()
+             for i in range(dsf.npartitions)]
+    res = pd.concat(parts, axis=0)
+    pdt.assert_frame_equal(res, correct)


Since we're already making changes... Could you please add a newline here? :P

# Conflicts: # sparsity/dask/core.py # sparsity/test/test_dask_sparse_frame.py

kayibal added 7 commits March 7, 2018 13:49

Implement distributed groupby sum and apply_concat_apply function for…

58da78f

… SparseFrame

Merge branch 'master' into distributed-groupby-sum

51ae0a2

# Conflicts: # sparsity/test/test_dask_sparse_frame.py

update comments

3f3be49

add test for different index datatypes

63d4dc8

implement sort_index

df110c9

implement __len__

49cb182

add test for complex sort_index with aumatic repartitioning

c64ec58

kayibal requested review from michcio1234 and vitords April 1, 2018 23:17

kayibal added 4 commits April 1, 2018 17:30

add test for get_partition

d2ce4f0

better tests for repartition

7ab1acb

improve groupby_sum tests

9ffd926

also test groupby shortcut

85f9379

michcio1234 requested changes Apr 18, 2018

View reviewed changes

kayibal added 4 commits April 19, 2018 17:30

Cosmetics and docs

3c55e69

fix missing assert in groupby test

5cb9027

Merge branch 'master' into sort_index

dc1d3e9

# Conflicts: # sparsity/dask/core.py # sparsity/test/test_dask_sparse_frame.py

2 newlines after imports

0675bfa

kayibal merged commit 4d85502 into master Apr 19, 2018

kayibal deleted the sort_index branch April 19, 2018 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort index#37

Sort index#37
kayibal merged 15 commits intomasterfrom
sort_index

kayibal commented Apr 1, 2018

Uh oh!

codecov bot commented Apr 1, 2018 •

edited

Loading

Uh oh!

michcio1234 left a comment

Uh oh!

michcio1234 Apr 18, 2018

Uh oh!

michcio1234 Apr 18, 2018

Uh oh!

michcio1234 Apr 18, 2018

Uh oh!

kayibal Apr 19, 2018

Uh oh!

michcio1234 Apr 19, 2018

Uh oh!

michcio1234 Apr 18, 2018

Uh oh!

kayibal Apr 19, 2018

Uh oh!

michcio1234 Apr 18, 2018

Uh oh!

michcio1234 Apr 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kayibal commented Apr 1, 2018

Uh oh!

codecov bot commented Apr 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

michcio1234 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 1, 2018 •

edited

Loading