Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Aug 12, 2014

  1. skip partitionBy() when numOfPartition is 1
  2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
    rangePartitioner

1. skip partitionBy() when numOfPartition is 1
2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
rangePartitioner
@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA tests have started for PR 1898. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18349/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 12, 2014

QA results for PR 1898:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18349/consoleFull

@davies davies changed the title [SPARK-705] [PySpark] improve performance of sortByKey() [SPARK-2983] [PySpark] improve performance of sortByKey() Aug 12, 2014
@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA tests have started for PR 1898. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18401/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 13, 2014

QA results for PR 1898:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18401/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we have the flatMap(lambda x: x) before? Just want to make sure we're not removing something useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I guess it's due to the yield -> return above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I have no idea why it's done in this way. I think it's not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, yeah. It seems unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be two unintended side effects of this change. This code used to work in pyspark:

sc.parallelize([5,3,4,2,1]).map(lambda x: (x,x)).sortByKey().take(1)

Now it failswith the error:

File "<...>/spark/python/pyspark/rdd.py", line 1023, in takeUpToNumLeft
    yield next(iterator)
TypeError: list object is not an iterator

Changing mapFunc and sort back to generators rather than regular functions fixes that problem.

After making that change, there is a second side effect due to the removal of flatMap where the above code returns the following unexpected result due to the default partitioning scheme:

[[(1, 1), (2, 2)]]

Removing sortByKey, e.g.:

sc.parallelize([5,3,4,2,1]).map(lambda x: (x,x)).take(1)

returns the expected result [(5, 5)]. Restoring the call to flatMap resolves this as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out.. sounds like we should look into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@str-janus thanks, this will be fixed in #2045

@mateiz
Copy link
Contributor

mateiz commented Aug 13, 2014

Looks good to me; I'll merge it.

@mateiz
Copy link
Contributor

mateiz commented Aug 13, 2014

Merged into 1.1.

asfgit pushed a commit that referenced this pull request Aug 13, 2014
1. skip partitionBy() when numOfPartition is 1
2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
rangePartitioner

Author: Davies Liu <[email protected]>

Closes #1898 from davies/sort and squashes the following commits:

0a9608b [Davies Liu] Merge branch 'master' into sort
1cf9565 [Davies Liu] improve performance of sortByKey()

(cherry picked from commit 434bea1)
Signed-off-by: Matei Zaharia <[email protected]>
@asfgit asfgit closed this in 434bea1 Aug 13, 2014
@davies davies deleted the sort branch August 15, 2014 17:27
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
1. skip partitionBy() when numOfPartition is 1
2. use bisect_left (O(lg(N))) instread of loop (O(N)) in
rangePartitioner

Author: Davies Liu <[email protected]>

Closes apache#1898 from davies/sort and squashes the following commits:

0a9608b [Davies Liu] Merge branch 'master' into sort
1cf9565 [Davies Liu] improve performance of sortByKey()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants