-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-26549][PySpark] Fix for python worker reuse take no effect for parallelize lazy iterable range #23470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #100802 has finished for PR 23470 at commit
|
I haven't taken a look yet but where's difference between Python 2 and 3? Can you also explain why? |
|
Thanks Wenchen Liangchi and Hyukjin for your comment, the JIRA description has more detailed and the code I added before: https://issues.apache.org/jira/browse/SPARK-26549. Sorry for the mess, the bug only for |
|
looks reasonable to me, cc @ueshin @BryanCutler |
|
Test build #100888 has finished for PR 23470 at commit
|
|
Looks fine to me too. |
|
LGTM, too. |
|
Does this mean that the user could also map a function that doesn't consume the iterator and inadvertently cause the worker to not be reused? If so, should the fix be in |
|
re: #23470 (comment) Yea, I think so. I took a look to fix the root cause but it's going to be quite invasive from my look. Maybe there's another way I missed. So, the current fix is like a bandaid fix .. but I think it's good enough. |
|
LGTM too considering it's a quick bandaid fix. |
|
Also, let's fix PR description and title from xrange to lazy iterable range? Range in Python 3 is a lazy iterable already. |
|
This is fine as a band-aid fix for use in |
|
@HyukjinKwon Thanks for your comments and advice, all addressed done. |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #100950 has finished for PR 23470 at commit
|
|
Merged to master. |
|
Thanks for all reviewers. |
… parallelize lazy iterable range ## What changes were proposed in this pull request? During the follow-up work(apache#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for `sc.parallelize(xrange(...))`. It happened because of the specialize rdd.parallelize logic for xrange(introduced in apache#3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in [SPARK-26549](https://issues.apache.org/jira/browse/SPARK-26549) description. We fix this by force using the passed-in iterator. ## How was this patch tested? New UT in test_worker.py. Closes apache#23470 from xuanyuanking/SPARK-26549. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
During the follow-up work(#23435) for PySpark worker reuse scenario, we found that the worker reuse takes no effect for
sc.parallelize(xrange(...)). It happened because of the specialize rdd.parallelize logic for xrange(introduced in #3264) generated data by lazy iterable range, which don't need to use the passed-in iterator. But this will break the end of stream checking in python worker and finally cause worker reuse takes no effect. See more details in SPARK-26549 description.We fix this by force using the passed-in iterator.
How was this patch tested?
New UT in test_worker.py.