-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27612][PYTHON] Use Python's default protocol instead of highest protocol #24519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @BryanCutler and @viirya This PR could not identify the root cause yet (but only my rough hypothesis). I would like to get this in first and then do the investigation separately. One concern about using default protocol is that it's not tested in cloudpickle. I opened a PR to add a env to control that and test it (cloudpipe/cloudpickle#265). |
| basestring = unicode = str | ||
| xrange = range | ||
| pickle_protocol = pickle.HIGHEST_PROTOCOL | ||
| pickle_protocol = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use pickle.DEFAULT_PROTOCOL too but let me stick with constant since seems protocol 4 has this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks nice and solid. BTW, do you think we can have a pointer for the upstream bug issue against pickle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the bug issue related to the root cause somewhere? (I think) it's more like an issue within Pryolite library .. I am not 100% sure yet. I will update that when I have it. I am looking into this to identify the cause.
|
Test build #105096 has finished for PR 24519 at commit
|
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this can be a temporary fix for the correctness issue. We can investigate it more for root cause after this.
|
The test failed? Can't default protocol fix this? |
|
Oh, #24519 (comment) was intended to be failed (1de0478). I wanted to make sure it's failed in Jenkins too. |
|
The last commit is still under testing~ |
|
Oh, don't see that. Thanks. |
|
Test build #105097 has finished for PR 24519 at commit
|
|
I can almost confirm that this is a bug in Pyrolite.
|
|
Not sure if there's quick fix because notable opcodes that appeared from both protocol 4 and current reproducer is only |
|
Merged to master. Thanks for review, @dongjoon-hyun and @viirya |
…ling ## What changes were proposed in this pull request? In SPARK-27612, one correctness issue was reported. When protocol 4 is used to pickle Python objects, we found that unpickled objects were wrong. A temporary fix was proposed by not using highest protocol. It was found that Opcodes.MEMOIZE was appeared in the opcodes in protocol 4. It is suspect to this issue. A deeper dive found that Opcodes.MEMOIZE stores objects into internal map of Unpickler object. We use single Unpickler object to unpickle serialized Python bytes. Stored objects intervenes next round of unpickling, if the map is not cleared. We has two options: 1. Continues to reuse Unpickler, but calls its close after each unpickling. 2. Not to reuse Unpickler and create new Unpickler object in each unpickling. This patch takes option 1. ## How was this patch tested? Passing the test added in SPARK-27612 (#24519). Closes #24521 from viirya/SPARK-27629. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
|
Hi, @HyukjinKwon . |
…ling In SPARK-27612, one correctness issue was reported. When protocol 4 is used to pickle Python objects, we found that unpickled objects were wrong. A temporary fix was proposed by not using highest protocol. It was found that Opcodes.MEMOIZE was appeared in the opcodes in protocol 4. It is suspect to this issue. A deeper dive found that Opcodes.MEMOIZE stores objects into internal map of Unpickler object. We use single Unpickler object to unpickle serialized Python bytes. Stored objects intervenes next round of unpickling, if the map is not cleared. We has two options: 1. Continues to reuse Unpickler, but calls its close after each unpickling. 2. Not to reuse Unpickler and create new Unpickler object in each unpickling. This patch takes option 1. Passing the test added in SPARK-27612 (apache#24519). Closes apache#24521 from viirya/SPARK-27629. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…ling In SPARK-27612, one correctness issue was reported. When protocol 4 is used to pickle Python objects, we found that unpickled objects were wrong. A temporary fix was proposed by not using highest protocol. It was found that Opcodes.MEMOIZE was appeared in the opcodes in protocol 4. It is suspect to this issue. A deeper dive found that Opcodes.MEMOIZE stores objects into internal map of Unpickler object. We use single Unpickler object to unpickle serialized Python bytes. Stored objects intervenes next round of unpickling, if the map is not cleared. We has two options: 1. Continues to reuse Unpickler, but calls its close after each unpickling. 2. Not to reuse Unpickler and create new Unpickler object in each unpickling. This patch takes option 1. Passing the test added in SPARK-27612 (apache#24519). Closes apache#24521 from viirya/SPARK-27629. Authored-by: Liang-Chi Hsieh <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR partially reverts #20691
After we changed the Python protocol to highest ones, seems like it introduced a correctness bug. This potentially affects all Python related code paths.
I suspect a bug related to Pryolite (maybe opcodes
MEMOIZE,FRAMEand/or ourRowPickler). I would like to stick to default protocol for now and investigate the issue separately.I will separately investigate later to bring highest protocol back.
Update:
I can almost confirm that this is a bug in Pyrolite.
Python lists are chunked via
BatchedSerializeratparallelize(increateDataFrame), for instance, as below (each line is each batch):Here is opcodes for each batch (separated by
===..) with protocol 4:Details
Those batches become binary in an RDD at
parallelize- so far results look finerdd._to_java_object_rdd->SerDeUtil.pythonToJava->Unpickler.loadsHere, Pryolite
Unpickler.loadsconverts each Python object batch into Java objectsThe converted Java objects look a bit odd from Pryolite as below (see the last three lists):
Last three nested lists are ignored via
EvaluatePython.makeFromJavaand becomesnull->None.So the Jenkins test was failed at [SPARK-27612][PYTHON] Use Python's default protocol instead of highest protocol #24519 (comment) as below:
How was this patch tested?
Unittest was added.
./run-tests --python-executables=python3.7 --testname "pyspark.sql.tests.test_serde SerdeTests.test_int_array_serialization"