Skip to content
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/pyspark/broadcast.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def __init__(self, sc=None, value=None, pickle_registry=None, path=None,

def dump(self, value, f):
try:
pickle.dump(value, f, 2)
pickle.dump(value, f, pickle.HIGHEST_PROTOCOL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind I ask about the context? why we always use protocol 2 previously?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to upgrading cloudpickle?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yea. this PR was previously setting the protocol to highest one to support 4gb+ pickle alone in the regular pickle (not including cloudpickle).

So I suggested to target upgrade Cloudpickle because upper Cloudpickle has that change to use highest protocol even though upgrading Cloudpickle is slightly orthogonal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it should be great if we know the context about why it was set 2 previously. I suspect there's no particular reason but should be good to double check and leave the reason if it's able to find.

The highest pickle protocol will be 2 in Python 2 and 4 in Python 3.4+. So, we're changing it from 2 to 4 when Python 3.4+.

One possibility is that it was set to 2 for the worry about writing and reading even in different Python versions but I don't think that's not guranteed in PySpark. Maybe we should explicitly note this somewhere as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened here: 6cf5076#diff-bb67501acde415576c589b478e16c60aR82
Since then it never changed.
I agree that there was no particular reason for that since pickle.HIGHEST_PROTOCOL in Python 2 versions is 2 for ages, not 3 or 4. Using pickle.HIGHEST_PROTOCOL consistently should be safe for that reason.

except pickle.PickleError:
raise
except Exception as e:
Expand Down
Loading