Skip to content

Conversation

@mccheah
Copy link

@mccheah mccheah commented Mar 27, 2019

Implements the shuffle writer API by writing shuffle files to local disk and using the index block resolver to commit data and write index files.

The logic in BypassMergeSortShuffleWriter has been refactored to use the base implementation of the plugin instead.

APIs have been slightly renamed to clarify semantics after considering nuances in how these are to be implemented by other developers.

Follow-up commits are to come for SortShuffleWriter and UnsafeShuffleWriter.

Ported from bloomberg#6, credits to @ifilonenko.

ifilonenko and others added 2 commits March 26, 2019 18:37
…#6)

Implement default version of the API that would be used across all shuffle writers, writes to local disk.

Shuffle Writes [2/6] [3/6] [5/6]
@mccheah mccheah force-pushed the shuffle-writer-default-impl branch from 654cfeb to cb261e1 Compare March 27, 2019 01:46
@mccheah
Copy link
Author

mccheah commented Mar 27, 2019

@vanzin @squito for review.

mccheah added 3 commits March 27, 2019 12:06
- We always have to make a partition writer regardless if input spill files exist or not.
- Initialize executor components in SortShuffleManager
- Ensure we're always using the initialized executor components
Copy link

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see anything that caught my attention, but I kinda skimmed the tests.

@yifeih
Copy link

yifeih commented Mar 29, 2019

Took a look through the changes since bloomberg#6, lgtm!

@yifeih yifeih changed the title Local shuffle implementation of the shuffle writer API [SPARK-25299] Local shuffle implementation of the shuffle writer API Apr 3, 2019
Copy link
Author

@mccheah mccheah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'm going to merge this now - we've had some time vetting this back and forth on the Bloomberg fork. Furthermore having access to ShuffleExecutorComponents will allow the reader plugin to be a more complete patch in #523, and it also gives the locations API something to hook into in #517. We discussed in other forums that making the plugin implementation handle Spark's compression and encryption is impossible because of the way BypassMergeSortShuffleWriter is fundamentally built.

I'm going to merge this as is. Incremental improvements can be proposed here and can be added as follow-up patches moving forward. @felixcheung @vanzin @squito for SA.

Thanks @ifilonenko for working on this!

Allows more efficient writing of chunks.
@mccheah
Copy link
Author

mccheah commented Apr 3, 2019

On second thought I forgot about 9a2f4f2 =). @yifeih can you review and then approve, then bulldozer can merge?

@bulldozer-bot bulldozer-bot bot merged commit bc40da2 into spark-25299 Apr 3, 2019
@bulldozer-bot bulldozer-bot bot deleted the shuffle-writer-default-impl branch April 3, 2019 23:34
mccheah added a commit that referenced this pull request Jun 27, 2019
…524)

Implements the shuffle writer API by writing shuffle files to local disk and using the index block resolver to commit data and write index files.

The logic in `BypassMergeSortShuffleWriter` has been refactored to use the base implementation of the plugin instead.

APIs have been slightly renamed to clarify semantics after considering nuances in how these are to be implemented by other developers.

Follow-up commits are to come for `SortShuffleWriter` and `UnsafeShuffleWriter`.

Ported from bloomberg#6, credits to @ifilonenko.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants