-
Notifications
You must be signed in to change notification settings - Fork 313
Add source-layer shuffle to iceberg-source for correct and scalable C… #6682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
lawofcycles
wants to merge
20
commits into
opensearch-project:main
Choose a base branch
from
lawofcycles:feature/iceberg-source-shuffle
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 9 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
8e66ba7
Add source-layer shuffle to iceberg-source for correct and scalable C…
lawofcycles ce5de14
skip updating lastsnapshotId when shuffle failed
lawofcycles 77d1a5f
spotless apply
lawofcycles 22bf746
Add S3 certificate path support and ssl_insecure_disable_verification…
lawofcycles 0310bee
Fix coalesce to collect index from all nodes and extract ShuffleNodeC…
lawofcycles 7482b55
Add remote shuffle file cleanup via HTTP DELETE endpoint on all nodes
lawofcycles 7c04330
Fix shuffle write completion key race condition by creating GlobalSta…
lawofcycles 107ac2d
Fix shuffle Avro serialization to use Iceberg DataWriter/PlannedDataR…
lawofcycles 84858c6
Use common HTTP server infrastructure for shuffle
lawofcycles 1d1ede2
Replace fully qualified ByteBuffer with import in ShuffleHttpService
lawofcycles 18dc31b
Replace magic numbers with named constants in shuffle writer and reader
lawofcycles bb2b4ca
Add path traversal protection to LocalDiskShuffleStorage
lawofcycles 390fcfb
Add input validation to shuffle HTTP endpoints with tests
lawofcycles 73e6118
Improve error handling in shuffle HTTP endpoints
lawofcycles 963fef6
Support authentication plugin for shuffle HTTP server
lawofcycles 66c83b2
Make shuffle storage path configurable and fix flaky test
lawofcycles 291c8ed
Add retry limit to DynamoDB write loops in ChangelogWorker
lawofcycles 67ad064
Use toString-based hash for deterministic shuffle partitioning across…
lawofcycles 171b0b2
Isolate shuffle storage per node and preserve base directory on cleanup
lawofcycles b2c486c
fix import
lawofcycles File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
46 changes: 46 additions & 0 deletions
46
...earch/dataprepper/plugins/source/iceberg/coordination/partition/ShuffleReadPartition.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| /* | ||
| * Copyright OpenSearch Contributors | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| * | ||
| * The OpenSearch Contributors require contributions made to | ||
| * this file be licensed under the Apache-2.0 license or a | ||
| * compatible open source license. | ||
| * | ||
| */ | ||
|
|
||
| package org.opensearch.dataprepper.plugins.source.iceberg.coordination.partition; | ||
|
|
||
| import org.opensearch.dataprepper.model.source.coordinator.SourcePartitionStoreItem; | ||
| import org.opensearch.dataprepper.model.source.coordinator.enhanced.EnhancedSourcePartition; | ||
| import org.opensearch.dataprepper.plugins.source.iceberg.coordination.state.ShuffleReadProgressState; | ||
|
|
||
| import java.util.Optional; | ||
|
|
||
| public class ShuffleReadPartition extends EnhancedSourcePartition<ShuffleReadProgressState> { | ||
|
|
||
| public static final String PARTITION_TYPE = "SHUFFLE_READ"; | ||
|
|
||
| private final String partitionKey; | ||
| private final ShuffleReadProgressState state; | ||
|
|
||
| public ShuffleReadPartition(final String partitionKey, final ShuffleReadProgressState state) { | ||
| this.partitionKey = partitionKey; | ||
| this.state = state; | ||
| } | ||
|
|
||
| public ShuffleReadPartition(final SourcePartitionStoreItem sourcePartitionStoreItem) { | ||
| setSourcePartitionStoreItem(sourcePartitionStoreItem); | ||
| this.partitionKey = sourcePartitionStoreItem.getSourcePartitionKey(); | ||
| this.state = convertStringToPartitionProgressState(ShuffleReadProgressState.class, | ||
| sourcePartitionStoreItem.getPartitionProgressState()); | ||
| } | ||
|
|
||
| @Override | ||
| public String getPartitionType() { return PARTITION_TYPE; } | ||
|
|
||
| @Override | ||
| public String getPartitionKey() { return partitionKey; } | ||
|
|
||
| @Override | ||
| public Optional<ShuffleReadProgressState> getProgressState() { return Optional.of(state); } | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make this configurable. Also I wonder if we should have this nested in the
data-prepperdirectory to start. I think in our Docker deployment and perhaps for other installations via tar.gz, the user that Data Prepper runs under might not have access to this directory. We hit this once with service map, though I don't recall all the details.Did you test this using the Docker container built by Data Prepper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the shuffle storage path configurable via
storage_pathundershuffle. The default resolution order is: explicitstorage_pathsetting, then${data-prepper.dir}/data/shuffle(following the same pattern as the GeoIP processor), then${java.io.tmpdir}/data-prepper-shuffleas a fallback for test environments wheredata-prepper.diris not set.I have not tested with the official Docker image built by Data Prepper's release process. My multi-node verification used locally built Docker images. With
data-prepper.dir/data/shuffleas the default, the path is under Data Prepper's home directory where the running user has write access.