-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5327] Fix spark stages when using row writer #7374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this could be a reason for performance problems. Can you please elaborate what you're trying to achieve here?
cc @boneanxs
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin
Currently, collect is used internally in bulk insert for [[Dataset] when execute clusting, which cause
CompletableFuture. supplyAsync, the number of spark jobs that can be executed simultaneously is limited to the number of CPU cores of the driver, which may cause a performance bottleneckIn addition,
performClusteringWithRecordsRDDdoes not have the above problems, because it does not use collect internally, so I just keep their behavior consistentYou can see https://issues.apache.org/jira/browse/HUDI-5327, I introduced the case I encountered in it
cc @boneanxs
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Zouxxyy, Thanks for raising this issue! It's so nice to see you're trying this feature!
The reason to collect the data here is that
HoodieData<WriteStatus>will be used multiple times afterperformClustering, I recall there is anisEmptycheck could take lots of time(validateWriteResult), so here we directly convert to a list of WriteStatus, which will reduce the time.For the second issue, I noticed this and raised a pr to fix it: #7343, will that address your problem? Feel free to review it!
I think
performClusteringWithRecordsRDDalso has the same issue such as usingRDDSpatialCurveSortPartitionerto optimize data layout, it will callRDD.isEmpty, which will raise a new job.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boneanxs
For isEmpty check could take lots of time, I provided a PR to fix it, #7373, so maybe we don't need #7343
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can fix this by directly using
getStat, but what ifupdateIndexwill calculatewriteStatusListmultiple times? If we can directly dereferenceRDD<WriteStatus>to a list ofWriteStatusat one feasible point(such asperformClusteringWithRecordsAsRowhas already done), we no need to take care of such issue anymore.As for the parallelism of thread pool could cause the performance issue, I think
performClusteringWithRecordsRDDalso has the same issue. As we might callpartitioner.repartitionRecords, there could also raise a new job inside the Future thread such as https://github.com/apache/hudi/blob/ea48a85efcf8e331d0cc105d426e830b8bfe5b37/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java#L66(check if the RDD is empty or not), orsortByfunction inRDDCustomColumnsSortPartitioner(sortByuseRangePartitonerwhich needs to sample the rdd first to decide the ranges, which will also raise a job in the Future)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boneanxs
For, #7343, you should be right, I overlooked that other operations may also generate a job. However, I'm wondering if it's necessary to specifically set a parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@boneanxs
For RDD reuse problem, I think we should use persist (fixed #7373) instead of using collcet and creating a new RDD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very appreciate it if you can review the pr to share your thought, could you please explain more in that pr? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Zouxxyy in this case we should actually not be relying on persist as a way to avoid double execution, since persisting is essentially just a caching mechanism (re-using cached blocks on executors) and it'd not be relied upon (it could fail at any point if, for ex, one of the executors fail, making you recompute whole RDD)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin, ok, WriteStatus has a large class attributeas
writtenRecords, as long as collect does not cause OOM