-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5012][HUDI-4921] Batch clean delete files retry #6890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1f9a112 to
a6b6c78
Compare
|
hi @nsivabalan . thanks for this. once this get ready, we will check the performances improvement on real life data and give a feedback |
|
sure. sounds good. |
This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <[email protected]>
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Please add UT and a couple of a minor comments.
| */ | ||
| Stream<HoodieFileGroup> getAllFileGroups(String partitionPath); | ||
|
|
||
| Stream<Pair<String, List<HoodieFileGroup>>> getAllFileGroups(List<String> partitionPaths); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do away with this api, doesn't add much value. Instead, we can directly use the getAllFileGroups(String partitionPath) api at the call site.
| } | ||
|
|
||
| @Override | ||
| public final Stream<Pair<String, List<HoodieFileGroup>>> getAllFileGroups(List<String> partitionPaths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a separate api in the interface, let's extract the logic to a separate methos in CleanPlanner.
04b9377 to
1d57df4
Compare
|
while writing tests for this, realized that the lastCompletedCommit could have some bugs. So, gonna take time to fix that and will update the patch. |
|
@nsivabalan let's chat more on this. As was discussed in the original issue, the root-cause here isn't in the clean planning but in the fact that every request will be re-parsing MT. While refactoring makes sense, it's not addressing the real issue here and the ROI on doing it right now isn't as high IMO as some other items we're planning on fixing. Happy to chat more to align on this one. |
|
@nsivabalan @alexeykudinkin we tested the approach (mapPartition VS map) on a large table and sadly it does not speed up things. We still merge log file for each partition for each lookup. Sorry for that. That being said, we improved a bit the cleaning speed with MDT by tuning:
|
Change Logs
This patch has 2 fixes:
This is a re-attempt of [HUDI-4792] Batch clean files to delete #6580
This makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition.
This limit the number of call to the view and should fix the perf hit in context of lot of partitions.
#fixes [SUPPORT] Incremental cleaning never used during insert #6373
Recently we added last completed commit timestamp to clean plan. But we missed to take into consideration multi-writers. So, have fixed the last completed commit to represent the last completed commit before any inflights in timeline.
Impact
Will improve clean planning latency for tables with large partitions.
Risk level: medium
Users reported latencies of 20 mins just for clean planning phase. This is expected to cut down the latency by a lot.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
N/A
Contributor's checklist