[HUDI-5012][HUDI-4921] Batch clean delete files retry #6890

nsivabalan · 2022-10-08T00:35:34Z

Change Logs

This patch has 2 fixes:

This is a re-attempt of [HUDI-4792] Batch clean files to delete #6580
This makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition.
This limit the number of call to the view and should fix the perf hit in context of lot of partitions.
#fixes [SUPPORT] Incremental cleaning never used during insert #6373
Recently we added last completed commit timestamp to clean plan. But we missed to take into consideration multi-writers. So, have fixed the last completed commit to represent the last completed commit before any inflights in timeline.

Impact

Will improve clean planning latency for tables with large partitions.

Risk level: medium

Users reported latencies of 20 mins just for clean planning phase. This is expected to cut down the latency by a lot.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

parisni · 2022-10-14T08:01:38Z

hi @nsivabalan . thanks for this. once this get ready, we will check the performances improvement on real life data and give a feedback

nsivabalan · 2022-10-19T18:34:47Z

sure. sounds good.

This patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition. This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions. Fixes issue apache#6373 Co-authored-by: sivabalan <[email protected]>

codope

Looks good. Please add UT and a couple of a minor comments.

codope · 2022-10-22T06:35:03Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java

   */
  Stream<HoodieFileGroup> getAllFileGroups(String partitionPath);

+  Stream<Pair<String, List<HoodieFileGroup>>> getAllFileGroups(List<String> partitionPaths);


We can do away with this api, doesn't add much value. Instead, we can directly use the getAllFileGroups(String partitionPath) api at the call site.

codope · 2022-10-22T06:37:24Z

hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java

  }

+  @Override
+  public final Stream<Pair<String, List<HoodieFileGroup>>> getAllFileGroups(List<String> partitionPaths) {


Instead of a separate api in the interface, let's extract the logic to a separate methos in CleanPlanner.

hudi-bot · 2022-10-22T09:53:23Z

CI report:

a6b6c78 UNKNOWN
1d57df4 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2022-11-02T07:03:33Z

while writing tests for this, realized that the lastCompletedCommit could have some bugs. So, gonna take time to fix that and will update the patch.

alexeykudinkin · 2022-11-08T20:50:49Z

@nsivabalan let's chat more on this.

As was discussed in the original issue, the root-cause here isn't in the clean planning but in the fact that every request will be re-parsing MT. While refactoring makes sense, it's not addressing the real issue here and the ROI on doing it right now isn't as high IMO as some other items we're planning on fixing. Happy to chat more to align on this one.

parisni · 2022-11-15T10:21:11Z

@nsivabalan @alexeykudinkin we tested the approach (mapPartition VS map) on a large table and sadly it does not speed up things. We still merge log file for each partition for each lookup. Sorry for that.

That being said, we improved a bit the cleaning speed with MDT by tuning:

hoodie.metadata.compact.max.delta.commits: turn to 1 before cleaning decrease by 50% time
hoodie.metadata.enable.full.scan.log.files: keep to default = true otherwise time increase by 300%

alexeykudinkin · 2022-11-15T19:26:02Z

@parisni this issue (of re-parsing MT continuously) would be addressed by #6815

nsivabalan changed the title ~~[WIP] Batch clean delete files retry~~ [HUDI-4878] Batch clean delete files retry Oct 11, 2022

nsivabalan changed the title ~~[HUDI-4878] Batch clean delete files retry~~ [HUDI-5012] Batch clean delete files retry Oct 11, 2022

nsivabalan force-pushed the batchCleanRetry branch from 1f9a112 to a6b6c78 Compare October 11, 2022 19:09

nsivabalan marked this pull request as ready for review October 11, 2022 19:35

nsivabalan assigned codope Oct 11, 2022

nsivabalan added the priority:critical Production degraded; pipelines stalled label Oct 11, 2022

nsivabalan changed the title ~~[HUDI-5012] Batch clean delete files retry~~ [HUDI-5012][HUDI-4921] Batch clean delete files retry Oct 11, 2022

nsivabalan mentioned this pull request Oct 11, 2022

[HUDI-4921] Fixing last completed commit with clean scheduling #6889

Closed

4 tasks

hussein-awala mentioned this pull request Oct 14, 2022

[SUPPORT] cleaner incremental mode doesn't work if there is no file to delete in the previous clean #6953

Closed

parisni and others added 2 commits October 22, 2022 11:15

Fixing last completed commit for clean planning

1d57df4

codope reviewed Oct 22, 2022

View reviewed changes

codope force-pushed the batchCleanRetry branch from 04b9377 to 1d57df4 Compare October 22, 2022 06:43

codope added the status:in-progress Work in progress label Nov 2, 2022

nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022

codope removed the release-0.12.2 Patches targetted for 0.12.2 label Dec 7, 2022

nsivabalan closed this Feb 8, 2023

hudi-bot mentioned this pull request Dec 9, 2025

Fix clean planning for very large partitions #15479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-5012][HUDI-4921] Batch clean delete files retry #6890

[HUDI-5012][HUDI-4921] Batch clean delete files retry #6890

Uh oh!

nsivabalan commented Oct 8, 2022 •

edited

Loading

Uh oh!

parisni commented Oct 14, 2022

Uh oh!

nsivabalan commented Oct 19, 2022

Uh oh!

codope left a comment

Uh oh!

codope Oct 22, 2022

Uh oh!

codope Oct 22, 2022

Uh oh!

hudi-bot commented Oct 22, 2022

Uh oh!

nsivabalan commented Nov 2, 2022

Uh oh!

alexeykudinkin commented Nov 8, 2022

Uh oh!

parisni commented Nov 15, 2022

Uh oh!

alexeykudinkin commented Nov 15, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[HUDI-5012][HUDI-4921] Batch clean delete files retry #6890

[HUDI-5012][HUDI-4921] Batch clean delete files retry #6890

Uh oh!

Conversation

nsivabalan commented Oct 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Documentation Update

Contributor's checklist

Uh oh!

parisni commented Oct 14, 2022

Uh oh!

nsivabalan commented Oct 19, 2022

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

codope Oct 22, 2022

Choose a reason for hiding this comment

Uh oh!

codope Oct 22, 2022

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Oct 22, 2022

CI report:

Uh oh!

nsivabalan commented Nov 2, 2022

Uh oh!

alexeykudinkin commented Nov 8, 2022

Uh oh!

parisni commented Nov 15, 2022

Uh oh!

alexeykudinkin commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nsivabalan commented Oct 8, 2022 •

edited

Loading

alexeykudinkin commented Nov 15, 2022 •

edited

Loading