Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Oct 8, 2022

Change Logs

This patch has 2 fixes:

  1. This is a re-attempt of [HUDI-4792] Batch clean files to delete #6580
    This makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition.
    This limit the number of call to the view and should fix the perf hit in context of lot of partitions.
    #fixes [SUPPORT] Incremental cleaning never used during insert #6373

  2. Recently we added last completed commit timestamp to clean plan. But we missed to take into consideration multi-writers. So, have fixed the last completed commit to represent the last completed commit before any inflights in timeline.

Impact

Will improve clean planning latency for tables with large partitions.

Risk level: medium

Users reported latencies of 20 mins just for clean planning phase. This is expected to cut down the latency by a lot.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan changed the title [WIP] Batch clean delete files retry [HUDI-4878] Batch clean delete files retry Oct 11, 2022
@nsivabalan nsivabalan changed the title [HUDI-4878] Batch clean delete files retry [HUDI-5012] Batch clean delete files retry Oct 11, 2022
@nsivabalan nsivabalan marked this pull request as ready for review October 11, 2022 19:35
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Oct 11, 2022
@nsivabalan nsivabalan changed the title [HUDI-5012] Batch clean delete files retry [HUDI-5012][HUDI-4921] Batch clean delete files retry Oct 11, 2022
@parisni
Copy link
Contributor

parisni commented Oct 14, 2022

hi @nsivabalan . thanks for this. once this get ready, we will check the performances improvement on real life data and give a feedback

@nsivabalan
Copy link
Contributor Author

sure. sounds good.

parisni and others added 2 commits October 22, 2022 11:15
This  patch makes use of batch call to get fileGroup to delete during cleaning instead of 1 call per partition.
This limit the number of call to the view and should fix the trouble with metadata table in context of lot of partitions.
Fixes issue apache#6373

Co-authored-by: sivabalan <[email protected]>
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please add UT and a couple of a minor comments.

*/
Stream<HoodieFileGroup> getAllFileGroups(String partitionPath);

Stream<Pair<String, List<HoodieFileGroup>>> getAllFileGroups(List<String> partitionPaths);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do away with this api, doesn't add much value. Instead, we can directly use the getAllFileGroups(String partitionPath) api at the call site.

}

@Override
public final Stream<Pair<String, List<HoodieFileGroup>>> getAllFileGroups(List<String> partitionPaths) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a separate api in the interface, let's extract the logic to a separate methos in CleanPlanner.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor Author

while writing tests for this, realized that the lastCompletedCommit could have some bugs. So, gonna take time to fix that and will update the patch.

@codope codope added the status:in-progress Work in progress label Nov 2, 2022
@alexeykudinkin
Copy link
Contributor

@nsivabalan let's chat more on this.

As was discussed in the original issue, the root-cause here isn't in the clean planning but in the fact that every request will be re-parsing MT. While refactoring makes sense, it's not addressing the real issue here and the ROI on doing it right now isn't as high IMO as some other items we're planning on fixing. Happy to chat more to align on this one.

@parisni
Copy link
Contributor

parisni commented Nov 15, 2022

@nsivabalan @alexeykudinkin we tested the approach (mapPartition VS map) on a large table and sadly it does not speed up things. We still merge log file for each partition for each lookup. Sorry for that.

That being said, we improved a bit the cleaning speed with MDT by tuning:

  • hoodie.metadata.compact.max.delta.commits: turn to 1 before cleaning decrease by 50% time
  • hoodie.metadata.enable.full.scan.log.files: keep to default = true otherwise time increase by 300%

@alexeykudinkin
Copy link
Contributor

alexeykudinkin commented Nov 15, 2022

@parisni this issue (of re-parsing MT continuously) would be addressed by #6815

@nsivabalan nsivabalan added the release-0.12.2 Patches targetted for 0.12.2 label Dec 6, 2022
@codope codope removed the release-0.12.2 Patches targetted for 0.12.2 label Dec 7, 2022
@nsivabalan nsivabalan closed this Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled status:in-progress Work in progress

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

[SUPPORT] Incremental cleaning never used during insert

5 participants