fix: fix record size estimation to reflect previous behavior #14039

jonvex · 2025-10-02T16:54:16Z

Describe the issue this Pull Request addresses

Summary and Changelog

Revert the behavior to match the previous behavior while still honoring the new config added for metadata size to be subtracted

Impact

more accurate record size estimate

Risk Level

low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2025-10-02T18:22:58Z

CI report:

8362dd6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2025-10-02T19:46:52Z

...t/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java

+          HoodieInstant instant = instants.next();
+          try {
+            HoodieCommitMetadata commitMetadata = commitTimeline.readCommitMetadata(instant);
+            final HoodieAtomicLongAccumulator totalBytesWritten = HoodieAtomicLongAccumulator.create();


not sure why do we need an accumulator here.
we are processing all these in driver from what I can gauge.

nsivabalan · 2025-10-02T19:54:47Z

...t/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java

  public long averageBytesPerRecord(HoodieTimeline commitTimeline, CommitMetadataSerDe commitMetadataSerDe) {
    int maxCommits = hoodieWriteConfig.getRecordSizeEstimatorMaxCommits();
-    final AverageRecordSizeStats averageRecordSizeStats = new AverageRecordSizeStats(hoodieWriteConfig);
+    final long commitSizeThreshold = (long) (hoodieWriteConfig.getRecordSizeEstimationThreshold() * hoodieWriteConfig.getParquetSmallFileLimit());


isn't this file slice threshold or single data file threshold?

looks like it was a bug earlier. and we should fix it now.

#10763 it seems to have always been this case that it's for the entire commit. Additionally, the config description is

public static final ConfigProperty<String> RECORD_SIZE_ESTIMATION_THRESHOLD = ConfigProperty .key("hoodie.record.size.estimation.threshold") .defaultValue("1.0") .markAdvanced() .withDocumentation("We use the previous commits' metadata to calculate the estimated record size and use it " + " to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, " + " Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten " + " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)");

and the git blame is from 2021

nsivabalan · 2025-10-02T19:55:20Z

...t/hudi-client-common/src/main/java/org/apache/hudi/estimator/AverageRecordSizeEstimator.java

+                    totalRecordsWritten.add(hoodieWriteStat.getNumWrites());
+                  });
+            } else {
+              totalBytesWritten.add(commitMetadata.fetchTotalBytesWritten() - (commitMetadata.fetchTotalFiles() * metadataSizeEstimate));


if we go w/ per file size threshold,
then here also, we need to loop for every writeStat

fix record size estimation to reflect previous behavior

8362dd6

github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 2, 2025

nsivabalan reviewed Oct 2, 2025

View reviewed changes

nsivabalan approved these changes Oct 3, 2025

View reviewed changes

nsivabalan merged commit 0148b0a into apache:master Oct 3, 2025
134 of 137 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix record size estimation to reflect previous behavior #14039

fix: fix record size estimation to reflect previous behavior #14039

Uh oh!

jonvex commented Oct 2, 2025

Uh oh!

hudi-bot commented Oct 2, 2025

Uh oh!

nsivabalan Oct 2, 2025

Uh oh!

nsivabalan Oct 2, 2025

Uh oh!

jonvex Oct 2, 2025

Uh oh!

nsivabalan Oct 2, 2025

Uh oh!

nsivabalan Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: fix record size estimation to reflect previous behavior #14039

fix: fix record size estimation to reflect previous behavior #14039

Uh oh!

Conversation

jonvex commented Oct 2, 2025

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Oct 2, 2025

CI report:

Uh oh!

nsivabalan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

nsivabalan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

jonvex Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

nsivabalan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

nsivabalan Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants