Skip to content

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Oct 2, 2025

Describe the issue this Pull Request addresses

#13995

Summary and Changelog

Revert the behavior to match the previous behavior while still honoring the new config added for metadata size to be subtracted

Impact

more accurate record size estimate

Risk Level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 2, 2025
@hudi-bot
Copy link
Collaborator

hudi-bot commented Oct 2, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

HoodieInstant instant = instants.next();
try {
HoodieCommitMetadata commitMetadata = commitTimeline.readCommitMetadata(instant);
final HoodieAtomicLongAccumulator totalBytesWritten = HoodieAtomicLongAccumulator.create();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why do we need an accumulator here.
we are processing all these in driver from what I can gauge.

public long averageBytesPerRecord(HoodieTimeline commitTimeline, CommitMetadataSerDe commitMetadataSerDe) {
int maxCommits = hoodieWriteConfig.getRecordSizeEstimatorMaxCommits();
final AverageRecordSizeStats averageRecordSizeStats = new AverageRecordSizeStats(hoodieWriteConfig);
final long commitSizeThreshold = (long) (hoodieWriteConfig.getRecordSizeEstimationThreshold() * hoodieWriteConfig.getParquetSmallFileLimit());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this file slice threshold or single data file threshold?

looks like it was a bug earlier. and we should fix it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10763 it seems to have always been this case that it's for the entire commit. Additionally, the config description is

  public static final ConfigProperty<String> RECORD_SIZE_ESTIMATION_THRESHOLD = ConfigProperty
      .key("hoodie.record.size.estimation.threshold")
      .defaultValue("1.0")
      .markAdvanced()
      .withDocumentation("We use the previous commits' metadata to calculate the estimated record size and use it "
          + " to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, "
          + " Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten "
          + " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)");

and the git blame is from 2021

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

totalRecordsWritten.add(hoodieWriteStat.getNumWrites());
});
} else {
totalBytesWritten.add(commitMetadata.fetchTotalBytesWritten() - (commitMetadata.fetchTotalFiles() * metadataSizeEstimate));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go w/ per file size threshold,
then here also, we need to loop for every writeStat

@nsivabalan nsivabalan merged commit 0148b0a into apache:master Oct 3, 2025
134 of 137 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants