-
Notifications
You must be signed in to change notification settings - Fork 2.5k
fix: fix record size estimation to reflect previous behavior #14039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix record size estimation to reflect previous behavior #14039
Conversation
| HoodieInstant instant = instants.next(); | ||
| try { | ||
| HoodieCommitMetadata commitMetadata = commitTimeline.readCommitMetadata(instant); | ||
| final HoodieAtomicLongAccumulator totalBytesWritten = HoodieAtomicLongAccumulator.create(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why do we need an accumulator here.
we are processing all these in driver from what I can gauge.
| public long averageBytesPerRecord(HoodieTimeline commitTimeline, CommitMetadataSerDe commitMetadataSerDe) { | ||
| int maxCommits = hoodieWriteConfig.getRecordSizeEstimatorMaxCommits(); | ||
| final AverageRecordSizeStats averageRecordSizeStats = new AverageRecordSizeStats(hoodieWriteConfig); | ||
| final long commitSizeThreshold = (long) (hoodieWriteConfig.getRecordSizeEstimationThreshold() * hoodieWriteConfig.getParquetSmallFileLimit()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this file slice threshold or single data file threshold?
looks like it was a bug earlier. and we should fix it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#10763 it seems to have always been this case that it's for the entire commit. Additionally, the config description is
public static final ConfigProperty<String> RECORD_SIZE_ESTIMATION_THRESHOLD = ConfigProperty
.key("hoodie.record.size.estimation.threshold")
.defaultValue("1.0")
.markAdvanced()
.withDocumentation("We use the previous commits' metadata to calculate the estimated record size and use it "
+ " to bin pack records into partitions. If the previous commit is too small to make an accurate estimation, "
+ " Hudi will search commits in the reverse order, until we find a commit that has totalBytesWritten "
+ " larger than (PARQUET_SMALL_FILE_LIMIT_BYTES * this_threshold)");
and the git blame is from 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
| totalRecordsWritten.add(hoodieWriteStat.getNumWrites()); | ||
| }); | ||
| } else { | ||
| totalBytesWritten.add(commitMetadata.fetchTotalBytesWritten() - (commitMetadata.fetchTotalFiles() * metadataSizeEstimate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we go w/ per file size threshold,
then here also, we need to loop for every writeStat
Describe the issue this Pull Request addresses
#13995
Summary and Changelog
Revert the behavior to match the previous behavior while still honoring the new config added for metadata size to be subtracted
Impact
more accurate record size estimate
Risk Level
low
Documentation Update
N/A
Contributor's checklist