Improved seekable format ingestion speed for small frame size #3544
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As reported by @P-E-Meunier in #2662 (comment), seekable format ingestion speed can be particularly slow when selected
FRAME_SIZEis very small,especially in combination with the recent row_hash compression mode.
The specific scenario mentioned was
pijul, using frame sizes of 256 bytes and level 10.This is improved in this PR,
by providing approximate parameter adaptation to the compression state.
Tested locally on a M1 laptop,
ingestion of
enwik8usingpijulparameterswent from 35sec. (before this PR) to 2.5sec. (with this PR).
For the specific corner case of a file full of zeroes, this is even more pronounced, going from 45sec. to 0.5sec.
The benefits remain perceptible for other small frame sizes, such as for example 4 KB, where the
enwik8ingestion test improves from 3.6sec. to 1.8sec., on top of a small compression ratio gain.These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically.
The
seekable_compressiontest program has also been updated to allow compression level setting, in order to produce these performance results.