improve BytesRefHash.sort performance by rearranging ids#15772
Merged
dweiss merged 6 commits intoapache:mainfrom Feb 28, 2026
Merged
improve BytesRefHash.sort performance by rearranging ids#15772dweiss merged 6 commits intoapache:mainfrom
dweiss merged 6 commits intoapache:mainfrom
Conversation
Adjusting the data order in ids during compaction, which can improve data access continuity and reduce cache-misses. finally enhance sort performance by 20% in million-term tests
dweiss
approved these changes
Feb 26, 2026
Contributor
dweiss
left a comment
There was a problem hiding this comment.
This is brutally simple. I don't even remember why the compaction was so convoluted since we can't have any gaps and indeed the id sequence must be a simple range 0...size.
Contributor
|
Could you add a changes entry? Shall we apply this to 10x as well (so it should appear under 10.5 section). |
Added a new entry to CHANGES.txt to document performance improvement in BytesRefHash.sort.
Contributor
Author
done
please take a look, thanks |
dweiss
approved these changes
Feb 28, 2026
dweiss
added a commit
that referenced
this pull request
Feb 28, 2026
* improve BytesRefHash.sort performance Adjusting the data order in ids during compaction, which can improve data access continuity and reduce cache-misses. finally enhance sort performance by 20% in million-term tests * Update changes. Added a new entry to CHANGES.txt to document performance improvement in BytesRefHash.sort. * add comment in compact * Fix comment typo in compact method * update comment --------- Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L179
with repeated calls to
addand the finalcompact, the id inidsbecome randomly distributed. This means subsequent access to data inBytesRefBlockPoolbecomes more random.This negatively impacts operations like
computeCommonPrefixLengthAndBuildHistogramthat require access toBytesRefBlockPool. Cache misses can be observed using theasync-profilertool.Therefore, the order of IDs can be rearranged in
compact, sinceidsstores the order in which items are added toBytesRefBlockPool, and this order is monotonically increasing.This improves the continuity of data access and reduces cache misses.
I wrote a very simple piece of code:
https://github.com/tyronecai/LuceneBytesRefSortBench/blob/master/src/main/java/com/demo/BytesRefHashSortBech.java
This code retrieves a series of terms from a log file, adds them to
BytesRefHash, sorts them, and observes the performance latency.Ultimately, in a sorting test with millions terms, the performance improvement was approximately 20%.
Test Env
Test on 200MB log file
Test on some big log file
@mikemccand @dweiss Can you take a look here?