improve BytesRefHash.sort performance by rearranging ids by tyronecai · Pull Request #15772 · apache/lucene

tyronecai · 2026-02-26T11:37:13Z

Description

In https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L179

with repeated calls to add and the final compact, the id in ids become randomly distributed. This means subsequent access to data in BytesRefBlockPool becomes more random.

This negatively impacts operations like computeCommonPrefixLengthAndBuildHistogram that require access to BytesRefBlockPool. Cache misses can be observed using the async-profiler tool.

Therefore, the order of IDs can be rearranged in compact, since ids stores the order in which items are added to BytesRefBlockPool, and this order is monotonically increasing.

This improves the continuity of data access and reduces cache misses.

I wrote a very simple piece of code:

https://github.com/tyronecai/LuceneBytesRefSortBench/blob/master/src/main/java/com/demo/BytesRefHashSortBech.java

This code retrieves a series of terms from a log file, adds them to BytesRefHash, sorts them, and observes the performance latency.

Ultimately, in a sorting test with millions terms, the performance improvement was approximately 20%.

Test Env

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          48 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   8
  On-line CPU(s) list:    0-7
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen 7 9700X 8-Core Processor

$ java -version
openjdk version "21.0.9" 2025-10-21 LTS
OpenJDK Runtime Environment (build 21.0.9+15-LTS)
OpenJDK 64-Bit Server VM (build 21.0.9+15-LTS, mixed mode, sharing)

Test on 200MB log file

java -jar bench-1.0-SNAPSHOT-jar-with-dependencies.jar /data/source/lucene-10.3.2/xxx.log 100 false
sort 2097152 unique terms in 118.03 ms

java -jar bench-1.0-SNAPSHOT-jar-with-dependencies.jar /data/source/lucene-10.3.2/xxx.log 100 true
sort 2097152 unique terms in 96.23 ms      (118.03 - 96.23) / 118.03  = 0.18

Test on some big log file

java -jar bench-1.0-SNAPSHOT-jar-with-dependencies.jar /data/source/lucene-10.3.2/xxx.txt 100 false
sort 33554432 unique terms in 4543.19 ms

java -jar bench-1.0-SNAPSHOT-jar-with-dependencies.jar /data/source/lucene-10.3.2/xxx.txt 100 true
sort 33554432 unique terms in 3385.94 ms    (4543.19 - 3385.94) / 4543.19 = 0.254

@mikemccand @dweiss Can you take a look here?

Adjusting the data order in ids during compaction, which can improve data access continuity and reduce cache-misses. finally enhance sort performance by 20% in million-term tests

dweiss

This is brutally simple. I don't even remember why the compaction was so convoluted since we can't have any gaps and indeed the id sequence must be a simple range 0...size.

dweiss · 2026-02-26T20:33:42Z

Could you add a changes entry? Shall we apply this to 10x as well (so it should appear under 10.5 section).

Added a new entry to CHANGES.txt to document performance improvement in BytesRefHash.sort.

tyronecai · 2026-02-27T00:56:43Z

Could you add a changes entry? Shall we apply this to 10x as well (so it should appear under 10.5 section).

done

modify the review title
add comment in compact()
add change entry

please take a look, thanks

* improve BytesRefHash.sort performance Adjusting the data order in ids during compaction, which can improve data access continuity and reduce cache-misses. finally enhance sort performance by 20% in million-term tests * Update changes. Added a new entry to CHANGES.txt to document performance improvement in BytesRefHash.sort. * add comment in compact * Fix comment typo in compact method * update comment --------- Co-authored-by: Dawid Weiss <dawid.weiss@carrotsearch.com>

improve BytesRefHash.sort performance

c8d3b74

Adjusting the data order in ids during compaction, which can improve data access continuity and reduce cache-misses. finally enhance sort performance by 20% in million-term tests

github-actions bot added the module:core/other label Feb 26, 2026

dweiss approved these changes Feb 26, 2026

View reviewed changes

dweiss added this to the 10.5.0 milestone Feb 26, 2026

tyronecai changed the title ~~improve BytesRefHash.sort performance by 20%~~ improve BytesRefHash.sort performance by rearranging ids Feb 27, 2026

tyronecai added 3 commits February 27, 2026 08:45

Update changes.

8b1ee31

Added a new entry to CHANGES.txt to document performance improvement in BytesRefHash.sort.

add comment in compact

f639564

Fix comment typo in compact method

6e55692

update comment

602135c

tyronecai mentioned this pull request Feb 27, 2026

Improve BytesRefHash.sort performance by retrieve byte directly from the pool. #15775

Closed

merge with main.

8dab310

dweiss approved these changes Feb 28, 2026

View reviewed changes

dweiss self-assigned this Feb 28, 2026

dweiss merged commit d48cf56 into apache:main Feb 28, 2026
13 checks passed

tyronecai mentioned this pull request Mar 4, 2026

Improve BytesRefHash.add performance by optimize rehash operation #15779

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve BytesRefHash.sort performance by rearranging ids#15772

improve BytesRefHash.sort performance by rearranging ids#15772
dweiss merged 6 commits intoapache:mainfrom
tyronecai:patch-1

tyronecai commented Feb 26, 2026 •

edited

Loading

Uh oh!

dweiss left a comment

Uh oh!

dweiss commented Feb 26, 2026

Uh oh!

tyronecai commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tyronecai commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test on 200MB log file

Test on some big log file

Uh oh!

dweiss left a comment

Choose a reason for hiding this comment

Uh oh!

dweiss commented Feb 26, 2026

Uh oh!

tyronecai commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyronecai commented Feb 26, 2026 •

edited

Loading