Skip to content

Improve BytesRefHash.add performance by optimize rehash operation#15779

Open
tyronecai wants to merge 5 commits intoapache:mainfrom
tyronecai:patch-4
Open

Improve BytesRefHash.add performance by optimize rehash operation#15779
tyronecai wants to merge 5 commits intoapache:mainfrom
tyronecai:patch-4

Conversation

@tyronecai
Copy link
Contributor

@tyronecai tyronecai commented Feb 28, 2026

Description

Rehashing performance is improved by creating a temporary int array in rehash to sequentially compute the hashcodes of all terms at once.

This improves memory locality, enhancing the performance of pool.hash computation, which in turn improves rehash performance, ultimately significantly improving add performance.

The impact is that a temporary int array consumes additional memory (4 bytes * count), potentially an extra 40MB for 10 million data points, which should be acceptable.

I also tested residing int[] hashcodes in the BytesRefHash class, maintaining it during add, and using it for the results of findHash. For large datasets, this can slightly improve performance (5-6%), but the memory overhead and maintenance cost seem uneconomical.

Thanks to OpenAI's Codex, I am able to validate my implementation very quickly, which is fantastic.

test code

private static void insert(List<BytesRef> testData, int round) {
    for (int r = 0; r < round; r++) {
      BytesRefHash hash = new BytesRefHash();
      int uniqueCount = 0;
      long start = System.nanoTime();
      for (BytesRef ref : testData) {
        int pos = hash.add(ref);
        if (pos >= 0) {
          uniqueCount += 1;
        }
      }
      long insertTimeNs = System.nanoTime() - start;
      System.out.printf(
          "Inserted %d terms in %.2f ms, unique term %d\n",
          testData.size(), insertTimeNs / 1_000_000.0, uniqueCount);
      System.out.printf(
          "rehashTimes %d, rehashTimeMs %d, calcHashTimeMs %d\n",
          hash.rehashTimes, hash.rehashTimeMs, hash.calcHashTimeMs);
    }
  }

BytesRefHash.java

  public long rehashTimes = 0l;
  public long rehashTimeMs = 0l;
  public long calcHashTimeMs = 0l;

  public int add(BytesRef bytes) {
    assert bytesStart != null : "Bytesstart is null - not initialized";
    final int hashcode = doHash(bytes.bytes, bytes.offset, bytes.length);
    // final position
    final int hashPos = findHash(bytes, hashcode);
    int e = ids[hashPos];

    if (e == -1) {
      // new entry
      if (count >= bytesStart.length) {
        bytesStart = bytesStartArray.grow();
        assert count < bytesStart.length + 1 : "count: " + count + " len: " + bytesStart.length;
      }
      bytesStart[count] = pool.addBytesRef(bytes);
      e = count++;
      assert ids[hashPos] == -1;
      ids[hashPos] = e | (hashcode & highMask);

      if (count == hashHalfSize) {
        rehashTimes += 1;                               // <-- 
        long start = System.nanoTime();         // <-- 
        rehash(2 * hashSize, true);
        rehashTimeMs += (System.nanoTime() - start) / 1_000_000.0;         // <-- 
      }
      return e;
    }
    return -((e & hashMask) + 1);
  }

test with large amounts (12585302 unique terms)

original (with large amounts of data, rehash took more than half the time of the entire insert operation.)

round#32
Inserted 12585302 terms in 2270.57 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 1503, calcHashTimeMs 0 
round#33
Inserted 12585302 terms in 2255.78 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 1484, calcHashTimeMs 0
round#34
Inserted 12585302 terms in 2267.66 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 1499, calcHashTimeMs 0
round#35
Inserted 12585302 terms in 2253.70 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 1481, calcHashTimeMs 0

with precompute

round#92
Inserted 12585302 terms in 1090.96 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 338, calcHashTimeMs 136
round#93
Inserted 12585302 terms in 1099.85 ms, unique term 12585302   <-- (2255.78 - 1099.85) / 2255.78 = 0.512  !!!
rehashTimes 21, rehashTimeMs 342, calcHashTimeMs 135
round#94
Inserted 12585302 terms in 1096.58 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 338, calcHashTimeMs 135
round#95
Inserted 12585302 terms in 1109.92 ms, unique term 12585302
rehashTimes 21, rehashTimeMs 342, calcHashTimeMs 135

test with medium amounts (915436 unique terms)

original (with medium amounts of data, precompute effect is not that obvious. )

round#41
Inserted 915436 terms in 54.25 ms, unique term 915436
rehashTimes 17, rehashTimeMs 25, calcHashTimeMs 0
round#42
Inserted 915436 terms in 54.95 ms, unique term 915436
rehashTimes 17, rehashTimeMs 25, calcHashTimeMs 0
round#43
Inserted 915436 terms in 54.56 ms, unique term 915436
rehashTimes 17, rehashTimeMs 25, calcHashTimeMs 0
round#44
Inserted 915436 terms in 56.28 ms, unique term 915436
rehashTimes 17, rehashTimeMs 25, calcHashTimeMs 0

with precompute

round#92
Inserted 915436 terms in 41.65 ms, unique term 915436
rehashTimes 17, rehashTimeMs 15, calcHashTimeMs 4
round#93
Inserted 915436 terms in 42.51 ms, unique term 915436    <-- (54.95 - 42.51) / 54.95 = 0.226
rehashTimes 17, rehashTimeMs 15, calcHashTimeMs 4
round#94
Inserted 915436 terms in 43.06 ms, unique term 915436
rehashTimes 17, rehashTimeMs 15, calcHashTimeMs 4

@dweiss @mikemccand please take a look and give some advice

Precompute all term hash values sequentially at once
which is much faster than out-of-order computation
@tyronecai tyronecai changed the title Optimize rehash in BytesRefHash, boost add performance Improve BytesRefHash.add performance by optimize rehash operation Feb 28, 2026
@github-actions github-actions bot added this to the 10.5.0 milestone Feb 28, 2026
Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem harmful and is a known technique for leveraging cache locality. Seems ok with me.

@tyronecai
Copy link
Contributor Author

This doesn't seem harmful and is a known technique for leveraging cache locality. Seems ok with me.

Hi, is there anything else I need to do with this review,

Or just need wait for someone else to review it?

@mikemccand
Copy link
Member

Thanks to OpenAI's Codex, I am able to validate my implementation very quickly, which is fantastic.

+1 -- these genai tools are amazing. Claude (Opus 4.6) helped me add ascii-art (well, Unicode) sparkle histograms to visualize smelly vectors in luceneutil's knnPerfTest.py (vector search benchmark).

Wow, ~51% net topline (time to insert N keys) speedup by pre-computing all hashes into up-front int[] during rehash!?

You are doing precisely the same amount of CPU work (1X hash computation for each BytesRef in the BytesRefHash), just doing it up front (this change) vs doing it interleaved along with the insert into the new larger hash table? I.e. we are not somehow saving further hash() calls done when stepping through collisions on insert.

I guess reading the key from random packed byte[] location is cache-painful. Similarly, writing into the bigger hash table is also cache-painful. But if you try to do both at once -> cache thrashing (the two fight with each other, greatly reducing cache hit %).

If you run with perf stat -ddd it'll probably show exactly this from its counters?

I asked genai to look at the PR and explain the speedup. Claude Opus 4.6 did well. So did Grok (Expert). Gemini (Thinking) was (surprisingly) not great -- it confusingly thought this PR introduced power-of-two hash table sizes (that is pre-existing), and that this PR switched from modulo math to bitmasks (also pre-existing).

Anyways, I love this change. How portable is it? If you use Lucene's aws-jmh benchmark infra across various CPU flavors, do the gains hold up? I expect on nightly benchmarking box (beast3) this would be big win -- it has four "chiplets" and inter-chiplet latency is much higher than within-chiplet and so I have to tell OS what Numa nodes to use, etc. Do you have the benchy source you ran -- I'll test on beast3.

But I'm worried about the surge in transient RAM. It's especially bad timing because we are already surging to 3X the current hash table in transience (1X current one, 2X being rehashed into) -- surge on surge. Could we change it to do that pre-computation in chunks?

Also: since BytesRefHash now uses the otherwise 0 leading bits of each int in int[] ids array to hold some of the hash code bits, couldn't we often avoid recomputing the full hash entirely? We know the lower bits of the hash (position in the current ids array), we know more bits from the recent opto, don't we have enough bits? We need just one more bit for the initial hash slot ... on linear probe it just increments...

@mikemccand
Copy link
Member

mikemccand commented Mar 3, 2026

Do you have the benchy source you ran -- I'll test on beast3.

Woops -- I see you already posted the code fragment for the benchy in your op -- I'll try to test on beast3 in between nightly benchy runs.

And thank you for posting benchy source up front -- it's great to share exactly what/how your ran along with any results.

@tyronecai
Copy link
Contributor Author

Do you have the benchy source you ran -- I'll test on beast3.

Woops -- I see you already posted the code fragment for the benchy in your op -- I'll try to test on beast3 in between nightly benchy runs.

And thank you for posting benchy source up front -- it's great to share exactly what/how your ran along with any results.

I suddenly realized that rehashing is essentially a process of reconstructing ids from existing terms.

Therefore, I no longer need the previous ids; I can directly read terms sequentially from byteStarts + pool and place them in the appropriate positions within the new IDs.

Am I right?

This way, I only need 2X the memory, instead of 3X or the 4X after adding int hashcodes[]

The code is roughly as follows; I still need to confirm and test it further.

private void rehash(final int newSize, boolean hashOnData) {
    final int newMask = newSize - 1;
    final int newHighMask = ~newMask;
    bytesUsed.addAndGet(Integer.BYTES * (long) (newSize - ids.length));

    ids = new int[newSize];
    Arrays.fill(ids, -1);

    // rebuild ids from terms in pool
    for (int id = 0; id < count; id++) {
      final int hashcode;
      int code;
      if (hashOnData) {
        hashcode = code = pool.hash(bytesStart[id]);
      } else {
        code = bytesStart[id];
        hashcode = 0;
      }

      int hashPos = code & newMask;
      assert hashPos >= 0;

      // Conflict; use linear probe to find an open slot
      // (see LUCENE-5604):
      while (ids[hashPos] != -1) {
        code++;
        hashPos = code & newMask;
      }

      ids[hashPos] = id | (hashcode & newHighMask);
    }

    hashMask = newMask;
    highMask = newHighMask;
    hashSize = newSize;
    hashHalfSize = newSize / 2;
  }
``

@mikemccand
Copy link
Member

Oooh I see, I think that should work? You discard the old hash map (ids[]) immediately (replace with 2X larger one), so surge is just 2X not 3X the prior array.

You should also be able to iterate via the byte[] blocks? They are just concatenated byte[] with prefix 1 or 2 byte vInt. Then the access would be entirely sequential --> CPU, caches, RAM happier? Maybe even single pass is OK even mixed with random-writes into new hash?

But, I still think a zero-hash impl should work too! (Using the opto that stuffs some of the hash bits into ids I linked above).

@mikemccand
Copy link
Member

I tested the PR (pre-compute hash's), on a Raptorlake i9-13900K, 192 GB RAM, Arch Lijnux.

I don't know what all the perf stats mean, but I see 1.4 -> 1.7 CPUs_utilized changed:

Before:

38092046 terms loaded
done shuffling
Inserted 38092046 terms in 12691.45 ms, unique term 38092046
Inserted 38092046 terms in 12688.31 ms, unique term 38092046
Inserted 38092046 terms in 12607.45 ms, unique term 38092046
Inserted 38092046 terms in 12537.87 ms, unique term 38092046

 Performance counter stats for '/usr/lib/jvm/java-25-openjdk/bin/java -cp .:lucene/core/build/classes/java/main25:lucene/core/build/classes/java/main BHT /lucenedata/enwiki/allterms-20110115.txt':

              8560      context-switches                 #    108.1 cs/sec  cs_per_second
               287      cpu-migrations                   #      3.6 migrations/sec  migrations_per_second
             33899      page-faults                      #    428.0 faults/sec  page_faults_per_second
          79211.49 msec task-clock                       #      1.4 CPUs  CPUs_utilized
        2273016325      cpu_core/L1-dcache-load-misses/  #      nan %  l1d_miss_rate            (29.09%)
        1541526215      cpu_core/LLC-loads/              #     73.2 %  llc_miss_rate            (13.84%)
        1405307756      cpu_core/branch-misses/          #      2.8 %  branch_miss_rate         (20.77%)
       49549979427      cpu_core/branches/               #    625.5 M/sec  branch_frequency     (27.68%)
      441832760975      cpu_core/cpu-cycles/             #      5.6 GHz  cycles_frequency       (34.59%)
      283012369585      cpu_core/instructions/           #      0.6 instructions  insn_per_cycle  (41.48%)
       83593449949      cpu_core/dTLB-loads/             #      0.1 %  dtlb_miss_rate           (48.33%)
          88342760      cpu_atom/L1-icache-load-misses/  #      0.7 %  l1i_miss_rate            (17.36%)
         128880274      cpu_atom/LLC-loads/              #      0.2 %  llc_miss_rate            (11.44%)
          85602625      cpu_atom/branch-misses/          #      1.0 %  branch_miss_rate         (8.54%)
        5128977191      cpu_atom/branches/               #     64.8 M/sec  branch_frequency     (13.66%)
       61038440913      cpu_atom/cpu-cycles/             #      0.8 GHz  cycles_frequency       (18.21%)
       34060988052      cpu_atom/instructions/           #      0.6 instructions  insn_per_cycle  (22.70%)
       11407392575      cpu_atom/dTLB-loads/             #      0.0 %  dtlb_miss_rate           (27.22%)
             TopdownL1 (cpu_core)                        #      8.5 %  tma_bad_speculation
                                                         #     12.2 %  tma_frontend_bound       (58.25%)
                                                         #     33.2 %  tma_backend_bound
                                                         #     46.0 %  tma_retiring             (58.25%)
             TopdownL1 (cpu_atom)                        #     81.9 %  tma_backend_bound        (27.06%)
                                                         #      4.2 %  tma_frontend_bound       (19.02%)
                                                         #     -6.3 %  tma_bad_speculation
                                                         #     20.3 %  tma_retiring             (17.47%)

      55.165335221 seconds time elapsed

      76.435898000 seconds user
       2.268276000 seconds sys

After:

8092046 terms loaded
done shuffling
Inserted 38092046 terms in 7715.29 ms, unique term 38092046
Inserted 38092046 terms in 7696.81 ms, unique term 38092046
Inserted 38092046 terms in 7704.62 ms, unique term 38092046
Inserted 38092046 terms in 7586.43 ms, unique term 38092046

 Performance counter stats for '/usr/lib/jvm/java-25-openjdk/bin/java -cp /l/trunk:lucene/core/build/classes/java/main25:lucene/core/build/classes/java/main BHT /lucenedata/enwiki/allterms-20110115.txt':

              8710      context-switches                 #    147.3 cs/sec  cs_per_second
               334      cpu-migrations                   #      5.6 migrations/sec  migrations_per_second
             34616      page-faults                      #    585.4 faults/sec  page_faults_per_second
          59128.21 msec task-clock                       #      1.7 CPUs  CPUs_utilized
        1561004563      cpu_core/L1-dcache-load-misses/  #      nan %  l1d_miss_rate            (27.12%)
         984009712      cpu_core/LLC-loads/              #     73.1 %  llc_miss_rate            (14.29%)
        1341577909      cpu_core/branch-misses/          #      2.8 %  branch_miss_rate         (21.76%)
       47205893532      cpu_core/branches/               #    798.4 M/sec  branch_frequency     (28.98%)
      299702122270      cpu_core/cpu-cycles/             #      5.1 GHz  cycles_frequency       (36.20%)
      274395763972      cpu_core/instructions/           #      0.9 instructions  insn_per_cycle  (43.41%)
       85722776251      cpu_core/dTLB-loads/             #      0.1 %  dtlb_miss_rate           (47.47%)
          61455978      cpu_atom/L1-icache-load-misses/  #      0.4 %  l1i_miss_rate            (11.69%)
         165263411      cpu_atom/LLC-loads/              #      0.6 %  llc_miss_rate            (8.66%)
         104379304      cpu_atom/branch-misses/          #      0.9 %  branch_miss_rate         (6.72%)
       12291571297      cpu_atom/branches/               #    207.9 M/sec  branch_frequency     (6.66%)
      123652422399      cpu_atom/cpu-cycles/             #      2.1 GHz  cycles_frequency       (8.85%)
       77079071643      cpu_atom/instructions/           #      0.6 instructions  insn_per_cycle  (11.01%)
       27617125715      cpu_atom/dTLB-loads/             #      0.0 %  dtlb_miss_rate           (11.78%)
             TopdownL1 (cpu_core)                        #      8.5 %  tma_bad_speculation
                                                         #     11.4 %  tma_frontend_bound       (54.24%)
                                                         #     36.5 %  tma_backend_bound
                                                         #     43.6 %  tma_retiring             (54.24%)
             TopdownL1 (cpu_atom)                        #     80.3 %  tma_backend_bound        (11.68%)
                                                         #      2.6 %  tma_frontend_bound       (11.73%)
                                                         #      4.5 %  tma_bad_speculation
                                                         #     12.5 %  tma_retiring             (11.75%)

      35.300626738 seconds time elapsed

      56.434061000 seconds user
       2.299231000 seconds sys

This is on latest Lucene main branch (#182ee9c4cc3bc52ace12e699248b750377a3aa2f) using your benchy (I just added code to load terms from a file one per line). I tested on an export of terms from Wikipedia en:

import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefHash;

// /usr/lib/jvm/java-25-openjdk/bin/javac -cp lucene/core/build/classes/java/main25:lucene/core/build/classes/java/main BHT.java; perf stat -dd /usr/lib/jvm/java-25-openjdk/bin/java -cp .:lucene/core/build/classes/java/main2\
5:lucene/core/build/classes/java/main BHT /lucenedata/enwiki/allterms-20110115.txt                                                                                                                                               

public class BHT {
  public static void main(String[] args) throws IOException {
    BytesRef[] terms = loadTerms(Paths.get(args[0]));
    for (int iter=0;iter<1;iter++) {
      insert(terms, 4);
    }
  }

  private static BytesRef[] loadTerms(Path path) throws IOException {

    final List<BytesRef> terms = new ArrayList<>();
    try (java.util.stream.Stream<String> lines = Files.lines(path)) {
      // Process each line as it is read                                                                                                                                                                                         
      lines.forEach(line -> {
          terms.add(new BytesRef(line.trim()));
        });
    }
    System.out.println(terms.size() + " terms loaded");
    Collections.shuffle(terms);
    System.out.println("done shuffling");
    return terms.toArray(new BytesRef[0]);
  }

  private static void insert(BytesRef[] testData, int round) {
    for (int r = 0; r < round; r++) {
      BytesRefHash hash = new BytesRefHash();
      int uniqueCount = 0;
      long start = System.nanoTime();
      for (BytesRef ref : testData) {
        int pos = hash.add(ref);
        if (pos >= 0) {
          uniqueCount += 1;
        }
      }
      long insertTimeNs = System.nanoTime() - start;
      System.out.printf(
          "Inserted %d terms in %.2f ms, unique term %d\n",
          testData.length, insertTimeNs / 1_000_000.0, uniqueCount);
      /*                                                                                                                                                                                                                         
      System.out.printf(                                                                                                                                                                                                         
          "rehashTimes %d, rehashTimeMs %d, calcHashTimeMs %d\n",                                                                                                                                                                
          hash.rehashTimes, hash.rehashTimeMs, hash.calcHashTimeMs);                                                                                                                                                             
      */
    }
  }
}

@mikemccand
Copy link
Member

I asked Claude Opus 4.6 to explain the results: https://claude.ai/share/79c14d62-39e1-4a83-b19e-430c563ac9a9

@tyronecai
Copy link
Contributor Author

tyronecai commented Mar 4, 2026

But, I still think a zero-hash impl should work too! (Using the opto that stuffs some of the hash bits into ids I linked above).

Oooh I see, I think that should work? You discard the old hash map (ids[]) immediately (replace with 2X larger one), so surge is just 2X not 3X the prior array.

——— yes, I build the new ids from all terms pointed to by bytesStart[]

You should also be able to iterate via the byte[] blocks? They are just concatenated byte[] with prefix 1 or 2 byte vInt. Then the access would be entirely sequential --> CPU, caches, RAM happier? Maybe even single pass is OK even mixed with random-writes into new hash?

—— Yes, but this is actually same as the access below.

for (int id = 0; id < count; id++) {
        hashcode = code = pool.hash(bytesStart[id]);
But, I still think a zero-hash impl should work too! (Using the opto that stuffs some of the hash bits into ids I linked above).

We know the lower bits of the hash (position in the current ids array), we know more bits from [the recent opto](https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb), don't we have enough bits? We need just one more bit for the initial hash slot ... on linear probe it just increments...

—— No, we don't have enough bits in current code. because

position in the current ids array != the lower bits of the hashcode

According to (2f66c8f)

    hashSize = capacity;
    hashHalfSize = hashSize >> 1;
    hashMask = hashSize - 1;
    highMask = ~hashMask;

When capacity is 16, hashSize is 16, hashHalfSize is 8,
hashMask is 15 (0xf),
highMask is -16 (0xfffffff0)

ids[hashPos] = e | (hashcode & highMask);
The id contains the original id and the high 28 bits of the hashcode.

int hashPos = code & hashMask;
while (e != -1 ....) {
  code++;
  hashPos = code & hashMask;
  e = ids[hashPos];
}

when we look up the ids, we use the lower 4 bits of the hashcode. and linear probe for an unused location e.

We can completely change the logic of ids and store the lower N bits of the hashcode (currently the higher N bits) in id. This way, we don't need to recalculate the hashcode during rehashing.

I analyzed the benefits and potential problems using CodeX.

The current design uses the low k bits for bucket location and stores the high bits for fingerprint, with the two pieces of information being largely independent.

If we change it to "store the low bits for fingerprint," the first k bits overlap with the bucket location, essentially wasting k bits of information.

This degradation occurs rapidly as hashSize increases. It will cause findHash to fall into pool.equals(...) more frequently.

• The current scheme has approximately 32-k effective fingerprint bits.

• The low-bit scheme effectively adds approximately 32-2k new information in collisions within the same bucket.

When k>=16 (hashSize>=65536), there are almost no effective fingerprints, and findHash will more frequently fall into pool.equals(...).

@tyronecai
Copy link
Contributor Author

I asked Claude Opus 4.6 to explain the results: https://claude.ai/share/79c14d62-39e1-4a83-b19e-430c563ac9a9

CPUs_utilized 1.4 -> 1.7 changed

It seems to be the benefit from reduced cache-misses.

@tyronecai
Copy link
Contributor Author

Since all the term information is already stored in bytesStart + pool,
the ids is simply a rearrangement of the ids in bytesStart based on their hash codes.

Therefore, it's completely useless during rehashing and subsequent compaction.

So, in #15772, I modified the compaction process to discard the ids information.

Following this idea, I readjusted the rehash mechanism, improving performance and reducing memory consumption. Please review it again @mikemccand @dweiss

@mikemccand
I retested it on all hardware I could find.
I still extracting terms from some application logs using newlines and spaces, resulting in 2,282,163 unique terms. Test devices included:
My own Apple M1 Pro laptop, (259.36 ms VS 465.09 ms)
My AMD Ryzen 7 9700X desktop with similar results, (166.57 VS 335.15 ms)
A server equipped with an Arm Kunpeng 960 CPU, (731.54 ms VS 1310.54 ms)
A server equipped with an older Intel CPU (Intel(R) Xeon(R) Silver 4110), (980.57 ms VS 1611.37 ms)

The results were similar: hash.add completed 2,282,163 term in half the time it took before optimizing the rehash code.

List<BytesRef> testData = loadUniqueTermsFromFile(filename);
for (int i = 0; i < round; i++) {
  insert(testData);
}

private static void insert(List<BytesRef> testData) {
    BytesRefHash hash = new BytesRefHash();
    long start = System.nanoTime();
    int uniqueCount = 0;
    for (BytesRef ref : testData) {
      int pos = hash.add(ref);
      if (pos >= 0) {
        uniqueCount += 1;
      }
    }

    long insertTimeNs = System.nanoTime() - start;
    System.out.printf(
        "Inserted %d strings in %.2f ms, uniqueCount %d, %n",
        testData.size(), insertTimeNs / 1_000_000.0, uniqueCount);
  }

@mikemccand
Copy link
Member

On nightly benchy box (beast3, Ryzen Threadripper 3990X, before:

38092046 terms loaded
done shuffling
Inserted 38092046 terms in 20841.91 ms, unique term 38092046
Inserted 38092046 terms in 21000.09 ms, unique term 38092046
Inserted 38092046 terms in 21635.49 ms, unique term 38092046
Inserted 38092046 terms in 20560.37 ms, unique term 38092046

 Performance counter stats for '/usr/lib/jvm/java-25-openjdk/bin/java -cp .:lucene/core/build/classes/java/main25:lucene/core/build/classes/java/main BHT /lucenedata/enwiki/\
allterms-20110115.txt':

                 0      context-switches:u               #      0.0 cs/sec  cs_per_second
                 0      cpu-migrations:u                 #      0.0 migrations/sec  migrations_per_second
           554,477      page-faults:u                    #   3849.9 faults/sec  page_faults_per_second
        144,022.41 msec task-clock:u                     #      1.6 CPUs  CPUs_utilized
     4,099,191,298      L1-dcache-load-misses:u          #      3.5 %  l1d_miss_rate            (20.04%)
        17,137,351      L1-icache-load-misses:u          #      0.2 %  l1i_miss_rate            (20.02%)
     1,287,102,957      branch-misses:u                  #      3.0 %  branch_miss_rate         (20.00%)
    42,486,918,231      branches:u                       #    295.0 M/sec  branch_frequency     (20.00%)
   556,536,771,158      cpu-cycles:u                     #      3.9 GHz  cycles_frequency       (30.03%)
   246,271,809,438      instructions:u                   #      0.4 instructions  insn_per_cycle  (30.05%)
    18,978,848,529      stalled-cycles-frontend:u        #     0.03 frontend_cycles_idle        (20.04%)
     1,060,428,248      dTLB-loads:u                     #     27.1 %  dtlb_miss_rate           (20.08%)
           245,693      iTLB-loads:u                     #    132.8 %  itlb_miss_rate           (20.06%)

      90.699653506 seconds time elapsed

     132.112774000 seconds user
      12.021535000 seconds sys

After:

38092046 terms loaded
done shuffling
Inserted 38092046 terms in 11263.41 ms, unique term 38092046
Inserted 38092046 terms in 12925.52 ms, unique term 38092046
Inserted 38092046 terms in 12718.04 ms, unique term 38092046
Inserted 38092046 terms in 12635.16 ms, unique term 38092046

 Performance counter stats for '/usr/lib/jvm/java-25-openjdk/bin/java -cp .:lucene/core/build/classes/java/main25:lucene/core/build/classes/java/main BHT /lucenedata/enwiki/\
allterms-20110115.txt':

                 0      context-switches:u               #      0.0 cs/sec  cs_per_second
                 0      cpu-migrations:u                 #      0.0 migrations/sec  migrations_per_second
            41,869      page-faults:u                    #    365.2 faults/sec  page_faults_per_second
        114,640.36 msec task-clock:u                     #      2.1 CPUs  CPUs_utilized
     3,491,553,643      L1-dcache-load-misses:u          #      2.7 %  l1d_miss_rate            (20.06%)
        15,855,892      L1-icache-load-misses:u          #      0.2 %  l1i_miss_rate            (20.08%)
     1,271,522,632      branch-misses:u                  #      2.6 %  branch_miss_rate         (20.09%)
    48,708,599,021      branches:u                       #    424.9 M/sec  branch_frequency     (20.08%)
   430,787,025,878      cpu-cycles:u                     #      3.8 GHz  cycles_frequency       (30.10%)
   285,622,442,684      instructions:u                   #      0.7 instructions  insn_per_cycle  (30.06%)
    18,208,623,146      stalled-cycles-frontend:u        #     0.04 frontend_cycles_idle        (20.05%)
       621,501,564      dTLB-loads:u                     #      3.9 %  dtlb_miss_rate           (20.04%)
           272,769      iTLB-loads:u                     #     53.7 %  itlb_miss_rate           (20.03%)

      55.867613269 seconds time elapsed

     102.551131000 seconds user
      12.034597000 seconds sys

Nice! Note the amazing drop in dtlb_miss_rate, which I think is a cache the CPU keeps lose for mapping virtual -> physical address. So the better locality pays off.

@mikemccand
Copy link
Member

If we change it to "store the low bits for fingerprint," the first k bits overlap with the bucket location, essentially wasting k bits of information.

Wait -- we would not duplicate the hash bits in this approach? Bucket location is lower k bits, then store the next m lower bits (not overlapping with the k bits) in the high unused bits of ids (fingerprint)? Then we do not lose any hash bits (still 32-k bits used for fingerprint) and I think we can avoid recomputing hash of keys during rehash.

Really, during rehash, we just need one more bit (the lowest bit of the fingerprint) of each hash. It tells us whether bucket location in the new table is the same spot (0 bit) in bottom half of the new table, or the same spot in the "top half" (spot + hashTableSize/2).

@mikemccand
Copy link
Member

Yes, but this is actually same as the access below.

Well, you're putting 2X pressure on the cache lines right? (Both sequential). 1) Is stepping through bytesStart array, 2) is reading from a different spot in virtual address space (current 32 KB byte[] holding the byte[] key for each entry?

Whereas if you only stepped through the pages directly that's a single sequential read stream. But, it's an added 1 or 2 byte vInt decode, yet, that if should be trivial for CPU (almost always 1 byte, keys < 128 length vast majority of time).

I'm not sure which would be better! This mechanical / physics sympathy is hard for me to model/predict...

So these 1 or 2 sequential read streams then also wrestle with the random-access writes we do into the new hash ids array.

@tyronecai
Copy link
Contributor Author

If we change it to "store the low bits for fingerprint," the first k bits overlap with the bucket location, essentially wasting k bits of information.如果我们将其改为“将低位数用于指纹信息”,那么前 k 位会与桶位置重叠,实际上会浪费 k 位的信息。

Wait -- we would not duplicate the hash bits in this approach? Bucket location is lower k bits, then store the next m lower bits (not overlapping with the k bits) in the high unused bits of ids (fingerprint)? Then we do not lose any hash bits (still 32-k bits used for fingerprint) and I think we can avoid recomputing hash of keys during rehash.等等——这种方法不会重复哈希位吗?桶的位置是较低的 k 位,然后存储接下来的 m 个较低位(不与 k 位重叠)在 ids 的高位未使用位中?这样我们就不会丢失任何哈希位(仍然使用 32-k 位用于指纹),我认为我们可以避免在重新哈希时重新计算键的哈希值。

Really, during rehash, we just need one more bit (the lowest bit of the fingerprint) of each hash. It tells us whether bucket location in the new table is the same spot (0 bit) in bottom half of the new table, or the same spot in the "top half" (spot + hashTableSize/2).实际上,在重新哈希过程中,我们只需要每个哈希值多一个位(指纹的最低位)。这个位可以告诉我们,新表中的桶位置是在新表下半部分的相同位置(位值为 0),还是在“上半部分”的相同位置(位置值加上 hashTableSize/2)。

Let me understand what you're saying.

Whereas if you only stepped through the pages directly that's a single sequential read stream. But, it's an added 1 or 2 byte vInt decode, yet, that if should be trivial for CPU (almost always 1 byte, keys < 128 length vast majority of time).

Directly iterating through the data in the pool within BytesRefHash doesn't feel quite right, even though it does reduce one access to bytesStart.

I still need to test the effect of this change.

@mikemccand
Copy link
Member

Sorry for all the ideas -- we can also pinch & ship what you already created -- it looks like an amazing win as is -- PNP! ("progress not perfection").

I'm curious if this is needle moving for indexing overall? We can wait and see what nightly benchy thinks, after we merge this.

I had another realization: we say the ids array stores the fingerprint (high bits) of the key's hash code. Since these ids are actually ordinals (assigned as 0, 1, 2, 3, ... as we see each unique BytesRef being added), we have many free 0 high bits to use for fingerprint.

But I think it's not actually a fingerprint? I think it is the entire hash code (when added to position in ids array, the lower hash bits)? We shouldn't ever need to hash() on something already in the table (just the incoming key being added/get'd)? There must always be enough free 0 high bits in the ids, since they are compact ordinals, and we grow the hash table (which stores lower k bits of hash, then k+1 its on rehash, ...) well before id value gets to hash table size?

@tyronecai
Copy link
Contributor Author

Also: since BytesRefHash now uses the otherwise 0 leading bits of each int in int[] ids array to hold some of the hash code bits, couldn't we often avoid recomputing the full hash entirely? We know the lower bits of the hash (position in the current ids array), we know more bits from the recent opto, don't we have enough bits? We need just one more bit for the initial hash slot ... on linear probe it just increments...

:)

let me see see

@tyronecai
Copy link
Contributor Author

tyronecai commented Mar 4, 2026

Sorry for all the ideas -- we can also pinch & ship what you already created -- it looks like an amazing win as is -- PNP! ("progress not perfection").

I'm curious if this is needle moving for indexing overall? We can wait and see what nightly benchy thinks, after we merge this.

I had another realization: we say the ids array stores the fingerprint (high bits) of the key's hash code. Since these ids are actually ordinals (assigned as 0, 1, 2, 3, ... as we see each unique BytesRef being added), we have many free 0 high bits to use for fingerprint.

But I think it's not actually a fingerprint? I think it is the entire hash code (when added to position in ids array, the lower hash bits)? We shouldn't ever need to hash() on something already in the table (just the incoming key being added/get'd)? There must always be enough free 0 high bits in the ids, since they are compact ordinals, and we grow the hash table (which stores lower k bits of hash, then k+1 its on rehash, ...) well before id value gets to hash table size?

I encountered several issues while trying to write the code:

  1. BytesRefBlockPool lacks an API for iteration.

  2. BytesRefBlockPool doesn't provide enough information to tell you how many terms are in each byte[] block. It can only calculate using the vint's length, but this can be inaccurate. For example, if there's unused space at the end of the byte[] block, the data will be all 0s, resulting in multiple terms of length 0 which may not actually exist.

  3. As I mentioned before, we don't actually have the low-order bits of the hashcode: The old slot i is often the position after the probe, not the original hash & oldMask, losing the "probe displacement" information, making it impossible to reliably deduce that 1 bit (or even more low-order bits). Am I understanding correctly?

I also tried creating a hashcodes[] member variable similar to bytesStart[] in the BytesRefHash class to store the hashcodes of all added terms. This avoids recalculating hash values ​​during rehashing and can be used for comparison during findHash. However, benchmarking revealed that the benefits were not significant compared to the current modifications, and it also incurred additional memory overhead.

@tyronecai tyronecai requested a review from dweiss March 4, 2026 14:21
@mikemccand
Copy link
Member

Ahhh you're right, the linear probing breaks that idea (Claude Opus 4.6 agrees), sigh. OK +1 to ship what you already did -- this is awesome progress! Thanks @tyronecai .

@tyronecai
Copy link
Contributor Author

@dweiss

Could you please review my new changes again? I've reduced memory usage.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tyronecai
Copy link
Contributor Author

Ahhh you're right, the linear probing breaks that idea (Claude Opus 4.6 agrees), sigh. OK +1 to ship what you already did -- this is awesome progress! Thanks @tyronecai .

Hi, @mikemccand,

Does this review still need your approval before the change can be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants