Improve BytesRefHash.add performance by optimize rehash operation#15779
Improve BytesRefHash.add performance by optimize rehash operation#15779tyronecai wants to merge 5 commits intoapache:mainfrom
Conversation
Precompute all term hash values sequentially at once which is much faster than out-of-order computation
dweiss
left a comment
There was a problem hiding this comment.
This doesn't seem harmful and is a known technique for leveraging cache locality. Seems ok with me.
Hi, is there anything else I need to do with this review, Or just need wait for someone else to review it? |
+1 -- these genai tools are amazing. Claude (Opus 4.6) helped me add ascii-art (well, Unicode) sparkle histograms to visualize smelly vectors in Wow, ~51% net topline (time to insert N keys) speedup by pre-computing all hashes into up-front You are doing precisely the same amount of CPU work (1X hash computation for each I guess reading the key from random packed If you run with I asked genai to look at the PR and explain the speedup. Claude Opus 4.6 did well. So did Grok (Expert). Gemini (Thinking) was (surprisingly) not great -- it confusingly thought this PR introduced power-of-two hash table sizes (that is pre-existing), and that this PR switched from modulo math to bitmasks (also pre-existing). Anyways, I love this change. How portable is it? If you use Lucene's But I'm worried about the surge in transient RAM. It's especially bad timing because we are already surging to 3X the current hash table in transience (1X current one, 2X being rehashed into) -- surge on surge. Could we change it to do that pre-computation in chunks? Also: since |
Woops -- I see you already posted the code fragment for the benchy in your op -- I'll try to test on And thank you for posting benchy source up front -- it's great to share exactly what/how your ran along with any results. |
I suddenly realized that rehashing is essentially a process of reconstructing Therefore, I no longer need the previous ids; I can directly read terms sequentially from byteStarts + pool and place them in the appropriate positions within the new IDs. Am I right? This way, I only need 2X the memory, instead of 3X or the 4X after adding The code is roughly as follows; I still need to confirm and test it further. |
|
Oooh I see, I think that should work? You discard the old hash map ( You should also be able to iterate via the But, I still think a zero-hash impl should work too! (Using the opto that stuffs some of the hash bits into |
|
I tested the PR (pre-compute hash's), on a Raptorlake i9-13900K, 192 GB RAM, Arch Lijnux. I don't know what all the perf stats mean, but I see 1.4 -> 1.7 CPUs_utilized changed: Before: After: This is on latest Lucene |
|
I asked Claude Opus 4.6 to explain the results: https://claude.ai/share/79c14d62-39e1-4a83-b19e-430c563ac9a9 |
——— yes, I build the new ids from all terms pointed to by bytesStart[] —— Yes, but this is actually same as the access below. —— No, we don't have enough bits in current code. because
According to (2f66c8f) When
when we look up the ids, we use the lower 4 bits of the hashcode. and linear probe for an unused location e. We can completely change the logic of I analyzed the benefits and potential problems using CodeX. The current design uses the low k bits for bucket location and stores the high bits for fingerprint, with the two pieces of information being largely independent. If we change it to "store the low bits for fingerprint," the first k bits overlap with the bucket location, essentially wasting k bits of information. This degradation occurs rapidly as hashSize increases. It will cause findHash to fall into pool.equals(...) more frequently. • The current scheme has approximately 32-k effective fingerprint bits. • The low-bit scheme effectively adds approximately 32-2k new information in collisions within the same bucket. When k>=16 (hashSize>=65536), there are almost no effective fingerprints, and findHash will more frequently fall into pool.equals(...). |
CPUs_utilized 1.4 -> 1.7 changed It seems to be the benefit from reduced cache-misses. |
|
Since all the term information is already stored in Therefore, it's completely useless during rehashing and subsequent compaction. So, in #15772, I modified the compaction process to discard the ids information. Following this idea, I readjusted the @mikemccand The results were similar: hash.add completed 2,282,163 term in half the time it took before optimizing the rehash code. |
|
On nightly benchy box ( After: Nice! Note the amazing drop in |
Wait -- we would not duplicate the hash bits in this approach? Bucket location is lower k bits, then store the next m lower bits (not overlapping with the k bits) in the high unused bits of ids (fingerprint)? Then we do not lose any hash bits (still 32-k bits used for fingerprint) and I think we can avoid recomputing hash of keys during rehash. Really, during rehash, we just need one more bit (the lowest bit of the fingerprint) of each hash. It tells us whether bucket location in the new table is the same spot (0 bit) in bottom half of the new table, or the same spot in the "top half" (spot + hashTableSize/2). |
Well, you're putting 2X pressure on the cache lines right? (Both sequential). 1) Is stepping through Whereas if you only stepped through the pages directly that's a single sequential read stream. But, it's an added 1 or 2 byte I'm not sure which would be better! This mechanical / physics sympathy is hard for me to model/predict... So these 1 or 2 sequential read streams then also wrestle with the random-access writes we do into the new hash |
Let me understand what you're saying.
Directly iterating through the data in the pool within BytesRefHash doesn't feel quite right, even though it does reduce one access to bytesStart. I still need to test the effect of this change. |
|
Sorry for all the ideas -- we can also pinch & ship what you already created -- it looks like an amazing win as is -- PNP! ("progress not perfection"). I'm curious if this is needle moving for indexing overall? We can wait and see what nightly benchy thinks, after we merge this. I had another realization: we say the But I think it's not actually a fingerprint? I think it is the entire hash code (when added to position in |
:) let me see see |
I encountered several issues while trying to write the code:
I also tried creating a |
|
Ahhh you're right, the linear probing breaks that idea (Claude Opus 4.6 agrees), sigh. OK +1 to ship what you already did -- this is awesome progress! Thanks @tyronecai . |
|
Could you please review my new changes again? I've reduced memory usage. |
Hi, @mikemccand, Does this review still need your approval before the change can be merged? |
Description
Rehashing performance is improved by creating a temporary int array in
rehashto sequentially compute the hashcodes of all terms at once.This improves memory locality, enhancing the performance of
pool.hashcomputation, which in turn improvesrehashperformance, ultimately significantly improvingaddperformance.The impact is that a temporary int array consumes additional memory (4 bytes * count), potentially an extra 40MB for 10 million data points, which should be acceptable.
I also tested residing
int[] hashcodesin theBytesRefHashclass, maintaining it duringadd, and using it for the results offindHash. For large datasets, this can slightly improve performance (5-6%), but the memory overhead and maintenance cost seem uneconomical.Thanks to OpenAI's Codex, I am able to validate my implementation very quickly, which is fantastic.
test code
BytesRefHash.java
test with large amounts (12585302 unique terms)
original (with large amounts of data,
rehashtook more than half the time of the entire insert operation.)with precompute
test with medium amounts (915436 unique terms)
original (with medium amounts of data, precompute effect is not that obvious. )
with precompute
@dweiss @mikemccand please take a look and give some advice