Fix GroupVIntBenchmark and some tests to actually use DataInput#readGroupVInt; deprecate/remove obsolete long[] code by jpountz · Pull Request #15104 · apache/lucene

jpountz · 2025-08-21T13:30:51Z

When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting DataInput sub-classes optimize it).

However, GroupVIntBenchmark was not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead.

When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting `DataInput` sub-classes optimize it). However, `GroupVIntBenchmark` was not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead.

github-actions · 2025-08-21T13:31:43Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

jpountz · 2025-08-21T13:35:17Z

cc @rmuir who made me find this bug by observing that naive decoding and optimized decoding performed the same

@uschindler I'm not sure why but I'm getting much worse performance since #15089 (I get ~10.4 ops/us before this PR):

Benchmark                                                          (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt              64  thrpt    5  0.624 ± 0.010  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline      64  thrpt    5  6.833 ± 0.080  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                   64  thrpt    5  8.976 ± 0.270  ops/us

uschindler · 2025-08-21T14:33:00Z

cc @rmuir who made me find this bug by observing that naive decoding and optimized decoding performed the same

@uschindler I'm not sure why but I'm getting much worse performance since #15089 (I get ~10.4 ops/us before this PR):
Benchmark                                                          (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt              64  thrpt    5  0.624 ± 0.010  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline      64  thrpt    5  6.833 ± 0.080  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                   64  thrpt    5  8.976 ± 0.270  ops/us

This cannot be correct, because the new code is actually having less indirections. And Roberts ran the same benchmark (and I also did), the results were not worse and actually more stable.

The code is actually simpler (no lambda anymore), so it is strange what you see now.

uschindler · 2025-08-21T14:36:49Z

The question is: I have seen the dead code, too. But the code GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt should have used the mmap code.

To me it looks like it does not test MMapDirectory at all?

uschindler · 2025-08-21T14:40:05Z

So this is not testing mmapdir's implementation at all!

lucene/lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java

Lines 187 to 191 in 6cfaa5e

    
           public void benchMMapDirectoryInputs_readGroupVInt(Blackhole bh) throws IOException { 
        
             byteBufferGVIntIn.seek(0); 
        
             GroupVIntUtil.readGroupVInts(byteBufferGVIntIn, values, size); 
        
             bh.consume(values); 
        
           }

So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!?

uschindler · 2025-08-21T14:45:47Z

Can you please also fix the benchmark to not talk about ByteBuffers?

uschindler · 2025-08-21T14:55:31Z

I think, the whole benchmark code should be rewritten and also make sure that the variable naming actually shows what is measured.

How do your changes now align with what we have seen 3 days ago? The MMap one was very noisy and was not noisy anymore afzter the change (see in the #15089 issue). How can this be explained? There's something very fishy?

I'd like to see a benchmark with much higher warmup time (it is also too short for VarHandles to inline correctly).

uschindler · 2025-08-21T14:58:19Z

Can you also show correct values for NIOFSDircetory? NIOFSDirectory has the same optimization in #15089, I don't see numbers before/after!

uschindler · 2025-08-21T15:02:41Z

P.S.: Can we also remove the dead code in GroupVIntUtil? The long[] one.... with VarHandles/lambda.

uschindler · 2025-08-21T15:03:29Z

P.S.: Can we also remove the dead code in GroupVIntUtil? The long[] one.... with VarHandles/lambda. I was about to do this, but I did not trust the IDE at that time (was late evening).

jpountz · 2025-08-21T15:08:14Z

Sure, though I won't have time to look into it before next week.

uschindler · 2025-08-21T15:15:02Z

OK, Thanks! Let's wait for Mike McCandeless benchmark too and let us see how this fixes the actual issue (slowdown) caused by GC pressure because of the Lambda. This may work well in this small benchamrk, but with call-site pollution it suddenly behaves much worse (due to the lambda).

The biggest problem with this benchmark is: it does not does long enough warmup (therefore the optimizations required for MMap to be efficient won't appy) and it does not have the callsite pollution, because all benchs run with a separate JVM (that's what I get from the annotations). So it runs the mmap case too isolated and therefor the optimizer can remove the lambda completely.

There may be a small slowdown in the highly optimized case we see here, because there's no cast of the referent. But this should go away with longer runtimes.

I would only trust that benchmark if we made it use a more real workload.

easyice · 2025-08-22T01:49:41Z

So this is not testing mmapdir's implementation at all!

lucene/lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java

Lines 187 to 191 in 6cfaa5e

public void benchMMapDirectoryInputs_readGroupVInt(Blackhole bh) throws IOException {

byteBufferGVIntIn.seek(0);

GroupVIntUtil.readGroupVInts(byteBufferGVIntIn, values, size);

bh.consume(values);

}

So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!?

In my understanding, this PR should fix this issue as well.

In the main branch, benchMMapDirectoryInputs_readGroupVInt invokes
readGroupVInt(DataInput, long[], int), which bypasses in.readGroupVInt and therefore does not exercise the mmap code path.

This PR updates benchMMapDirectoryInputs_readGroupVInt to call
readGroupVInts(DataInput, int[], int) instead, ensuring that the mmap code is covered.

uschindler · 2025-08-22T06:23:54Z

So this is not testing mmapdir's implementation at all!

lucene/lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/GroupVIntBenchmark.java

Lines 187 to 191 in 6cfaa5e

public void benchMMapDirectoryInputs_readGroupVInt(Blackhole bh) throws IOException {

byteBufferGVIntIn.seek(0);

GroupVIntUtil.readGroupVInts(byteBufferGVIntIn, values, size);

bh.consume(values);

}

So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!?

In my understanding, this PR should fix this issue as well.

In the main branch, benchMMapDirectoryInputs_readGroupVInt invokes readGroupVInt(DataInput, long[], int), which bypasses in.readGroupVInt and therefore does not exercise the mmap code path.

This PR updates benchMMapDirectoryInputs_readGroupVInt to call readGroupVInts(DataInput, int[], int) instead, ensuring that the mmap code is covered.

Hi,
it was a bit late yesterday. The problem I mainly have is with the variable naming. It talks about ByteBuffers at many places, but actually it is about MMap.

Can we fix this in this PR, so it is clear what each benchmark is doing? I am good at reading and understanding code if at least variables and field names are senseful, but the current state of that benchmark is unreadable and therefor I want to have a rewrite with less methods and variable named that make sense (e.g, "mmapIndexInput" instead of "byteBufferGVIntIn")
We should also remove the "dead" long[] code in GroupVIntUtil (the one which was targeted for DataInputSubclasses for optimization).
We should check the setup of the benchmark, to me it looks like warmup times and runtimes are much too small to they are very noisy. @jpountz found out that since by change in Refactor GroupVIntUtil functional interface lambda which does not inline correctly in MemorySegmentIndexInput #15089 the direct benchmaek was slower, but in Mike's production benchmark it showed a significant speedup for queries.
We should have benchmakrs for three impls: NIOFSDir (uses my new code with long->int downcast), ByteBuffersDir (uses my new code, same as NIOFSDir), MMAPDir (uses my new code)

Once we have the bench fixed, let's compare the results pre/post #15089 again!

Thanks, Uwe

easyice · 2025-08-22T06:26:29Z

I am also seeing performance regression related to #15089

PR with out #15089

GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  7.187 ± 0.588  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.021 ± 0.181  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.858 ± 0.079  ops/us

PR

GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  0.366 ± 0.033  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  3.998 ± 0.255  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.890 ± 0.428  ops/us

Then I tried increasing the warmup time from 3 to 10, but it didn’t help.

PR (warmup time 10)

GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  0.331 ± 0.045  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.006 ± 0.113  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.912 ± 0.263  ops/us

uschindler · 2025-08-22T06:27:26Z

Can you also show the NIOFSDir and ByteBuffersDir benchmarks? They have same PR changes.

uschindler · 2025-08-22T06:29:08Z

The strange thing is that the production behcnmarks got much faster.... Which queries should be impacted mainly by this change?

uschindler · 2025-08-22T06:41:28Z

See benchmarks from last night with the PR applied: #15079 (comment)

easyice · 2025-08-22T06:55:02Z

Can you also show the NIOFSDir and ByteBuffersDir benchmarks? They have same PR changes.

Hi, Uwe, here is the full benchmark output:

PR

Benchmark                                                            (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchByteArrayDataInput_readGroupVInt                 64  thrpt    5  4.016 ± 0.101  ops/us
GroupVIntBenchmark.benchByteArrayDataInput_readVInt                      64  thrpt    5  4.908 ± 0.177  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVInt              64  thrpt    5  0.319 ± 0.016  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVIntBaseline      64  thrpt    5  1.155 ± 0.154  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  0.331 ± 0.045  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.006 ± 0.113  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.912 ± 0.263  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVInt               64  thrpt    5  0.327 ± 0.025  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVIntBaseline       64  thrpt    5  3.974 ± 0.090  ops/us
GroupVIntBenchmark.bench_writeGroupVInt                                  64  thrpt    5  2.053 ± 0.154  ops/us
PosGroupVIntBenchmark.benchmark_addPositions                            N/A  thrpt    5  0.746 ± 0.113  ops/us

PR with out #15089

Benchmark                                                            (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchByteArrayDataInput_readGroupVInt                 64  thrpt    5  3.966 ± 0.119  ops/us
GroupVIntBenchmark.benchByteArrayDataInput_readVInt                      64  thrpt    5  4.911 ± 0.582  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVInt              64  thrpt    5  3.422 ± 0.068  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVIntBaseline      64  thrpt    5  1.160 ± 0.084  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  7.187 ± 0.588  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.021 ± 0.181  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.858 ± 0.079  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVInt               64  thrpt    5  5.681 ± 0.367  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVIntBaseline       64  thrpt    5  3.597 ± 0.993  ops/us
GroupVIntBenchmark.bench_writeGroupVInt                                  64  thrpt    5  2.057 ± 0.162  ops/us
PosGroupVIntBenchmark.benchmark_addPositions                            N/A  thrpt    5  0.738 ± 0.103  ops/us

uschindler · 2025-08-22T07:00:47Z

OK, thanks, so all three have same slowdown in this benchmark.

Let's only understand why the production benchmarks on Mike's server showed a significant improvment for postings-related boolean queries?

I will do some tests locally and modify the VarHandle stuff more, maybe I can make it better without lambda (that causes GC havoc outside of microbenchamrks).

Maybe we should for now revert #15089. I have no much time today to work on that. But basically we need to get rid of the captured lambda.

jpountz · 2025-08-22T12:19:45Z

The speedup in nightly benchmarks is due to #15039 with very high likelihood. Much fewer virtual calls when computing scores and slightly better vectorization.

uschindler · 2025-08-22T12:31:01Z

The speedup in nightly benchmarks is due to #15039 with very high likelihood. Much fewer virtual calls when computing scores and slightly better vectorization.

Looks like this is correct, I needed to dig a bit and figure out if the PR was not already part of the old run of Aug 18th on Mike's server, but it looks like it was the "first" commit afetr the last successful run on Aug 18th:

90be960...32e97a6

So I think this might be the reason. It is very bad that we had som many performance critical commits in that short time and failing benchmaerks at same time!

uschindler · 2025-08-22T12:32:17Z

So let me revert #15089 and start over again.

I have some ideas.

uschindler · 2025-08-22T13:10:35Z

Can we merge the current state of this PR so we have a common ground for the benchmark?

Let me fix the variable naming!.... working....

github-actions · 2025-08-22T13:26:09Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

…o fix/GroupVIntBenchmark

github-actions · 2025-08-22T13:32:50Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

jpountz · 2025-08-22T13:39:52Z

I'm on a phone so it's not super easy to merge and watch builds, but feel free to merge yourself. I don't think there's any controversy in this change.

uschindler · 2025-08-22T14:02:25Z

I am currently also deprecating all code and removeing all code in the class thats unused...

I need to have some food first, but all code passes validation....

uschindler · 2025-08-22T14:02:37Z

give me two hours....

jpountz · 2025-08-22T14:15:01Z

Heavy committing!

…only used by backwards codecs

uschindler · 2025-08-22T15:38:12Z

Hi, I pushed changes to fix also test framework still using the old long[] for testing multi-segment index inputs and close working correctly. Actually like with the benachmark those tests were no longer testing the new implementation.

I added @Deprecated to all GroupVIntUtil that use long[] and moved them to the end of the class. The Methods used for the optimization was completely removed.

One thing I did not fix: I added Deprecated also to the DataOutput method. We should remove that completely in main (see new issue by @jpountz). In 10.x we' cant remove it yet because there may be code that has implemented it in custom DataOutputs. Actually the method should not be in DataOutput at all and the whole thing should be completely handled by the util class. (that's my personal opinion).

What's strange: Badwards codecs is stil able to write, so the writing code was not removed, otherwise the writeGroupVInt with longs could have been removed completely.

@jpountz: Can you have a quick look? I will merge this otherwise in a moment (and backport) to proceed with refactoring an allowing to give the VarHandles/Lambda a second try.

…roupVInt; deprecate/remove obsolete long[] code (#15104) * Fix GroupVIntBenchmark to actually use DataInput#readGroupVInt. When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting `DataInput` sub-classes optimize it). However, `GroupVIntBenchmark` was not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead. * cleanup field names * Remove dead code and add deprecated annotation to all code and tests only used by backwards codecs --------- Co-authored-by: Uwe Schindler <[email protected]>

uschindler · 2025-08-22T16:45:06Z

Thansk to all. Now getting to observe options! Thanks's @rmuir and @jpountz for finding that issue. I also fixed the same problem like the benchmark in our test framework. It also tested wrong method!

uschindler mentioned this pull request Aug 22, 2025

Understand 2025/08/06 nightly benchy regression in KNN indexing #15079

Open

cleanup field names

3dba2d8

Merge branch 'main' of https://gitbox.apache.org/repos/asf/lucene int…

f96530a

…o fix/GroupVIntBenchmark

uschindler added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label Aug 22, 2025

jpountz mentioned this pull request Aug 22, 2025

Move long[] group varint to backward-codecs #15113

Open

Remove dead code and add deprecated annotation to all code and tests …

d7ec0ba

…only used by backwards codecs

github-actions bot added module:core/store module:test-framework labels Aug 22, 2025

uschindler changed the title ~~Fix GroupVIntBenchmark to actually use DataInput#readGroupVInt.~~ Fix GroupVIntBenchmark and some tests to actually use DataInput#readGroupVInt; deprecate/remove obsolete long[] code Aug 22, 2025

rmuir approved these changes Aug 22, 2025

View reviewed changes

uschindler merged commit 2c56505 into apache:main Aug 22, 2025
8 checks passed

uschindler self-assigned this Aug 22, 2025

uschindler added this to the 10.3.0 milestone Aug 22, 2025

Conversation

jpountz commented Aug 21, 2025

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

jpountz commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpountz commented Aug 21, 2025

Uh oh!

uschindler commented Aug 21, 2025

Uh oh!

easyice commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

easyice commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

easyice commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

jpountz commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

github-actions bot commented Aug 22, 2025

Uh oh!

jpountz commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

jpountz commented Aug 22, 2025

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

Uh oh!

uschindler commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

uschindler commented Aug 21, 2025 •

edited

Loading

uschindler commented Aug 21, 2025 •

edited

Loading

easyice commented Aug 22, 2025 •

edited

Loading

uschindler commented Aug 22, 2025 •

edited

Loading