Skip to content

Fix GroupVIntBenchmark and some tests to actually use DataInput#readGroupVInt; deprecate/remove obsolete long[] code#15104

Merged
uschindler merged 4 commits intoapache:mainfrom
jpountz:fix/GroupVIntBenchmark
Aug 22, 2025
Merged

Fix GroupVIntBenchmark and some tests to actually use DataInput#readGroupVInt; deprecate/remove obsolete long[] code#15104
uschindler merged 4 commits intoapache:mainfrom
jpountz:fix/GroupVIntBenchmark

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Aug 21, 2025

When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting DataInput sub-classes optimize it).

However, GroupVIntBenchmark was not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead.

When group-varint was first introduced, it worked on long[], because this was
the representation that our postings format used for doc IDs (to be able to do
pseudo SIMD when storing two doc IDs in a single long). When postings later
moved to int[], group-varint also moved to int[], only preserving the slower
implementation for long[] (not letting `DataInput` sub-classes optimize it).

However, `GroupVIntBenchmark` was not updated, so every time that it benchmarks
group-varint, it actually runs the naive implementation that decodes into a
long[]. This PR fixes this benchmark to use the optimized impls that decode
into an int[] instead.
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 21, 2025

cc @rmuir who made me find this bug by observing that naive decoding and optimized decoding performed the same

@uschindler I'm not sure why but I'm getting much worse performance since #15089 (I get ~10.4 ops/us before this PR):

Benchmark                                                          (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt              64  thrpt    5  0.624 ± 0.010  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline      64  thrpt    5  6.833 ± 0.080  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                   64  thrpt    5  8.976 ± 0.270  ops/us

@uschindler
Copy link
Contributor

cc @rmuir who made me find this bug by observing that naive decoding and optimized decoding performed the same

@uschindler I'm not sure why but I'm getting much worse performance since #15089 (I get ~10.4 ops/us before this PR):

Benchmark                                                          (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt              64  thrpt    5  0.624 ± 0.010  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline      64  thrpt    5  6.833 ± 0.080  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                   64  thrpt    5  8.976 ± 0.270  ops/us

This cannot be correct, because the new code is actually having less indirections. And Roberts ran the same benchmark (and I also did), the results were not worse and actually more stable.

The code is actually simpler (no lambda anymore), so it is strange what you see now.

@uschindler
Copy link
Contributor

The question is: I have seen the dead code, too. But the code GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt should have used the mmap code.

To me it looks like it does not test MMapDirectory at all?

@uschindler
Copy link
Contributor

So this is not testing mmapdir's implementation at all!

public void benchMMapDirectoryInputs_readGroupVInt(Blackhole bh) throws IOException {
byteBufferGVIntIn.seek(0);
GroupVIntUtil.readGroupVInts(byteBufferGVIntIn, values, size);
bh.consume(values);
}

So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!?

@uschindler
Copy link
Contributor

Can you please also fix the benchmark to not talk about ByteBuffers?

@uschindler
Copy link
Contributor

uschindler commented Aug 21, 2025

I think, the whole benchmark code should be rewritten and also make sure that the variable naming actually shows what is measured.

How do your changes now align with what we have seen 3 days ago? The MMap one was very noisy and was not noisy anymore afzter the change (see in the #15089 issue). How can this be explained? There's something very fishy?

I'd like to see a benchmark with much higher warmup time (it is also too short for VarHandles to inline correctly).

@uschindler
Copy link
Contributor

Can you also show correct values for NIOFSDircetory? NIOFSDirectory has the same optimization in #15089, I don't see numbers before/after!

@uschindler
Copy link
Contributor

P.S.: Can we also remove the dead code in GroupVIntUtil? The long[] one.... with VarHandles/lambda.

@uschindler
Copy link
Contributor

uschindler commented Aug 21, 2025

P.S.: Can we also remove the dead code in GroupVIntUtil? The long[] one.... with VarHandles/lambda. I was about to do this, but I did not trust the IDE at that time (was late evening).

@jpountz
Copy link
Contributor Author

jpountz commented Aug 21, 2025

Sure, though I won't have time to look into it before next week.

@uschindler
Copy link
Contributor

OK, Thanks! Let's wait for Mike McCandeless benchmark too and let us see how this fixes the actual issue (slowdown) caused by GC pressure because of the Lambda. This may work well in this small benchamrk, but with call-site pollution it suddenly behaves much worse (due to the lambda).

The biggest problem with this benchmark is: it does not does long enough warmup (therefore the optimizations required for MMap to be efficient won't appy) and it does not have the callsite pollution, because all benchs run with a separate JVM (that's what I get from the annotations). So it runs the mmap case too isolated and therefor the optimizer can remove the lambda completely.

There may be a small slowdown in the highly optimized case we see here, because there's no cast of the referent. But this should go away with longer runtimes.

I would only trust that benchmark if we made it use a more real workload.

@easyice
Copy link
Contributor

easyice commented Aug 22, 2025

So this is not testing mmapdir's implementation at all!

public void benchMMapDirectoryInputs_readGroupVInt(Blackhole bh) throws IOException {
byteBufferGVIntIn.seek(0);
GroupVIntUtil.readGroupVInts(byteBufferGVIntIn, values, size);
bh.consume(values);
}

So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!?

In my understanding, this PR should fix this issue as well.

In the main branch, benchMMapDirectoryInputs_readGroupVInt invokes
readGroupVInt(DataInput, long[], int), which bypasses in.readGroupVInt and therefore does not exercise the mmap code path.

This PR updates benchMMapDirectoryInputs_readGroupVInt to call
readGroupVInts(DataInput, int[], int) instead, ensuring that the mmap code is covered.

@uschindler
Copy link
Contributor

So this is not testing mmapdir's implementation at all!

public void benchMMapDirectoryInputs_readGroupVInt(Blackhole bh) throws IOException {
byteBufferGVIntIn.seek(0);
GroupVIntUtil.readGroupVInts(byteBufferGVIntIn, values, size);
bh.consume(values);
}

So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!?

In my understanding, this PR should fix this issue as well.

In the main branch, benchMMapDirectoryInputs_readGroupVInt invokes readGroupVInt(DataInput, long[], int), which bypasses in.readGroupVInt and therefore does not exercise the mmap code path.

This PR updates benchMMapDirectoryInputs_readGroupVInt to call readGroupVInts(DataInput, int[], int) instead, ensuring that the mmap code is covered.

Hi,
it was a bit late yesterday. The problem I mainly have is with the variable naming. It talks about ByteBuffers at many places, but actually it is about MMap.

  • Can we fix this in this PR, so it is clear what each benchmark is doing? I am good at reading and understanding code if at least variables and field names are senseful, but the current state of that benchmark is unreadable and therefor I want to have a rewrite with less methods and variable named that make sense (e.g, "mmapIndexInput" instead of "byteBufferGVIntIn")
  • We should also remove the "dead" long[] code in GroupVIntUtil (the one which was targeted for DataInputSubclasses for optimization).
  • We should check the setup of the benchmark, to me it looks like warmup times and runtimes are much too small to they are very noisy. @jpountz found out that since by change in Refactor GroupVIntUtil functional interface lambda which does not inline correctly in MemorySegmentIndexInput #15089 the direct benchmaek was slower, but in Mike's production benchmark it showed a significant speedup for queries.
  • We should have benchmakrs for three impls: NIOFSDir (uses my new code with long->int downcast), ByteBuffersDir (uses my new code, same as NIOFSDir), MMAPDir (uses my new code)

Once we have the bench fixed, let's compare the results pre/post #15089 again!

Thanks, Uwe

@easyice
Copy link
Contributor

easyice commented Aug 22, 2025

I am also seeing performance regression related to #15089

PR with out #15089

GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  7.187 ± 0.588  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.021 ± 0.181  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.858 ± 0.079  ops/us

PR

GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  0.366 ± 0.033  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  3.998 ± 0.255  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.890 ± 0.428  ops/us

Then I tried increasing the warmup time from 3 to 10, but it didn’t help.

PR (warmup time 10)

GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  0.331 ± 0.045  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.006 ± 0.113  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.912 ± 0.263  ops/us

@uschindler
Copy link
Contributor

uschindler commented Aug 22, 2025

Can you also show the NIOFSDir and ByteBuffersDir benchmarks? They have same PR changes.

@uschindler
Copy link
Contributor

The strange thing is that the production behcnmarks got much faster.... Which queries should be impacted mainly by this change?

@uschindler
Copy link
Contributor

See benchmarks from last night with the PR applied: #15079 (comment)

@easyice
Copy link
Contributor

easyice commented Aug 22, 2025

Can you also show the NIOFSDir and ByteBuffersDir benchmarks? They have same PR changes.

Hi, Uwe, here is the full benchmark output:

PR

Benchmark                                                            (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchByteArrayDataInput_readGroupVInt                 64  thrpt    5  4.016 ± 0.101  ops/us
GroupVIntBenchmark.benchByteArrayDataInput_readVInt                      64  thrpt    5  4.908 ± 0.177  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVInt              64  thrpt    5  0.319 ± 0.016  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVIntBaseline      64  thrpt    5  1.155 ± 0.154  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  0.331 ± 0.045  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.006 ± 0.113  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.912 ± 0.263  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVInt               64  thrpt    5  0.327 ± 0.025  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVIntBaseline       64  thrpt    5  3.974 ± 0.090  ops/us
GroupVIntBenchmark.bench_writeGroupVInt                                  64  thrpt    5  2.053 ± 0.154  ops/us
PosGroupVIntBenchmark.benchmark_addPositions                            N/A  thrpt    5  0.746 ± 0.113  ops/us

PR with out #15089

Benchmark                                                            (size)   Mode  Cnt  Score   Error   Units
GroupVIntBenchmark.benchByteArrayDataInput_readGroupVInt                 64  thrpt    5  3.966 ± 0.119  ops/us
GroupVIntBenchmark.benchByteArrayDataInput_readVInt                      64  thrpt    5  4.911 ± 0.582  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVInt              64  thrpt    5  3.422 ± 0.068  ops/us
GroupVIntBenchmark.benchByteBuffersIndexInput_readGroupVIntBaseline      64  thrpt    5  1.160 ± 0.084  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVInt                64  thrpt    5  7.187 ± 0.588  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readGroupVIntBaseline        64  thrpt    5  4.021 ± 0.181  ops/us
GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt                     64  thrpt    5  4.858 ± 0.079  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVInt               64  thrpt    5  5.681 ± 0.367  ops/us
GroupVIntBenchmark.benchNIOFSDirectoryInputs_readGroupVIntBaseline       64  thrpt    5  3.597 ± 0.993  ops/us
GroupVIntBenchmark.bench_writeGroupVInt                                  64  thrpt    5  2.057 ± 0.162  ops/us
PosGroupVIntBenchmark.benchmark_addPositions                            N/A  thrpt    5  0.738 ± 0.103  ops/us

@uschindler
Copy link
Contributor

OK, thanks, so all three have same slowdown in this benchmark.

Let's only understand why the production benchmarks on Mike's server showed a significant improvment for postings-related boolean queries?

I will do some tests locally and modify the VarHandle stuff more, maybe I can make it better without lambda (that causes GC havoc outside of microbenchamrks).

Maybe we should for now revert #15089. I have no much time today to work on that. But basically we need to get rid of the captured lambda.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 22, 2025

The speedup in nightly benchmarks is due to #15039 with very high likelihood. Much fewer virtual calls when computing scores and slightly better vectorization.

@uschindler
Copy link
Contributor

The speedup in nightly benchmarks is due to #15039 with very high likelihood. Much fewer virtual calls when computing scores and slightly better vectorization.

Looks like this is correct, I needed to dig a bit and figure out if the PR was not already part of the old run of Aug 18th on Mike's server, but it looks like it was the "first" commit afetr the last successful run on Aug 18th:

90be960...32e97a6

So I think this might be the reason. It is very bad that we had som many performance critical commits in that short time and failing benchmaerks at same time!

@uschindler
Copy link
Contributor

So let me revert #15089 and start over again.

I have some ideas.

@uschindler
Copy link
Contributor

Can we merge the current state of this PR so we have a common ground for the benchmark?

Let me fix the variable naming!.... working....

@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@uschindler uschindler added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label Aug 22, 2025
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor Author

jpountz commented Aug 22, 2025

I'm on a phone so it's not super easy to merge and watch builds, but feel free to merge yourself. I don't think there's any controversy in this change.

@uschindler
Copy link
Contributor

I am currently also deprecating all code and removeing all code in the class thats unused...

I need to have some food first, but all code passes validation....

@uschindler
Copy link
Contributor

give me two hours....

@jpountz
Copy link
Contributor Author

jpountz commented Aug 22, 2025

Heavy committing!

@uschindler
Copy link
Contributor

Hi, I pushed changes to fix also test framework still using the old long[] for testing multi-segment index inputs and close working correctly. Actually like with the benachmark those tests were no longer testing the new implementation.

I added @Deprecated to all GroupVIntUtil that use long[] and moved them to the end of the class. The Methods used for the optimization was completely removed.

One thing I did not fix: I added Deprecated also to the DataOutput method. We should remove that completely in main (see new issue by @jpountz). In 10.x we' cant remove it yet because there may be code that has implemented it in custom DataOutputs. Actually the method should not be in DataOutput at all and the whole thing should be completely handled by the util class. (that's my personal opinion).

What's strange: Badwards codecs is stil able to write, so the writing code was not removed, otherwise the writeGroupVInt with longs could have been removed completely.

@jpountz: Can you have a quick look? I will merge this otherwise in a moment (and backport) to proceed with refactoring an allowing to give the VarHandles/Lambda a second try.

@uschindler uschindler changed the title Fix GroupVIntBenchmark to actually use DataInput#readGroupVInt. Fix GroupVIntBenchmark and some tests to actually use DataInput#readGroupVInt; deprecate/remove obsolete long[] code Aug 22, 2025
@uschindler uschindler merged commit 2c56505 into apache:main Aug 22, 2025
8 checks passed
asf-gitbox-commits pushed a commit that referenced this pull request Aug 22, 2025
…roupVInt; deprecate/remove obsolete long[] code (#15104)

* Fix GroupVIntBenchmark to actually use DataInput#readGroupVInt.

When group-varint was first introduced, it worked on long[], because this was
the representation that our postings format used for doc IDs (to be able to do
pseudo SIMD when storing two doc IDs in a single long). When postings later
moved to int[], group-varint also moved to int[], only preserving the slower
implementation for long[] (not letting `DataInput` sub-classes optimize it).

However, `GroupVIntBenchmark` was not updated, so every time that it benchmarks
group-varint, it actually runs the naive implementation that decodes into a
long[]. This PR fixes this benchmark to use the optimized impls that decode
into an int[] instead.

* cleanup field names

* Remove dead code and add deprecated annotation to all code and tests only used by backwards codecs

---------

Co-authored-by: Uwe Schindler <[email protected]>
@uschindler uschindler self-assigned this Aug 22, 2025
@uschindler uschindler added this to the 10.3.0 milestone Aug 22, 2025
@uschindler
Copy link
Contributor

Thansk to all. Now getting to observe options! Thanks's @rmuir and @jpountz for finding that issue. I also fixed the same problem like the benchmark in our test framework. It also tested wrong method!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:core/store module:test-framework skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants