Fix GroupVIntBenchmark and some tests to actually use DataInput#readGroupVInt; deprecate/remove obsolete long[] code#15104
Conversation
When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting `DataInput` sub-classes optimize it). However, `GroupVIntBenchmark` was not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead.
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
cc @rmuir who made me find this bug by observing that naive decoding and optimized decoding performed the same @uschindler I'm not sure why but I'm getting much worse performance since #15089 (I get ~10.4 ops/us before this PR): |
This cannot be correct, because the new code is actually having less indirections. And Roberts ran the same benchmark (and I also did), the results were not worse and actually more stable. The code is actually simpler (no lambda anymore), so it is strange what you see now. |
|
The question is: I have seen the dead code, too. But the code GroupVIntBenchmark.benchMMapDirectoryInputs_readVInt should have used the mmap code. To me it looks like it does not test MMapDirectory at all? |
|
So this is not testing mmapdir's implementation at all! So actually we have no bechhmark and this shows why the performance trap with the lambda was never visible in any benchmark!? |
|
Can you please also fix the benchmark to not talk about ByteBuffers? |
|
I think, the whole benchmark code should be rewritten and also make sure that the variable naming actually shows what is measured. How do your changes now align with what we have seen 3 days ago? The MMap one was very noisy and was not noisy anymore afzter the change (see in the #15089 issue). How can this be explained? There's something very fishy? I'd like to see a benchmark with much higher warmup time (it is also too short for VarHandles to inline correctly). |
|
Can you also show correct values for NIOFSDircetory? NIOFSDirectory has the same optimization in #15089, I don't see numbers before/after! |
|
P.S.: Can we also remove the dead code in GroupVIntUtil? The long[] one.... with VarHandles/lambda. |
|
P.S.: Can we also remove the dead code in GroupVIntUtil? The long[] one.... with VarHandles/lambda. I was about to do this, but I did not trust the IDE at that time (was late evening). |
|
Sure, though I won't have time to look into it before next week. |
|
OK, Thanks! Let's wait for Mike McCandeless benchmark too and let us see how this fixes the actual issue (slowdown) caused by GC pressure because of the Lambda. This may work well in this small benchamrk, but with call-site pollution it suddenly behaves much worse (due to the lambda). The biggest problem with this benchmark is: it does not does long enough warmup (therefore the optimizations required for MMap to be efficient won't appy) and it does not have the callsite pollution, because all benchs run with a separate JVM (that's what I get from the annotations). So it runs the mmap case too isolated and therefor the optimizer can remove the lambda completely. There may be a small slowdown in the highly optimized case we see here, because there's no cast of the referent. But this should go away with longer runtimes. I would only trust that benchmark if we made it use a more real workload. |
In my understanding, this PR should fix this issue as well. In the main branch, This PR updates |
Hi,
Once we have the bench fixed, let's compare the results pre/post #15089 again! Thanks, Uwe |
|
I am also seeing performance regression related to #15089 PR with out #15089 PR Then I tried increasing the warmup time from 3 to 10, but it didn’t help. PR (warmup time 10) |
|
Can you also show the NIOFSDir and ByteBuffersDir benchmarks? They have same PR changes. |
|
The strange thing is that the production behcnmarks got much faster.... Which queries should be impacted mainly by this change? |
|
See benchmarks from last night with the PR applied: #15079 (comment) |
Hi, Uwe, here is the full benchmark output: PR PR with out #15089 |
|
OK, thanks, so all three have same slowdown in this benchmark. Let's only understand why the production benchmarks on Mike's server showed a significant improvment for postings-related boolean queries? I will do some tests locally and modify the VarHandle stuff more, maybe I can make it better without lambda (that causes GC havoc outside of microbenchamrks). Maybe we should for now revert #15089. I have no much time today to work on that. But basically we need to get rid of the captured lambda. |
|
The speedup in nightly benchmarks is due to #15039 with very high likelihood. Much fewer virtual calls when computing scores and slightly better vectorization. |
Looks like this is correct, I needed to dig a bit and figure out if the PR was not already part of the old run of Aug 18th on Mike's server, but it looks like it was the "first" commit afetr the last successful run on Aug 18th: So I think this might be the reason. It is very bad that we had som many performance critical commits in that short time and failing benchmaerks at same time! |
|
So let me revert #15089 and start over again. I have some ideas. |
|
Can we merge the current state of this PR so we have a common ground for the benchmark? Let me fix the variable naming!.... working.... |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
…o fix/GroupVIntBenchmark
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
I'm on a phone so it's not super easy to merge and watch builds, but feel free to merge yourself. I don't think there's any controversy in this change. |
|
I am currently also deprecating all code and removeing all code in the class thats unused... I need to have some food first, but all code passes validation.... |
|
give me two hours.... |
|
Heavy committing! |
…only used by backwards codecs
|
Hi, I pushed changes to fix also test framework still using the old I added One thing I did not fix: I added What's strange: Badwards codecs is stil able to write, so the writing code was not removed, otherwise the writeGroupVInt with longs could have been removed completely. @jpountz: Can you have a quick look? I will merge this otherwise in a moment (and backport) to proceed with refactoring an allowing to give the VarHandles/Lambda a second try. |
…roupVInt; deprecate/remove obsolete long[] code (#15104) * Fix GroupVIntBenchmark to actually use DataInput#readGroupVInt. When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting `DataInput` sub-classes optimize it). However, `GroupVIntBenchmark` was not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead. * cleanup field names * Remove dead code and add deprecated annotation to all code and tests only used by backwards codecs --------- Co-authored-by: Uwe Schindler <[email protected]>
When group-varint was first introduced, it worked on long[], because this was the representation that our postings format used for doc IDs (to be able to do pseudo SIMD when storing two doc IDs in a single long). When postings later moved to int[], group-varint also moved to int[], only preserving the slower implementation for long[] (not letting
DataInputsub-classes optimize it).However,
GroupVIntBenchmarkwas not updated, so every time that it benchmarks group-varint, it actually runs the naive implementation that decodes into a long[]. This PR fixes this benchmark to use the optimized impls that decode into an int[] instead.