Optimize AbstractKnnVectorQuery#createBitSet with intoBitset#14674
Optimize AbstractKnnVectorQuery#createBitSet with intoBitset#14674gf2121 merged 2 commits intoapache:mainfrom
Conversation
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
| } | ||
| } | ||
| }; | ||
| return BitSet.of(filterIterator, maxDoc); |
There was a problem hiding this comment.
Instead of implementing intoBitSet on the iterator, we could do BitSet bitSet = BitSet.of(iterator, maxDoc); and then clear deleted docs with liveDocs.applyMask(bitSet, 0);? Ie. there wouldn't be a FilterDocIdSetIterator anymore?
There was a problem hiding this comment.
Thanks for feedback!
This is what i had in mind initially. The problem here is applyMask is requiring a FixedBitSet while BitSet#of may return a SparseFixedBitSet. So i have to fork the Bits#applyMask code for the sparse case, which could look like:
BitSet bitSet = BitSet.of(iterator, maxDoc);
if (liveDocs != null) {
if (liveDocs instanceof FixedBitSet fixedBitSet) {
liveDocs.applyMask(fixedBitSet, 0);
} else {
for (int i = bitSet.nextSetBit(0);
i != DocIdSetIterator.NO_MORE_DOCS;
i = i + 1 >= bitSet.length() ? DocIdSetIterator.NO_MORE_DOCS : bitSet.nextSetBit(i + 1)) {
if (bitSet.get(i) == false) {
bitSet.clear(i);
}
}
}
}
return bitSet;
Or another potential way i'm thinking is making Bits#applyMask accept a BitSet. Default impl works fine but specialized impls need an instanceof FixedBitset check.
What do you think?
There was a problem hiding this comment.
Oh, this is annoying... It's not great either but I guess I'd like to do the SparseFixedBitSet/FixedBitSet decision up-front, something like:
int threshold = maxDoc >> 7; // like BitSet#of
if (it.cost() >= threshold) {
// Create a FixedBitSet in an optimized way
} else {
// existing code
}|
It looks like a similar change can be made to |
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
… optimize_createBitSet
769b1b3 to
85b80b7
Compare
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR. |
* main: (32 commits) update os.makedirs with pathlib mkdir (apache#14710) Optimize AbstractKnnVectorQuery#createBitSet with intoBitset (apache#14674) Implement #docIDRunEnd() on PostingsEnum. (apache#14693) Speed up TermQuery (apache#14709) Refactor main top-n bulk scorers to evaluate hits in a more term-at-a-time fashion. (apache#14701) Fix WindowsFS test failure seen on Policeman Jenkins (apache#14706) Use a temporary repository location to download certain ecj versions ("drops") (apache#14703) Add assumption to ignore occasional test failures due to disconnected graphs (apache#14696) Return MatchNoDocsQuery when IndexOrDocValuesQuery::rewrite does not match (apache#14700) Minor access modifier adjustment to a couple of lucene90 backward compat types (apache#14695) Speed up exhaustive evaluation. (apache#14679) Specify and test that IOContext is immutable (apache#14686) deps(java): bump org.gradle.toolchains.foojay-resolver-convention (apache#14691) deps(java): bump org.eclipse.jgit:org.eclipse.jgit (apache#14692) Clean up how the test framework creates asserting scorables. (apache#14452) Make competitive iterators more robust. (apache#14532) Remove DISIDocIdStream. (apache#14550) Implement AssertingPostingsEnum#intoBitSet. (apache#14675) Fix patience knn queries to work with seeded knn queries (apache#14688) Added toString() method to BytesRefBuilder (apache#14676) ...
| new FilteredDocIdSetIterator(iterator) { | ||
| @Override | ||
| protected boolean match(int doc) { | ||
| return liveDocs == null || liveDocs.get(doc); |
There was a problem hiding this comment.
if liveDocs == null we could return Bitset.of(iterator, maxDoc), no, and avoid the internal logic in the filtered iterator, which probably goes away with branch prediction 🤷
This tries to implement
intoBitsetto speed up the building of BitSet.