Add Lucene improvements for HNSW merging#129046
Add Lucene improvements for HNSW merging#129046carlosdelest wants to merge 32 commits intoelastic:mainfrom
Conversation
|
Hi @carlosdelest, I've created a changelog YAML for you. |
…ne99 vector writers
…reduce-heap' into feature/dense-vector-hnsw-reduce-heap
…reduce-heap' into feature/dense-vector-hnsw-reduce-heap
| * additional candidates is predicated on the original candidate's filtered percentage. | ||
| * </ul> | ||
| */ | ||
| public class FilteredHnswGraphSearcher extends HnswGraphSearcher { |
There was a problem hiding this comment.
I don't think we need this? Only used on read, never during graph building.
| * thread-safe. The search method optionally takes a set of "accepted nodes", which can be used to | ||
| * exclude deleted documents. | ||
| */ | ||
| public abstract class HnswGraph { |
There was a problem hiding this comment.
if these are public in Lucene, can we just rely on them or are they not exposed in the module?
| if (numMergeWorkers == 1 && mergeExec != null) { | ||
| throw new IllegalArgumentException("No executor service is needed as we'll use single thread to merge"); | ||
| } | ||
| this.numMergeWorkers = numMergeWorkers; | ||
| if (mergeExec != null) { | ||
| this.mergeExec = new TaskExecutor(mergeExec); | ||
| } else { | ||
| this.mergeExec = null; | ||
| } |
There was a problem hiding this comment.
we don't need any of this merge worker stuff, we don't use it.
| if (parallelMergeTaskExecutor != null && numParallelMergeWorkers > 1) { | ||
| return new ConcurrentHnswMerger(fieldInfo, scorerSupplier, M, beamWidth, parallelMergeTaskExecutor, numParallelMergeWorkers); | ||
| } | ||
| return new IncrementalHnswGraphMerger(fieldInfo, scorerSupplier, M, beamWidth); |
There was a problem hiding this comment.
we only need IncrementalHnswGraphMerger we never use the task exec stuff nor do concurrent merges.
| import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; | ||
|
|
||
| /** This merger merges graph in a concurrent manner, by using {@link HnswConcurrentMergeBuilder} */ | ||
| public class ConcurrentHnswMerger extends IncrementalHnswGraphMerger { |
| * | ||
| * @lucene.experimental | ||
| */ | ||
| final class SeededHnswGraphSearcher extends AbstractHnswGraphSearcher { |
| * | ||
| * @lucene.internal | ||
| */ | ||
| class HashContainers { |
There was a problem hiding this comment.
I wonder if we can just extract what we want and directly place it into the fixed sized array thigns?
|
|
||
| public class ArrayUtil { | ||
|
|
||
| public static float[] growInRange(float[] array, int minLength, int maxLength) { |
There was a problem hiding this comment.
I think ES has access growInRange, I am not sure if any of the changes for growInRange are important?
| * #insertWithOverflow(int, float)} and {@link #add(int, float)}, and provides MIN and MAX heap | ||
| * subclasses. | ||
| */ | ||
| public class NeighborQueue { |
There was a problem hiding this comment.
not sure we need to copy this? I thought this was available outside of lucene since its a public class. Maybe Lucene doesn't export the module?
| private HnswUtil() {} | ||
|
|
||
| // Finds orphaned components on the graph level. | ||
| static List<Component> components(HnswGraph hnsw, int level, FixedBitSet notFullyConnected, int maxConn) throws IOException { |
There was a problem hiding this comment.
I don't think any of this is used now, maybe we can remove it.
|
|
||
| @Override | ||
| public long ramBytesUsed() { | ||
| return BASE_RAM_BYTES_USED + nodes.ramBytesUsed() + scores.ramBytesUsed(); |
There was a problem hiding this comment.
We should update the way Lucene does this (and consequently this) if we merge this. I think this will have a real performance impact as the way its done now in the builder is that it will iterate every array (e.g. every node) and do a calculation. That makes no sense IMO.
|
@benwtrent @ChrisHegarty I've removed the non-needed classes and tried to prune the changes down to the minimum, including the removal of a new codec name. Do you think there's anything else we can do to reduce the size of this change? It's still a 3K LOC change. Is this something we should add to ES 9.1/8.19? |
|
Closing as this is too big a change - let's wait for a Lucene release that contains the code |
apache/lucene#14527 improved the heap usage for HNSW merging.
As this change still hasn't made any Lucene release, we'd like to incorporate this into Elasticsearch.
We have copied over the Lucene changes into Elasticsearch code and create a new Elasticsearch codec that will use the copy of the Lucene
Lucene99HnswVectorsWriter, renamed toES910HnswReducedHeapVectorsWriter.Changes done:
Elasticsearch900Lucene101CodectoElasticsearch910Lucene102Codec, to signal the codec is for ES 9.10 and Lucene 10.2ES910HnswVectorsFormatthat will provide the entry point for the writer vector changes.Lucene99HnswVectorsFormatusages, as it is compatible with it - same file formatsES910HnswVectorsWriterthat uses the copied over classes from Lucene. HNSW improvements will be on the merge side, so we're only focused on the Writer aspect of the format.org.elasticsearch.index.codec.vectors.es910.hnsworg.elasticsearch.index.codec.vectors.es910.internal.hppcorg.elasticsearch.index.codec.vectors.es910.utilWe can remove these changes when Lucene 10.3 is created and merged into Elasticsearch.