Skip to content

Feature/collaborative hnsw search#15676

Draft
krickert wants to merge 34 commits intoapache:mainfrom
ai-pipestream:feature/collaborative-hnsw-search
Draft

Feature/collaborative hnsw search#15676
krickert wants to merge 34 commits intoapache:mainfrom
ai-pipestream:feature/collaborative-hnsw-search

Conversation

@krickert
Copy link

@krickert krickert commented Feb 7, 2026

[HNSW] Collaborative Search via Dynamic Threshold Feedback

Status: Experimental - This change validates the pruning mechanism at the single-graph level, provides the collector infrastructure for multi-segment coordination within an index, and includes simulated multi-shard tests (multiple separate HNSW graphs and separate Directory instances combined via MultiReader) demonstrating cross-shard pruning gains. Full distributed multi-shard testing will happen in a follow-up PR for OpenSearch.

Summary

Enable HNSW graph search to accept externally-updated similarity thresholds during traversal. This allows multiple concurrent search processes (threads, shards, or nodes) to share a global minimum-score bar, pruning each other's search frontiers in real time.

Problem Statement

In current distributed KNN implementations, each shard searches its local HNSW graph in isolation. A shard will continue exploring candidates even when other shards have already found globally superior matches. This redundant traversal wastes CPU and IO, and the cost scales with K and the number of shards.

Proposed Changes

  1. Dynamic Threshold Re-fetching (HnswGraphSearcher.java): Re-read minCompetitiveSimilarity() from the collector on every iteration of the HNSW search loop, rather than only at initialization. If the value has increased (due to an external update), minAcceptedSimilarity is raised and the search frontier is pruned accordingly.
  2. CollaborativeKnnCollector: A KnnCollector.Decorator that wraps a standard TopKnnCollector and an AtomicInteger (storing float bits via Float.floatToRawIntBits). Its minCompetitiveSimilarity() returns max(local, global), allowing external signals to raise the pruning bar. Updates use a lock-free CAS loop.
  3. CollaborativeKnnCollectorManager: Creates per-segment CollaborativeKnnCollector instances that share a single AtomicInteger, enabling threshold propagation across leaf segments within a node.

Test Results and Methodology

Unit tests measure the number of graph nodes visited under two conditions:

  1. Standard search: A normal TopKnnCollector with no external threshold.
  2. Collaborative search: A CollaborativeKnnCollector where the global bar is set using a score derived from the standard search's results (simulating a "discovered" top-K score).
Scenario Standard Visited Collaborative Visited Reduction
Basic (K=10, 2-dim, 20K docs) 135 10 92.6%
High-Dimension (K=100, 128-dim, 10K docs) ~6,600 ~400 ~94%
High-K (K=1000, 16-dim, 30K docs) ~11,500 ~350 ~97%

Results vary across runs because Lucene's test framework randomizes graph construction parameters (maxConn, beamWidth). A subsequent run with smaller random values produced:

Scenario Standard Visited Collaborative Visited Reduction
Basic (K=10, 2-dim, 20K docs) 59 43 ~27%
High-Dimension (K=100, 128-dim, 10K docs) 2,620 55 ~98%
High-K (K=1000, 16-dim, 30K docs) 8,762 1,756 ~80%

The pruning is consistently effective across random seeds, with the strongest gains in high-dimension and high-K scenarios where graph traversal is most expensive. The basic scenario is more sensitive to graph topology - smaller graphs with fewer connections have less room to prune.

Important Caveats

These numbers represent upper-bound savings under idealized conditions:

  • The global bar is set to a strong score before the collaborative search starts, simulating a scenario where another shard has already completed its search in full. In a real distributed system, the bar ramps up incrementally.
  • There is no network round-trip latency in the test.
  • The tests run against a single graph on a single node.

The fundamental mechanism - raising the threshold mid-traversal to skip provably non-competitive subgraphs - is not test-specific. Real-world savings should still be significant, particularly for high-K queries (K >= 100) and dense embedding spaces.

Thread Safety

The implementation adds no locks or synchronization to the HNSW search hot path. Visibility of the shared threshold is guaranteed by the volatile read semantics of AtomicInteger.get(), which the collector calls on every loop iteration. Updates from external threads become visible on the next iteration without explicit memory fencing.

The CollaborativeKnnCollector.updateGlobalMinSimilarity() method uses a standard CAS loop to ensure monotonic updates (the bar can only go up, never down).

Design Note: Threshold Propagation Is External

The Lucene-layer change is intentionally passive: it reads the global bar but does not write it. Raising the bar based on incoming results from other shards is the responsibility of the orchestration layer. This keeps the Lucene change minimal and avoids coupling graph traversal logic to any specific coordination protocol.

Use Case: Streaming Coordinator

This change is a prerequisite for high-performance distributed KNN search. In a streaming model, the coordinator can broadcast the current "Global Kth Score" back to all shards. Shards running this modified searcher will instantly prune their frontier, terminating their local search as soon as it is mathematically impossible to improve the global result set.

Multi-Index Performance Results

The single-graph tests above prove the mechanism works. The following tests measure what happens when collaborative pruning is applied across multiple separate HNSW graphs - the scenario that maps directly to cross-shard KNN search in OpenSearch.

Test: Multi-Index High-K (low-level HNSW graphs)

5 separate HNSW graphs, 5000 vectors each (25K total), dim=32, K=500. Standard search queries each graph independently and merges. Collaborative search pre-sets the pruning bar to the median score of the merged top-500, then searches all 5 graphs with a shared AtomicInteger.

Run Standard Total Visited Collaborative Total Visited Reduction
1 20,132 5,393 73.2%
2 20,263 4,417 78.2%

Test: Multi-Index End-to-End (IndexSearcher + MultiReader)

5 separate Directory instances, 2000 vectors each (10K total), dim=32, K=100. Combined via MultiReader and searched through IndexSearcher - the same code path OpenSearch uses. Visited counts captured via mergeLeafResults override.

Run Standard Visited Collaborative Visited Reduction
1 4,610 135 97.1%

What this means for high-K cross-shard search

The cost of KNN search scales with K and the number of shards. Without collaborative pruning, each shard does full work independently - a K=2000 query across 20 shards means 20 full HNSW traversals with no shared knowledge. With collaborative pruning, the bar rises as soon as any shard finds good results, and every other shard prunes accordingly. The effect compounds: more shards means the bar rises faster, which means more pruning per shard.

The numbers above are from K=100 and K=500 across 5 graphs. At K=2000 across 20 shards, the pruning surface is larger and the ratio of useful-to-wasted traversal work is worse in the standard case - which means collaborative pruning has even more room to cut. Queries that are currently too expensive to run (high K, many shards, high-dimensional embeddings) become feasible.

The shared global minimum similarity is a 32-bit float stored via
Float.floatToRawIntBits. Using AtomicLong required unsafe narrowing
casts ((int) globalMinSimBits.get()) on every read, which would
silently truncate if the upper 32 bits were ever non-zero.

AtomicInteger is the natural fit: it matches the 32-bit width of a
float's bit representation, eliminates all narrowing casts in both
the hot-path read (minCompetitiveSimilarity) and the CAS update loop
(updateGlobalMinSimilarity), and retains identical volatile/CAS
memory ordering guarantees.

Changed in:
- CollaborativeKnnCollector: field type, constructors, minCompetitiveSimilarity(), updateGlobalMinSimilarity()
- CollaborativeKnnCollectorManager: field type, constructor
- TestCollaborativeHnswSearch: all AtomicLong instantiations
Introduces a test to verify collaborative pruning across multiple index segments, ensuring shared thresholds affect HNSW traversal correctly.
Add two tests simulating cross-shard KNN search with collaborative pruning:

- testMultiIndexHighKPerformance: 5 separate HNSW graphs (5000 vectors each),
  K=500, measures 73-78% reduction in visited nodes vs standard search.

- testMultiIndexCollaborativeEndToEnd: 5 separate Directory instances combined
  via MultiReader through IndexSearcher, K=100, measures 97% reduction using
  TrackingKnnQuery and TrackingCollaborativeKnnQuery to capture per-leaf
  visited counts through mergeLeafResults.
The two multi-index performance tests (testMultiIndexHighKPerformance,
testMultiIndexCollaborativeEndToEnd) take ~10-20s each. Tag them @nightly
so they are skipped during normal test runs and only execute with
-Dtests.nightly=true.
These two single-graph tests account for ~88% of the suite time (~15s
of ~17s) due to large graph construction (30K vectors at K=1000, 10K
vectors at 128 dimensions). Moving them to @nightly brings the default
suite from ~17s down to ~2s while keeping the basic collaborative
pruning test and multi-segment test in every run.

All four collaborative tests still run with -Dtests.nightly=true.
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is an interesting one.

Does your benchmark compare recall? Improvements here are only useful if the recall/visited (latency) curve is actually improved.

I suggest using Lucene Util: https://github.com/mikemccand/luceneutil/

Since for a single Lucene index, the main impact is for reducing comparisons across segments, please include that information.

If we are sharing scores, please take a look at how this is done else where in Lucene with LongAccumulator and min-competitive logic that takes into account the doc-id tie breaking.

Fix testMultiIndexHighKPerformance using constant docBase=1000000 which
prevented pruning from ever activating. Use i*vectorsPerGraph to simulate
real multi-segment layout.

Add brute-force recall measurement to all single-graph pruning tests and
a new testMultiSegmentCombinedRecall that builds multiple HNSW graphs,
searches with both standard and collaborative collectors, merges results,
and compares against exact top-k.

Update HnswGraphSearcher comment to reference LongAccumulator instead of
bi-directional stream. Add Javadoc to minCompetitiveSimilarity documenting
the segment-0 tie-breaking design tradeoff.
@krickert
Copy link
Author

krickert commented Feb 7, 2026

Still working through the details and creating a test that produces a matrix of values to show speed up. You pointed out exactly the the direction I needed, thanks.. I'll have more updates and comment when I'm done.

@navneet1v
Copy link
Contributor

Hi @krickert
This is an interesting idea, and I think it extends current implementation of minCompetitive score from segments to shards(aka different Lucene Indexes).

+1 on the idea.

…ment sweep test

CollaborativeKnnCollector.collect() now shares the k-th best score (floor)
instead of every collected doc's score, maintaining 0.995 recall while still
enabling cross-segment pruning. The real-world test sweeps 4/8/16/32 segments
using 73k 1024-dim embeddings and is double-gated behind @monster and the
tests.embeddings.dir system property.
@krickert
Copy link
Author

I've been digging into the recall issues from the distributed simulations (4, 8, and 16 shards, 1.47M 1024-dim vectors). Rerunning on clean, deduped data and instrumenting per-shard behavior has uncovered a few difficult to figure out problems:

  1. Entry Point Protection - a lag threshold: A high global bar arriving early in a search can prune a shard's HNSW entry point before it ever reaches its local high-similarity cluster. Current workaround: a constant-node guard (first 100 nodes) to let every shard get into its query neighborhood before global pruning kicks in. It works, but it's a blunt instrument and took away the speed improvement I had hoped for.

  2. Tie-Break Paralysis: The original docBase safety logic was too restrictive in multi-shard environments, effectively disabling pruning for shards with lower IDs. I've shifted to prioritizing pruning leverage with a safety slack (0.01f) for floating-point jitter, though I don't think this holds up.

  3. Coherence Contention: High-frequency volatile reads of the global bar in the HNSW hot-loop were creating memory bus contention the multi-core system I'm coding on. I changed HnswGraphSearcher to help - but that's far too invasive and will continue to avoid changing core classes.

  4. Recall Recovery: With some fixes, the K=100 recall now matches baseline (0.806 vs 0.796) on deduped data, and K=10 has recovered from 0.31 to 0.66. Better, but still... bad.

The parameters that scale with K (especially for K >= 1000) aren't as straightforward as I had thought. Still working through it and open to ideas if anyone sees a cleaner approach.

In the meantime, I will attempt a more realistic approach and create a per-index HTTP2 service that serves up lucene to see if real-network collaborative pruning can work.

More to come...

@benwtrent
Copy link
Member

FYI, I suspect any collaborative search across shards to have an impact on recall with the same parameters (unless finely tuned). They key thing is the visited/recall curve. Can we get the same recall with fewer visited?

I suspect real-world lucene indices (just like Lucene segments), to be a random sample of the entire corpus. Relevant vectors should be expected to be evenly distributed between all indices. This is the assumption that lucene makes with segments and its "optimistic search" pattern. This same assumption will be required by this idea.

Anything searching one shard and then only sets competitiveness without taking this into account will be useless.

@krickert
Copy link
Author

krickert commented Feb 10, 2026

Collaborative HNSW Shard Pruning: Visited/Recall Analysis

Yup, @benwtrent. The visited/recall curve is the only honest way to judge this.

Here is what I am seeing on 1.47M deduped 1024-dim vectors across 8 independent shards (evenly distributed):

K Mode Merged Recall Nodes Visited
10 Baseline 0.738 22,788
10 Collaborative 0.664 16,168
100 Baseline 0.796 50,785
100 Collaborative 0.806 77,716

At K=10, collaborative pruning "wins" on performance but fails on recall - it's essentially cutting off the bridge paths required to reach local clusters. At K=100, we recover the recall, but the safety mechanisms (delayed bar application and slack buffers) actually cause us to over-explore, doing more work than the baseline.

The difficulty is that unlike Lucene segments, independent shard indices don't share a consistent global HNSW topology. The assumption that shards are "random samples" is theoretically sound, but in practice, sharing the raw kth-best score from the first-finishing shard is too aggressive. It doesn't account for the variance in when a shard actually "finds" its target neighborhood during traversal.

I'm prototyping a distributed gRPC setup to test this with realistic network isolation (to eliminate the memory-bus contention we see in a single JVM). However, the core question remains: should the collaborative bar be a raw score, or a heuristic that factors in the shard count and expected local density? My ultimate goal is to make very large result sets ($K \ge 1000$) fast by allowing irrelevant shards to "bail out" early, but I'm seeing that I'll need a better way to derive that global threshold.

@krickert
Copy link
Author

Distributed Validation Progress: Collaborative HNSW Pruning

So testing is now transitioning from single-JVM simulations to a physical 8-node cluster (2.5GbE, 1TB NVMe, 16GB RAM per node) to isolate variables like CPU contention and IO over-caching. I've discovered that single-host simulations masked a "Coherence Tax" of distributed HNSW search. So testing on real hardware over a LAN is the only way to measure the trade-off between coordination overhead and compute savings.

Initial tests are showing slightly slower times in a distributed environment if simulated on a local system - this isn't adding the correct latency we would see IRL. To finally confirm if this implementation is right, it's easier to just make a distributed search PoC that runs on an http2 stream.

100% of this work will be in the available in the lucene-test-data repo, which can recreate everything up until the shard distribution.


The Theory Revisited

So the theory is sound - but I've not been able to successfully demonstrate the savings because the tests were flawed or the idea is just too much overhead to hold water against.

In a standard sharded HNSW search, every shard performs a full graph traversal to find its local top-K, unaware that its candidates may be far below the global similarity threshold. This results in significant redundant compute, especially for high-K queries ($K \ge 1000$).

We want to eliminate these redundant cycles.

Collaborative Pruning introduces a global minimum similarity bar synchronized across nodes via HTTP/2 streams. By injecting this bar into the searchLevel loops of the HNSW graph searcher, we can trigger early termination of graph traversals that have zero probability of entering the final global result set.


Benchmarking Dimensions & Stats

To prove this, here's a chart of data I'll be collecting:

Dimension Values
Result Set Size (K) 3, 50, 100, 200, 500, 1,000, 5,000
Shard Distribution 1, 2, 4, 8, 16 shards
Data Density Diverse corpus (Wikipedia, Literature, Technical Papers) embedded via BGE-M3

Baselines

  • Recall: True KNN (brute-force exact search) per query, used to measure recall@K for both standard and collaborative modes.
  • Performance Baseline: Standard Lucene HNSW search (no collaborative pruning) across the same K x shard matrix, providing the control for all latency and node-visit deltas.

Key Metrics

Metric Purpose
Recall@K vs. True KNN Ensuring collaborative pruning does not degrade result quality relative to exact search
HNSW Node Visit Delta Reduction in total graph nodes explored per query, the "pure" algorithmic win
Tail Latency (P99) Validating if collaborative pruning mitigates "heavy" query outliers
Standard Deviation of Work How effectively the global bar balances load in "skewed" data scenarios

I'll try to infer the collaborative overhead, but comparing to a standard baseline and measuring latency should be good enough to demonstrate.


Execution Plan

  1. Staging: Python-based embedding generation using BGE-M3, followed by a deduplication step to ensure a clean corpus.
  2. Indexing: A 16-worker process generates 16 base shards, which are then merged into 8, 4, and 2-shard indices for comparative testing.
  3. Service: A gRPC-based symmetric peer architecture utilizing ScaleCube for gossip-style node discovery.
  4. Simulation: Nodes will be warmed (OS page cache) to ensure we are measuring the algorithm's efficiency rather than raw disk IO performance.

System Architecture

graph TD
    subgraph "Offline Preparation"
        A[Raw Text Data] -->|BGE-M3 Python| B[.vec Embeddings]
        B -->|Indexer Utility| C[16 Shards]
        C -->|Merge| D[8 / 4 / 2 / 1 Shard Indices]
    end

    subgraph "Distributed Cluster: 8 Nodes"
        E[Search Request] --> Node0((Node 0: Coordinator))

        subgraph "Symmetric Peer Node"
            direction TB
            N((Peer Node)) --> Local[Lucene HNSW Shard]
            Local <--> Collab[CollaborativeKnnCollector]
            Collab <--> Gossip{{ScaleCube / gRPC Stream}}
        end

        Node0 -->|gRPC Search| Node1(Node 1)
        Node0 -->|gRPC Search| Node2(Node 2)
        Node0 -->|gRPC Search| NodeN(...)

        Node1 <-->|Bi-Di Threshold Sync| Node0
        Node2 <-->|Bi-Di Threshold Sync| Node1
        NodeN <-->|Bi-Di Threshold Sync| Node0
    end

    subgraph "Merging"
        Node0 -->|Fan-in Results| Merge[Heap Merge & Sort]
        Merge --> Final[Top K Results]
    end
Loading

…ibuted safety

This change hardens the collaborative ANN pruning mechanism to ensure
high recall in distributed environments while maintaining significant
traversal technical leverage.

Key Refinements:
- Implement "Lagging Threshold" (warm-up) in CollaborativeKnnCollector:
  The global pruning bar is now ignored until the local queue is full
  and a minimum number of nodes (2*k) have been visited. This prevents
  the "Entry Point Trap" where high global bars from other shards could
  cause premature termination at local bridge nodes.
- Introduce "Safety Slack Buffer": Applied a 0.05f slack to the global
  threshold to allow HNSW traversal through similarity "valleys"
  required to reach high-scoring clusters in independent graphs.
- Update HnswGraphSearcher threshold logic: Switched to Math.nextUp()
  for dynamic similarity updates to match standard Lucene behavior and
  relaxed bulk-pruning checks to '>=' to correctly handle score ties.
- Refactor Javadocs: Updated documentation to be protocol-neutral,
  focusing on general distributed search requirements and global
  tie-breaking priority via docId mapping.

Integration & Cleanup:
- Integrated collaborative search support into luceneutil (KnnGraphTester
  and knnPerfTest.py) to enable standardized performance benchmarking.
- Removed experimental nightly/monster tests from core to reduce cruft.
- Fixed luceneutil SUMMARY output to include collaborative status.
…ibuted safety

This change hardens the collaborative ANN pruning mechanism to ensure
high recall in distributed environments while maintaining significant
traversal technical leverage.

Key Refinements:
- Implement 'Lagging Threshold' (warm-up) in CollaborativeKnnCollector:
  The global pruning bar is now ignored until a minimum number of
  nodes (100) have been visited. This prevents the 'Entry Point Trap'
  where high global bars from other shards could cause premature
  termination at local bridge nodes.
- Introduce Safety Slack Buffer: Applied a 0.01f slack to the global
  threshold to allow HNSW traversal through similarity 'valleys'
  required to reach high-scoring clusters in independent graphs.
- Implement Smart Accumulation: Global bar updates are now debounced by
  0.001f improvement to reduce atomic contention across threads.
- Update HnswGraphSearcher threshold logic: Switched to minimal
  Math.nextUp() similarity updates to match standard Lucene behavior.
- Support docIdMapper: Added IntUnaryOperator support to ensure
  globally consistent tie-breaking across shards.
…lMax)

- Revert to Global Floor vs Local Max pruning
- earlyTerminated(): stop when localMax < globalFloor (after 100 visits)
- minCompetitiveSimilarity(): local bar only (pathfinding unchanged)
- collect(): track localMaxScore, push floor with lastSharedScore guard
@krickert
Copy link
Author

@benwtrent @navneet1v - quick status update on a distributed KNN prototype.

I implemented a gRPC/HTTP2 streaming coordinator + shard model for collaborative HNSW search (outside OpenSearch REST for now), and ran initial benchmarks on 8 Raspberry Pi 5 nodes (NVMe) plus local runs.

Early results

Metric Observation
Recall Matched standard lucene jar baseline in tested runs
Node visits Reduced by ~50% on typical queries
End-to-end latency Improved by ~40–50% in the same scenarios

In a heterogeneous setup, adding one higher-performance node improved global pruning and produced larger gains (up to ~65% vs the same cluster without that node in current tests).

Repo

ai-pipestream/distributed-search - grpc streaming service PoC

Next steps

  • Preparing a fuller write-up with automated/reproducible runs
  • Current coverage includes K up to 5000
  • Next evaluation target: larger-data (100+GB) with large K values

Note: this needed an HTTP2 boost - it will not be fast if it's done on HTTP1.

@benwtrent
Copy link
Member

@krickert

Your final numbers don't indicate recall. Please, we need to see what the Pareto frontier (how recall changes with increasing efSearch) looks like for the following scenarios as they reflect optimal, baseline, and candidate:

  • All vectors within a single shard (this is baseline optimal)
  • Vectors independently searched between shards (baseline multi-index/shard)
  • Vectors searched with your collaborative search (candidate multi-index/shard).

Last I saw from your benchmarks, at k:100 collaborative was much worse.

In most real world data, each index/shard will have a random subset of the entire dataset, your tests should reflect this as well.

The Pareto frontier should likely be "recall vs. total vectors compared". And that for the latter two benchmarks, they are done against the exact same graphs/indices as reindexing isn't necessary to test with the collaborative searcher.

I don't think benchmarking between machines is necessary. If the collaborative searching isn't useful when all shards are on the same machine (thus sharing information overhead is at its lowest), I doubt it will be helpful at all once overheads increase.

@krickert
Copy link
Author

@krickert

Your final numbers don't indicate recall. Please, we need to see what the Pareto frontier (how recall changes with increasing efSearch) looks like for the following scenarios as they reflect optimal, baseline, and candidate:

  • All vectors within a single shard (this is baseline optimal)
  • Vectors independently searched between shards (baseline multi-index/shard)
  • Vectors searched with your collaborative search (candidate multi-index/shard).

Last I saw from your benchmarks, at k:100 collaborative was much worse.

In most real world data, each index/shard will have a random subset of the entire dataset, your tests should reflect this as well.

The Pareto frontier should likely be "recall vs. total vectors compared". And that for the latter two benchmarks, they are done against the exact same graphs/indices as reindexing isn't necessary to test with the collaborative searcher.

I don't think benchmarking between machines is necessary. If the collaborative searching isn't useful when all shards are on the same machine (thus sharing information overhead is at its lowest), I doubt it will be helpful at all once overheads increase.

No problem.

So The benchmark comparison

@krickert
Copy link
Author

krickert commented Feb 19, 2026

NOTE: removed due to flawed test recordings - see results below instead

This post had flawed results - please see discussion below for better points..

@benwtrent
Copy link
Member

I don't understand your graphs. I would expect the following results:

  • Single shard to have the lowest vector ops and the lowest recall, but the best curve
  • Multi-shard independent to have much higher recall and number of shards times more vector ops than the single shard (there abouts)
  • Multi-shard collaborative to have lower vector ops and lower recall, moving closer to single shard.

Your graphs show there is zero benefit for collaborative and that splitting data across shards significantly reduces recall!

@krickert
Copy link
Author

You’re right, I mixed objectives. I’ll focus on recall next, specifically recall vs efSearch across three scenarios:

  • single-shard baseline
  • multi-shard independent
  • multi-shard collaborative on the same shard graphs

I’ll treat work/latency as secondary and keep them out of the main conclusion for now.

Next I’ll test whether recall can be improved by adding shard-aware index-time context instead of relying on search-time coordination alone. I’ll prototype a lightweight global routing layer and cross-shard neighborhood metadata so shard traversal starts with better global priors.

I think the core issue is that each shard currently builds and searches its own local ANN neighborhood frontier. A single shard can look strong, but once we merge across many shard-local frontiers, recall drops much harder than I expected. It’s honestly more severe than I thought, and that’s exactly why I think index-time global awareness can help. I'm looking through some papers for a round, but I'll test out a few more scenarios.

Thanks for being patient, by the way.. I really want to push hard for getting a high K search to be the norm.

@krickert krickert marked this pull request as draft February 20, 2026 04:34
@krickert
Copy link
Author

Reporting numbers were wrong, I believe they're right now. Reporting high recall across the board now.. Changed to a draft - I'll do more testing and post here tomorrow.

@krickert
Copy link
Author

On this low-latency, 16-shard setup, collaborative and independent sit on top of each other: recall is the same and total lookups are the same, so the plot shows negligible impact of collaborative search here. The takeaway is that on a fast machine with modest index size, we don’t see a recall or work tradeoff yet.

image

This chart shows that pruning does occur (lookups_saved > 0) and that the gains are larger when shards do more work (e.g. higher ef or larger index). So the mechanism works; the small effect in Chart 1 is because total work per query is already small in this environment.

image

The Pareto plot shows the same story: collaborative and independent trace the same recall–latency and recall-work curves on this setup, so low latency (and current index size) doesn’t hurt recall or performance - we just don’t see a visible gain yet. The expectation is that a much larger index and/or higher-latency setting would show a clearer separation and larger compute savings from collaborative search.

image

Next steps (suggestions welcome, highly encouraged)

  1. Vastly increase index size (e.g. 10–20×) so the system is actually stressed; we expect much larger collaborative gains, consistent with the savings seen on higher-latency (2.5 Gbit) setups.

  2. Test relevant vs. non‑relevant (but sane) queries on that larger index to reveal any impact of data voids or query distribution.

boolean shouldExploreMinSim = true;
while (candidates.size() > 0 && results.earlyTerminated() == false) {
// Update the threshold dynamically from the collector to allow external pruning.
float liveMinSimilarity = results.minCompetitiveSimilarity();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do when we are not using external pruning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we're not using external pruning, results.minCompetitiveSimilarity() still returns the minimum competitive similarity of the current top‑K results, but only from this shard’s collector.

So it's the same kind of threshold (used to prune the graph: skip nodes that can’t beat the worst of the current top‑K), but it's not updated from other shards.

The loop and the pruning logic are unchanged; only the source of the threshold is internal (this shard) instead of external (collaborative). In other words: same API, same pruning, no cross-shard updates when external pruning is off.

@benwtrent
Copy link
Member

Vastly increase index size (e.g. 10–20×) so the system is actually stressed; we expect much larger collaborative gains, consistent with the savings seen on higher-latency (2.5 Gbit) setups.

I don't see how this will show anything different? What is the size of your data set now?

I would assume about 1M per shard should be plenty to give any indication that this would prove useful.

However, I am not sure a naive sharing like this will actually work without other orchestration (e.g. routing certain clusters of vectors to shards, which Lucene just won't do because lucene is the shard).

I do think there is something to searching multiple graphs in parallel(e.g. Optimistic searching like Lucene does with segments). But this would have much more communication, orchestration, and work than simply sharing the min_competitive score.

and/or higher-latency setting would show a clearer separation

How would that show any improvement? If sharing information doesn't help when the latency of communication is near zero, how would it improve when the latency of communication increases significantly? That just means now sharing (the key point of this algorithm) is now more expensive.

@krickert
Copy link
Author

First, all great questions and you're getting to the heart of why I should increase the collection size to demo that this is going to help due to the increases I saw with my home-lab distribution setup.

tl;dr - I suspect that pruning can only help if there is enough latency between shards and searches go over 10ms. My home lab, due to using cheap machines, was a good candidate to show this and it did show significant improvement.

I can rerun the same tests I showed you in the slow environment because the testing harness that runs on localhost is a real streaming distributed search harness. But I also need to show this on a localhost multi-shard setup, where latency is low but calculations are high.

I don't see how this will show anything different? What is the size of your data set now?

Small. Too small. From the shard dirs:

  • 1 shard (full index, ~73K docs): ~290 MB on disk
  • 16 shard dir (all 16 together): ~289 MB total == ~18 MB per shard

So total index size is in the hundreds of MB (~0.3 GB), not 10M or 100M. That’s for 73K vectors × 1024 dims (float32) plus the HNSW graph.

My machine is 128GB, and the drive operates at 20GB/s.. so the entire index is certainly in OS disk cache and the drive is faster than the raspberry pi memory. That's why I need to add latency to the setup - both environments test two different extremes. The larger machine does no sweating with these tests (the timing of the entire test is in low ms range). We want to challenge the machine with at least 250ms queries like I did with the raspberry pi.

We’re well below 1M per shard. The idea of going 10–20x larger isn't that a bigger index by itself proves anything; it’s that with more work per shard per query (more graph to traverse), there’s more for pruning to actually cut.

The problem is the entire corpus is in disk cache, the work we do is nearly instant. That's why you're not seeing it kick in. Collaboration doesn't seem to affect the overall search speed because there's a dedicated HTTP2 streaming connection that is always on during the search. It's notification system is less than 1ms. But with collaboration turned on, I can show you a 50% boost in speed and a 50% savings in CPU in some situations - which is why I tested this on a Raspberry Pi; it forced latency and "simulated" a larger corpus.

But if I bring the index size up and run the searches concurrently, I saw it outperform the traditional search - because it waits for the entire search result set to yield when you shouldn't have to. That's where it shines.

I can measure the overhead to some degree. I logged the events in the coordination layer and you can see the trimming live in the logs. But don't believe me right now - if I just give it a large corpus, I'm convinced you'll see it for yourself.

With 73K and a fast machine, each shard finishes so quickly that by the time a useful min is shared, others are often done - so we don’t see a difference. We’d expect ~1M per shard to be enough to see whether sharing the min_competitive score helps under this setup; we simply haven't run at that scale yet. I was thinking of using the court records as the data set - 11MM court documents that chunked will certainly be over 100MM chunks.

How would [higher-latency setting] show any improvement? If sharing doesn’t help when latency is near zero, how would it improve when latency increases? That just means sharing is now more expensive.

You’re right that higher latency doesn’t make sharing better - it makes it more expensive. The point wasn't to imply that adding latency to get improvement; it was that because we're running on localhost for all shards, it's an unrealistic test for a distributed search because it's low latency you wouldn't see in most setups.

In my tested lab environment (2.5 Gbit, more latency), we did see larger savings (lookups_saved) AND a large reduction in latency - but just enough to make me realize I need more testing (much like how a 0 latency connection is unrealistic, so too is assuming the world will power lucene on all Raspberry Pi machines).

So I have two setups:

  • one setup where the benefit was visible but was too slow to be realistic but great to simulate a high latency environment
  • this local one where it wasn't (but showed no regression) due to being too fast of a machine

The interpretation I'm suggesting is that in that other setup there was more work per query (and/or slower shards), so pruning had something to cut; here, work per query is so small that pruning barely shows up. So "higher-latency setting" was shorthand for "the environment where we already saw the gain," not a claim that increasing latency causes the gain.

So I'm trying to reproduce that kind of "more work per query" locally (e.g. with a much larger index) to see if the same benefit appears.

3. Naive sharding / orchestration

We’re testing naive sharding (no cluster-based routing; Lucene is the shard) on purpose: the question is whether only sharing the min_competitive score across shards helps in that setting.

We’re not claiming Lucene will do routing or orchestration - it never should. But to have a collaborate search, exposing is needed. And to test, a distributed search harness was necessary to create.

We agree that more orchestration (e.g. routing clusters to shards, or more optimistic/segment-like search) would be more work and more communication than just sharing the min score - that is something I'll also measure. That'll be up to the orchestration writers to do too. gRPC works great for this though - I was able to code it in a few hours. But HTTP2/3 direct, a simple UDP packet, and more can easily be used too. REST would be too much overhead - I tried it at first.

The tests suggested that this overhead was minimal. I also added a ticker before to only allow sharing from a shard within a threshold to prevent flooding, but that made the code ugly and was a premature optimization - so far I don't see the coordination being an issue even in the fast machine.

If you look at the code too, there's an initial wait before we use the shared value to terminate, but not before we share. I detail it below but we do an initial wait of 100 visits before we use the shared min to terminate. We do not wait before we share our min.

Using the shared min (early termination)
In CollaborativeKnnCollector.earlyTerminated():

if (visitedCount() < GLOBAL_BAR_MIN_VISITS) return false;

GLOBAL_BAR_MIN_VISITS is 100. So we do not allow early termination based on the global floor until this shard has done at least 100 node visits. Until then we keep searching regardless of what others have shared.

Sharing our min
In collect(), we call minScoreAcc.accumulate(...) whenever we collect a new result and the local floor improves (above lastSharedScore + 0.0001f). There is no minimum visit count or delay before the first share; we share as soon as we have a floor to share.

@krickert
Copy link
Author

krickert commented Mar 5, 2026

Still working through this - just a quick update..

@vigyasharma
Copy link
Contributor

Thank you @krickert for working through this. Your attention to testing and the work you've done on profiling different distributed scenario setups is admirable. Thank you, for making it all open source! <3

This is an interesting idea and I'm curious to see how the profiling fares out. Going over the issue thread, it seems that one persistent problem is that we don't know when we are in an optimal graph neighborhood to start applying the externally imposed threshold. Applying it too early gives bad results (obv because we haven't reached the good neighborhood yet). And relying on static no. of hops or iterations seems wasteful.

I'm curious about ideas for this specific sub-problem. Would it help to look at the variance in similarity scores for all nodes in the candidates neighbor array during graph search? Perhaps variance would be high initially (in bad neighborhood) but slowly fall as we reach better graph areas? It's not super reliable though because we could be in a region where all neighbors are similarly bad.. so maybe we want to also factor in the similarity scores we're seeing?

Another heuristic could be to apply the external threshold once we have at least a few candidates that survive graph traversal iterations. These nodes are potentially good results and applying the external threshold after collecting some good results means we are only short-circuiting the long tail of collected results.

@krickert
Copy link
Author

krickert commented Mar 6, 2026

Thanks for the suggestions, @vigyasharma.

You’re right; I used the current 100-visit warm-up as a static safeguard to prevent the "entry point trap" at local bridge nodes. My next round of tests will retain the 100-visit warm-up to establish a baseline, then I'll introduce the variance. This approach allows me to isolate the recall-safety of the core logic before adding complexity with additional variables.

The current test results show that collaborative search produces results identical to a standard distributed search - achieving recall parity with the current Lucene HNSW implementation - while ensuring it doesn't regress on high-performance local hardware.

I've seen significant success testing this on resource-constrained clusters (Raspberry Pis), where the pruning yielded a ~50% reduction in CPU cycles and latency without any recall loss. On high-end localhost setups with small shards, the gains are understandably masked by the raw speed of the traversal, but the recall floor remains solid.

Regarding the heuristics:

  1. I agree that a static visit count is a blunt instrument. I’m currently preparing a 250GB index benchmark (court law cases) which will provide much more realistic graph depth than my initial tests.
  2. Once that larger-scale data is ready, I plan to use it to test your suggestion of a variance-based trigger. This would allow the pruning to be topology-aware rather than relying on a static visit counter.

I’ll share those Pareto frontier results once the large-scale runs are complete. It should make the benefits clear even on high-performance hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants