Conversation
The shared global minimum similarity is a 32-bit float stored via Float.floatToRawIntBits. Using AtomicLong required unsafe narrowing casts ((int) globalMinSimBits.get()) on every read, which would silently truncate if the upper 32 bits were ever non-zero. AtomicInteger is the natural fit: it matches the 32-bit width of a float's bit representation, eliminates all narrowing casts in both the hot-path read (minCompetitiveSimilarity) and the CAS update loop (updateGlobalMinSimilarity), and retains identical volatile/CAS memory ordering guarantees. Changed in: - CollaborativeKnnCollector: field type, constructors, minCompetitiveSimilarity(), updateGlobalMinSimilarity() - CollaborativeKnnCollectorManager: field type, constructor - TestCollaborativeHnswSearch: all AtomicLong instantiations
Introduces a test to verify collaborative pruning across multiple index segments, ensuring shared thresholds affect HNSW traversal correctly.
Add two tests simulating cross-shard KNN search with collaborative pruning: - testMultiIndexHighKPerformance: 5 separate HNSW graphs (5000 vectors each), K=500, measures 73-78% reduction in visited nodes vs standard search. - testMultiIndexCollaborativeEndToEnd: 5 separate Directory instances combined via MultiReader through IndexSearcher, K=100, measures 97% reduction using TrackingKnnQuery and TrackingCollaborativeKnnQuery to capture per-leaf visited counts through mergeLeafResults.
The two multi-index performance tests (testMultiIndexHighKPerformance, testMultiIndexCollaborativeEndToEnd) take ~10-20s each. Tag them @nightly so they are skipped during normal test runs and only execute with -Dtests.nightly=true.
These two single-graph tests account for ~88% of the suite time (~15s of ~17s) due to large graph construction (30K vectors at K=1000, 10K vectors at 128 dimensions). Moving them to @nightly brings the default suite from ~17s down to ~2s while keeping the basic collaborative pruning test and multi-segment test in every run. All four collaborative tests still run with -Dtests.nightly=true.
benwtrent
left a comment
There was a problem hiding this comment.
I think the idea is an interesting one.
Does your benchmark compare recall? Improvements here are only useful if the recall/visited (latency) curve is actually improved.
I suggest using Lucene Util: https://github.com/mikemccand/luceneutil/
Since for a single Lucene index, the main impact is for reducing comparisons across segments, please include that information.
If we are sharing scores, please take a look at how this is done else where in Lucene with LongAccumulator and min-competitive logic that takes into account the doc-id tie breaking.
Fix testMultiIndexHighKPerformance using constant docBase=1000000 which prevented pruning from ever activating. Use i*vectorsPerGraph to simulate real multi-segment layout. Add brute-force recall measurement to all single-graph pruning tests and a new testMultiSegmentCombinedRecall that builds multiple HNSW graphs, searches with both standard and collaborative collectors, merges results, and compares against exact top-k. Update HnswGraphSearcher comment to reference LongAccumulator instead of bi-directional stream. Add Javadoc to minCompetitiveSimilarity documenting the segment-0 tie-breaking design tradeoff.
|
Still working through the details and creating a test that produces a matrix of values to show speed up. You pointed out exactly the the direction I needed, thanks.. I'll have more updates and comment when I'm done. |
|
Hi @krickert +1 on the idea. |
…ment sweep test CollaborativeKnnCollector.collect() now shares the k-th best score (floor) instead of every collected doc's score, maintaining 0.995 recall while still enabling cross-segment pruning. The real-world test sweeps 4/8/16/32 segments using 73k 1024-dim embeddings and is double-gated behind @monster and the tests.embeddings.dir system property.
|
I've been digging into the recall issues from the distributed simulations (4, 8, and 16 shards, 1.47M 1024-dim vectors). Rerunning on clean, deduped data and instrumenting per-shard behavior has uncovered a few difficult to figure out problems:
The parameters that scale with K (especially for K >= 1000) aren't as straightforward as I had thought. Still working through it and open to ideas if anyone sees a cleaner approach. In the meantime, I will attempt a more realistic approach and create a per-index HTTP2 service that serves up lucene to see if real-network collaborative pruning can work. More to come... |
|
FYI, I suspect any collaborative search across shards to have an impact on recall with the same parameters (unless finely tuned). They key thing is the visited/recall curve. Can we get the same recall with fewer visited? I suspect real-world lucene indices (just like Lucene segments), to be a random sample of the entire corpus. Relevant vectors should be expected to be evenly distributed between all indices. This is the assumption that lucene makes with segments and its "optimistic search" pattern. This same assumption will be required by this idea. Anything searching one shard and then only sets competitiveness without taking this into account will be useless. |
Collaborative HNSW Shard Pruning: Visited/Recall AnalysisYup, @benwtrent. The visited/recall curve is the only honest way to judge this. Here is what I am seeing on 1.47M deduped 1024-dim vectors across 8 independent shards (evenly distributed):
At K=10, collaborative pruning "wins" on performance but fails on recall - it's essentially cutting off the bridge paths required to reach local clusters. At K=100, we recover the recall, but the safety mechanisms (delayed bar application and slack buffers) actually cause us to over-explore, doing more work than the baseline. The difficulty is that unlike Lucene segments, independent shard indices don't share a consistent global HNSW topology. The assumption that shards are "random samples" is theoretically sound, but in practice, sharing the raw kth-best score from the first-finishing shard is too aggressive. It doesn't account for the variance in when a shard actually "finds" its target neighborhood during traversal. I'm prototyping a distributed gRPC setup to test this with realistic network isolation (to eliminate the memory-bus contention we see in a single JVM). However, the core question remains: should the collaborative bar be a raw score, or a heuristic that factors in the shard count and expected local density? My ultimate goal is to make very large result sets ( |
Distributed Validation Progress: Collaborative HNSW PruningSo testing is now transitioning from single-JVM simulations to a physical 8-node cluster (2.5GbE, 1TB NVMe, 16GB RAM per node) to isolate variables like CPU contention and IO over-caching. I've discovered that single-host simulations masked a "Coherence Tax" of distributed HNSW search. So testing on real hardware over a LAN is the only way to measure the trade-off between coordination overhead and compute savings. Initial tests are showing slightly slower times in a distributed environment if simulated on a local system - this isn't adding the correct latency we would see IRL. To finally confirm if this implementation is right, it's easier to just make a distributed search PoC that runs on an http2 stream. 100% of this work will be in the available in the lucene-test-data repo, which can recreate everything up until the shard distribution. The Theory RevisitedSo the theory is sound - but I've not been able to successfully demonstrate the savings because the tests were flawed or the idea is just too much overhead to hold water against. In a standard sharded HNSW search, every shard performs a full graph traversal to find its local top-K, unaware that its candidates may be far below the global similarity threshold. This results in significant redundant compute, especially for high-K queries ( We want to eliminate these redundant cycles. Collaborative Pruning introduces a global minimum similarity bar synchronized across nodes via HTTP/2 streams. By injecting this bar into the Benchmarking Dimensions & StatsTo prove this, here's a chart of data I'll be collecting:
Baselines
Key Metrics
I'll try to infer the collaborative overhead, but comparing to a standard baseline and measuring latency should be good enough to demonstrate. Execution Plan
System Architecturegraph TD
subgraph "Offline Preparation"
A[Raw Text Data] -->|BGE-M3 Python| B[.vec Embeddings]
B -->|Indexer Utility| C[16 Shards]
C -->|Merge| D[8 / 4 / 2 / 1 Shard Indices]
end
subgraph "Distributed Cluster: 8 Nodes"
E[Search Request] --> Node0((Node 0: Coordinator))
subgraph "Symmetric Peer Node"
direction TB
N((Peer Node)) --> Local[Lucene HNSW Shard]
Local <--> Collab[CollaborativeKnnCollector]
Collab <--> Gossip{{ScaleCube / gRPC Stream}}
end
Node0 -->|gRPC Search| Node1(Node 1)
Node0 -->|gRPC Search| Node2(Node 2)
Node0 -->|gRPC Search| NodeN(...)
Node1 <-->|Bi-Di Threshold Sync| Node0
Node2 <-->|Bi-Di Threshold Sync| Node1
NodeN <-->|Bi-Di Threshold Sync| Node0
end
subgraph "Merging"
Node0 -->|Fan-in Results| Merge[Heap Merge & Sort]
Merge --> Final[Top K Results]
end
|
…ibuted safety This change hardens the collaborative ANN pruning mechanism to ensure high recall in distributed environments while maintaining significant traversal technical leverage. Key Refinements: - Implement "Lagging Threshold" (warm-up) in CollaborativeKnnCollector: The global pruning bar is now ignored until the local queue is full and a minimum number of nodes (2*k) have been visited. This prevents the "Entry Point Trap" where high global bars from other shards could cause premature termination at local bridge nodes. - Introduce "Safety Slack Buffer": Applied a 0.05f slack to the global threshold to allow HNSW traversal through similarity "valleys" required to reach high-scoring clusters in independent graphs. - Update HnswGraphSearcher threshold logic: Switched to Math.nextUp() for dynamic similarity updates to match standard Lucene behavior and relaxed bulk-pruning checks to '>=' to correctly handle score ties. - Refactor Javadocs: Updated documentation to be protocol-neutral, focusing on general distributed search requirements and global tie-breaking priority via docId mapping. Integration & Cleanup: - Integrated collaborative search support into luceneutil (KnnGraphTester and knnPerfTest.py) to enable standardized performance benchmarking. - Removed experimental nightly/monster tests from core to reduce cruft. - Fixed luceneutil SUMMARY output to include collaborative status.
…ibuted safety This change hardens the collaborative ANN pruning mechanism to ensure high recall in distributed environments while maintaining significant traversal technical leverage. Key Refinements: - Implement 'Lagging Threshold' (warm-up) in CollaborativeKnnCollector: The global pruning bar is now ignored until a minimum number of nodes (100) have been visited. This prevents the 'Entry Point Trap' where high global bars from other shards could cause premature termination at local bridge nodes. - Introduce Safety Slack Buffer: Applied a 0.01f slack to the global threshold to allow HNSW traversal through similarity 'valleys' required to reach high-scoring clusters in independent graphs. - Implement Smart Accumulation: Global bar updates are now debounced by 0.001f improvement to reduce atomic contention across threads. - Update HnswGraphSearcher threshold logic: Switched to minimal Math.nextUp() similarity updates to match standard Lucene behavior. - Support docIdMapper: Added IntUnaryOperator support to ensure globally consistent tie-breaking across shards.
… and Neighborhood Affinity gating
…nd fix Lucene 11 API compatibility
…lMax) - Revert to Global Floor vs Local Max pruning - earlyTerminated(): stop when localMax < globalFloor (after 100 visits) - minCompetitiveSimilarity(): local bar only (pathfinding unchanged) - collect(): track localMaxScore, push floor with lastSharedScore guard
|
@benwtrent @navneet1v - quick status update on a distributed KNN prototype. I implemented a gRPC/HTTP2 streaming coordinator + shard model for collaborative HNSW search (outside OpenSearch REST for now), and ran initial benchmarks on 8 Raspberry Pi 5 nodes (NVMe) plus local runs. Early results
In a heterogeneous setup, adding one higher-performance node improved global pruning and produced larger gains (up to ~65% vs the same cluster without that node in current tests). Repoai-pipestream/distributed-search - grpc streaming service PoC Next steps
Note: this needed an HTTP2 boost - it will not be fast if it's done on HTTP1. |
|
Your final numbers don't indicate recall. Please, we need to see what the Pareto frontier (how recall changes with increasing efSearch) looks like for the following scenarios as they reflect optimal, baseline, and candidate:
Last I saw from your benchmarks, at k:100 collaborative was much worse. In most real world data, each index/shard will have a random subset of the entire dataset, your tests should reflect this as well. The Pareto frontier should likely be "recall vs. total vectors compared". And that for the latter two benchmarks, they are done against the exact same graphs/indices as reindexing isn't necessary to test with the collaborative searcher. I don't think benchmarking between machines is necessary. If the collaborative searching isn't useful when all shards are on the same machine (thus sharing information overhead is at its lowest), I doubt it will be helpful at all once overheads increase. |
No problem. So The benchmark comparison |
|
NOTE: removed due to flawed test recordings - see results below instead This post had flawed results - please see discussion below for better points.. |
|
I don't understand your graphs. I would expect the following results:
Your graphs show there is zero benefit for collaborative and that splitting data across shards significantly reduces recall! |
|
You’re right, I mixed objectives. I’ll focus on recall next, specifically recall vs
I’ll treat work/latency as secondary and keep them out of the main conclusion for now. Next I’ll test whether recall can be improved by adding shard-aware index-time context instead of relying on search-time coordination alone. I’ll prototype a lightweight global routing layer and cross-shard neighborhood metadata so shard traversal starts with better global priors. I think the core issue is that each shard currently builds and searches its own local ANN neighborhood frontier. A single shard can look strong, but once we merge across many shard-local frontiers, recall drops much harder than I expected. It’s honestly more severe than I thought, and that’s exactly why I think index-time global awareness can help. I'm looking through some papers for a round, but I'll test out a few more scenarios. Thanks for being patient, by the way.. I really want to push hard for getting a high K search to be the norm. |
|
Reporting numbers were wrong, I believe they're right now. Reporting high recall across the board now.. Changed to a draft - I'll do more testing and post here tomorrow. |
| boolean shouldExploreMinSim = true; | ||
| while (candidates.size() > 0 && results.earlyTerminated() == false) { | ||
| // Update the threshold dynamically from the collector to allow external pruning. | ||
| float liveMinSimilarity = results.minCompetitiveSimilarity(); |
There was a problem hiding this comment.
what does this do when we are not using external pruning?
There was a problem hiding this comment.
When we're not using external pruning, results.minCompetitiveSimilarity() still returns the minimum competitive similarity of the current top‑K results, but only from this shard’s collector.
So it's the same kind of threshold (used to prune the graph: skip nodes that can’t beat the worst of the current top‑K), but it's not updated from other shards.
The loop and the pruning logic are unchanged; only the source of the threshold is internal (this shard) instead of external (collaborative). In other words: same API, same pruning, no cross-shard updates when external pruning is off.
I don't see how this will show anything different? What is the size of your data set now? I would assume about 1M per shard should be plenty to give any indication that this would prove useful. However, I am not sure a naive sharing like this will actually work without other orchestration (e.g. routing certain clusters of vectors to shards, which Lucene just won't do because lucene is the shard). I do think there is something to searching multiple graphs in parallel(e.g. Optimistic searching like Lucene does with segments). But this would have much more communication, orchestration, and work than simply sharing the min_competitive score.
How would that show any improvement? If sharing information doesn't help when the latency of communication is near zero, how would it improve when the latency of communication increases significantly? That just means now sharing (the key point of this algorithm) is now more expensive. |
|
First, all great questions and you're getting to the heart of why I should increase the collection size to demo that this is going to help due to the increases I saw with my home-lab distribution setup. tl;dr - I suspect that pruning can only help if there is enough latency between shards and searches go over 10ms. My home lab, due to using cheap machines, was a good candidate to show this and it did show significant improvement. I can rerun the same tests I showed you in the slow environment because the testing harness that runs on localhost is a real streaming distributed search harness. But I also need to show this on a localhost multi-shard setup, where latency is low but calculations are high.
Small. Too small. From the shard dirs:
So total index size is in the hundreds of MB (~0.3 GB), not 10M or 100M. That’s for 73K vectors × 1024 dims (float32) plus the HNSW graph. My machine is 128GB, and the drive operates at 20GB/s.. so the entire index is certainly in OS disk cache and the drive is faster than the raspberry pi memory. That's why I need to add latency to the setup - both environments test two different extremes. The larger machine does no sweating with these tests (the timing of the entire test is in low ms range). We want to challenge the machine with at least 250ms queries like I did with the raspberry pi. We’re well below 1M per shard. The idea of going 10–20x larger isn't that a bigger index by itself proves anything; it’s that with more work per shard per query (more graph to traverse), there’s more for pruning to actually cut. The problem is the entire corpus is in disk cache, the work we do is nearly instant. That's why you're not seeing it kick in. Collaboration doesn't seem to affect the overall search speed because there's a dedicated HTTP2 streaming connection that is always on during the search. It's notification system is less than 1ms. But with collaboration turned on, I can show you a 50% boost in speed and a 50% savings in CPU in some situations - which is why I tested this on a Raspberry Pi; it forced latency and "simulated" a larger corpus. But if I bring the index size up and run the searches concurrently, I saw it outperform the traditional search - because it waits for the entire search result set to yield when you shouldn't have to. That's where it shines. I can measure the overhead to some degree. I logged the events in the coordination layer and you can see the trimming live in the logs. But don't believe me right now - if I just give it a large corpus, I'm convinced you'll see it for yourself. With 73K and a fast machine, each shard finishes so quickly that by the time a useful
You’re right that higher latency doesn’t make sharing better - it makes it more expensive. The point wasn't to imply that adding latency to get improvement; it was that because we're running on localhost for all shards, it's an unrealistic test for a distributed search because it's low latency you wouldn't see in most setups. In my tested lab environment (2.5 Gbit, more latency), we did see larger savings (lookups_saved) AND a large reduction in latency - but just enough to make me realize I need more testing (much like how a 0 latency connection is unrealistic, so too is assuming the world will power lucene on all Raspberry Pi machines). So I have two setups:
The interpretation I'm suggesting is that in that other setup there was more work per query (and/or slower shards), so pruning had something to cut; here, work per query is so small that pruning barely shows up. So "higher-latency setting" was shorthand for "the environment where we already saw the gain," not a claim that increasing latency causes the gain. So I'm trying to reproduce that kind of "more work per query" locally (e.g. with a much larger index) to see if the same benefit appears. 3. Naive sharding / orchestration We’re testing naive sharding (no cluster-based routing; Lucene is the shard) on purpose: the question is whether only sharing the We’re not claiming Lucene will do routing or orchestration - it never should. But to have a collaborate search, exposing is needed. And to test, a distributed search harness was necessary to create. We agree that more orchestration (e.g. routing clusters to shards, or more optimistic/segment-like search) would be more work and more communication than just sharing the min score - that is something I'll also measure. That'll be up to the orchestration writers to do too. gRPC works great for this though - I was able to code it in a few hours. But HTTP2/3 direct, a simple UDP packet, and more can easily be used too. REST would be too much overhead - I tried it at first. The tests suggested that this overhead was minimal. I also added a ticker before to only allow sharing from a shard within a threshold to prevent flooding, but that made the code ugly and was a premature optimization - so far I don't see the coordination being an issue even in the fast machine. If you look at the code too, there's an initial wait before we use the shared value to terminate, but not before we share. I detail it below but we do an initial wait of 100 visits before we use the shared min to terminate. We do not wait before we share our min. Using the shared min (early termination) if (visitedCount() < GLOBAL_BAR_MIN_VISITS) return false;
Sharing our min |
|
Still working through this - just a quick update.. |
|
Thank you @krickert for working through this. Your attention to testing and the work you've done on profiling different distributed scenario setups is admirable. Thank you, for making it all open source! <3 This is an interesting idea and I'm curious to see how the profiling fares out. Going over the issue thread, it seems that one persistent problem is that we don't know when we are in an optimal graph neighborhood to start applying the externally imposed threshold. Applying it too early gives bad results (obv because we haven't reached the good neighborhood yet). And relying on static no. of hops or iterations seems wasteful. I'm curious about ideas for this specific sub-problem. Would it help to look at the variance in similarity scores for all nodes in the candidates neighbor array during graph search? Perhaps variance would be high initially (in bad neighborhood) but slowly fall as we reach better graph areas? It's not super reliable though because we could be in a region where all neighbors are similarly bad.. so maybe we want to also factor in the similarity scores we're seeing? Another heuristic could be to apply the external threshold once we have at least a few candidates that survive graph traversal iterations. These nodes are potentially good results and applying the external threshold after collecting some good results means we are only short-circuiting the long tail of collected results. |
|
Thanks for the suggestions, @vigyasharma. You’re right; I used the current 100-visit warm-up as a static safeguard to prevent the "entry point trap" at local bridge nodes. My next round of tests will retain the 100-visit warm-up to establish a baseline, then I'll introduce the variance. This approach allows me to isolate the recall-safety of the core logic before adding complexity with additional variables. The current test results show that collaborative search produces results identical to a standard distributed search - achieving recall parity with the current Lucene HNSW implementation - while ensuring it doesn't regress on high-performance local hardware. I've seen significant success testing this on resource-constrained clusters (Raspberry Pis), where the pruning yielded a ~50% reduction in CPU cycles and latency without any recall loss. On high-end localhost setups with small shards, the gains are understandably masked by the raw speed of the traversal, but the recall floor remains solid. Regarding the heuristics:
I’ll share those Pareto frontier results once the large-scale runs are complete. It should make the benefits clear even on high-performance hardware. |



[HNSW] Collaborative Search via Dynamic Threshold Feedback
Summary
Enable HNSW graph search to accept externally-updated similarity thresholds during traversal. This allows multiple concurrent search processes (threads, shards, or nodes) to share a global minimum-score bar, pruning each other's search frontiers in real time.
Problem Statement
In current distributed KNN implementations, each shard searches its local HNSW graph in isolation. A shard will continue exploring candidates even when other shards have already found globally superior matches. This redundant traversal wastes CPU and IO, and the cost scales with K and the number of shards.
Proposed Changes
HnswGraphSearcher.java): Re-readminCompetitiveSimilarity()from the collector on every iteration of the HNSW search loop, rather than only at initialization. If the value has increased (due to an external update),minAcceptedSimilarityis raised and the search frontier is pruned accordingly.CollaborativeKnnCollector: AKnnCollector.Decoratorthat wraps a standardTopKnnCollectorand anAtomicInteger(storing float bits viaFloat.floatToRawIntBits). ItsminCompetitiveSimilarity()returnsmax(local, global), allowing external signals to raise the pruning bar. Updates use a lock-free CAS loop.CollaborativeKnnCollectorManager: Creates per-segmentCollaborativeKnnCollectorinstances that share a singleAtomicInteger, enabling threshold propagation across leaf segments within a node.Test Results and Methodology
Unit tests measure the number of graph nodes visited under two conditions:
TopKnnCollectorwith no external threshold.CollaborativeKnnCollectorwhere the global bar is set using a score derived from the standard search's results (simulating a "discovered" top-K score).Results vary across runs because Lucene's test framework randomizes graph construction parameters (maxConn, beamWidth). A subsequent run with smaller random values produced:
The pruning is consistently effective across random seeds, with the strongest gains in high-dimension and high-K scenarios where graph traversal is most expensive. The basic scenario is more sensitive to graph topology - smaller graphs with fewer connections have less room to prune.
Important Caveats
These numbers represent upper-bound savings under idealized conditions:
The fundamental mechanism - raising the threshold mid-traversal to skip provably non-competitive subgraphs - is not test-specific. Real-world savings should still be significant, particularly for high-K queries (K >= 100) and dense embedding spaces.
Thread Safety
The implementation adds no locks or synchronization to the HNSW search hot path. Visibility of the shared threshold is guaranteed by the volatile read semantics of
AtomicInteger.get(), which the collector calls on every loop iteration. Updates from external threads become visible on the next iteration without explicit memory fencing.The
CollaborativeKnnCollector.updateGlobalMinSimilarity()method uses a standard CAS loop to ensure monotonic updates (the bar can only go up, never down).Design Note: Threshold Propagation Is External
The Lucene-layer change is intentionally passive: it reads the global bar but does not write it. Raising the bar based on incoming results from other shards is the responsibility of the orchestration layer. This keeps the Lucene change minimal and avoids coupling graph traversal logic to any specific coordination protocol.
Use Case: Streaming Coordinator
This change is a prerequisite for high-performance distributed KNN search. In a streaming model, the coordinator can broadcast the current "Global Kth Score" back to all shards. Shards running this modified searcher will instantly prune their frontier, terminating their local search as soon as it is mathematically impossible to improve the global result set.
Multi-Index Performance Results
The single-graph tests above prove the mechanism works. The following tests measure what happens when collaborative pruning is applied across multiple separate HNSW graphs - the scenario that maps directly to cross-shard KNN search in OpenSearch.
Test: Multi-Index High-K (low-level HNSW graphs)
5 separate HNSW graphs, 5000 vectors each (25K total), dim=32, K=500. Standard search queries each graph independently and merges. Collaborative search pre-sets the pruning bar to the median score of the merged top-500, then searches all 5 graphs with a shared
AtomicInteger.Test: Multi-Index End-to-End (IndexSearcher + MultiReader)
5 separate Directory instances, 2000 vectors each (10K total), dim=32, K=100. Combined via
MultiReaderand searched throughIndexSearcher- the same code path OpenSearch uses. Visited counts captured viamergeLeafResultsoverride.What this means for high-K cross-shard search
The cost of KNN search scales with K and the number of shards. Without collaborative pruning, each shard does full work independently - a K=2000 query across 20 shards means 20 full HNSW traversals with no shared knowledge. With collaborative pruning, the bar rises as soon as any shard finds good results, and every other shard prunes accordingly. The effect compounds: more shards means the bar rises faster, which means more pruning per shard.
The numbers above are from K=100 and K=500 across 5 graphs. At K=2000 across 20 shards, the pruning surface is larger and the ratio of useful-to-wasted traversal work is worse in the standard case - which means collaborative pruning has even more room to cut. Queries that are currently too expensive to run (high K, many shards, high-dimensional embeddings) become feasible.