Skip to content

Why are we not doing exact search in findBestEntryPoint on topmost layer? #15778

@shubhamvishu

Description

@shubhamvishu

Description

It's very likely that this has already been discussed or explored, but I couldn’t find any relevant GitHub issue, so I’m opening this to at least satisfy my curiosity and hopefully get some insights or have it struck down :P.

for (int level = graph.numLevels() - 1; level >= 1; level--) {
foundBetter = true;
visited.set(currentEp);
// Keep searching the given level until we stop finding a better candidate entry point
while (foundBetter) {
foundBetter = false;
graphSeek(graph, level, currentEp);
int friendOrd;
int numNodes = 0;
while ((friendOrd = graphNextNeighbor(graph)) != NO_MORE_DOCS) {
assert friendOrd < size : "friendOrd=" + friendOrd + "; size=" + size;
if (visited.getAndSet(friendOrd)) {
continue;
}
if (collector.earlyTerminated()) {
return new int[] {UNK_EP};
}
bulkNodes[numNodes++] = friendOrd;
}
float maxScore =
numNodes > 0
? scorer.bulkScore(bulkNodes, bulkScores, numNodes)
: Float.NEGATIVE_INFINITY;
collector.incVisitedCount(numNodes);
if (maxScore > currentScore) {
for (int i = 0; i < numNodes; i++) {
float score = bulkScores[i];
if (score > currentScore) {
currentScore = score;
currentEp = bulkNodes[i];
foundBetter = true;

I was looking at findBestEntryPoint we greedily try to find the best entry point and stop if none of the neighbours is better than the current node. Intuitively, I thought that the number of nodes on the topmost layer is typically quite small in absolute sense (example: ~10-20 for a graph of 5M docs and same hols for even higher number of docs). Is it possible that, in some rare cases (though I don't have data to back this up), we might miss the node that’s closest to the query? This could result in landing somewhere farther in the vector space, on the bottommost layer, leading to more computation. Could this be avoided by simply iterating over the nodes on the topmost layer? Or, perhaps we could go a step further by performing an exhaustive search on a layer containing X% of expectedVisitedNodes (where X is configurable, e.g., <=0.1% or 1% or some heuristic so that the only top level at minimum, higher means trying harder for exhaustive search on next layers).

=> M=64, n=5000000

Level Probability Expected nodes
0 100% 5,000,000
1 1/64 78,125
2 1/4,096 1,221
3 1/262,144 ~19

For n=5,000,000 and K=100 => expectedVisitedNodes(K, n) = ln(5000000) * 100 = 15.4249 * 100 = 1542 nodes which is (1542 / 5000000) = 0.03% = ~1%

I understand that we don’t want to lose the main benefit of HNSW search by making everything exhaustive or introducing too many knobs. The idea is to apply this approach only where the cost is minimal, while hopefully improving recall and latency by bringing the search closer to the query on the bottom layer. I’m also wondering if the top layer(s) are always well-connected (i.e., never disconnected graph?) or entry point is always accurate one on topmost layer? and this might cause it to not translate into actual benefits even though this seems intuitive to me now. I'd be happy to be proven wrong or experiment if it sounds reasonable.

Benefit/Why do this: In some cases (perhaps adversarial ones), we could reduce the number of nodes visited i.e. saving latency, when not choosing the best candidate on the topmost layer becomes unnecessarily costly. This could also improve recall by bringing the search closer to the query's vicinity?. The idea was sparked mainly because we aren't performing an exact search on the top layer, where the absolute no. of nodes is very small and may be there are edge cases in certain vector datasets where this approach leaves some potential benefits on the table. I'd love to hear more thoughts on this.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions