Making it go fast for high volume queries

Looking for any pointers/advice/best practices for my use case:

Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)

Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.

Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit

NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...

Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?

Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Making it go fast for high volume queries #668

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Making it go fast for high volume queries #668

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions