- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.2k
 
Description
Looking for any pointers/advice/best practices for my use case:
Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)
Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.
Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit
NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...
Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?
Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.