Skip to content

Making it go fast for high volume queries #668

@FlimFlamm

Description

@FlimFlamm

Looking for any pointers/advice/best practices for my use case:

Large annoy tree (100GB +), high frequency lookups, best or near best accuracy required (wherever diminishing returns start to show i guess, which right now seems to be around search_k= 30_000 for 10M items each with 3500 components)

Essentially I need to non-stop sequentially lookup every item in the tree as fast as possible. At my desired search_k value, the performance hit is starting to hurt.

Side question: If i were to build another annoy index with as many build-trees as i can fit in memory/disk, would this significantly reduce my search_k requirements to get similar results? edit: answer: possibly yes; at least a bit

NOTE: currently, multi-processing across 2/3'rds of my cores appears to be fastest, which i suspect is due to I/O waiting times...

Tertiary question: Is a shared memory approach with one tree in memory and many processes accessing it achievable or useful?

Quaternary questions: is there a fastest metric? edit: answer: yes. In my case hamming turned out to be fastest, and counterintuitively, the most accurate by a keyword-based metric. (although if I normalize my continuous vector before hand (which should break hamming), the build time itself seems to increase drastically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions