Skip to content

[BUG] sample_rows + balanced k-means leads to imbalanced clusters on BIGANN 1B #1461

@julianmi

Description

@julianmi

Multiple routines use raft::matrix::sample_rows() followed by a balanced cuvs::cluster::kmeans::fit() including all_neighbors::get_centroids_on_data_subsample(), ivf_pq::build(), scann::build(), and ACE introduced in #1404. Testings this PR with BIGANN 1B and 1% (10M) samples shows high imbalances:

Primary vectors     - Total: 1000000000, Avg: 1000000.0, Min: 160947, Max: 18829503
Augmented vectors   - Total: 1000000000, Avg: 1000000.0, Min: 153915, Max: 13578909
Total per partition - Total: 2000000000, Avg: 2000000.0, Min: 323707, Max: 32408412

This can lead to OOM issues in partitioned approaches.

Uniform sampling (see cagra::ace_get_partition_labels introduced in #1404) shows much better balancing:

Primary vectors     - Total: 1000000000, Avg: 1000000.0, Min: 519219, Max: 3040985
Augmented vectors   - Total: 1000000000, Avg: 1000000.0, Min: 265749, Max: 2634495
Total per partition - Total: 2000000000, Avg: 2000000.0, Min: 784968, Max: 5378950

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions