Skip to content

Init checks for Dask KMeans#7391

Merged
rapids-bot[bot] merged 4 commits intorapidsai:mainfrom
viclafargue:init-checks-dask-kmeans
Oct 31, 2025
Merged

Init checks for Dask KMeans#7391
rapids-bot[bot] merged 4 commits intorapidsai:mainfrom
viclafargue:init-checks-dask-kmeans

Conversation

@viclafargue
Copy link
Copy Markdown
Contributor

Closes #7389

@viclafargue viclafargue requested a review from a team as a code owner October 27, 2025 11:19
@viclafargue viclafargue requested a review from dantegd October 27, 2025 11:19
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Oct 27, 2025
Comment thread python/cuml/cuml/cluster/kmeans.pyx Outdated
Comment thread python/cuml/cuml/cluster/kmeans.pyx Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here - I'm not 100% sure if this check should be skipped in multi-gpu execution. When refactoring I excluded it in multi-gpu since we weren't running it there before.

Is the multi-gpu implementation robust to a single node having fewer rows than the requested n_clusters? It doesn't seem to error when invoked in that setup, but I'm also not sure if it provides good results.

Copy link
Copy Markdown
Contributor Author

@viclafargue viclafargue Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check was missing in the multi-GPU implementation. I do not know for sure either, but I guess that this is a rare case that should probably not yield very good results especially for scalable/parallel kmeans++ initialization. Better safe than sorry, we should probably alert the user in this case.

Copy link
Copy Markdown
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a check that oversampling_factor > 0?

@viclafargue viclafargue force-pushed the init-checks-dask-kmeans branch from aaa89f5 to be8c66b Compare October 31, 2025 13:33
@viclafargue viclafargue added bug Something isn't working non-breaking Non-breaking change labels Oct 31, 2025
@viclafargue
Copy link
Copy Markdown
Contributor Author

/merge

@rapids-bot rapids-bot Bot merged commit 550aba7 into rapidsai:main Oct 31, 2025
103 checks passed
@jcrist jcrist mentioned this pull request Oct 31, 2025
rapids-bot Bot pushed a commit that referenced this pull request Oct 31, 2025
During a recent refactor we removed the `KMeansMG` class, viewing it as internal. It turns out this class was used by a few external projects.

Since we still need to support external users accessing the non-dask multi-gpu implementation, we'll want a public way to do so that isn't the private `_fit` method. Additionally, since we want to special case the `MG` case a little more, making it a separate class (even if as a thin shim) makes sense.

This PR:

- Brings back the `KMeansMG` class
- Adds a check that `random_state` is non-None in the `KMeansMG` case, ensuring external users also set `random_state` properly
- Removes mutation of kwargs in the dask `KMeans` case (as suggested [here](#7417 (comment)))
- Simplifies and moves the multi-gpu `kmeans++`/`oversampling_factor` check (as suggested [here](#7391 (comment)))

Fixes #7387.
Fixes #7389.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #7420
vardhan30016 pushed a commit to vardhan30016/cuml that referenced this pull request Nov 7, 2025
vardhan30016 pushed a commit to vardhan30016/cuml that referenced this pull request Nov 7, 2025
During a recent refactor we removed the `KMeansMG` class, viewing it as internal. It turns out this class was used by a few external projects.

Since we still need to support external users accessing the non-dask multi-gpu implementation, we'll want a public way to do so that isn't the private `_fit` method. Additionally, since we want to special case the `MG` case a little more, making it a separate class (even if as a thin shim) makes sense.

This PR:

- Brings back the `KMeansMG` class
- Adds a check that `random_state` is non-None in the `KMeansMG` case, ensuring external users also set `random_state` properly
- Removes mutation of kwargs in the dask `KMeans` case (as suggested [here](rapidsai#7417 (comment)))
- Simplifies and moves the multi-gpu `kmeans++`/`oversampling_factor` check (as suggested [here](rapidsai#7391 (comment)))

Fixes rapidsai#7387.
Fixes rapidsai#7389.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#7420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Multi Node KMeans result doesn't match Single Node

4 participants