Init checks for Dask KMeans#7391
Conversation
There was a problem hiding this comment.
While you're here - I'm not 100% sure if this check should be skipped in multi-gpu execution. When refactoring I excluded it in multi-gpu since we weren't running it there before.
Is the multi-gpu implementation robust to a single node having fewer rows than the requested n_clusters? It doesn't seem to error when invoked in that setup, but I'm also not sure if it provides good results.
There was a problem hiding this comment.
I think the check was missing in the multi-GPU implementation. I do not know for sure either, but I guess that this is a rare case that should probably not yield very good results especially for scalable/parallel kmeans++ initialization. Better safe than sorry, we should probably alert the user in this case.
divyegala
left a comment
There was a problem hiding this comment.
Maybe also add a check that oversampling_factor > 0?
aaa89f5 to
be8c66b
Compare
|
/merge |
During a recent refactor we removed the `KMeansMG` class, viewing it as internal. It turns out this class was used by a few external projects. Since we still need to support external users accessing the non-dask multi-gpu implementation, we'll want a public way to do so that isn't the private `_fit` method. Additionally, since we want to special case the `MG` case a little more, making it a separate class (even if as a thin shim) makes sense. This PR: - Brings back the `KMeansMG` class - Adds a check that `random_state` is non-None in the `KMeansMG` case, ensuring external users also set `random_state` properly - Removes mutation of kwargs in the dask `KMeans` case (as suggested [here](#7417 (comment))) - Simplifies and moves the multi-gpu `kmeans++`/`oversampling_factor` check (as suggested [here](#7391 (comment))) Fixes #7387. Fixes #7389. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #7420
Closes rapidsai#7389 Authors: - Victor Lafargue (https://github.com/viclafargue) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#7391
During a recent refactor we removed the `KMeansMG` class, viewing it as internal. It turns out this class was used by a few external projects. Since we still need to support external users accessing the non-dask multi-gpu implementation, we'll want a public way to do so that isn't the private `_fit` method. Additionally, since we want to special case the `MG` case a little more, making it a separate class (even if as a thin shim) makes sense. This PR: - Brings back the `KMeansMG` class - Adds a check that `random_state` is non-None in the `KMeansMG` case, ensuring external users also set `random_state` properly - Removes mutation of kwargs in the dask `KMeans` case (as suggested [here](rapidsai#7417 (comment))) - Simplifies and moves the multi-gpu `kmeans++`/`oversampling_factor` check (as suggested [here](rapidsai#7391 (comment))) Fixes rapidsai#7387. Fixes rapidsai#7389. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#7420
Closes #7389