Cleanup KMeans python layer#7196
Conversation
|
Ope, 2 kinds of failures I'll need to fix:
|
c37d2ca to
ba3ea5c
Compare
|
Ok, The latest commit fully ripped out KMeansMG was a thin wrapper around |
|
With this fix the It might not be correct though, unless cuvs does something magical behind the scenes with a multi-gpu implementation and computing the inertia properly. The python layer in cuml fits one |
|
Did a quick chat offline, this is not correct (though it's disconcerting that our tests pass). Only |
ba3ea5c to
b2ee24b
Compare
da343c1 to
c3172e9
Compare
|
The scope of this PR has increased since some changes were required to support the dask API (this also now does fix the dask bug #7037). I've updated the description at the top. I wouldn't be surprised if the sklearn tests fail (I cannot reproduce these failures locally, see comment above), but beyond that I believe this should be good for another round of review. |
This is a major cleanup of the Python layer of `cuml.KMeans`. Highlights: - `__init__` is now simple, matching sklearn conventions and restrictions. No parameter validation or processing happens in `__init__`, which helps ensure we're compatible with common sklearn APIs like `clone`. - Vastly simplified internal state. No more private attributes needed. - Improved data validation of inputs in several methods, letting us fix some sklearn compatibility bugs with `cuml.accel`. - Removed faulty memory management of `KMeansParams` struct. We now allocate this on the stack only, removing the need to call `calloc`/`free`. Before these allocations would leak if an error occurred before `free` was called.
I _think_ we can fully remove `KMeansMG`. As is, KMeansMG is a thin wrapper around `KMeans` itself, with just the `fit` method reimplemented. Looking at the implementation though, all it does is call `cuvs::cluster::kmeans::fit` (with much less input validation than it should) followed by `cuvs::cluster::kmeans::predict` instead of a single call to `cuvs::cluster::kmeans::fit_predict` (like `KMeans` does). Reading through the cuvs docs, I don't see a strong reason why we can't just use `fit_predict` everywhere. Ripping out `KMeansMG` does lead all tests to pass.
This lets us avoid needing to define a `KMeansMG` since `cuvs::cluster::kmeans::fit` will auto-dispatch to a MG implementation automatically.
Since `ML::cluster::kmeans::fit` will autodispatch, there's no need for `kmeans_mg` anymore.
We no longer use this internally. All occurrences now call `fit` and then `predict`.
29dd6ae to
6f1ec9c
Compare
jcrist
left a comment
There was a problem hiding this comment.
I'm a bit lost as to why the changes in the xfail list are happening in CI but not locally (3 pass in CI but fail on my machine, 1 fails in CI but passes locally). I've marked them all as flaky for now since I'm a bit stuck as to what's happening. Will open up a followup issue to debug more later, but I don't believe those changes should block getting this in.
| tests: | ||
| - "sklearn.cluster.tests.test_k_means::test_dense_sparse[42-KMeans-X_csr0]" | ||
| - "sklearn.cluster.tests.test_k_means::test_dense_sparse[42-KMeans-X_csr1]" | ||
| - "sklearn.cluster.tests.test_k_means::test_weighted_vs_repeated[42]" |
There was a problem hiding this comment.
These are now passing in CI, but always fail for me locally. Punting for now and marking them as strict: false.
| condition: scikit-learn>=1.7 | ||
| strict: false | ||
| tests: | ||
| - "sklearn.mixture.tests.test_gaussian_mixture::test_gaussian_mixture_precisions_init_diag[float64]" |
There was a problem hiding this comment.
This now always fails in CI, but passes locally. The only bit of this test that hits a cuml.accel codepath is the bit generating the initial labels used estimate the covariance. Big 🤷 as to what changed here, running things locally I see no difference in the output of KMeans. Punting for now and marking as flaky.
|
Running this locally with |
|
|
|
/merge |
This is a major cleanup of the Python layer of
cuml.KMeans.Highlights:
__init__is now simple, matching sklearn conventions and restrictions. No parameter validation or processing happens in__init__, which helps ensure we're compatible with common sklearn APIs likeclone.cuml.accel.KMeansParamsstruct. We now allocate this on the stack only, removing the need to callcalloc/free. Before these allocations would leak if an error occurred beforefreewas called.cuml.dask.cluster.KMeanswhereinertia_wasn't being computed properlyKMeansMGentirely, we can now usecuml.cluster.KMeansin all contexts. Since theKMeansMGwas an internal implementation detail, I've removed this class entirely without a deprecation period.Additionally, I've ripped out a few bits of our C++ API that we're now no longer using. It's my understanding that the libcuml C++ api is mostly an implementation detail for the python API, and changes like this can be made without worry. I've done these removals as separate commits so their easy to revert if needed.
kmeans_mgfunctions entirely (ML::cluster::kmeans::fithandles this automatically).ML::cluster::kmeans::fit_predict. We now callfitand thenpredictin all cases to support the auto-dispatching.Fixes #7037.
Fixes #7187.