Cleanup `KMeans` python layer by jcrist · Pull Request #7196 · rapidsai/cuml

jcrist · 2025-09-08T20:43:57Z

This is a major cleanup of the Python layer of cuml.KMeans.

Highlights:

__init__ is now simple, matching sklearn conventions and restrictions. No parameter validation or processing happens in __init__, which helps ensure we're compatible with common sklearn APIs like clone.
Vastly simplified internal state. No more private attributes needed.
Improved data validation of inputs in several methods, letting us fix some sklearn compatibility bugs with cuml.accel.
Removed faulty memory management of KMeansParams struct. We now allocate this on the stack only, removing the need to call calloc/free. Before these allocations would leak if an error occurred before free was called.
Fixes a bug in cuml.dask.cluster.KMeans where inertia_ wasn't being computed properly
Removes the need for KMeansMG entirely, we can now use cuml.cluster.KMeans in all contexts. Since the KMeansMG was an internal implementation detail, I've removed this class entirely without a deprecation period.

Additionally, I've ripped out a few bits of our C++ API that we're now no longer using. It's my understanding that the libcuml C++ api is mostly an implementation detail for the python API, and changes like this can be made without worry. I've done these removals as separate commits so their easy to revert if needed.

Removes the kmeans_mg functions entirely (ML::cluster::kmeans::fit handles this automatically).
Removes ML::cluster::kmeans::fit_predict. We now call fit and then predict in all cases to support the auto-dispatching.

Fixes #7037.
Fixes #7187.

jcrist · 2025-09-08T22:25:42Z

Ope, 2 kinds of failures I'll need to fix:

Dask version needs to be updated as well. Can do.
There's 2 xpassing tests in the sklearn test suite, and one new failure. These are consistent across all 3 runs, but I cannot reproduce any of these locally. The xpassing test could be flaky (it was failing before just due to misalignment of the labels - all 3 clusters are found perfectly, we just use different labels than sklearn). The new failure is peculiar, I don't see how changes could lead to this (and cannot reproduce locally).

jcrist · 2025-09-08T23:08:56Z

Ok, dask implementation should be fixed, but I'd like a followup review (cc @divyegala) just to ensure this isn't invalid.

The latest commit fully ripped out KMeansMG in favor of just using cuml.KMeans everywhere. The *MG classes aren't public, so I don't view this as a breaking change, and I think the fix is as correct as the previous implementation.

KMeansMG was a thin wrapper around KMeans itself, with just the fit method reimplemented. Looking at the implementation though, all it does is call cuvs::cluster::kmeans::fit (with much less input validation than it should) followed by cuvs::cluster::kmeans::predict instead of a single call to cuvs::cluster::kmeans::fit_predict (like
KMeans does). Reading through the cuvs docs, I don't see a strong reason why we can't just use fit_predict everywhere, and swapping to using KMeans everywhere does lead all tests to pass.

divyegala · 2025-09-08T23:11:54Z

@jcrist nice update, this looks valid to me. Could you quickly check if this issue is fixed with your updates?

jcrist · 2025-09-08T23:19:31Z

With this fix the inertia_ attribute is available and non-zero on the fit dask estimator (so in that view, yes it fixes the issue).

It might not be correct though, unless cuvs does something magical behind the scenes with a multi-gpu implementation and computing the inertia properly. The python layer in cuml fits one KMeans estimator per dask partition, concatenates all the labels together, and then exposes only the first fit estimator's attributes. Unless some magic is happening somewhere, I don't see how this is algorithmically correct for the fit, much less attributes like inertia_ that would need a global view of all the data rather than something partition-local. I haven't looked much into the C++ layer of any of the multi-gpu things though, so I may be/am likely missing something 🤷.

jcrist · 2025-09-08T23:41:48Z

Did a quick chat offline, this is not correct (though it's disconcerting that our tests pass). Only cuvs::cluster::kmeans::fit auto-dispatches to multi-gpu, cuvs::cluster::kmeans::fit_predict uses the single GPU path. I'll revisit this tomorrow.

jcrist · 2025-09-10T21:51:42Z

The scope of this PR has increased since some changes were required to support the dask API (this also now does fix the dask bug #7037). I've updated the description at the top.

I wouldn't be surprised if the sklearn tests fail (I cannot reproduce these failures locally, see comment above), but beyond that I believe this should be good for another round of review.

This is a major cleanup of the Python layer of `cuml.KMeans`. Highlights: - `__init__` is now simple, matching sklearn conventions and restrictions. No parameter validation or processing happens in `__init__`, which helps ensure we're compatible with common sklearn APIs like `clone`. - Vastly simplified internal state. No more private attributes needed. - Improved data validation of inputs in several methods, letting us fix some sklearn compatibility bugs with `cuml.accel`. - Removed faulty memory management of `KMeansParams` struct. We now allocate this on the stack only, removing the need to call `calloc`/`free`. Before these allocations would leak if an error occurred before `free` was called.

I _think_ we can fully remove `KMeansMG`. As is, KMeansMG is a thin wrapper around `KMeans` itself, with just the `fit` method reimplemented. Looking at the implementation though, all it does is call `cuvs::cluster::kmeans::fit` (with much less input validation than it should) followed by `cuvs::cluster::kmeans::predict` instead of a single call to `cuvs::cluster::kmeans::fit_predict` (like `KMeans` does). Reading through the cuvs docs, I don't see a strong reason why we can't just use `fit_predict` everywhere. Ripping out `KMeansMG` does lead all tests to pass.

This lets us avoid needing to define a `KMeansMG` since `cuvs::cluster::kmeans::fit` will auto-dispatch to a MG implementation automatically.

Since `ML::cluster::kmeans::fit` will autodispatch, there's no need for `kmeans_mg` anymore.

We no longer use this internally. All occurrences now call `fit` and then `predict`.

jcrist

I'm a bit lost as to why the changes in the xfail list are happening in CI but not locally (3 pass in CI but fail on my machine, 1 fails in CI but passes locally). I've marked them all as flaky for now since I'm a bit stuck as to what's happening. Will open up a followup issue to debug more later, but I don't believe those changes should block getting this in.

jcrist · 2025-09-15T20:52:53Z

+  tests:
+  - "sklearn.cluster.tests.test_k_means::test_dense_sparse[42-KMeans-X_csr0]"
+  - "sklearn.cluster.tests.test_k_means::test_dense_sparse[42-KMeans-X_csr1]"
+  - "sklearn.cluster.tests.test_k_means::test_weighted_vs_repeated[42]"


These are now passing in CI, but always fail for me locally. Punting for now and marking them as strict: false.

jcrist · 2025-09-15T20:57:02Z

+  condition: scikit-learn>=1.7
+  strict: false
+  tests:
+  - "sklearn.mixture.tests.test_gaussian_mixture::test_gaussian_mixture_precisions_init_diag[float64]"


This now always fails in CI, but passes locally. The only bit of this test that hits a cuml.accel codepath is the bit generating the initial labels used estimate the covariance. Big 🤷 as to what changed here, running things locally I see no difference in the output of KMeans. Punting for now and marking as flaky.

csadorf · 2025-09-15T21:04:31Z

Running this locally with ./python/cuml/cuml_accel_tests/upstream/scikit-learn/run-tests.sh -k test_k_means, I get the following errors (similar to CI I believe):

FAILED python/cuml/cuml_accel_tests/upstream/scikit-learn/cluster/tests/test_k_means.py::test_all_init[KMeans-random-dense] - AssertionError: assert 0.0 > 0.0
FAILED python/cuml/cuml_accel_tests/upstream/scikit-learn/cluster/tests/test_k_means.py::test_all_init[KMeans-k-means++-dense] - assert 0.0 > 0.0
FAILED python/cuml/cuml_accel_tests/upstream/scikit-learn/cluster/tests/test_k_means.py::test_all_init[KMeans-ndarray-dense] - assert 0.0 > 0.0
FAILED python/cuml/cuml_accel_tests/upstream/scikit-learn/cluster/tests/test_k_means.py::test_kmeans_copyx - assert 0.0 > 0.0
FAILED python/cuml/cuml_accel_tests/upstream/scikit-learn/cluster/tests/test_k_means.py::test_weighted_vs_repeated[42]

jcrist · 2025-09-15T21:13:33Z

test_all_init and test_kmeans_copyx should be due to a bug I fixed in 6902f1b (fetch the latest version, I suspect you're testing on older code). test_weighted_vs_repeated is one that passes in CI and fails for me locally.

viclafargue

Thanks! LGTM

jcrist · 2025-09-16T14:40:52Z

/merge

jcrist self-assigned this Sep 8, 2025

jcrist requested a review from a team as a code owner September 8, 2025 20:43

jcrist added the Cython / Python Cython or Python issue label Sep 8, 2025

jcrist requested a review from divyegala September 8, 2025 20:43

jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 8, 2025

jcrist mentioned this pull request Sep 8, 2025

[BUG] cuML KMeans fails to clone when init is an array of custom centers (mis-handled as preset) #7187

Closed

divyegala approved these changes Sep 8, 2025

View reviewed changes

jcrist force-pushed the cleanup-kmeans branch from c37d2ca to ba3ea5c Compare September 8, 2025 23:05

jcrist requested a review from a team as a code owner September 8, 2025 23:05

jcrist requested a review from KyleFromNVIDIA September 8, 2025 23:05

github-actions Bot added the CMake label Sep 8, 2025

jcrist added the DO NOT MERGE Hold off on merging; see PR for details label Sep 8, 2025

jcrist force-pushed the cleanup-kmeans branch from ba3ea5c to b2ee24b Compare September 10, 2025 21:42

jcrist requested a review from a team as a code owner September 10, 2025 21:42

jcrist requested a review from vyasr September 10, 2025 21:42

github-actions Bot added the CUDA/C++ label Sep 10, 2025

jcrist removed the DO NOT MERGE Hold off on merging; see PR for details label Sep 10, 2025

jcrist force-pushed the cleanup-kmeans branch 2 times, most recently from da343c1 to c3172e9 Compare September 10, 2025 21:44

jcrist added 4 commits September 15, 2025 15:26

Use fit & predict instead of fit_predict

7606b30

This lets us avoid needing to define a `KMeansMG` since `cuvs::cluster::kmeans::fit` will auto-dispatch to a MG implementation automatically.

Release GIL, try using fused types

45bf7fe

jcrist added 6 commits September 15, 2025 15:26

No fused types

8780f20

Remove kmeans_mg

f063469

Since `ML::cluster::kmeans::fit` will autodispatch, there's no need for `kmeans_mg` anymore.

Remove ML::cluster::kmeans::fit_predict

1c778ee

We no longer use this internally. All occurrences now call `fit` and then `predict`.

Add a small docstring

56c577e

Fixup

6902f1b

Update xfail list

6f1ec9c

jcrist force-pushed the cleanup-kmeans branch from 29dd6ae to 6f1ec9c Compare September 15, 2025 20:51

jcrist commented Sep 15, 2025

View reviewed changes

viclafargue approved these changes Sep 16, 2025

View reviewed changes

csadorf approved these changes Sep 16, 2025

View reviewed changes

rapids-bot Bot merged commit 53c9237 into rapidsai:branch-25.10 Sep 16, 2025
106 checks passed

jcrist deleted the cleanup-kmeans branch September 16, 2025 14:41

jcrist mentioned this pull request Oct 7, 2025

[TRACKER] Cleanup python estimator implementations #7317

Open

44 tasks

jcrist mentioned this pull request Oct 24, 2025

Support KMeans in multi-gpu context public outside of cuml.dask.cluster.KMeans #7387

Closed

Conversation

jcrist commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Sep 8, 2025

Uh oh!

jcrist commented Sep 8, 2025

Uh oh!

divyegala commented Sep 8, 2025

Uh oh!

jcrist commented Sep 8, 2025

Uh oh!

jcrist commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Sep 10, 2025

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

jcrist Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

jcrist Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

csadorf commented Sep 15, 2025

Uh oh!

jcrist commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

jcrist commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jcrist commented Sep 8, 2025 •

edited

Loading

jcrist commented Sep 8, 2025 •

edited

Loading

jcrist commented Sep 15, 2025 •

edited

Loading