Support non-numeric class labels everywhere by jcrist · Pull Request #7480 · rapidsai/cuml

jcrist · 2025-11-12T23:18:30Z

This PR:

Standardizes our class label preprocessing (validation, conversion to monotonic numeric indices and classes), and uses it in (almost) all our classifiers
Standardizes our class label post-processing (convesion of predicted monotonic numeric indices back to class labels), and uses it in (almost) all our classifiers
Adds support for non-numeric class labels to (almost) all our classifiers)
Fixes > 30 xfailed sklearn tests

This is a slight breaking change, in that the classes_ attribute for RandomForestClassifier/SVC/LinearSVC/KNeighborsClassifier now is always a numpy array (or a list of numpy arrays). We already made a similar change several releases ago for LogisticRegression for the same reason, this just applies that change everywhere else. Note that for most of these models the classes_ attribute was fully undocumented.

With this change all classifiers (excluding MBSGDClassifier) now support non-numeric class labels, while before only LogisticRegression did.

Fixes #6267
Fixes #4169
Fixes #5684

jcrist · 2025-11-12T23:19:29Z

This is still WIP, just pushing it up for now. It at least still needs a few new tests and probably a bunch of xfail-list updates.

Extracts our label pre/post processing routines from `LogisticRegression` into two common utilities to be used in other classifiers.

These files should really be rewritten (and in the case of `test_input_estimators.py`, perhaps just deleted). In most cases the test isn't really testing anything, and all failures here were from passing regression data to classifiers (which now errors appropriately).

jcrist

This is ready for review.

jcrist · 2025-11-17T03:48:51Z

        )


+def preprocess_labels(


The logic in preprocess_labels and decode_labels is mostly the same logic we already had in LogisticRegression, just extracted and generalized so it can apply to all our classifiers.

jcrist · 2025-11-17T03:50:17Z

-            )
-        return preds
+        inds = fil.predict(X, threshold=threshold).to_output("cupy")
+        with cuml.internals.exit_internal_api():


This is gross, but necessary so with cuml.using_output_type(...) actually works on the predict method. This was broken before in LogisticRegression, but is now fixed (and tested in the generic test). I hope this is not a long lived hack with upcoming refactors we're thinking about to type reflection.

jcrist · 2025-11-17T03:50:54Z

        if index is not None:
-            if convert_to_mem_type is MemoryType.host and isinstance(
+            if not isinstance(index, (pd.Index, cudf.Index)):
+                index = None


Without this input_to_cuml_array([1, 2, 3]) would have an index attribute since list.index exists (but is a method).

jcrist · 2025-11-17T03:51:39Z

                )

-        if check_cols:
+        if check_cols is not False:


This and the equivalent change below are necessary so check_cols=0/check_rows=0 still applies the check.

jcrist · 2025-11-17T03:53:01Z

-    X, y = make_regression(
-        n_samples=nrows, n_features=ncols, n_informative=ninfo, random_state=0
-    )
+def make_dataset(dtype, nrows, ncols, ninfo, is_classifier):


Before the tests in this file were always run with regression data, which now errors nicely for classifiers (as it does in sklearn). The updates here just ensure we're testing classifiers with classification inputs rather than regression inputs.

jcrist · 2025-11-17T03:56:16Z

    X_test = X[~train_selection]
    y_test = y[~train_selection]

-    if datatype == "dataframe":


There are two types of changes in this file:

Using "cudf" instead of "dataframe" for clarity on the input type

Switching y to be a Series instead of a DataFrame. We now match sklearn's behavior of warning when passing in a y of shape (n_samples, 1) informing the user to use a 1D input instead. The tests here were using a DataFrame for y which would warn, I updated the tests to use a Series for y in that case so we weren't warning in our own tests.

jcrist · 2025-11-17T03:57:03Z

-@pytest.mark.parametrize("algo", [cuLog])
-# ignoring warning about change of solver
-@pytest.mark.filterwarnings("ignore::UserWarning:cuml[.*]")
-def test_linear_models_set_params(algo):


This test was no longer necessary with #7433 (since set_params is now trivially Base.set_params, instead of a complicated custom wrapper).

jcrist · 2025-11-17T03:57:25Z

+        "MBSGDClassifier",
+        "RandomForestClassifier",
+        "KNeighborsClassifier",
+        "LogisticRegression",


LogisticRegression is a classifier, not a regressor.

jcrist · 2025-11-17T03:58:12Z

+        if is_classifier(model):
+            X_train, y_train, X_test = make_classification_dataset(
+                datatype, nrows, ncols, n_info, 2
+            )


Same as test_input_estimators, before we were testing all estimators with regression data, which now errors nicely (matching sklearn behavior) for classifiers. Tests updated to test classifiers with classification data.

jcrist · 2025-11-17T03:59:17Z

+            out = cudf.Series(classes).take(y_encoded).reset_index(drop=True)
+
+    # Coerce result to requested output_type
+    if isinstance(out, CumlArray):


This is messy, but is basically a generalization of what we already added to LogisticRegression months ago. I hope we can simplify this a bunch when we cleanup our output type handling.

viclafargue

Thanks! LGTM. Could not uncover any issue in the preprocess_labels and decode_labels functions. But, an other pair of eyes would be welcome.

jcrist · 2025-11-18T14:56:12Z

/merge

jcrist self-assigned this Nov 12, 2025

jcrist requested a review from a team as a code owner November 12, 2025 23:18

jcrist requested a review from csadorf November 12, 2025 23:18

jcrist added improvement Improvement / enhancement to an existing function breaking Breaking change algo: linear-model cuml-accel Issues related to cuml.accel algo: svm algo: random-forest labels Nov 12, 2025

github-actions Bot added the Cython / Python Cython or Python issue label Nov 12, 2025

jcrist changed the title ~~Support non-numeric class labels everywhere~~ [WIP] Support non-numeric class labels everywhere Nov 12, 2025

jcrist added the algo: neighbors label Nov 12, 2025

jcrist force-pushed the standardize-classifier-outputs branch 2 times, most recently from 110a778 to 2c801fb Compare November 14, 2025 18:01

A few fixes for CumlArray

a46f6b0

jcrist force-pushed the standardize-classifier-outputs branch from 2c801fb to 88bf1dd Compare November 15, 2025 21:59

jcrist added 8 commits November 16, 2025 20:48

Move labels pre/post processing to cuml.common.classification

356f6b3

Extracts our label pre/post processing routines from `LogisticRegression` into two common utilities to be used in other classifiers.

Support non-numeric class labels in SVC

1965f73

Support non-numeric class labels in LinearSVC

4be47ba

Support non-numeric class labels in RandomForestClassifier

f755555

Support non-numeric class labels in KNeighborsClassifier

1b3d1d7

Fix bad tests

bb1d7d1

These files should really be rewritten (and in the case of `test_input_estimators.py`, perhaps just deleted). In most cases the test isn't really testing anything, and all failures here were from passing regression data to classifiers (which now errors appropriately).

Update xfail list

1edb6f2

Add classifier dtype handling test

3fcb14f

jcrist force-pushed the standardize-classifier-outputs branch from 88bf1dd to 3fcb14f Compare November 17, 2025 02:48

jcrist changed the title ~~[WIP] Support non-numeric class labels everywhere~~ Support non-numeric class labels everywhere Nov 17, 2025

jcrist commented Nov 17, 2025

View reviewed changes

jcrist changed the base branch from main to release/25.12 November 17, 2025 17:11

viclafargue approved these changes Nov 18, 2025

View reviewed changes

Comment thread python/cuml/cuml/linear_model/logistic_regression.py

Comment thread python/cuml/cuml/linear_model/logistic_regression.py

Comment thread python/cuml/cuml/neighbors/kneighbors_classifier.pyx

Comment thread python/cuml/cuml/svm/svc.py

Comment thread python/cuml/tests/test_base.py

gforsyth merged commit e39cbd3 into rapidsai:release/25.12 Nov 18, 2025
214 of 216 checks passed

		)


		def preprocess_labels(

Conversation

jcrist commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcrist commented Nov 12, 2025

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viclafargue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcrist commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jcrist commented Nov 12, 2025 •

edited

Loading