Skip to content

Support non-numeric class labels everywhere#7480

Merged
gforsyth merged 9 commits intorapidsai:release/25.12from
jcrist:standardize-classifier-outputs
Nov 18, 2025
Merged

Support non-numeric class labels everywhere#7480
gforsyth merged 9 commits intorapidsai:release/25.12from
jcrist:standardize-classifier-outputs

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Nov 12, 2025

This PR:

  • Standardizes our class label preprocessing (validation, conversion to monotonic numeric indices and classes), and uses it in (almost) all our classifiers
  • Standardizes our class label post-processing (convesion of predicted monotonic numeric indices back to class labels), and uses it in (almost) all our classifiers
  • Adds support for non-numeric class labels to (almost) all our classifiers)
  • Fixes > 30 xfailed sklearn tests

This is a slight breaking change, in that the classes_ attribute for RandomForestClassifier/SVC/LinearSVC/KNeighborsClassifier now is always a numpy array (or a list of numpy arrays). We already made a similar change several releases ago for LogisticRegression for the same reason, this just applies that change everywhere else. Note that for most of these models the classes_ attribute was fully undocumented.

With this change all classifiers (excluding MBSGDClassifier) now support non-numeric class labels, while before only LogisticRegression did.

Fixes #6267
Fixes #4169
Fixes #5684

@jcrist jcrist self-assigned this Nov 12, 2025
@jcrist jcrist requested a review from a team as a code owner November 12, 2025 23:18
@jcrist jcrist requested a review from csadorf November 12, 2025 23:18
@jcrist jcrist added improvement Improvement / enhancement to an existing function breaking Breaking change algo: linear-model cuml-accel Issues related to cuml.accel algo: svm algo: random-forest labels Nov 12, 2025
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Nov 12, 2025
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Nov 12, 2025

This is still WIP, just pushing it up for now. It at least still needs a few new tests and probably a bunch of xfail-list updates.

@jcrist jcrist changed the title Support non-numeric class labels everywhere [WIP] Support non-numeric class labels everywhere Nov 12, 2025
@jcrist jcrist force-pushed the standardize-classifier-outputs branch 2 times, most recently from 110a778 to 2c801fb Compare November 14, 2025 18:01
@jcrist jcrist force-pushed the standardize-classifier-outputs branch from 2c801fb to 88bf1dd Compare November 15, 2025 21:59
Extracts our label pre/post processing routines from
`LogisticRegression` into two common utilities to be used in other
classifiers.
These files should really be rewritten (and in the case of
`test_input_estimators.py`, perhaps just deleted). In most cases the
test isn't really testing anything, and all failures here were from
passing regression data to classifiers (which now errors appropriately).
@jcrist jcrist force-pushed the standardize-classifier-outputs branch from 88bf1dd to 3fcb14f Compare November 17, 2025 02:48
@jcrist jcrist changed the title [WIP] Support non-numeric class labels everywhere Support non-numeric class labels everywhere Nov 17, 2025
Copy link
Copy Markdown
Member Author

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready for review.

)


def preprocess_labels(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in preprocess_labels and decode_labels is mostly the same logic we already had in LogisticRegression, just extracted and generalized so it can apply to all our classifiers.

)
return preds
inds = fil.predict(X, threshold=threshold).to_output("cupy")
with cuml.internals.exit_internal_api():
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is gross, but necessary so with cuml.using_output_type(...) actually works on the predict method. This was broken before in LogisticRegression, but is now fixed (and tested in the generic test). I hope this is not a long lived hack with upcoming refactors we're thinking about to type reflection.

if index is not None:
if convert_to_mem_type is MemoryType.host and isinstance(
if not isinstance(index, (pd.Index, cudf.Index)):
index = None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this input_to_cuml_array([1, 2, 3]) would have an index attribute since list.index exists (but is a method).

)

if check_cols:
if check_cols is not False:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the equivalent change below are necessary so check_cols=0/check_rows=0 still applies the check.

X, y = make_regression(
n_samples=nrows, n_features=ncols, n_informative=ninfo, random_state=0
)
def make_dataset(dtype, nrows, ncols, ninfo, is_classifier):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the tests in this file were always run with regression data, which now errors nicely for classifiers (as it does in sklearn). The updates here just ensure we're testing classifiers with classification inputs rather than regression inputs.

X_test = X[~train_selection]
y_test = y[~train_selection]

if datatype == "dataframe":
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two types of changes in this file:

  • Using "cudf" instead of "dataframe" for clarity on the input type
  • Switching y to be a Series instead of a DataFrame. We now match sklearn's behavior of warning when passing in a y of shape (n_samples, 1) informing the user to use a 1D input instead. The tests here were using a DataFrame for y which would warn, I updated the tests to use a Series for y in that case so we weren't warning in our own tests.

@pytest.mark.parametrize("algo", [cuLog])
# ignoring warning about change of solver
@pytest.mark.filterwarnings("ignore::UserWarning:cuml[.*]")
def test_linear_models_set_params(algo):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was no longer necessary with #7433 (since set_params is now trivially Base.set_params, instead of a complicated custom wrapper).

"MBSGDClassifier",
"RandomForestClassifier",
"KNeighborsClassifier",
"LogisticRegression",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LogisticRegression is a classifier, not a regressor.

if is_classifier(model):
X_train, y_train, X_test = make_classification_dataset(
datatype, nrows, ncols, n_info, 2
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as test_input_estimators, before we were testing all estimators with regression data, which now errors nicely (matching sklearn behavior) for classifiers. Tests updated to test classifiers with classification data.

out = cudf.Series(classes).take(y_encoded).reset_index(drop=True)

# Coerce result to requested output_type
if isinstance(out, CumlArray):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is messy, but is basically a generalization of what we already added to LogisticRegression months ago. I hope we can simplify this a bunch when we cleanup our output type handling.

@jcrist jcrist changed the base branch from main to release/25.12 November 17, 2025 17:11
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM. Could not uncover any issue in the preprocess_labels and decode_labels functions. But, an other pair of eyes would be welcome.

Comment thread python/cuml/cuml/linear_model/logistic_regression.py
Comment thread python/cuml/cuml/linear_model/logistic_regression.py
Comment thread python/cuml/cuml/neighbors/kneighbors_classifier.pyx
Comment thread python/cuml/cuml/svm/svc.py
Comment thread python/cuml/tests/test_base.py
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Nov 18, 2025

/merge

@gforsyth gforsyth merged commit e39cbd3 into rapidsai:release/25.12 Nov 18, 2025
214 of 216 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

algo: linear-model algo: neighbors algo: random-forest algo: svm breaking Breaking change cuml-accel Issues related to cuml.accel Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function

Projects

None yet

4 participants