Support non-numeric class labels everywhere#7480
Support non-numeric class labels everywhere#7480gforsyth merged 9 commits intorapidsai:release/25.12from
Conversation
|
This is still WIP, just pushing it up for now. It at least still needs a few new tests and probably a bunch of xfail-list updates. |
110a778 to
2c801fb
Compare
2c801fb to
88bf1dd
Compare
Extracts our label pre/post processing routines from `LogisticRegression` into two common utilities to be used in other classifiers.
These files should really be rewritten (and in the case of `test_input_estimators.py`, perhaps just deleted). In most cases the test isn't really testing anything, and all failures here were from passing regression data to classifiers (which now errors appropriately).
88bf1dd to
3fcb14f
Compare
jcrist
left a comment
There was a problem hiding this comment.
This is ready for review.
| ) | ||
|
|
||
|
|
||
| def preprocess_labels( |
There was a problem hiding this comment.
The logic in preprocess_labels and decode_labels is mostly the same logic we already had in LogisticRegression, just extracted and generalized so it can apply to all our classifiers.
| ) | ||
| return preds | ||
| inds = fil.predict(X, threshold=threshold).to_output("cupy") | ||
| with cuml.internals.exit_internal_api(): |
There was a problem hiding this comment.
This is gross, but necessary so with cuml.using_output_type(...) actually works on the predict method. This was broken before in LogisticRegression, but is now fixed (and tested in the generic test). I hope this is not a long lived hack with upcoming refactors we're thinking about to type reflection.
| if index is not None: | ||
| if convert_to_mem_type is MemoryType.host and isinstance( | ||
| if not isinstance(index, (pd.Index, cudf.Index)): | ||
| index = None |
There was a problem hiding this comment.
Without this input_to_cuml_array([1, 2, 3]) would have an index attribute since list.index exists (but is a method).
| ) | ||
|
|
||
| if check_cols: | ||
| if check_cols is not False: |
There was a problem hiding this comment.
This and the equivalent change below are necessary so check_cols=0/check_rows=0 still applies the check.
| X, y = make_regression( | ||
| n_samples=nrows, n_features=ncols, n_informative=ninfo, random_state=0 | ||
| ) | ||
| def make_dataset(dtype, nrows, ncols, ninfo, is_classifier): |
There was a problem hiding this comment.
Before the tests in this file were always run with regression data, which now errors nicely for classifiers (as it does in sklearn). The updates here just ensure we're testing classifiers with classification inputs rather than regression inputs.
| X_test = X[~train_selection] | ||
| y_test = y[~train_selection] | ||
|
|
||
| if datatype == "dataframe": |
There was a problem hiding this comment.
There are two types of changes in this file:
- Using
"cudf"instead of"dataframe"for clarity on the input type - Switching
yto be aSeriesinstead of aDataFrame. We now match sklearn's behavior of warning when passing in ayof shape(n_samples, 1)informing the user to use a 1D input instead. The tests here were using aDataFrameforywhich would warn, I updated the tests to use aSeriesforyin that case so we weren't warning in our own tests.
| @pytest.mark.parametrize("algo", [cuLog]) | ||
| # ignoring warning about change of solver | ||
| @pytest.mark.filterwarnings("ignore::UserWarning:cuml[.*]") | ||
| def test_linear_models_set_params(algo): |
There was a problem hiding this comment.
This test was no longer necessary with #7433 (since set_params is now trivially Base.set_params, instead of a complicated custom wrapper).
| "MBSGDClassifier", | ||
| "RandomForestClassifier", | ||
| "KNeighborsClassifier", | ||
| "LogisticRegression", |
There was a problem hiding this comment.
LogisticRegression is a classifier, not a regressor.
| if is_classifier(model): | ||
| X_train, y_train, X_test = make_classification_dataset( | ||
| datatype, nrows, ncols, n_info, 2 | ||
| ) |
There was a problem hiding this comment.
Same as test_input_estimators, before we were testing all estimators with regression data, which now errors nicely (matching sklearn behavior) for classifiers. Tests updated to test classifiers with classification data.
| out = cudf.Series(classes).take(y_encoded).reset_index(drop=True) | ||
|
|
||
| # Coerce result to requested output_type | ||
| if isinstance(out, CumlArray): |
There was a problem hiding this comment.
This is messy, but is basically a generalization of what we already added to LogisticRegression months ago. I hope we can simplify this a bunch when we cleanup our output type handling.
viclafargue
left a comment
There was a problem hiding this comment.
Thanks! LGTM. Could not uncover any issue in the preprocess_labels and decode_labels functions. But, an other pair of eyes would be welcome.
|
/merge |
This PR:
This is a slight breaking change, in that the
classes_attribute forRandomForestClassifier/SVC/LinearSVC/KNeighborsClassifiernow is always a numpy array (or a list of numpy arrays). We already made a similar change several releases ago forLogisticRegressionfor the same reason, this just applies that change everywhere else. Note that for most of these models theclasses_attribute was fully undocumented.With this change all classifiers (excluding
MBSGDClassifier) now support non-numeric class labels, while before onlyLogisticRegressiondid.Fixes #6267
Fixes #4169
Fixes #5684