TargetEncoder in cuml.accel#7476
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
||
|
|
||
| class TargetEncoder: | ||
| class TargetEncoder(InteropMixin): |
There was a problem hiding this comment.
For everything to work properly you'll also need to move TargetEncoder to be a subclass of Base. Looking at the current implementation, I think this would entail:
- Adding
Baseto the base class list (it should go first, before the mixin) - Updating the definition of
_get_param_namesto also includesuper()._get_param_names()(please also move this definition to the top, as we've done on other estimators). - Ripping out the custom infra in the class like
_get_output_type/get_params/.... Basically everything that's not there to implementfit/fit_transform/transformshould be moved to use theBaseinfra. - Adding
CumlArrayreturn type annotations fromtransform/fit_transformto enable method type reflection - Possibly using a
CumlArrayDescriptorto reflect fitted attributes, though from looking at the list in the sklearn docs I don't think that's necessary. - Ensuring we have adequate test coverage for this estimator so we're not unexpectedly breaking things. Since this wasn't a
Basesubclass and wasn't doing type reflection the way we do elsewhere I wouldn't be surprised if after this we see differences in behavior, but if we're moving towards our expected standard I'd view those "breaking changes" as more bugfixes since this estimator doesn't follow our conventions.
Overall it looks like there's a bunch of cleanup work to do in this estimator, making this ready for cuml-accel is not necessarily a light lift.
|
@aamijar What do you need to move this forward? |
|
Hi @csadorf, it should be a matter of addressing each of the upstream test failures now. We need to decide which ones to xfail and which ones to fix. Going to mark this as ready to review so CI can run as well to see if there are any failures that have been missed when testing locally. I'm continuing to update the PR description with the test failures I am currently seeing after addressing them one by one. |
csadorf
left a comment
There was a problem hiding this comment.
It looks like there is still a lot of work we need to put into this implementation to actually accelerate the TargetEncoder with cuml.accel. cuML's TargetEncoder API is differing in various ways making it harder to accelerate with ZCC.
| @@ -150,7 +150,7 @@ def test_targetencoder_pandas(): | |||
| answer = np.array([0.75, 0.5, 1.0, 0.75]) | |||
| assert array_equal(test_encoded, answer) | |||
| print(type(test_encoded)) | |||
There was a problem hiding this comment.
unrelated fix: we should remove this print call
…x attrs conversion, handle multi-feature with warnings
|
@csadorf @jcrist pushed a commit that adds the things we need for the conversion and fixes a few things to get us closer, could really use your eyes on the changes in 8a651b1 For the single-feature case, it should work correctly with no approximation needed - this is the common use case (especially with
sklearn:
cuML
The current implementation in the commit uses an approximation for multi-feature conversion:
Should we instead raise Arguments for approximation (current approach in the commit):
Arguments for raising error:
I lean towards the error, but wanted to discuss what you think is the better approach? |
|
If you wanna check the difference, here is some code and output: import numpy as np
import pandas as pd
# Sample data (need at least 5 rows for sklearn's default 5-fold CV)
X = np.array([[1, 0],
[1, 0],
[1, 1],
[1, 1],
[2, 0],
[2, 0],
[2, 1],
[2, 1]])
y = np.array([10, 10, 20, 20, 30, 30, 40, 40])
# === sklearn: encodes each feature independently ===
from sklearn.preprocessing import TargetEncoder as SklearnTE
sk_enc = SklearnTE(smooth=0, target_type='continuous')
sk_enc.fit(X, y)
print("sklearn encodings (per-feature):")
print(f" Feature 0: {dict(zip(sk_enc.categories_[0], sk_enc.encodings_[0]))}")
print(f" Feature 1: {dict(zip(sk_enc.categories_[1], sk_enc.encodings_[1]))}")
# Transform a single test point to see the output shape
X_test = np.array([[1, 0], [2, 1]])
sk_result = sk_enc.transform(X_test)
print(f"\nsklearn transform [1,0] and [2,1] (2 columns each):\n{sk_result}")
# === cuML: encodes feature combinations ===
from cuml.preprocessing import TargetEncoder as CumlTE
import cudf
X_df = cudf.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 2],
'B': [0, 0, 1, 1, 0, 0, 1, 1]})
y_series = cudf.Series([10, 10, 20, 20, 30, 30, 40, 40])
cu_enc = CumlTE(smooth=0)
cu_enc.fit(X_df, y_series)
print("\ncuML encode_all (combinations):")
print(cu_enc.encode_all.to_pandas())
X_test_cu = cudf.DataFrame({'A': [1, 2], 'B': [0, 1]})
cu_result = cu_enc.transform(X_test_cu)
print(f"\ncuML transform [1,0] and [2,1] (1 column):\n{cu_result}")Output: (rapids2512a1126) ➜ ~ python terepro.py
sklearn encodings (per-feature):
Feature 0: {np.int64(1): np.float64(15.0), np.int64(2): np.float64(35.0)}
Feature 1: {np.int64(0): np.float64(20.0), np.int64(1): np.float64(30.0)}
sklearn transform [1,0] and [2,1] (2 columns each):
[[15. 20.]
[35. 30.]]
cuML encode_all (combinations):
A B __TARGET___x __TARGET___y __TARGET_ENCODE__
0 1 1 40 2 20.0
1 2 1 80 2 40.0
2 1 0 20 2 10.0
3 2 0 60 2 30.0
cuML transform [1,0] and [2,1] (1 column):
[10. 40.] |
csadorf
left a comment
There was a problem hiding this comment.
Overall good improvement, but I have some comments.
…mes_out for set_output compatibility
|
@csadorf adding some notes here for documentation besides the xfail list: The wrapper automatically falls back to sklearn when
Xfailed Tests:
All xfailed tests represent intentional implementation differences rather than bugs. |
| except Exception: | ||
| self.target_type_ = "continuous" |
There was a problem hiding this comment.
We should narrow this down in a follow-up.
| if random_state is None: | ||
| seed = 42 | ||
| elif isinstance(random_state, int): | ||
| seed = random_state | ||
| else: | ||
| # For RandomState objects, use a default seed | ||
| # (the accel wrapper will fall back to CPU anyway) | ||
| seed = 42 |
There was a problem hiding this comment.
Is it really necessary for us to hard-code the random seed here?
| "with sklearn. Use .ravel() if you need 1D output.", | ||
| FutureWarning, | ||
| stacklevel=4, | ||
| ) |
There was a problem hiding this comment.
We should create a follow-up issue to change this behavior in 26.04.
|
/merge |
…7741) Adds documentation for the cuml.accel'eration of TargetEncoder. Follow-up to #7476 . Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #7741
Resolves rapidsai#7154 This PR adds support for TargetEncoder in cuml.accel. This feature was originally requested by the kaggle team. TargetEncoder is a preprocessing step to convert categorical features like `"cat", "dog"` into numerical values. It uses the mean of the categories target values to obtain a numerical value. There are API differences between cuml and sklearn's implementation of TargetEncoder and these differences must be handled in translating between cpu and gpu models. | **cuML TargetEncoder param** | **sklearn TargetEncoder param** | **Transformation / Notes** | |-------------------------------|------------------------------------|-----------------------------| | `n_folds` | `cv` | Direct mapping | | `seed` | `random_state` | If `random_state` is `None`, defaults to `42` | | `smooth` | `smooth` | If `smooth == "auto"`, set to `1.0`; else `float(model.smooth)` | | `split_method` | `shuffle` | `"random"` if `shuffle=True`, otherwise `"continuous"` | | `output_type` | *(no sklearn equivalent)* | Always `"auto"` | | `stat` | *(no sklearn equivalent)* | Always `"mean"` | | *(no cuml equivalent)* | `categories` | Always `"auto"` | | *(no cuml equivalent)* | `target_type` | Always `"continuous"` | ## Testing upstream ```bash ./run-tests.sh -k "targetencoder" ``` Current failures ```bash FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_n_features_in_after_fitting] - AssertionError: `TargetEncoder.fit()` does not set the `n_features_in_` attribute. You might want to use `sklearn.utils.validation.validate_data` instead of `check_array` in `TargetEncoder.fit()` which takes care of setting the attribute. FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_complex_data] - NotImplementedError: complex128 not supported FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_dtype_object] - cudf.errors.MixedTypeError: Cannot convert a floating of object type FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_empty_data_messages] - AssertionError: The estimator TargetEncoder does not raise a ValueError when an empty data is used to train. Perhaps use check_array in train. FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_pickle] - AttributeError: DataFrame object has no attribute to_output FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_pickle(readonly_memmap=True)] - AttributeError: DataFrame object has no attribute to_output FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_data_not_an_array] - AttributeError: '_NotAnArray' object has no attribute 'shape' FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_general] - AssertionError: The transformer TargetEncoder does not raise an error when the number of features in transform is different from the number of features in fit. FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_general(readonly_memmap=True)] - AssertionError: The transformer TargetEncoder does not raise an error when the number of features in transform is different from the number of features in fit. FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_methods_sample_order_invariance] - AssertionError: FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_methods_subset_invariance] - AssertionError: FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_n_features_in] - AssertionError FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_fit1d] - AssertionError: Did not raise: [<class 'ValueError'>] FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_fit2d_predict1d] - KeyError: '__FEA__' FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_requires_y_none] - TypeError: Input of type <class 'NoneType'> is not cudf.Series, or pandas.Seriesor numpy.ndarrayor cupy.ndarray FAILED tests/test_common.py::test_pandas_column_name_consistency[TargetEncoder()] - ValueError: Estimator does not have a feature_names_in_ attribute after fitting with a dataframe FAILED tests/test_docstring_parameters.py::test_fit_docstring_attributes[TargetEncoder-TargetEncoder] - AssertionError: assert False ================================================================================================================================= 17 failed, 43 passed, 1 skipped, 44347 deselected, 32 warnings in 15.23s ================================================================================================================================== ``` ## Testing local ```bash pytest test_sklearn_import_export.py -k "target_encoder" ``` Current failures ```bash AttributeError: 'TargetEncoder' object has no attribute 'categories_'. Did you mean: 'categories'? ============================== short test summary info ============================== FAILED test_sklearn_import_export.py::test_target_encoder - AttributeError: 'TargetEncoder' object has no attribute 'categories_'. Did you m... ``` Authors: - Anupam (https://github.com/aamijar) - Dante Gama Dessavre (https://github.com/dantegd) - Simon Adorf (https://github.com/csadorf) Approvers: - Simon Adorf (https://github.com/csadorf) - James Lamb (https://github.com/jameslamb) URL: rapidsai#7476
…apidsai#7741) Adds documentation for the cuml.accel'eration of TargetEncoder. Follow-up to rapidsai#7476 . Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: rapidsai#7741
Resolves #7154
This PR adds support for TargetEncoder in cuml.accel. This feature was originally requested by the kaggle team.
TargetEncoder is a preprocessing step to convert categorical features like
"cat", "dog"into numerical values. It uses the mean of the categories target values to obtain a numerical value.There are API differences between cuml and sklearn's implementation of TargetEncoder and these differences must be handled in translating between cpu and gpu models.
n_foldscvseedrandom_staterandom_stateisNone, defaults to42smoothsmoothsmooth == "auto", set to1.0; elsefloat(model.smooth)split_methodshuffle"random"ifshuffle=True, otherwise"continuous"output_type"auto"stat"mean"categories"auto"target_type"continuous"Testing upstream
./run-tests.sh -k "targetencoder"Current failures
Testing local
pytest test_sklearn_import_export.py -k "target_encoder"Current failures