Skip to content

TargetEncoder in cuml.accel#7476

Merged
rapids-bot[bot] merged 30 commits intorapidsai:release/26.02from
aamijar:TargetEncoder-cuml.accel
Jan 29, 2026
Merged

TargetEncoder in cuml.accel#7476
rapids-bot[bot] merged 30 commits intorapidsai:release/26.02from
aamijar:TargetEncoder-cuml.accel

Conversation

@aamijar
Copy link
Copy Markdown
Member

@aamijar aamijar commented Nov 11, 2025

Resolves #7154

This PR adds support for TargetEncoder in cuml.accel. This feature was originally requested by the kaggle team.

TargetEncoder is a preprocessing step to convert categorical features like "cat", "dog" into numerical values. It uses the mean of the categories target values to obtain a numerical value.

There are API differences between cuml and sklearn's implementation of TargetEncoder and these differences must be handled in translating between cpu and gpu models.

cuML TargetEncoder param sklearn TargetEncoder param Transformation / Notes
n_folds cv Direct mapping
seed random_state If random_state is None, defaults to 42
smooth smooth If smooth == "auto", set to 1.0; else float(model.smooth)
split_method shuffle "random" if shuffle=True, otherwise "continuous"
output_type (no sklearn equivalent) Always "auto"
stat (no sklearn equivalent) Always "mean"
(no cuml equivalent) categories Always "auto"
(no cuml equivalent) target_type Always "continuous"

Testing upstream

./run-tests.sh -k "targetencoder"

Current failures

FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_n_features_in_after_fitting] - AssertionError: `TargetEncoder.fit()` does not set the `n_features_in_` attribute. You might want to use `sklearn.utils.validation.validate_data` instead of `check_array` in `TargetEncoder.fit()` which takes care of setting the attribute.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_complex_data] - NotImplementedError: complex128 not supported
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_dtype_object] - cudf.errors.MixedTypeError: Cannot convert a floating of object type
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_empty_data_messages] - AssertionError: The estimator TargetEncoder does not raise a ValueError when an empty data is used to train. Perhaps use check_array in train.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_pickle] - AttributeError: DataFrame object has no attribute to_output
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_pickle(readonly_memmap=True)] - AttributeError: DataFrame object has no attribute to_output
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_data_not_an_array] - AttributeError: '_NotAnArray' object has no attribute 'shape'
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_general] - AssertionError: The transformer TargetEncoder does not raise an error when the number of features in transform is different from the number of features in fit.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_general(readonly_memmap=True)] - AssertionError: The transformer TargetEncoder does not raise an error when the number of features in transform is different from the number of features in fit.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_methods_sample_order_invariance] - AssertionError: 
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_methods_subset_invariance] - AssertionError: 
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_n_features_in] - AssertionError
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_fit1d] - AssertionError: Did not raise: [<class 'ValueError'>]
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_fit2d_predict1d] - KeyError: '__FEA__'
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_requires_y_none] - TypeError: Input of type <class 'NoneType'> is not cudf.Series, or pandas.Seriesor numpy.ndarrayor cupy.ndarray
FAILED tests/test_common.py::test_pandas_column_name_consistency[TargetEncoder()] - ValueError: Estimator does not have a feature_names_in_ attribute after fitting with a dataframe
FAILED tests/test_docstring_parameters.py::test_fit_docstring_attributes[TargetEncoder-TargetEncoder] - AssertionError: assert False
================================================================================================================================= 17 failed, 43 passed, 1 skipped, 44347 deselected, 32 warnings in 15.23s ==================================================================================================================================

Testing local

pytest test_sklearn_import_export.py -k "target_encoder"

Current failures

AttributeError: 'TargetEncoder' object has no attribute 'categories_'. Did you mean: 'categories'?
============================== short test summary info ==============================
FAILED test_sklearn_import_export.py::test_target_encoder - AttributeError: 'TargetEncoder' object has no attribute 'categories_'. Did you m...

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Nov 11, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Nov 11, 2025
@aamijar aamijar added cuml-accel Issues related to cuml.accel non-breaking Non-breaking change feature request New feature or request labels Nov 11, 2025


class TargetEncoder:
class TargetEncoder(InteropMixin):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For everything to work properly you'll also need to move TargetEncoder to be a subclass of Base. Looking at the current implementation, I think this would entail:

  • Adding Base to the base class list (it should go first, before the mixin)
  • Updating the definition of _get_param_names to also include super()._get_param_names() (please also move this definition to the top, as we've done on other estimators).
  • Ripping out the custom infra in the class like _get_output_type/get_params/.... Basically everything that's not there to implement fit/fit_transform/transform should be moved to use the Base infra.
  • Adding CumlArray return type annotations from transform/fit_transform to enable method type reflection
  • Possibly using a CumlArrayDescriptor to reflect fitted attributes, though from looking at the list in the sklearn docs I don't think that's necessary.
  • Ensuring we have adequate test coverage for this estimator so we're not unexpectedly breaking things. Since this wasn't a Base subclass and wasn't doing type reflection the way we do elsewhere I wouldn't be surprised if after this we see differences in behavior, but if we're moving towards our expected standard I'd view those "breaking changes" as more bugfixes since this estimator doesn't follow our conventions.

Overall it looks like there's a bunch of cleanup work to do in this estimator, making this ready for cuml-accel is not necessarily a light lift.

Copy link
Copy Markdown
Member Author

@aamijar aamijar Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jcrist, I've added the Base infra to the estimator that you mentioned in dc5ea29

Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Jan 15, 2026

@aamijar What do you need to move this forward?

@aamijar aamijar changed the base branch from main to release/26.02 January 16, 2026 20:40
@aamijar
Copy link
Copy Markdown
Member Author

aamijar commented Jan 17, 2026

Hi @csadorf, it should be a matter of addressing each of the upstream test failures now. We need to decide which ones to xfail and which ones to fix. Going to mark this as ready to review so CI can run as well to see if there are any failures that have been missed when testing locally.

I'm continuing to update the PR description with the test failures I am currently seeing after addressing them one by one.

@aamijar aamijar marked this pull request as ready for review January 17, 2026 02:22
@aamijar aamijar requested a review from a team as a code owner January 17, 2026 02:22
@aamijar aamijar requested a review from divyegala January 17, 2026 02:22
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is still a lot of work we need to put into this implementation to actually accelerate the TargetEncoder with cuml.accel. cuML's TargetEncoder API is differing in various ways making it harder to accelerate with ZCC.

Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/tests/test_target_encoder.py
@@ -150,7 +150,7 @@ def test_targetencoder_pandas():
answer = np.array([0.75, 0.5, 1.0, 0.75])
assert array_equal(test_encoded, answer)
print(type(test_encoded))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated fix: we should remove this print call

Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
@csadorf csadorf assigned csadorf and unassigned aamijar Jan 20, 2026
…x attrs conversion, handle multi-feature with warnings
@dantegd
Copy link
Copy Markdown
Member

dantegd commented Jan 21, 2026

@csadorf @jcrist pushed a commit that adds the things we need for the conversion and fixes a few things to get us closer, could really use your eyes on the changes in 8a651b1

For the single-feature case, it should work correctly with no approximation needed - this is the common use case (especially with ColumnTransformer). But there is one important issue, continuing from the analysis @jcrist did in #5280 (comment): for multi-feature TargetEncoder, cuML and sklearn have fundamental semantic differences, i.e. fundamentally different approaches.

sklearn cuML
Encoding strategy Each feature encoded independently Features encoded as combinations
Output shape (n_samples, n_features) (n_samples,)

sklearn:

  • Computes encodings_[i] for each feature i separately
  • Each encodings_[i] maps categories_[i] gives target means for that feature alone

cuML

  • Groups by ALL feature columns together (x_cols)
  • encode_all DataFrame maps category combinations → single target encoding
  • Returns one encoded column, not one per feature

The current implementation in the commit uses an approximation for multi-feature conversion:

  • from_sklearn(): Creates cartesian product of categories and averages per-feature encodings (with 100k combination limit)
  • as_sklearn(): Averages encode_all values across all combinations containing each category using .mean()
  • Both emit UserWarning explaining the approximation

Should we instead raise UnsupportedOnGPU for multi-feature cases?

Arguments for approximation (current approach in the commit):

  • Allows some functionality rather than none
  • Warning makes the limitation clear
  • Single-feature (the common case) works correctly

Arguments for raising error:

  • Cleaner semantics - no silent approximation
  • Avoids potential user confusion from differing results
  • Multi-feature conversion is fundamentally lossy anyway

I lean towards the error, but wanted to discuss what you think is the better approach?

@dantegd
Copy link
Copy Markdown
Member

dantegd commented Jan 21, 2026

If you wanna check the difference, here is some code and output:

import numpy as np
import pandas as pd

# Sample data (need at least 5 rows for sklearn's default 5-fold CV)
X = np.array([[1, 0],
              [1, 0],
              [1, 1],
              [1, 1],
              [2, 0],
              [2, 0],
              [2, 1],
              [2, 1]])
y = np.array([10, 10, 20, 20, 30, 30, 40, 40])

# === sklearn: encodes each feature independently ===
from sklearn.preprocessing import TargetEncoder as SklearnTE

sk_enc = SklearnTE(smooth=0, target_type='continuous')
sk_enc.fit(X, y)

print("sklearn encodings (per-feature):")
print(f"  Feature 0: {dict(zip(sk_enc.categories_[0], sk_enc.encodings_[0]))}")
print(f"  Feature 1: {dict(zip(sk_enc.categories_[1], sk_enc.encodings_[1]))}")

# Transform a single test point to see the output shape
X_test = np.array([[1, 0], [2, 1]])
sk_result = sk_enc.transform(X_test)
print(f"\nsklearn transform [1,0] and [2,1] (2 columns each):\n{sk_result}")

# === cuML: encodes feature combinations ===
from cuml.preprocessing import TargetEncoder as CumlTE
import cudf

X_df = cudf.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 2], 
                       'B': [0, 0, 1, 1, 0, 0, 1, 1]})
y_series = cudf.Series([10, 10, 20, 20, 30, 30, 40, 40])

cu_enc = CumlTE(smooth=0)
cu_enc.fit(X_df, y_series)

print("\ncuML encode_all (combinations):")
print(cu_enc.encode_all.to_pandas())

X_test_cu = cudf.DataFrame({'A': [1, 2], 'B': [0, 1]})
cu_result = cu_enc.transform(X_test_cu)
print(f"\ncuML transform [1,0] and [2,1] (1 column):\n{cu_result}")

Output:

(rapids2512a1126) ➜  ~ python terepro.py
sklearn encodings (per-feature):
  Feature 0: {np.int64(1): np.float64(15.0), np.int64(2): np.float64(35.0)}
  Feature 1: {np.int64(0): np.float64(20.0), np.int64(1): np.float64(30.0)}

sklearn transform [1,0] and [2,1] (2 columns each):
[[15. 20.]
 [35. 30.]]

cuML encode_all (combinations):
   A  B  __TARGET___x  __TARGET___y  __TARGET_ENCODE__
0  1  1            40             2               20.0
1  2  1            80             2               40.0
2  1  0            20             2               10.0
3  2  0            60             2               30.0

cuML transform [1,0] and [2,1] (1 column):
[10. 40.]

Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good improvement, but I have some comments.

Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py Outdated
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py
@csadorf csadorf assigned dantegd and unassigned csadorf Jan 27, 2026
@dantegd
Copy link
Copy Markdown
Member

dantegd commented Jan 27, 2026

@csadorf adding some notes here for documentation besides the xfail list:

The wrapper automatically falls back to sklearn when

  • Multiclass targets (3+ classes) - sklearn uses internal one-hot encoding that cuML doesn't implement
  • Custom categories parameter - when users specify explicit category lists rather than "auto", cuML falls back since it only supports automatic category detection

Xfailed Tests:

  • Cross-validation differences (test_multiple_features_quick, check_methods_sample_order_invariance, check_methods_subset_invariance) - cuML uses random fold assignment while sklearn uses sequential assignment. Both approaches are statistically valid, but produce different per-sample encodings during training. The global encodings used for test data are identical.
  • Input validation (check_fit1d) - cuML intentionally accepts 1D arrays as single-feature input, treating them as a convenience. sklearn requires 2D input.
  • Error messages (check_estimators_empty_data_messages, check_fit2d_predict1d, test_errors[y1-...]) - cuML raises the same errors for invalid inputs but with different message text.
  • List inputs (test_errors[y0-...], test_feature_names_out_set_output) - cuML doesn't accept Python lists directly; users should pass numpy arrays or DataFrames.
  • Pickle serialization (check_estimators_pickle) - cudf DataFrame serialization differs from sklearn's internal structures.
  • Internal data structures (check_transformer_data_not_an_array) - cuML uses cudf DataFrames internally rather than numpy arrays.
  • Docstring format (test_fit_docstring_attributes) - Minor documentation format differences.
  • Warnings (test_use_regression_target) - cuML doesn't emit sklearn's specific warning about continuous targets.
  • Complex dtype (check_complex_data) - cuML doesn't support complex128 inputs.

All xfailed tests represent intentional implementation differences rather than bugs.

@dantegd dantegd requested a review from a team as a code owner January 28, 2026 21:52
@dantegd dantegd requested a review from gforsyth January 28, 2026 21:52
Comment on lines +718 to +719
except Exception:
self.target_type_ = "continuous"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should narrow this down in a follow-up.

Comment on lines +924 to +931
if random_state is None:
seed = 42
elif isinstance(random_state, int):
seed = random_state
else:
# For RandomState objects, use a default seed
# (the accel wrapper will fall back to CPU anyway)
seed = 42
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary for us to hard-code the random seed here?

"with sklearn. Use .ravel() if you need 1D output.",
FutureWarning,
stacklevel=4,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a follow-up issue to change this behavior in 26.04.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Jan 29, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 61d09dc into rapidsai:release/26.02 Jan 29, 2026
110 checks passed
rapids-bot Bot pushed a commit that referenced this pull request Jan 29, 2026
…7741)

Adds documentation for the cuml.accel'eration of TargetEncoder.

Follow-up to #7476 .

Authors:
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #7741
dantegd added a commit to dantegd/cuml that referenced this pull request Feb 17, 2026
Resolves rapidsai#7154

This PR adds support for TargetEncoder in cuml.accel. This feature was originally requested by the kaggle team.

TargetEncoder is a preprocessing step to convert categorical features like `"cat", "dog"` into numerical values. It uses the mean of the categories target values to obtain a numerical value.

There are API differences between cuml and sklearn's implementation of TargetEncoder and these differences must be handled in translating between cpu and gpu models.

| **cuML TargetEncoder param** | **sklearn TargetEncoder param** | **Transformation / Notes** |
|-------------------------------|------------------------------------|-----------------------------|
| `n_folds`                    | `cv`                               | Direct mapping |
| `seed`                       | `random_state`                     | If `random_state` is `None`, defaults to `42` |
| `smooth`                     | `smooth`                           | If `smooth == "auto"`, set to `1.0`; else `float(model.smooth)` |
| `split_method`               | `shuffle`                          | `"random"` if `shuffle=True`, otherwise `"continuous"` |
| `output_type`                | *(no sklearn equivalent)*          | Always `"auto"` |
| `stat`                       | *(no sklearn equivalent)*          | Always `"mean"` |
| *(no cuml equivalent)*        | `categories`                       | Always `"auto"` |
| *(no cuml equivalent)*        | `target_type`                      | Always `"continuous"` |

## Testing upstream

```bash
./run-tests.sh -k "targetencoder"
```

Current failures
```bash
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_n_features_in_after_fitting] - AssertionError: `TargetEncoder.fit()` does not set the `n_features_in_` attribute. You might want to use `sklearn.utils.validation.validate_data` instead of `check_array` in `TargetEncoder.fit()` which takes care of setting the attribute.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_complex_data] - NotImplementedError: complex128 not supported
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_dtype_object] - cudf.errors.MixedTypeError: Cannot convert a floating of object type
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_empty_data_messages] - AssertionError: The estimator TargetEncoder does not raise a ValueError when an empty data is used to train. Perhaps use check_array in train.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_pickle] - AttributeError: DataFrame object has no attribute to_output
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_estimators_pickle(readonly_memmap=True)] - AttributeError: DataFrame object has no attribute to_output
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_data_not_an_array] - AttributeError: '_NotAnArray' object has no attribute 'shape'
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_general] - AssertionError: The transformer TargetEncoder does not raise an error when the number of features in transform is different from the number of features in fit.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_transformer_general(readonly_memmap=True)] - AssertionError: The transformer TargetEncoder does not raise an error when the number of features in transform is different from the number of features in fit.
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_methods_sample_order_invariance] - AssertionError:
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_methods_subset_invariance] - AssertionError:
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_n_features_in] - AssertionError
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_fit1d] - AssertionError: Did not raise: [<class 'ValueError'>]
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_fit2d_predict1d] - KeyError: '__FEA__'
FAILED tests/test_common.py::test_estimators[TargetEncoder()-check_requires_y_none] - TypeError: Input of type <class 'NoneType'> is not cudf.Series, or pandas.Seriesor numpy.ndarrayor cupy.ndarray
FAILED tests/test_common.py::test_pandas_column_name_consistency[TargetEncoder()] - ValueError: Estimator does not have a feature_names_in_ attribute after fitting with a dataframe
FAILED tests/test_docstring_parameters.py::test_fit_docstring_attributes[TargetEncoder-TargetEncoder] - AssertionError: assert False
================================================================================================================================= 17 failed, 43 passed, 1 skipped, 44347 deselected, 32 warnings in 15.23s ==================================================================================================================================
```

## Testing local

```bash
pytest test_sklearn_import_export.py -k "target_encoder"
```

Current failures
```bash
AttributeError: 'TargetEncoder' object has no attribute 'categories_'. Did you mean: 'categories'?
============================== short test summary info ==============================
FAILED test_sklearn_import_export.py::test_target_encoder - AttributeError: 'TargetEncoder' object has no attribute 'categories_'. Did you m...
```

Authors:
  - Anupam (https://github.com/aamijar)
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Simon Adorf (https://github.com/csadorf)
  - James Lamb (https://github.com/jameslamb)

URL: rapidsai#7476
dantegd added a commit to dantegd/cuml that referenced this pull request Feb 17, 2026
…apidsai#7741)

Adds documentation for the cuml.accel'eration of TargetEncoder.

Follow-up to rapidsai#7476 .

Authors:
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#7741
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuml-accel Issues related to cuml.accel Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for TargetEncoder in cuml.accel

6 participants