Add cuml.accel support for StandardScaler by csadorf · Pull Request #7766 · rapidsai/cuml

csadorf · 2026-02-04T16:51:19Z

Summary

Adds GPU acceleration support for sklearn.preprocessing.StandardScaler via cuml.accel.

Closes #7765

Changes

Implemented InteropMixin in StandardScaler for CPU/GPU model conversion
Added StandardScaler proxy wrapper with automatic GPU/CPU fallback
Updated documentation (FAQ, limitations)
Added basic test coverage

GPU Fallback (CPU used for)

partial_fit() - incremental learning not supported
sample_weight parameter - weighted statistics not supported
Complex/object dtypes - not supported on GPU

coderabbitai · 2026-02-04T17:00:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds cuml.accel support for sklearn.preprocessing.StandardScaler via an interop shim and GPU wrapper, updates docs and tests (including many xfail entries), adjusts runtime adapters and estimator-proxy behavior, and documents CPU-fallback conditions and input validation.

Changes

Cohort / File(s)	Summary
Documentation `docs/source/cuml-accel/faq.rst`, `docs/source/cuml-accel/limitations.rst`	Listed `StandardScaler` as accelerated and documented CPU-fallback cases (e.g., `partial_fit`, `sample_weight`, unsupported dtypes, certain sparse matrix conditions).
Third‑party sklearn shim `python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py`	Added `InteropMixin` to `StandardScaler`; introduced `_cpu_class_path`, `_params_from_cpu`, `_params_to_cpu`, `_attrs_from_cpu`, `_attrs_to_cpu` to convert params and fitted attributes between CPU (sklearn) and GPU (cuML); header year bump.
Acceleration wrappers `python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`	Added GPU-aware `StandardScaler` wrapper, input validation for unsupported cases, `_gpu_fit`/`_gpu_fit_transform` and `_gpu_partial_fit` stubs (partial_fit falls back to CPU); exported `StandardScaler` in `__all__`.
Adapters / runtime checks `python/cuml/cuml/thirdparty_adapters/adapters.py`	When accel enabled, `check_array` eagerly converts common list-like inputs to NumPy more often; improved 2D/dimensionality validation error messages; SPDX year update.
Tests & xfails `python/cuml/cuml_accel_tests/test_basic_estimators.py`, `python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`	Added `test_standard_scaler()` and greatly expanded upstream xfail-list with many StandardScaler-related xfails (feature-name/metadata, numerical stability, pandas NA, GPU limitations, pipelines, dtype/format combinations).
Estimator proxy `python/cuml/cuml/accel/estimator_proxy.py`	Imported `OneToOneFeatureMixin` and extended GPU `get_feature_names_out` handling to support it alongside existing mixins.
Minor wording `python/cuml/cuml/_thirdparty/sklearn/utils/skl_dependencies.py`	Small wording tweak in `_check_n_features` error message.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Improve numerical stability handling for near-constant features in StandardScaler #7770 — Modifies/introduces cuml.accel StandardScaler GPU path and adds numerical-stability/near-constant-feature xfails; closely related and could be linked.

Suggested labels

improvement

Suggested reviewers

jcrist

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: adding GPU acceleration support for StandardScaler via cuml.accel.
Description check	✅ Passed	The description clearly relates to the changeset and covers the main changes: InteropMixin implementation, StandardScaler proxy wrapper, documentation updates, and test coverage.
Linked Issues check	✅ Passed	The PR successfully addresses the linked issue `#7765` by implementing InteropMixin for CPU/GPU conversion, adding StandardScaler proxy wrapper with GPU/CPU fallback, and documenting limitations.
Out of Scope Changes check	✅ Passed	All changes are within scope: StandardScaler acceleration support, interop integration, documentation, tests, and necessary infrastructure updates (check_array enhancements, xfail list updates, estimator proxy updates).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

csadorf · 2026-02-04T17:56:54Z

I am working on addressing the test failures, but this is ready for initial review regardless.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`:
- Around line 19-21: The _gpu_fit and _gpu_fit_transform functions currently use
a **kwargs-only** signature so positional sample_weight calls raise TypeError
before your UnsupportedOnGPU check; change both signatures to include an
explicit sample_weight=None parameter (i.e., def _gpu_fit(self, X, y=None,
sample_weight=None, **kwargs) and def _gpu_fit_transform(self, X, y=None,
sample_weight=None, **kwargs)) so callers can pass sample_weight positionally or
by keyword and the existing sample_weight handling (raising UnsupportedOnGPU)
still runs.
- Around line 22-41: The current input validation in the proxy methods _gpu_fit
and _gpu_fit_transform misses pandas/cuDF DataFrames (they use .dtypes, not
.dtype) and CuPy sparse matrices (cupyx.scipy.sparse), allowing unsupported
types to reach GPU code; update validation by either calling the existing helper
_check_unsupported_inputs (as TargetEncoder does) or add checks that (a) detect
DataFrame-like inputs by checking for .dtypes and iterate/inspect .dtypes to
reject complex/object dtypes, and (b) detect CuPy sparse matrices (e.g., via
importing cupyx.scipy.sparse or checking for cupyx sparse-specific attributes)
in addition to scipy.sparse.issparse, and raise UnsupportedOnGPU with the same
messages; apply this change to both _gpu_fit and _gpu_fit_transform (or
consolidate into a new helper used by both).

In `@python/cuml/cuml/thirdparty_adapters/adapters.py`:
- Around line 230-241: The early np.asarray conversion can raise on GPU-backed
arrays; update the guard in the cuml_accel_enabled() branch around the variable
array to explicitly exclude CuPy arrays, cuDF/cudf.Series and pandas.Series, and
any object exposing a __cuda_array_interface__ (in addition to existing
cudf.DataFrame checks and gpu_sparse checks) so np.asarray is only called for
true CPU list-like inputs; leave GPU types to be handled by
input_to_cupy_array() further down. Ensure you reference the same symbols
(cuml_accel_enabled(), array, np.asarray, input_to_cupy_array()) so the change
prevents passing GPU-backed objects to np.asarray while preserving list/tuple
conversion behavior.

…t-for-standardscaler

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/source/cuml-accel/limitations.rst`:
- Around line 408-414: The documentation about StandardScaler's GPU fallback is
inaccurate; update the docs to match the implementation by specifying that
sparse integer dtype fallback applies only to int64 (not all integer dtypes) and
that the sparse-format validation (CSR/CSC) check is performed only for CuPy
sparse inputs; keep the other bullet points unchanged (partial_fit,
sample_weight, object/float16/complex dtypes). Reference StandardScaler,
partial_fit, sample_weight, and the sparse/int64/CuPy sparse behavior in the
text so readers know these exact conditions trigger CPU fallback.

In `@python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`:
- Around line 36-57: The sparse-dtype validation only rejects int64 for SciPy
sparse and omits dtype checks for CuPy sparse, so integer dtypes besides int64
can slip through; update the checks in the block guarding sp_sparse.issparse(X)
and the cupy_sparse.issparse(X) branch to reject any integer dtype (use
np.issubdtype(X.dtype, np.integer) and allow bool) and raise UnsupportedOnGPU
with a message similar to the existing one (mentioning StandardScaler GPU
support and allowed float/complex/bool dtypes); ensure you update both branches
where UnsupportedOnGPU is raised (references: sp_sparse.issparse,
cupy_sparse.issparse, UnsupportedOnGPU, StandardScaler) so all integer sparse
matrices are rejected consistently for SciPy and CuPy backends.

…t-for-standardscaler

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1393-1479: The test failures show cuml.accel's StandardScaler
gives incorrect results for near-constant features; update the cuml.accel
StandardScaler implementation (the StandardScaler class and its
fit/transform/fit_transform code paths) to either replicate sklearn's
numerical-stability logic (add epsilon/variance thresholding when computing
scale, matching sklearn behavior) or detect near-zero variance columns during
fit and force a CPU fallback/dispatch to the sklearn implementation (trigger the
same non-accelerated code path used for other unsupported cases). Ensure the
check uses the same threshold semantics as sklearn (compare computed variances
against epsilon or use sklearn's scale_ computation logic) and implement the
fallback decision in the same place where other GPU limitations are detected so
tests like test_standard_scaler_near_constant_features are handled consistently.

🧹 Nitpick comments (1)

python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml (1)
25-28: Use the feature-name-specific marker for this xfail.

This block is explicitly about feature name preservation, but it uses the generic cuml_accel_bugs marker. Consider aligning with the dedicated feature-name markers added below for easier triage.
Suggested marker alignment
-  marker: cuml_accel_bugs
+  marker: cuml_accel_bugs_feature_names

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1475-1478: Update the xfail reason and project documentation to
explicitly note that the StandardScaler proxy (and related cuml.accel proxies)
do not correctly support parallel dispatch with n_jobs>1; locate the YAML entry
that lists marker: parallel_config and test
"sklearn.utils.tests.test_parallel::test_dispatch_config_parallel[2]" and change
the reason string to mention the CPU-fallback/parallelism limitation, and add a
short note in the user-facing docs alongside the existing CPU-fallback cases
(partial_fit, sample_weight, complex dtypes) describing the behaviour and
recommended workaround (disable cuml.accel or set n_jobs=1).

🧹 Nitpick comments (1)

python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml (1)

1393-1479: Inconsistent marker naming convention.

The new markers gpu_limitations, numerical_stability, pandas_na, and parallel_config (Lines 1394, 1398, 1471, 1476) break from the cuml_accel_* prefix convention used by every other marker in this file. Consider renaming them for consistency (e.g., cuml_accel_gpu_limitations, cuml_accel_numerical_stability, cuml_accel_pandas_na, cuml_accel_parallel_config).

betatim

This looks good to me. Some questions/speculation as comments

…t-for-standardscaler

betatim · 2026-02-10T07:16:41Z

/merge

## Summary Adds GPU acceleration support for `sklearn.preprocessing.StandardScaler` via `cuml.accel`. Closes rapidsai#7765 ## Changes - Implemented `InteropMixin` in StandardScaler for CPU/GPU model conversion - Added `StandardScaler` proxy wrapper with automatic GPU/CPU fallback - Updated documentation (FAQ, limitations) - Added basic test coverage ## GPU Fallback (CPU used for) - `partial_fit()` - incremental learning not supported - `sample_weight` parameter - weighted statistics not supported - Complex/object dtypes - not supported on GPU Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#7766

csadorf added 10 commits February 4, 2026 08:12

Support StandardScaler in cuml.accel.

71142e8

partial_fit is not supported

215505d

fixup n_sample_seen sync to cpu

2f448a1

Fix error message

960d9f9

do not support complext input data

9fa8622

Do not support object dtype

35dd633

xfail check_transformer_data_not_an_array test

308b5e2

Document support and limitations.

1c39bb6

convert n_samples_seen_ when synced from cpu

a1065d9

document lack of support for sample_weight argument

e0ad693

csadorf requested a review from a team as a code owner February 4, 2026 16:51

csadorf requested a review from betatim February 4, 2026 16:51

csadorf added feature request New feature or request non-breaking Non-breaking change labels Feb 4, 2026

github-actions Bot assigned csadorf Feb 4, 2026

github-actions Bot added the Cython / Python Cython or Python issue label Feb 4, 2026

update xfail list

a3944a7

csadorf changed the title ~~Add cuml.accel support for standardscaler~~ Add cuml.accel support for StandardScaler Feb 4, 2026

csadorf added 2 commits February 4, 2026 11:50

address sklearn upstream failures

c78da5e

remove passing tests from xfail list

944b631

coderabbitai Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py Outdated

Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py Outdated

Comment thread python/cuml/cuml/thirdparty_adapters/adapters.py

csadorf added 7 commits February 5, 2026 08:12

more precise handling of sparse matrix related limitations

71613fb

refactor preprocessing.py to deduplicate

6e5b8c8

Merge remote-tracking branch 'origin/main' into add-cuml.accel-suppor…

3a8ae02

…t-for-standardscaler

be more precise about detection and conversion of list-like inputs

f32090e

Handle sample_weight parameter

10f6821

Do not sync for get_feature_names_out

661012b

Merge remote-tracking branch 'origin/main' into add-cuml.accel-suppor…

5ce80ca

…t-for-standardscaler

fall back for float16

e7a52e4

coderabbitai Bot reviewed Feb 5, 2026

View reviewed changes

Comment thread docs/source/cuml-accel/limitations.rst

Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py

csadorf added 3 commits February 5, 2026 14:02

fine-tune support matrix and limitation docs

4884e3e

Merge remote-tracking branch 'origin/main' into add-cuml.accel-suppor…

51bc0fd

…t-for-standardscaler

fixup xfail list version conditions

b0542c1

coderabbitai Bot reviewed Feb 5, 2026

View reviewed changes

Comment thread python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml

coderabbitai Bot mentioned this pull request Feb 6, 2026

Improve numerical stability handling for near-constant features in StandardScaler #7770

Closed

3 tasks

csadorf added 2 commits February 6, 2026 11:54

Merge branch 'main' into add-cuml.accel-support-for-standardscaler

13869b0

Restore erroneously removed entries from xfail list.

68db908

coderabbitai Bot reviewed Feb 6, 2026

View reviewed changes

Comment thread python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml

csadorf mentioned this pull request Feb 6, 2026

Add cuml.accel support for Pipeline #7782

Closed

betatim reviewed Feb 9, 2026

View reviewed changes

Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py Outdated

Comment thread python/cuml/cuml/thirdparty_adapters/adapters.py Outdated

csadorf added 4 commits February 9, 2026 11:17

use scikit-learn approach for is-array-like check

b8635bc

simplify sample_weight kwargs handling

c14de81

Merge remote-tracking branch 'origin/main' into add-cuml.accel-suppor…

f58ad9a

…t-for-standardscaler

Revert change to xfail list from b8635bc

2740a65

betatim approved these changes Feb 10, 2026

View reviewed changes

rapids-bot Bot merged commit e350c8c into rapidsai:main Feb 10, 2026
168 of 170 checks passed

csadorf deleted the add-cuml.accel-support-for-standardscaler branch February 10, 2026 14:42

coderabbitai Bot mentioned this pull request Mar 12, 2026

Add support for feature_names_in_ #7877

Merged

This was referenced Apr 28, 2026

Use scikit-learn's array-api to accelerate StandardScaler #8020

Merged

Add MinMaxScaler, MaxAbsScaler, and PolynomialFeatures to cuml.accel #8032

Merged

Conversation

csadorf commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

GPU Fallback (CPU used for)

Uh oh!

coderabbitai Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested labels

Suggested reviewers

Uh oh!

csadorf commented Feb 4, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

betatim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

betatim commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

csadorf commented Feb 4, 2026 •

edited

Loading

coderabbitai Bot commented Feb 4, 2026 •

edited

Loading