Skip to content

Add cuml.accel support for StandardScaler#7766

Merged
rapids-bot[bot] merged 30 commits intorapidsai:mainfrom
csadorf:add-cuml.accel-support-for-standardscaler
Feb 10, 2026
Merged

Add cuml.accel support for StandardScaler#7766
rapids-bot[bot] merged 30 commits intorapidsai:mainfrom
csadorf:add-cuml.accel-support-for-standardscaler

Conversation

@csadorf
Copy link
Copy Markdown
Contributor

@csadorf csadorf commented Feb 4, 2026

Summary

Adds GPU acceleration support for sklearn.preprocessing.StandardScaler via cuml.accel.

Closes #7765

Changes

  • Implemented InteropMixin in StandardScaler for CPU/GPU model conversion
  • Added StandardScaler proxy wrapper with automatic GPU/CPU fallback
  • Updated documentation (FAQ, limitations)
  • Added basic test coverage

GPU Fallback (CPU used for)

  • partial_fit() - incremental learning not supported
  • sample_weight parameter - weighted statistics not supported
  • Complex/object dtypes - not supported on GPU

@csadorf csadorf requested a review from a team as a code owner February 4, 2026 16:51
@csadorf csadorf requested a review from betatim February 4, 2026 16:51
@csadorf csadorf added feature request New feature or request non-breaking Non-breaking change labels Feb 4, 2026
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Feb 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 4, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds cuml.accel support for sklearn.preprocessing.StandardScaler via an interop shim and GPU wrapper, updates docs and tests (including many xfail entries), adjusts runtime adapters and estimator-proxy behavior, and documents CPU-fallback conditions and input validation.

Changes

Cohort / File(s) Summary
Documentation
docs/source/cuml-accel/faq.rst, docs/source/cuml-accel/limitations.rst
Listed StandardScaler as accelerated and documented CPU-fallback cases (e.g., partial_fit, sample_weight, unsupported dtypes, certain sparse matrix conditions).
Third‑party sklearn shim
python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py
Added InteropMixin to StandardScaler; introduced _cpu_class_path, _params_from_cpu, _params_to_cpu, _attrs_from_cpu, _attrs_to_cpu to convert params and fitted attributes between CPU (sklearn) and GPU (cuML); header year bump.
Acceleration wrappers
python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py
Added GPU-aware StandardScaler wrapper, input validation for unsupported cases, _gpu_fit/_gpu_fit_transform and _gpu_partial_fit stubs (partial_fit falls back to CPU); exported StandardScaler in __all__.
Adapters / runtime checks
python/cuml/cuml/thirdparty_adapters/adapters.py
When accel enabled, check_array eagerly converts common list-like inputs to NumPy more often; improved 2D/dimensionality validation error messages; SPDX year update.
Tests & xfails
python/cuml/cuml_accel_tests/test_basic_estimators.py, python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
Added test_standard_scaler() and greatly expanded upstream xfail-list with many StandardScaler-related xfails (feature-name/metadata, numerical stability, pandas NA, GPU limitations, pipelines, dtype/format combinations).
Estimator proxy
python/cuml/cuml/accel/estimator_proxy.py
Imported OneToOneFeatureMixin and extended GPU get_feature_names_out handling to support it alongside existing mixins.
Minor wording
python/cuml/cuml/_thirdparty/sklearn/utils/skl_dependencies.py
Small wording tweak in _check_n_features error message.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Suggested labels

improvement

Suggested reviewers

  • jcrist
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective of the PR: adding GPU acceleration support for StandardScaler via cuml.accel.
Description check ✅ Passed The description clearly relates to the changeset and covers the main changes: InteropMixin implementation, StandardScaler proxy wrapper, documentation updates, and test coverage.
Linked Issues check ✅ Passed The PR successfully addresses the linked issue #7765 by implementing InteropMixin for CPU/GPU conversion, adding StandardScaler proxy wrapper with GPU/CPU fallback, and documenting limitations.
Out of Scope Changes check ✅ Passed All changes are within scope: StandardScaler acceleration support, interop integration, documentation, tests, and necessary infrastructure updates (check_array enhancements, xfail list updates, estimator proxy updates).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Feb 4, 2026

I am working on addressing the test failures, but this is ready for initial review regardless.

@csadorf csadorf changed the title Add cuml.accel support for standardscaler Add cuml.accel support for StandardScaler Feb 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`:
- Around line 19-21: The _gpu_fit and _gpu_fit_transform functions currently use
a **kwargs-only** signature so positional sample_weight calls raise TypeError
before your UnsupportedOnGPU check; change both signatures to include an
explicit sample_weight=None parameter (i.e., def _gpu_fit(self, X, y=None,
sample_weight=None, **kwargs) and def _gpu_fit_transform(self, X, y=None,
sample_weight=None, **kwargs)) so callers can pass sample_weight positionally or
by keyword and the existing sample_weight handling (raising UnsupportedOnGPU)
still runs.
- Around line 22-41: The current input validation in the proxy methods _gpu_fit
and _gpu_fit_transform misses pandas/cuDF DataFrames (they use .dtypes, not
.dtype) and CuPy sparse matrices (cupyx.scipy.sparse), allowing unsupported
types to reach GPU code; update validation by either calling the existing helper
_check_unsupported_inputs (as TargetEncoder does) or add checks that (a) detect
DataFrame-like inputs by checking for .dtypes and iterate/inspect .dtypes to
reject complex/object dtypes, and (b) detect CuPy sparse matrices (e.g., via
importing cupyx.scipy.sparse or checking for cupyx sparse-specific attributes)
in addition to scipy.sparse.issparse, and raise UnsupportedOnGPU with the same
messages; apply this change to both _gpu_fit and _gpu_fit_transform (or
consolidate into a new helper used by both).

In `@python/cuml/cuml/thirdparty_adapters/adapters.py`:
- Around line 230-241: The early np.asarray conversion can raise on GPU-backed
arrays; update the guard in the cuml_accel_enabled() branch around the variable
array to explicitly exclude CuPy arrays, cuDF/cudf.Series and pandas.Series, and
any object exposing a __cuda_array_interface__ (in addition to existing
cudf.DataFrame checks and gpu_sparse checks) so np.asarray is only called for
true CPU list-like inputs; leave GPU types to be handled by
input_to_cupy_array() further down. Ensure you reference the same symbols
(cuml_accel_enabled(), array, np.asarray, input_to_cupy_array()) so the change
prevents passing GPU-backed objects to np.asarray while preserving list/tuple
conversion behavior.

Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py Outdated
Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py Outdated
Comment thread python/cuml/cuml/thirdparty_adapters/adapters.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/source/cuml-accel/limitations.rst`:
- Around line 408-414: The documentation about StandardScaler's GPU fallback is
inaccurate; update the docs to match the implementation by specifying that
sparse integer dtype fallback applies only to int64 (not all integer dtypes) and
that the sparse-format validation (CSR/CSC) check is performed only for CuPy
sparse inputs; keep the other bullet points unchanged (partial_fit,
sample_weight, object/float16/complex dtypes). Reference StandardScaler,
partial_fit, sample_weight, and the sparse/int64/CuPy sparse behavior in the
text so readers know these exact conditions trigger CPU fallback.

In `@python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`:
- Around line 36-57: The sparse-dtype validation only rejects int64 for SciPy
sparse and omits dtype checks for CuPy sparse, so integer dtypes besides int64
can slip through; update the checks in the block guarding sp_sparse.issparse(X)
and the cupy_sparse.issparse(X) branch to reject any integer dtype (use
np.issubdtype(X.dtype, np.integer) and allow bool) and raise UnsupportedOnGPU
with a message similar to the existing one (mentioning StandardScaler GPU
support and allowed float/complex/bool dtypes); ensure you update both branches
where UnsupportedOnGPU is raised (references: sp_sparse.issparse,
cupy_sparse.issparse, UnsupportedOnGPU, StandardScaler) so all integer sparse
matrices are rejected consistently for SciPy and CuPy backends.

Comment thread docs/source/cuml-accel/limitations.rst
Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1393-1479: The test failures show cuml.accel's StandardScaler
gives incorrect results for near-constant features; update the cuml.accel
StandardScaler implementation (the StandardScaler class and its
fit/transform/fit_transform code paths) to either replicate sklearn's
numerical-stability logic (add epsilon/variance thresholding when computing
scale, matching sklearn behavior) or detect near-zero variance columns during
fit and force a CPU fallback/dispatch to the sklearn implementation (trigger the
same non-accelerated code path used for other unsupported cases). Ensure the
check uses the same threshold semantics as sklearn (compare computed variances
against epsilon or use sklearn's scale_ computation logic) and implement the
fallback decision in the same place where other GPU limitations are detected so
tests like test_standard_scaler_near_constant_features are handled consistently.
🧹 Nitpick comments (1)
python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml (1)

25-28: Use the feature-name-specific marker for this xfail.

This block is explicitly about feature name preservation, but it uses the generic cuml_accel_bugs marker. Consider aligning with the dedicated feature-name markers added below for easier triage.

Suggested marker alignment
-  marker: cuml_accel_bugs
+  marker: cuml_accel_bugs_feature_names

Comment thread python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1475-1478: Update the xfail reason and project documentation to
explicitly note that the StandardScaler proxy (and related cuml.accel proxies)
do not correctly support parallel dispatch with n_jobs>1; locate the YAML entry
that lists marker: parallel_config and test
"sklearn.utils.tests.test_parallel::test_dispatch_config_parallel[2]" and change
the reason string to mention the CPU-fallback/parallelism limitation, and add a
short note in the user-facing docs alongside the existing CPU-fallback cases
(partial_fit, sample_weight, complex dtypes) describing the behaviour and
recommended workaround (disable cuml.accel or set n_jobs=1).
🧹 Nitpick comments (1)
python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml (1)

1393-1479: Inconsistent marker naming convention.

The new markers gpu_limitations, numerical_stability, pandas_na, and parallel_config (Lines 1394, 1398, 1471, 1476) break from the cuml_accel_* prefix convention used by every other marker in this file. Consider renaming them for consistency (e.g., cuml_accel_gpu_limitations, cuml_accel_numerical_stability, cuml_accel_pandas_na, cuml_accel_parallel_config).

Comment thread python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
Copy link
Copy Markdown
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Some questions/speculation as comments

Comment thread python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py Outdated
Comment thread python/cuml/cuml/thirdparty_adapters/adapters.py Outdated
@betatim
Copy link
Copy Markdown
Member

betatim commented Feb 10, 2026

/merge

@rapids-bot rapids-bot Bot merged commit e350c8c into rapidsai:main Feb 10, 2026
168 of 170 checks passed
@csadorf csadorf deleted the add-cuml.accel-support-for-standardscaler branch February 10, 2026 14:42
dantegd added a commit to dantegd/cuml that referenced this pull request Feb 17, 2026
## Summary

Adds GPU acceleration support for `sklearn.preprocessing.StandardScaler` via `cuml.accel`.

Closes rapidsai#7765

## Changes
- Implemented `InteropMixin` in StandardScaler for CPU/GPU model conversion
- Added `StandardScaler` proxy wrapper with automatic GPU/CPU fallback
- Updated documentation (FAQ, limitations)
- Added basic test coverage

## GPU Fallback (CPU used for)
- `partial_fit()` - incremental learning not supported
- `sample_weight` parameter - weighted statistics not supported
- Complex/object dtypes - not supported on GPU

Authors:
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Tim Head (https://github.com/betatim)

URL: rapidsai#7766
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for StandardScaler in cuml.accel

3 participants