Add cuml.accel support for StandardScaler#7766
Add cuml.accel support for StandardScaler#7766rapids-bot[bot] merged 30 commits intorapidsai:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds cuml.accel support for Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I am working on addressing the test failures, but this is ready for initial review regardless. |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`:
- Around line 19-21: The _gpu_fit and _gpu_fit_transform functions currently use
a **kwargs-only** signature so positional sample_weight calls raise TypeError
before your UnsupportedOnGPU check; change both signatures to include an
explicit sample_weight=None parameter (i.e., def _gpu_fit(self, X, y=None,
sample_weight=None, **kwargs) and def _gpu_fit_transform(self, X, y=None,
sample_weight=None, **kwargs)) so callers can pass sample_weight positionally or
by keyword and the existing sample_weight handling (raising UnsupportedOnGPU)
still runs.
- Around line 22-41: The current input validation in the proxy methods _gpu_fit
and _gpu_fit_transform misses pandas/cuDF DataFrames (they use .dtypes, not
.dtype) and CuPy sparse matrices (cupyx.scipy.sparse), allowing unsupported
types to reach GPU code; update validation by either calling the existing helper
_check_unsupported_inputs (as TargetEncoder does) or add checks that (a) detect
DataFrame-like inputs by checking for .dtypes and iterate/inspect .dtypes to
reject complex/object dtypes, and (b) detect CuPy sparse matrices (e.g., via
importing cupyx.scipy.sparse or checking for cupyx sparse-specific attributes)
in addition to scipy.sparse.issparse, and raise UnsupportedOnGPU with the same
messages; apply this change to both _gpu_fit and _gpu_fit_transform (or
consolidate into a new helper used by both).
In `@python/cuml/cuml/thirdparty_adapters/adapters.py`:
- Around line 230-241: The early np.asarray conversion can raise on GPU-backed
arrays; update the guard in the cuml_accel_enabled() branch around the variable
array to explicitly exclude CuPy arrays, cuDF/cudf.Series and pandas.Series, and
any object exposing a __cuda_array_interface__ (in addition to existing
cudf.DataFrame checks and gpu_sparse checks) so np.asarray is only called for
true CPU list-like inputs; leave GPU types to be handled by
input_to_cupy_array() further down. Ensure you reference the same symbols
(cuml_accel_enabled(), array, np.asarray, input_to_cupy_array()) so the change
prevents passing GPU-backed objects to np.asarray while preserving list/tuple
conversion behavior.
…t-for-standardscaler
…t-for-standardscaler
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@docs/source/cuml-accel/limitations.rst`:
- Around line 408-414: The documentation about StandardScaler's GPU fallback is
inaccurate; update the docs to match the implementation by specifying that
sparse integer dtype fallback applies only to int64 (not all integer dtypes) and
that the sparse-format validation (CSR/CSC) check is performed only for CuPy
sparse inputs; keep the other bullet points unchanged (partial_fit,
sample_weight, object/float16/complex dtypes). Reference StandardScaler,
partial_fit, sample_weight, and the sparse/int64/CuPy sparse behavior in the
text so readers know these exact conditions trigger CPU fallback.
In `@python/cuml/cuml/accel/_wrappers/sklearn/preprocessing.py`:
- Around line 36-57: The sparse-dtype validation only rejects int64 for SciPy
sparse and omits dtype checks for CuPy sparse, so integer dtypes besides int64
can slip through; update the checks in the block guarding sp_sparse.issparse(X)
and the cupy_sparse.issparse(X) branch to reject any integer dtype (use
np.issubdtype(X.dtype, np.integer) and allow bool) and raise UnsupportedOnGPU
with a message similar to the existing one (mentioning StandardScaler GPU
support and allowed float/complex/bool dtypes); ensure you update both branches
where UnsupportedOnGPU is raised (references: sp_sparse.issparse,
cupy_sparse.issparse, UnsupportedOnGPU, StandardScaler) so all integer sparse
matrices are rejected consistently for SciPy and CuPy backends.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1393-1479: The test failures show cuml.accel's StandardScaler
gives incorrect results for near-constant features; update the cuml.accel
StandardScaler implementation (the StandardScaler class and its
fit/transform/fit_transform code paths) to either replicate sklearn's
numerical-stability logic (add epsilon/variance thresholding when computing
scale, matching sklearn behavior) or detect near-zero variance columns during
fit and force a CPU fallback/dispatch to the sklearn implementation (trigger the
same non-accelerated code path used for other unsupported cases). Ensure the
check uses the same threshold semantics as sklearn (compare computed variances
against epsilon or use sklearn's scale_ computation logic) and implement the
fallback decision in the same place where other GPU limitations are detected so
tests like test_standard_scaler_near_constant_features are handled consistently.
🧹 Nitpick comments (1)
python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml (1)
25-28: Use the feature-name-specific marker for this xfail.This block is explicitly about feature name preservation, but it uses the generic
cuml_accel_bugsmarker. Consider aligning with the dedicated feature-name markers added below for easier triage.Suggested marker alignment
- marker: cuml_accel_bugs + marker: cuml_accel_bugs_feature_names
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1475-1478: Update the xfail reason and project documentation to
explicitly note that the StandardScaler proxy (and related cuml.accel proxies)
do not correctly support parallel dispatch with n_jobs>1; locate the YAML entry
that lists marker: parallel_config and test
"sklearn.utils.tests.test_parallel::test_dispatch_config_parallel[2]" and change
the reason string to mention the CPU-fallback/parallelism limitation, and add a
short note in the user-facing docs alongside the existing CPU-fallback cases
(partial_fit, sample_weight, complex dtypes) describing the behaviour and
recommended workaround (disable cuml.accel or set n_jobs=1).
🧹 Nitpick comments (1)
python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml (1)
1393-1479: Inconsistent marker naming convention.The new markers
gpu_limitations,numerical_stability,pandas_na, andparallel_config(Lines 1394, 1398, 1471, 1476) break from thecuml_accel_*prefix convention used by every other marker in this file. Consider renaming them for consistency (e.g.,cuml_accel_gpu_limitations,cuml_accel_numerical_stability,cuml_accel_pandas_na,cuml_accel_parallel_config).
betatim
left a comment
There was a problem hiding this comment.
This looks good to me. Some questions/speculation as comments
|
/merge |
## Summary Adds GPU acceleration support for `sklearn.preprocessing.StandardScaler` via `cuml.accel`. Closes rapidsai#7765 ## Changes - Implemented `InteropMixin` in StandardScaler for CPU/GPU model conversion - Added `StandardScaler` proxy wrapper with automatic GPU/CPU fallback - Updated documentation (FAQ, limitations) - Added basic test coverage ## GPU Fallback (CPU used for) - `partial_fit()` - incremental learning not supported - `sample_weight` parameter - weighted statistics not supported - Complex/object dtypes - not supported on GPU Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#7766
Summary
Adds GPU acceleration support for
sklearn.preprocessing.StandardScalerviacuml.accel.Closes #7765
Changes
InteropMixinin StandardScaler for CPU/GPU model conversionStandardScalerproxy wrapper with automatic GPU/CPU fallbackGPU Fallback (CPU used for)
partial_fit()- incremental learning not supportedsample_weightparameter - weighted statistics not supported