New Estimator Proxy architecture by jcrist · Pull Request #6613 · rapidsai/cuml

jcrist · 2025-05-01T20:56:39Z

This PR adds a new ProxyBase base class for defining proxy estimators via composing one CPU and one GPU estimator, instead of inheriting from the GPU estimator. This lets us better match sklearn's expected interface, since the exposed class can look much more like the proxied sklearn class than one subclassed from the cuml estimator.

This is accomplished by defining a new InteropMixin mixin class to handle converting hyperparameters/attributes to/from the proxied CPU estimator. The mixin defines the public as_sklearn/from_sklearn methods, and the private methods are also used by ProxyBase.

To port an estimator from the old ProxyMixin to the new ProxyBase:

Switch the cuml estimator to subclass from Base instead of UniversalBase
Add InteropMixin to the cuml estimator, and define the required methods. Remove any old methods and decorators used by UniversalBase.
Switch the class in cuml.accel._wrappers.* to subclass from ProxyBase. Any special casing for certain methods should be moved to this class from the cuml proper estimator class
Add tests

One notable caveat of this approach is that the old device selection support will be removed from ported estimators. Ported estimators that previously would run on CPU with using_device_type no longer will - they will only run on GPU. This is a breaking change. That said, I don't believe we have a large user base doing this, and there's a clear transition path to using cuml.accel instead of using_device_type.

In this PR I've handled porting all of the currently handled linear model estimators. There are still a few remaining todos:

Add warning for old methods that supported device selection if device isn't GPU
Fix pickling (should be easy to do, just haven't done it yet)
Expand tests for the generic ProxyBase functionality
Unxfail any newly passing sklearn test suite tests

I don't intend to port any more estimators in this PR. Converting the other proxies to the new system can be done incrementally in follow up PRs.

Part of #6502.

copy-pr-bot · 2025-05-01T20:56:42Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

jcrist · 2025-05-01T20:58:05Z

/ok to test

jcrist · 2025-05-06T22:32:41Z

/ok to test

jcrist · 2025-05-07T04:40:41Z

/ok to test

jcrist · 2025-05-07T19:39:30Z

Ok, this is ready for review. I haven't updated the sklearn xfail list yet (was waiting on #6629 to be merged first, will hack away at this now that it's in), but everything else should be relatively good to go.

dantegd

Overall really like the new design!

dantegd · 2025-05-08T14:47:06Z

+        return {
+            "alpha": model.alpha,
+            "l1_ratio": model.l1_ratio,
+            "fit_intercept": model.fit_intercept,
+            "tol": tol,
+            "max_iter": model.max_iter,
+            "selection": model.selection,
+        }


The method is significantly more verboe, great for flexibility but I was wondering if it might be worth avoiding having to repeat all the parameters here again and perhaps use get_param_names minus the exceptions or something like that?

I much prefer the explicit verbosity for clarity over any kind of logic that we might get wrong at this point. I think that reducing verbosity and explicitness would be better done in a follow-up.

dantegd · 2025-05-08T14:48:25Z

+        # TODO: do we want to be changing the tolerance?
+        tol = 0.1 * self.tol
+
+        return {


Same as the comment above. This sems error pronse, or do we have tests to check that these lists are correct?

I would expect sklearn test regressions to signal mistakes here.

csadorf · 2025-05-08T17:26:47Z

Just noting that I'd like to get a chance for review prior to merge.

csadorf

Just a couple of comments, but this looks great!

csadorf · 2025-05-08T22:54:08Z

+        return {
+            "alpha": model.alpha,
+            "l1_ratio": model.l1_ratio,
+            "fit_intercept": model.fit_intercept,
+            "tol": tol,
+            "max_iter": model.max_iter,
+            "selection": model.selection,
+        }


I much prefer the explicit verbosity for clarity over any kind of logic that we might get wrong at this point. I think that reducing verbosity and explicitness would be better done in a follow-up.

csadorf · 2025-05-08T22:55:07Z

+        # TODO: do we want to be changing the tolerance?
+        tol = 0.1 * self.tol
+
+        return {


I would expect sklearn test regressions to signal mistakes here.

This adds a new mixin class for defining methods for converting cuml estimators to/from their CPU-based counterparts. This replaces the functionality defined currently in `UniversalBase`, decoupling the conversion methods (``as_sklearn``, ``from_sklearn``) from the current device selection handling. The new estimator proxy base class will then make use of the methods defined in this mixin class.

This adds a new base class for defining proxy estimators, making use of the new `InteropMixin`. With this base class, proxy estimators will be true sklearn estimators, wrapping both a sklearn estimator and a cuml estimator (composition instead of inheritance). This lets us achieve better compatibility with sklearn, since we can fully replicate the required interface without requiring massive changes to existing cuml estimators. The migration plan for porting `cuml.accel` estimators is then: - Move a cuml estimator back to subclassing from `Base` instead of `UniversalBase`. - Add in the `InteropMixin` and define the required methods - Define a new subclass of `ProxyBase` in the proper wrappers file. Add in any required special-cased method definitions as needed.

We no longer support selecting CPU execution with `using_device_type` for methods on classes ported to `InteropMixin`. From our understanding, very very few users made use of these, so we're opting for a breaking change in removing this functionality. Users are instead encouraged to make use of the upstream libraries (e.g. sklearn) directly, and use the `from_sklearn`/`as_sklearn` methods to coerce to/from cuML as needed. Since old code using `using_device_type` won't error (but will silently always use GPU), we add a warning if the device type is explicitly set to CPU execution on any method previously decorated with `enable_device_selection`.

Pickling is now implemented in a way where loading the pickle: - Works in environments where `cuml` is not installed - Works if `cuml.accel` is not enabled In both cases, the CPU estimator is returned instead. Also expands tests for the `ProxyBase` estimator class. Still more to do here, but we're getting closer to full coverage.

This: - Makes the separation between `ProxyBase` and the legacy `ProxyMixin` clearer. - Makes it possible to import cuml without sklearn installed. Once sklearn is a required dependency (and the `ProxyMixin` is removed), `is_proxy` can be moved back into `estimator_proxy.py` where it makes more sense to be located.

- Support `__sklearn_is_fitted__` since the default `check_is_fitted` doesn't work for `ProxyBase`. - Oops, `LinearRegression` _can_ be multi-target. Fix the `InteropMixin` implementations as needed.

Previously we only did the sane things a user would normally do, but the sklearn test suite wants to muck with internals. We now proxy everything to help pass more of the sklearn test suite.

Currently cuml does this wrong when it does it at all. For now we avoid forwarding this fit attribute.

- Validate hyperparameters before fit using sklearn's `_validate_params` - Fixup repr to use the native `BaseEstimator` repr, which plays nicely with sklearn's pprint features in meta-estimators.

Sklearn generates the parameterized tests in `test_common` dynamically. Before when our proxy estimators weren't subclasses of `BaseEstimator` they were fully ignored by these tests. Now that they're true `BaseEstimator` subclasses, there are suddenly a lot more tests for them. We pass _most_ of the checks, the ones that we don't have been added to the xfail list. There are a few other newly failing tests. I went through them individually - in most cases the issue was that before we had a bug, and that bug just happened to make the test not fail. A good example of this is the `ElasticNet` tests - there's a few that run `ElasticNet` with two different sets of hyperparameters and compare the results. Before we were silently dropping the relevant hyperparameters, so both estimators were identical and would of course pass. Now they're not, and we don't have identical `coef_` since one is running on GPU and one on CPU. But the scores of both estimators are comparable. We also remove several xfailed tests. All in all, I think this PR is a net positive, and afaict doesn't introduce new bugs per the sklearn test suite.

viclafargue

Thanks @jcrist! LGTM, just a few questions/comments

jcrist · 2025-05-09T21:36:29Z

/merge

This PR adds a new `ProxyBase` base class for defining proxy estimators via composing one CPU and one GPU estimator, instead of inheriting from the GPU estimator. This lets us better match sklearn's expected interface, since the exposed class can look much more like the proxied sklearn class than one subclassed from the cuml estimator. This is accomplished by defining a new `InteropMixin` mixin class to handle converting hyperparameters/attributes to/from the proxied CPU estimator. The mixin defines the public `as_sklearn`/`from_sklearn` methods, and the private methods are also used by `ProxyBase`. To port an estimator from the old `ProxyMixin` to the new `ProxyBase`: - Switch the cuml estimator to subclass from `Base` instead of `UniversalBase` - Add `InteropMixin` to the `cuml` estimator, and define the required methods. Remove any old methods and decorators used by `UniversalBase`. - Switch the class in `cuml.accel._wrappers.*` to subclass from `ProxyBase`. Any special casing for certain methods should be moved to this class from the `cuml` proper estimator class - Add tests One notable caveat of this approach is that the old device selection support will be removed from ported estimators. Ported estimators that previously would run on CPU with `using_device_type` no longer will - they will only run on GPU. This is a breaking change. That said, I don't believe we have a large user base doing this, and there's a clear transition path to using `cuml.accel` instead of `using_device_type`. In this PR I've handled porting all of the currently handled linear model estimators. There are still a few remaining todos: - [x] Add warning for old methods that supported device selection if device isn't GPU - [x] Fix pickling (should be easy to do, just haven't done it yet) - [x] Expand tests for the generic `ProxyBase` functionality - [x] Unxfail any newly passing sklearn test suite tests I don't intend to port any more estimators in this PR. Converting the other proxies to the new system can be done incrementally in follow up PRs. Part of rapidsai#6502. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) - Victor Lafargue (https://github.com/viclafargue) URL: rapidsai#6613

github-actions Bot added the Cython / Python Cython or Python issue label May 1, 2025

jcrist added breaking Breaking change improvement Improvement / enhancement to an existing function labels May 1, 2025

csadorf linked an issue May 5, 2025 that may be closed by this pull request

Implement new proxy estimator class using composition #6615

Closed

jcrist force-pushed the new-estimator-proxy branch from 8da0f7b to 926d044 Compare May 7, 2025 04:40