Skip to content

Make r2_score compatible with Scikit-Learn#6337

Merged
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.04from
jcrist:r2-score-compat
Feb 21, 2025
Merged

Make r2_score compatible with Scikit-Learn#6337
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.04from
jcrist:r2-score-compat

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Feb 19, 2025

This makes cuml.metrics.r2_score compatible with the same function in sklearn.

In particular, we now support the sample_weight and multioutput options.

  • Moved the implementation to use cupy instead of calling functions in libcuml. The other regression metrics already used cupy, so there's precedence here.
  • Since the implementation no longer uses cython, we can move the file to .py and delete a bunch of cython code. Woot.
  • All the regression metrics share some input setup and processing; the code paths here were updated to share code, there should be no change in functionality in the other metrics.
  • The testing for the regression metrics was updated to improve coverage of options and ensure compatibility with sklearn.

This also adds a sample_weight keyword to RegressorMixin.score, to improve scikit-learn compatibility for our regressor models. A corresponding test was added using LinearRegression.

@jcrist jcrist requested review from a team as code owners February 19, 2025 20:07
@jcrist jcrist requested review from cjnolet and vyasr February 19, 2025 20:07
@github-actions github-actions Bot added Cython / Python Cython or Python issue CMake labels Feb 19, 2025
@jcrist jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuml-cpu and removed Cython / Python Cython or Python issue CMake labels Feb 19, 2025
Comment thread python/cuml/cuml/metrics/regression.py
Comment thread python/cuml/cuml/tests/test_metrics.py
Comment thread python/cuml/cuml/tests/test_linear_model.py
Comment thread python/cuml/cuml/ensemble/randomforestregressor.pyx
Comment thread python/cuml/cuml/metrics/regression.py
Copy link
Copy Markdown
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is ripping out the calls to the c++ layer but as I understand it that’s not the ideal route to the 0cc path. Just want to flag this before it’s merged.

I feel this is moving the wrong direction. There are definitely some functions that still call cupy but in general we have been moving the other direction (we implement in cupy for convenience and will move things to c++ later / as needed).

Comment thread python/cuml/cuml/metrics/regression.pxd
Comment thread python/cuml/cuml/metrics/regression.py
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Feb 20, 2025

Not sure why this is ripping out the calls to the c++ layer but as I understand it that’s not the ideal route to the 0cc path. Just want to flag this before it’s merged.

I feel this is moving the wrong direction. There are definitely some functions that still call cupy but in general we have been moving the other direction (we implement in cupy for convenience and will move things to c++ later / as needed).

@cjnolet I fully agree that we should evaluate this further in broader context, but I would recommend that we do so after the patch release. Right now, I think the priority should be to reach parity and address our current failures.

Would you say that this is important enough that we need to address this immediately or can we reevaluate post-GTC?

@cjnolet
Copy link
Copy Markdown
Member

cjnolet commented Feb 20, 2025

@csadorf we should be calling c++ primitives when they exist and not removing them from the Python layer. The changes in this PR are a step backwards as they are and I'd like that to be fixed before they are merged.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Feb 20, 2025

we should be calling c++ primitives when they exist and not removing them from the Python layer.

Why though? For this function, performance is likely uncritical, and the current c++ implementation is missing several features that we need in cuml. Having two code paths (one that calls the C++ function when possible, falling back to cupy otherwise) adds extra complication, especially when we're already using cupy for all the other metrics in this file. From a maintenance perspective I'd rather keep the code simple and isolated in cuml, rather than requiring we call a C++ function split across 2 repos just because it exists.

@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Feb 20, 2025

@csadorf we should be calling c++ primitives when they exist and not removing them from the Python layer. The changes in this PR are a step backwards as they are and I'd like that to be fixed before they are merged.

The motivation for this change is to overcome a current limitation of the 0cc layer. I would be happy to classify this as tech debt if need be.

Taking advantage of existing primitives instead of replicating them is a win in my book as long as we maintain performance. In this particular case, we could easily achieve that by calling the c++ primitives whenever possible and only drop that if our benchmarks show no speed-up across all scales.

As it stands, it seems like this change is too controversial as is, so I’d say we just document the limitation for the patch release and revisit this at a later point.

Copy link
Copy Markdown
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked w/ Dante about this offline. I see the vision here. Approving.

@github-actions github-actions Bot added Cython / Python Cython or Python issue CMake labels Feb 20, 2025
@dantegd
Copy link
Copy Markdown
Member

dantegd commented Feb 21, 2025

/merge

This updates `r2_score` to match the interface provided by
`sklearn.metrics.r2_score`. In particular, we now support the
`sample_weight` and `multioutput` options.

- Moved the implementation to use `cupy` instead of calling functions in
  `libcuml`. The other regression metrics already used `cupy`, so
  there's precedence here.

- Since the implementation no longer uses cython, we can move the file
  to `.py` and delete a bunch of cython code. Woot.

- All the regression metrics share some input setup and processing; the
  code paths here were updated to share code, there should be no change
  in functionality in the other metrics.

- The testing for the regression metrics was updated to improve coverage
  of options and ensure compatibility with sklearn.
All scikit-learn regressors support a `sample_weight` keyword in their
`score` method. This adds support for that (improving sklearn
compatibility) as well as a test.
@rapids-bot rapids-bot Bot merged commit 23a08a3 into rapidsai:branch-25.04 Feb 21, 2025
@jcrist jcrist deleted the r2-score-compat branch February 21, 2025 22:33
dantegd added a commit to dantegd/cuml that referenced this pull request Feb 23, 2025
dantegd added a commit to dantegd/cuml that referenced this pull request Feb 26, 2025
raydouglass pushed a commit that referenced this pull request Feb 28, 2025
PRs being backported: 

- [x] #6234
- [x] #6306
- [x] #6320
- [x] #6319
- [x] #6327
- [x] #6333
- [x] #6142 
- [x] #6223
- [x] #6235
- [x] #6317 
- [x] #6331
- [x] #6326
- [x] #6332
- [x] #6347
- [x] #6348
- [x] #6337
- [x] #6355
- [x] #6354
- [x] #6322
- [x] #6353
- [x] #6359
- [x] #6364
- [x] #6363
- [x] [FIL BATCH_TREE_REORG fix for SM90, 100 and
120](a3e419a)

---------

Co-authored-by: William Hicks <whicks@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants