Accelerate linear model predict on C-ordered inputs by jcrist · Pull Request #7329 · rapidsai/cuml

jcrist · 2025-10-10T17:59:15Z

This started out as a cleanup PR, but moved to a performance improvement after some benchmarking.

LinearRegression, ElasticNet, Lasso, and Ridge all share the same predict method. This calculates X.dot(coef.T) + intercept.

Previously we used a function from libcuml to compute the single target case, and cupy to handle the multitarget case.

After some benchmarking, I no longer think using libcuml at all here is worth it. It's simpler to always take the cupy path, and cupy already handles dispatching to cublas appropriately to handle disparate layouts (C vs F).

For F-ordered inputs we see roughly the same performance as before.

For C-ordered inputs, we see anything from mild speedups (150 us now, vs 200 us before) on small data, to up to 10x speedup on larger data (0.75 ms now vs 8.4 ms before). Presumably this is due to avoiding unnecessary copies to force a uniform F order as we did before.

This started out as a cleanup PR, but moved to a performance improvement after some benchmarking. `LinearRegression`, `ElasticNet`, `Lasso`, and `Ridge` all share the same `predict` method. This calculates `X.dot(coef.T) + intercept`. Previously we used a function from `libcuml` to compute the single target case, and `cupy` to handle the multitarget case. After some benchmarking, I no longer think using `libcuml` at all here is worth it. It's simpler to always take the `cupy` path, and `cupy` already handles dispatching to cublas appropriately to handle disparate layouts (C vs F). For F-ordered inputs we see roughly the same performance as before. For C-ordered inputs, we see anything from mild speedups (150 us now, vs 200 us before) on small data, to up to 10x speedup on larger data (0.75 ms now vs 8.4 ms before). Presumably this is due to avoiding unnecessary copies to force a uniform F order as we did before.

jcrist · 2025-10-10T18:03:58Z

Since the coef_ order and intercept_ order, shape, and dtype are identical across all these models (regardless of how they were generated), I arbitrarily selected LinearRegression to do the benchmarking. The results are the same for all models though.

bench.py

from itertools import product
from time import perf_counter

import cupy as cp
from cuml import LinearRegression
from cuml.datasets import make_regression


N_FEATURES = [100, 1000]
N_SAMPLES = [1000, 10_000, 100_000]
DTYPES = ["float32", "float64"]
ORDERS = ["C", "F"]
N_RUNS = 5


for order, dtype in product(ORDERS, DTYPES):
    print(f"{order = }, {dtype = }")

    for n_features, n_samples in product(N_FEATURES, N_SAMPLES):
        X, y = make_regression(
            n_samples,
            n_features,
            dtype="float" if dtype == "float32" else "double",
            random_state=42,
        )
        model = LinearRegression().fit(X, y)

        X = cp.asarray(X, order=order)
        # Warmup
        for _ in range(N_RUNS):
            model.predict(X)

        start = perf_counter()
        for _ in range(N_RUNS):
            model.predict(X)
        duration = (perf_counter() - start) / N_RUNS
        print(f"- {X.shape}: {duration * 1e3:.3f} ms")

**Before this PR**

order = 'C', dtype = 'float32'
- (1000, 100): 0.197 ms
- (10000, 100): 0.390 ms
- (100000, 100): 1.498 ms
- (1000, 1000): 0.408 ms
- (10000, 1000): 1.490 ms
- (100000, 1000): 8.422 ms
order = 'C', dtype = 'float64'
- (1000, 100): 0.188 ms
- (10000, 100): 0.378 ms
- (100000, 100): 1.870 ms
- (1000, 1000): 0.383 ms
- (10000, 1000): 1.865 ms
- (100000, 1000): 11.153 ms
order = 'F', dtype = 'float32'
- (1000, 100): 0.130 ms
- (10000, 100): 0.128 ms
- (100000, 100): 0.184 ms
- (1000, 1000): 0.131 ms
- (10000, 1000): 0.188 ms
- (100000, 1000): 0.788 ms
order = 'F', dtype = 'float64'
- (1000, 100): 0.135 ms
- (10000, 100): 10.009 ms
- (100000, 100): 0.258 ms
- (1000, 1000): 0.147 ms
- (10000, 1000): 0.314 ms
- (100000, 1000): 1.355 ms

**After this PR**

order = 'C', dtype = 'float32'
- (1000, 100): 0.151 ms
- (10000, 100): 0.151 ms
- (100000, 100): 0.171 ms
- (1000, 1000): 0.222 ms
- (10000, 1000): 0.170 ms
- (100000, 1000): 0.740 ms
order = 'C', dtype = 'float64'
- (1000, 100): 0.148 ms
- (10000, 100): 0.151 ms
- (100000, 100): 0.307 ms
- (1000, 1000): 0.140 ms
- (10000, 1000): 0.235 ms
- (100000, 1000): 1.474 ms
order = 'F', dtype = 'float32'
- (1000, 100): 0.138 ms
- (10000, 100): 0.152 ms
- (100000, 100): 0.168 ms
- (1000, 1000): 0.146 ms
- (10000, 1000): 0.175 ms
- (100000, 1000): 0.728 ms
order = 'F', dtype = 'float64'
- (1000, 100): 0.163 ms
- (10000, 100): 0.149 ms
- (100000, 100): 0.240 ms
- (1000, 1000): 0.139 ms
- (10000, 1000): 0.222 ms
- (100000, 1000): 1.420 ms

csadorf · 2025-10-13T14:50:07Z

I've run slightly more extended benchmarks with larger data and don't observe any significant regressions on F-ordered data and major speed-up on C-ordered data. I think this is a good improvement.

jcrist · 2025-10-13T14:52:26Z

/merge

- Release GIL - Simple `__init__` following sklearn conventions - Only warn on single input if solver set explicitly, otherwise if `auto` and only 1 column default to `svd` automatically without warning. - General readability cleanups On top of #7329 (relies on some changes there). Part of #7317. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) - Victor Lafargue (https://github.com/viclafargue) URL: #7330

jcrist self-assigned this Oct 10, 2025

jcrist added the Cython / Python Cython or Python issue label Oct 10, 2025

jcrist requested review from a team as code owners October 10, 2025 17:59

jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 10, 2025

jcrist requested a review from dantegd October 10, 2025 17:59

github-actions Bot added the CMake label Oct 10, 2025

jcrist mentioned this pull request Oct 10, 2025

Cleanup Ridge #7330

Merged

viclafargue approved these changes Oct 13, 2025

View reviewed changes

csadorf approved these changes Oct 13, 2025

View reviewed changes

rapids-bot Bot merged commit 32230d8 into rapidsai:branch-25.12 Oct 13, 2025
105 checks passed

jcrist deleted the accelerate-linear-predict branch October 13, 2025 14:52

jcrist mentioned this pull request Oct 13, 2025

[TRACKER] Cleanup python estimator implementations #7317

Open

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate linear model predict on C-ordered inputs#7329

Accelerate linear model predict on C-ordered inputs#7329
rapids-bot[bot] merged 1 commit intorapidsai:branch-25.12from
jcrist:accelerate-linear-predict

jcrist commented Oct 10, 2025

Uh oh!

jcrist commented Oct 10, 2025

Uh oh!

csadorf commented Oct 13, 2025

Uh oh!

jcrist commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jcrist commented Oct 10, 2025

Uh oh!

jcrist commented Oct 10, 2025

Uh oh!

csadorf commented Oct 13, 2025

Uh oh!

jcrist commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants