Skip to content

Accelerate linear model predict on C-ordered inputs#7329

Merged
rapids-bot[bot] merged 1 commit intorapidsai:branch-25.12from
jcrist:accelerate-linear-predict
Oct 13, 2025
Merged

Accelerate linear model predict on C-ordered inputs#7329
rapids-bot[bot] merged 1 commit intorapidsai:branch-25.12from
jcrist:accelerate-linear-predict

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Oct 10, 2025

This started out as a cleanup PR, but moved to a performance improvement after some benchmarking.

LinearRegression, ElasticNet, Lasso, and Ridge all share the same predict method. This calculates X.dot(coef.T) + intercept.

Previously we used a function from libcuml to compute the single target case, and cupy to handle the multitarget case.

After some benchmarking, I no longer think using libcuml at all here is worth it. It's simpler to always take the cupy path, and cupy already handles dispatching to cublas appropriately to handle disparate layouts (C vs F).

For F-ordered inputs we see roughly the same performance as before.

For C-ordered inputs, we see anything from mild speedups (150 us now, vs 200 us before) on small data, to up to 10x speedup on larger data (0.75 ms now vs 8.4 ms before). Presumably this is due to avoiding unnecessary copies to force a uniform F order as we did before.

This started out as a cleanup PR, but moved to a performance improvement
after some benchmarking.

`LinearRegression`, `ElasticNet`, `Lasso`, and `Ridge` all share the
same `predict` method. This calculates `X.dot(coef.T) + intercept`.

Previously we used a function from `libcuml` to compute the single
target case, and `cupy` to handle the multitarget case.

After some benchmarking, I no longer think using `libcuml` at all here
is worth it. It's simpler to always take the `cupy` path, and `cupy`
already handles dispatching to cublas appropriately to handle disparate
layouts (C vs F).

For F-ordered inputs we see roughly the same performance as before.

For C-ordered inputs, we see anything from mild speedups (150 us now, vs
200 us before) on small data, to up to 10x speedup on larger data (0.75 ms
now vs 8.4 ms before). Presumably this is due to avoiding unnecessary
copies to force a uniform F order as we did before.
@jcrist jcrist self-assigned this Oct 10, 2025
@jcrist jcrist added the Cython / Python Cython or Python issue label Oct 10, 2025
@jcrist jcrist requested review from a team as code owners October 10, 2025 17:59
@jcrist jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 10, 2025
@jcrist jcrist requested a review from dantegd October 10, 2025 17:59
@github-actions github-actions Bot added the CMake label Oct 10, 2025
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Oct 10, 2025

Since the coef_ order and intercept_ order, shape, and dtype are identical across all these models (regardless of how they were generated), I arbitrarily selected LinearRegression to do the benchmarking. The results are the same for all models though.

bench.py
from itertools import product
from time import perf_counter

import cupy as cp
from cuml import LinearRegression
from cuml.datasets import make_regression


N_FEATURES = [100, 1000]
N_SAMPLES = [1000, 10_000, 100_000]
DTYPES = ["float32", "float64"]
ORDERS = ["C", "F"]
N_RUNS = 5


for order, dtype in product(ORDERS, DTYPES):
    print(f"{order = }, {dtype = }")

    for n_features, n_samples in product(N_FEATURES, N_SAMPLES):
        X, y = make_regression(
            n_samples,
            n_features,
            dtype="float" if dtype == "float32" else "double",
            random_state=42,
        )
        model = LinearRegression().fit(X, y)

        X = cp.asarray(X, order=order)
        # Warmup
        for _ in range(N_RUNS):
            model.predict(X)

        start = perf_counter()
        for _ in range(N_RUNS):
            model.predict(X)
        duration = (perf_counter() - start) / N_RUNS
        print(f"- {X.shape}: {duration * 1e3:.3f} ms")
**Before this PR**
order = 'C', dtype = 'float32'
- (1000, 100): 0.197 ms
- (10000, 100): 0.390 ms
- (100000, 100): 1.498 ms
- (1000, 1000): 0.408 ms
- (10000, 1000): 1.490 ms
- (100000, 1000): 8.422 ms
order = 'C', dtype = 'float64'
- (1000, 100): 0.188 ms
- (10000, 100): 0.378 ms
- (100000, 100): 1.870 ms
- (1000, 1000): 0.383 ms
- (10000, 1000): 1.865 ms
- (100000, 1000): 11.153 ms
order = 'F', dtype = 'float32'
- (1000, 100): 0.130 ms
- (10000, 100): 0.128 ms
- (100000, 100): 0.184 ms
- (1000, 1000): 0.131 ms
- (10000, 1000): 0.188 ms
- (100000, 1000): 0.788 ms
order = 'F', dtype = 'float64'
- (1000, 100): 0.135 ms
- (10000, 100): 10.009 ms
- (100000, 100): 0.258 ms
- (1000, 1000): 0.147 ms
- (10000, 1000): 0.314 ms
- (100000, 1000): 1.355 ms
**After this PR**
order = 'C', dtype = 'float32'
- (1000, 100): 0.151 ms
- (10000, 100): 0.151 ms
- (100000, 100): 0.171 ms
- (1000, 1000): 0.222 ms
- (10000, 1000): 0.170 ms
- (100000, 1000): 0.740 ms
order = 'C', dtype = 'float64'
- (1000, 100): 0.148 ms
- (10000, 100): 0.151 ms
- (100000, 100): 0.307 ms
- (1000, 1000): 0.140 ms
- (10000, 1000): 0.235 ms
- (100000, 1000): 1.474 ms
order = 'F', dtype = 'float32'
- (1000, 100): 0.138 ms
- (10000, 100): 0.152 ms
- (100000, 100): 0.168 ms
- (1000, 1000): 0.146 ms
- (10000, 1000): 0.175 ms
- (100000, 1000): 0.728 ms
order = 'F', dtype = 'float64'
- (1000, 100): 0.163 ms
- (10000, 100): 0.149 ms
- (100000, 100): 0.240 ms
- (1000, 1000): 0.139 ms
- (10000, 1000): 0.222 ms
- (100000, 1000): 1.420 ms

@jcrist jcrist mentioned this pull request Oct 10, 2025
@csadorf
Copy link
Copy Markdown
Contributor

csadorf commented Oct 13, 2025

I've run slightly more extended benchmarks with larger data and don't observe any significant regressions on F-ordered data and major speed-up on C-ordered data. I think this is a good improvement.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Oct 13, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 32230d8 into rapidsai:branch-25.12 Oct 13, 2025
105 checks passed
@jcrist jcrist deleted the accelerate-linear-predict branch October 13, 2025 14:52
rapids-bot Bot pushed a commit that referenced this pull request Oct 13, 2025
- Release GIL
- Simple `__init__` following sklearn conventions
- Only warn on single input if solver set explicitly, otherwise if `auto` and only 1 column default to `svd` automatically without warning.
- General readability cleanups

On top of #7329 (relies on some changes there). Part of #7317.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Simon Adorf (https://github.com/csadorf)
  - Victor Lafargue (https://github.com/viclafargue)

URL: #7330
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants