Skip to content

Speedup dask LogisticRegression tests#6607

Merged
rapids-bot[bot] merged 7 commits intorapidsai:branch-25.06from
jcrist:speedup-dask-lr-tests
Apr 30, 2025
Merged

Speedup dask LogisticRegression tests#6607
rapids-bot[bot] merged 7 commits intorapidsai:branch-25.06from
jcrist:speedup-dask-lr-tests

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Apr 29, 2025

This PR applies a few changes (see the commit messages for details) to speedup the dask LogisticRegression tests. Most of the changes fall into one of a few categories:

  • Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run)
  • Reducing the scale tested by a bit
  • Coupling certain parameter combinations to reduce the number of tests without reducing coverage
  • Using a faster solver for the CPU versions

All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup.

For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind quality_param/stress_param annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in #6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs.

jcrist added 7 commits April 29, 2025 12:43
No need to parametrize this test based on scale, using a fixed scale
should be sufficient.
Using a faster solver for the CPU model results in a 3x speedup.
Using a slightly smaller number of classes results in some measurable
speedups.
No need to parametrize these tests based on `n_classes`, the behavior
being tested isn't dependent on the number of classes.
Better tests the behavior being tested here.
The behavior's being tested in these tests don't vary based on
`fit_intercept`.
This parametrization was unused, it was just doubling the number of
tests run without any variance.
@jcrist jcrist requested a review from a team as a code owner April 29, 2025 20:31
@jcrist jcrist requested review from betatim and csadorf April 29, 2025 20:31
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Apr 29, 2025
@jcrist jcrist self-assigned this Apr 29, 2025
@jcrist jcrist added tests Unit testing for project Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 29, 2025
Comment thread python/cuml/cuml/tests/dask/test_dask_logistic_regression.py
Comment thread python/cuml/cuml/tests/dask/test_dask_logistic_regression.py
Copy link
Copy Markdown
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable. Something I haven't thought about is if there are weird numerical effects that we would only see with one of the two dtypes. Which would suggest that we should have (at least) one test for each dtype. But then none of the tests seem to check for "weird numerical" things.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Apr 30, 2025

Note that we do still have plenty of tests for each dtype, we're just not running them on every test. Tests for specific behavior check that specific behavior (with minimal unnecessary parametrization). Then there's general tests for each penalty that have a broader parametrization.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Apr 30, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 83f9bcd into rapidsai:branch-25.06 Apr 30, 2025
78 checks passed
@jcrist jcrist deleted the speedup-dask-lr-tests branch April 30, 2025 15:14
Ofek-Haim pushed a commit to Ofek-Haim/cuml that referenced this pull request May 13, 2025
This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories:

- Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run)
- Reducing the scale tested by a bit
- Coupling certain parameter combinations to reduce the number of tests without reducing coverage
- Using a faster solver for the CPU versions

All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup.

For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in rapidsai#6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Tim Head (https://github.com/betatim)

URL: rapidsai#6607
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cython / Python Cython or Python issue Dask / cuml.dask Issue/PR related to Python level dask or cuml.dask features. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change tests Unit testing for project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants