Documentation and Testing Infrastructure Updates by csadorf · Pull Request #6580 · rapidsai/cuml

csadorf · 2025-04-23T20:30:28Z

This PR enhances developer documentation and testing infrastructure with a focus on linear model tests.

Documentation Changes

Added comprehensive testing best practices guide covering:
- Test organization principles
- Accuracy testing methodology
- Memory usage considerations
- Effective use of fixtures and parametrization
- Guidelines for hypothesis-based testing
Added step-by-step instructions for running tests from python/cuml/
Documented common pytest commands and options
Added clear explanation of test parameter levels (unit/quality/stress)

Infrastructure Improvements

Improved test parameterization consistency in test_linear_model.py
Added a new cuml-specific floating dtypes hypothesis strategy to make it easier to parametrize tests on dtypes that should be supported by cuml's estimators
Renamed dataset compatibility functions for clarity:
- sklearn_compatible_dataset → is_sklearn_compatible_dataset
- cuml_compatible_dataset → is_cuml_compatible_dataset

Follow-ups

The replacement of cuml-sklearn result comparisons with hard-coded values in test_linear_model.py is a more significant change that warrants its own dedicated PR.
Consider removal of scale parameterization (unit_param, quality_param, stress_param)
Split linear module test module for each estimator

Related issues

Contributes to [Tracker] Test Infrastructure Improvements #6469
Closes Improve Documentation and Linear Model Testing #6592

csadorf · 2025-04-23T20:37:39Z

It should be noted that while the linear models test module is now more consistently using pytest.mark.parametrize for most estimator hyperparameters and hypothesis primarily for dataset inputs, this leads to an increase in the total number of tests and thus total runtime since it guarantees that we run all combinations of hyperparameters, whereas previously we would run most combinations stochastically.

csadorf · 2025-04-23T20:39:00Z

Requesting explicit review from @dantegd and @jcrist who previously expressed interested in this topic.

dantegd · 2025-04-28T14:57:37Z

+We use [pytest](https://docs.pytest.org/en/latest/) for writing and running tests. To see existing examples, refer to any of the `test_*.py` files in the folder `cuml/tests`.
+
+### Test Organization
+- Keep all tests for a single estimator in one file, with exceptions for:


A long standing pet peeve of mine, test_linear_model actually contains tests for multiple estimators, perhaps we could split it as part of this PR?

Yes, I think splitting them up in this PR makes sense.

Revising my previous statement, to remain clarity on the changes here, we should do that in a follow-up.

dantegd · 2025-04-28T14:59:29Z

+
+### Test Parameter Levels
+
+You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`).


This might not be a bad time to rething the names of these three or at least define them better. What is the difference between unit and quality? Perhaps we should use this opportunity to distinguish between tests that we only want to run in nightly runs?

Not only are they somewhat purely defined, I'm not fully convinced that we need them at all.

I think I agree with Simon. At least I am a bit puzzled what this is for/about. Tests are for testing correctness and benchmarks are for benchmarking. And pytest is not a great tool for writing benchmarks, something like asv is probably more useful for that

I'd recommend that we leave this as-is for now so that we can make progress with this PR. I'll propose to remove those scale qualifiers unless we can formulate convincing reasons to keep them.

+1 on removing these fully. Tests should be checking code behavior, and unless logical behavior changes with scale (as noted above and in the linked issue), then what we're doing here is more performance testing which would be better handled by other tools & specific performance tests. I'm very against hiding performance tests in with other tests.

Agree this can be done later, but maybe we want to add a note here dissuading adding more of these (or remove documenting them here/refer to them as legacy code somehow)?

I'll just remove them for this test module.

dantegd · 2025-04-28T15:00:06Z

+   unit_param(2)  # For number of components
+   ```
+
+2. **Quality Tests** (`quality_param`): Medium values for thorough testing


Referring to my last comment, medium is not well defined here, as well as the difference between basic vs thorough.

See my above comment, maybe we can take a step back and think about the motivation for this, and that might help us to determine whether we need this at all, and if yes, how exactly we want to define this?

betatim · 2025-04-29T13:24:25Z

+@given(train_dtype=dataset_dtypes(), test_dtype=dataset_dtypes())
+@example(train_dtype=np.float32, test_dtype=np.float32)
+@example(train_dtype=np.float32, test_dtype=np.float64)
+@example(train_dtype=np.float64, test_dtype=np.float32)
+@example(train_dtype=np.float64, test_dtype=np.float64)


As a naive hypothesis user (aka not a lot of experience) I read this and thought "why do we still need the given? don't the examples cover all the possible combinations?"

Is this just a different way of writing what the pytest.mark.parametrize was doing? Why change?

Yes, in this particular case they might actually cover all combinations. However, we might refine/expand our dataset_types() definition to include more types in the future (and at least until recently they included dtypes with varying endianness). I'm ok with leaving this as is even if it is currently a bit redundant.

betatim · 2025-04-29T13:26:06Z

+  - Performance testing/benchmarking
+  - Generic estimator checks (e.g., `test_base.py`)
+- Use small, focused datasets for correctness testing
+- Only parametrize scale when it triggers alternate code paths


Suggested change

- Only parametrize scale when it triggers alternate code paths

- Only parametrize dataset size when it triggers alternate code paths

I find this easier to understand, if "scale" did refer to dataset size, if not that I'm lost as to what "scale" means here

betatim · 2025-04-29T13:35:44Z

+   - Must include at least one `@example` for deterministic testing
+   - Preferred for dataset generation
+   ```python
+   @example(dataset=small_regression_dataset(np.float32))


Why have the explicit example? Is it to combine "sample some random values for me" and "check this explicit value because I as a human think it is important" in one test?

In my head I assume that if you run the "normal" tests you get good coverage and check all the things that should be checked. The hypothesis tests are "bonus" tests that we run to find weird edge cases or combinations that we didn't think of or sampling combinations that are too numerous to exhaustively try. But I have no idea if others think of hypothesis like this or not.

We do not run hypothesis tests with strategies during PR runs, only explicit examples. By requiring explicit examples we ensure that we always run tests on a deterministic input set and thus detect issues with the test implementation early.

betatim · 2025-04-29T13:49:46Z

I'm not sure if exhaustively testing all combinations of hyper-parameters is worth it, compared to sampling (enough) combinations. This is based on the idea that as human author of tests I can reason that some parameter combinations are more useful to test than others (e.g. there is usually no need to repeat all parameter combinations with logging on and off - often a handful of combinations are enough to see that logging works and outputs sensible things).

At least I'd hope that we can get enough coverage by writing explicit tests of "sensible" hyper-parameter combinations based on knowing which parameters interact and/or reusing things that have already been tested elsewhere (e.g. when using LabelEncoder in an estimator we can assume that LabelEncoder (for all its parameter values) is correct or that the LabelEncoder tests will find the bug in it).

One thing that I'd find helpful regarding the organisation of the tests is if we mirrored the directory structure of python/cuml/cuml - so the tests for cuml.datasets would be in tests/datasets/ and cuml.metrics tests would be in tests/metrics etc. What do people think of that? (Not sure we want to do this in this PR but seems like a good venue to discuss it)

csadorf · 2025-04-29T13:54:23Z

I'm not sure if exhaustively testing all combinations of hyper-parameters is worth it, compared to sampling (enough) combinations. This is based on the idea that as human author of tests I can reason that some parameter combinations are more useful to test than others (e.g. there is usually no need to repeat all parameter combinations with logging on and off - often a handful of combinations are enough to see that logging works and outputs sensible things).

At least I'd hope that we can get enough coverage by writing explicit tests of "sensible" hyper-parameter combinations based on knowing which parameters interact and/or reusing things that have already been tested elsewhere (e.g. when using LabelEncoder in an estimator we can assume that LabelEncoder (for all its parameter values) is correct or that the LabelEncoder tests will find the bug in it).

Agreed, the exhaustive testing of all parameter combinations might be overkill and at least right now, this PR increase the total test runtime. On the other hand, it's a bit difficult to decide whether we want to explore certain parameters exhaustively or stochastically. To keep things simply, I decided to recommend and implement the former everywhere.

One thing that I'd find helpful regarding the organisation of the tests is if we mirrored the directory structure of python/cuml/cuml - so the tests for cuml.datasets would be in tests/datasets/ and cuml.metrics tests would be in tests/metrics etc. What do people think of that? (Not sure we want to do this in this PR but seems like a good venue to discuss it)

Yes, we can improve test organization, but that should be a follow-up issue.

jcrist

I'd agree with Tim that we probably don't want to test every combination. Some estimators have many parameters, and doing a full cartesian product will result in a large number of possibly redundant tests.

I do recognize I (and probably we) are at a bit of a disadvantage making changes to tests for code we did not write. When writing code it's easy to see what parameter combos are relevant for checking edge cases and any parameter interactions that might occur. I can make some inferences from documentation and a cursory reading of the code, but without deeper study of an estimator's code I can't be certain there isn't some interaction that has been missed.

That said, perhaps that's fine. We should trust our tests to be useful and sufficient for checking for failure modes when making changes. Right now we can't trust them because we're not convinced the tests are well written and provide adequate coverage.

Perhaps the best method forward is to make a best attempt at cleaning things up in a way that doesn't result in a slower running test suite. This would require some code and docs reading to determine what cases are meaningful, but would likely result in at least as quality a test suite as we currently have. Will we possibly miss meaningful cases? Sure. But it also would likely result in adding meaningful cases, and lay a better groundwork that we can iterate on later.

My worry is that if we decide to do a full parameter sweep in tests "just to be safe" that we'll end up with a test suite that is significantly slower than it already is, leading to a slower dev cycle (harder to run tests to check locally, slower CI, ...), without providing a known and meaningful increase in coverage. And given we may be pulled off onto other more important things, saying we'll reduce the test parametrization later may result in it never actually happening.

jcrist · 2025-04-29T14:17:25Z

+
+### Test Parameter Levels
+
+You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`).


+1 on removing these fully. Tests should be checking code behavior, and unless logical behavior changes with scale (as noted above and in the linked issue), then what we're doing here is more performance testing which would be better handled by other tools & specific performance tests. I'm very against hiding performance tests in with other tests.

Agree this can be done later, but maybe we want to add a note here dissuading adding more of these (or remove documenting them here/refer to them as legacy code somehow)?

csadorf · 2025-04-29T14:40:25Z

Based on the feedback on parameterization so far, I propose that we adopt language that states that we want to generally test all combinations stochastically (i.e., generally use hypothesis to explore the full parameter space) and only use parameterization whenever we explicitly want to test certain combinations exhaustively.

If there are problematic edge cases then we should discover them eventually through the nightly tests and can then pro-actively add them to our examples or opt for exhaustive parameterization.

This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories: - Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run) - Reducing the scale tested by a bit - Coupling certain parameter combinations to reduce the number of tests without reducing coverage - Using a faster solver for the CPU versions All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup. For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in #6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: #6607

Added sections on test organization, accuracy testing, memory usage considerations, and best practices for writing tests. Included detailed recommendations for using fixtures, parametrization, and hypothesis for test input generation. This update aims to improve the clarity and effectiveness of testing strategies within the codebase.

Included a section detailing how to run tests from the python/cuml/ directory, including common pytest commands and options.

Introduced a new section outlining three levels of test parameters: unit, quality, and stress tests.

The test should be removed with 25.08, not 24.08.

Updated the function names for checking dataset compatibility with scikit-learn and cuML from `sklearn_compatible_dataset` and `cuml_compatible_dataset` to `is_sklearn_compatible_dataset` and `is_cuml_compatible_dataset`, respectively. Adjusted all references in the testing module to reflect these changes, enhancing code readability and consistency.

Updated the `floating_dtypes` strategy to use a new implementation that generates only little-endian float32 and float64 dtypes supported by cuML. Adjusted all references in the testing module to ensure consistency and clarity in dtype handling across various dataset strategies.

And replace with hypothesis strategies where appropriate.

csadorf · 2025-04-30T22:49:16Z

Runtime Analysis

Test Results

Parametrization	Hypothesis	Time (seconds)
hypothesis (`a82178c`)	enabled	46.651
hypothesis (`a82178c`)	disabled	14.154
pytest (`147a49e`)	enabled	129.392
pytest (`147a49e`)	disabled	26.589

The "hypothesis" parametrization approach uses Hypothesis for both hyperparameter testing and dataset generation, providing stochastic test coverage through random parameter combinations
The "pytest" approach uses pytest.mark.parametrize for hyperparameter testing while still using Hypothesis for dataset generation, resulting in more deterministic testing, but also vastly more test cases

Analysis

Hypothesis Impact:
- When Hypothesis is enabled, tests run significantly slower (46.651s vs 14.154s)
- This is expected as Hypothesis performs more thorough testing by generating random parameter combinations
Implementation Comparison:
- The current implementation (a82178c) shows better performance than the pytest-based approach (147a49e) in both configurations
- With Hypothesis disabled: 14.154s vs 26.589s (47% faster)
- With Hypothesis enabled: 46.651s vs 129.392s (64% faster)

csadorf · 2025-04-30T22:57:11Z

@dantegd @jcrist @betatim This is ready for another round of review. I'd highly prefer that any additional non-critical changes go into a follow-up PR.

csadorf · 2025-05-01T16:02:16Z

Addressing the sklearn test failure in #6610 .

…odel-tests

jcrist

csadorf · 2025-05-07T18:57:35Z

/merge

This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories: - Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run) - Reducing the scale tested by a bit - Coupling certain parameter combinations to reduce the number of tests without reducing coverage - Using a faster solver for the CPU versions All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup. For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in rapidsai#6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#6607

This PR enhances developer documentation and testing infrastructure with a focus on linear model tests. ## Documentation Changes - Added comprehensive testing best practices guide covering: - Test organization principles - Accuracy testing methodology - Memory usage considerations - Effective use of fixtures and parametrization - Guidelines for hypothesis-based testing - Added step-by-step instructions for running tests from python/cuml/ - Documented common pytest commands and options - Added clear explanation of test parameter levels (unit/quality/stress) ## Infrastructure Improvements - Improved test parameterization consistency in test_linear_model.py - Added a new cuml-specific floating dtypes hypothesis strategy to make it easier to parametrize tests on dtypes that should be supported by cuml's estimators - Renamed dataset compatibility functions for clarity: - `sklearn_compatible_dataset` → `is_sklearn_compatible_dataset` - `cuml_compatible_dataset` → `is_cuml_compatible_dataset` ## Follow-ups - The replacement of cuml-sklearn result comparisons with hard-coded values in test_linear_model.py is a more significant change that warrants its own dedicated PR. - Consider removal of scale parameterization (`unit_param`, `quality_param`, `stress_param`) - Split linear module test module for each estimator ## Related issues - Contributes to rapidsai#6469 - Closes rapidsai#6592 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6580

csadorf requested a review from a team as a code owner April 23, 2025 20:30

csadorf requested review from betatim and divyegala April 23, 2025 20:30

github-actions Bot added the Cython / Python Cython or Python issue label Apr 23, 2025

csadorf added tests Unit testing for project improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 23, 2025

csadorf self-assigned this Apr 23, 2025

csadorf requested review from dantegd and jcrist April 23, 2025 20:38

dantegd reviewed Apr 28, 2025

View reviewed changes

betatim reviewed Apr 29, 2025

View reviewed changes

jcrist reviewed Apr 29, 2025

View reviewed changes

jcrist mentioned this pull request Apr 29, 2025

Speedup dask LogisticRegression tests #6607

Merged

csadorf added 8 commits April 30, 2025 11:46

Add instructions for running tests in DEVELOPER_GUIDE.md

4f110a9

Included a section detailing how to run tests from the python/cuml/ directory, including common pytest commands and options.

Add test parameter levels section to DEVELOPER_GUIDE.md

ebdb9e6

Introduced a new section outlining three levels of test parameters: unit, quality, and stress tests.

Update TODO comment date in test_logreg_penalty_deprecation function

9a5789f

The test should be removed with 25.08, not 24.08.

Improve consistency in use of pytest.mark.parametrize and hypothesis.

0db0e1a

Remove obsolete algorithms hypothesis strategy.

a539277

csadorf force-pushed the tests/improve-test-docs-and-linear-model-tests branch from 20c3c38 to ab5a012 Compare April 30, 2025 17:39

csadorf added 5 commits April 30, 2025 16:51

Drop stress_param from test_linear_model.py.

f0a6b9b

Drop unit_param and quality_param from test_linear_model.py

86a5d75

And replace with hypothesis strategies where appropriate.

Do not prefer exhaustive parameterization.

0ac1318

Increase acceptance tolerance for test_elasticnet_model test.

7d89d15

Use hypothesis for most parametrization in test_linear_module.py.

8f54cfb

csadorf force-pushed the tests/improve-test-docs-and-linear-model-tests branch from de79ab3 to 8f54cfb Compare April 30, 2025 21:52

csadorf added 2 commits April 30, 2025 17:03

fixup! Use hypothesis for most parametrization in test_linear_module.py.

4f22010

Clarify documentation on dataset size parametrization.

01711d8

Use dataset_dtypes strategy consistently.

a82178c

Merge branch 'branch-25.06' into tests/improve-test-docs-and-linear-m…

d2cd432

…odel-tests

csadorf requested review from betatim, dantegd and jcrist May 5, 2025 15:55

Merge branch 'branch-25.06' into tests/improve-test-docs-and-linear-m…

362f8ed

…odel-tests

jcrist approved these changes May 7, 2025

View reviewed changes

rapids-bot Bot merged commit 1ad9af9 into rapidsai:branch-25.06 May 7, 2025
78 checks passed

csadorf deleted the tests/improve-test-docs-and-linear-model-tests branch May 7, 2025 18:57


		### Test Parameter Levels

		You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`).

	- Only parametrize scale when it triggers alternate code paths
	- Only parametrize dataset size when it triggers alternate code paths

Conversation

csadorf commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation Changes

Infrastructure Improvements

Follow-ups

Related issues

Uh oh!

csadorf commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csadorf commented Apr 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim commented Apr 29, 2025

Uh oh!

csadorf commented Apr 29, 2025

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csadorf commented Apr 29, 2025

Uh oh!

csadorf commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Runtime Analysis

Test Results

Analysis

Uh oh!

csadorf commented Apr 30, 2025

Uh oh!

csadorf commented May 1, 2025

Uh oh!

jcrist left a comment

Choose a reason for hiding this comment

Uh oh!

csadorf commented May 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

csadorf commented Apr 23, 2025 •

edited

Loading

csadorf commented Apr 23, 2025 •

edited

Loading

csadorf commented Apr 30, 2025 •

edited

Loading