Skip to content

Documentation and Testing Infrastructure Updates#6580

Merged
rapids-bot[bot] merged 18 commits intorapidsai:branch-25.06from
csadorf:tests/improve-test-docs-and-linear-model-tests
May 7, 2025
Merged

Documentation and Testing Infrastructure Updates#6580
rapids-bot[bot] merged 18 commits intorapidsai:branch-25.06from
csadorf:tests/improve-test-docs-and-linear-model-tests

Conversation

@csadorf
Copy link
Copy Markdown
Contributor

@csadorf csadorf commented Apr 23, 2025

This PR enhances developer documentation and testing infrastructure with a focus on linear model tests.

Documentation Changes

  • Added comprehensive testing best practices guide covering:
    • Test organization principles
    • Accuracy testing methodology
    • Memory usage considerations
    • Effective use of fixtures and parametrization
    • Guidelines for hypothesis-based testing
  • Added step-by-step instructions for running tests from python/cuml/
  • Documented common pytest commands and options
  • Added clear explanation of test parameter levels (unit/quality/stress)

Infrastructure Improvements

  • Improved test parameterization consistency in test_linear_model.py
  • Added a new cuml-specific floating dtypes hypothesis strategy to make it easier to parametrize tests on dtypes that should be supported by cuml's estimators
  • Renamed dataset compatibility functions for clarity:
    • sklearn_compatible_datasetis_sklearn_compatible_dataset
    • cuml_compatible_datasetis_cuml_compatible_dataset

Follow-ups

  • The replacement of cuml-sklearn result comparisons with hard-coded values in test_linear_model.py is a more significant change that warrants its own dedicated PR.
  • Consider removal of scale parameterization (unit_param, quality_param, stress_param)
  • Split linear module test module for each estimator

Related issues

@csadorf csadorf requested a review from a team as a code owner April 23, 2025 20:30
@csadorf csadorf requested review from betatim and divyegala April 23, 2025 20:30
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Apr 23, 2025
@csadorf csadorf added tests Unit testing for project improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 23, 2025
@csadorf csadorf self-assigned this Apr 23, 2025
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Apr 23, 2025

It should be noted that while the linear models test module is now more consistently using pytest.mark.parametrize for most estimator hyperparameters and hypothesis primarily for dataset inputs, this leads to an increase in the total number of tests and thus total runtime since it guarantees that we run all combinations of hyperparameters, whereas previously we would run most combinations stochastically.

@csadorf csadorf requested review from dantegd and jcrist April 23, 2025 20:38
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Apr 23, 2025

Requesting explicit review from @dantegd and @jcrist who previously expressed interested in this topic.

We use [pytest](https://docs.pytest.org/en/latest/) for writing and running tests. To see existing examples, refer to any of the `test_*.py` files in the folder `cuml/tests`.

### Test Organization
- Keep all tests for a single estimator in one file, with exceptions for:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A long standing pet peeve of mine, test_linear_model actually contains tests for multiple estimators, perhaps we could split it as part of this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think splitting them up in this PR makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revising my previous statement, to remain clarity on the changes here, we should do that in a follow-up.


### Test Parameter Levels

You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be a bad time to rething the names of these three or at least define them better. What is the difference between unit and quality? Perhaps we should use this opportunity to distinguish between tests that we only want to run in nightly runs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only are they somewhat purely defined, I'm not fully convinced that we need them at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with Simon. At least I am a bit puzzled what this is for/about. Tests are for testing correctness and benchmarks are for benchmarking. And pytest is not a great tool for writing benchmarks, something like asv is probably more useful for that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that we leave this as-is for now so that we can make progress with this PR. I'll propose to remove those scale qualifiers unless we can formulate convincing reasons to keep them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on removing these fully. Tests should be checking code behavior, and unless logical behavior changes with scale (as noted above and in the linked issue), then what we're doing here is more performance testing which would be better handled by other tools & specific performance tests. I'm very against hiding performance tests in with other tests.

Agree this can be done later, but maybe we want to add a note here dissuading adding more of these (or remove documenting them here/refer to them as legacy code somehow)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just remove them for this test module.

unit_param(2) # For number of components
```

2. **Quality Tests** (`quality_param`): Medium values for thorough testing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referring to my last comment, medium is not well defined here, as well as the difference between basic vs thorough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my above comment, maybe we can take a step back and think about the motivation for this, and that might help us to determine whether we need this at all, and if yes, how exactly we want to define this?

Comment on lines +820 to +876
@given(train_dtype=dataset_dtypes(), test_dtype=dataset_dtypes())
@example(train_dtype=np.float32, test_dtype=np.float32)
@example(train_dtype=np.float32, test_dtype=np.float64)
@example(train_dtype=np.float64, test_dtype=np.float32)
@example(train_dtype=np.float64, test_dtype=np.float64)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a naive hypothesis user (aka not a lot of experience) I read this and thought "why do we still need the given? don't the examples cover all the possible combinations?"

Is this just a different way of writing what the pytest.mark.parametrize was doing? Why change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in this particular case they might actually cover all combinations. However, we might refine/expand our dataset_types() definition to include more types in the future (and at least until recently they included dtypes with varying endianness). I'm ok with leaving this as is even if it is currently a bit redundant.

Comment thread wiki/python/DEVELOPER_GUIDE.md Outdated
- Performance testing/benchmarking
- Generic estimator checks (e.g., `test_base.py`)
- Use small, focused datasets for correctness testing
- Only parametrize scale when it triggers alternate code paths
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Only parametrize scale when it triggers alternate code paths
- Only parametrize dataset size when it triggers alternate code paths

I find this easier to understand, if "scale" did refer to dataset size, if not that I'm lost as to what "scale" means here

Comment thread wiki/python/DEVELOPER_GUIDE.md Outdated
- Must include at least one `@example` for deterministic testing
- Preferred for dataset generation
```python
@example(dataset=small_regression_dataset(np.float32))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have the explicit example? Is it to combine "sample some random values for me" and "check this explicit value because I as a human think it is important" in one test?

In my head I assume that if you run the "normal" tests you get good coverage and check all the things that should be checked. The hypothesis tests are "bonus" tests that we run to find weird edge cases or combinations that we didn't think of or sampling combinations that are too numerous to exhaustively try. But I have no idea if others think of hypothesis like this or not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not run hypothesis tests with strategies during PR runs, only explicit examples. By requiring explicit examples we ensure that we always run tests on a deterministic input set and thus detect issues with the test implementation early.

@betatim
Copy link
Copy Markdown
Member

betatim commented Apr 29, 2025

I'm not sure if exhaustively testing all combinations of hyper-parameters is worth it, compared to sampling (enough) combinations. This is based on the idea that as human author of tests I can reason that some parameter combinations are more useful to test than others (e.g. there is usually no need to repeat all parameter combinations with logging on and off - often a handful of combinations are enough to see that logging works and outputs sensible things).

At least I'd hope that we can get enough coverage by writing explicit tests of "sensible" hyper-parameter combinations based on knowing which parameters interact and/or reusing things that have already been tested elsewhere (e.g. when using LabelEncoder in an estimator we can assume that LabelEncoder (for all its parameter values) is correct or that the LabelEncoder tests will find the bug in it).


One thing that I'd find helpful regarding the organisation of the tests is if we mirrored the directory structure of python/cuml/cuml - so the tests for cuml.datasets would be in tests/datasets/ and cuml.metrics tests would be in tests/metrics etc. What do people think of that? (Not sure we want to do this in this PR but seems like a good venue to discuss it)

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Apr 29, 2025

I'm not sure if exhaustively testing all combinations of hyper-parameters is worth it, compared to sampling (enough) combinations. This is based on the idea that as human author of tests I can reason that some parameter combinations are more useful to test than others (e.g. there is usually no need to repeat all parameter combinations with logging on and off - often a handful of combinations are enough to see that logging works and outputs sensible things).

At least I'd hope that we can get enough coverage by writing explicit tests of "sensible" hyper-parameter combinations based on knowing which parameters interact and/or reusing things that have already been tested elsewhere (e.g. when using LabelEncoder in an estimator we can assume that LabelEncoder (for all its parameter values) is correct or that the LabelEncoder tests will find the bug in it).

Agreed, the exhaustive testing of all parameter combinations might be overkill and at least right now, this PR increase the total test runtime. On the other hand, it's a bit difficult to decide whether we want to explore certain parameters exhaustively or stochastically. To keep things simply, I decided to recommend and implement the former everywhere.

One thing that I'd find helpful regarding the organisation of the tests is if we mirrored the directory structure of python/cuml/cuml - so the tests for cuml.datasets would be in tests/datasets/ and cuml.metrics tests would be in tests/metrics etc. What do people think of that? (Not sure we want to do this in this PR but seems like a good venue to discuss it)

Yes, we can improve test organization, but that should be a follow-up issue.

Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree with Tim that we probably don't want to test every combination. Some estimators have many parameters, and doing a full cartesian product will result in a large number of possibly redundant tests.

I do recognize I (and probably we) are at a bit of a disadvantage making changes to tests for code we did not write. When writing code it's easy to see what parameter combos are relevant for checking edge cases and any parameter interactions that might occur. I can make some inferences from documentation and a cursory reading of the code, but without deeper study of an estimator's code I can't be certain there isn't some interaction that has been missed.

That said, perhaps that's fine. We should trust our tests to be useful and sufficient for checking for failure modes when making changes. Right now we can't trust them because we're not convinced the tests are well written and provide adequate coverage.

Perhaps the best method forward is to make a best attempt at cleaning things up in a way that doesn't result in a slower running test suite. This would require some code and docs reading to determine what cases are meaningful, but would likely result in at least as quality a test suite as we currently have. Will we possibly miss meaningful cases? Sure. But it also would likely result in adding meaningful cases, and lay a better groundwork that we can iterate on later.

My worry is that if we decide to do a full parameter sweep in tests "just to be safe" that we'll end up with a test suite that is significantly slower than it already is, leading to a slower dev cycle (harder to run tests to check locally, slower CI, ...), without providing a known and meaningful increase in coverage. And given we may be pulled off onto other more important things, saying we'll reduce the test parametrization later may result in it never actually happening.


### Test Parameter Levels

You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on removing these fully. Tests should be checking code behavior, and unless logical behavior changes with scale (as noted above and in the linked issue), then what we're doing here is more performance testing which would be better handled by other tools & specific performance tests. I'm very against hiding performance tests in with other tests.

Agree this can be done later, but maybe we want to add a note here dissuading adding more of these (or remove documenting them here/refer to them as legacy code somehow)?

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Apr 29, 2025

Based on the feedback on parameterization so far, I propose that we adopt language that states that we want to generally test all combinations stochastically (i.e., generally use hypothesis to explore the full parameter space) and only use parameterization whenever we explicitly want to test certain combinations exhaustively.

If there are problematic edge cases then we should discover them eventually through the nightly tests and can then pro-actively add them to our examples or opt for exhaustive parameterization.

rapids-bot Bot pushed a commit that referenced this pull request Apr 30, 2025
This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories:

- Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run)
- Reducing the scale tested by a bit
- Coupling certain parameter combinations to reduce the number of tests without reducing coverage
- Using a faster solver for the CPU versions

All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup.

For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in #6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Tim Head (https://github.com/betatim)

URL: #6607
csadorf added 8 commits April 30, 2025 11:46
Added sections on test organization, accuracy testing, memory usage considerations, and best practices for writing tests. Included detailed recommendations for using fixtures, parametrization, and hypothesis for test input generation. This update aims to improve the clarity and effectiveness of testing strategies within the codebase.
Included a section detailing how to run tests from the python/cuml/ directory, including common pytest commands and options.
Introduced a new section outlining three levels of test parameters: unit, quality, and stress tests.
The test should be removed with 25.08, not 24.08.
Updated the function names for checking dataset compatibility with
scikit-learn and cuML from `sklearn_compatible_dataset` and
`cuml_compatible_dataset` to `is_sklearn_compatible_dataset` and
`is_cuml_compatible_dataset`, respectively. Adjusted all references in
the testing module to reflect these changes, enhancing code readability
and consistency.
Updated the `floating_dtypes` strategy to use a new implementation that
generates only little-endian float32 and float64 dtypes supported by
cuML. Adjusted all references in the testing module to ensure
consistency and clarity in dtype handling across various dataset
strategies.
@csadorf csadorf force-pushed the tests/improve-test-docs-and-linear-model-tests branch from 20c3c38 to ab5a012 Compare April 30, 2025 17:39
@csadorf csadorf force-pushed the tests/improve-test-docs-and-linear-model-tests branch from de79ab3 to 8f54cfb Compare April 30, 2025 21:52
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Apr 30, 2025

Runtime Analysis

Test Results

Parametrization Hypothesis Time (seconds)
hypothesis (a82178c) enabled 46.651
hypothesis (a82178c) disabled 14.154
pytest (147a49e) enabled 129.392
pytest (147a49e) disabled 26.589
  • The "hypothesis" parametrization approach uses Hypothesis for both hyperparameter testing and dataset generation, providing stochastic test coverage through random parameter combinations
  • The "pytest" approach uses pytest.mark.parametrize for hyperparameter testing while still using Hypothesis for dataset generation, resulting in more deterministic testing, but also vastly more test cases

Analysis

  1. Hypothesis Impact:

    • When Hypothesis is enabled, tests run significantly slower (46.651s vs 14.154s)
    • This is expected as Hypothesis performs more thorough testing by generating random parameter combinations
  2. Implementation Comparison:

    • The current implementation (a82178c) shows better performance than the pytest-based approach (147a49e) in both configurations
    • With Hypothesis disabled: 14.154s vs 26.589s (47% faster)
    • With Hypothesis enabled: 46.651s vs 129.392s (64% faster)

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Apr 30, 2025

@dantegd @jcrist @betatim This is ready for another round of review. I'd highly prefer that any additional non-critical changes go into a follow-up PR.

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented May 1, 2025

Addressing the sklearn test failure in #6610 .

@csadorf csadorf requested review from betatim, dantegd and jcrist May 5, 2025 15:55
Copy link
Copy Markdown
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented May 7, 2025

/merge

@rapids-bot rapids-bot Bot merged commit 1ad9af9 into rapidsai:branch-25.06 May 7, 2025
78 checks passed
@csadorf csadorf deleted the tests/improve-test-docs-and-linear-model-tests branch May 7, 2025 18:57
Ofek-Haim pushed a commit to Ofek-Haim/cuml that referenced this pull request May 13, 2025
This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories:

- Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run)
- Reducing the scale tested by a bit
- Coupling certain parameter combinations to reduce the number of tests without reducing coverage
- Using a faster solver for the CPU versions

All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup.

For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in rapidsai#6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Tim Head (https://github.com/betatim)

URL: rapidsai#6607
Ofek-Haim pushed a commit to Ofek-Haim/cuml that referenced this pull request May 13, 2025
This PR enhances developer documentation and testing infrastructure with a focus on linear model tests.

## Documentation Changes
- Added comprehensive testing best practices guide covering:
  - Test organization principles
  - Accuracy testing methodology
  - Memory usage considerations
  - Effective use of fixtures and parametrization
  - Guidelines for hypothesis-based testing
- Added step-by-step instructions for running tests from python/cuml/
- Documented common pytest commands and options
- Added clear explanation of test parameter levels (unit/quality/stress)

## Infrastructure Improvements
- Improved test parameterization consistency in test_linear_model.py
- Added a new cuml-specific floating dtypes hypothesis strategy to make it easier to parametrize tests on dtypes that should be supported by cuml's estimators
- Renamed dataset compatibility functions for clarity:
  - `sklearn_compatible_dataset` → `is_sklearn_compatible_dataset`
  - `cuml_compatible_dataset` → `is_cuml_compatible_dataset`

## Follow-ups
- The replacement of cuml-sklearn result comparisons with hard-coded values in test_linear_model.py is a more significant change that warrants its own dedicated PR.
- Consider removal of scale parameterization (`unit_param`, `quality_param`, `stress_param`)
- Split linear module test module for each estimator

## Related issues

- Contributes to rapidsai#6469
- Closes rapidsai#6592

Authors:
  - Simon Adorf (https://github.com/csadorf)

Approvers:
  - Jim Crist-Harif (https://github.com/jcrist)

URL: rapidsai#6580
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change tests Unit testing for project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Documentation and Linear Model Testing

4 participants