Documentation and Testing Infrastructure Updates#6580
Conversation
|
It should be noted that while the linear models test module is now more consistently using |
| We use [pytest](https://docs.pytest.org/en/latest/) for writing and running tests. To see existing examples, refer to any of the `test_*.py` files in the folder `cuml/tests`. | ||
|
|
||
| ### Test Organization | ||
| - Keep all tests for a single estimator in one file, with exceptions for: |
There was a problem hiding this comment.
A long standing pet peeve of mine, test_linear_model actually contains tests for multiple estimators, perhaps we could split it as part of this PR?
There was a problem hiding this comment.
Yes, I think splitting them up in this PR makes sense.
There was a problem hiding this comment.
Revising my previous statement, to remain clarity on the changes here, we should do that in a follow-up.
|
|
||
| ### Test Parameter Levels | ||
|
|
||
| You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`). |
There was a problem hiding this comment.
This might not be a bad time to rething the names of these three or at least define them better. What is the difference between unit and quality? Perhaps we should use this opportunity to distinguish between tests that we only want to run in nightly runs?
There was a problem hiding this comment.
Not only are they somewhat purely defined, I'm not fully convinced that we need them at all.
There was a problem hiding this comment.
I think I agree with Simon. At least I am a bit puzzled what this is for/about. Tests are for testing correctness and benchmarks are for benchmarking. And pytest is not a great tool for writing benchmarks, something like asv is probably more useful for that
There was a problem hiding this comment.
I'd recommend that we leave this as-is for now so that we can make progress with this PR. I'll propose to remove those scale qualifiers unless we can formulate convincing reasons to keep them.
There was a problem hiding this comment.
+1 on removing these fully. Tests should be checking code behavior, and unless logical behavior changes with scale (as noted above and in the linked issue), then what we're doing here is more performance testing which would be better handled by other tools & specific performance tests. I'm very against hiding performance tests in with other tests.
Agree this can be done later, but maybe we want to add a note here dissuading adding more of these (or remove documenting them here/refer to them as legacy code somehow)?
There was a problem hiding this comment.
I'll just remove them for this test module.
| unit_param(2) # For number of components | ||
| ``` | ||
|
|
||
| 2. **Quality Tests** (`quality_param`): Medium values for thorough testing |
There was a problem hiding this comment.
Referring to my last comment, medium is not well defined here, as well as the difference between basic vs thorough.
There was a problem hiding this comment.
See my above comment, maybe we can take a step back and think about the motivation for this, and that might help us to determine whether we need this at all, and if yes, how exactly we want to define this?
| @given(train_dtype=dataset_dtypes(), test_dtype=dataset_dtypes()) | ||
| @example(train_dtype=np.float32, test_dtype=np.float32) | ||
| @example(train_dtype=np.float32, test_dtype=np.float64) | ||
| @example(train_dtype=np.float64, test_dtype=np.float32) | ||
| @example(train_dtype=np.float64, test_dtype=np.float64) |
There was a problem hiding this comment.
As a naive hypothesis user (aka not a lot of experience) I read this and thought "why do we still need the given? don't the examples cover all the possible combinations?"
Is this just a different way of writing what the pytest.mark.parametrize was doing? Why change?
There was a problem hiding this comment.
Yes, in this particular case they might actually cover all combinations. However, we might refine/expand our dataset_types() definition to include more types in the future (and at least until recently they included dtypes with varying endianness). I'm ok with leaving this as is even if it is currently a bit redundant.
| - Performance testing/benchmarking | ||
| - Generic estimator checks (e.g., `test_base.py`) | ||
| - Use small, focused datasets for correctness testing | ||
| - Only parametrize scale when it triggers alternate code paths |
There was a problem hiding this comment.
| - Only parametrize scale when it triggers alternate code paths | |
| - Only parametrize dataset size when it triggers alternate code paths |
I find this easier to understand, if "scale" did refer to dataset size, if not that I'm lost as to what "scale" means here
| - Must include at least one `@example` for deterministic testing | ||
| - Preferred for dataset generation | ||
| ```python | ||
| @example(dataset=small_regression_dataset(np.float32)) |
There was a problem hiding this comment.
Why have the explicit example? Is it to combine "sample some random values for me" and "check this explicit value because I as a human think it is important" in one test?
In my head I assume that if you run the "normal" tests you get good coverage and check all the things that should be checked. The hypothesis tests are "bonus" tests that we run to find weird edge cases or combinations that we didn't think of or sampling combinations that are too numerous to exhaustively try. But I have no idea if others think of hypothesis like this or not.
There was a problem hiding this comment.
We do not run hypothesis tests with strategies during PR runs, only explicit examples. By requiring explicit examples we ensure that we always run tests on a deterministic input set and thus detect issues with the test implementation early.
|
I'm not sure if exhaustively testing all combinations of hyper-parameters is worth it, compared to sampling (enough) combinations. This is based on the idea that as human author of tests I can reason that some parameter combinations are more useful to test than others (e.g. there is usually no need to repeat all parameter combinations with logging on and off - often a handful of combinations are enough to see that logging works and outputs sensible things). At least I'd hope that we can get enough coverage by writing explicit tests of "sensible" hyper-parameter combinations based on knowing which parameters interact and/or reusing things that have already been tested elsewhere (e.g. when using One thing that I'd find helpful regarding the organisation of the tests is if we mirrored the directory structure of |
Agreed, the exhaustive testing of all parameter combinations might be overkill and at least right now, this PR increase the total test runtime. On the other hand, it's a bit difficult to decide whether we want to explore certain parameters exhaustively or stochastically. To keep things simply, I decided to recommend and implement the former everywhere.
Yes, we can improve test organization, but that should be a follow-up issue. |
jcrist
left a comment
There was a problem hiding this comment.
I'd agree with Tim that we probably don't want to test every combination. Some estimators have many parameters, and doing a full cartesian product will result in a large number of possibly redundant tests.
I do recognize I (and probably we) are at a bit of a disadvantage making changes to tests for code we did not write. When writing code it's easy to see what parameter combos are relevant for checking edge cases and any parameter interactions that might occur. I can make some inferences from documentation and a cursory reading of the code, but without deeper study of an estimator's code I can't be certain there isn't some interaction that has been missed.
That said, perhaps that's fine. We should trust our tests to be useful and sufficient for checking for failure modes when making changes. Right now we can't trust them because we're not convinced the tests are well written and provide adequate coverage.
Perhaps the best method forward is to make a best attempt at cleaning things up in a way that doesn't result in a slower running test suite. This would require some code and docs reading to determine what cases are meaningful, but would likely result in at least as quality a test suite as we currently have. Will we possibly miss meaningful cases? Sure. But it also would likely result in adding meaningful cases, and lay a better groundwork that we can iterate on later.
My worry is that if we decide to do a full parameter sweep in tests "just to be safe" that we'll end up with a test suite that is significantly slower than it already is, leading to a slower dev cycle (harder to run tests to check locally, slower CI, ...), without providing a known and meaningful increase in coverage. And given we may be pulled off onto other more important things, saying we'll reduce the test parametrization later may result in it never actually happening.
|
|
||
| ### Test Parameter Levels | ||
|
|
||
| You can mark test parameters for different scales with (`unit_param`, `quality_param`, and `stress_param`). |
There was a problem hiding this comment.
+1 on removing these fully. Tests should be checking code behavior, and unless logical behavior changes with scale (as noted above and in the linked issue), then what we're doing here is more performance testing which would be better handled by other tools & specific performance tests. I'm very against hiding performance tests in with other tests.
Agree this can be done later, but maybe we want to add a note here dissuading adding more of these (or remove documenting them here/refer to them as legacy code somehow)?
|
Based on the feedback on parameterization so far, I propose that we adopt language that states that we want to generally test all combinations stochastically (i.e., generally use hypothesis to explore the full parameter space) and only use parameterization whenever we explicitly want to test certain combinations exhaustively. If there are problematic edge cases then we should discover them eventually through the nightly tests and can then pro-actively add them to our examples or opt for exhaustive parameterization. |
This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories: - Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run) - Reducing the scale tested by a bit - Coupling certain parameter combinations to reduce the number of tests without reducing coverage - Using a faster solver for the CPU versions All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup. For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in #6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: #6607
Added sections on test organization, accuracy testing, memory usage considerations, and best practices for writing tests. Included detailed recommendations for using fixtures, parametrization, and hypothesis for test input generation. This update aims to improve the clarity and effectiveness of testing strategies within the codebase.
Included a section detailing how to run tests from the python/cuml/ directory, including common pytest commands and options.
Introduced a new section outlining three levels of test parameters: unit, quality, and stress tests.
The test should be removed with 25.08, not 24.08.
Updated the function names for checking dataset compatibility with scikit-learn and cuML from `sklearn_compatible_dataset` and `cuml_compatible_dataset` to `is_sklearn_compatible_dataset` and `is_cuml_compatible_dataset`, respectively. Adjusted all references in the testing module to reflect these changes, enhancing code readability and consistency.
Updated the `floating_dtypes` strategy to use a new implementation that generates only little-endian float32 and float64 dtypes supported by cuML. Adjusted all references in the testing module to ensure consistency and clarity in dtype handling across various dataset strategies.
20c3c38 to
ab5a012
Compare
And replace with hypothesis strategies where appropriate.
de79ab3 to
8f54cfb
Compare
Runtime AnalysisTest Results
Analysis
|
|
Addressing the sklearn test failure in #6610 . |
|
/merge |
This PR applies a few changes (see the commit messages for details) to speedup the dask `LogisticRegression` tests. Most of the changes fall into one of a few categories: - Removing useless parametrization (either unnecessary for testing the specific feature targeted by the test, or actually ignored and was just doubling the number of tests run) - Reducing the scale tested by a bit - Coupling certain parameter combinations to reduce the number of tests without reducing coverage - Using a faster solver for the CPU versions All together this reduces the time taken from 28 minutes to 7 minutes on my machine, a 4x speedup. For I assume historical reasons, most of the dask test suite doesn't run in PRs since it's gated behind `quality_param`/`stress_param` annotations. This file is one of the exceptions, and thus takes ~1/2 the time used for a single PR test run. Rather than add those annotations here (I'm mostly against them and hope we can remove them, as discussed in rapidsai#6580), I've opted to making the tests here more targeted and faster without skipping certain tests in PRs. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Tim Head (https://github.com/betatim) URL: rapidsai#6607
This PR enhances developer documentation and testing infrastructure with a focus on linear model tests. ## Documentation Changes - Added comprehensive testing best practices guide covering: - Test organization principles - Accuracy testing methodology - Memory usage considerations - Effective use of fixtures and parametrization - Guidelines for hypothesis-based testing - Added step-by-step instructions for running tests from python/cuml/ - Documented common pytest commands and options - Added clear explanation of test parameter levels (unit/quality/stress) ## Infrastructure Improvements - Improved test parameterization consistency in test_linear_model.py - Added a new cuml-specific floating dtypes hypothesis strategy to make it easier to parametrize tests on dtypes that should be supported by cuml's estimators - Renamed dataset compatibility functions for clarity: - `sklearn_compatible_dataset` → `is_sklearn_compatible_dataset` - `cuml_compatible_dataset` → `is_cuml_compatible_dataset` ## Follow-ups - The replacement of cuml-sklearn result comparisons with hard-coded values in test_linear_model.py is a more significant change that warrants its own dedicated PR. - Consider removal of scale parameterization (`unit_param`, `quality_param`, `stress_param`) - Split linear module test module for each estimator ## Related issues - Contributes to rapidsai#6469 - Closes rapidsai#6592 Authors: - Simon Adorf (https://github.com/csadorf) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#6580
This PR enhances developer documentation and testing infrastructure with a focus on linear model tests.
Documentation Changes
Infrastructure Improvements
sklearn_compatible_dataset→is_sklearn_compatible_datasetcuml_compatible_dataset→is_cuml_compatible_datasetFollow-ups
unit_param,quality_param,stress_param)Related issues