[Tracker] Test Infrastructure Improvements

## Context and Motivation

The cuML test infrastructure has evolved organically as the library has grown, leading to several challenges including test flakiness, coverage gaps, and inconsistent fixture usage. A systematic overhaul of the test infrastructure will help address these issues by improving reliability, clarifying test organization, optimizing execution time, and reducing maintenance burden - ultimately providing a better development experience and more robust codebase.

It is assumed that the plan outlined here will have to be implemented incrementally. The issue serves to tracking overarching aims and progress.

## Overall Aims
1. Reduce test flakiness
2. Increase test coverage
3. Improve test maintainability and organization
4. Enhance test performance and execution time

## General Approach

1. Better delineate between functional tests that check the plumping and correctness tests that check whether our algorithmic implementation is correct.
2. For correctness checks threshold should ideally be rooted in an analytical expectation based on the condition of the problem, but realistically, be more aggressive about reducing thresholds to levels that would clearly indicate a regression rather than barely below the usually observed value.
3. Make sure that especially the correctness checks are as deterministic as possible while keeping in mind that in many cases it simply isn’t due to the nature of parallel computation on GPU devices.
4. Use per-test retries where appropriate, _i.e._, for correctness tests where single test failure among many successes does not immediately indicate a regression.
5. Use automated retries for all components that rely on external resources, e.g., dataset downloads.

## Specific Approach

### 1. Test Organization and Structure

#### Test Categories
- Consider all tests as functional by default (API, input validation, basic behavior)
- Explicitly mark or identify:
  - Correctness tests (algorithmic implementation verification)
  - Integration tests (end-to-end workflows)
- Keep performance tests separate in dedicated benchmarking suite

#### Organization Alternatives

1. **Test Type Identification Options**
   a. Via pytest markers:
   ```python
   @pytest.mark.correctness
   @pytest.mark.integration
   ```
   - Pros: Flexible filtering, no file reorganization needed
   - Cons: Requires discipline in marker usage

   b. Via filename conventions:
   ```
   test_kmeans.py           # functional tests (default)
   test_kmeans_correct.py   # correctness tests
   test_kmeans_integ.py     # integration tests
   ```
   - Pros: Clear visual separation
   - Cons: Could lead to code duplication

   c. Via test name conventions:
   ```python
   def test_kmeans_fit()          # functional test
   def test_correct_kmeans_conv() # correctness test
   def test_integ_kmeans_pipe()   # integration test
   ```
   - Pros: Easy to implement
   - Cons: Less structured than other options

2. **Directory-Based Structure**
   ```
   tests/
     functional/    # default location
     correctness/
     integration/
   ```
   - Pros: Clear separation of concerns
   - Cons: May require significant reorganization

#### Recommendations for Review
1. Evaluate current test organization pain points
2. Consider implementing pytest markers as a non-invasive first step
3. Review potential benefits of filename conventions
4. Assess need for directory restructuring based on maintenance experience

### 2. Test Quality Improvements
- For functional tests:
  - Ensure comprehensive input validation (use hypothesis strategies where possible)
  - Test edge cases and error conditions
  - Verify API contract compliance
  - Avoid thresholded checks or use very conservative thresholds

- For correctness checks:
  - Root thresholds in analytical expectations where possible
  - Set aggressive thresholds that clearly indicate regressions
  - Document the rationale behind threshold choices and expected variance
  - Try to implement tests as deterministic as possible
  - Use automatic retries where appropriate*

*) In case that we opt to use a "correctness" marker, retry logic could be implemented as part of the marker.

### 4. Test Infrastructure Enhancements
- More consistent use of common fixtures (deduplicate existing ones)
- More consistent use of hypothesis strategies for input validation

## Related issues

- gh-6375 
- gh-6366
- gh-6251
- gh-3744
- gh-2707
- gh-5917
- gh-3081
- gh-4964

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracker] Test Infrastructure Improvements #6469

Context and Motivation

Overall Aims

General Approach

Specific Approach

1. Test Organization and Structure

Test Categories

Organization Alternatives

Recommendations for Review

2. Test Quality Improvements

4. Test Infrastructure Enhancements

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracker] Test Infrastructure Improvements #6469

Description

Context and Motivation

Overall Aims

General Approach

Specific Approach

1. Test Organization and Structure

Test Categories

Organization Alternatives

Recommendations for Review

2. Test Quality Improvements

4. Test Infrastructure Enhancements

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions