Skip to content

[Tracker] Test Infrastructure Improvements #6469

@csadorf

Description

@csadorf

Context and Motivation

The cuML test infrastructure has evolved organically as the library has grown, leading to several challenges including test flakiness, coverage gaps, and inconsistent fixture usage. A systematic overhaul of the test infrastructure will help address these issues by improving reliability, clarifying test organization, optimizing execution time, and reducing maintenance burden - ultimately providing a better development experience and more robust codebase.

It is assumed that the plan outlined here will have to be implemented incrementally. The issue serves to tracking overarching aims and progress.

Overall Aims

  1. Reduce test flakiness
  2. Increase test coverage
  3. Improve test maintainability and organization
  4. Enhance test performance and execution time

General Approach

  1. Better delineate between functional tests that check the plumping and correctness tests that check whether our algorithmic implementation is correct.
  2. For correctness checks threshold should ideally be rooted in an analytical expectation based on the condition of the problem, but realistically, be more aggressive about reducing thresholds to levels that would clearly indicate a regression rather than barely below the usually observed value.
  3. Make sure that especially the correctness checks are as deterministic as possible while keeping in mind that in many cases it simply isn’t due to the nature of parallel computation on GPU devices.
  4. Use per-test retries where appropriate, i.e., for correctness tests where single test failure among many successes does not immediately indicate a regression.
  5. Use automated retries for all components that rely on external resources, e.g., dataset downloads.

Specific Approach

1. Test Organization and Structure

Test Categories

  • Consider all tests as functional by default (API, input validation, basic behavior)
  • Explicitly mark or identify:
    • Correctness tests (algorithmic implementation verification)
    • Integration tests (end-to-end workflows)
  • Keep performance tests separate in dedicated benchmarking suite

Organization Alternatives

  1. Test Type Identification Options
    a. Via pytest markers:

    @pytest.mark.correctness
    @pytest.mark.integration
    • Pros: Flexible filtering, no file reorganization needed
    • Cons: Requires discipline in marker usage

    b. Via filename conventions:

    test_kmeans.py           # functional tests (default)
    test_kmeans_correct.py   # correctness tests
    test_kmeans_integ.py     # integration tests
    
    • Pros: Clear visual separation
    • Cons: Could lead to code duplication

    c. Via test name conventions:

    def test_kmeans_fit()          # functional test
    def test_correct_kmeans_conv() # correctness test
    def test_integ_kmeans_pipe()   # integration test
    • Pros: Easy to implement
    • Cons: Less structured than other options
  2. Directory-Based Structure

    tests/
      functional/    # default location
      correctness/
      integration/
    
    • Pros: Clear separation of concerns
    • Cons: May require significant reorganization

Recommendations for Review

  1. Evaluate current test organization pain points
  2. Consider implementing pytest markers as a non-invasive first step
  3. Review potential benefits of filename conventions
  4. Assess need for directory restructuring based on maintenance experience

2. Test Quality Improvements

  • For functional tests:

    • Ensure comprehensive input validation (use hypothesis strategies where possible)
    • Test edge cases and error conditions
    • Verify API contract compliance
    • Avoid thresholded checks or use very conservative thresholds
  • For correctness checks:

    • Root thresholds in analytical expectations where possible
    • Set aggressive thresholds that clearly indicate regressions
    • Document the rationale behind threshold choices and expected variance
    • Try to implement tests as deterministic as possible
    • Use automatic retries where appropriate*

*) In case that we opt to use a "correctness" marker, retry logic could be implemented as part of the marker.

4. Test Infrastructure Enhancements

  • More consistent use of common fixtures (deduplicate existing ones)
  • More consistent use of hypothesis strategies for input validation

Related issues

Metadata

Metadata

Assignees

Labels

Tech DebtIssues related to debtimprovementImprovement / enhancement to an existing functiontestsUnit testing for project

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions