Skip to content

Improve numerical stability handling for near-constant features in StandardScaler #7770

@coderabbitai

Description

@coderabbitai

Summary

The cuml.accel StandardScaler currently produces incorrect results when scaling near-constant features (features with very low variance). This affects numerical correctness once StandardScaler is auto-accelerated.

Problem

Test failures in test_standard_scaler_near_constant_features show that the GPU implementation lacks proper handling for features with variance close to zero (tested with values like 1e-10, 1, and 10000000000.0 at various sample sizes).

Requested Solution

Either:

  • (a) Match sklearn's stability handling: Implement epsilon/variance thresholding when computing scale, matching sklearn's behavior for near-zero variance columns
  • (b) Detect and force CPU fallback: Identify near-constant features during fit and dispatch to sklearn's CPU implementation (similar to other unsupported cases like partial_fit or sample_weight)

Context

This issue was identified during PR #7766 which adds cuml.accel support for StandardScaler.

Acceptance Criteria

  • Tests in test_standard_scaler_near_constant_features pass without xfails
  • Numerical results match sklearn for all near-constant feature scenarios
  • Solution is consistent with other GPU limitation handling in cuml.accel

Requested by: @csadorf

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions