Skip to content

Validate that X is 2 dimensional#7889

Merged
rapids-bot[bot] merged 4 commits intorapidsai:release/26.04from
jcrist:more-validation
Mar 13, 2026
Merged

Validate that X is 2 dimensional#7889
rapids-bot[bot] merged 4 commits intorapidsai:release/26.04from
jcrist:more-validation

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Mar 13, 2026

sklearn requires that X is 2-dimensional, and errors nicely otherwise.

cuml doesn't uniformly require that X is 2-dimensional. Some estimators error on non-2D-X, but they usually do so accidentally, not as part of the input validation code. This leads to a bunch of xfailed tests in our sklearn compatibility test suite (both in the cuml.accel upstream tests, as well as in test_sklearn_compatibility.py).

This PR:

  • Adds a deprecation warning to all estimators when X is non-2-dimensional. In 26.06 we'll remove the deprecation warning and error instead.
  • If cuml.accel is enabled, this warning is an error matching the sklearn error message instead. This lets us un-xfail a bunch of tests right now.
  • Updates our test suite to not trigger the warning, ensuring we're always passing in 2-dimensional X in tests. This was most common for TargetEncoder, only a few other locations needed it. Since it was so common for TargetEncoder, I added a deprecation test there as well to check that everything still worked on 1D inputs.
  • Updates reflect to support reset="type", for setting the type on fit-like functions alone (and not n_features_in_/feature_names_in_). This was needed for 2 "transformers" sklearn (and cuml) supports that are meant to operate on y instead of X (LabelEncoder and LabelBinarizer). These estimators operate on y alone and shouldn't support n_features_in_/feature_names_in_. They also shouldn't validate that the array input is 2 dimensional, since y can be 1D. This is a stop-gap solution as we refactor our validation functions - in the long run we might remove reset entirely from reflect and instead move setting the input type to the validation functions (with the reflect decorator only remaining for coercing outputs).

As per our deprecation policy, I've marked this PR as "breaking" since it adds a new deprecation warning around non-2-dimensional X. All prior working code should continue to work, users providing 1D X should just see a warning.

jcrist added 3 commits March 13, 2026 12:15
cuml has not been strict about requiring `X` be a 2-dimensional input.
In the future we want to be strict about requiring 2D X, to better align
with sklearn conventions and expectations.

Here we add a deprecation warning if X isn't 2D, but continue to accept
it everywhere we already do. If `cuml.accel` is enabled, we instead
error, with the same error that sklearn would raise. In the next release
we'll instead error, improving our compatibility.
@jcrist jcrist self-assigned this Mar 13, 2026
@jcrist jcrist requested a review from a team as a code owner March 13, 2026 17:25
@jcrist jcrist requested a review from dantegd March 13, 2026 17:25
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Mar 13, 2026
@jcrist jcrist added improvement Improvement / enhancement to an existing function breaking Breaking change cuml-accel Issues related to cuml.accel sklearn-api-compat Issues around cuml matching sklearn API conventions/standards and removed Cython / Python Cython or Python issue labels Mar 13, 2026
@jcrist jcrist requested a review from csadorf March 13, 2026 17:26
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread python/cuml/cuml/internals/outputs.py Outdated
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Mar 13, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 13, 2026

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Enforced stricter 2D array input validation across estimators; 1D inputs now trigger deprecation warnings.
    • Enhanced error messages for invalid input shapes and edge cases.
  • Documentation

    • Updated TargetEncoder examples to demonstrate 2D DataFrame inputs.
  • Tests

    • Added validation tests for features infrastructure behavior.
    • Updated test inputs to comply with 2D array requirements.

Walkthrough

This PR enforces 2D input validation across cuML's input handling pipeline, introduces a new "type" reset mode in the reflect decorator for fine-grained feature tracking control, updates preprocessing examples and tests to use 2D inputs, and re-enables previously skipped tests by removing xfail entries.

Changes

Cohort / File(s) Summary
Feature Validation & Reset Semantics
python/cuml/cuml/internals/outputs.py, python/cuml/cuml/internals/validation.py
Updated reflect() to support reset="type" (in addition to bool), with conditional logic that skips certain feature checks for "type" mode. Strengthened _get_n_features() with explicit 2D validation, detailed error messages, and FutureWarning deprecation for non-2D inputs when acceleration is disabled.
Preprocessor & Decorator Updates
python/cuml/cuml/preprocessing/label.py, python/cuml/cuml/preprocessing/TargetEncoder.py
Changed LabelBinarizer.fit decorator from reset=True to reset="type". Updated TargetEncoder docstring example to demonstrate 2D DataFrame input instead of 1D Series.
Core Algorithm Tests
python/cuml/tests/test_coordinate_descent.py, python/cuml/tests/test_dbscan.py, python/cuml/tests/test_tsne.py, python/cuml/tests/test_validation.py
Reshaped test inputs to 2D arrays; updated validation tests to assert 3D input raises ValueError and added parametrized test for 1D input deprecation warnings.
Feature Attribution Tests
python/cuml/tests/test_label_binarizer.py, python/cuml/tests/test_label_encoder.py
Added new tests verifying LabelBinarizer and LabelEncoder do not attach n_features_in_ attribute, confirming feature infrastructure exclusion.
Integration & Encoder Tests
python/cuml/tests/test_target_encoder.py
Refactored TargetEncoder tests to use 2D DataFrame inputs for X and 1D Series for y; added deprecation warning expectations and updated output assertions for new input pathways.
Test Configuration
python/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
Removed multiple xfail entries to re-enable tests; updated TSNE validation marker from cuml_accel_tsne_validation_on_init to cuml_accel_tsne_validations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • #7877: Modifies cuml.internals.validation with overlapping feature/2D input validation logic; this PR's reflect() wrapper now conditions check_features() calls affecting both changes.

Suggested labels

Cython / Python

Suggested reviewers

  • dantegd
  • viclafargue
  • csadorf
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Validate that X is 2 dimensional' directly summarizes the main change: enforcing validation that input array X must be 2-dimensional across all estimators.
Description check ✅ Passed The description thoroughly explains the rationale, implementation details, and changes made to enforce 2D X validation, including deprecation warnings, test updates, and the reflect function modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/internals/validation.py`:
- Around line 107-112: The deprecation warning emitted in
cuml/internals/validation.py uses warnings.warn without a stacklevel, so it
points to this internal function instead of the user's callsite; update the
warnings.warn call in this module to pass an appropriate stacklevel (e.g.,
stacklevel=2 or 3) so the warning points to the user's code that passed non-2D
input (i.e., modify the warnings.warn(...) invocation in this validation code to
include stacklevel=<n>).

In `@python/cuml/tests/test_target_encoder.py`:
- Around line 36-38: Fix the typo in the test comment: change "tarnsform" to
"transform" above the pytest.warns block that asserts FutureWarning for
encoder.transform(df.category); update the comment text so it correctly reads
"Warns in transform".

In `@python/cuml/tests/test_tsne.py`:
- Around line 231-233: The current test_components_exception uses np.array([[]])
which is empty and can raise for zero features instead of testing
TSNE(n_components=3).fit; update the test to pass a minimally valid 2D X with at
least one sample and fewer features than n_components (e.g., shape (1,1) or
(2,1)) so the ValueError originates from the n_components check in TSNE.fit, and
(optionally) assert the raised exception message mentions n_components to ensure
the correct error path in TSNE.fit is exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5febae2b-ff2c-488a-ab2c-b760fb4a38b6

📥 Commits

Reviewing files that changed from the base of the PR and between abc17be and 10ce094.

📒 Files selected for processing (12)
  • python/cuml/cuml/internals/outputs.py
  • python/cuml/cuml/internals/validation.py
  • python/cuml/cuml/preprocessing/TargetEncoder.py
  • python/cuml/cuml/preprocessing/label.py
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
  • python/cuml/tests/test_coordinate_descent.py
  • python/cuml/tests/test_dbscan.py
  • python/cuml/tests/test_label_binarizer.py
  • python/cuml/tests/test_label_encoder.py
  • python/cuml/tests/test_target_encoder.py
  • python/cuml/tests/test_tsne.py
  • python/cuml/tests/test_validation.py

Comment thread python/cuml/cuml/internals/validation.py
Comment thread python/cuml/tests/test_target_encoder.py
Comment thread python/cuml/tests/test_tsne.py
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Mar 13, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 3cc7c9c into rapidsai:release/26.04 Mar 13, 2026
165 of 170 checks passed
@jcrist jcrist deleted the more-validation branch March 13, 2026 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change cuml-accel Issues related to cuml.accel Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function sklearn-api-compat Issues around cuml matching sklearn API conventions/standards

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants