Validate that X is 2 dimensional by jcrist · Pull Request #7889 · rapidsai/cuml

jcrist · 2026-03-13T17:25:00Z

sklearn requires that X is 2-dimensional, and errors nicely otherwise.

cuml doesn't uniformly require that X is 2-dimensional. Some estimators error on non-2D-X, but they usually do so accidentally, not as part of the input validation code. This leads to a bunch of xfailed tests in our sklearn compatibility test suite (both in the cuml.accel upstream tests, as well as in test_sklearn_compatibility.py).

This PR:

Adds a deprecation warning to all estimators when X is non-2-dimensional. In 26.06 we'll remove the deprecation warning and error instead.
If cuml.accel is enabled, this warning is an error matching the sklearn error message instead. This lets us un-xfail a bunch of tests right now.
Updates our test suite to not trigger the warning, ensuring we're always passing in 2-dimensional X in tests. This was most common for TargetEncoder, only a few other locations needed it. Since it was so common for TargetEncoder, I added a deprecation test there as well to check that everything still worked on 1D inputs.
Updates reflect to support reset="type", for setting the type on fit-like functions alone (and not n_features_in_/feature_names_in_). This was needed for 2 "transformers" sklearn (and cuml) supports that are meant to operate on y instead of X (LabelEncoder and LabelBinarizer). These estimators operate on y alone and shouldn't support n_features_in_/feature_names_in_. They also shouldn't validate that the array input is 2 dimensional, since y can be 1D. This is a stop-gap solution as we refactor our validation functions - in the long run we might remove reset entirely from reflect and instead move setting the input type to the validation functions (with the reflect decorator only remaining for coercing outputs).

As per our deprecation policy, I've marked this PR as "breaking" since it adds a new deprecation warning around non-2-dimensional X. All prior working code should continue to work, users providing 1D X should just see a warning.

cuml has not been strict about requiring `X` be a 2-dimensional input. In the future we want to be strict about requiring 2D X, to better align with sklearn conventions and expectations. Here we add a deprecation warning if X isn't 2D, but continue to accept it everywhere we already do. If `cuml.accel` is enabled, we instead error, with the same error that sklearn would raise. In the next release we'll instead error, improving our compatibility.

csadorf

LGTM!

coderabbitai · 2026-03-13T18:27:13Z

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Enforced stricter 2D array input validation across estimators; 1D inputs now trigger deprecation warnings.
- Enhanced error messages for invalid input shapes and edge cases.
Documentation
- Updated TargetEncoder examples to demonstrate 2D DataFrame inputs.
Tests
- Added validation tests for features infrastructure behavior.
- Updated test inputs to comply with 2D array requirements.

Walkthrough

This PR enforces 2D input validation across cuML's input handling pipeline, introduces a new "type" reset mode in the reflect decorator for fine-grained feature tracking control, updates preprocessing examples and tests to use 2D inputs, and re-enables previously skipped tests by removing xfail entries.

Changes

Cohort / File(s)	Summary
Feature Validation & Reset Semantics `python/cuml/cuml/internals/outputs.py`, `python/cuml/cuml/internals/validation.py`	Updated `reflect()` to support `reset="type"` (in addition to bool), with conditional logic that skips certain feature checks for `"type"` mode. Strengthened `_get_n_features()` with explicit 2D validation, detailed error messages, and FutureWarning deprecation for non-2D inputs when acceleration is disabled.
Preprocessor & Decorator Updates `python/cuml/cuml/preprocessing/label.py`, `python/cuml/cuml/preprocessing/TargetEncoder.py`	Changed `LabelBinarizer.fit` decorator from `reset=True` to `reset="type"`. Updated `TargetEncoder` docstring example to demonstrate 2D DataFrame input instead of 1D Series.
Core Algorithm Tests `python/cuml/tests/test_coordinate_descent.py`, `python/cuml/tests/test_dbscan.py`, `python/cuml/tests/test_tsne.py`, `python/cuml/tests/test_validation.py`	Reshaped test inputs to 2D arrays; updated validation tests to assert 3D input raises ValueError and added parametrized test for 1D input deprecation warnings.
Feature Attribution Tests `python/cuml/tests/test_label_binarizer.py`, `python/cuml/tests/test_label_encoder.py`	Added new tests verifying `LabelBinarizer` and `LabelEncoder` do not attach `n_features_in_` attribute, confirming feature infrastructure exclusion.
Integration & Encoder Tests `python/cuml/tests/test_target_encoder.py`	Refactored `TargetEncoder` tests to use 2D DataFrame inputs for X and 1D Series for y; added deprecation warning expectations and updated output assertions for new input pathways.
Test Configuration `python/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`	Removed multiple xfail entries to re-enable tests; updated TSNE validation marker from `cuml_accel_tsne_validation_on_init` to `cuml_accel_tsne_validations`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

#7877: Modifies cuml.internals.validation with overlapping feature/2D input validation logic; this PR's reflect() wrapper now conditions check_features() calls affecting both changes.

Suggested labels

Cython / Python

Suggested reviewers

dantegd
viclafargue
csadorf

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Validate that X is 2 dimensional' directly summarizes the main change: enforcing validation that input array X must be 2-dimensional across all estimators.
Description check	✅ Passed	The description thoroughly explains the rationale, implementation details, and changes made to enforce 2D X validation, including deprecation warnings, test updates, and the reflect function modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/internals/validation.py`:
- Around line 107-112: The deprecation warning emitted in
cuml/internals/validation.py uses warnings.warn without a stacklevel, so it
points to this internal function instead of the user's callsite; update the
warnings.warn call in this module to pass an appropriate stacklevel (e.g.,
stacklevel=2 or 3) so the warning points to the user's code that passed non-2D
input (i.e., modify the warnings.warn(...) invocation in this validation code to
include stacklevel=<n>).

In `@python/cuml/tests/test_target_encoder.py`:
- Around line 36-38: Fix the typo in the test comment: change "tarnsform" to
"transform" above the pytest.warns block that asserts FutureWarning for
encoder.transform(df.category); update the comment text so it correctly reads
"Warns in transform".

In `@python/cuml/tests/test_tsne.py`:
- Around line 231-233: The current test_components_exception uses np.array([[]])
which is empty and can raise for zero features instead of testing
TSNE(n_components=3).fit; update the test to pass a minimally valid 2D X with at
least one sample and fewer features than n_components (e.g., shape (1,1) or
(2,1)) so the ValueError originates from the n_components check in TSNE.fit, and
(optionally) assert the raised exception message mentions n_components to ensure
the correct error path in TSNE.fit is exercised.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5febae2b-ff2c-488a-ab2c-b760fb4a38b6

📥 Commits

Reviewing files that changed from the base of the PR and between abc17be and 10ce094.

📒 Files selected for processing (12)

python/cuml/cuml/internals/outputs.py
python/cuml/cuml/internals/validation.py
python/cuml/cuml/preprocessing/TargetEncoder.py
python/cuml/cuml/preprocessing/label.py
python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
python/cuml/tests/test_coordinate_descent.py
python/cuml/tests/test_dbscan.py
python/cuml/tests/test_label_binarizer.py
python/cuml/tests/test_label_encoder.py
python/cuml/tests/test_target_encoder.py
python/cuml/tests/test_tsne.py
python/cuml/tests/test_validation.py

jcrist · 2026-03-13T19:58:49Z

/merge

jcrist added 3 commits March 13, 2026 12:15

Don't set/check features on label preprocessors

bc685f9

Update xfail list

f542a75

jcrist self-assigned this Mar 13, 2026

jcrist requested a review from a team as a code owner March 13, 2026 17:25

jcrist requested a review from dantegd March 13, 2026 17:25

github-actions Bot added the Cython / Python Cython or Python issue label Mar 13, 2026

jcrist added improvement Improvement / enhancement to an existing function breaking Breaking change cuml-accel Issues related to cuml.accel sklearn-api-compat Issues around cuml matching sklearn API conventions/standards and removed Cython / Python Cython or Python issue labels Mar 13, 2026

jcrist requested a review from csadorf March 13, 2026 17:26

jcrist mentioned this pull request Mar 13, 2026

Remove deprecation warning in TargetEncoder for 26.04 #7890

Closed

csadorf approved these changes Mar 13, 2026

View reviewed changes

Comment thread python/cuml/cuml/internals/outputs.py Outdated

Fixups

10ce094

github-actions Bot added the Cython / Python Cython or Python issue label Mar 13, 2026

coderabbitai Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread python/cuml/cuml/internals/validation.py

Comment thread python/cuml/tests/test_target_encoder.py

Comment thread python/cuml/tests/test_tsne.py

rapids-bot Bot merged commit 3cc7c9c into rapidsai:release/26.04 Mar 13, 2026
165 of 170 checks passed

jcrist deleted the more-validation branch March 13, 2026 19:59

coderabbitai Bot mentioned this pull request Mar 13, 2026

Forward-merge release/26.04 into main #7894

Merged

coderabbitai Bot mentioned this pull request Mar 24, 2026

Remove deprecations from 26.04 #7928

Merged

This was referenced Apr 22, 2026

New input validation utilities #7973

Merged

Use new input validation in cuml.linear_models/cuml.solvers #7978

Merged

Apply new validation to cuml.cluster #7984

Merged

This was referenced Apr 29, 2026

Apply new validation to cuml.svm #8029

Merged

Cleanup and apply new validation to cuml._thirdparty and cuml.preprocessing #8052

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate that X is 2 dimensional#7889

Validate that X is 2 dimensional#7889
rapids-bot[bot] merged 4 commits intorapidsai:release/26.04from
jcrist:more-validation

jcrist commented Mar 13, 2026 •

edited

Loading

Uh oh!

csadorf left a comment

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 13, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcrist commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jcrist commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csadorf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 13, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcrist commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jcrist commented Mar 13, 2026 •

edited

Loading