Skip to content

Add check_features#7874

Merged
rapids-bot[bot] merged 8 commits intorapidsai:mainfrom
jcrist:check-features
Mar 11, 2026
Merged

Add check_features#7874
rapids-bot[bot] merged 8 commits intorapidsai:mainfrom
jcrist:check-features

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Mar 10, 2026

This PR resolves many xfailures in test_sklearn_compatibility and the cuml.accel upstream sklearn test suite. In particular, it targets the check_n_features_in_after_fitting check (although several other checks were fixed while implementing).

This is a precursor to adding support for feature_names_in_ - once this PR lands, the next step is to add storing/checking that attribute within check_features as well. In the long run I expect most estimators will move to calling check_features from within a larger check_inputs ingestion routine (similar to sklearn's validate_data), but we're taking incremental steps to get there.

Summary:

  • Add check_features and corresponding test to cuml.internals.validation
  • Call check_features(self, X, reset=True) from within reflect, drop all calls to private _set_n_features_in method.
  • Call check_features(self, X) in all inference methods that expect an X aligned with the original fit features. Note that this should not be called in an inverse_transform method, since the input there is not aligned with the original X.
  • Update xfailed tests accordingly.

In addition, the following tangential issues were found and fixed while writing this PR:

  • Make predict_proba/predict_log_proba only available on SVC/LinearSVC if probability=True. This is needed for the sklearn generic tests to work properly - if the method is available it'll be called, so we want hasattr(model, "predict_proba") to only return True if probability=True. This matches upstream sklearn behavior. This is non-breaking, any user that was calling these methods before will still have them work fine. This just changes the nature of the error that would raise if they weren't supported. This fixed several xfails in our test suite.
  • Add requires_positive_X tag to ComplementNB/CategoricalNB. This was needed so the sklearn generic tests would generate positive data for these estimators, otherwise invalid data would be generated, masking what issues we actually have. This fixed several xfails in our test suite.
  • Fix the partial_fit methods in cuml.naive_bayes and cuml.IncrementalPCA to better distinguish between first and subsequent fits. This fixed several xfails in our test suite.
  • Simplify implementation of OrdinalEncoder/OneHotEncoder. These estimators don't follow the patterns we expect for Base subclasses yet, and were implementing n_features_in_ themselves (among other things). This is non-breaking, but will look larger on the diff.

Most other changes are pretty small, mechanical additions to add check_features calls in the correct spots. Hopefully a straightforward review.

Working towards #6650, part of #7428.

jcrist added 6 commits March 10, 2026 12:47
Adds calls to `check_features` everywhere they're needed.

Removes now-passing xfails in `test_sklearn_compatibility.py`.

While hacking through this, I also found a few unrelated bugs that
got lumped in with this commit:

- Adds `requires_positive_X` tag to `ComplementNB`/`CategoricalNB`. This
  tag affects sklearn's generic tests and led to some xfails now
  passing. Was necessary to get the tests to run without errors.
- Fixes some state management around `partial_fit` for naive bayes and
  incremental PCA. These are tested by the sklearn generic tests.
@jcrist jcrist self-assigned this Mar 10, 2026
@jcrist jcrist requested a review from a team as a code owner March 10, 2026 22:23
@jcrist jcrist requested a review from betatim March 10, 2026 22:23
@jcrist jcrist added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuml-accel Issues related to cuml.accel sklearn-api-compat Issues around cuml matching sklearn API conventions/standards labels Mar 10, 2026
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Mar 10, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3adcee5c-eed1-442a-af72-2f53dbcccd94

📥 Commits

Reviewing files that changed from the base of the PR and between bd8a3b9 and 2f0e73f.

📒 Files selected for processing (1)
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
💤 Files with no reviewable changes (1)
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • LinearSVC now exposes predict_proba and predict_log_proba when probability=True.
  • Bug Fixes

    • Centralized feature validation added across many estimators and transformers for consistent errors on feature-mismatch.
    • Naive Bayes fit/partial-fit behavior and metadata handling improved.
    • Encoders and preprocessing now manage feature metadata more robustly.
  • Tests

    • Tests and expected failures updated to reflect stricter validation and behavior changes.

Walkthrough

Introduces a centralized check_features(estimator, X, reset=False) and helper _get_n_features in cuml.internals.validation; removes Base._set_n_features_in; replaces direct feature-count setting with check_features calls across many estimators/encoders, updates Naive Bayes context usage, adjusts tests and xfail expectations.

Changes

Cohort / File(s) Summary
Core validation & base outputs
python/cuml/cuml/internals/validation.py, python/cuml/cuml/internals/base.py, python/cuml/cuml/internals/outputs.py
Adds check_features and _get_n_features; removes Base._set_n_features_in; replaces direct feature-size updates with centralized validation and exports check_features.
Preprocessing & encoders
python/cuml/cuml/preprocessing/encoders.py, python/cuml/cuml/preprocessing/onehotencoder_mg.py, python/cuml/cuml/preprocessing/ordinalencoder_mg.py, python/cuml/cuml/preprocessing/TargetEncoder.py
Replaces CheckFeaturesMixIn usage with centralized check_features; adds _set_features on BaseEncoder; removes setting/exposing n_features_in_ in TargetEncoder.
Decomposition / PCA / TSVD
python/cuml/cuml/decomposition/base_mg.py, python/cuml/cuml/decomposition/incremental_pca.py, python/cuml/cuml/decomposition/pca.pyx, python/cuml/cuml/decomposition/tsvd.pyx
Use check_features(..., reset=True) for initial-fit feature setup; add check_features in transform paths; incremental PCA adjusts first-call initialization.
Clustering, kmeans, projections
python/cuml/cuml/cluster/hdbscan/hdbscan.pyx, python/cuml/cuml/cluster/kmeans.pyx, python/cuml/cuml/random_projection/random_projection.py
Insert check_features validations in membership_vector, approximate_predict, KMeans predict/transform, and RandomProjection.transform.
Ensembles & neighbors
python/cuml/cuml/ensemble/randomforestclassifier.py, python/cuml/cuml/ensemble/randomforestregressor.py, python/cuml/cuml/neighbors/nearest_neighbors.pyx, python/cuml/cuml/neighbors/kernel_density.py
Add check_features in predict/predict_proba/kneighbors/score_samples and simplify some sparse-data guards.
SVM, linear models, kernel ridge, SVR
python/cuml/cuml/svm/linear_svc.py, python/cuml/cuml/svm/svc.py, python/cuml/cuml/svm/svr.py, python/cuml/cuml/linear_model/base.py, python/cuml/cuml/linear_model/base_mg.pyx, python/cuml/cuml/kernel_ridge/kernel_ridge.py
Add check_features validation in predict/decision_function/predict_proba paths; add @available_if gating for probability methods and refine NotFittedError messaging.
Naive Bayes & related refactors
python/cuml/cuml/naive_bayes/naive_bayes.py
Swaps reflectrun_in_internal_context, inserts check_features calls and first-call branching in fit/partial_fit/predict flows, and adds _more_static_tags() on several NB classes.
Other estimators/utilities
python/cuml/cuml/covariance/ledoit_wolf.py, python/cuml/cuml/manifold/umap/umap.pyx, python/cuml/cuml/experimental/linear_model/lars.pyx
Add check_features to score/transform/predict paths; lars removes explicit _set_n_features_in/_set_output_type calls.
Tests & sklearn-compat adjustments
python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml, python/cuml/tests/test_validation.py, python/cuml/tests/test_linear_svm.py, python/cuml/tests/*
Adds tests for _get_n_features/check_features, removes many xfail entries for n_features_in checks, and updates SVC/LinearSVC probability-related tests and expectations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • betatim
  • dantegd
  • csadorf
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add check_features' is concise and clearly describes the main change: introducing a new validation helper function.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation details, and related fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuml/cuml/svm/linear_svc.py (1)

320-375: ⚠️ Potential issue | 🟠 Major

cuML's LinearSVC introduces API divergence from sklearn by adding unsupported parameters and methods.

The probability parameter and predict_proba/predict_log_proba methods do not exist in sklearn's LinearSVC. Sklearn recommends wrapping LinearSVC with CalibratedClassifierCV to obtain probability estimates, or using SVC(kernel="linear", probability=True) instead.

Per coding guidelines, function and parameter names must match scikit-learn without justification. Either remove the probability parameter and probability methods entirely, or document this as an intentional API deviation with clear justification for why cuML diverges here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/svm/linear_svc.py` around lines 320 - 375, The PR adds
sklearn-incompatible API changes to LinearSVC by introducing the probability
parameter and methods predict_proba and predict_log_proba; either remove these
members from the LinearSVC class (remove the probability attribute, the
`@available_if` wrappers, and the predict_proba/predict_log_proba methods) to
match scikit-learn, or if you intend to keep them, add a clear, explicit
justification in the class docstring and module-level docs describing the
intentional divergence and rationale (mention recommended alternatives like
CalibratedClassifierCV and SVC(kernel="linear", probability=True)); update any
generate_docstring/reflect annotations and tests referencing
predict_proba/predict_log_proba accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/internals/validation.py`:
- Around line 57-65: The _get_n_features branch that treats list/tuple inputs
assumes nested rows are only lists/tuples and returns 1 otherwise, which
miscounts cases like [np.array(...), ...] and also swallows errors with a
blanket except; update the logic in _get_n_features to: (1) explicitly handle
empty X by raising a clear ValueError, (2) inspect the first row using a robust
"array-like" check (e.g., isinstance(row, (list, tuple, np.ndarray)) or
hasattr(row, "__len__")) to determine n_features correctly for array-like rows
such as np.ndarray, (3) avoid a bare except—only catch specific exceptions if
needed and re-raise or convert them to a meaningful error about inconsistent row
formats, and (4) ensure functions that rely on n_features_in_ (the callers)
receive the correct integer or a clear error when ambiguous; locate this logic
in the _get_n_features function to make the changes.

In `@python/cuml/cuml/manifold/umap/umap.pyx`:
- Around line 1415-1419: Run the dimensionality check (the len(X.shape) != 2
guard that raises "Reshape your data: X should be two dimensional") before
calling check_features(self, X) inside the predict/transform path so 1D inputs
hit the sklearn-style reshape error first; then remove the later manual n_cols
!= self.n_features_in_ block because check_features already enforces
feature-count consistency; ensure references are to check_is_fitted, the
len(X.shape) 2D check, check_features(self, X), and the n_cols !=
self.n_features_in_ manual check so the correct spots are updated.

In `@python/cuml/cuml/naive_bayes/naive_bayes.py`:
- Around line 858-863: On the partial_fit path, when first_call is True and the
estimator is the discrete branch, require that the caller passes a non-None
classes parameter instead of silently inferring classes from the first batch; in
practice modify partial_fit (the method that calls self._set_output_type(X) and
check_features) to check if first_call and classes is None and raise ValueError
(matching scikit-learn behavior), and when classes is provided use it to
initialize self.classes_ rather than inferring from y in that first batch.

In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 777-779: kneighbors currently uses an old width check via
X.shape[1] after calling check_features, causing divergent error paths and
failures for array-like inputs; update kneighbors to call check_is_fitted()
first, rely on check_features(self, X) to validate/normalize the input (remove
any subsequent direct uses of X.shape[1] or similar legacy width checks), then
branch to _kneighbors_dense/_kneighbors_sparse using the validated/normalized X
returned or ensured by check_features, and ensure the returned neighbors
distances/indices preserve the input type (cuDF/pandas/NumPy) consistent with
check_features behavior.

In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 42-47: Update OneHotEncoder.transform to use the existing
_check_features helper so transform performs the same feature validation as
fit/fit_transform: call check_is_fitted(self), then invoke
self._check_features(X, reset=False) (which calls check_features and
sets/validates feature_names_in_), and ensure transform validates input types
(pandas/cuDF/ndarray) and shapes against the fitted
n_features_in_/feature_names_in_. Preserve the fitted column order when
selecting columns (do not iterate X.columns directly), raise on missing or extra
columns rather than silently dropping/reordering, and make sure the returned
encoded block type matches the input type (DataFrame for DataFrame inputs,
ndarray for ndarray inputs).
- Around line 62-63: The bug is that calling self._check_features(reset=True)
unconditionally inside _fit() causes the second validation of explicit
categories (via _check_input_fit/_check_input with is_categories=True) to
overwrite fitted metadata; change the flow so that _check_features(reset=True)
is only called when validating the main training X (not when validating explicit
categories), i.e., detect is_categories before resetting and skip the reset for
category-validation paths; ensure _fit() still clears previous model state and
initializes learned attributes (n_features_in_, feature_names_in_, etc.) before
any validation and use the existing _check_input/_check_input_fit calls to
validate inputs without mutating fitted metadata for categories.

In `@python/cuml/tests/test_sklearn_compatibility.py`:
- Around line 289-290: Remove the new xfail entries for "check_fit2d_predict1d"
(and the similar entries referenced at lines 359-360) from the xfail mapping so
we don't codify 1D-input regressions; instead fix the estimators that fail
validation by updating their predict/transform implementations (e.g.,
SVC.predict and any custom transformer/estimator transform methods) to perform
the same input-shape checks and error raising as scikit-learn for 1D arrays
(match function/parameter names and edge-case behavior), then run the
sklearn_compatibility tests to confirm check_fit2d_predict1d passes rather than
being marked xfail.

---

Outside diff comments:
In `@python/cuml/cuml/svm/linear_svc.py`:
- Around line 320-375: The PR adds sklearn-incompatible API changes to LinearSVC
by introducing the probability parameter and methods predict_proba and
predict_log_proba; either remove these members from the LinearSVC class (remove
the probability attribute, the `@available_if` wrappers, and the
predict_proba/predict_log_proba methods) to match scikit-learn, or if you intend
to keep them, add a clear, explicit justification in the class docstring and
module-level docs describing the intentional divergence and rationale (mention
recommended alternatives like CalibratedClassifierCV and SVC(kernel="linear",
probability=True)); update any generate_docstring/reflect annotations and tests
referencing predict_proba/predict_log_proba accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5d3ba8df-07be-4a81-afe5-8bfc0dfb80d0

📥 Commits

Reviewing files that changed from the base of the PR and between 5f2909d and f925fa9.

📒 Files selected for processing (35)
  • python/cuml/cuml/cluster/hdbscan/hdbscan.pyx
  • python/cuml/cuml/cluster/kmeans.pyx
  • python/cuml/cuml/covariance/ledoit_wolf.py
  • python/cuml/cuml/decomposition/base_mg.py
  • python/cuml/cuml/decomposition/incremental_pca.py
  • python/cuml/cuml/decomposition/pca.pyx
  • python/cuml/cuml/decomposition/tsvd.pyx
  • python/cuml/cuml/ensemble/randomforestclassifier.py
  • python/cuml/cuml/ensemble/randomforestregressor.py
  • python/cuml/cuml/experimental/linear_model/lars.pyx
  • python/cuml/cuml/internals/base.py
  • python/cuml/cuml/internals/outputs.py
  • python/cuml/cuml/internals/validation.py
  • python/cuml/cuml/kernel_ridge/kernel_ridge.py
  • python/cuml/cuml/linear_model/base.py
  • python/cuml/cuml/linear_model/base_mg.pyx
  • python/cuml/cuml/manifold/umap/umap.pyx
  • python/cuml/cuml/naive_bayes/naive_bayes.py
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/cuml/neighbors/nearest_neighbors.pyx
  • python/cuml/cuml/preprocessing/TargetEncoder.py
  • python/cuml/cuml/preprocessing/encoders.py
  • python/cuml/cuml/preprocessing/onehotencoder_mg.py
  • python/cuml/cuml/preprocessing/ordinalencoder_mg.py
  • python/cuml/cuml/random_projection/random_projection.py
  • python/cuml/cuml/svm/linear_svc.py
  • python/cuml/cuml/svm/svc.py
  • python/cuml/cuml/svm/svr.py
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
  • python/cuml/tests/test_base.py
  • python/cuml/tests/test_incremental_pca.py
  • python/cuml/tests/test_linear_svm.py
  • python/cuml/tests/test_sklearn_compatibility.py
  • python/cuml/tests/test_svm.py
  • python/cuml/tests/test_validation.py
💤 Files with no reviewable changes (4)
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
  • python/cuml/cuml/experimental/linear_model/lars.pyx
  • python/cuml/cuml/internals/base.py
  • python/cuml/tests/test_base.py

Comment thread python/cuml/cuml/internals/validation.py Outdated
Comment thread python/cuml/cuml/manifold/umap/umap.pyx
Comment thread python/cuml/cuml/naive_bayes/naive_bayes.py
Comment thread python/cuml/cuml/neighbors/nearest_neighbors.pyx
Comment thread python/cuml/cuml/preprocessing/encoders.py Outdated
Comment thread python/cuml/cuml/preprocessing/encoders.py Outdated
Comment thread python/cuml/tests/test_sklearn_compatibility.py
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM, even though a second pair of eyes would be welcome.

Comment thread python/cuml/cuml/experimental/linear_model/lars.pyx
Comment thread python/cuml/cuml/experimental/linear_model/lars.pyx
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
python/cuml/cuml/preprocessing/encoders.py (1)

335-348: ⚠️ Potential issue | 🟠 Major

Validate DataFrame schema, not just width.

check_features(self, X) only checks n_features_in_. For DataFrame inputs with the same width but a different column order, this path still encodes in X.columns order, so the output block layout changes after fit. Reindex against self._features and raise on missing/extra columns before the transform loop.

Proposed fix
 def transform(self, X):
     """Transform X using one-hot encoding."""
     check_is_fitted(self)
     check_features(self, X)

     X = self._check_input(X)
+    if hasattr(X, "columns"):
+        missing = [f for f in self._features if f not in X.columns]
+        extra = [c for c in X.columns if c not in self._features]
+        if missing or extra:
+            raise ValueError(
+                f"X columns must match the fitted feature set. Missing={missing}, extra={extra}"
+            )
+        X = X[self._features]

@@
-            for feature in X.columns:
+            for feature in self._features:
                 encoder = self._encoders[feature]
In scikit-learn 1.5.0, does `OneHotEncoder.transform` preserve the fitted feature order for DataFrame inputs and reject missing or extra columns even when the column count matches?

As per coding guidelines, predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/preprocessing/encoders.py` around lines 335 - 348, The
transform method only checks n_features_in_ via check_features(self, X) and then
iterates X.columns, which allows a DataFrame with the same width but reordered,
missing, or extra columns to silently produce a different encoding layout;
modify transform to first validate the DataFrame schema against the fitted
feature list (self._features) by checking for missing or unexpected columns and
raising an error if any are found, then reindex X to self._features (preserving
fitted column order) before calling self._check_input and iterating encoders in
self._encoders; ensure error messages reference the offending column names and
maintain input-type handling (cuDF/pandas/NumPy) and existing check_is_fitted()
usage.
python/cuml/cuml/internals/validation.py (1)

56-71: ⚠️ Potential issue | 🟠 Major

Don't infer a feature count from empty Python sequences.

[] currently returns 0 here, so any fit-like path that calls check_features(..., reset=True) can record a bogus n_features_in_ before the real validation layer rejects the input. The broad except Exception keeps the same “guess a width” behavior for other malformed row objects. This helper is now on the hot path for many estimators, so it should fail fast on empty/ambiguous sequence inputs instead of synthesizing a feature count.

Proposed fix
 def _get_n_features(X):
     if isinstance(X, (list, tuple)):
         if len(X) == 0:
-            return 0
+            raise TypeError(
+                "Unable to determine the number of features from an empty input"
+            )
         row = X[0]
@@
         if not isinstance(row, (str, bytes, dict)):
             try:
                 return len(row)
-            except Exception:
+            except TypeError:
                 pass
         return 1
In scikit-learn 1.5.0 source, what does `sklearn.utils.validation._num_features([])` do for an empty Python sequence, and does it infer `0` features or raise?

As per coding guidelines, behavior for edge cases (empty arrays, single sample) must match scikit-learn.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/internals/validation.py` around lines 56 - 71, The helper
_get_n_features currently returns 0 for empty Python sequences which causes
downstream code (e.g., check_features and n_features_in_) to record a bogus
feature count; change _get_n_features so that if X is a list or tuple and len(X)
== 0 it raises a ValueError (with a clear message like "Cannot infer number of
features from an empty sequence") instead of returning 0, leaving the existing
try/except behavior for non-empty rows intact so malformed row objects still
fall through to later validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 42-47: The _set_features helper only sets self.feature_names_in_
when X has a columns attribute, so if you refit from a DataFrame to a plain
array the old feature names linger; update _set_features (used by encoders' fit
paths) so that when X has no "columns" attribute it deletes/clears
self.feature_names_in_ (e.g., delattr or pop) before calling
check_features(self, X, reset=True), otherwise ensure feature_names_in_ is
removed on refit to reflect the new input without column names.

---

Duplicate comments:
In `@python/cuml/cuml/internals/validation.py`:
- Around line 56-71: The helper _get_n_features currently returns 0 for empty
Python sequences which causes downstream code (e.g., check_features and
n_features_in_) to record a bogus feature count; change _get_n_features so that
if X is a list or tuple and len(X) == 0 it raises a ValueError (with a clear
message like "Cannot infer number of features from an empty sequence") instead
of returning 0, leaving the existing try/except behavior for non-empty rows
intact so malformed row objects still fall through to later validation.

In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 335-348: The transform method only checks n_features_in_ via
check_features(self, X) and then iterates X.columns, which allows a DataFrame
with the same width but reordered, missing, or extra columns to silently produce
a different encoding layout; modify transform to first validate the DataFrame
schema against the fitted feature list (self._features) by checking for missing
or unexpected columns and raising an error if any are found, then reindex X to
self._features (preserving fitted column order) before calling self._check_input
and iterating encoders in self._encoders; ensure error messages reference the
offending column names and maintain input-type handling (cuDF/pandas/NumPy) and
existing check_is_fitted() usage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 40c08a5e-536c-4146-800f-60fbe3f08d8a

📥 Commits

Reviewing files that changed from the base of the PR and between f925fa9 and bd8a3b9.

📒 Files selected for processing (7)
  • python/cuml/cuml/internals/validation.py
  • python/cuml/cuml/neighbors/nearest_neighbors.pyx
  • python/cuml/cuml/preprocessing/encoders.py
  • python/cuml/cuml/preprocessing/onehotencoder_mg.py
  • python/cuml/cuml/preprocessing/ordinalencoder_mg.py
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
  • python/cuml/tests/test_validation.py
💤 Files with no reviewable changes (1)
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • python/cuml/cuml/preprocessing/ordinalencoder_mg.py

Comment thread python/cuml/cuml/preprocessing/encoders.py
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Mar 11, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 415760f into rapidsai:main Mar 11, 2026
243 of 249 checks passed
Copy link
Copy Markdown
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread python/cuml/cuml/naive_bayes/naive_bayes.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuml-accel Issues related to cuml.accel Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change sklearn-api-compat Issues around cuml matching sklearn API conventions/standards

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants