Skip to content

Add cuml.internals.validation, check_is_fitted checks#7868

Merged
rapids-bot[bot] merged 21 commits intorapidsai:mainfrom
jcrist:check-is-fitted
Mar 10, 2026
Merged

Add cuml.internals.validation, check_is_fitted checks#7868
rapids-bot[bot] merged 21 commits intorapidsai:mainfrom
jcrist:check-is-fitted

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Mar 6, 2026

This is spun out of a far-too-large branch I have containing validation improvements. Trying to split things up into smaller chunks, and this is the first one.

This:

  • Adds a new cuml.internals.validation module for common validation functions (similar to sklearn.utils.validation).
  • Moves check_random_seed to that module, and adds a test for it in test_validation.py.
  • Imports sklearn.utils.validation.check_is_fitted into cuml.internals.validation for easy access
  • Deprecates cuml.common.exceptions.NotFittedError in favor of the upstream sklearn.exceptions.NotFittedError. Since we depend on sklearn, we should just reuse their exception. For now this module re-exports sklearn's NotFittedError (with a deprecation warning).
  • Fixes bugs in NearestNeighbors/KNeighborsRegressor/KNeighborsClassifier/OneHotEncoder that led to unfit estimators appearing to be fit.
  • Deletes a copy-pasted version of check_is_fitted from our thirdparty_adapters in favor of just relying on the actual upstream check_is_fitted.
  • Applies check_is_fitted to all our cleaned up estimators. We're probably still missing it in some places, but anything we're testing for sklearn compat now has it in all the proper places.
  • Fixes the sklearn compat tests to actually test HDBSCAN/GaussianRandomProjection/SparseRandomProjection and adds a sanity check to prevent issues like this from happening again.
  • Deletes a bunch of xfails for tests that now pass.

As per our policy, I've marked this as breaking for the deprecation of cuml.common.exceptions. Nothing else is breaking.

Part of #7061

@jcrist jcrist self-assigned this Mar 6, 2026
@jcrist jcrist requested a review from a team as a code owner March 6, 2026 21:44
@jcrist jcrist requested a review from dantegd March 6, 2026 21:44
@jcrist jcrist added improvement Improvement / enhancement to an existing function breaking Breaking change non-breaking Non-breaking change sklearn-api-compat Issues around cuml matching sklearn API conventions/standards labels Mar 6, 2026
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Mar 6, 2026
@jcrist jcrist removed the non-breaking Non-breaking change label Mar 6, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Centralized fitted-state validation and unified random-seed handling across many estimators.
    • Nearest-neighbors now exposes effective metric information to better align with scikit-learn.
  • Bug Fixes

    • More consistent precondition checks to prevent operations on unfitted models.
  • Deprecations

    • NotFittedError now delegates to scikit-learn with a deprecation warning.
  • Tests

    • New and updated tests for seed handling, fitted-state validation, and neighbor-metric compatibility.

Walkthrough

Adds a centralized validation module cuml.internals.validation (new check_random_seed, re-exports check_is_fitted), removes local check_random_seed/check_is_fitted implementations, replaces the concrete NotFittedError with a lazy sklearn alias, updates many imports, and inserts check_is_fitted() precondition checks across multiple estimators and tests.

Changes

Cohort / File(s) Summary
Validation core & utils
python/cuml/cuml/internals/validation.py, python/cuml/cuml/internals/utils.py, python/cuml/cuml/common/exceptions.py
Adds internals.validation with check_random_seed and re-exports check_is_fitted; removes check_random_seed from internals/utils.py; replaces concrete NotFittedError with a lazy __getattr__ that returns sklearn.exceptions.NotFittedError and emits a deprecation warning.
Third‑party sklearn preprocessing
python/cuml/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py, .../_data.py, .../_discretization.py, .../_imputation.py, .../utils/validation.py
Moves check_is_fitted imports to cuml.internals.validation and removes the local check_is_fitted helper from the third‑party utils module.
Global estimator fitted‑state checks
multiple files under python/cuml/cuml/ (cluster, decomposition, linear_model, svm, naive_bayes, kernel_ridge, ensemble, covariance, random_projection, neighbors, preprocessing, manifold, etc.)
Adds check_is_fitted(self) entry‑point checks (predict/transform/score/sample/etc.) and updates imports to use cuml.internals.validation; removes local NotFittedError/_fitted bookkeeping where replaced by centralized validation.
Nearest neighbors refactor
python/cuml/cuml/neighbors/nearest_neighbors.pyx
Introduces effective_metric_ and effective_metric_params_, removes selected_algorithm_, adds _effective_p helper, normalizes metric handling during fit, and replaces hasattr-based fitted checks with check_is_fitted.
Preprocessing & encoders
python/cuml/cuml/preprocessing/TargetEncoder.py, .../encoders.py, .../LabelEncoder.py, .../_vectorizers.py, .../_tfidf.py
Removes internal _fitted flags and _check_is_fitted helpers, introduces check_is_fitted usage, swaps some NotFittedError imports to sklearn.exceptions, and adds a __sklearn_is_fitted__ shim for encoders.
Seed import / random utilities update
python/cuml/cuml/datasets/classification.py, python/cuml/cuml/manifold/*, python/cuml/cuml/model_selection/_split.py, python/cuml/cuml/dask/..., python/cuml/cuml/cluster/..., python/cuml/cuml/random_projection/...
Replaces imports of check_random_seed from internals.utils with cuml.internals.validation and adapts call sites.
Tests, test infra, and xfail list
python/cuml/tests/*, python/cuml/tests/test_sklearn_compatibility.py, python/cuml/tests/test_validation.py, python/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
Updates tests to import NotFittedError/check_is_fitted from sklearn or new validation module; adds tests for check_random_seed; refactors compatibility test parameterization (ESTIMATORS/XFAILS); updates xfail entries.
Misc Cython modules
python/cuml/cuml/cluster/*.pyx, python/cuml/cuml/decomposition/*.pyx, python/cuml/cuml/manifold/*.pyx, python/cuml/cuml/ensemble/*.pyx
Adds imports and calls to centralized check_is_fitted and moves seed validation imports where applicable; removes local NotFittedError usage in these modules.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • dantegd
  • csadorf
  • hcho3
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.59% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a new validation module and check_is_fitted checks across the codebase.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the new validation module, deprecations, bug fixes, and test improvements introduced by this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
python/cuml/cuml/decomposition/incremental_pca.py (1)

395-439: ⚠️ Potential issue | 🟡 Minor

Pass convert_dtype through to PCA.transform().

This method documents convert_dtype, but both branches ignore it and always call super().transform(...) with the parent's default. Callers that opt into coercion still get the old behavior, which can surface as avoidable dtype errors during inference.

🛠️ Suggested fix
             for batch in _gen_batches(
                 n_samples,
                 self.batch_size_,
                 min_batch_size=self.n_components or 0,
             ):
-                output.append(super().transform(X[batch]))
+                output.append(
+                    super().transform(
+                        X[batch], convert_dtype=convert_dtype
+                    )
+                )
             output, _, _, _ = input_to_cuml_array(cp.vstack(output), order="K")
 
             return output
         else:
-            return super().transform(X)
+            return super().transform(X, convert_dtype=convert_dtype)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/decomposition/incremental_pca.py` around lines 395 - 439,
The transform method documents convert_dtype but never forwards it to the parent
PCA.transform; update both calls to super().transform so they pass convert_dtype
through (e.g., in the sparse branch call super().transform(X[batch],
convert_dtype=convert_dtype) for each minibatch before stacking, and in the
dense branch call super().transform(X, convert_dtype=convert_dtype)); ensure the
rest of the sparse-path logic (cp.vstack + input_to_cuml_array) remains
unchanged so dtype conversion behavior matches the parent's transform.
python/cuml/cuml/random_projection/random_projection.py (1)

82-116: ⚠️ Potential issue | 🟠 Major

Persist and validate n_features_in_ before projecting.

transform() now rejects unfitted estimators, but it still never checks that X.shape[1] matches the fitted feature count. On both dense and sparse inputs, a width mismatch falls through to a backend matmul failure instead of a deterministic ValueError. Please store n_features_in_ in fit() and validate it in transform() before X @ components.T.

🛠️ Suggested fix
     `@generate_docstring`()
     `@reflect`(reset=True)
     def fit(self, X, y=None, *, convert_dtype=True):
         """Generate a random projection matrix."""
         n_samples, n_features = X.shape
+        self.n_features_in_ = n_features
 
         # Prefer float32, unless `convert_dtype=False` and the input is float64
         if convert_dtype:
             dtype = np.float32
@@
     `@generate_docstring`()
     `@reflect`
     def transform(self, X, *, convert_dtype=True) -> CumlArray:
         """Project the data by taking the matrix product with the random matrix."""
         check_is_fitted(self)
 
         # Coerce X to a cupy array or cupyx sparse matrix
         index = None
         if sp.issparse(X):
             X = cp_sp.csr_matrix(X)
-        elif not cp_sp.issparse(X):
+        if cp_sp.issparse(X):
+            if X.shape[1] != self.n_features_in_:
+                raise ValueError(
+                    f"X has {X.shape[1]} features, but this estimator was fitted "
+                    f"with {self.n_features_in_} features."
+                )
+        else:
             X_m = input_to_cuml_array(
                 X,
                 convert_to_dtype=(np.float32 if convert_dtype else None),
                 check_dtype=[np.float32, np.float64],
+                check_cols=self.n_features_in_,
                 order="K",
             ).array
             index = X_m.index
             X = X_m.to_output("cupy")
As per coding guidelines, "predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type".

Also applies to: 120-154

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/random_projection/random_projection.py` around lines 82 -
116, In fit(), persist the number of input features by setting
self.n_features_in_ = n_features (from X.shape[1]) after computing n_features
and before generating self.components_, and ensure dtype selection remains
unchanged; in transform(), first call check_is_fitted(self, "components_") then
validate the incoming X has matching feature width by extracting its column
count (handle NumPy arrays, sparse matrices and dataframe-like inputs) and raise
a clear ValueError if X.shape[1] != self.n_features_in_; only after that perform
the projection using self.components_ to avoid backend matmul errors and ensure
the output type follows the input-type handling convention used elsewhere.
python/cuml/cuml/linear_model/base.py (1)

62-80: ⚠️ Potential issue | 🟠 Major

Validate sparse input width before the matmul.

The dense branch enforces check_cols=self.n_features_in_, but the sparse branch skips that and relies on X @ coef.T to fail later. That produces a backend shape error instead of the same deterministic ValueError the dense path returns. Please add an explicit column-count check for sparse inputs too.

As per coding guidelines, "predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/linear_model/base.py` around lines 62 - 80, The sparse
branch in decision_function skips validating input width so add an explicit
column-count check after wrapping X into SparseCumlArray (the branch where
is_sparse(X) is True and X is converted with convert_to_dtype=self.coef_.dtype)
to compare X.shape[1] (or the SparseCumlArray's column count) against
self.n_features_in_ and raise the same ValueError as the dense path when they
differ; ensure this validation happens before assigning out_index or converting
X to cupy so the method consistently validates input dimensions like the dense
branch (function: decision_function; symbols: is_sparse, SparseCumlArray,
self.n_features_in_, self.coef_.dtype, out_index).
python/cuml/cuml/naive_bayes/naive_bayes.py (1)

195-216: ⚠️ Potential issue | 🟠 Major

Reject wrong feature counts before computing likelihoods.

check_is_fitted(self) fixes the unfitted case, but this shared inference path still never checks X.shape[1] against the fitted model. For GaussianNB, _check_X() is a no-op, so wrong-width inputs fall through to _joint_log_likelihood() and fail with broadcasting errors instead of a clear ValueError. Reuse self.n_features_ here before calling _joint_log_likelihood().

🛠️ Suggested fix
         X = self._check_X(X)
+        if X.shape[1] != self.n_features_:
+            raise ValueError(
+                f"X has {X.shape[1]} features, but this estimator was fitted "
+                f"with {self.n_features_} features."
+            )
         jll = self._joint_log_likelihood(X)
As per coding guidelines, "predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type".

Also applies to: 237-258

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/naive_bayes/naive_bayes.py` around lines 195 - 216, The
predict method currently validates fitted state but does not check that the
number of features in X matches the fitted model; before calling
self._joint_log_likelihood() (and likewise in the sibling block around lines
237-258, e.g., predict_proba), validate X.shape[1] against self.n_features_ (or
handle None/indexed sparse cases) and raise a clear ValueError if they differ;
place this check after converting X (after X = self._check_X(X) or immediately
after extracting X.array for dense/sparse paths) so mismatched widths fail with
a descriptive error instead of a broadcasting exception.
python/cuml/cuml/neighbors/nearest_neighbors.pyx (1)

648-687: ⚠️ Potential issue | 🟠 Major

Canonicalize minkowski metric before algorithm selection and effective_metric_ assignment.

The current code validates and selects algorithms against the raw constructor metric without canonicalizing minkowski aliases. This causes two problems:

  1. Algorithm selection regression: metric='minkowski' with p=2 should be canonicalized to 'euclidean' before the RBC check (line 652), but currently isn't. Since 'euclidean' is in VALID_METRICS["rbc"] but 'minkowski' is not, RBC is incorrectly rejected.

  2. Effective metric semantics violation: sklearn canonicalizes minkowski into specific metrics (p=1'manhattan', p=2'euclidean', p=inf'chebyshev') and stores the canonicalized name in effective_metric_. The current code sets effective_metric_ = self.metric, preserving the raw name. Query methods (lines 615, 619, 823, 858, 895, 1248, 1276) use effective_metric_, so downstream code receives the wrong metric name (e.g., 'minkowski' instead of 'euclidean').

🩹 Suggested change
+        effective_metric = self.metric
+        effective_metric_params = dict(self.metric_params or {})
+        if effective_metric in {"minkowski", "lp"}:
+            if self.p == 1:
+                effective_metric = "manhattan"
+            elif self.p == 2:
+                effective_metric = "euclidean"
+            elif np.isinf(self.p):
+                effective_metric = "chebyshev"
+            else:
+                effective_metric_params["p"] = self.p
+
         if self.algorithm == "auto":
             if (
                 self.n_features_in_ in (2, 3)
                 and not sparse
-                and self.metric in cuml.neighbors.VALID_METRICS["rbc"]
+                and effective_metric in cuml.neighbors.VALID_METRICS["rbc"]
                 and X.shape[0]**0.5 >= self.n_neighbors
             ):
                 self._fit_method = "rbc"
@@
-        if self.metric not in valid_metrics[self._fit_method]:
+        if effective_metric not in valid_metrics[self._fit_method]:
             raise ValueError(
-                f"Metric {self.metric} is not supported. See "
+                f"Metric {effective_metric} is not supported. See "
                 f"`cuml.neighbors.VALID_METRICS{'_SPARSE' * sparse}[{self._fit_method!r}]`"
                 f"for a list of valid options."
             )
@@
-                self.metric,
+                effective_metric,
                 self._fit_method,
                 params=self.algo_params,
                 p=self.p,
             )
         elif self._fit_method == "rbc":
-            self._index = RBCIndex.build(self._fit_X, self.metric)
+            self._index = RBCIndex.build(self._fit_X, effective_metric)
 
-        self.effective_metric_ = self.metric
-        self.effective_metric_params_ = self.metric_params
+        self.effective_metric_ = effective_metric
+        self.effective_metric_params_ = effective_metric_params
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx` around lines 648 - 687,
Canonicalize minkowski-style metrics before algorithm selection and before
assigning effective_metric_: detect when self.metric is 'minkowski' (or an
alias) and map it using self.p to the sklearn-style canonical name ('manhattan'
if p==1, 'euclidean' if p==2, 'chebyshev' if p==np.inf, otherwise leave as
'minkowski' with p), then use that canonical_metric for membership checks
against cuml.neighbors.VALID_METRICS (so the RBC check uses 'euclidean' for
p==2) and set self.effective_metric_ = canonical_metric (and keep
self.effective_metric_params_ = self.metric_params); update references around
algorithm selection (the block using self.algorithm, self._fit_method,
VALID_METRICS, and the assignments to self._index and effective_metric_).
python/cuml/cuml/decomposition/pca.pyx (1)

613-618: ⚠️ Potential issue | 🟠 Major

Specify components_ as the fitted attribute check in inverse_transform and transform.

fit() sets n_samples_ and n_components_ before calling _fit_dense() or _fit_sparse(). If validation raises an error (e.g., n_components exceeds available columns), bare check_is_fitted(self) will still pass because sklearn's default checks for any trailing-underscore attribute. Code then fails with AttributeError when accessing components_, mean_, and singular_values_, which are only set after successful decomposition. Use check_is_fitted(self, attributes=["components_"]) to ensure the fit actually succeeded.

🩹 Suggested change
-        check_is_fitted(self)
+        check_is_fitted(self, attributes=["components_"])

Applies to both inverse_transform (lines 613–618) and transform (lines 704–708).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/decomposition/pca.pyx` around lines 613 - 618, Replace the
bare check_is_fitted(self) calls in inverse_transform and transform with a check
that explicitly requires the components_ attribute (e.g., check_is_fitted(self,
attributes=["components_"])) so that a partially-run fit that set other
trailing-underscore names (like n_samples_ or n_components_) but failed before
computing components_, mean_, and singular_values_ will be caught; update both
inverse_transform and transform to call check_is_fitted(self,
attributes=["components_"]) before accessing components_, mean_, or
singular_values_.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/common/exceptions.py`:
- Around line 14-19: The FutureWarning emitted in cuml.common.exceptions via the
warnings.warn call for the deprecated NotFittedError currently lacks a
stacklevel, causing the trace to point to this module instead of the caller;
update the warnings.warn invocation in this module (the deprecation block that
mentions `cuml.common.exceptions.NotFittedError` and
`sklearn.exceptions.NotFittedError`) to pass stacklevel=2 so the warning points
at the user's import site rather than the module's __getattr__.

In `@python/cuml/cuml/feature_extraction/_vectorizers.py`:
- Line 12: The inverse_transform and get_feature_names methods currently access
self.vocabulary_ directly and will raise AttributeError when the vectorizer is
unfitted; update these methods (and any other places that access fitted
attributes) to use the shared fitted-state check by calling
sklearn.utils.validation.check_is_fitted (or the project's shared helper) at the
start of CountVectorizer.inverse_transform and
CountVectorizer.get_feature_names, mirroring the safeguard used in
CountVectorizer.transform, so they raise sklearn.exceptions.NotFittedError
consistently when the estimator is not fitted.

In `@python/cuml/cuml/internals/validation.py`:
- Around line 46-49: The TypeError message in validation.py uses
"{random_state!r}" but the string is missing the f-string prefix so the variable
isn't interpolated; update the raise in the validation function that checks the
random_state parameter to use an f-string (add the leading f before the string
literal containing "{random_state!r}") so the actual random_state value appears
in the error message.

In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 354-355: Replace the bare check_is_fitted(self) calls in
score_samples() and sample() with check_is_fitted(self, attributes="_X") so the
methods detect incomplete fit state when _X is missing (fit() may set bandwidth_
before _X); update the call in the score_samples() function and the call in the
sample() function to explicitly require the _X attribute to be present before
proceeding.

In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 750-751: The current check_is_fitted(self) calls allow partial-fit
failures to slip through because fit() sets early attributes but may fail before
initializing effective_metric_ and _index; update the check_is_fitted checks in
the nearest_neighbors implementation to require effective_metric_ (i.e., call
check_is_fitted(self, "effective_metric_")) at the three sites mentioned (near
the beginning of query methods and other call sites around the previous lines
751, 987, and 1248) so that a failed index build raises a fitting error instead
of causing later attribute access failures.

In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 356-360: Add explicit validation in transform and
inverse_transform to verify input feature layout and encoded width: in transform
(method transform and helper _check_input usage) check that X provides the same
feature names/order as self._features (or at least contains all expected
features) and raise a clear error on missing/reordered columns; in
inverse_transform verify the incoming encoded array's width matches the fitted
encoder's expected total encoded width (derived from self._features and
per-feature cardinalities) before slicing feature blocks and raise if
mismatched. Update both transform and inverse_transform to perform these checks
up front and use self._features and the fitted metadata (cardinalities/offsets)
to compute expected sizes so cuDF/pandas/ndarray differences are validated
consistently.

In `@python/cuml/cuml/preprocessing/TargetEncoder.py`:
- Line 353: Add a boolean success flag (e.g. _fit_succeeded or
_is_fitted_successfully) that is set to False at the start of fit()/before
_fit_transform() begins and only set to True at the very end of fit() after all
work completes; implement __sklearn_is_fitted__(self, attributes=None) to return
that flag so check_is_fitted(self) uses it instead of checking
trailing-underscore names; also ensure _attrs_from_cpu() sets the flag to True
when deserialization restores a fully fitted state. Update references to
check_is_fitted/fit/_fit_transform/_attrs_from_cpu to use this new flag and do
not change other behavior.

In `@python/cuml/cuml/svm/svc.py`:
- Around line 521-523: The check_is_fitted calls currently use the bare
heuristic (check_is_fitted(self)) which can return true if fit() partially set
trailing-underscore attributes before _fit() failed; change these to anchor on
the explicit fit completion flag by calling check_is_fitted(self,
attributes=["fit_status_"]) wherever the bare check is used (e.g., in SVC
methods predict, predict_proba, decision_function and the occurrences around the
existing calls noted at the other two locations), so that downstream access to
support_vectors_, _probA, _probB only proceeds after _fit() has successfully set
fit_status_.

---

Outside diff comments:
In `@python/cuml/cuml/decomposition/incremental_pca.py`:
- Around line 395-439: The transform method documents convert_dtype but never
forwards it to the parent PCA.transform; update both calls to super().transform
so they pass convert_dtype through (e.g., in the sparse branch call
super().transform(X[batch], convert_dtype=convert_dtype) for each minibatch
before stacking, and in the dense branch call super().transform(X,
convert_dtype=convert_dtype)); ensure the rest of the sparse-path logic
(cp.vstack + input_to_cuml_array) remains unchanged so dtype conversion behavior
matches the parent's transform.

In `@python/cuml/cuml/decomposition/pca.pyx`:
- Around line 613-618: Replace the bare check_is_fitted(self) calls in
inverse_transform and transform with a check that explicitly requires the
components_ attribute (e.g., check_is_fitted(self, attributes=["components_"]))
so that a partially-run fit that set other trailing-underscore names (like
n_samples_ or n_components_) but failed before computing components_, mean_, and
singular_values_ will be caught; update both inverse_transform and transform to
call check_is_fitted(self, attributes=["components_"]) before accessing
components_, mean_, or singular_values_.

In `@python/cuml/cuml/linear_model/base.py`:
- Around line 62-80: The sparse branch in decision_function skips validating
input width so add an explicit column-count check after wrapping X into
SparseCumlArray (the branch where is_sparse(X) is True and X is converted with
convert_to_dtype=self.coef_.dtype) to compare X.shape[1] (or the
SparseCumlArray's column count) against self.n_features_in_ and raise the same
ValueError as the dense path when they differ; ensure this validation happens
before assigning out_index or converting X to cupy so the method consistently
validates input dimensions like the dense branch (function: decision_function;
symbols: is_sparse, SparseCumlArray, self.n_features_in_, self.coef_.dtype,
out_index).

In `@python/cuml/cuml/naive_bayes/naive_bayes.py`:
- Around line 195-216: The predict method currently validates fitted state but
does not check that the number of features in X matches the fitted model; before
calling self._joint_log_likelihood() (and likewise in the sibling block around
lines 237-258, e.g., predict_proba), validate X.shape[1] against
self.n_features_ (or handle None/indexed sparse cases) and raise a clear
ValueError if they differ; place this check after converting X (after X =
self._check_X(X) or immediately after extracting X.array for dense/sparse paths)
so mismatched widths fail with a descriptive error instead of a broadcasting
exception.

In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 648-687: Canonicalize minkowski-style metrics before algorithm
selection and before assigning effective_metric_: detect when self.metric is
'minkowski' (or an alias) and map it using self.p to the sklearn-style canonical
name ('manhattan' if p==1, 'euclidean' if p==2, 'chebyshev' if p==np.inf,
otherwise leave as 'minkowski' with p), then use that canonical_metric for
membership checks against cuml.neighbors.VALID_METRICS (so the RBC check uses
'euclidean' for p==2) and set self.effective_metric_ = canonical_metric (and
keep self.effective_metric_params_ = self.metric_params); update references
around algorithm selection (the block using self.algorithm, self._fit_method,
VALID_METRICS, and the assignments to self._index and effective_metric_).

In `@python/cuml/cuml/random_projection/random_projection.py`:
- Around line 82-116: In fit(), persist the number of input features by setting
self.n_features_in_ = n_features (from X.shape[1]) after computing n_features
and before generating self.components_, and ensure dtype selection remains
unchanged; in transform(), first call check_is_fitted(self, "components_") then
validate the incoming X has matching feature width by extracting its column
count (handle NumPy arrays, sparse matrices and dataframe-like inputs) and raise
a clear ValueError if X.shape[1] != self.n_features_in_; only after that perform
the projection using self.components_ to avoid backend matmul errors and ensure
the output type follows the input-type handling convention used elsewhere.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 69c6442c-c3f2-4e32-b23d-204d11f61a1c

📥 Commits

Reviewing files that changed from the base of the PR and between d28fd9d and 9733dd3.

📒 Files selected for processing (48)
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_discretization.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_imputation.py
  • python/cuml/cuml/_thirdparty/sklearn/utils/validation.py
  • python/cuml/cuml/cluster/hdbscan/hdbscan.pyx
  • python/cuml/cuml/cluster/kmeans.pyx
  • python/cuml/cuml/cluster/spectral_clustering.pyx
  • python/cuml/cuml/common/exceptions.py
  • python/cuml/cuml/covariance/ledoit_wolf.py
  • python/cuml/cuml/dask/cluster/kmeans.py
  • python/cuml/cuml/dask/preprocessing/LabelEncoder.py
  • python/cuml/cuml/datasets/classification.py
  • python/cuml/cuml/decomposition/incremental_pca.py
  • python/cuml/cuml/decomposition/pca.pyx
  • python/cuml/cuml/decomposition/tsvd.pyx
  • python/cuml/cuml/ensemble/randomforest_common.pyx
  • python/cuml/cuml/feature_extraction/_tfidf.py
  • python/cuml/cuml/feature_extraction/_vectorizers.py
  • python/cuml/cuml/internals/utils.py
  • python/cuml/cuml/internals/validation.py
  • python/cuml/cuml/kernel_ridge/kernel_ridge.py
  • python/cuml/cuml/linear_model/base.py
  • python/cuml/cuml/linear_model/logistic_regression.py
  • python/cuml/cuml/manifold/spectral_embedding.pyx
  • python/cuml/cuml/manifold/t_sne.pyx
  • python/cuml/cuml/manifold/umap/umap.pyx
  • python/cuml/cuml/model_selection/_split.py
  • python/cuml/cuml/naive_bayes/naive_bayes.py
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/cuml/neighbors/nearest_neighbors.pyx
  • python/cuml/cuml/preprocessing/LabelEncoder.py
  • python/cuml/cuml/preprocessing/TargetEncoder.py
  • python/cuml/cuml/preprocessing/encoders.py
  • python/cuml/cuml/random_projection/random_projection.py
  • python/cuml/cuml/svm/linear_svc.py
  • python/cuml/cuml/svm/svc.py
  • python/cuml/cuml/svm/svr.py
  • python/cuml/tests/dask/test_dask_label_encoder.py
  • python/cuml/tests/explainer/test_gpu_treeshap.py
  • python/cuml/tests/test_incremental_pca.py
  • python/cuml/tests/test_kernel_density.py
  • python/cuml/tests/test_label_encoder.py
  • python/cuml/tests/test_linear_svm.py
  • python/cuml/tests/test_pca.py
  • python/cuml/tests/test_sklearn_compatibility.py
  • python/cuml/tests/test_svm.py
  • python/cuml/tests/test_validation.py
💤 Files with no reviewable changes (1)
  • python/cuml/cuml/internals/utils.py

Comment thread python/cuml/cuml/common/exceptions.py
Comment thread python/cuml/cuml/feature_extraction/_vectorizers.py
Comment thread python/cuml/cuml/internals/validation.py
Comment thread python/cuml/cuml/neighbors/kernel_density.py
Comment thread python/cuml/cuml/neighbors/nearest_neighbors.pyx
Comment thread python/cuml/cuml/preprocessing/encoders.py
Comment thread python/cuml/cuml/preprocessing/TargetEncoder.py
Comment thread python/cuml/cuml/svm/svc.py
@jcrist jcrist force-pushed the check-is-fitted branch from 1e17d71 to 674f836 Compare March 9, 2026 21:36
Copy link
Copy Markdown
Member Author

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make a few additional changes to get effective_metric_/effective_metric_params_ working for NearestNeighbors (and friends). This means that:

  • We no longer ignore the metric_params kwarg
  • We derive effective_metric_/effective_metric_params_ the same as sklearn does
  • In cases where the minkowski metric selected is equal to a simpler metric, we run the simpler kernel instead (e.g. metric="minkowski", p=2 -> metric="euclidean").

This change happened in this PR since before these were defined as properties, and shuttling them back/forth from sklearn would secretly mutate the metric parameter itself, rather than the derived effective_metric_. We could have patched around things to have the exact same behavior as before, but that behavior was incorrect and it was simpler to just fix this here properly.

)

cu_model = cuml.NearestNeighbors(n_neighbors=10).fit(X)
cu_model = cuml.NearestNeighbors(metric="minkowski", n_neighbors=10).fit(X)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our NearestNeighbors classes default to metric="euclidean", while sklearn's default's to metric="minkowski" (where the default p=2 means it's effectively euclidean). Since this PR fixed the effective_metric_/effective_metric_params_ implementation to not mutate the parameters, I switched this test to hardcode the same metric for sklearn and cuml to avoid parameter differences.

],
)
@pytest.mark.parametrize("use_params", [False, True])
def test_effective_metric_and_params(metric, p, effective_metric, use_params):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test checks:

  • p can be configured from p or metric_params (with the latter taking precedence)
  • combinations of metric and p simplify properly to derive effective_metric_ and effective_metric_params_
  • All of this matches the sklearn implementation

- "sklearn.neighbors.tests.test_neighbors::test_precomputed_cross_validation"
- "sklearn.neighbors.tests.test_neighbors::test_query_equidistant_kth_nn[ball_tree]"
- "sklearn.neighbors.tests.test_neighbors::test_query_equidistant_kth_nn[brute]"
- "sklearn.neighbors.tests.test_neighbors::test_query_equidistant_kth_nn[kd_tree]"
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was passing before, but only accidentally. sklearn wants the first points found chosen in the case of a tie. When running with the actual minkowski metric this just happened to be true for the data provided, but wasn't generally true. Now that we simplify this case to run the euclidean kernel instead, this test now happens to fail.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
python/cuml/cuml/neighbors/nearest_neighbors.pyx (1)

651-704: ⚠️ Potential issue | 🔴 Critical

A failed refit can leave stale index state behind.

fit() updates _fit_X, n_samples_fit_, n_features_in_, effective_metric_, and effective_metric_params_ before validation/index construction finishes. If a second fit() on an already-fitted estimator raises after those assignments, the old _index / _fit_method survive and later queries can run against mismatched state. Please clear learned state up front, or compute into locals and publish the new state only after the full fit succeeds.

Based on learnings, fit() must clear previous model state, initialize all learned attributes, and state from previous fit() calls must not affect new training.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx` around lines 651 - 704, The
fit() method can leave stale state (_index, _fit_method, _fit_X, n_samples_fit_,
n_features_in_, effective_metric_, effective_metric_params_) if it fails
mid-way; fix by either clearing previous learned state at the start (set
self._index = None, self._fit_method = None, self._fit_X = None, n_samples_fit_
= 0, n_features_in_ = 0, effective_metric_ = None, effective_metric_params_ =
{}) or (preferably) compute all derived values in local variables (e.g.,
local_effective_metric, local_effective_metric_params, local_fit_method,
local_index) perform validations and call ApproxIndex.build(...) into a
local_index, and only after everything succeeds assign those locals back to the
instance (self._fit_X, self.n_samples_fit_, self.n_features_in_,
self.effective_metric_, self.effective_metric_params_, self._fit_method,
self._index); ensure any early exceptions do not leave previous
_index/_fit_method in place.
♻️ Duplicate comments (1)
python/cuml/cuml/preprocessing/TargetEncoder.py (1)

353-353: ⚠️ Potential issue | 🟠 Major

check_is_fitted() can still false-positive after a failed fit().

_fit_transform() assigns trailing-underscore attrs like n_features_in_, feature_names_in_, categories_, and target_type_ before the rest of fit completes. With this new check_is_fitted(self) call, an exception later in fit() still leaves transform() looking fitted and able to run against missing/stale encode_all or train_encode. Please add __sklearn_is_fitted__() backed by a success-only flag that is cleared at fit start and set only on success; _attrs_from_cpu() should restore that flag too so from_sklearn() instances remain usable.

Based on learnings, model state management must ensure fit/predict/transform maintain consistent state, fit() resets all learned attributes, and state from previous fit() calls does not affect new training.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/preprocessing/TargetEncoder.py` at line 353,
check_is_fitted(self) can return a false-positive when fit() fails because
_fit_transform()/fit() set trailing-underscore attrs (n_features_in_,
feature_names_in_, categories_, target_type_) before fit completes; add a
success-only fitted flag exposed via __sklearn_is_fitted__() that is cleared at
the start of fit() and only set to True at the end of a successful fit (and set
by _attrs_from_cpu() when restoring state in from_sklearn()); update fit(),
_fit_transform(), transform(), and _attrs_from_cpu() to use this flag so
transform() and check_is_fitted() rely on the new __sklearn_is_fitted__() result
(ensuring encode_all and train_encode are present and consistent) and ensure any
exception during fit leaves the flag False and learned attributes either cleared
or considered invalid.
🧹 Nitpick comments (2)
python/cuml/cuml/preprocessing/LabelEncoder.py (1)

134-134: Consider removing unused _fitted attribute.

The _fitted attribute is initialized but never read or updated elsewhere in the class. The fitted state is determined by __sklearn_is_fitted__() which checks self.classes_ is not None.

🧹 Suggested cleanup
     ) -> None:
         super().__init__(verbose=verbose, output_type=output_type)
         self.classes_ = None
         self.dtype = None
-        self._fitted: bool = False
         self.handle_unknown = handle_unknown
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/preprocessing/LabelEncoder.py` at line 134, Remove the
unused _fitted attribute from the LabelEncoder class: delete the self._fitted:
bool = False initialization in __init__ and any related comments or type hints
referencing _fitted, since fitted state is already determined by
__sklearn_is_fitted__ (which checks self.classes_). Ensure there are no other
references to _fitted in methods (e.g., fit/transform/inverse_transform) and run
tests/type checks to confirm no residual references remain.
python/cuml/tests/test_nearest_neighbors.py (1)

431-433: Consider using _ for unused variable.

The static analysis correctly identifies that y is never used. Consider using _ as a placeholder to indicate the variable is intentionally ignored.

♻️ Suggested fix
-    X, y = sklearn.datasets.make_blobs(
+    X, _ = sklearn.datasets.make_blobs(
         n_samples=nrows, n_features=n_feats, random_state=0
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_nearest_neighbors.py` around lines 431 - 433, The test
assigns two outputs from sklearn.datasets.make_blobs but never uses y; update
the tuple unpacking to use a throwaway variable (replace y with _) to make
intent clear and satisfy static analysis: locate the line where X, y =
sklearn.datasets.make_blobs(...) in test_nearest_neighbors.py and change it so
only X is used and the second value is assigned to _.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/dask/preprocessing/LabelEncoder.py`:
- Line 8: Replace the hand-rolled fitted-state checks in LabelEncoder.transform
and LabelEncoder.inverse_transform with the shared helper by calling
cuml.internals.validation.check_is_fitted(self) at the start of each method
(remove the manual inspection of self._fitted and the custom NotFittedError
usage), and delete the unused sklearn.exceptions.NotFittedError import; keep the
subsequent dimension and input-type validation logic but rely on check_is_fitted
to raise the standardized error so behavior matches other estimators.

In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 657-666: The code pops "p" from self.effective_metric_params_
unconditionally which loses the parameter for the "lp" alias; change the logic
in nearest_neighbors.pyx so you read p via get (p =
self.effective_metric_params_.get("p", self.p)) and only pop/remove "p" when
self.effective_metric_ == "minkowski" (so the subsequent branch that maps p to
"manhattan"/"euclidean"/"chebyshev" can pop), leaving metric_params["p"] intact
for the "lp" alias; update any uses of self.effective_metric_params_ accordingly
(refer to self.effective_metric_params_, self.effective_metric_, and the p
handling block).

In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 260-263: The class currently only sets _fitted True on success, so
if a later fit() fails mid-way stale state remains; modify fit/_fit to reset
_fitted to False and clear learned attributes (e.g. _encoders, drop_idx_) at the
start of fitting or, better, build new encoder state in local variables and only
assign to instance attributes and set _fitted = True after successful
completion; ensure __sklearn_is_fitted__ continues to rely on _fitted so failed
fits do not leave the object appearing fitted.

---

Outside diff comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 651-704: The fit() method can leave stale state (_index,
_fit_method, _fit_X, n_samples_fit_, n_features_in_, effective_metric_,
effective_metric_params_) if it fails mid-way; fix by either clearing previous
learned state at the start (set self._index = None, self._fit_method = None,
self._fit_X = None, n_samples_fit_ = 0, n_features_in_ = 0, effective_metric_ =
None, effective_metric_params_ = {}) or (preferably) compute all derived values
in local variables (e.g., local_effective_metric, local_effective_metric_params,
local_fit_method, local_index) perform validations and call
ApproxIndex.build(...) into a local_index, and only after everything succeeds
assign those locals back to the instance (self._fit_X, self.n_samples_fit_,
self.n_features_in_, self.effective_metric_, self.effective_metric_params_,
self._fit_method, self._index); ensure any early exceptions do not leave
previous _index/_fit_method in place.

---

Duplicate comments:
In `@python/cuml/cuml/preprocessing/TargetEncoder.py`:
- Line 353: check_is_fitted(self) can return a false-positive when fit() fails
because _fit_transform()/fit() set trailing-underscore attrs (n_features_in_,
feature_names_in_, categories_, target_type_) before fit completes; add a
success-only fitted flag exposed via __sklearn_is_fitted__() that is cleared at
the start of fit() and only set to True at the end of a successful fit (and set
by _attrs_from_cpu() when restoring state in from_sklearn()); update fit(),
_fit_transform(), transform(), and _attrs_from_cpu() to use this flag so
transform() and check_is_fitted() rely on the new __sklearn_is_fitted__() result
(ensuring encode_all and train_encode are present and consistent) and ensure any
exception during fit leaves the flag False and learned attributes either cleared
or considered invalid.

---

Nitpick comments:
In `@python/cuml/cuml/preprocessing/LabelEncoder.py`:
- Line 134: Remove the unused _fitted attribute from the LabelEncoder class:
delete the self._fitted: bool = False initialization in __init__ and any related
comments or type hints referencing _fitted, since fitted state is already
determined by __sklearn_is_fitted__ (which checks self.classes_). Ensure there
are no other references to _fitted in methods (e.g.,
fit/transform/inverse_transform) and run tests/type checks to confirm no
residual references remain.

In `@python/cuml/tests/test_nearest_neighbors.py`:
- Around line 431-433: The test assigns two outputs from
sklearn.datasets.make_blobs but never uses y; update the tuple unpacking to use
a throwaway variable (replace y with _) to make intent clear and satisfy static
analysis: locate the line where X, y = sklearn.datasets.make_blobs(...) in
test_nearest_neighbors.py and change it so only X is used and the second value
is assigned to _.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 56a1da5c-74cd-4b93-872f-fd94ef5d5876

📥 Commits

Reviewing files that changed from the base of the PR and between 1e17d71 and 674f836.

📒 Files selected for processing (51)
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_discretization.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_imputation.py
  • python/cuml/cuml/_thirdparty/sklearn/utils/validation.py
  • python/cuml/cuml/cluster/hdbscan/hdbscan.pyx
  • python/cuml/cuml/cluster/kmeans.pyx
  • python/cuml/cuml/cluster/spectral_clustering.pyx
  • python/cuml/cuml/common/exceptions.py
  • python/cuml/cuml/covariance/ledoit_wolf.py
  • python/cuml/cuml/dask/cluster/kmeans.py
  • python/cuml/cuml/dask/preprocessing/LabelEncoder.py
  • python/cuml/cuml/datasets/classification.py
  • python/cuml/cuml/decomposition/incremental_pca.py
  • python/cuml/cuml/decomposition/pca.pyx
  • python/cuml/cuml/decomposition/tsvd.pyx
  • python/cuml/cuml/ensemble/randomforest_common.pyx
  • python/cuml/cuml/feature_extraction/_tfidf.py
  • python/cuml/cuml/feature_extraction/_vectorizers.py
  • python/cuml/cuml/internals/utils.py
  • python/cuml/cuml/internals/validation.py
  • python/cuml/cuml/kernel_ridge/kernel_ridge.py
  • python/cuml/cuml/linear_model/base.py
  • python/cuml/cuml/linear_model/logistic_regression.py
  • python/cuml/cuml/manifold/spectral_embedding.pyx
  • python/cuml/cuml/manifold/t_sne.pyx
  • python/cuml/cuml/manifold/umap/umap.pyx
  • python/cuml/cuml/model_selection/_split.py
  • python/cuml/cuml/naive_bayes/naive_bayes.py
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/cuml/neighbors/nearest_neighbors.pyx
  • python/cuml/cuml/preprocessing/LabelEncoder.py
  • python/cuml/cuml/preprocessing/TargetEncoder.py
  • python/cuml/cuml/preprocessing/encoders.py
  • python/cuml/cuml/random_projection/random_projection.py
  • python/cuml/cuml/svm/linear_svc.py
  • python/cuml/cuml/svm/svc.py
  • python/cuml/cuml/svm/svr.py
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
  • python/cuml/tests/dask/test_dask_label_encoder.py
  • python/cuml/tests/explainer/test_gpu_treeshap.py
  • python/cuml/tests/test_incremental_pca.py
  • python/cuml/tests/test_kernel_density.py
  • python/cuml/tests/test_label_encoder.py
  • python/cuml/tests/test_linear_svm.py
  • python/cuml/tests/test_nearest_neighbors.py
  • python/cuml/tests/test_pca.py
  • python/cuml/tests/test_sklearn_compatibility.py
  • python/cuml/tests/test_sklearn_import_export.py
  • python/cuml/tests/test_svm.py
  • python/cuml/tests/test_validation.py
💤 Files with no reviewable changes (1)
  • python/cuml/cuml/internals/utils.py
✅ Files skipped from review due to trivial changes (1)
  • python/cuml/cuml/manifold/t_sne.pyx
🚧 Files skipped from review as they are similar to previous changes (21)
  • python/cuml/cuml/decomposition/incremental_pca.py
  • python/cuml/tests/test_incremental_pca.py
  • python/cuml/cuml/manifold/spectral_embedding.pyx
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_imputation.py
  • python/cuml/cuml/covariance/ledoit_wolf.py
  • python/cuml/tests/test_svm.py
  • python/cuml/cuml/ensemble/randomforest_common.pyx
  • python/cuml/cuml/cluster/spectral_clustering.pyx
  • python/cuml/cuml/svm/svr.py
  • python/cuml/cuml/decomposition/tsvd.pyx
  • python/cuml/cuml/svm/svc.py
  • python/cuml/cuml/cluster/hdbscan/hdbscan.pyx
  • python/cuml/tests/test_pca.py
  • python/cuml/cuml/neighbors/kernel_density.py
  • python/cuml/cuml/linear_model/logistic_regression.py
  • python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py
  • python/cuml/tests/test_label_encoder.py
  • python/cuml/cuml/manifold/umap/umap.pyx
  • python/cuml/cuml/datasets/classification.py
  • python/cuml/tests/dask/test_dask_label_encoder.py
  • python/cuml/tests/test_linear_svm.py

Comment thread python/cuml/cuml/dask/preprocessing/LabelEncoder.py
Comment thread python/cuml/cuml/neighbors/nearest_neighbors.pyx
Comment thread python/cuml/cuml/preprocessing/encoders.py
@jcrist jcrist force-pushed the check-is-fitted branch from 674f836 to 3b0cf29 Compare March 9, 2026 21:55
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 600-603: When unpickling older NearestNeighbors objects,
__setstate__ must backfill missing fitted attributes so _effective_p and the
query/index rebuild paths don't raise AttributeError; update
NearestNeighbors.__setstate__ to check for and set defaults for
effective_metric_ and effective_metric_params_ (e.g., derive effective_metric_
from metric and effective_metric_params_ from metric_params or {}), and for
index-based models (rbc, ivf*, any attributes used by the index rebuild logic)
ensure the minimal set of attributes the rebuild/query code expects are present
(populate with sensible defaults or compute them) so _effective_p (which reads
effective_metric_params_.get("p", self.p)) and downstream code work for old
pickles.

In `@python/cuml/tests/test_nearest_neighbors.py`:
- Around line 431-433: The test currently binds an unused variable `y` from
sklearn.datasets.make_blobs; change the call so only `X` is assigned (e.g.,
replace "X, y = sklearn.datasets.make_blobs(...)" with a single-assignment that
extracts just the features, such as assigning X =
sklearn.datasets.make_blobs(..., return_X_y=True)[0]) so the unused `y` binding
is removed while keeping the same data generation for `X`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9022be10-4ff2-4d8f-9da7-bdb7f7745816

📥 Commits

Reviewing files that changed from the base of the PR and between 674f836 and 3b0cf29.

📒 Files selected for processing (4)
  • python/cuml/cuml/neighbors/nearest_neighbors.pyx
  • python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml
  • python/cuml/tests/test_nearest_neighbors.py
  • python/cuml/tests/test_sklearn_import_export.py

Comment thread python/cuml/cuml/neighbors/nearest_neighbors.pyx
Comment thread python/cuml/tests/test_nearest_neighbors.py
@jcrist jcrist added the cuml-accel Issues related to cuml.accel label Mar 9, 2026
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Mar 9, 2026

I wouldn't be surprised if there's a failure in the upstream cuml-accel tests somewhere, but any remaining failures should be quick fixes. This is ready for review.

Copy link
Copy Markdown
Contributor

@viclafargue viclafargue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM for the most part! I guess using sklearn's NotFittedError allows stronger conformance and more checks passing in test_sklearn_compatibility.py. However, isn't it a concern that something as critical as check_is_fitted would now depend on the sklearn dependency (that can span over multiple versions)? Would it not introduce a higher risk of breaking things when updating the scikit-learn dependency?

Comment thread python/cuml/cuml/dask/preprocessing/LabelEncoder.py
Comment thread python/cuml/cuml/feature_extraction/_tfidf.py
Comment thread python/cuml/cuml/feature_extraction/_vectorizers.py
Comment thread python/cuml/cuml/naive_bayes/naive_bayes.py
Comment thread python/cuml/cuml/preprocessing/encoders.py
Comment thread python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Mar 10, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 5f2909d into rapidsai:main Mar 10, 2026
185 of 187 checks passed
@jcrist jcrist deleted the check-is-fitted branch March 10, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change cuml-accel Issues related to cuml.accel Cython / Python Cython or Python issue improvement Improvement / enhancement to an existing function sklearn-api-compat Issues around cuml matching sklearn API conventions/standards

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants