Add cuml.internals.validation, check_is_fitted checks#7868
Add cuml.internals.validation, check_is_fitted checks#7868rapids-bot[bot] merged 21 commits intorapidsai:mainfrom
cuml.internals.validation, check_is_fitted checks#7868Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a centralized validation module Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
python/cuml/cuml/decomposition/incremental_pca.py (1)
395-439:⚠️ Potential issue | 🟡 MinorPass
convert_dtypethrough toPCA.transform().This method documents
convert_dtype, but both branches ignore it and always callsuper().transform(...)with the parent's default. Callers that opt into coercion still get the old behavior, which can surface as avoidable dtype errors during inference.🛠️ Suggested fix
for batch in _gen_batches( n_samples, self.batch_size_, min_batch_size=self.n_components or 0, ): - output.append(super().transform(X[batch])) + output.append( + super().transform( + X[batch], convert_dtype=convert_dtype + ) + ) output, _, _, _ = input_to_cuml_array(cp.vstack(output), order="K") return output else: - return super().transform(X) + return super().transform(X, convert_dtype=convert_dtype)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/decomposition/incremental_pca.py` around lines 395 - 439, The transform method documents convert_dtype but never forwards it to the parent PCA.transform; update both calls to super().transform so they pass convert_dtype through (e.g., in the sparse branch call super().transform(X[batch], convert_dtype=convert_dtype) for each minibatch before stacking, and in the dense branch call super().transform(X, convert_dtype=convert_dtype)); ensure the rest of the sparse-path logic (cp.vstack + input_to_cuml_array) remains unchanged so dtype conversion behavior matches the parent's transform.python/cuml/cuml/random_projection/random_projection.py (1)
82-116:⚠️ Potential issue | 🟠 MajorPersist and validate
n_features_in_before projecting.
transform()now rejects unfitted estimators, but it still never checks thatX.shape[1]matches the fitted feature count. On both dense and sparse inputs, a width mismatch falls through to a backend matmul failure instead of a deterministicValueError. Please storen_features_in_infit()and validate it intransform()beforeX @ components.T.As per coding guidelines, "predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type".🛠️ Suggested fix
`@generate_docstring`() `@reflect`(reset=True) def fit(self, X, y=None, *, convert_dtype=True): """Generate a random projection matrix.""" n_samples, n_features = X.shape + self.n_features_in_ = n_features # Prefer float32, unless `convert_dtype=False` and the input is float64 if convert_dtype: dtype = np.float32 @@ `@generate_docstring`() `@reflect` def transform(self, X, *, convert_dtype=True) -> CumlArray: """Project the data by taking the matrix product with the random matrix.""" check_is_fitted(self) # Coerce X to a cupy array or cupyx sparse matrix index = None if sp.issparse(X): X = cp_sp.csr_matrix(X) - elif not cp_sp.issparse(X): + if cp_sp.issparse(X): + if X.shape[1] != self.n_features_in_: + raise ValueError( + f"X has {X.shape[1]} features, but this estimator was fitted " + f"with {self.n_features_in_} features." + ) + else: X_m = input_to_cuml_array( X, convert_to_dtype=(np.float32 if convert_dtype else None), check_dtype=[np.float32, np.float64], + check_cols=self.n_features_in_, order="K", ).array index = X_m.index X = X_m.to_output("cupy")Also applies to: 120-154
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/random_projection/random_projection.py` around lines 82 - 116, In fit(), persist the number of input features by setting self.n_features_in_ = n_features (from X.shape[1]) after computing n_features and before generating self.components_, and ensure dtype selection remains unchanged; in transform(), first call check_is_fitted(self, "components_") then validate the incoming X has matching feature width by extracting its column count (handle NumPy arrays, sparse matrices and dataframe-like inputs) and raise a clear ValueError if X.shape[1] != self.n_features_in_; only after that perform the projection using self.components_ to avoid backend matmul errors and ensure the output type follows the input-type handling convention used elsewhere.python/cuml/cuml/linear_model/base.py (1)
62-80:⚠️ Potential issue | 🟠 MajorValidate sparse input width before the matmul.
The dense branch enforces
check_cols=self.n_features_in_, but the sparse branch skips that and relies onX @ coef.Tto fail later. That produces a backend shape error instead of the same deterministicValueErrorthe dense path returns. Please add an explicit column-count check for sparse inputs too.As per coding guidelines, "predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/linear_model/base.py` around lines 62 - 80, The sparse branch in decision_function skips validating input width so add an explicit column-count check after wrapping X into SparseCumlArray (the branch where is_sparse(X) is True and X is converted with convert_to_dtype=self.coef_.dtype) to compare X.shape[1] (or the SparseCumlArray's column count) against self.n_features_in_ and raise the same ValueError as the dense path when they differ; ensure this validation happens before assigning out_index or converting X to cupy so the method consistently validates input dimensions like the dense branch (function: decision_function; symbols: is_sparse, SparseCumlArray, self.n_features_in_, self.coef_.dtype, out_index).python/cuml/cuml/naive_bayes/naive_bayes.py (1)
195-216:⚠️ Potential issue | 🟠 MajorReject wrong feature counts before computing likelihoods.
check_is_fitted(self)fixes the unfitted case, but this shared inference path still never checksX.shape[1]against the fitted model. ForGaussianNB,_check_X()is a no-op, so wrong-width inputs fall through to_joint_log_likelihood()and fail with broadcasting errors instead of a clearValueError. Reuseself.n_features_here before calling_joint_log_likelihood().As per coding guidelines, "predict/transform methods must call check_is_fitted(), validate input dimensions against fitted dimensions, handle input type correctly (cuDF, pandas, NumPy), and ensure output type is consistent with input type".🛠️ Suggested fix
X = self._check_X(X) + if X.shape[1] != self.n_features_: + raise ValueError( + f"X has {X.shape[1]} features, but this estimator was fitted " + f"with {self.n_features_} features." + ) jll = self._joint_log_likelihood(X)Also applies to: 237-258
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/naive_bayes/naive_bayes.py` around lines 195 - 216, The predict method currently validates fitted state but does not check that the number of features in X matches the fitted model; before calling self._joint_log_likelihood() (and likewise in the sibling block around lines 237-258, e.g., predict_proba), validate X.shape[1] against self.n_features_ (or handle None/indexed sparse cases) and raise a clear ValueError if they differ; place this check after converting X (after X = self._check_X(X) or immediately after extracting X.array for dense/sparse paths) so mismatched widths fail with a descriptive error instead of a broadcasting exception.python/cuml/cuml/neighbors/nearest_neighbors.pyx (1)
648-687:⚠️ Potential issue | 🟠 MajorCanonicalize minkowski metric before algorithm selection and effective_metric_ assignment.
The current code validates and selects algorithms against the raw constructor
metricwithout canonicalizing minkowski aliases. This causes two problems:
Algorithm selection regression:
metric='minkowski'withp=2should be canonicalized to'euclidean'before the RBC check (line 652), but currently isn't. Since'euclidean'is inVALID_METRICS["rbc"]but'minkowski'is not, RBC is incorrectly rejected.Effective metric semantics violation: sklearn canonicalizes minkowski into specific metrics (
p=1→'manhattan',p=2→'euclidean',p=inf→'chebyshev') and stores the canonicalized name ineffective_metric_. The current code setseffective_metric_ = self.metric, preserving the raw name. Query methods (lines 615, 619, 823, 858, 895, 1248, 1276) useeffective_metric_, so downstream code receives the wrong metric name (e.g.,'minkowski'instead of'euclidean').🩹 Suggested change
+ effective_metric = self.metric + effective_metric_params = dict(self.metric_params or {}) + if effective_metric in {"minkowski", "lp"}: + if self.p == 1: + effective_metric = "manhattan" + elif self.p == 2: + effective_metric = "euclidean" + elif np.isinf(self.p): + effective_metric = "chebyshev" + else: + effective_metric_params["p"] = self.p + if self.algorithm == "auto": if ( self.n_features_in_ in (2, 3) and not sparse - and self.metric in cuml.neighbors.VALID_METRICS["rbc"] + and effective_metric in cuml.neighbors.VALID_METRICS["rbc"] and X.shape[0]**0.5 >= self.n_neighbors ): self._fit_method = "rbc" @@ - if self.metric not in valid_metrics[self._fit_method]: + if effective_metric not in valid_metrics[self._fit_method]: raise ValueError( - f"Metric {self.metric} is not supported. See " + f"Metric {effective_metric} is not supported. See " f"`cuml.neighbors.VALID_METRICS{'_SPARSE' * sparse}[{self._fit_method!r}]`" f"for a list of valid options." ) @@ - self.metric, + effective_metric, self._fit_method, params=self.algo_params, p=self.p, ) elif self._fit_method == "rbc": - self._index = RBCIndex.build(self._fit_X, self.metric) + self._index = RBCIndex.build(self._fit_X, effective_metric) - self.effective_metric_ = self.metric - self.effective_metric_params_ = self.metric_params + self.effective_metric_ = effective_metric + self.effective_metric_params_ = effective_metric_params🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx` around lines 648 - 687, Canonicalize minkowski-style metrics before algorithm selection and before assigning effective_metric_: detect when self.metric is 'minkowski' (or an alias) and map it using self.p to the sklearn-style canonical name ('manhattan' if p==1, 'euclidean' if p==2, 'chebyshev' if p==np.inf, otherwise leave as 'minkowski' with p), then use that canonical_metric for membership checks against cuml.neighbors.VALID_METRICS (so the RBC check uses 'euclidean' for p==2) and set self.effective_metric_ = canonical_metric (and keep self.effective_metric_params_ = self.metric_params); update references around algorithm selection (the block using self.algorithm, self._fit_method, VALID_METRICS, and the assignments to self._index and effective_metric_).python/cuml/cuml/decomposition/pca.pyx (1)
613-618:⚠️ Potential issue | 🟠 MajorSpecify
components_as the fitted attribute check ininverse_transformandtransform.
fit()setsn_samples_andn_components_before calling_fit_dense()or_fit_sparse(). If validation raises an error (e.g.,n_componentsexceeds available columns), barecheck_is_fitted(self)will still pass because sklearn's default checks for any trailing-underscore attribute. Code then fails with AttributeError when accessingcomponents_,mean_, andsingular_values_, which are only set after successful decomposition. Usecheck_is_fitted(self, attributes=["components_"])to ensure the fit actually succeeded.🩹 Suggested change
- check_is_fitted(self) + check_is_fitted(self, attributes=["components_"])Applies to both
inverse_transform(lines 613–618) andtransform(lines 704–708).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/decomposition/pca.pyx` around lines 613 - 618, Replace the bare check_is_fitted(self) calls in inverse_transform and transform with a check that explicitly requires the components_ attribute (e.g., check_is_fitted(self, attributes=["components_"])) so that a partially-run fit that set other trailing-underscore names (like n_samples_ or n_components_) but failed before computing components_, mean_, and singular_values_ will be caught; update both inverse_transform and transform to call check_is_fitted(self, attributes=["components_"]) before accessing components_, mean_, or singular_values_.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/common/exceptions.py`:
- Around line 14-19: The FutureWarning emitted in cuml.common.exceptions via the
warnings.warn call for the deprecated NotFittedError currently lacks a
stacklevel, causing the trace to point to this module instead of the caller;
update the warnings.warn invocation in this module (the deprecation block that
mentions `cuml.common.exceptions.NotFittedError` and
`sklearn.exceptions.NotFittedError`) to pass stacklevel=2 so the warning points
at the user's import site rather than the module's __getattr__.
In `@python/cuml/cuml/feature_extraction/_vectorizers.py`:
- Line 12: The inverse_transform and get_feature_names methods currently access
self.vocabulary_ directly and will raise AttributeError when the vectorizer is
unfitted; update these methods (and any other places that access fitted
attributes) to use the shared fitted-state check by calling
sklearn.utils.validation.check_is_fitted (or the project's shared helper) at the
start of CountVectorizer.inverse_transform and
CountVectorizer.get_feature_names, mirroring the safeguard used in
CountVectorizer.transform, so they raise sklearn.exceptions.NotFittedError
consistently when the estimator is not fitted.
In `@python/cuml/cuml/internals/validation.py`:
- Around line 46-49: The TypeError message in validation.py uses
"{random_state!r}" but the string is missing the f-string prefix so the variable
isn't interpolated; update the raise in the validation function that checks the
random_state parameter to use an f-string (add the leading f before the string
literal containing "{random_state!r}") so the actual random_state value appears
in the error message.
In `@python/cuml/cuml/neighbors/kernel_density.py`:
- Around line 354-355: Replace the bare check_is_fitted(self) calls in
score_samples() and sample() with check_is_fitted(self, attributes="_X") so the
methods detect incomplete fit state when _X is missing (fit() may set bandwidth_
before _X); update the call in the score_samples() function and the call in the
sample() function to explicitly require the _X attribute to be present before
proceeding.
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 750-751: The current check_is_fitted(self) calls allow partial-fit
failures to slip through because fit() sets early attributes but may fail before
initializing effective_metric_ and _index; update the check_is_fitted checks in
the nearest_neighbors implementation to require effective_metric_ (i.e., call
check_is_fitted(self, "effective_metric_")) at the three sites mentioned (near
the beginning of query methods and other call sites around the previous lines
751, 987, and 1248) so that a failed index build raises a fitting error instead
of causing later attribute access failures.
In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 356-360: Add explicit validation in transform and
inverse_transform to verify input feature layout and encoded width: in transform
(method transform and helper _check_input usage) check that X provides the same
feature names/order as self._features (or at least contains all expected
features) and raise a clear error on missing/reordered columns; in
inverse_transform verify the incoming encoded array's width matches the fitted
encoder's expected total encoded width (derived from self._features and
per-feature cardinalities) before slicing feature blocks and raise if
mismatched. Update both transform and inverse_transform to perform these checks
up front and use self._features and the fitted metadata (cardinalities/offsets)
to compute expected sizes so cuDF/pandas/ndarray differences are validated
consistently.
In `@python/cuml/cuml/preprocessing/TargetEncoder.py`:
- Line 353: Add a boolean success flag (e.g. _fit_succeeded or
_is_fitted_successfully) that is set to False at the start of fit()/before
_fit_transform() begins and only set to True at the very end of fit() after all
work completes; implement __sklearn_is_fitted__(self, attributes=None) to return
that flag so check_is_fitted(self) uses it instead of checking
trailing-underscore names; also ensure _attrs_from_cpu() sets the flag to True
when deserialization restores a fully fitted state. Update references to
check_is_fitted/fit/_fit_transform/_attrs_from_cpu to use this new flag and do
not change other behavior.
In `@python/cuml/cuml/svm/svc.py`:
- Around line 521-523: The check_is_fitted calls currently use the bare
heuristic (check_is_fitted(self)) which can return true if fit() partially set
trailing-underscore attributes before _fit() failed; change these to anchor on
the explicit fit completion flag by calling check_is_fitted(self,
attributes=["fit_status_"]) wherever the bare check is used (e.g., in SVC
methods predict, predict_proba, decision_function and the occurrences around the
existing calls noted at the other two locations), so that downstream access to
support_vectors_, _probA, _probB only proceeds after _fit() has successfully set
fit_status_.
---
Outside diff comments:
In `@python/cuml/cuml/decomposition/incremental_pca.py`:
- Around line 395-439: The transform method documents convert_dtype but never
forwards it to the parent PCA.transform; update both calls to super().transform
so they pass convert_dtype through (e.g., in the sparse branch call
super().transform(X[batch], convert_dtype=convert_dtype) for each minibatch
before stacking, and in the dense branch call super().transform(X,
convert_dtype=convert_dtype)); ensure the rest of the sparse-path logic
(cp.vstack + input_to_cuml_array) remains unchanged so dtype conversion behavior
matches the parent's transform.
In `@python/cuml/cuml/decomposition/pca.pyx`:
- Around line 613-618: Replace the bare check_is_fitted(self) calls in
inverse_transform and transform with a check that explicitly requires the
components_ attribute (e.g., check_is_fitted(self, attributes=["components_"]))
so that a partially-run fit that set other trailing-underscore names (like
n_samples_ or n_components_) but failed before computing components_, mean_, and
singular_values_ will be caught; update both inverse_transform and transform to
call check_is_fitted(self, attributes=["components_"]) before accessing
components_, mean_, or singular_values_.
In `@python/cuml/cuml/linear_model/base.py`:
- Around line 62-80: The sparse branch in decision_function skips validating
input width so add an explicit column-count check after wrapping X into
SparseCumlArray (the branch where is_sparse(X) is True and X is converted with
convert_to_dtype=self.coef_.dtype) to compare X.shape[1] (or the
SparseCumlArray's column count) against self.n_features_in_ and raise the same
ValueError as the dense path when they differ; ensure this validation happens
before assigning out_index or converting X to cupy so the method consistently
validates input dimensions like the dense branch (function: decision_function;
symbols: is_sparse, SparseCumlArray, self.n_features_in_, self.coef_.dtype,
out_index).
In `@python/cuml/cuml/naive_bayes/naive_bayes.py`:
- Around line 195-216: The predict method currently validates fitted state but
does not check that the number of features in X matches the fitted model; before
calling self._joint_log_likelihood() (and likewise in the sibling block around
lines 237-258, e.g., predict_proba), validate X.shape[1] against
self.n_features_ (or handle None/indexed sparse cases) and raise a clear
ValueError if they differ; place this check after converting X (after X =
self._check_X(X) or immediately after extracting X.array for dense/sparse paths)
so mismatched widths fail with a descriptive error instead of a broadcasting
exception.
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 648-687: Canonicalize minkowski-style metrics before algorithm
selection and before assigning effective_metric_: detect when self.metric is
'minkowski' (or an alias) and map it using self.p to the sklearn-style canonical
name ('manhattan' if p==1, 'euclidean' if p==2, 'chebyshev' if p==np.inf,
otherwise leave as 'minkowski' with p), then use that canonical_metric for
membership checks against cuml.neighbors.VALID_METRICS (so the RBC check uses
'euclidean' for p==2) and set self.effective_metric_ = canonical_metric (and
keep self.effective_metric_params_ = self.metric_params); update references
around algorithm selection (the block using self.algorithm, self._fit_method,
VALID_METRICS, and the assignments to self._index and effective_metric_).
In `@python/cuml/cuml/random_projection/random_projection.py`:
- Around line 82-116: In fit(), persist the number of input features by setting
self.n_features_in_ = n_features (from X.shape[1]) after computing n_features
and before generating self.components_, and ensure dtype selection remains
unchanged; in transform(), first call check_is_fitted(self, "components_") then
validate the incoming X has matching feature width by extracting its column
count (handle NumPy arrays, sparse matrices and dataframe-like inputs) and raise
a clear ValueError if X.shape[1] != self.n_features_in_; only after that perform
the projection using self.components_ to avoid backend matmul errors and ensure
the output type follows the input-type handling convention used elsewhere.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 69c6442c-c3f2-4e32-b23d-204d11f61a1c
📒 Files selected for processing (48)
python/cuml/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.pypython/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.pypython/cuml/cuml/_thirdparty/sklearn/preprocessing/_discretization.pypython/cuml/cuml/_thirdparty/sklearn/preprocessing/_imputation.pypython/cuml/cuml/_thirdparty/sklearn/utils/validation.pypython/cuml/cuml/cluster/hdbscan/hdbscan.pyxpython/cuml/cuml/cluster/kmeans.pyxpython/cuml/cuml/cluster/spectral_clustering.pyxpython/cuml/cuml/common/exceptions.pypython/cuml/cuml/covariance/ledoit_wolf.pypython/cuml/cuml/dask/cluster/kmeans.pypython/cuml/cuml/dask/preprocessing/LabelEncoder.pypython/cuml/cuml/datasets/classification.pypython/cuml/cuml/decomposition/incremental_pca.pypython/cuml/cuml/decomposition/pca.pyxpython/cuml/cuml/decomposition/tsvd.pyxpython/cuml/cuml/ensemble/randomforest_common.pyxpython/cuml/cuml/feature_extraction/_tfidf.pypython/cuml/cuml/feature_extraction/_vectorizers.pypython/cuml/cuml/internals/utils.pypython/cuml/cuml/internals/validation.pypython/cuml/cuml/kernel_ridge/kernel_ridge.pypython/cuml/cuml/linear_model/base.pypython/cuml/cuml/linear_model/logistic_regression.pypython/cuml/cuml/manifold/spectral_embedding.pyxpython/cuml/cuml/manifold/t_sne.pyxpython/cuml/cuml/manifold/umap/umap.pyxpython/cuml/cuml/model_selection/_split.pypython/cuml/cuml/naive_bayes/naive_bayes.pypython/cuml/cuml/neighbors/kernel_density.pypython/cuml/cuml/neighbors/nearest_neighbors.pyxpython/cuml/cuml/preprocessing/LabelEncoder.pypython/cuml/cuml/preprocessing/TargetEncoder.pypython/cuml/cuml/preprocessing/encoders.pypython/cuml/cuml/random_projection/random_projection.pypython/cuml/cuml/svm/linear_svc.pypython/cuml/cuml/svm/svc.pypython/cuml/cuml/svm/svr.pypython/cuml/tests/dask/test_dask_label_encoder.pypython/cuml/tests/explainer/test_gpu_treeshap.pypython/cuml/tests/test_incremental_pca.pypython/cuml/tests/test_kernel_density.pypython/cuml/tests/test_label_encoder.pypython/cuml/tests/test_linear_svm.pypython/cuml/tests/test_pca.pypython/cuml/tests/test_sklearn_compatibility.pypython/cuml/tests/test_svm.pypython/cuml/tests/test_validation.py
💤 Files with no reviewable changes (1)
- python/cuml/cuml/internals/utils.py
Also moves `check_random_seed` to there and adds a small unit test for it.
jcrist
left a comment
There was a problem hiding this comment.
I had to make a few additional changes to get effective_metric_/effective_metric_params_ working for NearestNeighbors (and friends). This means that:
- We no longer ignore the
metric_paramskwarg - We derive
effective_metric_/effective_metric_params_the same as sklearn does - In cases where the minkowski metric selected is equal to a simpler metric, we run the simpler kernel instead (e.g.
metric="minkowski", p=2->metric="euclidean").
This change happened in this PR since before these were defined as properties, and shuttling them back/forth from sklearn would secretly mutate the metric parameter itself, rather than the derived effective_metric_. We could have patched around things to have the exact same behavior as before, but that behavior was incorrect and it was simpler to just fix this here properly.
| ) | ||
|
|
||
| cu_model = cuml.NearestNeighbors(n_neighbors=10).fit(X) | ||
| cu_model = cuml.NearestNeighbors(metric="minkowski", n_neighbors=10).fit(X) |
There was a problem hiding this comment.
Our NearestNeighbors classes default to metric="euclidean", while sklearn's default's to metric="minkowski" (where the default p=2 means it's effectively euclidean). Since this PR fixed the effective_metric_/effective_metric_params_ implementation to not mutate the parameters, I switched this test to hardcode the same metric for sklearn and cuml to avoid parameter differences.
| ], | ||
| ) | ||
| @pytest.mark.parametrize("use_params", [False, True]) | ||
| def test_effective_metric_and_params(metric, p, effective_metric, use_params): |
There was a problem hiding this comment.
This new test checks:
pcan be configured frompormetric_params(with the latter taking precedence)- combinations of
metricandpsimplify properly to deriveeffective_metric_andeffective_metric_params_ - All of this matches the sklearn implementation
| - "sklearn.neighbors.tests.test_neighbors::test_precomputed_cross_validation" | ||
| - "sklearn.neighbors.tests.test_neighbors::test_query_equidistant_kth_nn[ball_tree]" | ||
| - "sklearn.neighbors.tests.test_neighbors::test_query_equidistant_kth_nn[brute]" | ||
| - "sklearn.neighbors.tests.test_neighbors::test_query_equidistant_kth_nn[kd_tree]" |
There was a problem hiding this comment.
This test was passing before, but only accidentally. sklearn wants the first points found chosen in the case of a tie. When running with the actual minkowski metric this just happened to be true for the data provided, but wasn't generally true. Now that we simplify this case to run the euclidean kernel instead, this test now happens to fail.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
python/cuml/cuml/neighbors/nearest_neighbors.pyx (1)
651-704:⚠️ Potential issue | 🔴 CriticalA failed refit can leave stale index state behind.
fit()updates_fit_X,n_samples_fit_,n_features_in_,effective_metric_, andeffective_metric_params_before validation/index construction finishes. If a secondfit()on an already-fitted estimator raises after those assignments, the old_index/_fit_methodsurvive and later queries can run against mismatched state. Please clear learned state up front, or compute into locals and publish the new state only after the full fit succeeds.Based on learnings, fit() must clear previous model state, initialize all learned attributes, and state from previous fit() calls must not affect new training.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx` around lines 651 - 704, The fit() method can leave stale state (_index, _fit_method, _fit_X, n_samples_fit_, n_features_in_, effective_metric_, effective_metric_params_) if it fails mid-way; fix by either clearing previous learned state at the start (set self._index = None, self._fit_method = None, self._fit_X = None, n_samples_fit_ = 0, n_features_in_ = 0, effective_metric_ = None, effective_metric_params_ = {}) or (preferably) compute all derived values in local variables (e.g., local_effective_metric, local_effective_metric_params, local_fit_method, local_index) perform validations and call ApproxIndex.build(...) into a local_index, and only after everything succeeds assign those locals back to the instance (self._fit_X, self.n_samples_fit_, self.n_features_in_, self.effective_metric_, self.effective_metric_params_, self._fit_method, self._index); ensure any early exceptions do not leave previous _index/_fit_method in place.
♻️ Duplicate comments (1)
python/cuml/cuml/preprocessing/TargetEncoder.py (1)
353-353:⚠️ Potential issue | 🟠 Major
check_is_fitted()can still false-positive after a failedfit().
_fit_transform()assigns trailing-underscore attrs liken_features_in_,feature_names_in_,categories_, andtarget_type_before the rest of fit completes. With this newcheck_is_fitted(self)call, an exception later infit()still leavestransform()looking fitted and able to run against missing/staleencode_allortrain_encode. Please add__sklearn_is_fitted__()backed by a success-only flag that is cleared at fit start and set only on success;_attrs_from_cpu()should restore that flag too sofrom_sklearn()instances remain usable.Based on learnings, model state management must ensure fit/predict/transform maintain consistent state, fit() resets all learned attributes, and state from previous fit() calls does not affect new training.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/preprocessing/TargetEncoder.py` at line 353, check_is_fitted(self) can return a false-positive when fit() fails because _fit_transform()/fit() set trailing-underscore attrs (n_features_in_, feature_names_in_, categories_, target_type_) before fit completes; add a success-only fitted flag exposed via __sklearn_is_fitted__() that is cleared at the start of fit() and only set to True at the end of a successful fit (and set by _attrs_from_cpu() when restoring state in from_sklearn()); update fit(), _fit_transform(), transform(), and _attrs_from_cpu() to use this flag so transform() and check_is_fitted() rely on the new __sklearn_is_fitted__() result (ensuring encode_all and train_encode are present and consistent) and ensure any exception during fit leaves the flag False and learned attributes either cleared or considered invalid.
🧹 Nitpick comments (2)
python/cuml/cuml/preprocessing/LabelEncoder.py (1)
134-134: Consider removing unused_fittedattribute.The
_fittedattribute is initialized but never read or updated elsewhere in the class. The fitted state is determined by__sklearn_is_fitted__()which checksself.classes_ is not None.🧹 Suggested cleanup
) -> None: super().__init__(verbose=verbose, output_type=output_type) self.classes_ = None self.dtype = None - self._fitted: bool = False self.handle_unknown = handle_unknown🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/preprocessing/LabelEncoder.py` at line 134, Remove the unused _fitted attribute from the LabelEncoder class: delete the self._fitted: bool = False initialization in __init__ and any related comments or type hints referencing _fitted, since fitted state is already determined by __sklearn_is_fitted__ (which checks self.classes_). Ensure there are no other references to _fitted in methods (e.g., fit/transform/inverse_transform) and run tests/type checks to confirm no residual references remain.python/cuml/tests/test_nearest_neighbors.py (1)
431-433: Consider using_for unused variable.The static analysis correctly identifies that
yis never used. Consider using_as a placeholder to indicate the variable is intentionally ignored.♻️ Suggested fix
- X, y = sklearn.datasets.make_blobs( + X, _ = sklearn.datasets.make_blobs( n_samples=nrows, n_features=n_feats, random_state=0 )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/tests/test_nearest_neighbors.py` around lines 431 - 433, The test assigns two outputs from sklearn.datasets.make_blobs but never uses y; update the tuple unpacking to use a throwaway variable (replace y with _) to make intent clear and satisfy static analysis: locate the line where X, y = sklearn.datasets.make_blobs(...) in test_nearest_neighbors.py and change it so only X is used and the second value is assigned to _.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/dask/preprocessing/LabelEncoder.py`:
- Line 8: Replace the hand-rolled fitted-state checks in LabelEncoder.transform
and LabelEncoder.inverse_transform with the shared helper by calling
cuml.internals.validation.check_is_fitted(self) at the start of each method
(remove the manual inspection of self._fitted and the custom NotFittedError
usage), and delete the unused sklearn.exceptions.NotFittedError import; keep the
subsequent dimension and input-type validation logic but rely on check_is_fitted
to raise the standardized error so behavior matches other estimators.
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 657-666: The code pops "p" from self.effective_metric_params_
unconditionally which loses the parameter for the "lp" alias; change the logic
in nearest_neighbors.pyx so you read p via get (p =
self.effective_metric_params_.get("p", self.p)) and only pop/remove "p" when
self.effective_metric_ == "minkowski" (so the subsequent branch that maps p to
"manhattan"/"euclidean"/"chebyshev" can pop), leaving metric_params["p"] intact
for the "lp" alias; update any uses of self.effective_metric_params_ accordingly
(refer to self.effective_metric_params_, self.effective_metric_, and the p
handling block).
In `@python/cuml/cuml/preprocessing/encoders.py`:
- Around line 260-263: The class currently only sets _fitted True on success, so
if a later fit() fails mid-way stale state remains; modify fit/_fit to reset
_fitted to False and clear learned attributes (e.g. _encoders, drop_idx_) at the
start of fitting or, better, build new encoder state in local variables and only
assign to instance attributes and set _fitted = True after successful
completion; ensure __sklearn_is_fitted__ continues to rely on _fitted so failed
fits do not leave the object appearing fitted.
---
Outside diff comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 651-704: The fit() method can leave stale state (_index,
_fit_method, _fit_X, n_samples_fit_, n_features_in_, effective_metric_,
effective_metric_params_) if it fails mid-way; fix by either clearing previous
learned state at the start (set self._index = None, self._fit_method = None,
self._fit_X = None, n_samples_fit_ = 0, n_features_in_ = 0, effective_metric_ =
None, effective_metric_params_ = {}) or (preferably) compute all derived values
in local variables (e.g., local_effective_metric, local_effective_metric_params,
local_fit_method, local_index) perform validations and call
ApproxIndex.build(...) into a local_index, and only after everything succeeds
assign those locals back to the instance (self._fit_X, self.n_samples_fit_,
self.n_features_in_, self.effective_metric_, self.effective_metric_params_,
self._fit_method, self._index); ensure any early exceptions do not leave
previous _index/_fit_method in place.
---
Duplicate comments:
In `@python/cuml/cuml/preprocessing/TargetEncoder.py`:
- Line 353: check_is_fitted(self) can return a false-positive when fit() fails
because _fit_transform()/fit() set trailing-underscore attrs (n_features_in_,
feature_names_in_, categories_, target_type_) before fit completes; add a
success-only fitted flag exposed via __sklearn_is_fitted__() that is cleared at
the start of fit() and only set to True at the end of a successful fit (and set
by _attrs_from_cpu() when restoring state in from_sklearn()); update fit(),
_fit_transform(), transform(), and _attrs_from_cpu() to use this flag so
transform() and check_is_fitted() rely on the new __sklearn_is_fitted__() result
(ensuring encode_all and train_encode are present and consistent) and ensure any
exception during fit leaves the flag False and learned attributes either cleared
or considered invalid.
---
Nitpick comments:
In `@python/cuml/cuml/preprocessing/LabelEncoder.py`:
- Line 134: Remove the unused _fitted attribute from the LabelEncoder class:
delete the self._fitted: bool = False initialization in __init__ and any related
comments or type hints referencing _fitted, since fitted state is already
determined by __sklearn_is_fitted__ (which checks self.classes_). Ensure there
are no other references to _fitted in methods (e.g.,
fit/transform/inverse_transform) and run tests/type checks to confirm no
residual references remain.
In `@python/cuml/tests/test_nearest_neighbors.py`:
- Around line 431-433: The test assigns two outputs from
sklearn.datasets.make_blobs but never uses y; update the tuple unpacking to use
a throwaway variable (replace y with _) to make intent clear and satisfy static
analysis: locate the line where X, y = sklearn.datasets.make_blobs(...) in
test_nearest_neighbors.py and change it so only X is used and the second value
is assigned to _.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 56a1da5c-74cd-4b93-872f-fd94ef5d5876
📒 Files selected for processing (51)
python/cuml/cuml/_thirdparty/sklearn/preprocessing/_column_transformer.pypython/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.pypython/cuml/cuml/_thirdparty/sklearn/preprocessing/_discretization.pypython/cuml/cuml/_thirdparty/sklearn/preprocessing/_imputation.pypython/cuml/cuml/_thirdparty/sklearn/utils/validation.pypython/cuml/cuml/cluster/hdbscan/hdbscan.pyxpython/cuml/cuml/cluster/kmeans.pyxpython/cuml/cuml/cluster/spectral_clustering.pyxpython/cuml/cuml/common/exceptions.pypython/cuml/cuml/covariance/ledoit_wolf.pypython/cuml/cuml/dask/cluster/kmeans.pypython/cuml/cuml/dask/preprocessing/LabelEncoder.pypython/cuml/cuml/datasets/classification.pypython/cuml/cuml/decomposition/incremental_pca.pypython/cuml/cuml/decomposition/pca.pyxpython/cuml/cuml/decomposition/tsvd.pyxpython/cuml/cuml/ensemble/randomforest_common.pyxpython/cuml/cuml/feature_extraction/_tfidf.pypython/cuml/cuml/feature_extraction/_vectorizers.pypython/cuml/cuml/internals/utils.pypython/cuml/cuml/internals/validation.pypython/cuml/cuml/kernel_ridge/kernel_ridge.pypython/cuml/cuml/linear_model/base.pypython/cuml/cuml/linear_model/logistic_regression.pypython/cuml/cuml/manifold/spectral_embedding.pyxpython/cuml/cuml/manifold/t_sne.pyxpython/cuml/cuml/manifold/umap/umap.pyxpython/cuml/cuml/model_selection/_split.pypython/cuml/cuml/naive_bayes/naive_bayes.pypython/cuml/cuml/neighbors/kernel_density.pypython/cuml/cuml/neighbors/nearest_neighbors.pyxpython/cuml/cuml/preprocessing/LabelEncoder.pypython/cuml/cuml/preprocessing/TargetEncoder.pypython/cuml/cuml/preprocessing/encoders.pypython/cuml/cuml/random_projection/random_projection.pypython/cuml/cuml/svm/linear_svc.pypython/cuml/cuml/svm/svc.pypython/cuml/cuml/svm/svr.pypython/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yamlpython/cuml/tests/dask/test_dask_label_encoder.pypython/cuml/tests/explainer/test_gpu_treeshap.pypython/cuml/tests/test_incremental_pca.pypython/cuml/tests/test_kernel_density.pypython/cuml/tests/test_label_encoder.pypython/cuml/tests/test_linear_svm.pypython/cuml/tests/test_nearest_neighbors.pypython/cuml/tests/test_pca.pypython/cuml/tests/test_sklearn_compatibility.pypython/cuml/tests/test_sklearn_import_export.pypython/cuml/tests/test_svm.pypython/cuml/tests/test_validation.py
💤 Files with no reviewable changes (1)
- python/cuml/cuml/internals/utils.py
✅ Files skipped from review due to trivial changes (1)
- python/cuml/cuml/manifold/t_sne.pyx
🚧 Files skipped from review as they are similar to previous changes (21)
- python/cuml/cuml/decomposition/incremental_pca.py
- python/cuml/tests/test_incremental_pca.py
- python/cuml/cuml/manifold/spectral_embedding.pyx
- python/cuml/cuml/_thirdparty/sklearn/preprocessing/_imputation.py
- python/cuml/cuml/covariance/ledoit_wolf.py
- python/cuml/tests/test_svm.py
- python/cuml/cuml/ensemble/randomforest_common.pyx
- python/cuml/cuml/cluster/spectral_clustering.pyx
- python/cuml/cuml/svm/svr.py
- python/cuml/cuml/decomposition/tsvd.pyx
- python/cuml/cuml/svm/svc.py
- python/cuml/cuml/cluster/hdbscan/hdbscan.pyx
- python/cuml/tests/test_pca.py
- python/cuml/cuml/neighbors/kernel_density.py
- python/cuml/cuml/linear_model/logistic_regression.py
- python/cuml/cuml/_thirdparty/sklearn/preprocessing/_data.py
- python/cuml/tests/test_label_encoder.py
- python/cuml/cuml/manifold/umap/umap.pyx
- python/cuml/cuml/datasets/classification.py
- python/cuml/tests/dask/test_dask_label_encoder.py
- python/cuml/tests/test_linear_svm.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/neighbors/nearest_neighbors.pyx`:
- Around line 600-603: When unpickling older NearestNeighbors objects,
__setstate__ must backfill missing fitted attributes so _effective_p and the
query/index rebuild paths don't raise AttributeError; update
NearestNeighbors.__setstate__ to check for and set defaults for
effective_metric_ and effective_metric_params_ (e.g., derive effective_metric_
from metric and effective_metric_params_ from metric_params or {}), and for
index-based models (rbc, ivf*, any attributes used by the index rebuild logic)
ensure the minimal set of attributes the rebuild/query code expects are present
(populate with sensible defaults or compute them) so _effective_p (which reads
effective_metric_params_.get("p", self.p)) and downstream code work for old
pickles.
In `@python/cuml/tests/test_nearest_neighbors.py`:
- Around line 431-433: The test currently binds an unused variable `y` from
sklearn.datasets.make_blobs; change the call so only `X` is assigned (e.g.,
replace "X, y = sklearn.datasets.make_blobs(...)" with a single-assignment that
extracts just the features, such as assigning X =
sklearn.datasets.make_blobs(..., return_X_y=True)[0]) so the unused `y` binding
is removed while keeping the same data generation for `X`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9022be10-4ff2-4d8f-9da7-bdb7f7745816
📒 Files selected for processing (4)
python/cuml/cuml/neighbors/nearest_neighbors.pyxpython/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yamlpython/cuml/tests/test_nearest_neighbors.pypython/cuml/tests/test_sklearn_import_export.py
|
I wouldn't be surprised if there's a failure in the upstream cuml-accel tests somewhere, but any remaining failures should be quick fixes. This is ready for review. |
viclafargue
left a comment
There was a problem hiding this comment.
Thanks, LGTM for the most part! I guess using sklearn's NotFittedError allows stronger conformance and more checks passing in test_sklearn_compatibility.py. However, isn't it a concern that something as critical as check_is_fitted would now depend on the sklearn dependency (that can span over multiple versions)? Would it not introduce a higher risk of breaking things when updating the scikit-learn dependency?
|
/merge |
This is spun out of a far-too-large branch I have containing validation improvements. Trying to split things up into smaller chunks, and this is the first one.
This:
cuml.internals.validationmodule for common validation functions (similar tosklearn.utils.validation).check_random_seedto that module, and adds a test for it intest_validation.py.sklearn.utils.validation.check_is_fittedintocuml.internals.validationfor easy accesscuml.common.exceptions.NotFittedErrorin favor of the upstreamsklearn.exceptions.NotFittedError. Since we depend on sklearn, we should just reuse their exception. For now this module re-exports sklearn'sNotFittedError(with a deprecation warning).NearestNeighbors/KNeighborsRegressor/KNeighborsClassifier/OneHotEncoderthat led to unfit estimators appearing to be fit.check_is_fittedfrom ourthirdparty_adaptersin favor of just relying on the actual upstreamcheck_is_fitted.check_is_fittedto all our cleaned up estimators. We're probably still missing it in some places, but anything we're testing for sklearn compat now has it in all the proper places.HDBSCAN/GaussianRandomProjection/SparseRandomProjectionand adds a sanity check to prevent issues like this from happening again.As per our policy, I've marked this as breaking for the deprecation of
cuml.common.exceptions. Nothing else is breaking.Part of #7061