Forward-merge release/26.04 into main by gforsyth · Pull Request #7914 · rapidsai/cuml

gforsyth · 2026-03-18T21:41:35Z

Remember to /merge nosquash this one

Closes rapidsai#7827 The API reference section is now structured a bit more nicely. On `main` it is just one big page and it is hard to see what is what, what is where and who is who. 🖼️ Rendered preview https://downloads.rapids.ai/ci/cuml/pull-request/7798/55e9d3b/docs/cuml/html/api/ (can be outdated because it contains part of the commit hash) With this PR the API reference makes use of subsections for the various modules and maintains a list of "all the things" on that first API page (makes it easy to ctrl-f for a class you know exists and just want to get to fast): <img width="1265" height="955" alt="Screenshot 2026-02-13 at 11 56 31" src="https://github.com/user-attachments/assets/6d9d9288-0292-4e49-a7a4-d2ed4d9d42f1" /> You can navigate into a subsection and see the classes and functions available in it: <img width="1263" height="990" alt="Screenshot 2026-02-13 at 11 57 49" src="https://github.com/user-attachments/assets/2432b3e2-e156-4b7c-891d-82edcb452fb3" /> And you can visit a particular class/function to read everything there is to know about that class: <img width="1262" height="987" alt="Screenshot 2026-02-13 at 11 58 39" src="https://github.com/user-attachments/assets/a8e27fe7-1a19-41fe-9eaf-a2fb71ad66f4" /> --- There is an issue about restructuring the documentation, and we've discussed it a few times in the past. This morning I was frustrated enough while navigating the API reference section that I did this. I think it is a improvement on the status quo. It does surface that we could be more consistent with how we format our docstrings (eg without a linebreak after the first sentence all of the docstring ends up in the overview table or the fact that some metrics are shown directly on the "cuml.metrics" page but others referred to), etc but that is something we can tackle in a new issue I think. What do people think? I'm happy to do more polishing and think about HTTP redirects we need and such, but it would be good to know if people agree that this is going in the right direction. Authors: - Tim Head (https://github.com/betatim) - Simon Adorf (https://github.com/csadorf) Approvers: - Gil Forsyth (https://github.com/gforsyth) - Simon Adorf (https://github.com/csadorf) - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#7798

coderabbitai · 2026-03-18T21:48:16Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5bf32df9-3833-400a-9739-2413430cc288

📥 Commits

Reviewing files that changed from the base of the PR and between 5907ea7 and 791590a.

📒 Files selected for processing (2)

.github/workflows/build.yaml
.github/workflows/pr.yaml

📝 Walkthrough

Summary by CodeRabbit

Documentation
- Restructured API reference with a new API index, many module-specific pages, and enabled autosummary generation for richer, linked API docs.
Dependencies
- Updated CUDA toolkit requirements to targeted 13.x handling with nvjitlink managed separately.
Chores
- Updated copyright years to 2026.
- Adjusted XGBoost training params to use CPU/GPU-compatible settings.
- Added Python 3.13 to development container configs.
Refactor
- Reorganized preprocessing encoder exports and improved distributed label handling for Dask KMeans.

Walkthrough

Restructures and splits the monolithic API docs into per-submodule RSTs with an autosummary template and enabled generation; updates Sphinx config and extensions; adjusts .gitignore; refactors encoder import paths and related tests; modifies Dask KMeans distributed label/inertia handling; updates CI/devcontainer and CUDA dependency pins.

Changes

Cohort / File(s)	Summary
API index & per-module RSTs `docs/source/api.rst` (deleted), `docs/source/api/index.rst`, `docs/source/api/cuml.*.rst`	Removed monolithic API page; added `api/index.rst` and many per-submodule API RSTs using `automodule` + `autosummary` to generate `generated/` stubs.
Autosummary template & Sphinx config `docs/source/_templates/autosummary/base.rst`, `docs/source/conf.py`	Added autosummary base template, enabled `autosummary_generate = True`, added `rst_prolog` role, and updated copyright years.
Docs nav, minor pages & ignore rules `docs/source/index.rst`, `docs/source/FIL.rst`, `.gitignore`	Updated `:doc:` targets to `api/index`, adjusted FIL link, and extended `.gitignore` to exclude docs autosummary/generated artifacts and build outputs.
Sphinx extension `docs/source/sphinxext/github_link.py`	Added guard to skip docstring regex when `obj.__doc__` is None; bumped year.
Encoder import refactor `python/cuml/cuml/__init__.py`, `python/cuml/cuml/preprocessing/__init__.py`, `python/cuml/cuml/preprocessing/encoders.py`, `python/cuml/cuml/dask/preprocessing/__init__.py`, `python/cuml/cuml/dask/preprocessing/_label.py`	Changed internal import targets to re-export `LabelEncoder`/`TargetEncoder` from `_label`/_target_encoder` modules and updated docstring examples.
Encoder tests `python/cuml/tests/test_label_encoder.py`, `python/cuml/tests/test_target_encoder.py`, `python/cuml/tests/dask/test_dask_label_encoder.py`	Updated tests to import encoders from new internal module paths.
Dask KMeans & tests `python/cuml/cuml/dask/cluster/kmeans.py`, `python/cuml/tests/dask/test_dask_kmeans.py`	Changed post-fit aggregation: keep first-worker estimator for shared params, compute global inertia by summing per-worker tuples, store labels as distributed Dask objects (avoid local concatenation), and updated tests to use fit_predict/predict flow and validate partitions.
CI / wheel scripts `ci/test_wheel.sh`, `ci/test_wheel_dask.sh`, `ci/test_wheel_integrations.sh`	Use `${PIP_CONSTRAINT}` directly, add `--prefer-binary`, and adjust constraint/extra handling for pip installs.
Dependency & pyproject changes `dependencies.yaml`, `python/cuml/pyproject.toml`, `python/libcuml/pyproject.toml`	Reworked CUDA dependency matrix and pins: explicit major.minor handling, added separate `nvidia-nvjitlink` entries, narrowed toolkit specs (e.g., `==13.*`).
Devcontainer Python override `.devcontainer/*/devcontainer.json`	Added `containerEnv: { "PYTHON_VERSION": "3.13" }` to multiple devcontainer configs (some include TODO comments).
Misc code/test tweaks `python/cuml/cuml/solvers/cd_mg.pyx`, `python/cuml/cuml/benchmark/bench_helper_funcs.py`, notebooks, other tests	Minor fixes: scalar coercion for alpha in CDMG, XGBoost param updates (`device: 'cuda'`), test adjustments, and SPDX year bumps.
Workflows `.github/workflows/build.yaml`, `.github/workflows/pr.yaml`	Updated `docs-build` container image to a py3.13 / cuda13 image for docs build jobs.
Templates & autosummary outputs `docs/source/_templates/autosummary/base.rst`, `docs/source/api/generated/*`	Added template for autosummary-generated pages (used by new per-module RSTs).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Fix OOM in Dask KMeans by collecting only one model after fit #7908 — KMeans Dask changes matching inertia aggregation, first-worker model retention, and distributed labels conversion.
DOC Restructure API documentation #7798 — Documentation restructuring that adds autosummary template and splits API into per-module RSTs.
Forward-merge release/26.04 into main #7894 — Preprocessing encoder refactors and corresponding test updates for LabelEncoder/TargetEncoder.

Suggested labels

doc

Suggested reviewers

csadorf
bdice

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Forward-merge release/26.04 into main' clearly describes the main objective of the changeset, which is a forward-merge of the release branch into main.
Description check	✅ Passed	The PR description relates to the changeset by providing a merge instruction and referencing issue `#7905`, indicating this is part of a structured release process.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/source/api/index.rst`:
- Around line 1-5: Add the repository-standard SPDX/copyright header block to
the top of this new RST file (index.rst) before the existing title lines;
include the SPDX-License-Identifier line and the canonical copyright line (with
correct year and owner) formatted exactly like other repo files so it matches
the standard header template and linting expectations.
- Around line 450-507: Top-level API index omits several cuml.dask modules, so
update the "Multi-Node, Multi-GPU Algorithms" list-table in
docs/source/api/index.rst to include the missing cuml.dask entries (e.g.,
preprocessing, feature_extraction, datasets and any other modules listed in
docs/source/api/cuml.dask.rst); add rows with :obj: and :mod: references such as
:mod:`cuml.dask.preprocessing`, :mod:`cuml.dask.feature_extraction`,
:mod:`cuml.dask.datasets` (and corresponding representative :obj:`~...` entries
used in cuml.dask.rst) so the top-level index matches the detailed cuml.dask API
doc. Ensure the new rows follow the existing table structure and naming style so
discoverability is consistent.

In `@python/cuml/tests/test_label_encoder.py`:
- Line 12: The test currently imports the private module path (from
cuml.preprocessing._label import LabelEncoder); change it to use the public API
import (from cuml.preprocessing import LabelEncoder) so the test asserts the
public import path remains healthy—update the import in the test file and any
references to the LabelEncoder symbol accordingly to reference the public import
instead of _label.

In `@python/cuml/tests/test_target_encoder.py`:
- Line 10: The test imports TargetEncoder from the private module
cuml.preprocessing._target_encoder which bypasses the public API; change the
import to use the public namespace (import TargetEncoder from
cuml.preprocessing) so the test exercises cuml.preprocessing.TargetEncoder and
catches public-API regressions — update the import statement that references
TargetEncoder accordingly and run the tests to ensure no other references use
the private module.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 62fb3fea-c52c-42f6-b5be-9f8074df8fca

📥 Commits

Reviewing files that changed from the base of the PR and between 69261e4 and 722702f.

📒 Files selected for processing (46)

.gitignore
docs/source/FIL.rst
docs/source/_templates/autosummary/base.rst
docs/source/api.rst
docs/source/api/cuml.accel.rst
docs/source/api/cuml.benchmark.rst
docs/source/api/cuml.cluster.rst
docs/source/api/cuml.compose.rst
docs/source/api/cuml.covariance.rst
docs/source/api/cuml.dask.rst
docs/source/api/cuml.datasets.rst
docs/source/api/cuml.decomposition.rst
docs/source/api/cuml.ensemble.rst
docs/source/api/cuml.experimental.rst
docs/source/api/cuml.explainer.rst
docs/source/api/cuml.feature_extraction.rst
docs/source/api/cuml.fil.rst
docs/source/api/cuml.kernel_ridge.rst
docs/source/api/cuml.linear_model.rst
docs/source/api/cuml.manifold.rst
docs/source/api/cuml.metrics.rst
docs/source/api/cuml.model_selection.rst
docs/source/api/cuml.multiclass.rst
docs/source/api/cuml.naive_bayes.rst
docs/source/api/cuml.neighbors.rst
docs/source/api/cuml.preprocessing.rst
docs/source/api/cuml.random_projection.rst
docs/source/api/cuml.rst
docs/source/api/cuml.solvers.rst
docs/source/api/cuml.svm.rst
docs/source/api/cuml.tsa.rst
docs/source/api/index.rst
docs/source/conf.py
docs/source/index.rst
docs/source/sphinxext/github_link.py
python/cuml/cuml/__init__.py
python/cuml/cuml/common/doc_utils.py
python/cuml/cuml/dask/preprocessing/__init__.py
python/cuml/cuml/dask/preprocessing/_label.py
python/cuml/cuml/preprocessing/__init__.py
python/cuml/cuml/preprocessing/_label.py
python/cuml/cuml/preprocessing/_target_encoder.py
python/cuml/cuml/preprocessing/encoders.py
python/cuml/tests/dask/test_dask_label_encoder.py
python/cuml/tests/test_label_encoder.py
python/cuml/tests/test_target_encoder.py

💤 Files with no reviewable changes (1)

docs/source/api.rst

coderabbitai · 2026-03-18T21:48:19Z

+.. _api_ref:
+
+=============
+API Reference
+=============


⚠️ Potential issue | 🟡 Minor

Add the standard SPDX header to this new docs file.

This new file is missing the repository-standard copyright/license header block.

💡 Proposed fix

+.. SPDX-FileCopyrightText: Copyright (c) 2018-2026, NVIDIA CORPORATION. +.. SPDX-License-Identifier: Apache-2.0 + .. _api_ref:

As per coding guidelines: **/*: Ensure copyright headers of files are up-to-date and in the correct format.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.. _api_ref:

=============

API Reference

=============

.. SPDX-FileCopyrightText: Copyright (c) 2018-2026, NVIDIA CORPORATION.

.. SPDX-License-Identifier: Apache-2.0

.. _api_ref:

=============

API Reference

=============

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/source/api/index.rst` around lines 1 - 5, Add the repository-standard SPDX/copyright header block to the top of this new RST file (index.rst) before the existing title lines; include the SPDX-License-Identifier line and the canonical copyright line (with correct year and owner) formatted exactly like other repo files so it matches the standard header template and linting expectations.

coderabbitai · 2026-03-18T21:48:19Z

+Multi-Node, Multi-GPU Algorithms
+================================
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 50 20
+
+   * - Object
+     - Description
+     - Module
+   * - :obj:`~cuml.dask.solvers.CD`
+     - Multi-GPU Coordinate Descent solver.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.cluster.DBSCAN`
+     - Multi-GPU DBSCAN clustering.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.linear_model.ElasticNet`
+     - Multi-GPU ElasticNet regression.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.cluster.KMeans`
+     - Multi-GPU K-Means clustering.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.neighbors.KNeighborsClassifier`
+     - Multi-GPU K-Nearest Neighbors classifier.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.neighbors.KNeighborsRegressor`
+     - Multi-GPU K-Nearest Neighbors regressor.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.linear_model.Lasso`
+     - Multi-GPU Lasso regression.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.linear_model.LinearRegression`
+     - Multi-GPU Linear Regression.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.naive_bayes.MultinomialNB`
+     - Multi-GPU Multinomial Naive Bayes.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.neighbors.NearestNeighbors`
+     - Multi-GPU Nearest Neighbors.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.decomposition.PCA`
+     - Multi-GPU Principal Component Analysis.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.ensemble.RandomForestClassifier`
+     - Multi-GPU Random Forest classifier.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.ensemble.RandomForestRegressor`
+     - Multi-GPU Random Forest regressor.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.linear_model.Ridge`
+     - Multi-GPU Ridge Regression.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.decomposition.TruncatedSVD`
+     - Multi-GPU Truncated SVD.
+     - :mod:`cuml.dask`
+   * - :obj:`~cuml.dask.manifold.UMAP`
+     - Multi-GPU UMAP.
+     - :mod:`cuml.dask`


⚠️ Potential issue | 🟠 Major

Main API index is incomplete for cuml.dask entries.

The “comprehensive quick-search” page currently omits several cuml.dask APIs that are listed on docs/source/api/cuml.dask.rst (e.g., preprocessing, feature extraction, and dataset entries). This weakens discoverability from the top-level API page.

💡 Example patch (add omitted `cuml.dask` entries)

* - :obj:`~cuml.dask.manifold.UMAP` - Multi-GPU UMAP. - :mod:`cuml.dask` + * - :obj:`~cuml.dask.preprocessing.LabelBinarizer` + - Multi-GPU label binarization. + - :mod:`cuml.dask` + * - :obj:`~cuml.dask.preprocessing.OneHotEncoder` + - Multi-GPU one-hot encoding. + - :mod:`cuml.dask` + * - :obj:`~cuml.dask.feature_extraction.text.TfidfTransformer` + - Multi-GPU TF-IDF transform. + - :mod:`cuml.dask` + * - :obj:`~cuml.dask.datasets.make_blobs` + - Multi-GPU synthetic clustering data generation. + - :mod:`cuml.dask` + * - :obj:`~cuml.dask.datasets.make_classification` + - Multi-GPU synthetic classification data generation. + - :mod:`cuml.dask` + * - :obj:`~cuml.dask.datasets.make_regression` + - Multi-GPU synthetic regression data generation. + - :mod:`cuml.dask`

As per coding guidelines: docs/**/* → “Completeness: Check if API changes ... are documented”.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Multi-Node, Multi-GPU Algorithms

================================

.. list-table::

:header-rows: 1

:widths: 30 50 20

* - Object

- Description

- Module

* - :obj:`~cuml.dask.solvers.CD`

- Multi-GPU Coordinate Descent solver.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.cluster.DBSCAN`

- Multi-GPU DBSCAN clustering.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.ElasticNet`

- Multi-GPU ElasticNet regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.cluster.KMeans`

- Multi-GPU K-Means clustering.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.neighbors.KNeighborsClassifier`

- Multi-GPU K-Nearest Neighbors classifier.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.neighbors.KNeighborsRegressor`

- Multi-GPU K-Nearest Neighbors regressor.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.Lasso`

- Multi-GPU Lasso regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.LinearRegression`

- Multi-GPU Linear Regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.naive_bayes.MultinomialNB`

- Multi-GPU Multinomial Naive Bayes.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.neighbors.NearestNeighbors`

- Multi-GPU Nearest Neighbors.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.decomposition.PCA`

- Multi-GPU Principal Component Analysis.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.ensemble.RandomForestClassifier`

- Multi-GPU Random Forest classifier.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.ensemble.RandomForestRegressor`

- Multi-GPU Random Forest regressor.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.Ridge`

- Multi-GPU Ridge Regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.decomposition.TruncatedSVD`

- Multi-GPU Truncated SVD.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.manifold.UMAP`

- Multi-GPU UMAP.

- :mod:`cuml.dask`

Multi-Node, Multi-GPU Algorithms

================================

.. list-table::

:header-rows: 1

:widths: 30 50 20

* - Object

- Description

- Module

* - :obj:`~cuml.dask.solvers.CD`

- Multi-GPU Coordinate Descent solver.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.cluster.DBSCAN`

- Multi-GPU DBSCAN clustering.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.ElasticNet`

- Multi-GPU ElasticNet regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.cluster.KMeans`

- Multi-GPU K-Means clustering.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.neighbors.KNeighborsClassifier`

- Multi-GPU K-Nearest Neighbors classifier.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.neighbors.KNeighborsRegressor`

- Multi-GPU K-Nearest Neighbors regressor.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.Lasso`

- Multi-GPU Lasso regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.LinearRegression`

- Multi-GPU Linear Regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.naive_bayes.MultinomialNB`

- Multi-GPU Multinomial Naive Bayes.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.neighbors.NearestNeighbors`

- Multi-GPU Nearest Neighbors.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.decomposition.PCA`

- Multi-GPU Principal Component Analysis.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.ensemble.RandomForestClassifier`

- Multi-GPU Random Forest classifier.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.ensemble.RandomForestRegressor`

- Multi-GPU Random Forest regressor.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.linear_model.Ridge`

- Multi-GPU Ridge Regression.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.decomposition.TruncatedSVD`

- Multi-GPU Truncated SVD.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.manifold.UMAP`

- Multi-GPU UMAP.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.preprocessing.LabelBinarizer`

- Multi-GPU label binarization.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.preprocessing.OneHotEncoder`

- Multi-GPU one-hot encoding.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.feature_extraction.text.TfidfTransformer`

- Multi-GPU TF-IDF transform.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.datasets.make_blobs`

- Multi-GPU synthetic clustering data generation.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.datasets.make_classification`

- Multi-GPU synthetic classification data generation.

- :mod:`cuml.dask`

* - :obj:`~cuml.dask.datasets.make_regression`

- Multi-GPU synthetic regression data generation.

- :mod:`cuml.dask`

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/source/api/index.rst` around lines 450 - 507, Top-level API index omits several cuml.dask modules, so update the "Multi-Node, Multi-GPU Algorithms" list-table in docs/source/api/index.rst to include the missing cuml.dask entries (e.g., preprocessing, feature_extraction, datasets and any other modules listed in docs/source/api/cuml.dask.rst); add rows with :obj: and :mod: references such as :mod:`cuml.dask.preprocessing`, :mod:`cuml.dask.feature_extraction`, :mod:`cuml.dask.datasets` (and corresponding representative :obj:`~...` entries used in cuml.dask.rst) so the top-level index matches the detailed cuml.dask API doc. Ensure the new rows follow the existing table structure and naming style so discoverability is consistent.

coderabbitai · 2026-03-18T21:48:19Z

 from sklearn.utils.validation import check_is_fitted

-from cuml.preprocessing.LabelEncoder import LabelEncoder
+from cuml.preprocessing._label import LabelEncoder


⚠️ Potential issue | 🟡 Minor

Prefer the public LabelEncoder import path in this test.

Switching this suite to cuml.preprocessing._label stops it from asserting that the public cuml.preprocessing.LabelEncoder path remains healthy.

🔧 Suggested change

-from cuml.preprocessing._label import LabelEncoder +from cuml.preprocessing import LabelEncoder

As per coding guidelines, "API breaking changes to Python estimator interfaces, removing or renaming public methods/attributes without deprecation, or breaking backward compatibility require at least one release cycle for deprecations."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from cuml.preprocessing._label import LabelEncoder

from cuml.preprocessing import LabelEncoder

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuml/tests/test_label_encoder.py` at line 12, The test currently imports the private module path (from cuml.preprocessing._label import LabelEncoder); change it to use the public API import (from cuml.preprocessing import LabelEncoder) so the test asserts the public import path remains healthy—update the import in the test file and any references to the LabelEncoder symbol accordingly to reference the public import instead of _label.

coderabbitai · 2026-03-18T21:48:19Z

 import pytest

-from cuml.preprocessing.TargetEncoder import TargetEncoder
+from cuml.preprocessing._target_encoder import TargetEncoder


⚠️ Potential issue | 🟡 Minor

Keep this test importing TargetEncoder from the public namespace.

Using cuml.preprocessing._target_encoder here bypasses public API coverage and can hide regressions in cuml.preprocessing.TargetEncoder. For behavior tests, prefer the public import path.

🔧 Suggested change

-from cuml.preprocessing._target_encoder import TargetEncoder +from cuml.preprocessing import TargetEncoder

As per coding guidelines, "API breaking changes to Python estimator interfaces, removing or renaming public methods/attributes without deprecation, or breaking backward compatibility require at least one release cycle for deprecations."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from cuml.preprocessing._target_encoder import TargetEncoder

from cuml.preprocessing import TargetEncoder

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuml/tests/test_target_encoder.py` at line 10, The test imports TargetEncoder from the private module cuml.preprocessing._target_encoder which bypasses the public API; change the import to use the public namespace (import TargetEncoder from cuml.preprocessing) so the test exercises cuml.preprocessing.TargetEncoder and catches public-API regressions — update the import statement that references TargetEncoder accordingly and run the tests to ensure no other references use the private module.

gforsyth · 2026-03-19T13:37:22Z

I opened rapidsai/xgboost-feedstock#128 to get libxgboost built with rmm==26.6

…vailable (rapidsai#7916) We bumped the default devcontainer Python version to 3.14 in rapidsai/devcontainers#677 but some repos (like `cuml`) have optional dependencies without Python 3.14 support. We need to override the Python version temporarily until those deps catch up. I'll also open a tracking issue so we remember to undo this. Authors: - Gil Forsyth (https://github.com/gforsyth) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#7916

rapidsai#7907) Contributes to rapidsai/build-planning#257 * builds CUDA 13 wheels with the 13.0 CTK * ensures wheels ship with a runtime dependency of `nvidia-nvjitlink>={whatever-minor-version-they-were-built-against}` Contributes to rapidsai/build-planning#256 * updates wheel tests to cover a range of CTK versions (we previously, accidentally, were only testing the latest 12.x and 13.x) ## Notes for Reviewers ### Doesn't this need conda changes too? I don't think so. cuML doesn't directly use libnvjitlink in conda packages, and it dynamically links to libcuvs: https://github.com/rapidsai/cuml/blob/95b7be311789ef6b66bf1983d09f3cfd17b5e508/cpp/CMakeLists.txt#L64 https://github.com/rapidsai/cuml/blob/main/conda/recipes/libcuml/recipe.yaml#L118-L126 Wheels need a runtime dependency on `nvidia-nvjitlink` because they statically link cuVS: https://github.com/rapidsai/cuml/blob/95b7be311789ef6b66bf1983d09f3cfd17b5e508/python/libcuml/CMakeLists.txt#L54 Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Gil Forsyth (https://github.com/gforsyth) - Robert Maynard (https://github.com/robertmaynard) URL: rapidsai#7907

…ai#7908) After MNMG KMeans fit, the client was calling .result() on every worker's future, pulling the full fitted estimator (including `cluster_centers_`) from all workers back to the client. Since cluster centers are synchronized across workers via NCCL, every copy is identical making all but one transfer redundant. This PR changes the post-fit aggregation to: - Collect the full model from only the first worker - Extract `labels_` and `inertia_` from the remaining workers remotely via client.submit(getattr, ...) Authors: - Victor Lafargue (https://github.com/viclafargue) - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Jim Crist-Harif (https://github.com/jcrist) URL: rapidsai#7908

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/cuml/tests/dask/test_dask_kmeans.py (1)

57-76: ⚠️ Potential issue | 🟡 Minor

Broaden this test to cover the new fit_predict path.

fit_predict() is still called with its default delayed=True, and only predict() has its partitions/chunks checked. The production change added a separate eager branch plus direct distributed labels construction, so this can still miss regressions in fit_predict(delayed=False) or its layout.

🧪 Suggested test update

-    dask_fit_predict_labels = model.fit_predict(X_train)
+    dask_fit_predict_labels = model.fit_predict(
+        X_train, delayed=delayed_predict
+    )
...
     if input_type == "dataframe":
+        assert dask_fit_predict_labels.npartitions == parts_len
         assert dask_predict_labels.npartitions == parts_len
...
     elif input_type == "array":
+        assert len(dask_fit_predict_labels.chunks[0]) == parts_len
         assert len(dask_predict_labels.chunks[0]) == parts_len

As per coding guidelines, "Update unit tests when making code changes" and "Test files must ... test fit/predict/transform consistency".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/dask/test_dask_kmeans.py` around lines 57 - 76, The test
currently only checks partition/chunk layout for model.predict and leaves
model.fit_predict using its default delayed=True path untested; update the test
to also call model.fit_predict with delayed=False (e.g., a new variable like
eager_fit_predict_labels = model.fit_predict(X_train, delayed=False)) and assert
that its returned collections have the expected number of partitions/chunks
matching parts_len just like dask_predict_labels does, and compare their
computed/eager label arrays (e.g., eager_fit_predict_labels.compute() or
equivalent) against fit_pred_labels/y_train to ensure fit_predict(delayed=False)
follows the same layout and label consistency as predict and the delayed
fit_predict path.

🧹 Nitpick comments (1)

python/cuml/cuml/dask/cluster/kmeans.py (1)

180-181: Make these zip() calls strict.

Assuming the supported Python floor here is still 3.10+, Lines 181, 191, and 199 already trip Ruff B905. These sequences are supposed to stay aligned 1:1, so strict=True gives a safer failure mode than silent truncation.

🔧 Minimal diff

-                for f, w in zip(kmeans_fit, workers)
+                for f, w in zip(kmeans_fit, workers, strict=True)
...
-            for f, w in zip(kmeans_fit, workers)
+            for f, w in zip(kmeans_fit, workers, strict=True)
...
-                    for f, (_, length) in zip(labels, inertia_and_lengths)
+                    for f, (_, length) in zip(
+                        labels, inertia_and_lengths, strict=True
+                    )

Also applies to: 190-191, 198-199

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 180 - 181, Replace the
non-strict zip calls with strict=True so mismatched sequence lengths raise
immediately; specifically update the zip(...) usages that pair kmeans_fit with
workers (the generator passed to self.client.submit calling
_get_inertia_and_n_samples) and the other two zip pairings referenced in the
diff to use zip(..., strict=True) to ensure a 1:1 alignment between kmeans_fit
and workers (and the other paired iterables).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 172-185: The code sets the internal model to the first worker's
KMeansMG (first) before combining per-worker state, so get_combined_model() ends
up with only a partial inertia_ and worker-local labels_; instead, aggregate the
distributed state (compute total inertia_ from _get_inertia_and_n_samples
results and assemble/omit combined labels_) and update the first model's
attributes (e.g., set first.inertia_ = total_inertia and replace or clear
first.labels_ as appropriate) before calling _set_internal_model(first) for both
the initial call and the later similar block (referencing kmeans_fit,
_get_inertia_and_n_samples, inertia_, labels_, and _set_internal_model). Ensure
the internal model passed to _set_internal_model contains the combined/global
inertia_ (and not per-worker labels_) so the wrapper's get_combined_model()
reflects the aggregated state.

---

Outside diff comments:
In `@python/cuml/tests/dask/test_dask_kmeans.py`:
- Around line 57-76: The test currently only checks partition/chunk layout for
model.predict and leaves model.fit_predict using its default delayed=True path
untested; update the test to also call model.fit_predict with delayed=False
(e.g., a new variable like eager_fit_predict_labels = model.fit_predict(X_train,
delayed=False)) and assert that its returned collections have the expected
number of partitions/chunks matching parts_len just like dask_predict_labels
does, and compare their computed/eager label arrays (e.g.,
eager_fit_predict_labels.compute() or equivalent) against
fit_pred_labels/y_train to ensure fit_predict(delayed=False) follows the same
layout and label consistency as predict and the delayed fit_predict path.

---

Nitpick comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 180-181: Replace the non-strict zip calls with strict=True so
mismatched sequence lengths raise immediately; specifically update the zip(...)
usages that pair kmeans_fit with workers (the generator passed to
self.client.submit calling _get_inertia_and_n_samples) and the other two zip
pairings referenced in the diff to use zip(..., strict=True) to ensure a 1:1
alignment between kmeans_fit and workers (and the other paired iterables).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 231e6579-47b8-4011-8cb3-fb92721793f1

📥 Commits

Reviewing files that changed from the base of the PR and between 722702f and 5627345.

📒 Files selected for processing (12)

.devcontainer/cuda12.9-conda/devcontainer.json
.devcontainer/cuda12.9-pip/devcontainer.json
.devcontainer/cuda13.1-conda/devcontainer.json
.devcontainer/cuda13.1-pip/devcontainer.json
ci/test_wheel.sh
ci/test_wheel_dask.sh
ci/test_wheel_integrations.sh
dependencies.yaml
python/cuml/cuml/dask/cluster/kmeans.py
python/cuml/pyproject.toml
python/cuml/tests/dask/test_dask_kmeans.py
python/libcuml/pyproject.toml

✅ Files skipped from review due to trivial changes (2)

.devcontainer/cuda12.9-conda/devcontainer.json
.devcontainer/cuda13.1-conda/devcontainer.json

coderabbitai · 2026-03-20T13:55:04Z

+        first = kmeans_fit[0].result()
        self._set_internal_model(first)

+        workers = list(data.worker_to_parts.keys())
+
+        # Compute and store the total inertia_
+        inertia_and_lengths = self.client.gather(
+            [
+                self.client.submit(_get_inertia_and_n_samples, f, workers=[w])
+                for f, w in zip(kmeans_fit, workers)
+            ]
+        )
+        self.inertia_ = sum(inertia for inertia, _ in inertia_and_lengths)
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Inspect how the Dask estimator stores and exposes its internal model:" rg -n -A20 -B5 'def _set_internal_model\b|def get_combined_model\b' python/cuml/cuml/dask/common/base.py echo echo "Inspect current post-fit callers of get_combined_model():" rg -n -C2 '\bget_combined_model\s*\(' python/cuml/tests python/cuml/cuml/dask

Repository: rapidsai/cuml

Length of output: 6527

🏁 Script executed:

cat -n python/cuml/cuml/dask/cluster/kmeans.py | sed -n '160,210p'

Repository: rapidsai/cuml

Length of output: 2243

🏁 Script executed:

rg -n -A3 'get_combined_model.*labels_|get_combined_model.*inertia_' python/cuml/tests/dask/ python/cuml/cuml/dask/

Repository: rapidsai/cuml

Length of output: 354

🏁 Script executed:

# Check if KMeans uses get_combined_model internally in predict or other methods rg -n -B5 -A5 'def predict\|def transform\|def score' python/cuml/cuml/dask/cluster/kmeans.py | head -80

Repository: rapidsai/cuml

Length of output: 39

🏁 Script executed:

rg -n 'def (predict|score|fit_predict|transform)' python/cuml/cuml/dask/cluster/kmeans.py

Repository: rapidsai/cuml

Length of output: 262

🏁 Script executed:

# Check the score method implementation if it exists rg -n -A15 'def score' python/cuml/cuml/dask/cluster/kmeans.py

Repository: rapidsai/cuml

Length of output: 500

🏁 Script executed:

cat -n python/cuml/cuml/dask/cluster/kmeans.py | sed -n '283,310p'

Repository: rapidsai/cuml

Length of output: 964

🏁 Script executed:

cat -n python/cuml/tests/dask/test_dask_kmeans.py | sed -n '300,320p'

Repository: rapidsai/cuml

Length of output: 648

🏁 Script executed:

# Check if there are any assertions about inertia_ value in tests rg -n 'inertia_' python/cuml/tests/dask/test_dask_kmeans.py

Repository: rapidsai/cuml

Length of output: 167

🏁 Script executed:

# Check if there are any direct accesses to internal_model.inertia_ or similar rg -n 'internal_model.*inertia_|\.inertia_' python/cuml/cuml/dask/cluster/kmeans.py

Repository: rapidsai/cuml

Length of output: 194

🏁 Script executed:

# Check the fit_predict method to see if it accesses any internal model attributes cat -n python/cuml/cuml/dask/cluster/kmeans.py | sed -n '205,230p'

Repository: rapidsai/cuml

Length of output: 1007

🏁 Script executed:

# Final verification: confirm that first.inertia_ is indeed different from self.inertia_ # by checking the helper function that extracts inertia_ cat -n python/cuml/cuml/dask/cluster/kmeans.py | sed -n '1,30p'

Repository: rapidsai/cuml

Length of output: 1172

Store combined model state before calling _set_internal_model().

The internal model at line 173 is set to first, a single worker's KMeansMG instance containing only that worker's inertia_ and labels_. However, lines 178–201 aggregate inertia_ across all workers and collect distributed labels_ from all workers into the wrapper object. This violates the documented contract in base.py which states: "If multiple different parameters have been trained across the cluster...they should be combined into a single model and the combined model should be passed to set_internal_model()."

The inconsistency means get_combined_model().inertia_ would return the first worker's partial inertia instead of the aggregated total. Update the internal model's aggregated state (at minimum inertia_, ideally avoiding a worker-local labels_) before calling _set_internal_model() at line 173.

Also applies to: 193–201

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 181-181: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 172 - 185, The code sets the internal model to the first worker's KMeansMG (first) before combining per-worker state, so get_combined_model() ends up with only a partial inertia_ and worker-local labels_; instead, aggregate the distributed state (compute total inertia_ from _get_inertia_and_n_samples results and assemble/omit combined labels_) and update the first model's attributes (e.g., set first.inertia_ = total_inertia and replace or clear first.labels_ as appropriate) before calling _set_internal_model(first) for both the initial call and the later similar block (referencing kmeans_fit, _get_inertia_and_n_samples, inertia_, labels_, and _set_internal_model). Ensure the internal model passed to _set_internal_model contains the combined/global inertia_ (and not per-worker labels_) so the wrapper's get_combined_model() reflects the aggregated state.

gforsyth · 2026-03-20T14:30:54Z

Ugh, ok, new libxgboost with librmm 26.06 is merged, but now need to bump rapids-xgboost to use that version: rapidsai/integration#833

Numpy 2.4 no longer coerces 1 element arrays to scalars automatically. Small update to this line to support both 1-element alpha arrays and scalars. Fixes rapidsai#7873. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Simon Adorf (https://github.com/csadorf) URL: rapidsai#7918

Closes rapidsai#7191 Authors: - Philip Hyunsu Cho (https://github.com/hcho3) Approvers: - Simon Adorf (https://github.com/csadorf) - Jim Crist-Harif (https://github.com/jcrist) - James Lamb (https://github.com/jameslamb) URL: rapidsai#7858

review-notebook-app · 2026-03-20T15:01:20Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

python/cuml/cuml/dask/cluster/kmeans.py (1)
168-201: ⚠️ Potential issue | 🟠 Major

Move _set_internal_model() until after the global state is assembled.

Line 173 stores first before inertia_ and labels_ are combined across workers, so get_combined_model() still exposes the first worker’s partial inertia_ and worker-local labels_. It also means a failure in the later gather/submit calls can leave the estimator half-updated. Please update first with the aggregated state (or clear labels_) and call _set_internal_model(first) only after the aggregation finishes. Based on learnings: Model state management must ensure fit/predict/transform maintain consistent state, and state from previous fit() calls does not affect new training.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 168 - 201, Move the
call to _set_internal_model(first) until after the global state assembly:
compute inertia_and_lengths via client.gather and build/assign self.inertia_ and
self.labels_ (using _get_inertia_and_n_samples, labels list,
da.from_delayed/dd.from_delayed logic) first, then update the model by calling
_set_internal_model(first) so the internal model reflects the aggregated
inertia_ and distributed labels_; if you cannot fully populate first with
aggregated values, clear/overwrite first.labels_ and first.inertia_ (or avoid
setting them on the partial `first`) before calling _set_internal_model to
prevent leaking partial/worker-local state.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 175-201: The labels were concatenated per-worker (labels list
zipped with workers) which yields worker-block order, not the original partition
order; fix by mapping each returned delayed label future back to its original
partition index and then reassembling labels_ in input-partition order before
creating the dask array. Concretely: after creating the per-worker delayed
futures (labels) and knowing each worker's partition list from
data.worker_to_parts, build a mapping from partition_index ->
delayed_label_future (use the same zip of kmeans_fit/labels with the worker's
partition indices), then produce an ordered list [mapping[i] for i in
range(n_total_partitions)] and pass that ordered list into dd.from_delayed or
da.from_delayed/da.concatenate (replace current use of
labels/inertia_and_lengths) so labels_ respects the original partition ordering.

In `@python/cuml/cuml/experimental/hyperparams/HPO_demo.ipynb`:
- Around line 764-767: The markdown note about CPU runs is inconsistent with the
gating condition; change the note text so it matches the actual check
`data_fraction <= 0.1` (i.e., say "10%" instead of "1%") — locate the
explanatory text referencing `data_fraction` in the HPO_demo notebook and update
the sentence to reflect 10% (or alternatively change the gate to `<= 0.01` if
you intend 1%) so the note and the `data_fraction <= 0.1` condition are aligned.

---

Duplicate comments:
In `@python/cuml/cuml/dask/cluster/kmeans.py`:
- Around line 168-201: Move the call to _set_internal_model(first) until after
the global state assembly: compute inertia_and_lengths via client.gather and
build/assign self.inertia_ and self.labels_ (using _get_inertia_and_n_samples,
labels list, da.from_delayed/dd.from_delayed logic) first, then update the model
by calling _set_internal_model(first) so the internal model reflects the
aggregated inertia_ and distributed labels_; if you cannot fully populate first
with aggregated values, clear/overwrite first.labels_ and first.inertia_ (or
avoid setting them on the partial `first`) before calling _set_internal_model to
prevent leaking partial/worker-local state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 37f9cd7f-c5fa-4951-8697-606b323854d1

📥 Commits

Reviewing files that changed from the base of the PR and between 5627345 and 5907ea7.

📒 Files selected for processing (63)

.devcontainer/cuda12.9-conda/devcontainer.json
.devcontainer/cuda12.9-pip/devcontainer.json
.devcontainer/cuda13.1-conda/devcontainer.json
.devcontainer/cuda13.1-pip/devcontainer.json
.gitignore
ci/test_wheel.sh
ci/test_wheel_dask.sh
ci/test_wheel_integrations.sh
dependencies.yaml
docs/source/FIL.rst
docs/source/_templates/autosummary/base.rst
docs/source/api.rst
docs/source/api/cuml.accel.rst
docs/source/api/cuml.benchmark.rst
docs/source/api/cuml.cluster.rst
docs/source/api/cuml.compose.rst
docs/source/api/cuml.covariance.rst
docs/source/api/cuml.dask.rst
docs/source/api/cuml.datasets.rst
docs/source/api/cuml.decomposition.rst
docs/source/api/cuml.ensemble.rst
docs/source/api/cuml.experimental.rst
docs/source/api/cuml.explainer.rst
docs/source/api/cuml.feature_extraction.rst
docs/source/api/cuml.fil.rst
docs/source/api/cuml.kernel_ridge.rst
docs/source/api/cuml.linear_model.rst
docs/source/api/cuml.manifold.rst
docs/source/api/cuml.metrics.rst
docs/source/api/cuml.model_selection.rst
docs/source/api/cuml.multiclass.rst
docs/source/api/cuml.naive_bayes.rst
docs/source/api/cuml.neighbors.rst
docs/source/api/cuml.preprocessing.rst
docs/source/api/cuml.random_projection.rst
docs/source/api/cuml.rst
docs/source/api/cuml.solvers.rst
docs/source/api/cuml.svm.rst
docs/source/api/cuml.tsa.rst
docs/source/api/index.rst
docs/source/conf.py
docs/source/index.rst
docs/source/sphinxext/github_link.py
notebooks/target_encoder_walkthrough.ipynb
python/cuml/cuml/__init__.py
python/cuml/cuml/benchmark/bench_helper_funcs.py
python/cuml/cuml/common/doc_utils.py
python/cuml/cuml/dask/cluster/kmeans.py
python/cuml/cuml/dask/preprocessing/__init__.py
python/cuml/cuml/dask/preprocessing/_label.py
python/cuml/cuml/experimental/hyperparams/HPO_demo.ipynb
python/cuml/cuml/preprocessing/__init__.py
python/cuml/cuml/preprocessing/_label.py
python/cuml/cuml/preprocessing/_target_encoder.py
python/cuml/cuml/preprocessing/encoders.py
python/cuml/cuml/solvers/cd_mg.pyx
python/cuml/pyproject.toml
python/cuml/tests/dask/test_dask_coordinate_descent.py
python/cuml/tests/dask/test_dask_kmeans.py
python/cuml/tests/dask/test_dask_label_encoder.py
python/cuml/tests/test_label_encoder.py
python/cuml/tests/test_target_encoder.py
python/libcuml/pyproject.toml

💤 Files with no reviewable changes (1)

docs/source/api.rst

✅ Files skipped from review due to trivial changes (42)

docs/source/FIL.rst
docs/source/index.rst
.devcontainer/cuda12.9-pip/devcontainer.json
.devcontainer/cuda12.9-conda/devcontainer.json
docs/source/api/cuml.model_selection.rst
.devcontainer/cuda13.1-pip/devcontainer.json
docs/source/api/cuml.multiclass.rst
.gitignore
docs/source/api/cuml.accel.rst
docs/source/api/cuml.svm.rst
notebooks/target_encoder_walkthrough.ipynb
docs/source/api/cuml.preprocessing.rst
docs/source/api/cuml.random_projection.rst
docs/source/api/cuml.tsa.rst
python/cuml/cuml/dask/preprocessing/_label.py
python/cuml/cuml/preprocessing/encoders.py
docs/source/api/cuml.solvers.rst
python/libcuml/pyproject.toml
python/cuml/tests/test_label_encoder.py
docs/source/api/cuml.compose.rst
docs/source/api/cuml.neighbors.rst
docs/source/api/cuml.experimental.rst
docs/source/api/cuml.explainer.rst
python/cuml/cuml/preprocessing/init.py
docs/source/api/cuml.naive_bayes.rst
docs/source/api/cuml.covariance.rst
docs/source/api/cuml.manifold.rst
python/cuml/cuml/common/doc_utils.py
docs/source/api/cuml.datasets.rst
docs/source/api/cuml.decomposition.rst
docs/source/api/cuml.feature_extraction.rst
docs/source/api/cuml.rst
docs/source/api/cuml.dask.rst
python/cuml/tests/dask/test_dask_label_encoder.py
docs/source/api/cuml.ensemble.rst
docs/source/api/index.rst
docs/source/api/cuml.metrics.rst
docs/source/api/cuml.kernel_ridge.rst
docs/source/api/cuml.cluster.rst
docs/source/api/cuml.fil.rst
docs/source/_templates/autosummary/base.rst
docs/source/api/cuml.linear_model.rst

🚧 Files skipped from review as they are similar to previous changes (10)

ci/test_wheel_integrations.sh
docs/source/api/cuml.benchmark.rst
.devcontainer/cuda13.1-conda/devcontainer.json
python/cuml/tests/test_target_encoder.py
ci/test_wheel.sh
ci/test_wheel_dask.sh
python/cuml/cuml/init.py
python/cuml/cuml/dask/preprocessing/init.py
docs/source/conf.py
python/cuml/tests/dask/test_dask_kmeans.py

coderabbitai · 2026-03-20T15:10:30Z

+        workers = list(data.worker_to_parts.keys())
+
+        # Compute and store the total inertia_
+        inertia_and_lengths = self.client.gather(
+            [
+                self.client.submit(_get_inertia_and_n_samples, f, workers=[w])
+                for f, w in zip(kmeans_fit, workers)
+            ]
+        )
+        self.inertia_ = sum(inertia for inertia, _ in inertia_and_lengths)
+
+        # Store labels_ as a distributed dask array. This attribute scales with
+        # n_samples, and shouldn't be pulled back to a local node.
+        labels_meta = cp.zeros(0, dtype=first.labels_.dtype)
+        labels = [
+            self.client.submit(getattr, f, "labels_", workers=[w])
+            for f, w in zip(kmeans_fit, workers)
+        ]
+        if self.datatype == "cudf":
+            self.labels_ = dd.from_delayed(labels)
+        else:
+            self.labels_ = da.concatenate(
+                [
+                    da.from_delayed(f, shape=(length,), meta=labels_meta)
+                    for f, (_, length) in zip(labels, inertia_and_lengths)
+                ]
+            )


⚠️ Potential issue | 🟠 Major

Rebuild labels_ in input-partition order before returning it.

worker_to_parts groups partitions by worker, and _func_fit() concatenates each worker’s local partitions before producing labels_. Concatenating those worker-local blocks here yields [all parts on worker A, then worker B, ...], not the original Dask partition order, so an input distributed as A, B, A comes back as p0, p2, p1. With Lines 220-221 now returning self.labels_ directly, fit_predict() can return permuted labels for perfectly valid partition layouts.

Also applies to: 220-221

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 181-181: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

[warning] 191-191: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

[warning] 199-199: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/dask/cluster/kmeans.py` around lines 175 - 201, The labels were concatenated per-worker (labels list zipped with workers) which yields worker-block order, not the original partition order; fix by mapping each returned delayed label future back to its original partition index and then reassembling labels_ in input-partition order before creating the dask array. Concretely: after creating the per-worker delayed futures (labels) and knowing each worker's partition list from data.worker_to_parts, build a mapping from partition_index -> delayed_label_future (use the same zip of kmeans_fit/labels with the worker's partition indices), then produce an ordered list [mapping[i] for i in range(n_total_partitions)] and pass that ordered list into dd.from_delayed or da.from_delayed/da.concatenate (replace current use of labels/inertia_and_lengths) so labels_ respects the original partition ordering.

coderabbitai · 2026-03-20T15:10:30Z

+    "Now, running the model in CPU version to notice the difference in performance in terms of time. The main difference you would notice is that the `device` is set to `cpu` instead of `cuda`. The interface remains the same and we can even make use of the same parameters that we defined earlier (in fact, this is necessary for a fair comparison). \n",
    "\n",
    "Note: Remember the `data_fraction` flag from earlier and we will ensure we only run the CPU version if we are using less than 1% of the data"
   ]


⚠️ Potential issue | 🟡 Minor

Fix CPU data-fraction note to match the actual gate condition.

The note says CPU should run only below 1%, but the code gate is data_fraction <= 0.1 (10%). Please align the markdown to avoid confusion.

🛠️ Suggested patch

- "Note: Remember the `data_fraction` flag from earlier and we will ensure we only run the CPU version if we are using less than 1% of the data" + "Note: Remember the `data_fraction` flag from earlier; we only run the CPU version when using 10% of the data or less (`data_fraction <= 0.1`)."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"Now, running the model in CPU version to notice the difference in performance in terms of time. The main difference you would notice is that the `device` is set to `cpu` instead of `cuda`. The interface remains the same and we can even make use of the same parameters that we defined earlier (in fact, this is necessary for a fair comparison). \n",

"\n",

"Note: Remember the `data_fraction` flag from earlier and we will ensure we only run the CPU version if we are using less than 1% of the data"

]

"Now, running the model in CPU version to notice the difference in performance in terms of time. The main difference you would notice is that the `device` is set to `cpu` instead of `cuda`. The interface remains the same and we can even make use of the same parameters that we defined earlier (in fact, this is necessary for a fair comparison). \n",

"\n",

"Note: Remember the `data_fraction` flag from earlier; we only run the CPU version when using 10% of the data or less (`data_fraction <= 0.1`)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@python/cuml/cuml/experimental/hyperparams/HPO_demo.ipynb` around lines 764 - 767, The markdown note about CPU runs is inconsistent with the gating condition; change the note text so it matches the actual check `data_fraction <= 0.1` (i.e., say "10%" instead of "1%") — locate the explanatory text referencing `data_fraction` in the HPO_demo notebook and update the sentence to reflect 10% (or alternatively change the gate to `<= 0.01` if you intend 1%) so the note and the `data_fraction <= 0.1` condition are aligned.

Please stop reviewing this PR, it's pulling in commits that have already been approved on the release branch.

You are burning GPU cycles that could be better spent elsewhere.

@gforsyth, apologies for the unnecessary review noise. I understand this is a forward-merge PR bringing in commits that have already been reviewed and approved on the release branch. I'll refrain from further review comments on this PR.

🧠 Learnings used

Learnt from: CR Repo: rapidsai/cuml PR: 0 File: cpp/agents.md:0-0 Timestamp: 2026-02-04T17:24:50.873Z Learning: Focus code review on CRITICAL and HIGH issues only (GPU/CUDA errors, race conditions, memory leaks, algorithm correctness, numerical stability, performance); ignore style/formatting, minor naming, and subjective preferences

gforsyth · 2026-03-20T17:41:54Z

/merge nosquash

rapids-bot · 2026-03-20T17:52:29Z

Commit history integrity check failed: not all commits from original PR #7905 appear to be present individually in this PR's history. This usually happens if commits were squashed during the manual resolution process. Please ensure all original commits are preserved individually. You can fix this and try the /merge nosquash command again.

gforsyth requested review from a team as code owners March 18, 2026 21:41

gforsyth requested review from KyleFromNVIDIA and jcrist March 18, 2026 21:41

github-actions Bot added the Cython / Python Cython or Python issue label Mar 18, 2026

github-actions Bot assigned gforsyth Mar 18, 2026

gforsyth added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 18, 2026

jcrist approved these changes Mar 18, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 18, 2026

View reviewed changes

jameslamb approved these changes Mar 18, 2026

View reviewed changes

gforsyth and others added 3 commits March 19, 2026 18:36

gforsyth requested a review from a team as a code owner March 20, 2026 13:42

github-actions Bot added the ci label Mar 20, 2026

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

jcrist and others added 2 commits March 20, 2026 14:42

Merge branch 'release/26.04' into main-merge-release/26.04

5907ea7

gforsyth force-pushed the main-merge-release/26.04 branch from c2fa289 to 5907ea7 Compare March 20, 2026 15:02

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

fix(docs): build docs on Python 3.13

791590a

gforsyth mentioned this pull request Mar 20, 2026

Fix Dask estimators crashing when workers receive empty partitions #7807

Merged

gforsyth merged commit edcd781 into rapidsai:main Mar 20, 2026
188 of 190 checks passed

gforsyth deleted the main-merge-release/26.04 branch March 20, 2026 17:55

	from cuml.preprocessing._label import LabelEncoder
	from cuml.preprocessing import LabelEncoder

	from cuml.preprocessing._target_encoder import TargetEncoder
	from cuml.preprocessing import TargetEncoder

Conversation

gforsyth commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gforsyth commented Mar 19, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gforsyth commented Mar 20, 2026

Uh oh!

review-notebook-app Bot commented Mar 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gforsyth Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gforsyth commented Mar 20, 2026

Uh oh!

rapids-bot Bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gforsyth commented Mar 18, 2026 •

edited

Loading

coderabbitai Bot commented Mar 18, 2026 •

edited

Loading

coderabbitai Bot Mar 20, 2026 •

edited

Loading