Implement health checks. by csadorf · Pull Request #7854 · rapidsai/cuml

csadorf · 2026-03-04T21:07:13Z

Adds a cuml.health_checks module with smoke tests verifiable standalone and via rapids doctor.

Required for #7851

Closes rapidsai#7851

Depends on rapidsai/cuml#7854 Closes rapidsai/cuml#7851

coderabbitai · 2026-03-04T21:09:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4043693-d9b3-464c-b7d5-9395fffd9483

📥 Commits

Reviewing files that changed from the base of the PR and between bc883a6 and 672cd4d.

📒 Files selected for processing (1)

python/cuml/pyproject.toml

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Added health checks to verify cuML installation and functionality. Run via python -m cuml.health_checks with optional --verbose flag, or integrate with RAPIDS CLI (rapids doctor).
Documentation
- Added comprehensive health checks documentation covering standalone and RAPIDS CLI usage.
Tests
- Added test coverage for health check functionality and registration.

Walkthrough

Adds a cuML health-checks feature: new health check functions, a CLI entry point for python -m cuml.health_checks, RAPIDS CLI plugin entry-points, documentation, and tests validating registration and function signatures.

Changes

Cohort / File(s)	Summary
Documentation `docs/source/health_checks.rst`, `docs/source/user_guide.rst`	New docs page describing standalone and RAPIDS CLI usage; user guide toctree updated to include the page.
Health checks package `python/cuml/cuml/health_checks/__init__.py`, `python/cuml/cuml/health_checks/__main__.py`, `python/cuml/cuml/health_checks/_checks.py`	New package exposing four checks (import, functional, accel-basic, accel-cli), CLI runner (`main`) supporting verbose and name filtering, subprocess-based accel checks with timeout and detailed error handling.
Tests `python/cuml/tests/test_health_checks.py`	New parametrized tests ensuring all public checks are registered, runnable with verbose=True, and conform to the required signature (verbose param + **kwargs).
Packaging / Entry points `python/cuml/pyproject.toml`	Adds `project.entry-points.rapids_doctor_check` entries registering the four checks as rapids doctor plugins.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Add cuML health checks to rapids doctor #7851 — Implements the same cuML health-check functions and registers them under the rapids_doctor_check entry-point group; appears directly related and could be linked.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Implement health checks' directly summarizes the main change: adding a cuml.health_checks module with smoke tests.
Description check	✅ Passed	The description is related to the changeset, explaining that the PR adds a cuml.health_checks module with smoke tests for standalone and rapids doctor usage.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

python/cuml/tests/test_health_checks.py (1)

24-27: Rename unused parametrized argument for clarity.

name is only used for parametrization setup and not inside the test body. Rename it to _name (or parameterize only check_fn) to make intent explicit and keep lint clean.

♻️ Small cleanup

-@pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS)
-def test_health_check(name, check_fn):
+@pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS)
+def test_health_check(_name, check_fn):
     """Each registered health check should pass."""
     check_fn(verbose=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_health_checks.py` around lines 24 - 27, Rename the
unused parametrized argument in the test function to make intent explicit: in
test_health_check change the parameter list from (name, check_fn) to (_name,
check_fn) (or alternatively remove name from the param list and adjust the
parametrize to only supply check_fn) so that the unused "name" symbol is clearly
marked and lint warnings are resolved; update the test signature where
test_health_check and the _CHECKS/_CHECK_IDS parametrize decorator are defined.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/health_checks/_checks.py`:
- Around line 54-147: Add a new check function (e.g.,
accel_memory_check(verbose=False, **kwargs)) alongside accel_basic_check and
accel_cli_check that explicitly queries the GPU free memory (use NVML via pynvml
or CUDA API), compares it to a configurable minimum threshold (e.g.,
_MIN_GPU_MEMORY_BYTES or a min_memory kwarg), and fails with a clear
RuntimeError/AssertionError including available vs required bytes and
remediation guidance when memory is insufficient; ensure the function returns a
concise success message when verbose is True and register this new
accel_memory_check in the module's checks registry so it runs with the other
accel checks (accel_basic_check, accel_cli_check).

---

Nitpick comments:
In `@python/cuml/tests/test_health_checks.py`:
- Around line 24-27: Rename the unused parametrized argument in the test
function to make intent explicit: in test_health_check change the parameter list
from (name, check_fn) to (_name, check_fn) (or alternatively remove name from
the param list and adjust the parametrize to only supply check_fn) so that the
unused "name" symbol is clearly marked and lint warnings are resolved; update
the test signature where test_health_check and the _CHECKS/_CHECK_IDS
parametrize decorator are defined.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3578926c-4f63-4773-ac61-a328d32d57b3

📥 Commits

Reviewing files that changed from the base of the PR and between 631c8b9 and bc883a6.

📒 Files selected for processing (6)

docs/source/health_checks.rst
docs/source/user_guide.rst
python/cuml/cuml/health_checks/__init__.py
python/cuml/cuml/health_checks/__main__.py
python/cuml/cuml/health_checks/_checks.py
python/cuml/tests/test_health_checks.py

jacobtomlinson

These checks look great thanks @csadorf. They also need to be registered in the pyproject.toml in this repo, not the rapids-cli one.

csadorf · 2026-03-05T15:22:07Z

These checks look great thanks @csadorf. They also need to be registered in the pyproject.toml in this repo, not the rapids-cli one.

@jacobtomlinson Fixed in 672cd4d .

jacobtomlinson

This looks great thanks @csadorf

jameslamb

oh cool idea!!

betatim

One language tweak.

Can we also check that the versions of the various CUDA components and cuml are compatible with each other? This goes beyond just "cuml can't be imported" and gives users an idea why it can't be imported.

Maybe not for this PR but a future one: having a tool that collects information about your environment and presents it in a standard way would be useful to have. We could replace

**Environment details (please complete the following information):**
 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
 - Linux Distro/Architecture: [Ubuntu 16.04 amd64]
 - GPU Model/Driver: [V100 and driver 396.44]
 - CUDA: [9.2]
 - Method of cuDF & cuML install: [conda, Docker, or from source]
   - If method of install is [conda], run `conda list` and include results here
   - If method of install is [Docker], provide `docker pull` & `docker run` commands used
   - If method of install is [from source], provide versions of `cmake` & `gcc/g++` and commit hash of build

in our bug template with "Paste result of python -m cuml.health_checks" and get a well formatted, standardised output. I work on cuml and when I look at this part of the bug template I'm turned off, and think "just give me a command to run".

The reason to include this in the health checks would be that (a) running the health checks might already resolve people's problem and (b) we could probably concoct more health checks based on the information about the environment (eg could point out known to be broken/impossible combinations)

csadorf · 2026-03-06T15:01:50Z

One language tweak.

Can we also check that the versions of the various CUDA components and cuml are compatible with each other? This goes beyond just "cuml can't be imported" and gives users an idea why it can't be imported.

Maybe not for this PR but a future one: having a tool that collects information about your environment and presents it in a standard way would be useful to have. We could replace
**Environment details (please complete the following information):**
 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
 - Linux Distro/Architecture: [Ubuntu 16.04 amd64]
 - GPU Model/Driver: [V100 and driver 396.44]
 - CUDA: [9.2]
 - Method of cuDF & cuML install: [conda, Docker, or from source]
   - If method of install is [conda], run `conda list` and include results here
   - If method of install is [Docker], provide `docker pull` & `docker run` commands used
   - If method of install is [from source], provide versions of `cmake` & `gcc/g++` and commit hash of build
in our bug template with "Paste result of python -m cuml.health_checks" and get a well formatted, standardised output. I work on cuml and when I look at this part of the bug template I'm turned off, and think "just give me a command to run".

The reason to include this in the health checks would be that (a) running the health checks might already resolve people's problem and (b) we could probably concoct more health checks based on the information about the environment (eg could point out known to be broken/impossible combinations)

I really like that idea. I might push it into a follow-up though.

csadorf · 2026-03-06T15:16:51Z

@coderabbitai create an issue for the follow-up work to gather environment information from healthchecks that could be used for bug reports.

coderabbitai · 2026-03-06T15:17:35Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@csadorf: I've created the issue for the follow-up work. Here's the link once the script runs — I'll share it below.

🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo rapidsai/cuml \
  --title "Extend health checks to gather and report environment information for bug reports" \
  --assignee csadorf \
  --body "## Overview

As a follow-up to `#7854`, extend the \`cuml.health_checks\` module to collect and present standardized environment information. This would allow users to simply run \`python -m cuml.health_checks\` and include the output in bug reports, replacing the verbose manual environment checklist.

## Motivation

Currently, the bug report template asks contributors to manually collect:
- Environment location (bare-metal, Docker, Cloud)
- Linux distro/architecture
- GPU model and driver version
- CUDA version
- Method of cuML install (conda, Docker, source) with version details

This is a barrier to filing good bug reports. A single command that gathers this information in a standardized, well-formatted way would lower that barrier significantly.

## Proposed Work

- Add an environment information collector to \`cuml.health_checks\` that gathers:
  - CUDA version and compatibility with installed cuML
  - GPU model and driver version
  - Linux distro and architecture
  - cuML version and installation method
  - Other relevant RAPIDS component versions
- Surface known incompatible or broken version combinations as actionable warnings
- Update the bug report template to reference \`python -m cuml.health_checks\` instead of the manual checklist

## References

- PR `#7854` (initial health checks implementation)
- Issue `#7851`

Requested by `@csadorf` in https://github.com/rapidsai/cuml/pull/7854"

Length of output: 44

🏁 Script executed:

#!/bin/bash
gh issue list --repo rapidsai/cuml --author coderabbitai --limit 5 --json number,title,url

Length of output: 2

@csadorf: I've created the follow-up issue: #7862

It covers:

Adding an environment information collector to cuml.health_checks (CUDA/cuML version compatibility, GPU model & driver, Linux distro/arch, install method, other RAPIDS component versions)
Surfacing known incompatible version combinations as actionable warnings
Updating the bug report template to replace the manual checklist with a reference to python -m cuml.health_checks

🧠 Learnings used

Learnt from: CR
Repo: rapidsai/cuml PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2026-02-13T14:37:44.019Z
Learning: Report bugs by filing an issue with output from cuml/print_env.sh script to gather environment details

csadorf · 2026-03-06T15:28:09Z

/merge

jacobtomlinson · 2026-03-06T15:52:28Z

I would love to try and encourage RAPIDS projects to do this in a consistent way. Would you be open to asking users to run rapids doctor cuml instead of python -m cuml.health_checks?

We also have rapids debug in the rapids-cli to let users easily give you a full environment dump. I would suggest asking users to run rapids doctor cuml and rapids debug and paste the results.

csadorf · 2026-03-06T15:54:39Z

I would love to try and encourage RAPIDS projects to do this in a consistent way. Would you be open to asking users to run rapids doctor cuml instead of python -m cuml.health_checks?

We also have rapids debug in the rapids-cli to let users easily give you a full environment dump. I would suggest asking users to run rapids doctor cuml and rapids debug and paste the results.

I don't want there to be an extra barrier for users to create bug reports. Unless we can make it so that the RAPIDS cli can be easily pip/conda installed, I'd rather not redirect them to a different tool.

jacobtomlinson · 2026-03-06T16:53:04Z

Unless we can make it so that the RAPIDS cli can be easily pip/conda installed

This is already the case. It's pip install rapids-cli or conda install -c rapidsai rapids-cli.

csadorf · 2026-03-06T17:07:33Z

Unless we can make it so that the RAPIDS cli can be easily pip/conda installed

This is already the case. It's pip install rapids-cli or conda install -c rapidsai rapids-cli.

That's good! I didn't see that documented on https://github.com/rapidsai/rapids-cli, but maybe I missed it.

betatim · 2026-03-09T09:27:56Z

I like the idea of having a uniform way of doing this (get info for an issue). It would be uniform for users and uniform to read the output.

A downside of using rapids doctor is that it is two commands (install and then run it) which is 100% more than using something like python -m cuml.health. This sounds like a trivial nitpicking complaint, but I think it is worth considering. As a user who has some failing thing I am already in a bad mood, filing an issue is work, installing yet another damn tool that might hose my environment just to get debug info, etc, etc - I can totally see the barriers to adoption.

I'm not sure what a good solution would look like. Some ideas:

rapids-cli is a dependency of all RAPIDS packages (it is already installed when users need it),
vendor rapids-cli in RAPIDS packages (no new dependency, but engineering challenge to sort out many vendored copies),
don't expose this as a CLI command and instead instruct users to use cuml.get_info() (works even without a terminal and without knowing how to run terminal commands from notebooks, can discover health checks from across RAPIDS, no new dependency?)

Making this ridiculously easy to use is important, because the competition is deleting text from the issue template/leaving it blank.

jacobtomlinson · 2026-03-09T12:08:24Z

I totally understand the reasoning behind keeping things simple. The key things I'm advocating for are:

A consistent way to debug your environment across all libraries
Leveraging common checks (driver, CUDA versions, Python, etc) that we handle centrally in rapids-cli. Implementing these things over and over in each library feels wasteful.
Additional tools like rapids debug to dump your environment in a consistent way.

I'm also -1 on making rapids-cli a dependency in all the libraries because it has dependencies like rich and click which would add bloat to library dependencies. We will include rapids-cli in the docker images and other distributions like DLFW.

Perhaps a compromise is to advise "Run python -m cuml.health_checks for minimal checks or pip install rapids-cli && rapids doctor for a full check of your environment". It's also pretty easy to do uv run --with rapids-cli rapids doctor.

csadorf · 2026-03-09T14:16:12Z

Yes, providing two-staged guidance would be the way to go. Something that should "just work" assuming that cuML is installed at all and then a second set of commands that can be run either additionally or alternatively especially in case that the first command fails.

csadorf requested a review from a team as a code owner March 4, 2026 21:07

csadorf requested a review from betatim March 4, 2026 21:07

github-actions Bot assigned csadorf Mar 4, 2026

Implement health checks.

bc883a6

Closes rapidsai#7851

github-actions Bot added the Cython / Python Cython or Python issue label Mar 4, 2026

csadorf force-pushed the add-healthecks branch from d9cc0a6 to bc883a6 Compare March 4, 2026 21:07

csadorf added feature request New feature or request non-breaking Non-breaking change labels Mar 4, 2026

csadorf added a commit to csadorf/rapids-cli that referenced this pull request Mar 4, 2026

Add the cuML healthcheck entrypoints.

1456ce9

Depends on rapidsai/cuml#7854 Closes rapidsai/cuml#7851

csadorf mentioned this pull request Mar 4, 2026

Add the cuML healthcheck entrypoints. rapidsai/rapids-cli#138

Closed

coderabbitai Bot reviewed Mar 4, 2026

View reviewed changes

Comment thread python/cuml/cuml/health_checks/_checks.py

jacobtomlinson reviewed Mar 5, 2026

View reviewed changes

Add health check entrypoints to pyproject.toml

672cd4d

csadorf requested a review from a team as a code owner March 5, 2026 15:21

csadorf requested a review from KyleFromNVIDIA March 5, 2026 15:21

csadorf requested a review from jacobtomlinson March 5, 2026 15:22

jacobtomlinson approved these changes Mar 5, 2026

View reviewed changes

jameslamb approved these changes Mar 5, 2026

View reviewed changes

csadorf removed the request for review from KyleFromNVIDIA March 5, 2026 22:57

betatim reviewed Mar 6, 2026

View reviewed changes

Comment thread python/cuml/cuml/health_checks/_checks.py

csadorf requested a review from betatim March 6, 2026 15:13

coderabbitai Bot mentioned this pull request Mar 6, 2026

Extend health checks to gather and report environment information for bug reports #7862

Open

betatim approved these changes Mar 6, 2026

View reviewed changes

rapids-bot Bot merged commit d28fd9d into rapidsai:main Mar 6, 2026
167 of 170 checks passed

csadorf deleted the add-healthecks branch March 6, 2026 15:28

csadorf mentioned this pull request Mar 12, 2026

Add cuML health checks to rapids doctor #7851

Closed

5 tasks

ncclementi mentioned this pull request Apr 6, 2026

Find more rapids doctor checks from other RAPIDS projects rapidsai/rapids-cli#124

Open

Conversation

csadorf commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

csadorf commented Mar 5, 2026

Uh oh!

jacobtomlinson left a comment

Choose a reason for hiding this comment

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

betatim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csadorf commented Mar 6, 2026

Uh oh!

csadorf commented Mar 6, 2026

Uh oh!

coderabbitai Bot commented Mar 6, 2026

Uh oh!

csadorf commented Mar 6, 2026

Uh oh!

Uh oh!

jacobtomlinson commented Mar 6, 2026

Uh oh!

csadorf commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobtomlinson commented Mar 6, 2026

Uh oh!

csadorf commented Mar 6, 2026

Uh oh!

betatim commented Mar 9, 2026

Uh oh!

jacobtomlinson commented Mar 9, 2026

Uh oh!

csadorf commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

csadorf commented Mar 4, 2026 •

edited

Loading

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading

csadorf commented Mar 6, 2026 •

edited

Loading