Skip to content

Implement health checks.#7854

Merged
rapids-bot[bot] merged 2 commits intorapidsai:mainfrom
csadorf:add-healthecks
Mar 6, 2026
Merged

Implement health checks.#7854
rapids-bot[bot] merged 2 commits intorapidsai:mainfrom
csadorf:add-healthecks

Conversation

@csadorf
Copy link
Copy Markdown
Contributor

@csadorf csadorf commented Mar 4, 2026

Adds a cuml.health_checks module with smoke tests verifiable standalone and via rapids doctor.

Required for #7851

@csadorf csadorf requested a review from a team as a code owner March 4, 2026 21:07
@csadorf csadorf requested a review from betatim March 4, 2026 21:07
@github-actions github-actions Bot added the Cython / Python Cython or Python issue label Mar 4, 2026
@csadorf csadorf added feature request New feature or request non-breaking Non-breaking change labels Mar 4, 2026
csadorf added a commit to csadorf/rapids-cli that referenced this pull request Mar 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 4, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d4043693-d9b3-464c-b7d5-9395fffd9483

📥 Commits

Reviewing files that changed from the base of the PR and between bc883a6 and 672cd4d.

📒 Files selected for processing (1)
  • python/cuml/pyproject.toml

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added health checks to verify cuML installation and functionality. Run via python -m cuml.health_checks with optional --verbose flag, or integrate with RAPIDS CLI (rapids doctor).
  • Documentation

    • Added comprehensive health checks documentation covering standalone and RAPIDS CLI usage.
  • Tests

    • Added test coverage for health check functionality and registration.

Walkthrough

Adds a cuML health-checks feature: new health check functions, a CLI entry point for python -m cuml.health_checks, RAPIDS CLI plugin entry-points, documentation, and tests validating registration and function signatures.

Changes

Cohort / File(s) Summary
Documentation
docs/source/health_checks.rst, docs/source/user_guide.rst
New docs page describing standalone and RAPIDS CLI usage; user guide toctree updated to include the page.
Health checks package
python/cuml/cuml/health_checks/__init__.py, python/cuml/cuml/health_checks/__main__.py, python/cuml/cuml/health_checks/_checks.py
New package exposing four checks (import, functional, accel-basic, accel-cli), CLI runner (main) supporting verbose and name filtering, subprocess-based accel checks with timeout and detailed error handling.
Tests
python/cuml/tests/test_health_checks.py
New parametrized tests ensuring all public checks are registered, runnable with verbose=True, and conform to the required signature (verbose param + **kwargs).
Packaging / Entry points
python/cuml/pyproject.toml
Adds project.entry-points.rapids_doctor_check entries registering the four checks as rapids doctor plugins.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Implement health checks' directly summarizes the main change: adding a cuml.health_checks module with smoke tests.
Description check ✅ Passed The description is related to the changeset, explaining that the PR adds a cuml.health_checks module with smoke tests for standalone and rapids doctor usage.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
python/cuml/tests/test_health_checks.py (1)

24-27: Rename unused parametrized argument for clarity.

name is only used for parametrization setup and not inside the test body. Rename it to _name (or parameterize only check_fn) to make intent explicit and keep lint clean.

♻️ Small cleanup
-@pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS)
-def test_health_check(name, check_fn):
+@pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS)
+def test_health_check(_name, check_fn):
     """Each registered health check should pass."""
     check_fn(verbose=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuml/tests/test_health_checks.py` around lines 24 - 27, Rename the
unused parametrized argument in the test function to make intent explicit: in
test_health_check change the parameter list from (name, check_fn) to (_name,
check_fn) (or alternatively remove name from the param list and adjust the
parametrize to only supply check_fn) so that the unused "name" symbol is clearly
marked and lint warnings are resolved; update the test signature where
test_health_check and the _CHECKS/_CHECK_IDS parametrize decorator are defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@python/cuml/cuml/health_checks/_checks.py`:
- Around line 54-147: Add a new check function (e.g.,
accel_memory_check(verbose=False, **kwargs)) alongside accel_basic_check and
accel_cli_check that explicitly queries the GPU free memory (use NVML via pynvml
or CUDA API), compares it to a configurable minimum threshold (e.g.,
_MIN_GPU_MEMORY_BYTES or a min_memory kwarg), and fails with a clear
RuntimeError/AssertionError including available vs required bytes and
remediation guidance when memory is insufficient; ensure the function returns a
concise success message when verbose is True and register this new
accel_memory_check in the module's checks registry so it runs with the other
accel checks (accel_basic_check, accel_cli_check).

---

Nitpick comments:
In `@python/cuml/tests/test_health_checks.py`:
- Around line 24-27: Rename the unused parametrized argument in the test
function to make intent explicit: in test_health_check change the parameter list
from (name, check_fn) to (_name, check_fn) (or alternatively remove name from
the param list and adjust the parametrize to only supply check_fn) so that the
unused "name" symbol is clearly marked and lint warnings are resolved; update
the test signature where test_health_check and the _CHECKS/_CHECK_IDS
parametrize decorator are defined.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3578926c-4f63-4773-ac61-a328d32d57b3

📥 Commits

Reviewing files that changed from the base of the PR and between 631c8b9 and bc883a6.

📒 Files selected for processing (6)
  • docs/source/health_checks.rst
  • docs/source/user_guide.rst
  • python/cuml/cuml/health_checks/__init__.py
  • python/cuml/cuml/health_checks/__main__.py
  • python/cuml/cuml/health_checks/_checks.py
  • python/cuml/tests/test_health_checks.py

Comment thread python/cuml/cuml/health_checks/_checks.py
Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks look great thanks @csadorf. They also need to be registered in the pyproject.toml in this repo, not the rapids-cli one.

@csadorf csadorf requested a review from a team as a code owner March 5, 2026 15:21
@csadorf csadorf requested a review from KyleFromNVIDIA March 5, 2026 15:21
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 5, 2026

These checks look great thanks @csadorf. They also need to be registered in the pyproject.toml in this repo, not the rapids-cli one.

@jacobtomlinson Fixed in 672cd4d .

@csadorf csadorf requested a review from jacobtomlinson March 5, 2026 15:22
Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great thanks @csadorf

Copy link
Copy Markdown
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh cool idea!!

@csadorf csadorf removed the request for review from KyleFromNVIDIA March 5, 2026 22:57
Copy link
Copy Markdown
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One language tweak.

Can we also check that the versions of the various CUDA components and cuml are compatible with each other? This goes beyond just "cuml can't be imported" and gives users an idea why it can't be imported.


Maybe not for this PR but a future one: having a tool that collects information about your environment and presents it in a standard way would be useful to have. We could replace

**Environment details (please complete the following information):**
 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
 - Linux Distro/Architecture: [Ubuntu 16.04 amd64]
 - GPU Model/Driver: [V100 and driver 396.44]
 - CUDA: [9.2]
 - Method of cuDF & cuML install: [conda, Docker, or from source]
   - If method of install is [conda], run `conda list` and include results here
   - If method of install is [Docker], provide `docker pull` & `docker run` commands used
   - If method of install is [from source], provide versions of `cmake` & `gcc/g++` and commit hash of build

in our bug template with "Paste result of python -m cuml.health_checks" and get a well formatted, standardised output. I work on cuml and when I look at this part of the bug template I'm turned off, and think "just give me a command to run".

The reason to include this in the health checks would be that (a) running the health checks might already resolve people's problem and (b) we could probably concoct more health checks based on the information about the environment (eg could point out known to be broken/impossible combinations)

Comment thread python/cuml/cuml/health_checks/_checks.py
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 6, 2026

One language tweak.

Can we also check that the versions of the various CUDA components and cuml are compatible with each other? This goes beyond just "cuml can't be imported" and gives users an idea why it can't be imported.

Maybe not for this PR but a future one: having a tool that collects information about your environment and presents it in a standard way would be useful to have. We could replace

**Environment details (please complete the following information):**
 - Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
 - Linux Distro/Architecture: [Ubuntu 16.04 amd64]
 - GPU Model/Driver: [V100 and driver 396.44]
 - CUDA: [9.2]
 - Method of cuDF & cuML install: [conda, Docker, or from source]
   - If method of install is [conda], run `conda list` and include results here
   - If method of install is [Docker], provide `docker pull` & `docker run` commands used
   - If method of install is [from source], provide versions of `cmake` & `gcc/g++` and commit hash of build

in our bug template with "Paste result of python -m cuml.health_checks" and get a well formatted, standardised output. I work on cuml and when I look at this part of the bug template I'm turned off, and think "just give me a command to run".

The reason to include this in the health checks would be that (a) running the health checks might already resolve people's problem and (b) we could probably concoct more health checks based on the information about the environment (eg could point out known to be broken/impossible combinations)

I really like that idea. I might push it into a follow-up though.

@csadorf csadorf requested a review from betatim March 6, 2026 15:13
@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 6, 2026

@coderabbitai create an issue for the follow-up work to gather environment information from healthchecks that could be used for bug reports.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@csadorf: I've created the issue for the follow-up work. Here's the link once the script runs — I'll share it below.


🏁 Script executed:

#!/bin/bash
gh issue create \
  --repo rapidsai/cuml \
  --title "Extend health checks to gather and report environment information for bug reports" \
  --assignee csadorf \
  --body "## Overview

As a follow-up to `#7854`, extend the \`cuml.health_checks\` module to collect and present standardized environment information. This would allow users to simply run \`python -m cuml.health_checks\` and include the output in bug reports, replacing the verbose manual environment checklist.

## Motivation

Currently, the bug report template asks contributors to manually collect:
- Environment location (bare-metal, Docker, Cloud)
- Linux distro/architecture
- GPU model and driver version
- CUDA version
- Method of cuML install (conda, Docker, source) with version details

This is a barrier to filing good bug reports. A single command that gathers this information in a standardized, well-formatted way would lower that barrier significantly.

## Proposed Work

- Add an environment information collector to \`cuml.health_checks\` that gathers:
  - CUDA version and compatibility with installed cuML
  - GPU model and driver version
  - Linux distro and architecture
  - cuML version and installation method
  - Other relevant RAPIDS component versions
- Surface known incompatible or broken version combinations as actionable warnings
- Update the bug report template to reference \`python -m cuml.health_checks\` instead of the manual checklist

## References

- PR `#7854` (initial health checks implementation)
- Issue `#7851`

Requested by `@csadorf` in https://github.com/rapidsai/cuml/pull/7854"

Length of output: 44


🏁 Script executed:

#!/bin/bash
gh issue list --repo rapidsai/cuml --author coderabbitai --limit 5 --json number,title,url

Length of output: 2


@csadorf: I've created the follow-up issue: #7862

It covers:

  • Adding an environment information collector to cuml.health_checks (CUDA/cuML version compatibility, GPU model & driver, Linux distro/arch, install method, other RAPIDS component versions)
  • Surfacing known incompatible version combinations as actionable warnings
  • Updating the bug report template to replace the manual checklist with a reference to python -m cuml.health_checks

🧠 Learnings used
Learnt from: CR
Repo: rapidsai/cuml PR: 0
File: CONTRIBUTING.md:0-0
Timestamp: 2026-02-13T14:37:44.019Z
Learning: Report bugs by filing an issue with output from cuml/print_env.sh script to gather environment details

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 6, 2026

/merge

@rapids-bot rapids-bot Bot merged commit d28fd9d into rapidsai:main Mar 6, 2026
167 of 170 checks passed
@csadorf csadorf deleted the add-healthecks branch March 6, 2026 15:28
@jacobtomlinson
Copy link
Copy Markdown
Member

I would love to try and encourage RAPIDS projects to do this in a consistent way. Would you be open to asking users to run rapids doctor cuml instead of python -m cuml.health_checks?

We also have rapids debug in the rapids-cli to let users easily give you a full environment dump. I would suggest asking users to run rapids doctor cuml and rapids debug and paste the results.

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 6, 2026

I would love to try and encourage RAPIDS projects to do this in a consistent way. Would you be open to asking users to run rapids doctor cuml instead of python -m cuml.health_checks?

We also have rapids debug in the rapids-cli to let users easily give you a full environment dump. I would suggest asking users to run rapids doctor cuml and rapids debug and paste the results.

I don't want there to be an extra barrier for users to create bug reports. Unless we can make it so that the RAPIDS cli can be easily pip/conda installed, I'd rather not redirect them to a different tool.

@jacobtomlinson
Copy link
Copy Markdown
Member

Unless we can make it so that the RAPIDS cli can be easily pip/conda installed

This is already the case. It's pip install rapids-cli or conda install -c rapidsai rapids-cli.

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 6, 2026

Unless we can make it so that the RAPIDS cli can be easily pip/conda installed

This is already the case. It's pip install rapids-cli or conda install -c rapidsai rapids-cli.

That's good! I didn't see that documented on https://github.com/rapidsai/rapids-cli, but maybe I missed it.

@betatim
Copy link
Copy Markdown
Member

betatim commented Mar 9, 2026

I like the idea of having a uniform way of doing this (get info for an issue). It would be uniform for users and uniform to read the output.

A downside of using rapids doctor is that it is two commands (install and then run it) which is 100% more than using something like python -m cuml.health. This sounds like a trivial nitpicking complaint, but I think it is worth considering. As a user who has some failing thing I am already in a bad mood, filing an issue is work, installing yet another damn tool that might hose my environment just to get debug info, etc, etc - I can totally see the barriers to adoption.

I'm not sure what a good solution would look like. Some ideas:

  • rapids-cli is a dependency of all RAPIDS packages (it is already installed when users need it),
  • vendor rapids-cli in RAPIDS packages (no new dependency, but engineering challenge to sort out many vendored copies),
  • don't expose this as a CLI command and instead instruct users to use cuml.get_info() (works even without a terminal and without knowing how to run terminal commands from notebooks, can discover health checks from across RAPIDS, no new dependency?)

Making this ridiculously easy to use is important, because the competition is deleting text from the issue template/leaving it blank.

@jacobtomlinson
Copy link
Copy Markdown
Member

I totally understand the reasoning behind keeping things simple. The key things I'm advocating for are:

  • A consistent way to debug your environment across all libraries
  • Leveraging common checks (driver, CUDA versions, Python, etc) that we handle centrally in rapids-cli. Implementing these things over and over in each library feels wasteful.
  • Additional tools like rapids debug to dump your environment in a consistent way.

I'm also -1 on making rapids-cli a dependency in all the libraries because it has dependencies like rich and click which would add bloat to library dependencies. We will include rapids-cli in the docker images and other distributions like DLFW.

Perhaps a compromise is to advise "Run python -m cuml.health_checks for minimal checks or pip install rapids-cli && rapids doctor for a full check of your environment". It's also pretty easy to do uv run --with rapids-cli rapids doctor.

@csadorf
Copy link
Copy Markdown
Contributor Author

csadorf commented Mar 9, 2026

Yes, providing two-staged guidance would be the way to go. Something that should "just work" assuming that cuML is installed at all and then a second set of commands that can be run either additionally or alternatively especially in case that the first command fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants