Implement health checks.#7854
Conversation
Depends on rapidsai/cuml#7854 Closes rapidsai/cuml#7851
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughAdds a cuML health-checks feature: new health check functions, a CLI entry point for Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
python/cuml/tests/test_health_checks.py (1)
24-27: Rename unused parametrized argument for clarity.
nameis only used for parametrization setup and not inside the test body. Rename it to_name(or parameterize onlycheck_fn) to make intent explicit and keep lint clean.♻️ Small cleanup
-@pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS) -def test_health_check(name, check_fn): +@pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS) +def test_health_check(_name, check_fn): """Each registered health check should pass.""" check_fn(verbose=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuml/tests/test_health_checks.py` around lines 24 - 27, Rename the unused parametrized argument in the test function to make intent explicit: in test_health_check change the parameter list from (name, check_fn) to (_name, check_fn) (or alternatively remove name from the param list and adjust the parametrize to only supply check_fn) so that the unused "name" symbol is clearly marked and lint warnings are resolved; update the test signature where test_health_check and the _CHECKS/_CHECK_IDS parametrize decorator are defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml/health_checks/_checks.py`:
- Around line 54-147: Add a new check function (e.g.,
accel_memory_check(verbose=False, **kwargs)) alongside accel_basic_check and
accel_cli_check that explicitly queries the GPU free memory (use NVML via pynvml
or CUDA API), compares it to a configurable minimum threshold (e.g.,
_MIN_GPU_MEMORY_BYTES or a min_memory kwarg), and fails with a clear
RuntimeError/AssertionError including available vs required bytes and
remediation guidance when memory is insufficient; ensure the function returns a
concise success message when verbose is True and register this new
accel_memory_check in the module's checks registry so it runs with the other
accel checks (accel_basic_check, accel_cli_check).
---
Nitpick comments:
In `@python/cuml/tests/test_health_checks.py`:
- Around line 24-27: Rename the unused parametrized argument in the test
function to make intent explicit: in test_health_check change the parameter list
from (name, check_fn) to (_name, check_fn) (or alternatively remove name from
the param list and adjust the parametrize to only supply check_fn) so that the
unused "name" symbol is clearly marked and lint warnings are resolved; update
the test signature where test_health_check and the _CHECKS/_CHECK_IDS
parametrize decorator are defined.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3578926c-4f63-4773-ac61-a328d32d57b3
📒 Files selected for processing (6)
docs/source/health_checks.rstdocs/source/user_guide.rstpython/cuml/cuml/health_checks/__init__.pypython/cuml/cuml/health_checks/__main__.pypython/cuml/cuml/health_checks/_checks.pypython/cuml/tests/test_health_checks.py
jacobtomlinson
left a comment
There was a problem hiding this comment.
These checks look great thanks @csadorf. They also need to be registered in the pyproject.toml in this repo, not the rapids-cli one.
@jacobtomlinson Fixed in 672cd4d . |
jacobtomlinson
left a comment
There was a problem hiding this comment.
This looks great thanks @csadorf
betatim
left a comment
There was a problem hiding this comment.
One language tweak.
Can we also check that the versions of the various CUDA components and cuml are compatible with each other? This goes beyond just "cuml can't be imported" and gives users an idea why it can't be imported.
Maybe not for this PR but a future one: having a tool that collects information about your environment and presents it in a standard way would be useful to have. We could replace
**Environment details (please complete the following information):**
- Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
- Linux Distro/Architecture: [Ubuntu 16.04 amd64]
- GPU Model/Driver: [V100 and driver 396.44]
- CUDA: [9.2]
- Method of cuDF & cuML install: [conda, Docker, or from source]
- If method of install is [conda], run `conda list` and include results here
- If method of install is [Docker], provide `docker pull` & `docker run` commands used
- If method of install is [from source], provide versions of `cmake` & `gcc/g++` and commit hash of build
in our bug template with "Paste result of python -m cuml.health_checks" and get a well formatted, standardised output. I work on cuml and when I look at this part of the bug template I'm turned off, and think "just give me a command to run".
The reason to include this in the health checks would be that (a) running the health checks might already resolve people's problem and (b) we could probably concoct more health checks based on the information about the environment (eg could point out known to be broken/impossible combinations)
I really like that idea. I might push it into a follow-up though. |
|
@coderabbitai create an issue for the follow-up work to gather environment information from healthchecks that could be used for bug reports. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 44 🏁 Script executed: Length of output: 2
It covers:
🧠 Learnings used |
|
/merge |
|
I would love to try and encourage RAPIDS projects to do this in a consistent way. Would you be open to asking users to run We also have |
I don't want there to be an extra barrier for users to create bug reports. Unless we can make it so that the RAPIDS cli can be easily pip/conda installed, I'd rather not redirect them to a different tool. |
This is already the case. It's |
That's good! I didn't see that documented on https://github.com/rapidsai/rapids-cli, but maybe I missed it. |
|
I like the idea of having a uniform way of doing this (get info for an issue). It would be uniform for users and uniform to read the output. A downside of using I'm not sure what a good solution would look like. Some ideas:
Making this ridiculously easy to use is important, because the competition is deleting text from the issue template/leaving it blank. |
|
I totally understand the reasoning behind keeping things simple. The key things I'm advocating for are:
I'm also -1 on making Perhaps a compromise is to advise "Run |
|
Yes, providing two-staged guidance would be the way to go. Something that should "just work" assuming that cuML is installed at all and then a second set of commands that can be run either additionally or alternatively especially in case that the first command fails. |
Adds a
cuml.health_checksmodule with smoke tests verifiable standalone and viarapids doctor.Required for #7851