-
Notifications
You must be signed in to change notification settings - Fork 623
Implement health checks. #7854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Implement health checks. #7854
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| .. SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| .. SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| Health Checks | ||
| ============= | ||
|
|
||
| cuML provides a small set of health checks (smoke tests) to verify that cuML is | ||
| working correctly after installation or as part of automated processes such as | ||
| CI. These checks are also used by the RAPIDS CLI's ``rapids doctor`` command | ||
| when the CLI is installed. | ||
|
|
||
| Run standalone | ||
| -------------- | ||
|
|
||
| You can run all cuML health checks from the command line: | ||
|
|
||
| .. code-block:: console | ||
|
|
||
| python -m cuml.health_checks | ||
|
|
||
| Use ``--verbose`` or ``-v`` for extra output when a check passes. The command | ||
| exits with 0 if all checks pass, or 1 if any check fails. | ||
|
|
||
| Run via RAPIDS CLI | ||
| ------------------ | ||
|
|
||
| When `rapids-cli <https://github.com/rapidsai/rapids-cli>`_ is installed, the | ||
| same cuML checks are registered as plugins and run as part of: | ||
|
|
||
| .. code-block:: console | ||
|
|
||
| rapids doctor | ||
|
|
||
| See the `rapids-cli documentation | ||
| <https://github.com/rapidsai/rapids-cli#check-plugins>`_ for how checks are | ||
| discovered and how to run with ``--verbose`` or filter by name. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
|
|
||
| """Health checks for cuML, used by ``rapids doctor`` and runnable via ``python -m cuml.health_checks``.""" | ||
|
|
||
| from cuml.health_checks._checks import ( | ||
| accel_basic_check, | ||
| accel_cli_check, | ||
| functional_check, | ||
| import_check, | ||
| ) | ||
|
|
||
| __all__ = ( | ||
| "accel_basic_check", | ||
| "accel_cli_check", | ||
| "functional_check", | ||
| "import_check", | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
|
|
||
| """Run cuML health checks when invoked as ``python -m cuml.health_checks``.""" | ||
|
|
||
| import argparse | ||
| import sys | ||
|
|
||
| from cuml.health_checks import ( | ||
| accel_basic_check, | ||
| accel_cli_check, | ||
| functional_check, | ||
| import_check, | ||
| ) | ||
|
|
||
| _CHECKS = [ | ||
| ("import", import_check), | ||
| ("functional", functional_check), | ||
| ("accel-basic", accel_basic_check), | ||
| ("accel-cli", accel_cli_check), | ||
| ] | ||
|
|
||
|
|
||
| _CHECK_NAMES = [name for name, _ in _CHECKS] | ||
|
|
||
|
|
||
| def main(argv=None): | ||
| parser = argparse.ArgumentParser( | ||
| prog="python -m cuml.health_checks", | ||
| description="Run cuML health checks.", | ||
| ) | ||
| parser.add_argument( | ||
| "-v", | ||
| "--verbose", | ||
| action="store_true", | ||
| default=False, | ||
| help="Print extra output when a check passes.", | ||
| ) | ||
| parser.add_argument( | ||
| "checks", | ||
| nargs="*", | ||
| metavar="CHECK", | ||
| choices=_CHECK_NAMES, | ||
| help=( | ||
| f"Names of checks to run (default: all). " | ||
| f"Available: {', '.join(_CHECK_NAMES)}" | ||
| ), | ||
| ) | ||
| args = parser.parse_args(argv) | ||
|
|
||
| selected = set(args.checks) if args.checks else None | ||
| failed = False | ||
| for name, check_fn in _CHECKS: | ||
| if selected is not None and name not in selected: | ||
| continue | ||
| try: | ||
| result = check_fn(verbose=args.verbose) | ||
| print(f"{name}: OK") | ||
| if args.verbose and result: | ||
| print(f" {result}") | ||
| except Exception as e: | ||
| print(f"{name}: FAIL - {e}") | ||
| failed = True | ||
| return 1 if failed else 0 | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| sys.exit(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| # | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
|
|
||
| """Implementation of cuML health checks for rapids doctor and standalone use.""" | ||
|
|
||
|
|
||
| def import_check(verbose=False, **kwargs): | ||
| """Check that cuML can be imported. | ||
|
|
||
| Mainly useful when invoked programmatically; when run via rapids doctor, | ||
| cuml is typically already loaded. On failure, use the RAPIDS install docs. | ||
| """ | ||
| try: | ||
| import cuml | ||
| except ImportError as e: | ||
| raise ImportError( | ||
| "cuML could not be imported. Install cuML with conda or pip as " | ||
| "described at https://docs.rapids.ai/install/" | ||
| ) from e | ||
| if verbose: | ||
| return f"cuML {cuml.__version__} is available" | ||
|
|
||
|
|
||
| def functional_check(verbose=False, **kwargs): | ||
| """Check that a basic cuML estimator can fit and predict.""" | ||
| import numpy as np | ||
|
|
||
| from cuml.linear_model import LinearRegression | ||
|
|
||
| X = np.array([[1], [2], [3], [4]], dtype=np.float32) | ||
| y = np.array([1, 2, 3, 4], dtype=np.float32) | ||
| model = LinearRegression() | ||
| model.fit(X, y) | ||
| pred = model.predict(X) | ||
| if pred.shape != (4,): | ||
| raise AssertionError( | ||
| f"Expected predictions of shape (4,), got {pred.shape}" | ||
| ) | ||
| pred = np.asarray(pred, dtype=np.float32) | ||
| if not np.allclose(pred, y, atol=0.1): | ||
| raise AssertionError( | ||
| f"LinearRegression predictions differ from expected: " | ||
| f"got {pred.tolist()}, expected {y.tolist()}" | ||
| ) | ||
| if verbose: | ||
| return "LinearRegression fit/predict succeeded" | ||
|
|
||
|
|
||
| _SUBPROCESS_TIMEOUT = 120 | ||
|
|
||
|
|
||
| def accel_basic_check(verbose=False, **kwargs): | ||
| """Check that cuml.accel can be installed and intercepts sklearn.""" | ||
| import subprocess | ||
| import sys | ||
|
|
||
| script = ( | ||
| "import cuml.accel; cuml.accel.install(); " | ||
| "from sklearn.ensemble import RandomForestClassifier; " | ||
| "assert cuml.accel.is_proxy(RandomForestClassifier), " | ||
| "'RandomForestClassifier is not a cuml.accel proxy'; " | ||
| "from sklearn.datasets import make_classification; " | ||
| "X, y = make_classification(n_samples=100, random_state=0); " | ||
| "RandomForestClassifier(n_estimators=10).fit(X, y)" | ||
| ) | ||
| try: | ||
| result = subprocess.run( | ||
| [sys.executable, "-c", script], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=_SUBPROCESS_TIMEOUT, | ||
| ) | ||
| except subprocess.TimeoutExpired: | ||
| raise RuntimeError( | ||
| f"cuml.accel subprocess check timed out after " | ||
| f"{_SUBPROCESS_TIMEOUT}s" | ||
| ) | ||
| if result.returncode != 0: | ||
| stderr = result.stderr.strip() | ||
| detail = ( | ||
| "\n".join(stderr.splitlines()[-5:]) if stderr else "unknown error" | ||
| ) | ||
| raise RuntimeError(f"cuml.accel subprocess check failed:\n{detail}") | ||
| if verbose: | ||
| return ( | ||
| "cuml.accel intercepted sklearn and fit a RandomForestClassifier" | ||
| ) | ||
|
|
||
|
|
||
| def accel_cli_check(verbose=False, **kwargs): | ||
| """Check that python -m cuml.accel runs sklearn code on the GPU.""" | ||
| import os | ||
| import subprocess | ||
| import sys | ||
| import tempfile | ||
|
|
||
| script_content = ( | ||
| "from sklearn.datasets import make_classification\n" | ||
| "from sklearn.ensemble import RandomForestClassifier\n" | ||
| "X, y = make_classification(n_samples=200, random_state=0)\n" | ||
| "clf = RandomForestClassifier(n_estimators=10, random_state=0)\n" | ||
| "clf.fit(X, y)\n" | ||
| "clf.predict(X)\n" | ||
| ) | ||
| fd, script_path = tempfile.mkstemp(suffix=".py") | ||
| try: | ||
| with os.fdopen(fd, "w") as f: | ||
| f.write(script_content) | ||
|
|
||
| try: | ||
| result = subprocess.run( | ||
| [sys.executable, "-m", "cuml.accel", "--verbose", script_path], | ||
| capture_output=True, | ||
| text=True, | ||
| timeout=_SUBPROCESS_TIMEOUT, | ||
| ) | ||
| except subprocess.TimeoutExpired: | ||
| raise RuntimeError( | ||
| f"python -m cuml.accel --verbose timed out after " | ||
| f"{_SUBPROCESS_TIMEOUT}s" | ||
| ) | ||
| finally: | ||
| os.unlink(script_path) | ||
|
|
||
| if result.returncode != 0: | ||
| stderr = result.stderr.strip() | ||
| detail = ( | ||
| "\n".join(stderr.splitlines()[-5:]) if stderr else "unknown error" | ||
| ) | ||
| raise RuntimeError(f"python -m cuml.accel --verbose failed:\n{detail}") | ||
|
|
||
| output = result.stdout | ||
| if "ran on GPU" not in output: | ||
| raise AssertionError( | ||
| "cuml.accel --verbose output missing 'ran on GPU':\n" + output | ||
| ) | ||
| if "falling back to CPU" in output or "ran on CPU" in output: | ||
| raise AssertionError( | ||
| "cuml.accel --verbose reported CPU fallbacks:\n" + output | ||
| ) | ||
| if verbose: | ||
| return ( | ||
| "python -m cuml.accel --verbose ran sklearn code on GPU " | ||
| "with no fallbacks" | ||
| ) | ||
|
csadorf marked this conversation as resolved.
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| import inspect | ||
|
|
||
| import pytest | ||
|
|
||
| from cuml.health_checks import _checks | ||
| from cuml.health_checks.__main__ import _CHECKS | ||
|
|
||
|
|
||
| def _get_public_check_functions(): | ||
| """Return all public functions defined in _checks module.""" | ||
| return { | ||
| name: obj | ||
| for name, obj in inspect.getmembers(_checks, inspect.isfunction) | ||
| if not name.startswith("_") and obj.__module__ == _checks.__name__ | ||
| } | ||
|
|
||
|
|
||
| _CHECK_IDS = [name for name, _ in _CHECKS] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("name,check_fn", _CHECKS, ids=_CHECK_IDS) | ||
| def test_health_check(name, check_fn): | ||
| """Each registered health check should pass.""" | ||
| check_fn(verbose=True) | ||
|
|
||
|
|
||
| def test_all_checks_registered(): | ||
| """Every public function in _checks must appear in _CHECKS.""" | ||
| registered_fns = {fn for _, fn in _CHECKS} | ||
| public_fns = _get_public_check_functions() | ||
| missing = { | ||
| name for name, fn in public_fns.items() if fn not in registered_fns | ||
| } | ||
| assert not missing, ( | ||
| f"Public check functions not registered in _CHECKS: {missing}" | ||
| ) | ||
|
|
||
|
|
||
| def test_check_function_signatures(): | ||
| """All check functions must accept (verbose, **kwargs) per the rapids doctor contract.""" | ||
| for name, check_fn in _CHECKS: | ||
| sig = inspect.signature(check_fn) | ||
| params = list(sig.parameters.values()) | ||
|
|
||
| assert len(params) >= 2, ( | ||
| f"{name}: expected at least 2 parameters (verbose, **kwargs), " | ||
| f"got {len(params)}" | ||
| ) | ||
| assert params[0].name == "verbose", ( | ||
| f"{name}: first parameter should be 'verbose', " | ||
| f"got '{params[0].name}'" | ||
| ) | ||
| assert params[0].default is False, ( | ||
| f"{name}: 'verbose' should default to False, " | ||
| f"got {params[0].default!r}" | ||
| ) | ||
| assert params[-1].kind == inspect.Parameter.VAR_KEYWORD, ( | ||
| f"{name}: last parameter should be **kwargs, got {params[-1]}" | ||
| ) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.