ensure 'torch' CUDA wheels are installed in CI, test that 'torch' is an optional dependency#425
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
dependencies.yaml
Outdated
| - matrix: | ||
| packages: | ||
| - ogb | ||
| - sentence-transformers |
There was a problem hiding this comment.
Moving all these (torchdata, pydantic, ogb, sentence-transformers) so that depends_on_pytorch can be limited to only PyTorch packages. That makes its use in wheel installs a bit easier (we can pip download --no-deps the output of the list and know the only package will be torch).
Contributes to rapidsai/build-planning#256, by splitting the conda-only parts of #425 out to hoepfully simplify it. * limits `depends_on_pytorch` in `dependencies.yaml` to only `torch` / `pytorch` - *(this will be helpful in #425 to generate a requirement for `pip download --no-deps`)* * declares `pylibwholegraph`'s dependency on `packaging` * adds some `.gitignore` rules for Python package file types (was creating these in local testing) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) URL: #427
…#429) Splitting some changes off of #425 * re-enable arm64 conda tests * adding a small test with `torch.cuda.is_available()` in conda CI tests * updates `ci/test_*` scripts to re-use `ci/run_*` scripts ## Notes for Reviewers ### How I tested this Ran the full PR and nightly matrices of `conda-python-tests` here. https://github.com/rapidsai/cugraph-gnn/actions/runs/22935844055/job/66568277622?pr=429 All except arm64 + CUDA 12.2 passed (that combination doesn't have `pytorch` conda packages), so added matrix filters skipping that one combination: #429 (comment) ### How can we enable arm tests now? Since they were originally filtered out, `pytorch` packages with GPU support have made it to `conda-forge`. So the reasoning that there aren't available packages (#167 (comment)) no longer holds. We should now be able to test on arm64 😁 ### Why add this import test? `torch` is an optional dependency of both `cugraph-pyg` and `pylibwholegraph`. The things done to keep it optional introduce risk of it accidentally not being installed in CI, or a CPU-only version of it being installed. Right now, when that happens you're confronted with a hard-to-understand error like ```text E AttributeError: module 'torch._C' has no attribute '_cuda_changeCurrentAllocator' ``` Hopefully we never accidentally install CPU-only `torch` in CI again, but if it does happen the test introduced in this PR should save some debugging time. ### Why re-use the `run_*` scripts from `test_*` scripts? Reduces duplication. And more importantly, ensures this project's tests are runnable in DLFW builds (where only the `run_*` scripts are used because they are not tightly coupled to RAPIDS CI setup). Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #429
alexbarghi-nv
left a comment
There was a problem hiding this comment.
Just one minor comment; otherwise should be ok.
| @@ -171,8 +173,6 @@ def get_cpp_extension_src_path(): | |||
|
|
|||
|
|
|||
| def compile_cpp_extension(): | |||
There was a problem hiding this comment.
Should we be doing an import_optional for torch.utils?
There was a problem hiding this comment.
I was thinking we wouldn't need to because this usage will just try to hit the torch we already imported with torch = import_optional("torch") at the top of the module.
However it could be useful to do that more specific import for those annoying "uninstall didn't fully work" situations like this: https://github.com/rapidsai/cugraph-gnn/pull/425/changes#r2921258039
Because then you'd get a more informative error than like "module 'torch' has no attribute 'utils".
I'll add that here.
| ] | ||
| select = [ | ||
| # (flake8-tidy-imports) banned-api | ||
| "TID251" |
There was a problem hiding this comment.
Thanks for adding this check.
…ncies (#5453) Contributes to rapidsai/build-planning#257 * ensures that when `torch` is installed in wheels CI, it's always a CUDA variant * removes unused test dependencies `ogb`, `pydantic`, `torchdata`, and `torchmetrics` - *I suspect these are left over from before GNN packages moved out to https://github.com/rapidsai/cugraph-gnn* ## Notes for Reviewers ### How does this help with testing against a mix of CTKs? `torch` CUDA wheels tightly pin to `nvidia-{thing}` wheels and soon, `cuda-toolkit` (see rapidsai/build-planning#255). Forcing the use of CUDA `torch` wheels ensures we'll catch dependency conflicts in CI. Without this, `pip` could silently fall back to CPU-only `torch` from pypi.org. ### How I tested this This PR uses some patterns I've tested elsewhere too: * rapidsai/rmm#2270 (comment) * rapidsai/cugraph-gnn#425 I also tested a commit here with both the PR and nightly test matrices, to be sure we covered everything: https://github.com/rapidsai/cugraph/actions/runs/23062364778/job/66996357196?pr=5453 Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Gil Forsyth (https://github.com/gforsyth) URL: #5453
|
|
||
|
|
||
| def test_dlpack(): | ||
| def test_dlpack(torch): |
There was a problem hiding this comment.
Just picking a place on the diff so this can be threaded.
pylibwholegraph tests are failing like this:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/pyenv/versions/3.11.15/lib/python3.11/site-packages/torch/cuda/__init__.py:569: in set_device
torch._C._cuda_setDevice(device)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def _lazy_init():
global _initialized, _queued_calls
if is_initialized() or hasattr(_tls, "is_initializing"):
return
with _initialization_lock:
# We be double-checked locking, boys! This is OK because
# the above test was GIL protected anyway. The inner test
# is for when a thread blocked on some other thread which was
# doing the initialization; when they get the lock, they will
# find there is nothing left to do.
if is_initialized():
return
# It is important to prevent other threads from entering _lazy_init
# immediately, while we are still guaranteed to have the GIL, because some
# of the C calls we make below will release the GIL
if _is_in_bad_fork():
> raise RuntimeError(
"Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
"multiprocessing, you must use the 'spawn' start method"
)
E RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
/pyenv/versions/3.11.15/lib/python3.11/site-packages/torch/cuda/__init__.py:398: RuntimeError
and this:
def test_wholegraph_unweighted_sample(
graph_node_count,
...
torch,
):
gpu_count = wmb.fork_get_gpu_count()
> assert gpu_count > 0
E assert -1 > 0
tests/wholegraph_torch/ops/test_wholegraph_unweighted_sample_without_replacement.py:376: AssertionError
I'm not sure what the cause is. Nightly tests did not fail this way (https://github.com/rapidsai/cugraph-gnn/actions/runs/23132494004/job/67188677016), so it's likely this is some issue introduced by this PR :/
There was a problem hiding this comment.
If I run just this test using the same pytest invocation as in ci/run_pylibwholegraph_pytests.sh, it fails
$ pytest -rs --cache-clear --forked --import-mode=append tests/pylibwholegraph/test_wholememory_binding.py
torch = <module 'torch' from '/pyenv/versions/3.11.15/lib/python3.11/site-packages/torch/__init__.py'>
def test_dlpack(torch):
gpu_count = wmb.fork_get_gpu_count()
> assert gpu_count > 0
E assert -1 > 0
tests/pylibwholegraph/test_wholememory_binding.py:110: AssertionError
--------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------
fork_get_device_count failed.But running the same code in a Python process? No problem.
import pylibwholegraph.binding.wholememory_binding as wmb
wmb.fork_get_gpu_count()
# 1🤔
There was a problem hiding this comment.
I've replicated the failure locally. My initial thought was to run the tests more like cuGraph-PyG does, but I think what's going on in pylibwholegraph should be more or less functionally equivalent. I'm pretty busy today but I will take another look as soon as I get a chance.
There was a problem hiding this comment.
Thank you! I have a local repro too and will try to look more when I have time.
I can say I tried just removing the use of pytest-forked (unsure why that's used here), and unfortunately it wasn't that simple 😂
There was a problem hiding this comment.
Yeah, I forgot to say I also just used regular pytest and it failed with the same error.
There was a problem hiding this comment.
Ok, with some help I managed to figure out what's going wrong, I think. It looks like the tests in pylibwholegraph don't like it when torch constructs the CUDA context, resulting in this error when the forked processes try to create their own contexts. The problem is that importing torch creates a CUDA context...
There was a problem hiding this comment.
I think I might have a solution. Give me a minute.
|
@jameslamb let me see if I can replicate this locally. I think I might have some ideas on how to fix it. |
| raise RuntimeError(f"This feature requires the '{self.name}' package/module") | ||
|
|
||
|
|
||
| class FoundModule: |
There was a problem hiding this comment.
@jameslamb this is the solution I came up with - AI threw a bunch of different ideas at me, all of which failed miserably, so I guess human ingenuity still isn't obsolete. Basically, this makes all optional imports lazy so the actual import only occurs upon use. This prevents creating a CUDA context, and the tests pass, hopefully without any other weird implications...
There was a problem hiding this comment.
Ok thanks! Interesting, makes sense to me that that could work.
This doesn't work in the case where torch is not installed though:
ImportError while importing test module '/__w/cugraph-gnn/cugraph-gnn/python/pylibwholegraph/pylibwholegraph/tests/pylibwholegraph/test_wholememory_tensor.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/pyenv/versions/3.11.15/lib/python3.11/importlib/__init__.py:126: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/pylibwholegraph/test_wholememory_tensor.py:6: in <module>
from pylibwholegraph.torch.initialize import init_torch_env_and_create_wm_comm
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/__init__.py:17: in <module>
from .embedding import (
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/embedding.py:19: in <module>
from .tensor import WholeMemoryTensor
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/tensor.py:16: in <module>
from .wholegraph_env import wrap_torch_tensor, get_wholegraph_env_fns, get_stream
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/wholegraph_env.py:13: in <module>
torch_utils_cpp_ext = import_optional("torch.utils.cpp_extension")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/utils/imports.py:56: in import_optional
if find_spec(mod) is not None:
^^^^^^^^^^^^^^
E ModuleNotFoundError: No module named 'torch'
There might be clue in that... all of the unit tests were passing in CI on this PR, with and without torch installed, before I added that import_optional("torch.utils.cpp_extension") in response to 79a6efe#r2933127641
i.e. before these commits:
Maybe going 2 levels deep in a single import_optional() just does not work well with this machinery?
There was a problem hiding this comment.
ahhh yeah ok ... importlib.util.find_spec() cannot process dotted names.
from importlib.util import find_spec
find_spec("torch")
find_spec("torch.autograd")
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "<frozen importlib.util>", line 95, in find_spec
# ModuleNotFoundError: No module named 'torch'I'll look for a different approach here.
There was a problem hiding this comment.
Ok let's see if 456857a fixes this
There, I'm catching the ModuleNotFoundError / ImportError that could happen with dotted imports and treating that as another example of "thing not installed".
There was a problem hiding this comment.
Ok that fixed pylibwholegraph. There's one more issue in cugraph-pyg:
Traceback (most recent call last):
File "<string>", line 1, in <module>
import cugraph_pyg; print(f'cugraph-pyg version: {cugraph_pyg.__version__}')
^^^^^^^^^^^^^^^^^^
File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/__init__.py", line 6, in <module>
import cugraph_pyg.data
File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/data/__init__.py", line 6, in <module>
from cugraph_pyg.data.graph_store import (
GraphStore,
)
File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/data/graph_store.py", line 36, in <module>
else torch_geometric.data.GraphStore
^^^^^^^^^^^^^^^^^^^^
File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/utils/imports.py", line 50, in __getattr__
self.mod = import_module(self.mod)
~~~~~~~~~~~~~^^^^^^^^^^
File "/pyenv/versions/3.13.12/lib/python3.13/importlib/__init__.py", line 88, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/torch_geometric/__init__.py", line 3, in <module>
import torch
ModuleNotFoundError: No module named 'torch'
When torch_geometric is available but torch is not, this check falls back to trying to import torch_geometric and fails:
cugraph-gnn/python/cugraph-pyg/cugraph_pyg/data/graph_store.py
Lines 33 to 37 in 4dcf7eb
I think because find_spec("torch_geometric") works, but import_module("torch_geometric") fails (because torch_geometric unconditionally imports torch).
I've pushed 0afcdd7 hopefully resolving that.
d03a9bf to
456857a
Compare
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
Approved with a couple of small comments
| # Use '|| true' so grep not finding any matches (exit 1) does not kill the script under set -e | ||
| unzip -p "${WHEEL_FILE}" '*.dist-info/METADATA' \ | ||
| | grep -E '^Requires-Dist:.*torch[><=!~ ]+.*' \ | ||
| | tee matches.txt || true |
There was a problem hiding this comment.
Why the tee? You could pipe the output into a variable and then echo it instead.
There was a problem hiding this comment.
tee specifically to avoid the complexity of storing the output in a variable just to echo it.
| fi | ||
|
|
||
| rapids-logger "import cugraph-pyg (no 'torch')" | ||
| ./ci/uninstall-torch-wheels.sh |
There was a problem hiding this comment.
Is there any reason you couldn't do the no-torch tests before the torch tests? That would save you the trouble of uninstalling torch.
There was a problem hiding this comment.
Because to run the tests without torch, we still need to install cugraph-pyg and its dependencies, any of which might pull in torch as a requirement.
Imperative code that force-uninstalls torch right here afterwards is easier and less error-prone than trying to be careful about not letting torch into the environment at the beginning, in my opinion.
|
/merge |
Contributes to rapidsai/build-planning#256 and #410
Ensures that
torchCUDA wheels are always installed in CI"unintentionally installed a CPU-only
torchfrom PyPI" is a failure mode I've seen a few times for CI in this project, and each time it's happened it has take a bit of time to figure out that that was the root cause.This PR tries to fix that by:
torchwheel is installed if a compatible one exists+cu130to help prevent pulling in packages from PyPIdependencies.yamlitems for "oldest" dependencies so CI covers a range of supported versionsI tested similar patterns in rapidsai/rmm#2279 and saw them work well there.
Makes
torchtruly optionalWe want these packages to be installable and importable without
torch, for use in RAPIDS DLFW builds (where we don't installtorchalongside RAPIDS because it's build in other processes).I'd started relying on the assumption that they worked that way in this PR, but quickly realized that that isn't true...
torchis used unconditionally in many ways in these libraries.This PR fixes that. It makes
torchoptional and adds testing to ensure it stays that way:import_optionalmachinery fromcugraph-pygintopylibwholegraphimport_optional()for alltorchimports in the librariespytest.importorskip("torch")in teststorchinputs from import time to run timeflake8-tidy-imports:banned-apicheck fromruffto enforce thatimport torchisn't used anywhere in library code or test codepip uninstall torchci/validate_wheel.shconfirming thattorchdoesn't make it into any wheel metadata (which could happen due to mistakes independencies.yaml/pyproject.toml)Notes for Reviewers
Pulling these changes out of #413 , so CI in this repo can immediately benefit from them and so #419 can be reverted.
When this is merged, #413 will have a smaller diff and just be focused on testing against a range of CTKs.