Skip to content

ensure 'torch' CUDA wheels are installed in CI, test that 'torch' is an optional dependency#425

Merged
rapids-bot[bot] merged 49 commits intorapidsai:release/26.04from
jameslamb:torch-testing
Mar 18, 2026
Merged

ensure 'torch' CUDA wheels are installed in CI, test that 'torch' is an optional dependency#425
rapids-bot[bot] merged 49 commits intorapidsai:release/26.04from
jameslamb:torch-testing

Conversation

@jameslamb
Copy link
Member

@jameslamb jameslamb commented Mar 9, 2026

Contributes to rapidsai/build-planning#256 and #410

Ensures that torch CUDA wheels are always installed in CI

"unintentionally installed a CPU-only torch from PyPI" is a failure mode I've seen a few times for CI in this project, and each time it's happened it has take a bit of time to figure out that that was the root cause.

This PR tries to fix that by:

  • using a stricter install pattern to guarantee a CUDA torch wheel is installed if a compatible one exists
  • using local versions like +cu130 to help prevent pulling in packages from PyPI
  • adding dependencies.yaml items for "oldest" dependencies so CI covers a range of supported versions

I tested similar patterns in rapidsai/rmm#2279 and saw them work well there.

Makes torch truly optional

We want these packages to be installable and importable without torch, for use in RAPIDS DLFW builds (where we don't install torch alongside RAPIDS because it's build in other processes).

I'd started relying on the assumption that they worked that way in this PR, but quickly realized that that isn't true... torch is used unconditionally in many ways in these libraries.

This PR fixes that. It makes torch optional and adds testing to ensure it stays that way:

  • copying the import_optional machinery from cugraph-pyg into pylibwholegraph
  • using import_optional() for all torch imports in the libraries
  • using pytest.importorskip("torch") in tests
  • deferring some torch inputs from import time to run time
  • adding the flake8-tidy-imports:banned-api check from ruff to enforce that import torch isn't used anywhere in library code or test code
  • explicitly testing that the libraries are still installable and at least 1 unit test can run successfully after a pip uninstall torch
  • adding a check in ci/validate_wheel.sh confirming that torch doesn't make it into any wheel metadata (which could happen due to mistakes in dependencies.yaml / pyproject.toml)

Notes for Reviewers

Pulling these changes out of #413 , so CI in this repo can immediately benefit from them and so #419 can be reverted.

When this is merged, #413 will have a smaller diff and just be focused on testing against a range of CTKs.

@jameslamb jameslamb added breaking Introduces a breaking change improvement Improves an existing functionality labels Mar 9, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

- matrix:
packages:
- ogb
- sentence-transformers
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving all these (torchdata, pydantic, ogb, sentence-transformers) so that depends_on_pytorch can be limited to only PyTorch packages. That makes its use in wheel installs a bit easier (we can pip download --no-deps the output of the list and know the only package will be torch).

rapids-bot bot pushed a commit that referenced this pull request Mar 10, 2026
Contributes to rapidsai/build-planning#256, by splitting the conda-only parts of #425 out to hoepfully simplify it.

* limits `depends_on_pytorch` in `dependencies.yaml` to only `torch` / `pytorch`
  - *(this will be helpful in #425 to generate a requirement for `pip download --no-deps`)*
* declares `pylibwholegraph`'s dependency on `packaging`
* adds some `.gitignore` rules for Python package file types (was creating these in local testing)

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #427
rapids-bot bot pushed a commit that referenced this pull request Mar 11, 2026
…#429)

Splitting some changes off of #425

* re-enable arm64 conda tests
* adding a small test with `torch.cuda.is_available()` in conda CI tests
* updates `ci/test_*` scripts to re-use `ci/run_*` scripts

## Notes for Reviewers

### How I tested this

Ran the full PR and nightly matrices of `conda-python-tests` here.

https://github.com/rapidsai/cugraph-gnn/actions/runs/22935844055/job/66568277622?pr=429

All except arm64 + CUDA 12.2 passed (that combination doesn't have `pytorch` conda packages), so added matrix filters skipping that one combination: #429 (comment)

### How can we enable arm tests now?

Since they were originally filtered out, `pytorch` packages with GPU support have made it to `conda-forge`.

So the reasoning that there aren't available packages (#167 (comment)) no longer holds.

We should now be able to test on arm64 😁 

### Why add this import test?

`torch` is an optional dependency of both `cugraph-pyg` and `pylibwholegraph`. The things done to keep it optional introduce risk of it accidentally not being installed in CI, or a CPU-only version of it being installed.

Right now, when that happens you're confronted with a hard-to-understand error like

```text
E   AttributeError: module 'torch._C' has no attribute '_cuda_changeCurrentAllocator'
```

Hopefully we never accidentally install CPU-only `torch` in CI again, but if it does happen the test introduced in this PR should save some debugging time.

### Why re-use the `run_*` scripts from `test_*` scripts?

Reduces duplication.

And more importantly, ensures this project's tests are runnable in DLFW builds (where only the `run_*` scripts are used because they are not tightly coupled to RAPIDS CI setup).

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Alex Barghi (https://github.com/alexbarghi-nv)

URL: #429
Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one minor comment; otherwise should be ok.

@@ -171,8 +173,6 @@ def get_cpp_extension_src_path():


def compile_cpp_extension():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be doing an import_optional for torch.utils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we wouldn't need to because this usage will just try to hit the torch we already imported with torch = import_optional("torch") at the top of the module.

However it could be useful to do that more specific import for those annoying "uninstall didn't fully work" situations like this: https://github.com/rapidsai/cugraph-gnn/pull/425/changes#r2921258039

Because then you'd get a more informative error than like "module 'torch' has no attribute 'utils".

I'll add that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this in eed447c

]
select = [
# (flake8-tidy-imports) banned-api
"TID251"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this check.

rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Mar 13, 2026
…ncies (#5453)

Contributes to rapidsai/build-planning#257

* ensures that when `torch` is installed in wheels CI, it's always a CUDA variant
* removes unused test dependencies `ogb`, `pydantic`, `torchdata`, and `torchmetrics`
  - *I suspect these are left over from before GNN packages moved out to https://github.com/rapidsai/cugraph-gnn*

## Notes for Reviewers

### How does this help with testing against a mix of CTKs?

`torch` CUDA wheels tightly pin to `nvidia-{thing}` wheels and soon, `cuda-toolkit` (see rapidsai/build-planning#255).

Forcing the use of CUDA `torch` wheels ensures we'll catch dependency conflicts in CI. Without this, `pip` could silently fall back to CPU-only `torch` from pypi.org.

### How I tested this

This PR uses some patterns I've tested elsewhere too:

* rapidsai/rmm#2270 (comment)
* rapidsai/cugraph-gnn#425

I also tested a commit here with both the PR and nightly test matrices, to be sure we covered everything: https://github.com/rapidsai/cugraph/actions/runs/23062364778/job/66996357196?pr=5453

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Alex Barghi (https://github.com/alexbarghi-nv)
  - Gil Forsyth (https://github.com/gforsyth)

URL: #5453


def test_dlpack():
def test_dlpack(torch):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just picking a place on the diff so this can be threaded.

pylibwholegraph tests are failing like this:

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/pyenv/versions/3.11.15/lib/python3.11/site-packages/torch/cuda/__init__.py:569: in set_device
    torch._C._cuda_setDevice(device)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def _lazy_init():
        global _initialized, _queued_calls
        if is_initialized() or hasattr(_tls, "is_initializing"):
            return
        with _initialization_lock:
            # We be double-checked locking, boys!  This is OK because
            # the above test was GIL protected anyway.  The inner test
            # is for when a thread blocked on some other thread which was
            # doing the initialization; when they get the lock, they will
            # find there is nothing left to do.
            if is_initialized():
                return
            # It is important to prevent other threads from entering _lazy_init
            # immediately, while we are still guaranteed to have the GIL, because some
            # of the C calls we make below will release the GIL
            if _is_in_bad_fork():
>               raise RuntimeError(
                    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
                    "multiprocessing, you must use the 'spawn' start method"
                )
E               RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

/pyenv/versions/3.11.15/lib/python3.11/site-packages/torch/cuda/__init__.py:398: RuntimeError

and this:

    def test_wholegraph_unweighted_sample(
        graph_node_count,
        ...
        torch,
    ):
        gpu_count = wmb.fork_get_gpu_count()
>       assert gpu_count > 0
E       assert -1 > 0

tests/wholegraph_torch/ops/test_wholegraph_unweighted_sample_without_replacement.py:376: AssertionError

(build link)

I'm not sure what the cause is. Nightly tests did not fail this way (https://github.com/rapidsai/cugraph-gnn/actions/runs/23132494004/job/67188677016), so it's likely this is some issue introduced by this PR :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run just this test using the same pytest invocation as in ci/run_pylibwholegraph_pytests.sh, it fails

$ pytest -rs --cache-clear --forked --import-mode=append tests/pylibwholegraph/test_wholememory_binding.py
torch = <module 'torch' from '/pyenv/versions/3.11.15/lib/python3.11/site-packages/torch/__init__.py'>

    def test_dlpack(torch):
        gpu_count = wmb.fork_get_gpu_count()
>       assert gpu_count > 0
E       assert -1 > 0

tests/pylibwholegraph/test_wholememory_binding.py:110: AssertionError
--------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------
fork_get_device_count failed.

But running the same code in a Python process? No problem.

import pylibwholegraph.binding.wholememory_binding as wmb
wmb.fork_get_gpu_count()
# 1

🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replicated the failure locally. My initial thought was to run the tests more like cuGraph-PyG does, but I think what's going on in pylibwholegraph should be more or less functionally equivalent. I'm pretty busy today but I will take another look as soon as I get a chance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I have a local repro too and will try to look more when I have time.

I can say I tried just removing the use of pytest-forked (unsure why that's used here), and unfortunately it wasn't that simple 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I forgot to say I also just used regular pytest and it failed with the same error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, with some help I managed to figure out what's going wrong, I think. It looks like the tests in pylibwholegraph don't like it when torch constructs the CUDA context, resulting in this error when the forked processes try to create their own contexts. The problem is that importing torch creates a CUDA context...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might have a solution. Give me a minute.

@alexbarghi-nv
Copy link
Member

@jameslamb let me see if I can replicate this locally. I think I might have some ideas on how to fix it.

raise RuntimeError(f"This feature requires the '{self.name}' package/module")


class FoundModule:
Copy link
Member

@alexbarghi-nv alexbarghi-nv Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameslamb this is the solution I came up with - AI threw a bunch of different ideas at me, all of which failed miserably, so I guess human ingenuity still isn't obsolete. Basically, this makes all optional imports lazy so the actual import only occurs upon use. This prevents creating a CUDA context, and the tests pass, hopefully without any other weird implications...

Copy link
Member Author

@jameslamb jameslamb Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks! Interesting, makes sense to me that that could work.

This doesn't work in the case where torch is not installed though:

ImportError while importing test module '/__w/cugraph-gnn/cugraph-gnn/python/pylibwholegraph/pylibwholegraph/tests/pylibwholegraph/test_wholememory_tensor.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/pyenv/versions/3.11.15/lib/python3.11/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tests/pylibwholegraph/test_wholememory_tensor.py:6: in <module>
    from pylibwholegraph.torch.initialize import init_torch_env_and_create_wm_comm
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/__init__.py:17: in <module>
    from .embedding import (
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/embedding.py:19: in <module>
    from .tensor import WholeMemoryTensor
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/tensor.py:16: in <module>
    from .wholegraph_env import wrap_torch_tensor, get_wholegraph_env_fns, get_stream
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/torch/wholegraph_env.py:13: in <module>
    torch_utils_cpp_ext = import_optional("torch.utils.cpp_extension")
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/pyenv/versions/3.11.15/lib/python3.11/site-packages/pylibwholegraph/utils/imports.py:56: in import_optional
    if find_spec(mod) is not None:
       ^^^^^^^^^^^^^^
E   ModuleNotFoundError: No module named 'torch'

(build link)

There might be clue in that... all of the unit tests were passing in CI on this PR, with and without torch installed, before I added that import_optional("torch.utils.cpp_extension") in response to 79a6efe#r2933127641

i.e. before these commits:

Maybe going 2 levels deep in a single import_optional() just does not work well with this machinery?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhh yeah ok ... importlib.util.find_spec() cannot process dotted names.

from importlib.util import find_spec
find_spec("torch")
find_spec("torch.autograd")

# Traceback (most recent call last):
#  File "<stdin>", line 1, in <module>
#  File "<frozen importlib.util>", line 95, in find_spec
# ModuleNotFoundError: No module named 'torch'

I'll look for a different approach here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have a fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let's see if 456857a fixes this

There, I'm catching the ModuleNotFoundError / ImportError that could happen with dotted imports and treating that as another example of "thing not installed".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that fixed pylibwholegraph. There's one more issue in cugraph-pyg:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import cugraph_pyg; print(f'cugraph-pyg version: {cugraph_pyg.__version__}')
    ^^^^^^^^^^^^^^^^^^
  File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/__init__.py", line 6, in <module>
    import cugraph_pyg.data
  File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/data/__init__.py", line 6, in <module>
    from cugraph_pyg.data.graph_store import (
        GraphStore,
    )
  File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/data/graph_store.py", line 36, in <module>
    else torch_geometric.data.GraphStore
         ^^^^^^^^^^^^^^^^^^^^
  File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/cugraph_pyg/utils/imports.py", line 50, in __getattr__
    self.mod = import_module(self.mod)
               ~~~~~~~~~~~~~^^^^^^^^^^
  File "/pyenv/versions/3.13.12/lib/python3.13/importlib/__init__.py", line 88, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pyenv/versions/3.13.12/lib/python3.13/site-packages/torch_geometric/__init__.py", line 3, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

(build link)

When torch_geometric is available but torch is not, this check falls back to trying to import torch_geometric and fails:

class GraphStore(
object
if isinstance(torch_geometric, MissingModule)
else torch_geometric.data.GraphStore
):

I think because find_spec("torch_geometric") works, but import_module("torch_geometric") fails (because torch_geometric unconditionally imports torch).

I've pushed 0afcdd7 hopefully resolving that.

Copy link
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a couple of small comments

# Use '|| true' so grep not finding any matches (exit 1) does not kill the script under set -e
unzip -p "${WHEEL_FILE}" '*.dist-info/METADATA' \
| grep -E '^Requires-Dist:.*torch[><=!~ ]+.*' \
| tee matches.txt || true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the tee? You could pipe the output into a variable and then echo it instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tee specifically to avoid the complexity of storing the output in a variable just to echo it.

fi

rapids-logger "import cugraph-pyg (no 'torch')"
./ci/uninstall-torch-wheels.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason you couldn't do the no-torch tests before the torch tests? That would save you the trouble of uninstalling torch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because to run the tests without torch, we still need to install cugraph-pyg and its dependencies, any of which might pull in torch as a requirement.

Imperative code that force-uninstalls torch right here afterwards is easier and less error-prone than trying to be careful about not letting torch into the environment at the beginning, in my opinion.

Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alexbarghi-nv
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 330b135 into rapidsai:release/26.04 Mar 18, 2026
62 checks passed
@jameslamb jameslamb deleted the torch-testing branch March 18, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants