wheels: build with CUDA 13.0, test against mix of CTK versions, make 'torch-geometric' fully optional for 'cugraph-pyg'#434
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
.github/workflows/pr.yaml
Outdated
| - wheel-build-libwholegraph | ||
| - wheel-build-pylibwholegraph | ||
| - wheel-tests-pylibwholegraph | ||
| - wheel-tests-nightly-pylibwholegraph |
There was a problem hiding this comment.
TODO before merging: remove all this
We cover a different mix of environments in nightlies and this project's dependency tree is very sensitive to that mix, so want to be sure we've accounted for everything.
.github/workflows/pr.yaml
Outdated
| - wheel-tests-nightly-pylibwholegraph | ||
| - wheel-build-cugraph-pyg | ||
| - wheel-tests-cugraph-pyg | ||
| - wheel-tests-nightly-cugraph-pyg |
There was a problem hiding this comment.
This is pretty close!
- ✔️ all PR CI wheel jobs passing
- ✔️ all nightly
pylibwholegraphwheel jobs passing - 😬 1 nightly
cugraph-pygwheel job failing
Collecting ucxx-cu12==0.49.*,>=0.0.0a0 (from cugraph-cu12==26.4.*,>=0.0.0a0->cugraph-pyg-cu12==26.4.0a40->cugraph-pyg-cu12==26.4.0a40)
Downloading https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ucxx-cu12/0.49.0a32/ucxx_cu12-0.49.0a32-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (503 kB)
Downloading https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ucxx-cu12/0.49.0a31/ucxx_cu12-0.49.0a31-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (503 kB)
Downloading https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ucxx-cu12/0.49.0a30/ucxx_cu12-0.49.0a30-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (503 kB)
Downloading https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ucxx-cu12/0.49.0a29/ucxx_cu12-0.49.0a29-cp311-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (503 kB)
Downloading https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ucxx-cu12/0.49.0a28/ucxx_cu12-0.49.0a28-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (515 kB)
Downloading https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ucxx-cu12/0.49.0a27/ucxx_cu12-0.49.0a27-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (515 kB)
Collecting torch-geometric<2.8,>=2.5 (from cugraph-pyg-cu12==26.4.0a40->cugraph-pyg-cu12==26.4.0a40)
Downloading http://pip-cache.local.gha-runners.nvidia.com/packages/03/9f/157e913626c1acfb3b19ce000b1a6e4e4fb177c0bc0ea0c67ca5bd714b5a/torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
error: resolution-too-deep
× Dependency resolution exceeded maximum depth
╰─> Pip cannot resolve the current dependencies as the dependency graph is too complex for pip to solve efficiently.
hint: Try adding lower bounds to constrain your dependencies, for example: 'package>=2.0.0' instead of just 'package'.
I'll try to reproduce that locally and see if I can get a better solver error.
There was a problem hiding this comment.
I'm able to reproduce this locally
code to do that (click me)
docker run \
--rm \
--pull always \
--env RAPIDS_REPOSITORY=rapidsai/cugraph-gnn \
--env RAPIDS_SHA=13ef184fcfbeab41e096fa643f1ff082a3127ccd \
--env RAPIDS_REF_NAME=pull-request/434 \
--env RAPIDS_BUILD_TYPE=pull-request \
-v $(pwd):/opt/work \
-w /opt/work \
-it rapidsai/citestwheel:26.04-cuda12.2.2-ubuntu22.04-py3.11 \
bash
source rapids-init-pip
package_name="cugraph-pyg"
RAPIDS_PY_CUDA_SUFFIX="$(rapids-wheel-ctk-name-gen ${RAPIDS_CUDA_VERSION})"
# Download the libwholegraph, pylibwholegraph, and cugraph-pyg built in the previous step
LIBWHOLEGRAPH_WHEELHOUSE=$(RAPIDS_PY_WHEEL_NAME="libwholegraph_${RAPIDS_PY_CUDA_SUFFIX}" rapids-download-wheels-from-github cpp)
PYLIBWHOLEGRAPH_WHEELHOUSE=$(rapids-download-from-github "$(rapids-package-name "wheel_python" pylibwholegraph --stable --cuda "$RAPIDS_CUDA_VERSION")")
CUGRAPH_PYG_WHEELHOUSE=$(RAPIDS_PY_WHEEL_NAME="${package_name}_${RAPIDS_PY_CUDA_SUFFIX}" RAPIDS_PY_WHEEL_PURE="1" rapids-download-wheels-from-github python)
# generate constraints (possibly pinning to oldest support versions of dependencies)
rapids-generate-pip-constraints test_cugraph_pyg "${PIP_CONSTRAINT}"
rapids-generate-pip-constraints torch_only "${PIP_CONSTRAINT}"
rapids-pip-retry install \
--prefer-binary \
--constraint "${PIP_CONSTRAINT}" \
--extra-index-url 'https://pypi.nvidia.com' \
"${LIBWHOLEGRAPH_WHEELHOUSE}"/*.whl \
"$(echo "${PYLIBWHOLEGRAPH_WHEELHOUSE}"/pylibwholegraph_"${RAPIDS_PY_CUDA_SUFFIX}"*.whl)" \
"$(echo "${CUGRAPH_PYG_WHEELHOUSE}"/cugraph_pyg_"${RAPIDS_PY_CUDA_SUFFIX}"*.whl)[test]"I think I see what's happening.
torch-geometricandogbrequiretorchogbrequires somenvidia-{project}CTK packages likenvidia-cuda-nvrtc- when we don't install a CUDA build of
torch, the version oftorchin the environment is only constrained byogbandtorch-geometric's requirements, which allow all the way back totorch>=1.6.0
Taken together, you end up in this "resolution-too-deep" situation, where pip is trying varying combinations of ogb, torch-geometric, and CPU-only torch. CUDA-suffixed packages make the resolution graph larger... go back far enough and ogb flips from depending on nvidia-cuda-nvrtc-cu12 to nvidia-cuda-nvrtc-cu11`.
Unfortunately I think the best long-term fix here is to treat ogb and torch-geometric as fully optional for wheels just as we do torch... keeping them out of wheel metadata and installing them separately (ref: #425). If torch has to be truly optional, then anything that pulls it in needs to be optional too. I'll work on that.
There was a problem hiding this comment.
Interestingly, still hitting a 'resolution-to-deep' error even without 'torch', 'ogb', or 'torch-geometric' in the solve: https://github.com/rapidsai/cugraph-gnn/actions/runs/23318288336/job/67824788071?pr=434
Will look more into this tomorrow. Maybe it's actually RAPIDS libraries that are causing the conflicts?
There was a problem hiding this comment.
This is with pip right?
If so, maybe it is worth trying with uv. That might give us more insight into the nature of the conflict
There was a problem hiding this comment.
Thanks, I'll consider it.
There was a problem hiding this comment.
I found the root cause... cugraph-pyg[test] had sentence-transformers in it, which pulls in torch as a required dependency. That took us back down the road of pip considering many different torch versions and other libraries with competing dependencies (including building some from source during backtracking!), which led to these issues.
We really do not want torch in the environment at all unless it's a CUDA build of torch, and that means making sentence-transformers optional just as we did with torch itself in #425.
Pushed that change and it looks like all CI Jobs (including all nightly wheels jobs!) are now passing: https://github.com/rapidsai/cugraph-gnn/actions/runs/23348691254/job/67923786458?pr=434
I'll revert the nightly stuff and go ask for a review.
| - matrix: | ||
| packages: | ||
| - sentence-transformers | ||
| - sentence-transformers>=3.0.1 |
There was a problem hiding this comment.
Ran into issues in this PR that were like "pip is processing a graph of possibilities that's too large".
I don't think this floor would have helped that (in this specific case, the entire dependency just needed to be skipped), but in general having floors for test-only requirements like this reduces the risk of this type of problem.
This choice is pretty arbitrary... sentence-transformers 3.0.0 came out about 2 years ago (May 2024) and 3.0.1 came out a few days later so probably fixed some bug(s).
Chose this just to go from "no floor" to "some floor", and "version from 2 years ago" seemed like a safe choice 🤷🏻
Description
Fixes #410
Contributes to rapidsai/build-planning#257
Contributes to rapidsai/build-planning#256
Makes
torcheven more optional for wheels (follow-up to #425)torch-geometricfromcugraph-pygwheels' runtime dependencies (leaves it for conda)ogbandsentence-transformersfromcugraph-pyg[test](they're only used for examples that aren't run in wheels CI)Notes for Reviewers
How I tested this
Tested the full set of nightly and PR CI jobs for wheels, saw them all pass: https://github.com/rapidsai/cugraph-gnn/actions/runs/23348691254
This should fix #410 😁