Constrain versions of PyTorch and CI artifacts in CI Runs, upgrade to dgl 2.4#4690
Conversation
|
(Summarizing some offline conversations, to get this into the public record here on GitHub) For the last few days (unsure how long), CI jobs here targeting
full conda solve error trace (click me)how to reproduce this (click me)docker run \
--rm \
--gpus 1 \
--env CI=false \
--env RAPIDS_BUILD_TYPE="pull-request" \
--env RAPIDS_REPOSITORY="rapidsai/cugraph" \
--env RAPIDS_REF_NAME=pull-request/4690 \
--env RAPIDS_SHA=922571b6db5f721a287897b3c5acc81b3fe07f69 \
-v $(pwd):/opt/work \
-w /opt/work \
--network host \
-it rapidsai/ci-conda:cuda11.8.0-rockylinux8-py3.10 \
bash
RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"
rapids-logger "Downloading artifacts from previous jobs"
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)
rapids-logger "Generate Python testing dependencies"
rapids-dependency-file-generator \
--output conda \
--file-key test_python \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml
rapids-mamba-retry env create --yes -f env.yaml -n test_cugraph_pyg
conda activate test_cugraph_pyg
CONDA_CUDA_VERSION="11.8"
PYG_URL="https://data.pyg.org/whl/torch-2.3.0+cu118.html"
rapids-mamba-retry install \
--channel "${CPP_CHANNEL}" \
--channel "${PYTHON_CHANNEL}" \
--channel pyg \
"cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pytorch>=2.3,<2.4" \
"ogb"This only shows up in the Lines 187 to 189 in 5fad435 The PyTorch floor here was raised to
So what can we do?Ideally, there would be But there are not PyTorch 2.3 conda packages up at https://anaconda.org/pyg/pyg/files?page=3&version=2.5.2&sort=basename&sort_order=desc. The options I can think of:
|
|
update on #4690 (comment) After offline discussion with @alexbarghi-nv @jakirkham @tingyu66 , we decided to replace uses of commit: f267c77 They're built from the same sources, and |
jakirkham
left a comment
There was a problem hiding this comment.
Thanks James! 🙏
AIUI this matches what we discussed
Also grepped for any remaining pyg dependency lines to fix and didn't find any
Included one informational note below, but no action needed
Approving to unblock
|
All of the build and test jobs are now passing, and spot-checking the logs it looks to me like they're using the correct, expected versions of dependencies 🎉 The
The most recent docs build (yesterday) did "succeed" .... but only by using 24.08 packages 😱 It's showing up as a failure now because this PR prevents conda from using non-24.10 RAPIDS packages. In my experience with
There absolutely is a I was able to reproduce this locally on an x86_64 machine with CUDA 12.2, and that revealed the real issue. code to do that (click me)docker run \
--rm \
--gpus 1 \
--env CI=false \
--env RAPIDS_BUILD_TYPE="pull-request" \
--env RAPIDS_REPOSITORY="rapidsai/cugraph" \
--env RAPIDS_REF_NAME=pull-request/4690 \
--env RAPIDS_SHA=f267c771707d4007c6869b4a0a79feb3e0c27700 \
-v $(pwd):/opt/work \
-w /opt/work \
--network host \
-it rapidsai/ci-conda:cuda11.8.0-ubuntu22.04-py3.10 \
bash
RAPIDS_VERSION_MAJOR_MINOR="$(rapids-version-major-minor)"
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)
rapids-dependency-file-generator \
--output conda \
--file-key docs \
--matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml
rapids-mamba-retry env create --yes -f env.yaml -n docs
conda activate docs
if [[ "${RAPIDS_CUDA_VERSION}" == "11.8.0" ]]; then
CONDA_CUDA_VERSION="11.8"
DGL_CHANNEL="dglteam/label/cu118"
else
CONDA_CUDA_VERSION="12.1"
DGL_CHANNEL="dglteam/label/cu121"
fi
rapids-mamba-retry install \
--channel "${CPP_CHANNEL}" \
--channel "${PYTHON_CHANNEL}" \
--channel conda-forge \
--channel nvidia \
--channel "${DGL_CHANNEL}" \
"libcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pylibcugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-pyg=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-dgl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-service-server=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"cugraph-service-client=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"libcugraph_etl=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pylibcugraphops=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
"pylibwholegraph=${RAPIDS_VERSION_MAJOR_MINOR}.*" \
pytorch \
"cuda-version=${CONDA_CUDA_VERSION}"
python -c "import cugraph_dgl.convert"
Following that code shared above, that can reproduced without actually invoking python -c "import cugraph_dgl.convert"Walking down the trace: python -c "import dgl"
conda install -c conda-forge torchdata
python -c "import dgl"
conda install -c conda-forge pydantic
python -c "import dgl"
So what do we do?I'm not sure. Looks like
Those seem to have not made it in until Here in I'm not sure how to fix this. The https://anaconda.org/dglteam/dgl/files?version=&channel=cu118 Maybe we want the https://anaconda.org/dglteam/dgl/files?version=2.4.0.th23.cu118 |
|
Summarizing recent commits:
Here in the 24.10 release of and requiring this label on the As @alexbarghi-nv pointed out to me, something similar is being done in For wheels, I've updated the |
|
I'm going to merge this. It has a lot of approvals, CI is all passing, and I spot-checked CI logs for builds and tests and saw all the things we're expecting... latest nightlies of Thanks for the help everyone! |
|
/merge |
|
Thanks James! 🙏 |
## Summary Follow-up to #4690. Proposes consolidating stuff like this in CI scripts: ```shell pip install A pip install B pip install C ``` Into this: ```shell pip install A B C ``` ## Benefits of these changes Reduces the risk of creating a broken environment with incompatible packages. Unlike `conda`, `pip` does not evaluate the requirements of all installed packages when you run `pip` install. Installing `torch` and `cugraph-dgl` at the same time, for example, gives us a chance to find out about packaging issues like *"`cugraph-dgl` and `torch` have conflicting requirements on `{other_package}`"* at CI time. Similar change from `cudf`: rapidsai/cudf#16575 Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Alex Barghi (https://github.com/alexbarghi-nv) URL: #4701
Another steps towards completing the work started in #53 Fixes #15 Contributes to rapidsai/build-planning#111 Proposes changes to get CI running on pull requests for `cugraph-pyg` and `cugraph-dgl` ## Notes for Reviewers Workflows for nightly builds and publishing nightly packages are intentionally not included here. See #58 (comment) Notebook tests are intentionally not added here... they'll be added in the next PR. Pulls in changes from these other upstream PRs that had not been ported over to this repo: * rapidsai/cugraph#4690 * rapidsai/cugraph#4393 Authors: - James Lamb (https://github.com/jameslamb) - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - Bradley Dice (https://github.com/bdice) URL: #59

We were pulling the wrong packages because the PyTorch version constraint wasn't tight enough. Hopefully these sorts of issues will be resolved in the
cugraph-gnnrepository going forward, where we can pin a specific pytorch version for testing.