Skip to content

Conversation

@casparvl
Copy link
Contributor

Set CMAKE_SKIP_RPATH=ON for all PyTorch builds. This avoids the issue described at easybuilders/easybuild-easyconfigs#14359

Note that the cmakemake.py EasyBlock sets CMAKE_SKIP_RPATH=ON only when --rpath is used. We don't use the same condition here, but always set it, since regardless of whether --rpath is used, the PyTorch build will get an RPATH set due the CMAKE configuration set here https://github.com/pytorch/pytorch/blob/36449ea93134574c2a22b87baad3de0bf8d64d42/cmake/Dependencies.cmake#L16
This will result in the libcaffe2_nvrtc.so picking up on the CUDA stubs library, rather than the actual driver.

Caspar van Leeuwen added 3 commits November 16, 2021 20:17
…syBuild's --rpath option is used. This is copied from what the generic cmakemake.py easyblock does
…used. It should ALWAYS be set, since without --rpath PyTorch will try to set a RUNPATH that includes the CUDA stubs directory - and this causes obvious problems for any build, regardless of whether --rpath is set
@casparvl casparvl added this to the next release (4.5.1?) milestone Nov 16, 2021
@casparvl
Copy link
Contributor Author

Ok, this causes problems during testing:

Running test_tensorboard ... [2021-11-16 22:47:40.340613]
Executing ['/tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 'test_tensorboard.py', '-v'] ... [2021-11-16 22:47:40.340675]
WARNING:root:This caffe2 python run failed to load cuda module:libtorch.so: cannot open shared object file: No such file or directory,and AMD hip module:No module named 'caffe2.python.caffe2_pybind11_state_hip'.Will run in CPU only mode.
CRITICAL:root:Cannot load caffe2.python. Error: libtorch.so: cannot open shared object file: No such file or directory
test_tensorboard failed!

@casparvl
Copy link
Contributor Author

Now trying to see if setting only env.setvar("CMAKE_INSTALL_RPATH_USE_LINK_PATH", "FALSE") is enough to fix the original issue without breaking the tests... In that case, CMAKE should RPATH during the build, but not during the install. If that doesn't work, I guess we have to look for a more selective patch as proposed in pytorch/pytorch#35418

@casparvl
Copy link
Contributor Author

Alternatively, as a targetted patch, I'm wondering if we can't just do set_target_properties(caffe2_nvrtc PROPERTIES INSTALL_RPATH_USE_LINK_PATH FALSE). This would be as 'targetted' a patch as one could have. Disadvantage of such a targetted solution is that we'd probably have to do the same thing for all the failing test targets (see the list at easybuilders/easybuild-easyconfigs#14359) and that this might pop up again if new tests are added in the future...

@casparvl
Copy link
Contributor Author

casparvl commented Nov 17, 2021

A more generic patch could be (inspired by the one in pytorch/pytorch#35418):

(probably put in cmake/public/cuda.cmake):

+if(CUDA_CUDA_LIB MATCHES "stubs")
+  cmake_path(GET ${CUDA_CUDA_LIB} ROOT_PATH LIBCUDA_STUB_DIR)
+  list(APPEND CMAKE_PLATFORM_IMPLICIT_LINK_DIRECTORIES ${LIBCUDA_STUB_DIR})
+endif()

This relies on the fact that CMAKE won't RPATH IMPLICIT_LINK_DIRECTORIES, regardless of settings. Of course, this relies on internal and undocumented CMAKE behavior. Not so nice, but still: it might be a longer lasting solution than trying to use set_target_properties on each libary that breaks...

@casparvl
Copy link
Contributor Author

casparvl commented Nov 17, 2021

Ok, so the above patch doesn't resolve the issue. I'm not sure if the syntax is somehow wrong, or whether implicit link directories are RPATH-ed after all, but it ends up with the stubs in the RPATH again

ldd /tmp/sw_stack_gpu/software/PyTorch/1.10.0-foss-2021a-CUDA-11.3.1-rpath/lib/python3.9/site-packages/torch/lib/libcaffe2_nvrtc.so | grep stubs
        libcuda.so.1 => /tmp/sw_stack_gpu/software/CUDA/11.3.1/lib/stubs/libcuda.so.1 (0x000014a71fde8000)

[EDIT] Correction, setting env.setvar("CMAKE_INSTALL_RPATH_USE_LINK_PATH", "FALSE") does not resolve the original issue:

ldd /tmp/sw_stack_gpu/software/PyTorch/1.10.0-foss-2021a-CUDA-11.3.1/lib/python3.9/site-packages/torch/lib/libcaffe2_nvrtc.so | grep stubs
        libcuda.so.1 => /tmp/sw_stack_gpu/software/CUDA/11.3.1/lib/stubs/libcuda.so.1 (0x0000146d3a9b7000)

So, two options left:

  • set CMAKE_SKIP_RPATH=ON, skip the tensorboard test and hope that everything works after installation
  • Do more targetted CMAKE patches setting SKIP_BUILD_RPATH to FALSE on the individual build targets as properties (using set_target_properties).

…tion. It does still set it at compile time, which is needed for the tensorboard test from the test suite to pass
@casparvl
Copy link
Contributor Author

casparvl commented Nov 17, 2021

Test report was succesful, but erroneously ended up in a different PR (the one from an unlrelated EasyConfig).

[casparl@gcn1 tmp]$ eb_github PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb --include-easyblocks-from-pr
=2622 --from-pr 14106 -f --upload-test-report
== Temporary log file in case of crash /tmp/casparl/eb_tmp/eb-80c7yifq/easybuild-2ism66ic.log
== easyblock pytorch.py included from PR #2622
== found valid index for /sw/noarch/Centos8/2021/software/EasyBuild/4.5.0/easybuild/easyconfigs, so using it...
== found valid index for /sw/noarch/Centos8/2021/software/EasyBuild/4.5.0/easybuild/easyconfigs, so using it...
== processing EasyBuild easyconfig /tmp/casparl/eb_tmp/eb-80c7yifq/files_pr14106/p/PyTorch/PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb
== building and installing PyTorch/1.10.0-foss-2021a-CUDA-11.3.1...
== fetching files...
== creating build dir, resetting environment...
== ... (took 2 secs)
== unpacking...
== ... (took 2 secs)
== patching...
== preparing...
== ... (took 3 secs)
== configuring...
== building...
== ... (took 15 mins 7 secs)
== testing...
== ... (took 1 hour 47 mins 58 secs)
== installing...
== ... (took 10 secs)
== taking care of extensions...
== restore after iterating...
== postprocessing...
== sanity checking...
== ... (took 6 secs)
== cleaning up...
== ... (took 1 secs)
== creating module...
== ... (took 1 secs)
== permissions...
== packaging...
== running test cases...
== ... (took 12 secs)
== COMPLETED: Installation ended successfully (took 2 hours 3 mins 46 secs)
== Results of the build can be found in the log file(s) /tmp/sw_stack_gpu/software/PyTorch/1.10.0-foss-2021a-CUDA-11.3.1/easybuild/easybuild-PyTorch-1.10.0-20211117.170400.log
Adding comment to easybuild-easyconfigs issue #14106: 'Test report by @casparvl
Using easyblocks from PR(s) https://github.com/easybuilders/easybuild-easyblocks/pull/2622
**SUCCESS**
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn1 - Linux centos linux 8.4.2105, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/2ffa26ffd965c13934bdcc72661382f1 for a full test report.'
== Test report uploaded to https://gist.github.com/2ffa26ffd965c13934bdcc72661382f1 and mentioned in a comment in easybuild-easyconfigs PR(s) #14106
== Build succeeded for 1 out of 1
== Temporary log file(s) /tmp/casparl/eb_tmp/eb-80c7yifq/easybuild-2ism66ic.log* have been removed.
== Temporary directory /tmp/casparl/eb_tmp/eb-80c7yifq has been removed.

I'll try to rerun it and now reference the right PR... But anyway, a gist of a succesful build is here: https://gist.github.com/2ffa26ffd965c13934bdcc72661382f1

@casparvl
Copy link
Contributor Author

Ok, so I misunderstood what CMAKE_INSTALL_RPATH_USE_LINK_PATH does. It does not enable/disable RPATH-ing in any way, it just changes if the link path is used to set the RPATH. The alternative is to manually specify the RPATH in the CMakeLists.txt. I.e. setting this to FALSE is clearly not a solution. Some more info on CMAKE RPATH handling: https://gitlab.kitware.com/cmake/community/-/wikis/doc/cmake/RPATH-handling

@boegel boegel changed the title Updated PyTorch EasyBlock to avoid RPATH update PyTorch easyblock to avoid RPATH linking to CUDA stubs library in libcaffe2_nvrtc.so Nov 24, 2021
@casparvl
Copy link
Contributor Author

Ok, so:

  • Setting CMAKE_SKIP_RPATH=ON causes the build to succeed, but the test_tensorboard to fail
  • Setting CMAKE_INSTALL_RPATH_USE_LINK_PATH=FALSE doesn't help (build fails)

Thus, a more targetted patch was required and I implemented this as a patch for the EasyConfig easybuilders/easybuild-easyconfigs#14382. It doesn't make sense to do that at the EasyBlock level after all. If I find the time, I will try to push the patch upstream and see if it also works for them (upstream PyTorch had some CI problems with an earlier patch that was proposed for this issue pytorch/pytorch#37737).

Closing this PR.

@casparvl casparvl closed this Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant