Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented May 13, 2024

(created using eb --new-pr)
This is meant as an alternative to #20155 using a newer NCCL version as the older one currently included in foss/2022b doesn't seem to work with PyTorch 2.1.2

Update: Seems #20155 works now. So putting this one on hold

Requires:

@SebastianAchilles SebastianAchilles added this to the 4.x milestone May 14, 2024
@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 550.54.15, Python 3.9.18
See https://gist.github.com/SebastianAchilles/caa73902c24edfc4a9f09a1104e38750 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8
See https://gist.github.com/SebastianAchilles/c2693ff5dacd31a35769e1bca1515fc6 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @SebastianAchilles FAILED Build succeeded for 1 out of 2 (2 easyconfigs in total) skl-rockylinux-89 - Linux Rocky Linux 8.9, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 550.54.15, Python 3.6.8 See https://gist.github.com/SebastianAchilles/7ddc2f02e198c9e93730651648ea6a65 for a full test report.

That first one failed with

distributed/_tensor/test_dtensor_ops 1/1 failed! Received signal: SIGSEGV

I see that every now and then in various different tests especially test_jit*. Seems to happen randomly, not sure why.

I'll do a larger repeated run for both PRs over the weekend so I'll have the results to compare on Tuesday (Monday is a public holiday here)

@github-actions
Copy link

github-actions bot commented Nov 22, 2024

Updated software NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb

Diff against NCCL-2.27.7-GCCcore-14.3.0-CUDA-12.9.1.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.27.7-GCCcore-14.3.0-CUDA-12.9.1.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.27.7-GCCcore-14.3.0-CUDA-12.9.1.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index d573781fef..941e350fe4 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.27.7-GCCcore-14.3.0-CUDA-12.9.1.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,32 @@
 name = 'NCCL'
-version = '2.27.7'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '14.3.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['98e6262bd55932c51e7c8ffc50cc764f019e4b94a8fd6694d839ae828ec8d128']
+patches = [
+    'NCCL-2.16.2_fix-cpuid.patch',
+    'NCCL-2.18.3_fix-cudaMemcpyAsync.patch',
+]
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+    {'NCCL-2.18.3_fix-cudaMemcpyAsync.patch': '7dc8d0d1b78e4f8acefbc400860f47432ef67c225b50d73c732999c23483de90'},
+]
 
-builddependencies = [('binutils', '2.44')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.9.1', '', SYSTEM),
-    ('UCX-CUDA', '1.19.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)
Diff against NCCL-2.26.6-GCCcore-14.2.0-CUDA-12.8.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.26.6-GCCcore-14.2.0-CUDA-12.8.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.26.6-GCCcore-14.2.0-CUDA-12.8.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index 31a072cbdd..941e350fe4 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.26.6-GCCcore-14.2.0-CUDA-12.8.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,32 @@
 name = 'NCCL'
-version = '2.26.6'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '14.2.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['2a4f86198510e1f0764c116b33ff70e082240f87d158b2017d7f34c7c3768ac6']
+patches = [
+    'NCCL-2.16.2_fix-cpuid.patch',
+    'NCCL-2.18.3_fix-cudaMemcpyAsync.patch',
+]
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+    {'NCCL-2.18.3_fix-cudaMemcpyAsync.patch': '7dc8d0d1b78e4f8acefbc400860f47432ef67c225b50d73c732999c23483de90'},
+]
 
-builddependencies = [('binutils', '2.42')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.8.0', '', SYSTEM),
-    ('UCX-CUDA', '1.18.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)
Diff against NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb

easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb

diff --git a/easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
index 0534e538fa..941e350fe4 100644
--- a/easybuild/easyconfigs/n/NCCL/NCCL-2.22.3-GCCcore-13.3.0-CUDA-12.6.0.eb
+++ b/easybuild/easyconfigs/n/NCCL/NCCL-2.18.3-GCCcore-12.2.0-CUDA-12.0.0.eb
@@ -1,23 +1,32 @@
 name = 'NCCL'
-version = '2.22.3'
+version = '2.18.3'
 versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://developer.nvidia.com/nccl'
 description = """The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective
 communication primitives that are performance optimized for NVIDIA GPUs."""
 
-toolchain = {'name': 'GCCcore', 'version': '13.3.0'}
+toolchain = {'name': 'GCCcore', 'version': '12.2.0'}
 
 github_account = 'NVIDIA'
 source_urls = [GITHUB_SOURCE]
 sources = ['v%(version)s-1.tar.gz']
-checksums = ['45151629a9494460e73375281e8b0fe379141528879301899ece9b776faca024']
+patches = [
+    'NCCL-2.16.2_fix-cpuid.patch',
+    'NCCL-2.18.3_fix-cudaMemcpyAsync.patch',
+]
+checksums = [
+    ('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
+     'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96'),
+    {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
+    {'NCCL-2.18.3_fix-cudaMemcpyAsync.patch': '7dc8d0d1b78e4f8acefbc400860f47432ef67c225b50d73c732999c23483de90'},
+]
 
-builddependencies = [('binutils', '2.42')]
+builddependencies = [('binutils', '2.39')]
 
 dependencies = [
-    ('CUDA', '12.6.0', '', SYSTEM),
-    ('UCX-CUDA', '1.16.0', versionsuffix),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('UCX-CUDA', '1.13.1', versionsuffix),
 ]
 
 # default CUDA compute capabilities to use (override via --cuda-compute-capabilities)

Updated software PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb

Diff against PyTorch-2.6.0-foss-2024a.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.6.0-foss-2024a.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.6.0-foss-2024a.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index 975a779408..d8b2968689 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.6.0-foss-2024a.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,195 +1,200 @@
 name = 'PyTorch'
-version = '2.6.0'
+version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2024a'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
 patches = [
     'PyTorch-1.7.0_disable-dev-shm-test.patch',
+    'PyTorch-1.11.1_skip-test_init_from_local_shards.patch',
     'PyTorch-1.12.1_add-hypothesis-suppression.patch',
+    'PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch',
     'PyTorch-1.12.1_fix-TestTorch.test_to.patch',
+    'PyTorch-1.12.1_skip-test_round_robin.patch',
     'PyTorch-1.13.1_fix-gcc-12-warning-in-fbgemm.patch',
+    'PyTorch-1.13.1_fix-protobuf-dependency.patch',
+    'PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch',
     'PyTorch-1.13.1_skip-failing-singular-grad-test.patch',
+    'PyTorch-1.13.1_skip-tests-without-fbgemm.patch',
     'PyTorch-2.0.1_avoid-test_quantization-failures.patch',
+    'PyTorch-2.0.1_fix-skip-decorators.patch',
+    'PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch',
+    'PyTorch-2.0.1_fix-vsx-loadu.patch',
+    'PyTorch-2.0.1_no-cuda-stubs-rpath.patch',
     'PyTorch-2.0.1_skip-failing-gradtest.patch',
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
+    'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
+    'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
+    'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
+    'PyTorch-2.1.0_fix-validationError-output-test.patch',
+    'PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch',
+    'PyTorch-2.1.0_increase-tolerance-functorch-test_vmapvjpvjp.patch',
+    'PyTorch-2.1.0_remove-sparse-csr-nnz-overflow-test.patch',
     'PyTorch-2.1.0_remove-test-requiring-online-access.patch',
+    'PyTorch-2.1.0_skip-diff-test-on-ppc.patch',
     'PyTorch-2.1.0_skip-dynamo-test_predispatch.patch',
+    'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
+    'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
+    'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
+    'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
+    'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
+    'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
+    'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
+    'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
+    'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
-    'PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch',
-    'PyTorch-2.3.0_fix-mkldnn-avx512-f32-bias.patch',
-    'PyTorch-2.3.0_skip_test_var_mean_differentiable.patch',
-    'PyTorch-2.6.0_add-checkfunctionexists-include.patch',
-    'PyTorch-2.6.0_allow-sympy-1.13.3.patch',
-    'PyTorch-2.6.0_avoid_caffe2_test_cpp_jit.patch',
-    'PyTorch-2.6.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch',
-    'PyTorch-2.6.0_disable_tests_which_need_network_download.patch',
-    'PyTorch-2.6.0_disable-gcc12-warnings.patch',
-    'PyTorch-2.6.0_fix-accuracy-issues-in-linalg_solve.patch',
-    'PyTorch-2.6.0_fix-cpuinfo-bug-with-smt.patch',
-    'PyTorch-2.6.0_fix-distributed-tests-without-gpus.patch',
-    'PyTorch-2.6.0_fix-edge-case-causing-test_trigger_bisect_on_error-failure.patch',
-    'PyTorch-2.6.0_fix-ExcTests.test_trigger_on_error.patch',
-    'PyTorch-2.6.0_fix-flaky-test_aot_export_with_torch_cond.patch',
-    'PyTorch-2.6.0_fix-inductor-device-interface.patch',
-    'PyTorch-2.6.0_fix-server-in-test_control_plane.patch',
-    'PyTorch-2.6.0_fix-skip-decorators.patch',
-    'PyTorch-2.6.0_fix-sympy-1.13-compat.patch',
-    'PyTorch-2.6.0_fix-test_autograd_cpp_node_saved_float.patch',
-    'PyTorch-2.6.0_fix-test_linear_with_embedding.patch',
-    'PyTorch-2.6.0_fix-test_linear_with_in_out_buffer-without-mkl.patch',
-    'PyTorch-2.6.0_fix-test_public_bindings.patch',
-    'PyTorch-2.6.0_fix-test_unbacked_bindings_for_divisible_u_symint.patch',
-    'PyTorch-2.6.0_fix-vsx-vector-shift-functions.patch',
-    'PyTorch-2.6.0_fix-xnnpack-float16-convert.patch',
-    'PyTorch-2.6.0_increase-tolerance-test_aotdispatch-matmul.patch',
-    'PyTorch-2.6.0_increase-tolerance-test_quick-baddbmm.patch',
-    'PyTorch-2.6.0_increase-tolerance-test_vmap_autograd_grad.patch',
-    'PyTorch-2.6.0_remove-test_slice_with_floordiv.patch',
-    'PyTorch-2.6.0_show-test-duration.patch',
-    'PyTorch-2.6.0_skip-diff-test-on-ppc.patch',
-    'PyTorch-2.6.0_skip-test_checkpoint_wrapper_parity-on-cpu.patch',
-    'PyTorch-2.6.0_skip-test_init_from_local_shards.patch',
-    'PyTorch-2.6.0_skip-test_jvp_linalg_det_singular.patch',
-    'PyTorch-2.6.0_skip-test-requiring-MKL.patch',
-    'PyTorch-2.6.0_skip-test_segfault.patch',
-    'PyTorch-2.6.0_skip-tests-without-fbgemm.patch',
 ]
 checksums = [
-    {'pytorch-v2.6.0.tar.gz': '3005690eb7b083c443a38c7657938af63902f524ad87a6c83f1aca38c77e3b57'},
+    {'pytorch-v2.1.2.tar.gz': '85effbcce037bffa290aea775c9a4bad5f769cb229583450c40055501ee1acd7'},
     {'PyTorch-1.7.0_disable-dev-shm-test.patch': '622cb1eaeadc06e13128a862d9946bcc1f1edd3d02b259c56a9aecc4d5406b8a'},
+    {'PyTorch-1.11.1_skip-test_init_from_local_shards.patch':
+     '4aeb1b0bc863d4801b0095cbce69f8794066748f0df27c6aaaf729c5ecba04b7'},
     {'PyTorch-1.12.1_add-hypothesis-suppression.patch':
      'e71ffb94ebe69f580fa70e0de84017058325fdff944866d6bd03463626edc32c'},
+    {'PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch':
+     '1efc9850c431d702e9117d4766277d3f88c5c8b3870997c9974971bce7f2ab83'},
     {'PyTorch-1.12.1_fix-TestTorch.test_to.patch': '75f27987c3f25c501e719bd2b1c70a029ae0ee28514a97fe447516aee02b1535'},
+    {'PyTorch-1.12.1_skip-test_round_robin.patch': '63d4849b78605aa088fdff695637d9473ea60dee603a3ff7f788690d70c55349'},
     {'PyTorch-1.13.1_fix-gcc-12-warning-in-fbgemm.patch':
      '5c7be91a6096083a0b1315efe0001537499c600f1f569953c6a2c7f4cc1d0910'},
+    {'PyTorch-1.13.1_fix-protobuf-dependency.patch':
+     '8bd755a0cab7233a243bc65ca57c9630dfccdc9bf8c9792f0de4e07a644fcb00'},
+    {'PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch':
+     'bdde0f2105215c95a54de64ec4b1a4520528510663174fef6d5b900eb1db3937'},
     {'PyTorch-1.13.1_skip-failing-singular-grad-test.patch':
      '72688a57b2bb617665ad1a1d5e362c5111ae912c10936bb38a089c0204729f48'},
+    {'PyTorch-1.13.1_skip-tests-without-fbgemm.patch':
+     '481e595f673baf8ae58b41697a6792b83048b0264aa79b422f48cd8c22948bb7'},
     {'PyTorch-2.0.1_avoid-test_quantization-failures.patch':
      '02e3f47e4ed1d7d6077e26f1ae50073dc2b20426269930b505f4aefe5d2f33cd'},
+    {'PyTorch-2.0.1_fix-skip-decorators.patch': '2039012cef45446065e1a2097839fe20bb29fe3c1dcc926c3695ebf29832e920'},
+    {'PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch':
+     '1b37194f55ae678f3657b8728dfb896c18ffe8babe90987ce468c4fa9274f357'},
+    {'PyTorch-2.0.1_fix-vsx-loadu.patch': 'a0ffa61da2d47c6acd09aaf6d4791e527d8919a6f4f1aa7ed38454cdcadb1f72'},
+    {'PyTorch-2.0.1_no-cuda-stubs-rpath.patch': '8902e58a762240f24cdbf0182e99ccdfc2a93492869352fcb4ca0ec7e407f83a'},
     {'PyTorch-2.0.1_skip-failing-gradtest.patch': '8030bdec6ba49b057ab232d19a7f1a5e542e47e2ec340653a246ec9ed59f8bc1'},
     {'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch':
      '7047862abc1abaff62954da59700f36d4f39fcf83167a638183b1b7f8fec78ae'},
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
+    {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
+    {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
+     'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
+    {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
+     '84bb51a719abc677031a7a3dfe4382ff098b0cbd8b39b8bed2a7fa03f80ac1e9'},
+    {'PyTorch-2.1.0_fix-validationError-output-test.patch':
+     '7eba0942afb121ed92fac30d1529447d892a89eb3d53c565f8e9d480e95f692b'},
+    {'PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch':
+     '3793b4b878be1abe7791efcbd534774b87862cfe7dc4774ca8729b6cabb39e7e'},
+    {'PyTorch-2.1.0_increase-tolerance-functorch-test_vmapvjpvjp.patch':
+     'aef38adf1210d0c5455e91d7c7a9d9e5caad3ae568301e0ba9fc204309438e7b'},
+    {'PyTorch-2.1.0_remove-sparse-csr-nnz-overflow-test.patch':
+     '0ac36411e76506b3354c85a8a1260987f66af947ee52ffc64230aee1fa02ea8b'},
     {'PyTorch-2.1.0_remove-test-requiring-online-access.patch':
      '35184b8c5a1b10f79e511cc25db3b8a5585a5d58b5d1aa25dd3d250200b14fd7'},
+    {'PyTorch-2.1.0_skip-diff-test-on-ppc.patch': '394157dbe565ffcbc1821cd63d05930957412156cc01e949ef3d3524176a1dda'},
     {'PyTorch-2.1.0_skip-dynamo-test_predispatch.patch':
      '6298daf9ddaa8542850eee9ea005f28594ab65b1f87af43d8aeca1579a8c4354'},
+    {'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch':
+     '5229ca88a71db7667a90ddc0b809b2c817698bd6e9c5aaabd73d3173cf9b99fe'},
+    {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
+     '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
+    {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
+    {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
+     'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
+    {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
+     '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
+    {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
+     'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
+    {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
+    {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
+    {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
+     '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
-    {'PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch':
-     '23416f2d9d5226695ec3fbea0671e3650c655c19deefd3f0f8ddab5afa50f485'},
-    {'PyTorch-2.3.0_fix-mkldnn-avx512-f32-bias.patch':
-     'ee07d21c3ac7aeb0bd0e39507b18a417b9125284a529102929c4b5c6727c2976'},
-    {'PyTorch-2.3.0_skip_test_var_mean_differentiable.patch':
-     '9703fd0f1fca8916f6d79d83e9a7efe8e3f717362a5fdaa8f5d9da90d0c75018'},
-    {'PyTorch-2.6.0_add-checkfunctionexists-include.patch':
-     '93579e35e946fb06025a50c42f3625ed8b8ac9f503a963cc23767e2c8869f0ea'},
-    {'PyTorch-2.6.0_allow-sympy-1.13.3.patch': 'd17f5c528f64fe5e905c9154e90654e8ed2b7f0c16418ffd84ed3913aeb57eea'},
-    {'PyTorch-2.6.0_avoid_caffe2_test_cpp_jit.patch':
-     '88d03d90359bc1fe3cfa3562624d4fbfd4c6654c9199c556ca912ac55289ce55'},
-    {'PyTorch-2.6.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch':
-     '74db866787f1e666ed3b35db5204f05a0ba8d989fb23057a72dd07928388dc46'},
-    {'PyTorch-2.6.0_disable_tests_which_need_network_download.patch':
-     'fe76129811e4eb24d0e12c397335a4c7971b0c4e48ce9cdb9169f3ef9de7aac4'},
-    {'PyTorch-2.6.0_disable-gcc12-warnings.patch': '892643650788b743106ebe4e70c68be42a756eba797f0f79e31708d6e008a620'},
-    {'PyTorch-2.6.0_fix-accuracy-issues-in-linalg_solve.patch':
-     'a6b1cfe8f03ad5b17437e04e6a0369a25fcc79eed939ce6912ceca1c0ab0f444'},
-    {'PyTorch-2.6.0_fix-cpuinfo-bug-with-smt.patch':
-     '2ecb182802e795ed79b7a5f2ce9459780290b4097e981a737a98d4b47d3e2555'},
-    {'PyTorch-2.6.0_fix-distributed-tests-without-gpus.patch':
-     '011cffc098b6818eb160b6bec2e671dec46cb2a8457ce32144ea01cc9ed4290a'},
-    {'PyTorch-2.6.0_fix-edge-case-causing-test_trigger_bisect_on_error-failure.patch':
-     'fd918fa510bf04c95f3bcc2f4abea417632a0fefb278154ec95207ca0d1719ed'},
-    {'PyTorch-2.6.0_fix-ExcTests.test_trigger_on_error.patch':
-     '445472d43a61523b2ed169023f5f6db197bc2df8408f59e6254e55f5cb1d3a11'},
-    {'PyTorch-2.6.0_fix-flaky-test_aot_export_with_torch_cond.patch':
-     '79cf77a795e06c4c3206a998ce8f4a92072f79736803008ede65e5ec2f204bfc'},
-    {'PyTorch-2.6.0_fix-inductor-device-interface.patch':
-     'e8e6af1ea5f01568c23127d4f83aacb482ec9005ba558b68763748a581bcc5bc'},
-    {'PyTorch-2.6.0_fix-server-in-test_control_plane.patch':
-     '1337689ff28ecaa8d1d0edf60d322bcdd7846fec040925325d357b19eb6e4342'},
-    {'PyTorch-2.6.0_fix-skip-decorators.patch': 'ec1ba1ef2a2b2c6753a0b35d10c6af0457fc90fe98e2f77979745d9f79d79c86'},
-    {'PyTorch-2.6.0_fix-sympy-1.13-compat.patch': 'b801690a5b79ba6e4916ac6f719c36682b2a197582aee5e6f385e808f776920e'},
-    {'PyTorch-2.6.0_fix-test_autograd_cpp_node_saved_float.patch':
-     '928c4b1dc16f3d4a7bec29d8749b89ebd41488845938e2514c7fa8c048950e33'},
-    {'PyTorch-2.6.0_fix-test_linear_with_embedding.patch':
-     '56c053de7cfaa2f9898c3b036a185b499f5d44a7b4cd0442c45a8c94928322bf'},
-    {'PyTorch-2.6.0_fix-test_linear_with_in_out_buffer-without-mkl.patch':
-     '8cf9e5d434eb8d3b81400622ca23714c7002a0b835e7e08b384b84408c7ed085'},
-    {'PyTorch-2.6.0_fix-test_public_bindings.patch':
-     '066d88acd8156ed3f91b6a8e924de57f8aef944aa1bf67dc453b830ee1c26094'},
-    {'PyTorch-2.6.0_fix-test_unbacked_bindings_for_divisible_u_symint.patch':
-     '5f5ce1e275888cd6a057a0769fffaa9e49dde003ba191fd70b0265d8c6259a9b'},
-    {'PyTorch-2.6.0_fix-vsx-vector-shift-functions.patch':
-     '82ce0b48e3b7c3dfd3a2ba915f4675d5c3a6d149646e1e0d6a29eedbbaecc8bd'},
-    {'PyTorch-2.6.0_fix-xnnpack-float16-convert.patch':
-     'a6fcb475040c6fed2c0ec8b3f9c1e9fb964220413e84c8f2ee4092770ee6ac7d'},
-    {'PyTorch-2.6.0_increase-tolerance-test_aotdispatch-matmul.patch':
-     'c1c6ea41504e4479d258225ecefc7e9c5726934601610904ae555501a11e9109'},
-    {'PyTorch-2.6.0_increase-tolerance-test_quick-baddbmm.patch':
-     '9850facdfb5d98451249570788217ede07466cae9ba52cd03afd3ec803ba33c9'},
-    {'PyTorch-2.6.0_increase-tolerance-test_vmap_autograd_grad.patch':
-     '8d5eb53bb0a1456af333ae646c860033d6dd037bd9152601a200ca5c10ebf3cb'},
-    {'PyTorch-2.6.0_remove-test_slice_with_floordiv.patch':
-     '1b7ff59a595b9ebbc042d8ff53e3f6c72a1d3b04fb82228f4433473f28623f9b'},
-    {'PyTorch-2.6.0_show-test-duration.patch': '5508f2f9619204d9f3c356dbd4000a00d58f452ab2d64ae920eb8bc8b5484d75'},
-    {'PyTorch-2.6.0_skip-diff-test-on-ppc.patch': '6f2f87cad1b0ab8c5a0c7b3f7fbc14e4bdfbe61da26a3934ded9dda7fe368c74'},
-    {'PyTorch-2.6.0_skip-test_checkpoint_wrapper_parity-on-cpu.patch':
-     '600f74de167b6fea4d849229de6d653dc616093b456962729222d6bfa767a8e8'},
-    {'PyTorch-2.6.0_skip-test_init_from_local_shards.patch':
-     '222383195f6a3b7c545ffeadb4dd469b9f3361b42c0866de3d3f0f91f8fbe777'},
-    {'PyTorch-2.6.0_skip-test_jvp_linalg_det_singular.patch':
-     '3bbe8e585765d6db2a77ed0f751eadf924fbbedc95bbd88f447538ceede273fd'},
-    {'PyTorch-2.6.0_skip-test-requiring-MKL.patch':
-     'f1c9b1c77b09d59317fd52d390e7d948a147325b927ad6373c1fa1d1d6ea1ea8'},
-    {'PyTorch-2.6.0_skip-test_segfault.patch': '26806bd62e6b61b56ebaa52d68ca44c415a28124f684bd2fb373557ada68ef52'},
-    {'PyTorch-2.6.0_skip-tests-without-fbgemm.patch':
-     'ed35099de94a14322a879066da048ec9bc565dc81287b4adc4fec46f9afe90cf'},
 ]
 
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.29.3'),
-    ('hypothesis', '6.103.1'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
-    ('parameterized', '0.9.0'),
     ('pytest-flakefinder', '1.1.0'),
-    ('pytest-rerunfailures', '15.0'),
+    ('pytest-rerunfailures', '12.0'),
     ('pytest-shard', '0.1.2'),
-    ('pytest-subtests', '0.13.1'),
-    ('tlparse', '0.3.37'),
-    ('optree', '0.14.1'),
     ('unittest-xml-reporting', '3.1.0'),
 ]
 
 dependencies = [
-    ('Ninja', '1.12.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.12.3'),
-    ('Python-bundle-PyPI', '2024.06'),
-    ('protobuf', '28.0'),
-    ('protobuf-python', '5.28.0'),
-    ('pybind11', '2.12.0'),
-    ('PuLP', '2.8.0'),
-    ('SciPy-bundle', '2024.05'),
-    ('PyYAML', '6.0.2'),
-    ('MPFR', '4.2.1'),
-    ('GMP', '6.3.0'),
-    ('numactl', '2.0.18'),
-    ('FFmpeg', '7.0.2'),
-    ('Pillow', '10.4.0'),
-    ('expecttest', '0.2.1'),
-    ('networkx', '3.4.2'),
-    ('sympy', '1.13.3'),
-    ('Z3', '4.13.0',),
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
+    ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
+    ('PyYAML', '6.0'),
+    ('MPFR', '4.2.0'),
+    ('GMP', '6.2.1'),
+    ('numactl', '2.0.16'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
+    ('sympy', '1.12'),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 buildcmd = '%(python)s setup.py build'  # Run the (long) build in the build step
@@ -198,24 +203,43 @@ excluded_tests = {
     '': [
         # This test seems to take too long on NVIDIA Ampere at least.
         'distributed/test_distributed_spawn',
+        # Broken on CUDA 11.6/11.7: https://github.com/pytorch/pytorch/issues/75375
+        'distributions/test_constraints',
         # no xdoctest
         'doctests',
+        # failing on broadwell
+        # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
+        'test_native_mha',
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
-        # This test is expected to fail when run in their CI, but won't in our case.
-        # It just checks for a "CI" env variable
-        'test_ci_sanity_check_fail',
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
-local_test_opts = '--continue-through-error --pipe-logs --verbose %(excluded_tests)s'
-runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py ' + local_test_opts
+runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-through-error  --verbose %(excluded_tests)s'
 
 # Especially test_quantization has a few corner cases that are triggered by the random input values,
 # those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
 # So allow a low number of tests to fail as the tests "usually" succeed
-max_failed_tests = 16
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
 
 tests = ['PyTorch-check-cpp-extension.py']
 
Diff against PyTorch-2.3.0-foss-2023b.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.3.0-foss-2023b.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.3.0-foss-2023b.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index 2b47bc81b4..d8b2968689 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.3.0-foss-2023b.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,16 +1,18 @@
 name = 'PyTorch'
-version = '2.3.0'
+version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023b'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
 patches = [
     'PyTorch-1.7.0_disable-dev-shm-test.patch',
+    'PyTorch-1.11.1_skip-test_init_from_local_shards.patch',
     'PyTorch-1.12.1_add-hypothesis-suppression.patch',
     'PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch',
     'PyTorch-1.12.1_fix-TestTorch.test_to.patch',
@@ -22,35 +24,53 @@ patches = [
     'PyTorch-1.13.1_skip-tests-without-fbgemm.patch',
     'PyTorch-2.0.1_avoid-test_quantization-failures.patch',
     'PyTorch-2.0.1_fix-skip-decorators.patch',
+    'PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch',
     'PyTorch-2.0.1_fix-vsx-loadu.patch',
+    'PyTorch-2.0.1_no-cuda-stubs-rpath.patch',
     'PyTorch-2.0.1_skip-failing-gradtest.patch',
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
+    'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
+    'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
+    'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
+    'PyTorch-2.1.0_fix-validationError-output-test.patch',
     'PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch',
     'PyTorch-2.1.0_increase-tolerance-functorch-test_vmapvjpvjp.patch',
+    'PyTorch-2.1.0_remove-sparse-csr-nnz-overflow-test.patch',
     'PyTorch-2.1.0_remove-test-requiring-online-access.patch',
     'PyTorch-2.1.0_skip-diff-test-on-ppc.patch',
     'PyTorch-2.1.0_skip-dynamo-test_predispatch.patch',
     'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
+    'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
+    'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
+    'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
+    'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
+    'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
+    'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
+    'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
-    'PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch',
-    'PyTorch-2.3.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch',
-    'PyTorch-2.3.0_fix-cpuinfo-bug-with-smt.patch',
-    'PyTorch-2.3.0_skip-test_init_from_local_shards.patch',
-    'PyTorch-2.3.0_no-cuda-stubs-rpath.patch',
-    'PyTorch-2.3.0_disable-gcc12-warning.patch',
-    'PyTorch-2.3.0_fix-test_extension_backend-without-vectorization.patch',
-    'PyTorch-2.3.0_disable_tests_which_need_network_download.patch',
-    'PyTorch-2.3.0_avoid_caffe2_test_cpp_jit.patch',
-    'PyTorch-2.3.0_fix_missing_masked_load_for_int_type.patch',
-    'PyTorch-2.3.0_skip_test_var_mean_differentiable.patch',
-    'PyTorch-2.3.0_skip_test_sdpa_nn_functional_scaled_dot_product_attention_cpu.patch',
-    'PyTorch-2.3.0_fix-mkldnn-avx512-f32-bias.patch',
 ]
 checksums = [
-    {'pytorch-v2.3.0.tar.gz': '69579513b26261bbab32e13b7efc99ad287fcf3103087f2d4fdf1adacd25316f'},
+    {'pytorch-v2.1.2.tar.gz': '85effbcce037bffa290aea775c9a4bad5f769cb229583450c40055501ee1acd7'},
     {'PyTorch-1.7.0_disable-dev-shm-test.patch': '622cb1eaeadc06e13128a862d9946bcc1f1edd3d02b259c56a9aecc4d5406b8a'},
+    {'PyTorch-1.11.1_skip-test_init_from_local_shards.patch':
+     '4aeb1b0bc863d4801b0095cbce69f8794066748f0df27c6aaaf729c5ecba04b7'},
     {'PyTorch-1.12.1_add-hypothesis-suppression.patch':
      'e71ffb94ebe69f580fa70e0de84017058325fdff944866d6bd03463626edc32c'},
     {'PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch':
@@ -70,16 +90,30 @@ checksums = [
     {'PyTorch-2.0.1_avoid-test_quantization-failures.patch':
      '02e3f47e4ed1d7d6077e26f1ae50073dc2b20426269930b505f4aefe5d2f33cd'},
     {'PyTorch-2.0.1_fix-skip-decorators.patch': '2039012cef45446065e1a2097839fe20bb29fe3c1dcc926c3695ebf29832e920'},
+    {'PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch':
+     '1b37194f55ae678f3657b8728dfb896c18ffe8babe90987ce468c4fa9274f357'},
     {'PyTorch-2.0.1_fix-vsx-loadu.patch': 'a0ffa61da2d47c6acd09aaf6d4791e527d8919a6f4f1aa7ed38454cdcadb1f72'},
+    {'PyTorch-2.0.1_no-cuda-stubs-rpath.patch': '8902e58a762240f24cdbf0182e99ccdfc2a93492869352fcb4ca0ec7e407f83a'},
     {'PyTorch-2.0.1_skip-failing-gradtest.patch': '8030bdec6ba49b057ab232d19a7f1a5e542e47e2ec340653a246ec9ed59f8bc1'},
     {'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch':
      '7047862abc1abaff62954da59700f36d4f39fcf83167a638183b1b7f8fec78ae'},
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
+    {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
+    {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
+     'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
+    {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
+     '84bb51a719abc677031a7a3dfe4382ff098b0cbd8b39b8bed2a7fa03f80ac1e9'},
+    {'PyTorch-2.1.0_fix-validationError-output-test.patch':
+     '7eba0942afb121ed92fac30d1529447d892a89eb3d53c565f8e9d480e95f692b'},
     {'PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch':
      '3793b4b878be1abe7791efcbd534774b87862cfe7dc4774ca8729b6cabb39e7e'},
     {'PyTorch-2.1.0_increase-tolerance-functorch-test_vmapvjpvjp.patch':
      'aef38adf1210d0c5455e91d7c7a9d9e5caad3ae568301e0ba9fc204309438e7b'},
+    {'PyTorch-2.1.0_remove-sparse-csr-nnz-overflow-test.patch':
+     '0ac36411e76506b3354c85a8a1260987f66af947ee52ffc64230aee1fa02ea8b'},
     {'PyTorch-2.1.0_remove-test-requiring-online-access.patch':
      '35184b8c5a1b10f79e511cc25db3b8a5585a5d58b5d1aa25dd3d250200b14fd7'},
     {'PyTorch-2.1.0_skip-diff-test-on-ppc.patch': '394157dbe565ffcbc1821cd63d05930957412156cc01e949ef3d3524176a1dda'},
@@ -87,70 +121,80 @@ checksums = [
      '6298daf9ddaa8542850eee9ea005f28594ab65b1f87af43d8aeca1579a8c4354'},
     {'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch':
      '5229ca88a71db7667a90ddc0b809b2c817698bd6e9c5aaabd73d3173cf9b99fe'},
+    {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
+     '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
+    {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
+    {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
+     'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
+    {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
+     '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
+    {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
+     'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
+    {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
+    {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
     {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
-    {'PyTorch-2.3.0_disable_test_linear_package_if_no_half_types_are_available.patch':
-     '23416f2d9d5226695ec3fbea0671e3650c655c19deefd3f0f8ddab5afa50f485'},
-    {'PyTorch-2.3.0_disable_DataType_dependent_test_if_tensorboard_is_not_available.patch':
-     '0dcbdfde6752c3ff54c5376f521b4a742167669feb7f0f1d4e1d4d55f72b664f'},
-    {'PyTorch-2.3.0_fix-cpuinfo-bug-with-smt.patch':
-     '29fb95d1dba070133b513de050febd328ed36905a73f1ca135dc633f16beafa4'},
-    {'PyTorch-2.3.0_skip-test_init_from_local_shards.patch':
-     '90ed9c2870f57ee6dc032d00873a37e2217a2b92a13035ded1c25ad5306455f2'},
-    {'PyTorch-2.3.0_no-cuda-stubs-rpath.patch':
-     '7ba26824b5def7379cff02ae821a080698e6affea0da45bc846e9ecb89939cb1'},
-    {'PyTorch-2.3.0_disable-gcc12-warning.patch':
-     'a8a624e1a2a5f4c82610173e50bd0f853e49bd5621b432f5aac689f9f6eb1514'},
-    {'PyTorch-2.3.0_fix-test_extension_backend-without-vectorization.patch':
-     '36aa2d5ba175be17f4e996f4fb2d544fe477d4a0bd0644cd59a85063779afc8e'},
-    {'PyTorch-2.3.0_disable_tests_which_need_network_download.patch':
-     'b7fd1a5135dfd4098cdc054182f7bf84a23ac98462a00477712182b5442da855'},
-    {'PyTorch-2.3.0_avoid_caffe2_test_cpp_jit.patch':
-     '041adcd91d994b8c2ab57d227f081cd57e572c157117b37171e1eb8eb576f8fc'},
-    {'PyTorch-2.3.0_fix_missing_masked_load_for_int_type.patch':
-     'aa6ff764f3f7bf84372a8a257fe1b4ae6dc4b9744ad35f0f9015f2696c62a41e'},
-    {'PyTorch-2.3.0_skip_test_var_mean_differentiable.patch':
-     '9703fd0f1fca8916f6d79d83e9a7efe8e3f717362a5fdaa8f5d9da90d0c75018'},
-    {'PyTorch-2.3.0_skip_test_sdpa_nn_functional_scaled_dot_product_attention_cpu.patch':
-     '7955f2655db3da18606574fdcbc5990be24098f49ad1db5e86ea756ea1cc506f'},
-    {'PyTorch-2.3.0_fix-mkldnn-avx512-f32-bias.patch':
-     'ee07d21c3ac7aeb0bd0e39507b18a417b9125284a529102929c4b5c6727c2976'},
 ]
 
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.27.6'),
-    ('hypothesis', '6.90.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
-    ('pytest-rerunfailures', '14.0'),
+    ('pytest-rerunfailures', '12.0'),
     ('pytest-shard', '0.1.2'),
-    ('tlparse', '0.3.5'),
-    ('optree', '0.13.0'),
     ('unittest-xml-reporting', '3.1.0'),
 ]
 
 dependencies = [
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.5'),
-    ('Python-bundle-PyPI', '2023.10'),
-    ('protobuf', '25.3'),
-    ('protobuf-python', '4.25.3'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.11'),
-    ('PyYAML', '6.0.1'),
-    ('MPFR', '4.2.1'),
-    ('GMP', '6.3.0'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
+    ('PyYAML', '6.0'),
+    ('MPFR', '4.2.0'),
+    ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.2.0'),
-    ('expecttest', '0.2.1'),
-    ('networkx', '3.2.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.13.0',),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 buildcmd = '%(python)s setup.py build'  # Run the (long) build in the build step
@@ -169,19 +213,33 @@ excluded_tests = {
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
-        # This test is expected to fail when run in their CI, but won't in our case.
-        # It just checks for a "CI" env variable
-        'test_ci_sanity_check_fail',
-        # This fails consistently and is disabled upstream
-        # See https://github.com/pytorch/pytorch/issues/100152 and
-        # https://github.com/pytorch/pytorch/pull/124712
-        'test_cpp_extensions_open_device_registration',
-
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
-local_test_opts = '--continue-through-error --pipe-logs --verbose %(excluded_tests)s'
-runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py ' + local_test_opts
+runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-through-error  --verbose %(excluded_tests)s'
+
+# Especially test_quantization has a few corner cases that are triggered by the random input values,
+# those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
+# So allow a low number of tests to fail as the tests "usually" succeed
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
 
 tests = ['PyTorch-check-cpp-extension.py']
 
Diff against PyTorch-2.1.2-foss-2023b.eb

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb

diff --git a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
index 2206da7c2f..d8b2968689 100644
--- a/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2023b.eb
+++ b/easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.2-foss-2022b-CUDA-12.0.0.eb
@@ -1,11 +1,12 @@
 name = 'PyTorch'
 version = '2.1.2'
+versionsuffix = '-CUDA-%(cudaver)s'
 
 homepage = 'https://pytorch.org/'
 description = """Tensors and Dynamic neural networks in Python with strong GPU acceleration.
 PyTorch is a deep learning framework that puts Python first."""
 
-toolchain = {'name': 'foss', 'version': '2023b'}
+toolchain = {'name': 'foss', 'version': '2022b'}
 
 source_urls = [GITHUB_RELEASE]
 sources = ['%(namelower)s-v%(version)s.tar.gz']
@@ -30,6 +31,7 @@ patches = [
     'PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch',
     'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch',
     'PyTorch-2.1.0_disable-gcc12-warning.patch',
+    'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch',
     'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch',
     'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch',
     'PyTorch-2.1.0_fix-validationError-output-test.patch',
@@ -42,13 +44,26 @@ patches = [
     'PyTorch-2.1.0_skip-test_jvp_linalg_det_singular.patch',
     'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch',
     'PyTorch-2.1.0_skip-test_wrap_bad.patch',
+    'PyTorch-2.1.2_add-cuda-skip-markers.patch',
+    'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch',
+    'PyTorch-2.1.2_fix-device-mesh-check.patch',
+    'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch',
+    'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch',
+    'PyTorch-2.1.2_fix-test_cuda-non-x86.patch',
     'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch',
     'PyTorch-2.1.2_fix-test_memory_profiler.patch',
+    'PyTorch-2.1.2_fix-test_parallelize_api.patch',
     'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch',
     'PyTorch-2.1.2_fix-vsx-vector-abs.patch',
     'PyTorch-2.1.2_fix-vsx-vector-div.patch',
+    'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch',
+    'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch',
+    'PyTorch-2.1.2_relax-cuda-tolerances.patch',
+    'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch',
     'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch',
-    'PyTorch-2.1.2_skip-memory-leak-test.patch',
+    'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch',
+    'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch',
+    'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch',
     'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch',
 ]
 checksums = [
@@ -85,6 +100,8 @@ checksums = [
     {'PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch':
      '166c134573a95230e39b9ea09ece3ad8072f39d370c9a88fb2a1e24f6aaac2b5'},
     {'PyTorch-2.1.0_disable-gcc12-warning.patch': 'c858b8db0010f41005dc06f9a50768d0d3dc2d2d499ccbdd5faf8a518869a421'},
+    {'PyTorch-2.1.0_disable-cudnn-tf32-for-too-strict-tests.patch':
+     'd895018ebdfd46e65d9f7645444a3b4c5bbfe3d533a08db559a04be34e01e478'},
     {'PyTorch-2.1.0_fix-bufferoverflow-in-oneDNN.patch':
      'b15b1291a3c37bf6a4982cfbb3483f693acb46a67bc0912b383fd98baf540ccf'},
     {'PyTorch-2.1.0_fix-test_numpy_torch_operators.patch':
@@ -107,17 +124,40 @@ checksums = [
     {'PyTorch-2.1.0_skip-test_linear_fp32-without-MKL.patch':
      '5dcc79883b6e3ec0a281a8e110db5e0a5880de843bb05653589891f16473ead5'},
     {'PyTorch-2.1.0_skip-test_wrap_bad.patch': 'b8583125ee94e553b6f77c4ab4bfa812b89416175dc7e9b7390919f3b485cb63'},
+    {'PyTorch-2.1.2_add-cuda-skip-markers.patch': 'd007d6d0cdb533e7d01f503e9055218760123a67c1841c57585385144be18c9a'},
+    {'PyTorch-2.1.2_fix-conj-mismatch-test-failures.patch':
+     'c164357efa4ce88095376e590ba508fc1daa87161e1e59544eda56daac7f2847'},
+    {'PyTorch-2.1.2_fix-device-mesh-check.patch': 'c0efc288bf3d9a9a3c8bbd2691348a589a2677ea43880a8c987db91c8de4806b'},
+    {'PyTorch-2.1.2_fix-fsdp-tp-integration-test.patch':
+     'f583532c59f35f36998851957d501b3ac8c883884efd61bbaa308db55cb6bdcd'},
+    {'PyTorch-2.1.2_fix-locale-issue-in-nvrtcCompileProgram.patch':
+     'f7adafb4e4d3b724b93237a259797b6ed6f535f83be0e34a7b759c71c6a8ddf2'},
+    {'PyTorch-2.1.2_fix-test_cuda-non-x86.patch': '1ed76fcc87e6c50606ac286487292a3d534707068c94af74c3a5de8153fa2c2c'},
     {'PyTorch-2.1.2_fix-test_extension_backend-without-vectorization.patch':
      'cd1455495886a7d6b2d30d48736eb0103fded21e2e36de6baac719b9c52a1c92'},
     {'PyTorch-2.1.2_fix-test_memory_profiler.patch':
      '30b0c9355636c0ab3dedae02399789053825dc3835b4d7dac6e696767772b1ce'},
+    {'PyTorch-2.1.2_fix-test_parallelize_api.patch':
+     'f8387a1693af344099c806981ca38df1306d7f4847d7d44713306338384b1cfd'},
     {'PyTorch-2.1.2_fix-test_torchinductor-rounding.patch':
      'a0ef99192ee2ad1509c78a8377023d5be2b5fddb16f84063b7c9a0b53d979090'},
     {'PyTorch-2.1.2_fix-vsx-vector-abs.patch': 'd67d32407faed7dc1dbab4bba0e2f7de36c3db04560ced35c94caf8d84ade886'},
     {'PyTorch-2.1.2_fix-vsx-vector-div.patch': '11f497a6892eb49b249a15320e4218e0d7ac8ae4ce67de39e4a018a064ca1acc'},
+    {'PyTorch-2.1.2_fix-with_temp_dir-decorator.patch':
+     '90bd001e034095329277d70c6facc4026b4ce6d7f8b8d6aa81c0176eeb462eb1'},
+    {'PyTorch-2.1.2_fix-wrong-device-mesh-size-in-tests.patch':
+     '07a5e4233d02fb6348872838f4d69573c777899c6f0ea4e39ae23c08660d41e5'},
+    {'PyTorch-2.1.2_relax-cuda-tolerances.patch': '554ad09787f61080fafdb84216e711e32327aa357e2a9c40bb428eb6503dee6e'},
+    {'PyTorch-2.1.2_remove-nccl-backend-default-without-gpus.patch':
+     'e6a1efe3d127fcbf4723476a7a1c01cfcf2ccb16d1fb250f478192623e8b6a15'},
     {'PyTorch-2.1.2_skip-cpu_repro-test-without-vectorization.patch':
      '7ace835af60c58d9e0754a34c19d4b9a0c3a531f19e5d0eba8e2e49206eaa7eb'},
-    {'PyTorch-2.1.2_skip-memory-leak-test.patch': '8d9841208e8a00a498295018aead380c360cf56e500ef23ca740adb5b36de142'},
+    {'PyTorch-2.1.2_skip-failing-test_dtensor_ops-subtests.patch':
+     '6cf711bf26518550903b09ed4431de9319791e79d61aab065785d6608fd5cc88'},
+    {'PyTorch-2.1.2_skip-test_fsdp_tp_checkpoint_integration.patch':
+     '943ee92f5fd518f608a59e43fe426b9bb45d7e7ad0ba04639e516db2d61fa57d'},
+    {'PyTorch-2.1.2_skip-xfailing-test_dtensor_ops.patch':
+     '7f5befddcb006b6ab5377de6ee3c29df375c5f8ef5e42b998d35113585b983f3'},
     {'PyTorch-2.1.2_workaround_dynamo_failure_without_nnpack.patch':
      'fb96eefabf394617bbb3fbd3a7a7c1aa5991b3836edc2e5d2a30e708bfe49ba1'},
 ]
@@ -125,32 +165,36 @@ checksums = [
 osdependencies = [OS_PKG_IBVERBS_DEV]
 
 builddependencies = [
-    ('CMake', '3.27.6'),
-    ('hypothesis', '6.90.0'),
+    ('CMake', '3.24.3'),
+    ('hypothesis', '6.68.2'),
     # For tests
     ('pytest-flakefinder', '1.1.0'),
-    ('pytest-rerunfailures', '14.0'),
+    ('pytest-rerunfailures', '12.0'),
     ('pytest-shard', '0.1.2'),
+    ('unittest-xml-reporting', '3.1.0'),
 ]
 
 dependencies = [
+    ('CUDA', '12.0.0', '', SYSTEM),
+    ('cuDNN', '8.8.0.121', '-CUDA-%(cudaver)s', SYSTEM),
+    ('magma', '2.7.1', '-CUDA-%(cudaver)s'),
+    ('NCCL', '2.18.3', '-CUDA-%(cudaver)s'),
     ('Ninja', '1.11.1'),  # Required for JIT compilation of C++ extensions
-    ('Python', '3.11.5'),
-    ('Python-bundle-PyPI', '2023.10'),
-    ('protobuf', '25.3'),
-    ('protobuf-python', '4.25.3'),
-    ('pybind11', '2.11.1'),
-    ('SciPy-bundle', '2023.11'),
-    ('PyYAML', '6.0.1'),
-    ('MPFR', '4.2.1'),
-    ('GMP', '6.3.0'),
+    ('Python', '3.10.8'),
+    ('protobuf', '23.0'),
+    ('protobuf-python', '4.23.0'),
+    ('pybind11', '2.10.3'),
+    ('SciPy-bundle', '2023.02'),
+    ('PyYAML', '6.0'),
+    ('MPFR', '4.2.0'),
+    ('GMP', '6.2.1'),
     ('numactl', '2.0.16'),
-    ('FFmpeg', '6.0'),
-    ('Pillow', '10.2.0'),
-    ('expecttest', '0.2.1'),
-    ('networkx', '3.2.1'),
+    ('FFmpeg', '5.1.2'),
+    ('Pillow', '9.4.0'),
+    ('expecttest', '0.1.3'),
+    ('networkx', '3.0'),
     ('sympy', '1.12'),
-    ('Z3', '4.13.0',),
+    ('Z3', '4.12.2', '-Python-%(pyver)s'),
 ]
 
 buildcmd = '%(python)s setup.py build'  # Run the (long) build in the build step
@@ -169,11 +213,34 @@ excluded_tests = {
         # intermittent failures on various systems
         # See https://github.com/easybuilders/easybuild-easyconfigs/issues/17712
         'distributed/rpc/test_tensorpipe_agent',
+        # Broken test, can't ever succeed, see https://github.com/pytorch/pytorch/issues/122184
+        'distributed/tensor/parallel/test_tp_random_state',
+        # failures on OmniPath systems, which don't support some optional InfiniBand features
+        # See https://github.com/pytorch/tensorpipe/issues/413
+        'distributed/pipeline/sync/skip/test_gpipe',
+        'distributed/pipeline/sync/skip/test_leak',
+        'distributed/pipeline/sync/test_bugs',
+        'distributed/pipeline/sync/test_inplace',
+        'distributed/pipeline/sync/test_pipe',
+        'distributed/pipeline/sync/test_transparency',
     ]
 }
 
 runtest = 'cd test && PYTHONUNBUFFERED=1 %(python)s run_test.py --continue-through-error  --verbose %(excluded_tests)s'
 
+# Especially test_quantization has a few corner cases that are triggered by the random input values,
+# those cannot be easily avoided, see https://github.com/pytorch/pytorch/issues/107030
+# test_nn is also prone to spurious failures: https://github.com/pytorch/pytorch/issues/118294
+# So allow a low number of tests to fail as the tests "usually" succeed
+max_failed_tests = 10
+
+# The readelf sanity check command can be taken out once the TestRPATH test from
+# https://github.com/pytorch/pytorch/pull/122318 is accepted, since it is then checked as part of the PyTorch test suite
+local_libcaffe2 = "$EBROOTPYTORCH/lib/python%%(pyshortver)s/site-packages/torch/lib/libcaffe2_nvrtc.%s" % SHLIB_EXT
+sanity_check_commands = [
+    "readelf -d %s | egrep 'RPATH|RUNPATH' | grep -v stubs" % local_libcaffe2,
+]
+
 tests = ['PyTorch-check-cpp-extension.py']
 
 moduleclass = 'ai'

@Flamefire Flamefire marked this pull request as draft November 28, 2024 14:31
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 55 out of 55 (2 easyconfigs in total)
ml30 - Linux AlmaLinux 8.7 (Stone Smilodon), POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 530.30.02, Python 3.8.13
See https://gist.github.com/Flamefire/674307e6a21da75203eea9819bec205c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8034 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/0015043e032f9631948d9db5be864f2c for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8003 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/822b64b6fdcc8ee170fc9bbd65460c02 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8009 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/14efda5603d08a1e9297c1655bc09724 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8033 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/b45fcc90b20ab2756d0455cc6e22bc59 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8008 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/cfc58e06281f535f7bb4515c2ab35c88 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3633
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
i7016 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.8.17
See https://gist.github.com/Flamefire/c40c0bc474eef2f4a6c9cd22dcb5b588 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3633
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i7092 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7702 64-Core Processor (zen2), Python 3.8.17
See https://gist.github.com/Flamefire/c982e20acdcedcf5aa2de79f27abb8be for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3633
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
i8002 - Linux Rocky Linux 8.9 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 555.42.06, Python 3.8.17
See https://gist.github.com/Flamefire/101ba7fe80afb4aa8eb5595b6c8566f7 for a full test report.

@Flamefire Flamefire force-pushed the 20240513165649_new_pr_PyTorch212 branch from f27d797 to a9a5a6b Compare March 24, 2025 14:51
@Thyre Thyre added the 2022b label Aug 18, 2025
@akesandgren
Copy link
Contributor

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
b-cn1611.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.58.02, Python 3.10.12
See https://gist.github.com/akesandgren/2c39ef6c1a9fdc052b6b4f680da25c99 for a full test report.

@Flamefire Flamefire marked this pull request as ready for review October 14, 2025 17:05
github_account = 'NVIDIA'
source_urls = [GITHUB_SOURCE]
sources = ['v%(version)s-1.tar.gz']
patches = ['NCCL-2.16.2_fix-cpuid.patch']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this one also need NCCL-2.18.3_fix-cudaMemcpyAsync.patch like NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.1.1.eb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense I guess, added

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1611.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.58.02, Python 3.10.12
See https://gist.github.com/akesandgren/fcb6591e3a94554c29aaead9d72c9c78 for a full test report.

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren
Copy link
Contributor

Going in, thanks @Flamefire!

@akesandgren akesandgren merged commit cc35f16 into easybuilders:develop Oct 15, 2025
8 checks passed
@Flamefire Flamefire deleted the 20240513165649_new_pr_PyTorch212 branch October 15, 2025 12:34
@boegel boegel modified the milestones: 5.x, next release (5.2.0?) Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants