{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0 #23923

Flamefire · 2025-09-19T11:07:41Z

(created using eb --new-pr)

Requires:

I included the easyconfigs here for convenience

github-actions · 2025-09-19T11:08:30Z

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

Flamefire · 2025-09-25T08:32:03Z

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
i8005 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/e092ebca5d265d11b91fd67a83f3af73 for a full test report.

Flamefire · 2025-09-28T22:00:12Z

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
c32 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/f693e3f4804b88c790344400452a4cec for a full test report.

Flamefire · 2025-10-08T19:40:29Z

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
c144 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/26d65e755c8535ecce98bd3fa964d59b for a full test report.

Flamefire · 2025-10-16T13:27:50Z

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
i8032 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/de74d19eb7a953944822b58781747c62 for a full test report.

Flamefire · 2025-10-17T12:59:28Z

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
c23 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/ab3477dac032ccef529dab68f21959d7 for a full test report.

boegel · 2025-10-25T10:15:24Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node4307.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/206999229fc7e00980ac347bc1e717fd for a full test report.

boegel · 2025-10-25T10:17:13Z

@Flamefire

Checksum verification for /tmp/eb-eoudzstd/files_pr23923/p/PyTorch/PyTorch-2.7.0_do-not-checkout-nccl.patch using {'PyTorch-2.7.0_do-not-checkout-nccl.patch':
'ad085a15dd36768ad33a934f53dc595da745e01697b44d431f8b70ae9d0eb567'} failed

boegel · 2025-10-25T10:41:37Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node3308.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/978c2c4a948d5383453e81a82532ff57 for a full test report.

Flamefire · 2025-10-25T10:51:19Z

Seemingly changed by mistake. Fixed

boegel · 2025-10-27T03:52:39Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node4307.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/e2a8a47106ef7afcf783242dc72cbe5b for a full test report.

boegel · 2025-10-27T05:47:40Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node3308.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/1e769e4191aacbb019cb496146b909d6 for a full test report.

Flamefire · 2025-10-27T08:40:10Z

The H100 failures are mostly from inductor/test_cutlass_backend (39 failed, 1 passed, 2 skipped, 0 errors)
I expect most failures to be caused by "BytesWarning". That needs a rebuild of nvidia-cutlass with the added patch which I had only added to #23606 as I forgot I included the EC here.
But merged that now to this branch.

With the (now) default of 10 allowed failures that should be enough to pass

As for the V100: I already had more failures on A100 suggesting they don't test on "older" GPUs anymore... If you can attach the log of the test step I'll take a look at the failures

Flamefire · 2025-10-30T08:48:03Z

Test report by @Flamefire
~~FAILED~~
Build succeeded for 6 out of 7 (7 easyconfigs in total)
n1450.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/bc6c0f8510f18f3f95f0a1eed3eb848d for a full test report.

SUCCESS on rerun but upload failed due to expired token:

== COMPLETED: Installation ended successfully (took 18 hours 38 mins 21 secs)
== Results of the build can be found in the log file(s) /software/PyTorch/2.7.1-foss-2024a-CUDA-12.6.0/easybuild/easybuild-PyTorch-2.7.1-20251107.054206.log

Flamefire · 2025-10-30T18:39:46Z

Test report by @Flamefire
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in total)
c92 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/42fc4314e957ad4b757f8fbd40d064dd for a full test report.

Flamefire · 2025-10-31T09:03:50Z

Test report by @Flamefire
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in total)
i8018 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/cb905230972fee8ccf548435f534eddc for a full test report.

boegel · 2025-12-09T10:22:54Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (total: 46 hours 5 mins 4 secs) (7 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/3fa8bd784b0d43d7693a47a1abe0a206 for a full test report.

Flamefire · 2025-12-09T11:07:27Z

4 (of 8) failures are in test_cpu_select_algorithm and test_select_algorithm which I assume have the same cause. However the errors are not in the gist, so can't tell

Is it possibly this one?

OSError: [Errno 9] Bad file descriptor

Then I have a patch for that.

In any case: I remove the allowed failures = 6, which now uses the default of 10 which would make your run pass.

Flamefire · 2026-01-08T19:12:02Z

Issue is this:

WARNING Found 1 individual tests that exited with an error: pytest.internal

Can I see the full log?

boegel · 2026-01-08T20:20:52Z

Issue is this:

WARNING Found 1 individual tests that exited with an error: pytest.internal

Can I see the full log?

Unfortunately the log file wasn't retained...

I can trigger it again and make sure the log file is retained.

boegel · 2026-01-08T20:21:19Z

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16
EB_ARGS="--failed-install-logs-path=$HOME/pr23923-${SLURM_JOBID}"

boegelbot · 2026-01-08T20:30:09Z

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=23923 EB_ARGS="--failed-install-logs-path=$HOME/pr23923-${SLURM_JOBID}" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_23923 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

exit code: 0
output:

Submitted batch job 9354

Test results coming soon (I hope)...

Details

- notification for comment with ID 3725616778 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Flamefire · 2026-01-09T07:50:47Z

--failed-install-logs-path is a great option in general, especially for PRs. Together with copying the failed build dirs for PyTorch it allows investigating which issue the parser found by retaining the test XML files.

If you still have access to the XML files (test_reports folder) I can check why it associates pytest.internal with inductor/test_torchinductor_opinfo

boegel · 2026-01-09T07:51:35Z

Test report by @boegel
FAILED
Build succeeded for 4 out of 5 (total: 39 hours 4 mins 12 secs) (5 easyconfigs in total)
node3303.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/bf837f92eff2a389efa4aded8a62a5b0 for a full test report.

== FAILED: Installation ended unsuccessfully: An error was raised during test step: Failing because not all failed tests could be determined. Tests failed to start, crashed or the test accounting in the PyTorch EasyBlock needs updating!
Missing: inductor/test_cpu_select_algorithm
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failure to make the test step pass.
53 test failures, 0 test errors (out of 261882):
Failed tests (suites/files):
        distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
        distributed/test_store (1 failed, 32 passed, 0 skipped, 0 errors)
        dynamo/test_compiler_bisector (1 failed, 6 passed, 0 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
        higher_order_ops/test_invoke_quant (1 failed, 13 passed, 0 skipped, 0 errors)
        higher_order_ops/test_invoke_subgraph (1 failed, 19 passed, 1 skipped, 0 errors)
        inductor/test_aot_inductor (2 failed, 313 passed, 102 skipped, 0 errors)
        inductor/test_benchmark_fusion (2 failed, 10 passed, 4 skipped, 0 errors)
        inductor/test_compile_subprocess (5 failed, 1345 passed, 152 skipped, 0 errors)
        inductor/test_cuda_repro (2 failed, 61 passed, 7 skipped, 0 errors)
        inductor/test_custom_lowering (1 failed, 3 passed, 0 skipped, 0 errors)
        inductor/test_inplace_padding (2 failed, 7 passed, 0 skipped, 0 errors)
        inductor/test_kernel_benchmark (3 failed, 15 passed, 0 skipped, 0 errors)
        inductor/test_max_autotune (3 failed, 70 passed, 29 skipped, 0 errors)
        inductor/test_online_softmax (17 failed, 13 passed, 0 skipped, 0 errors)
        inductor/test_pad_mm (2 failed, 16 passed, 0 skipped, 0 errors)
        inductor/test_scatter_optimization (1 failed, 7 passed, 0 skipped, 0 errors)
        inductor/test_torchinductor (4 failed, 1519 passed, 151 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (1 failed, 1091 passed, 410 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 1403 passed, 158 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        inductor/test_cpu_select_algorithm (Undetected or did not run properly) (took 38 hours 58 mins 16 secs)

Flamefire · 2026-01-09T08:32:48Z

Test report by @boegel FAILED Build succeeded for 4 out of 5 (total: 39 hours 4 mins 12 secs) (5 easyconfigs in total) node3303.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/bf837f92eff2a389efa4aded8a62a5b0 for a full test report.

Mostly inductor/* failures, so likely a common cause. And inductor/test_cpu_select_algorithm is missing so might have crashed. Checking the full test log is required.

boegel · 2026-01-09T13:52:41Z

@Flamefire Found this in the log:

Finished inductor/test_cpu_select_algorithm 1/1 in 10 minutes
inductor/test_cpu_select_algorithm 1/1 failed!
Unable to import boto3. Will not be emitting metrics.... Reason: No module named 'boto3'

I'll share the whole log with you (via Slack)

Flamefire · 2026-01-09T15:48:29Z

Weird it looks like that failure was written to the XML file:

stepcurrent: skipping 0 already run items. Running only test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_aoti_bmm_unique_identifiers_cpu_float32
[...]
Got exit code 1
Test failed consistently, continuing with the rest of the tests due to continue-through-error being set
Test results will be stored in test-reports/python-pytest/inductor.test_cpu_select_algorithm/inductor.test_cpu_select_algorithm-8bb9267bd246f209.xml

But still:

Missing: inductor/test_cpu_select_algorithm

So it seems the parser didn't pick up the failure.

Almost all the rest seem to be specific to V100, and its non-support of BF16 (I reported it at pytorch/pytorch#172085):

distributed/test_store finds "localhost.localdomain" instead of "localhost". Some setting on the node?
distributed/_composable/fsdp/test_fully_shard_state_dict: Likely incorrect test for V100
dynamo/test_compiler_bisector - "Tesla V100-SXM2-32GB does not support bfloat16 compilation natively, skipping"
inductor/test_aot_inductor "BF16 is not supported" & "FlashAttention only supports Ampere GPUs or newer."
inductor/test_compile_subprocess "BF16 is not supported"
inductor/test_max_autotune "Unsupported conversion from f16 to f16" close to "LLVM ERROR: Unsupported rounding mode for conversion."
inductor/test_online_softmax "Tesla V100-SXM2-32GB does not support bfloat16 compilation natively, skipping"

If we can fix the missed inductor/test_cpu_select_algorithm we could just increase the allowed number of failed tests. Maybe exclude the full inductor/test_online_softmax which has the highest number of failures where 17 out of 30 tests fail.
I backported a patch from 2.9.1 which is simple enough to include here without side effects. This could fix that inductor/test_cpu_select_algorithm fails at all. As that fails with seemingly that issue I found and fixed there:

self._file.write(msg)
OSError: [Errno 9] Bad file descriptor

Would still be good to know why the parser didn't find it to avoid similar issues. Could be related though: The faulty close might cause the writing of the XML file fail

Patching the failing tests is also possible if this does indeed return false for V100s (could use a pip-installed pytorch to test): python -c 'import torch; from torch._dynamo.device_interface import get_interface_for_device; print(get_interface_for_device("cuda").is_dtype_supported(torch.bfloat16))'

Flamefire · 2026-01-10T17:07:03Z

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 27 hours 28 mins 18 secs) (5 easyconfigs in total)
n1502.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/57c1caf8d30e7259c3de520544fcc8fc for a full test report.

boegelbot · 2026-01-11T00:26:11Z

Test report by @boegelbot
FAILED
Build succeeded for 4 out of 5 (total: 51 hours 55 mins 26 secs) (5 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 590.44.01, Python 3.9.23
See https://gist.github.com/boegelbot/c3b2b6f37dc52f2085d1921421dd7f03 for a full test report.

Flamefire · 2026-01-11T02:11:59Z

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (total: 37 hours 23 mins 28 secs) (5 easyconfigs in total)
c119 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/abd1429569497eec557584f5f8ab648d for a full test report.

Flamefire · 2026-01-12T16:20:17Z

Failing because there were unexpected failures detected: inductor/test_torchinductor_opinfo

Maybe we should make this a warning only

boegel · 2026-01-13T15:42:39Z

Failing because there were unexpected failures detected: inductor/test_torchinductor_opinfo

Maybe we should make this a warning only

Warning by default, but with a way to make it a hard error perhaps?

@Flamefire In any case, I don't think we need to block this PR any further, what do you think?

Flamefire · 2026-01-13T17:04:34Z

Warning by default, but with a way to make it a hard error perhaps?

An EC option allow_extra_failures with default True?
I'd keep the error if a suite shows up as failed but we didn't find how many of its test failed.

In my report the cause is a timeout after which the test process gets killed without writing an XML entry. However the test has "rerun" entries, so we could use that: If a test only shows up as "rerun" but not as "success" it is an error.
Still has an issue because reruns continue at this test and remaining tests are run only after it failed N times. So if the first test of a suite is rerun N-1 times before terminating the whole suite we won't have any data about the other tests in that suite.
So the safe option is to do as is done now :-/

@Flamefire In any case, I don't think we need to block this PR any further, what do you think?

Do we want to increase the allowed failures to allow your previous build to pass? Or let people see those errors for old-ish GPUs?

boegel · 2026-01-14T07:51:19Z

Warning by default, but with a way to make it a hard error perhaps?

An EC option allow_extra_failures with default True? I'd keep the error if a suite shows up as failed but we didn't find how many of its test failed.

In my report the cause is a timeout after which the test process gets killed without writing an XML entry. However the test has "rerun" entries, so we could use that: If a test only shows up as "rerun" but not as "success" it is an error. Still has an issue because reruns continue at this test and remaining tests are run only after it failed N times. So if the first test of a suite is rerun N-1 times before terminating the whole suite we won't have any data about the other tests in that suite. So the safe option is to do as is done now :-/

@Flamefire In any case, I don't think we need to block this PR any further, what do you think?

Do we want to increase the allowed failures to allow your previous build to pass? Or let people see those errors for old-ish GPUs?

@Flamefire I'm in favor of allowed some more failures by default, maybe even up to 100?
We can easily make that more strict on our end (in the bot, in our site) via a hook.

The issues about not finding the result of a test should be less fatal too, but that's work for the easyblock, so doesn't need to block this PR.

Flamefire · 2026-01-14T12:49:13Z

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 52 hours 25 mins 56 secs) (5 easyconfigs in total)
i8026 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/48580e7a7fbd5daf1d19e5523670b194 for a full test report.

VRehnberg · 2026-01-19T10:49:21Z

My builds are currently on day 7+ of running PyTorch tests. Do you have any suggestions to make them run faster? Should I just always build PyTorch on a full node?

Flamefire · 2026-01-19T11:49:44Z

7 days is certainly too much. With 2.9.1 I identified an issue that caused an infinite hang. But that exact issue is not present in 2.7. Maybe check if any sub-process has been hanging for days or if the tests are just very slow on your machine.

I do indeed use a full node.

VRehnberg · 2026-01-19T12:49:01Z

Oh, now the first one has finished.

== ... (took 167 hours 8 mins 47 secs)
== FAILED: Installation ended unsuccessfully: An error was raised during test step: Failing because not all failed tests could be determined. Tests failed to start, crashed or the test accounting in the PyTorch EasyBlock needs updating!                                                         
Missing: test_optim
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failure to make the test step pass.
5 test failures, 0 test errors (out of 259567):
Failed tests (suites/files):
        distributed/_composable/fsdp/test_fully_shard_compile (1 failed, 12 passed, 3 skipped, 0 errors)
        distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
        inductor/test_select_algorithm (2 failed, 17 passed, 0 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        test_optim (Undetected or did not run properly) (took 170 hours 15 mins 13 secs)
== Results of the build can be found in the log file(s) /dev/shm/eb-90gorcp9/easybuild-PyTorch-2.7.1-20260112.094124.tDBmG.log
== Summary:16
   * [FAILED]  PyTorch/2.7.1-foss-2024a-CUDA-12.6.0

ERROR: Installation of PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb failed: An error was raised during test step: Failing because not all failed tests could be determined. Tests failed to start, crashed or the test accounting in the PyTorch EasyBlock needs updating!                                
Missing: test_optim
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failure to make the test step pass.
5 test failures, 0 test errors (out of 259567):
Failed tests (suites/files):
        distributed/_composable/fsdp/test_fully_shard_compile (1 failed, 12 passed, 3 skipped, 0 errors)
        distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
        inductor/test_select_algorithm (2 failed, 17 passed, 0 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        test_optim (Undetected or did not run properly)

So this is an issue for the EasyBlock I guess.

The slow test is probably

/apps/Test/software/Python/3.12.3-GCCcore-13.3.0/bin/python -bb inductor/test_cooperative_reductions.py -m not serial --shard-id=1 --num-shards=1 -v -vv -rfEX -p no:xdist --use-pytest -x --reruns=2 --sc=inductor/test_cooperative_reductions_1_9dcb754d5b072e76 --print-items

which has been running since Jan 13 in the another build.

Anyway, I'm happy with the state of this PR, so go ahead and merge when you are happy with it.

Flamefire · 2026-01-19T14:44:52Z

This here is an issue worth checking:

test_optim (Undetected or did not run properly)

Can you attach the full log and ideally the test/test_reports folder from the build if it still exists?

VRehnberg · 2026-01-19T15:14:26Z

@Flamefire here
PyTorch-20260112.tar.gz

boegel

lgtm

It's high time that we get this merged.

There will probably be follow-up PRs (especially for the PyTorch easyblock), but this has been proven to be mature across a variety of systems.

@Flamefire Thanks a lot for all the effort on this!

boegel · 2026-01-20T20:47:02Z

Going in, thanks @Flamefire!

github-actions bot added the new label Sep 19, 2025

github-actions bot added the update label Sep 19, 2025

Thyre added the 2024a issues & PRs related to 2024a common toolchains label Sep 19, 2025

pavelToman mentioned this pull request Oct 14, 2025

Transformers vscentrum/vsc-software-stack#576

Open

2 tasks

pavelToman mentioned this pull request Oct 16, 2025

Dorado 1.0.2 or newer vscentrum/vsc-software-stack#584

Open

2 tasks

This was referenced Oct 21, 2025

{bio}[foss/2024a] dorado v1.1.1 w/ CUDA 12.6.0 #24328

Open

{ai}[foss/2024a] accelerate v1.11.0 w/ CUDA 12.6.0 #24343

Open

Flamefire mentioned this pull request Oct 25, 2025

{ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 #24365

Open

4 tasks

This was referenced Nov 14, 2025

pyannote.audio 4.0.1 or newer + WhisperX vscentrum/vsc-software-stack#612

Open

whisperx vscentrum/vsc-software-stack#613

Open

Flamefire mentioned this pull request Dec 2, 2025

{2025.06}[2024a] PyTorch 2.6.0 EESSI/software-layer#1314

Open

boegel added this to the next release (5.2.0?) milestone Dec 3, 2025

Flamefire added 2 commits January 9, 2026 16:28

Add backported patch to avoid closing file handles

a11696f

Use default number of allowed failures again

786db80

Allow 60 test failures

32a9107

boegel approved these changes Jan 20, 2026

View reviewed changes

boegel merged commit 0eef0cf into easybuilders:develop Jan 20, 2026
8 checks passed

Flamefire deleted the 20250919130737_new_pr_PyTorch271 branch January 21, 2026 08:41

laraPPr mentioned this pull request Jan 23, 2026

{ai}[foss/2024a] DP-GEN v0.13.2, DeePMD-kit v3.1.1, LAMMPS v29Aug2024_update2 w/ CUDA 12.6.0, kokkos deepmd CUDA 12.6.0 #24733

Open

3 tasks

{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0 #23923

{ai}[foss/2024a] PyTorch v2.7.1 w/ CUDA 12.6.0 #23923

Conversation

Flamefire commented Sep 19, 2025 • edited by boegel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Sep 25, 2025

Uh oh!

Flamefire commented Sep 28, 2025

Uh oh!

Flamefire commented Oct 8, 2025

Uh oh!

Flamefire commented Oct 16, 2025

Uh oh!

Flamefire commented Oct 17, 2025

Uh oh!

boegel commented Oct 25, 2025

Uh oh!

boegel commented Oct 25, 2025

Uh oh!

boegel commented Oct 25, 2025

Uh oh!

Flamefire commented Oct 25, 2025

Uh oh!

boegel commented Oct 27, 2025

Uh oh!

boegel commented Oct 27, 2025

Uh oh!

Flamefire commented Oct 27, 2025

Uh oh!

Flamefire commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Oct 30, 2025

Uh oh!

Flamefire commented Oct 31, 2025

Uh oh!

boegel commented Dec 9, 2025

Uh oh!

Flamefire commented Dec 9, 2025

Uh oh!

Flamefire commented Jan 8, 2026

Uh oh!

boegel commented Jan 8, 2026

Uh oh!

boegel commented Jan 8, 2026

Uh oh!

boegelbot commented Jan 8, 2026

Uh oh!

Flamefire commented Jan 9, 2026

Uh oh!

boegel commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Jan 9, 2026

Uh oh!

boegel commented Jan 9, 2026

Uh oh!

Flamefire commented Jan 9, 2026

Uh oh!

Flamefire commented Jan 10, 2026

Uh oh!

boegelbot commented Jan 11, 2026

Uh oh!

Flamefire commented Jan 11, 2026

Uh oh!

Flamefire commented Jan 12, 2026

Uh oh!

boegel commented Jan 13, 2026

Uh oh!

Flamefire commented Jan 13, 2026

Uh oh!

boegel commented Jan 14, 2026

Uh oh!

Flamefire commented Jan 14, 2026

Uh oh!

VRehnberg commented Jan 19, 2026

Flamefire commented Sep 19, 2025 •

edited by boegel

Loading

github-actions bot commented Sep 19, 2025 •

edited

Loading

Flamefire commented Oct 30, 2025 •

edited

Loading

boegel commented Jan 9, 2026 •

edited

Loading

Flamefire commented Jan 19, 2026 •

edited

Loading