Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Sep 19, 2025

@github-actions github-actions bot added the new label Sep 19, 2025
@github-actions
Copy link

github-actions bot commented Sep 19, 2025

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@Thyre Thyre added the 2024a issues & PRs related to 2024a common toolchains label Sep 19, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
i8005 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/e092ebca5d265d11b91fd67a83f3af73 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
c32 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/f693e3f4804b88c790344400452a4cec for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
c144 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/26d65e755c8535ecce98bd3fa964d59b for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
i8032 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/de74d19eb7a953944822b58781747c62 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
c23 - Linux AlmaLinux 9.4, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 560.35.03, Python 3.9.18
See https://gist.github.com/Flamefire/ab3477dac032ccef529dab68f21959d7 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 25, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node4307.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/206999229fc7e00980ac347bc1e717fd for a full test report.

@boegel
Copy link
Member

boegel commented Oct 25, 2025

@Flamefire

Checksum verification for /tmp/eb-eoudzstd/files_pr23923/p/PyTorch/PyTorch-2.7.0_do-not-checkout-nccl.patch using {'PyTorch-2.7.0_do-not-checkout-nccl.patch':
'ad085a15dd36768ad33a934f53dc595da745e01697b44d431f8b70ae9d0eb567'} failed

@boegel
Copy link
Member

boegel commented Oct 25, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node3308.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/978c2c4a948d5383453e81a82532ff57 for a full test report.

@Flamefire
Copy link
Contributor Author

Seemingly changed by mistake. Fixed

@boegel
Copy link
Member

boegel commented Oct 27, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node4307.litleo.os - Linux RHEL 9.6, x86_64, AMD EPYC 9454P 48-Core Processor (zen4), 1 x NVIDIA NVIDIA H100 NVL, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/e2a8a47106ef7afcf783242dc72cbe5b for a full test report.

@boegel
Copy link
Member

boegel commented Oct 27, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
node3308.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/1e769e4191aacbb019cb496146b909d6 for a full test report.

@Flamefire
Copy link
Contributor Author

The H100 failures are mostly from inductor/test_cutlass_backend (39 failed, 1 passed, 2 skipped, 0 errors)
I expect most failures to be caused by "BytesWarning". That needs a rebuild of nvidia-cutlass with the added patch which I had only added to #23606 as I forgot I included the EC here.
But merged that now to this branch.

With the (now) default of 10 allowed failures that should be enough to pass

As for the V100: I already had more failures on A100 suggesting they don't test on "older" GPUs anymore... If you can attach the log of the test step I'll take a look at the failures

@Flamefire
Copy link
Contributor Author

Flamefire commented Oct 30, 2025

Test report by @Flamefire
FAILED
Build succeeded for 6 out of 7 (7 easyconfigs in total)
n1450.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/bc6c0f8510f18f3f95f0a1eed3eb848d for a full test report.

SUCCESS on rerun but upload failed due to expired token:

== COMPLETED: Installation ended successfully (took 18 hours 38 mins 21 secs)
== Results of the build can be found in the log file(s) /software/PyTorch/2.7.1-foss-2024a-CUDA-12.6.0/easybuild/easybuild-PyTorch-2.7.1-20251107.054206.log

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in total)
c92 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/42fc4314e957ad4b757f8fbd40d064dd for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 7 out of 7 (7 easyconfigs in total)
i8018 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/cb905230972fee8ccf548435f534eddc for a full test report.

@boegel
Copy link
Member

boegel commented Dec 9, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 6 out of 7 (total: 46 hours 5 mins 4 secs) (7 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/3fa8bd784b0d43d7693a47a1abe0a206 for a full test report.

@Flamefire
Copy link
Contributor Author

4 (of 8) failures are in test_cpu_select_algorithm and test_select_algorithm which I assume have the same cause. However the errors are not in the gist, so can't tell

Is it possibly this one?

OSError: [Errno 9] Bad file descriptor

Then I have a patch for that.

In any case: I remove the allowed failures = 6, which now uses the default of 10 which would make your run pass.

@Flamefire
Copy link
Contributor Author

Issue is this:

WARNING Found 1 individual tests that exited with an error: pytest.internal

Can I see the full log?

@boegel
Copy link
Member

boegel commented Jan 8, 2026

Issue is this:

WARNING Found 1 individual tests that exited with an error: pytest.internal

Can I see the full log?

Unfortunately the log file wasn't retained...

I can trigger it again and make sure the log file is retained.

@boegel
Copy link
Member

boegel commented Jan 8, 2026

@boegelbot please test @ jsc-zen3-a100
CORE_CNT=16
EB_ARGS="--failed-install-logs-path=$HOME/pr23923-${SLURM_JOBID}"

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=23923 EB_ARGS="--failed-install-logs-path=$HOME/pr23923-${SLURM_JOBID}" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_23923 --ntasks="16" --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 9354

Test results coming soon (I hope)...

Details

- notification for comment with ID 3725616778 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Flamefire
Copy link
Contributor Author

--failed-install-logs-path is a great option in general, especially for PRs. Together with copying the failed build dirs for PyTorch it allows investigating which issue the parser found by retaining the test XML files.

If you still have access to the XML files (test_reports folder) I can check why it associates pytest.internal with inductor/test_torchinductor_opinfo

@boegel
Copy link
Member

boegel commented Jan 9, 2026

Test report by @boegel
FAILED
Build succeeded for 4 out of 5 (total: 39 hours 4 mins 12 secs) (5 easyconfigs in total)
node3303.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/bf837f92eff2a389efa4aded8a62a5b0 for a full test report.

== FAILED: Installation ended unsuccessfully: An error was raised during test step: Failing because not all failed tests could be determined. Tests failed to start, crashed or the test accounting in the PyTorch EasyBlock needs updating!
Missing: inductor/test_cpu_select_algorithm
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failure to make the test step pass.
53 test failures, 0 test errors (out of 261882):
Failed tests (suites/files):
        distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
        distributed/test_store (1 failed, 32 passed, 0 skipped, 0 errors)
        dynamo/test_compiler_bisector (1 failed, 6 passed, 0 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
        higher_order_ops/test_invoke_quant (1 failed, 13 passed, 0 skipped, 0 errors)
        higher_order_ops/test_invoke_subgraph (1 failed, 19 passed, 1 skipped, 0 errors)
        inductor/test_aot_inductor (2 failed, 313 passed, 102 skipped, 0 errors)
        inductor/test_benchmark_fusion (2 failed, 10 passed, 4 skipped, 0 errors)
        inductor/test_compile_subprocess (5 failed, 1345 passed, 152 skipped, 0 errors)
        inductor/test_cuda_repro (2 failed, 61 passed, 7 skipped, 0 errors)
        inductor/test_custom_lowering (1 failed, 3 passed, 0 skipped, 0 errors)
        inductor/test_inplace_padding (2 failed, 7 passed, 0 skipped, 0 errors)
        inductor/test_kernel_benchmark (3 failed, 15 passed, 0 skipped, 0 errors)
        inductor/test_max_autotune (3 failed, 70 passed, 29 skipped, 0 errors)
        inductor/test_online_softmax (17 failed, 13 passed, 0 skipped, 0 errors)
        inductor/test_pad_mm (2 failed, 16 passed, 0 skipped, 0 errors)
        inductor/test_scatter_optimization (1 failed, 7 passed, 0 skipped, 0 errors)
        inductor/test_torchinductor (4 failed, 1519 passed, 151 skipped, 0 errors)
        inductor/test_torchinductor_codegen_dynamic_shapes (1 failed, 1091 passed, 410 skipped, 0 errors)
        inductor/test_torchinductor_dynamic_shapes (2 failed, 1403 passed, 158 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        inductor/test_cpu_select_algorithm (Undetected or did not run properly) (took 38 hours 58 mins 16 secs)

@Flamefire
Copy link
Contributor Author

Test report by @boegel FAILED Build succeeded for 4 out of 5 (total: 39 hours 4 mins 12 secs) (5 easyconfigs in total) node3303.joltik.os - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 580.95.05, Python 3.9.21 See https://gist.github.com/boegel/bf837f92eff2a389efa4aded8a62a5b0 for a full test report.

Mostly inductor/* failures, so likely a common cause. And inductor/test_cpu_select_algorithm is missing so might have crashed. Checking the full test log is required.

@boegel
Copy link
Member

boegel commented Jan 9, 2026

@Flamefire Found this in the log:

Finished inductor/test_cpu_select_algorithm 1/1 in 10 minutes
inductor/test_cpu_select_algorithm 1/1 failed!
Unable to import boto3. Will not be emitting metrics.... Reason: No module named 'boto3'

I'll share the whole log with you (via Slack)

@Flamefire
Copy link
Contributor Author

Weird it looks like that failure was written to the XML file:

stepcurrent: skipping 0 already run items. Running only test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_aoti_bmm_unique_identifiers_cpu_float32
[...]
Got exit code 1
Test failed consistently, continuing with the rest of the tests due to continue-through-error being set
Test results will be stored in test-reports/python-pytest/inductor.test_cpu_select_algorithm/inductor.test_cpu_select_algorithm-8bb9267bd246f209.xml

But still:

Missing: inductor/test_cpu_select_algorithm

So it seems the parser didn't pick up the failure.

Almost all the rest seem to be specific to V100, and its non-support of BF16 (I reported it at pytorch/pytorch#172085):

  • distributed/test_store finds "localhost.localdomain" instead of "localhost". Some setting on the node?
  • distributed/_composable/fsdp/test_fully_shard_state_dict: Likely incorrect test for V100
  • dynamo/test_compiler_bisector - "Tesla V100-SXM2-32GB does not support bfloat16 compilation natively, skipping"
  • inductor/test_aot_inductor "BF16 is not supported" & "FlashAttention only supports Ampere GPUs or newer."
  • inductor/test_compile_subprocess "BF16 is not supported"
  • inductor/test_max_autotune "Unsupported conversion from f16 to f16" close to "LLVM ERROR: Unsupported rounding mode for conversion."
  • inductor/test_online_softmax "Tesla V100-SXM2-32GB does not support bfloat16 compilation natively, skipping"

If we can fix the missed inductor/test_cpu_select_algorithm we could just increase the allowed number of failed tests. Maybe exclude the full inductor/test_online_softmax which has the highest number of failures where 17 out of 30 tests fail.
I backported a patch from 2.9.1 which is simple enough to include here without side effects. This could fix that inductor/test_cpu_select_algorithm fails at all. As that fails with seemingly that issue I found and fixed there:

self._file.write(msg)

OSError: [Errno 9] Bad file descriptor

Would still be good to know why the parser didn't find it to avoid similar issues. Could be related though: The faulty close might cause the writing of the XML file fail

Patching the failing tests is also possible if this does indeed return false for V100s (could use a pip-installed pytorch to test): python -c 'import torch; from torch._dynamo.device_interface import get_interface_for_device; print(get_interface_for_device("cuda").is_dtype_supported(torch.bfloat16))'

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 27 hours 28 mins 18 secs) (5 easyconfigs in total)
n1502.barnard.hpc.tu-dresden.de - Linux RHEL 9.6, x86_64, Intel(R) Xeon(R) Platinum 8470 (sapphirerapids), Python 3.9.21
See https://gist.github.com/Flamefire/57c1caf8d30e7259c3de520544fcc8fc for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 4 out of 5 (total: 51 hours 55 mins 26 secs) (5 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.7, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 590.44.01, Python 3.9.23
See https://gist.github.com/boegelbot/c3b2b6f37dc52f2085d1921421dd7f03 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (total: 37 hours 23 mins 28 secs) (5 easyconfigs in total)
c119 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/abd1429569497eec557584f5f8ab648d for a full test report.

@Flamefire
Copy link
Contributor Author

Failing because there were unexpected failures detected: inductor/test_torchinductor_opinfo

Maybe we should make this a warning only

@boegel
Copy link
Member

boegel commented Jan 13, 2026

Failing because there were unexpected failures detected: inductor/test_torchinductor_opinfo

Maybe we should make this a warning only

Warning by default, but with a way to make it a hard error perhaps?

@Flamefire In any case, I don't think we need to block this PR any further, what do you think?

@Flamefire
Copy link
Contributor Author

Warning by default, but with a way to make it a hard error perhaps?

An EC option allow_extra_failures with default True?
I'd keep the error if a suite shows up as failed but we didn't find how many of its test failed.

In my report the cause is a timeout after which the test process gets killed without writing an XML entry. However the test has "rerun" entries, so we could use that: If a test only shows up as "rerun" but not as "success" it is an error.
Still has an issue because reruns continue at this test and remaining tests are run only after it failed N times. So if the first test of a suite is rerun N-1 times before terminating the whole suite we won't have any data about the other tests in that suite.
So the safe option is to do as is done now :-/

@Flamefire In any case, I don't think we need to block this PR any further, what do you think?

Do we want to increase the allowed failures to allow your previous build to pass? Or let people see those errors for old-ish GPUs?

@boegel
Copy link
Member

boegel commented Jan 14, 2026

Warning by default, but with a way to make it a hard error perhaps?

An EC option allow_extra_failures with default True? I'd keep the error if a suite shows up as failed but we didn't find how many of its test failed.

In my report the cause is a timeout after which the test process gets killed without writing an XML entry. However the test has "rerun" entries, so we could use that: If a test only shows up as "rerun" but not as "success" it is an error. Still has an issue because reruns continue at this test and remaining tests are run only after it failed N times. So if the first test of a suite is rerun N-1 times before terminating the whole suite we won't have any data about the other tests in that suite. So the safe option is to do as is done now :-/

@Flamefire In any case, I don't think we need to block this PR any further, what do you think?

Do we want to increase the allowed failures to allow your previous build to pass? Or let people see those errors for old-ish GPUs?

@Flamefire I'm in favor of allowed some more failures by default, maybe even up to 100?
We can easily make that more strict on our end (in the bot, in our site) via a hook.

The issues about not finding the result of a test should be less fatal too, but that's work for the easyblock, so doesn't need to block this PR.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 52 hours 25 mins 56 secs) (5 easyconfigs in total)
i8026 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/48580e7a7fbd5daf1d19e5523670b194 for a full test report.

@VRehnberg
Copy link
Contributor

My builds are currently on day 7+ of running PyTorch tests. Do you have any suggestions to make them run faster? Should I just always build PyTorch on a full node?

@Flamefire
Copy link
Contributor Author

7 days is certainly too much. With 2.9.1 I identified an issue that caused an infinite hang. But that exact issue is not present in 2.7. Maybe check if any sub-process has been hanging for days or if the tests are just very slow on your machine.

I do indeed use a full node.

@VRehnberg
Copy link
Contributor

Oh, now the first one has finished.

== ... (took 167 hours 8 mins 47 secs)
== FAILED: Installation ended unsuccessfully: An error was raised during test step: Failing because not all failed tests could be determined. Tests failed to start, crashed or the test accounting in the PyTorch EasyBlock needs updating!                                                         
Missing: test_optim
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failure to make the test step pass.
5 test failures, 0 test errors (out of 259567):
Failed tests (suites/files):
        distributed/_composable/fsdp/test_fully_shard_compile (1 failed, 12 passed, 3 skipped, 0 errors)
        distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
        inductor/test_select_algorithm (2 failed, 17 passed, 0 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        test_optim (Undetected or did not run properly) (took 170 hours 15 mins 13 secs)
== Results of the build can be found in the log file(s) /dev/shm/eb-90gorcp9/easybuild-PyTorch-2.7.1-20260112.094124.tDBmG.log
== Summary:16
   * [FAILED]  PyTorch/2.7.1-foss-2024a-CUDA-12.6.0

ERROR: Installation of PyTorch-2.7.1-foss-2024a-CUDA-12.6.0.eb failed: An error was raised during test step: Failing because not all failed tests could be determined. Tests failed to start, crashed or the test accounting in the PyTorch EasyBlock needs updating!                                
Missing: test_optim
You can check the test failures (in the log) manually and if they are harmless, use --ignore-test-failure to make the test step pass.
5 test failures, 0 test errors (out of 259567):
Failed tests (suites/files):
        distributed/_composable/fsdp/test_fully_shard_compile (1 failed, 12 passed, 3 skipped, 0 errors)
        distributed/_composable/fsdp/test_fully_shard_state_dict (1 failed, 1 passed, 4 skipped, 0 errors)
        dynamo/test_error_messages (1 failed, 33 passed, 0 skipped, 0 errors)
        inductor/test_select_algorithm (2 failed, 17 passed, 0 skipped, 0 errors)
Could not count failed tests for the following test suites/files:
        test_optim (Undetected or did not run properly)

So this is an issue for the EasyBlock I guess.

The slow test is probably

/apps/Test/software/Python/3.12.3-GCCcore-13.3.0/bin/python -bb inductor/test_cooperative_reductions.py -m not serial --shard-id=1 --num-shards=1 -v -vv -rfEX -p no:xdist --use-pytest -x --reruns=2 --sc=inductor/test_cooperative_reductions_1_9dcb754d5b072e76 --print-items

which has been running since Jan 13 in the another build.

Anyway, I'm happy with the state of this PR, so go ahead and merge when you are happy with it.

@Flamefire
Copy link
Contributor Author

Flamefire commented Jan 19, 2026

This here is an issue worth checking:

test_optim (Undetected or did not run properly)

Can you attach the full log and ideally the test/test_reports folder from the build if it still exists?

@VRehnberg
Copy link
Contributor

@Flamefire here
PyTorch-20260112.tar.gz

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

It's high time that we get this merged.

There will probably be follow-up PRs (especially for the PyTorch easyblock), but this has been proven to be mature across a variety of systems.

@Flamefire Thanks a lot for all the effort on this!

@boegel
Copy link
Member

boegel commented Jan 20, 2026

Going in, thanks @Flamefire!

@boegel boegel merged commit 0eef0cf into easybuilders:develop Jan 20, 2026
8 checks passed
@Flamefire Flamefire deleted the 20250919130737_new_pr_PyTorch271 branch January 21, 2026 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants