Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Oct 24, 2025

@Flamefire Flamefire marked this pull request as draft October 24, 2025 16:33
@github-actions
Copy link

github-actions bot commented Oct 24, 2025

Diff of new easyconfig(s) against existing ones is too long for a GitHub comment. Use --review-pr (and --review-pr-filter / --review-pr-max) locally.

@Thyre Thyre added the 2024a issues & PRs related to 2024a common toolchains label Oct 25, 2025
@Flamefire Flamefire marked this pull request as ready for review December 9, 2025 11:31
@Flamefire Flamefire changed the title {ai}[foss/2024a] PyTorch v2.9.0 w/ CUDA 12.6.0 {ai}[foss/2024a] PyTorch v2.9.1 w/ CUDA 12.6.0 Dec 9, 2025
@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 9849937 to 15b85aa Compare December 9, 2025 11:45
@github-actions github-actions bot added the new label Dec 9, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (total: 27 hours 27 mins 43 secs) (1 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/50cdd13305fd9a33c6140c223aeab6cd for a full test report.

@boegel
Copy link
Member

boegel commented Dec 13, 2025

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3803
FAILED
Build succeeded for 4 out of 6 (total: 8 mins 45 secs) (6 easyconfigs in total)
node3907.accelgor.os - Linux RHEL 9.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 580.95.05, Python 3.9.21
See https://gist.github.com/boegel/e06f98f956452bcb8a132f816e55927b for a full test report.

postinstallpatches = [('triton_test.py', 'test/triton_test.py')]

checksums = [
{'triton_test.py': '0d8b4556a76268b000d6023a1abaee801d179db3aed51e781c06854858490cc8'},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checksum of easybuild/easyconfigs/t/Triton/triton_test.py in develop branch is 02a3390a5dbe27385358ab319cf10972cd8b51aca599a6809efea612a90ecdba ?

See also https://github.com/easybuilders/easybuild-easyconfigs/pull/23120/files#diff-281b568c9515da9719c383a4978f7626f1fae4773e9c38132ee7160b18771e6bR141

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yes. The upload failed because --update-pr couldn't handle Python files (tried to parse as easyconfig to find destination folder).
Will fix on Monday

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update added

@boegel
Copy link
Member

boegel commented Dec 13, 2025

I'm also seeing a crash with the triton_test.py script:

== FAILED: Installation ended unsuccessfully: Sanity check failed: sanity check command TRITON_HOME=$TMPDIR/eb-triton_home python
/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py 8.0 failed with exit code 1 (output: Traceback (most recent call last):
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/test/triton_test.py", line 13, in <module>
    src = triton.compiler.ASTSource(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/software/Triton/3.5.0-gfbf-2024a-CUDA-12.6.0/lib/python3.12/site-packages/triton/compiler/compiler.py", line 67, in __init__
    for k in self.signature.keys():
             ^^^^^^^^^^^^^^^^^^^
AttributeError: 'str' object has no attribute 'keys'

@Flamefire
Copy link
Contributor Author

That's why the checksum changed: Breaking change in Triton 3.5 and I updated the test script accordingly. It is in #24793

@Flamefire
Copy link
Contributor Author

I had to use tlparse 0.4.0 (also separate PR in #24882) as the older one isn't compatible with PyTorch output, see pytorch/pytorch@92c2dae

The lowest tlparse version that works is 0.3.42.

Not sure if this causes conflicts in EB. The alternative is to drop this dependency as it is optional

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 2b8bc42 to 64a4d67 Compare December 16, 2025 17:04
@Flamefire
Copy link
Contributor Author

Rebased to remove EasyConfigs present in develop from this branch.

Also added 2 more patches to avoid remaining failures.

@github-actions github-actions bot removed the new label Dec 16, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 4 out of 5 (total: 4 mins 48 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/6f2e5c7a020ec72dff9d2f5c8220fba5 for a full test report.

@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch 2 times, most recently from 01180ef to 381c028 Compare December 17, 2025 09:09
@boegel boegel added this to the release after 5.2.0 milestone Dec 18, 2025
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 5 out of 5 (total: 24 hours 42 mins 21 secs) (5 easyconfigs in total)
c144 - Linux Rocky Linux 9.6, x86_64, AMD EPYC 9334 32-Core Processor (zen4), 4 x NVIDIA NVIDIA H100, 580.65.06, Python 3.9.21
See https://gist.github.com/Flamefire/a6bd885643124f5fac4864060e0e18cd for a full test report.

…es: PyTorch-2.9.0_fix-nccl-test-env.patch, PyTorch-2.9.0_readd-support-for-nvidia-cutlass-python-package.patch
@Flamefire Flamefire force-pushed the 20251024183337_new_pr_PyTorch290 branch from 07c1976 to 117a394 Compare January 7, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2024a issues & PRs related to 2024a common toolchains update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants