-
Notifications
You must be signed in to change notification settings - Fork 772
fix possible error/crash in NCCL on x86 due to cpuid #19231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix possible error/crash in NCCL on x86 due to cpuid #19231
Conversation
|
Test report by @Flamefire |
|
@boegelbot: please test @ generoso |
|
@smoors: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1812577674 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
FAIL (build issue) NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.1.1.eb
|
That must be a download error on your site: |
| checksums = [('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9', | ||
| 'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96')] | ||
| patches = ['NCCL-2.16.2_fix-cpuid.patch'] | ||
| checksums = [ | ||
| {'v2.18.3-1.tar.gz': '6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9'}, | ||
| {'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'}, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please restore the alternative checksum. See #18906
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, didn't notice that. Done 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hitting a framework bug. Shall we resolve that first or shall I revert to non-dicts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's get this merged first?
checksumming is quite a mess at the moment. probably better to fix this in EB-5 so breaking changes can be made, see also easybuilders/easybuild-framework#4346
|
A college of mine tried to run GROMACS with NCCL and ran into a Segmentation fault pointing to |
|
|
@boegelbot please test @ generoso |
|
@SebastianAchilles: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1814160413 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @SebastianAchilles |
|
Test report by @SebastianAchilles |
|
@boegelbot please test @ jsc-zen2 |
|
@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1814215206 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @boegelbot |
smoors
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
Going in, thanks @Flamefire! |
(created using
eb --new-pr)I found this to be the cause for multiple failures of PyTorch on x86 (AMD EPYC) like: