-
Notifications
You must be signed in to change notification settings - Fork 772
{devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237
Conversation
…da-2020b.eb and patches: PyTorch-1.9.0_avoid-failures-in-test_unary_ufuncs.patch, PyTorch-1.9.0_fix-testnn-on-A100.patch, PyTorch-1.9.0_fix-use-after-destruct-in-cudaipctypes.patch, PyTorch-1.9.0_fix-vsx-vector-functions.patch, PyTorch-1.9.0_increase_test_cuda_tolerance.patch, PyTorch-1.9.0_increase-tolerance-for-distributed-tests.patch, PyTorch-1.9.0_limit_world_size_for_zero_redundancy_opt_test.patch, PyTorch-1.9.0_skip-nccl-error-tests.patch
|
I've set off a test reports. I did an initial test of the |
|
Test report by @terjekv |
|
Test report by @branfosj Edit: Failure in |
|
Test report by @branfosj Edit. Failure in Failure in |
|
@branfosj Thanks for the reminder. Don't really have a Cascadelake machine easily available. I added a patch to skip that test. Can you try that? @terjekv Something seems odd with your build machine:
|
|
I don't quite understand how. The machine monitoring suggests that free memory never dropped under 30GB or so. Could we see a spike eat that? :( |
@terjekv No idea. But seems to be related to Infiniband. It is a call to Maybe just a fluke? |
|
I hope so. It could be IB stuff on RHEL8 causing issues? Rerunning stuff to see. |
|
Same error. Kinda iffy. The box does not have any IB hardware, but has: |
Then that is maybe just unsupported. Please open an issue in the pytorch repo to get them to clarify that. |
|
Test report by @verdurin |
|
Test report by @terjekv |
|
Test report by @branfosj Edit: Something odd is going on with |
|
Test report by @branfosj |
|
Test report by @branfosj |
|
@branfosj Can you check what exactly failed there? It says still test_quantization but that part of the log is missing. And the test_lstm should be fixed... |
|
I'm confused - the failure is as before: |
|
I'm setting off a build so I can do some debugging on why that test is not being skipped. |
|
I didn't actually skip it but marked it as expected failure: 74c8c32#diff-de8a49a66cde6fe5e687dac5c654f196ebd0110d55a739b1f57095131f5d471cR22 Not sure why this isn't working |
|
@branfosj Found it. Currently uploading a new patch. The mark must come first -.- |
Thanks. I'll set off a test report. |
|
Test report by @branfosj |
|
Test report by @Flamefire |
I give up and skip that fucking test |
|
Test report by @branfosj |
|
Test report by @branfosj |
|
Test report by @Flamefire |
|
Test report by @branfosj |
|
Test report by @branfosj |
Not sure here... While I do like the option to run all tests and report failure afterwards it makes those test reports hard as the error happens way before. I think I should add some post-processing to the easyblock so it shows the actual failures |
|
The failed test was with 4 GPUs. We are back to NCCL errors: Full output from |
|
Yeah but this time test_broadcast_coalesced_nccl which I haven't seen failing before. Do you use a different NCCL? Have you rebuild NCCL with that patch already (not sure if done automatically)? |
|
NCCL is |
|
Hm, I can't see anything wrong with this test so I'd rather not disable it. Maybe just a fluke? |
|
Test report by @branfosj Edit: Test with 2 GPUs |
|
Test report by @Flamefire |
|
Test report by @branfosj Edit: Test with 1 GPU |
|
Test report by @branfosj Edit: Test with 4 GPUs |
I've run using all 4 GPUs three times:
So the first one was either a fluke failure or the test is flaky. |
|
The test shouldn't really be flaky from what I can tell, I guess it is NCCL being the problem here. Hence I'm hesitant disabling this test. FWIW: I reran this test file 5 times without any failure. My proposal: Include as-is and keep the potential failure as a kind of warning that pytorch might fail in the given environment. |
|
Test report by @boegel |
branfosj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
Going in, thanks @Flamefire! |
(created using
eb --new-pr)