-
Notifications
You must be signed in to change notification settings - Fork 769
{devel}[foss/2021a] PyTorch v1.10.0, torchvision v0.11.1, Horovod v0.23.0 w/ Python 3.9.5 + CUDA-11.3.1 #14233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2021a] PyTorch v1.10.0, torchvision v0.11.1, Horovod v0.23.0 w/ Python 3.9.5 + CUDA-11.3.1 #14233
Conversation
…hes: PyTorch-1.10.0_fix-alias-violation-in-bitwise-ops.patch, PyTorch-1.10.0_fix-faulty-asserts-and-skip-test.patch, PyTorch-1.10.0_fix-test-dataloader-fixed-affinity.patch, PyTorch-1.10.0_skip-nccl-error-tests.patch
|
@Flamefire Care to have a look? I've tried to figure out which patches were still needed (most of them made by you in the past). Seems that especially from the Moreover, you seem to know the test suite well. Maybe you can help figure out why some of them are failing... |
|
I had need of a PyTorch in 2021a. This was with 1.9.1, as it was before 1.10.0 had been released. I think that list of failures looks like the ones I saw there. My solution was to add See https://github.com/bear-rsg/easybuild-easyconfigs/blob/2021a/easybuild/easyconfigs/p/PyTorch/PyTorch-1.9.1-foss-2021a-CUDA-11.3.1-imkl.eb for the easyconfig we deployed. That additionally disables the I've not yet tried that on A100 (or any other A*) GPUs. I've just got A30s and A100s available, so I expect I'll be to test on those soon. It might also be worth doing a non-CUDA version, as it can be helpful to debug which errors are GPU related and which happen on a CPU-only version. |
|
distributed/optim/test_zero_redundancy_optimizer The following two tests in here fail: Both with a similar traceback: distributed/rpc/cuda/test_tensorpipe_agent This fails with Patch The original issue at PT for distributed/rpc/test_tensorpipe/agent is at least still open pytorch/pytorch#59436 (not sure about others) |
|
Test report by @casparvl |
|
@branfosj Thanks for the pointers! Very useful to know that it is probably the underlying BLAS indeed, and not so much this particular version of PyTorch. I actually started off with an mkl-based build, but had trouble there: at the start of the test phase it immediately failed with I figured out that this symbol is part of Intel OpenMP, which surprised me, as the configure (and any compiler flags I could find in a verbose build) suggested it was using GNU OpenMP: So... that Intel OpenMP symbol is not supposed to be there at all I believe... But I really couldn't find how it ended up there. Anyway, since most of my users will use it on GPUs, I figured instead of trying to debug this further, I'd attempt a build against |
|
Ah, this just reminded me. The easyconfigs I pointed at are deployed on single GPU per node. So I'll not have done any of the PyTorch tests with multiple-GPUs on a single node. This may account for several test failures. |
…ch-1.10.0_skip-nccl-error-tests.patch from the EasyConfig. This resolves the test failures I got, and I don't hit the original issues that these patches were made for.
|
After the last commit, the number of failing tests is down to these four btw: I'll report the errors for the latter three more extensively below, since those were still missing from my overview earlier in this ticket: test_linalg And similar for Also and And similar for test_ops and test_quantization |
|
Some references: test_ops
test_quantization |
|
Trying to debug test_linalg failing tests, I made it print some extra info on the input matrix, and And now get That's funny, because |
|
Indeed, running results in the "SVD did not converge" error. The reason is that NaN propagation is undefined for BLAS calls, it depends on the LAPACK implementation and behavior changed with While it is not yet clear which way the 'solution' would go on the numpy side (a check for NaN's in the input of e.g. calls to Edit: it seems they specifically test singular matrices as part of this test suite. I have reported the bug and suggested skipping this specific case pytorch/pytorch#67675 |
|
Ok, current status: Still failing
The |
I agree with that. It's just the error message for inputs with NaNs is different between MKL and OpenBLAS builds, if other tests pass then everything is working as expected for SVD. |
|
Test report by @casparvl |
…lues_cpu tests, which were failing because the error was a different error than expected. This has no impact on users, since these tests cases are not expected to complete succesfully anyway. Additionally, add a patch that increases the tolerance for the zero_model_parallel tests so that they also pass when using TensorFloat32, e.g. on A100
|
I've traced down the cause of the failures to the use of TensorFloat32 - which made losses with the reference case slightly different and caused the test to fail. I just update this PR with a Patch that does that ( More info, see pytorch/pytorch#67764 |
|
Ok, the only ones still failing at this point are: Strange thing is that the second test works fine if I run it later individually: So I think it's just some messed up environment from a previous test that spoils this test. I'd like to try and just skip it, but it's not so easy in the For the |
Here's the patch to skip that test |
…_consistency_jit_contiguous_cpu_float32. Both don't seem to point to fundamental issues with the build, but the problem is probably on the side of the test suite.
|
Test report by @casparvl |
|
Test report by @branfosj |
|
@boegelbot please test @ generoso |
|
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 962445004 processed Message to humans: this is just bookkeeping information for me, |
|
@boegelbot please test @ generoso |
|
@boegel: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 965017888 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
|
Test report by @boegel |
|
Test report by @boegel |
|
Test report by @branfosj |
|
Test report by @branfosj |
This test got disabled on Windows (pytorch/pytorch#40485). Maybe the flakiness is more widespread. |
branfosj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And see @boegel comment at #14233 (comment)
easybuild/easyconfigs/h/Horovod/Horovod-0.23.0-foss-2021a-PyTorch-1.10.0.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/p/PyTorch/PyTorch-1.10.0_fix-test-dataloader-fixed-affinity.patch
Show resolved
Hide resolved
|
Ok, so how do we proceed on this? @boegel can you add a patch to skip that test? |
|
Test report by @boegel |
|
Test report by @branfosj |
|
Test report by @branfosj |
branfosj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
It passed second time round on both of my new icelake+gpu nodes (one node with 2xA100 and one with 2xA30). So I'm happy to get this merged. |
|
Going in, thanks @casparvl! |
(created using
eb --new-pr)This PR isn't working (yet): it builds, but some tests fail (at least on my machine).
Still, I want to share this EasyConfig publicly, since it may help figuring out if a) these failures are system specific and b) what could be done to resolve them.