Skip to content

Conversation

@casparvl
Copy link
Contributor

@casparvl casparvl commented Oct 28, 2021

(created using eb --new-pr)

This PR isn't working (yet): it builds, but some tests fail (at least on my machine).

distributed/optim/test_zero_redundancy_optimizer failed!
distributed/rpc/cuda/test_tensorpipe_agent failed!
distributed/rpc/test_faulty_agent failed!
distributed/rpc/test_tensorpipe_agent failed!
distributed/test_c10d_nccl failed!
test_dataloader failed!
test_linalg failed!
test_ops failed!
test_quantization failed!

Still, I want to share this EasyConfig publicly, since it may help figuring out if a) these failures are system specific and b) what could be done to resolve them.

…hes: PyTorch-1.10.0_fix-alias-violation-in-bitwise-ops.patch, PyTorch-1.10.0_fix-faulty-asserts-and-skip-test.patch, PyTorch-1.10.0_fix-test-dataloader-fixed-affinity.patch, PyTorch-1.10.0_skip-nccl-error-tests.patch
@casparvl
Copy link
Contributor Author

@Flamefire Care to have a look? I've tried to figure out which patches were still needed (most of them made by you in the past). Seems that especially from the 1.9 patches a lot has been fixed upstream. However, some errors you had back then were also specific to power, so not something I can easily test.

Moreover, you seem to know the test suite well. Maybe you can help figure out why some of them are failing...

@branfosj
Copy link
Member

I had need of a PyTorch in 2021a. This was with 1.9.1, as it was before 1.10.0 had been released. I think that list of failures looks like the ones I saw there. My solution was to add imkl into the mix and most of test failures went away. This makes me suspicious that numpy / openblas, or the PyTorch code that uses them, is to at fault.

See https://github.com/bear-rsg/easybuild-easyconfigs/blob/2021a/easybuild/easyconfigs/p/PyTorch/PyTorch-1.9.1-foss-2021a-CUDA-11.3.1-imkl.eb for the easyconfig we deployed. That additionally disables the test_jit_cuda_fuser tests (with the really helpful ?? comment).

I've not yet tried that on A100 (or any other A*) GPUs. I've just got A30s and A100s available, so I expect I'll be to test on those soon.

It might also be worth doing a non-CUDA version, as it can be helpful to debug which errors are GPU related and which happen on a CPU-only version.

@casparvl
Copy link
Contributor Author

casparvl commented Oct 28, 2021

distributed/optim/test_zero_redundancy_optimizer

The following two tests in here fail:

ERROR: test_zero_model_parallel_with_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)
ERROR: test_zero_model_parallel_without_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)

Both with a similar traceback:

Check that ZeRO works with model parallelism where layers are sharded
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-2vlkz9m2/tmp7chj1gje/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/tmp/casparl/eb_tmp/eb-2vlkz9m2/tmp7chj1gje/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/tmp/casparl/eb_tmp/eb-2vlkz9m2/tmp7chj1gje/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 682, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-2vlkz9m2/tmp7chj1gje/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 536, in run_test
    getattr(self, test_name)()
  File "/tmp/casparl/eb_tmp/eb-2vlkz9m2/tmp7chj1gje/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 420, in wrapper
    fn()
  File "/tmp/casparl/eb_tmp/eb-2vlkz9m2/tmp7chj1gje/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 111, in wrapper
    return func(*args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 901, in test_zero_model_parallel_with_bucket_view
    self._test_zero_model_parallel(parameters_as_bucket_view=True)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 884, in _test_zero_model_parallel
    assert torch.allclose(
AssertionError: Losses differ between local optim and ZeroRedundancyOptimizer

distributed/rpc/cuda/test_tensorpipe_agent
distributed/rpc/test_faulty_agent
distributed/rpc/test_tensorpipe_agent
distributed/test_c10d_nccl

This fails with

Running distributed/rpc/cuda/test_tensorpipe_agent ... [2021-10-28 19:49:53.020781]
Executing ['/tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 'distributed/rpc/cuda/test_tensorpipe_agent.py', '-v'] ... [2021-10-28 19:49:53.020847]
Traceback (most recent call last):
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/rpc/cuda/test_tensorpipe_agent.py", line 15, in <module>
    from torch.testing._internal.distributed.rpc_utils import (
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc_utils.py", line 26, in <module>
    from torch.testing._internal.distributed.rpc.dist_autograd_test import (
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 2647, in <module>
    class TensorPipeCudaDistAutogradTest(RpcAgentTestFixture):
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 2695, in TensorPipeCudaDistAutogradTest
    @unittest.skip("Test fails")
NameError: name 'unittest' is not defined
distributed/rpc/cuda/test_tensorpipe_agent failed!
Running distributed/rpc/test_faulty_agent ... [2021-10-28 19:49:53.892147]
Executing ['/tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 'distributed/rpc/test_faulty_agent.py', '-v'] ... [2021-10-28 19:49:53.892193]
Traceback (most recent call last):
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/rpc/test_faulty_agent.py", line 16, in <module>
    from torch.testing._internal.distributed.rpc_utils import (
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc_utils.py", line 26, in <module>
    from torch.testing._internal.distributed.rpc.dist_autograd_test import (
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 2647, in <module>
    class TensorPipeCudaDistAutogradTest(RpcAgentTestFixture):
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 2695, in TensorPipeCudaDistAutogradTest
    @unittest.skip("Test fails")
NameError: name 'unittest' is not defined
distributed/rpc/test_faulty_agent failed!
Running distributed/rpc/test_tensorpipe_agent ... [2021-10-28 19:49:54.763103]
Executing ['/tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python', 'distributed/rpc/test_tensorpipe_agent.py', '-v'] ... [2021-10-28 19:49:54.763148]
Traceback (most recent call last):
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/rpc/test_tensorpipe_agent.py", line 16, in <module>
    from torch.testing._internal.distributed.rpc_utils import (
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc_utils.py", line 26, in <module>
    from torch.testing._internal.distributed.rpc.dist_autograd_test import (
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 2647, in <module>
    class TensorPipeCudaDistAutogradTest(RpcAgentTestFixture):
  File "/tmp/casparl/eb_tmp/eb-0z829gkf/tmp36cacabv/lib/python3.9/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 2695, in TensorPipeCudaDistAutogradTest
    @unittest.skip("Test fails")
NameError: name 'unittest' is not defined
distributed/rpc/test_tensorpipe_agent failed!

...

Traceback (most recent call last):
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_nccl.py", line 2175, in <module>
    class NcclErrorHandlingTest(MultiProcessTestCase):
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/test_c10d_nccl.py", line 2280, in NcclErrorHandlingTest
    @unittest.skip("Broken on recent NCCL")
NameError: name 'unittest' is not defined
distributed/test_c10d_nccl failed!

Patch PyTorch-1.8.1_skip_dist_autograd_sync_streams.patch inserts a @unittest.skip("Test fails") to skip a failing test, which causes the first three tests to trip. PyTorch-1.10.0_skip-nccl-error-tests.patch causes the NCCL test to trip in a similar way. I'll check if the tests still fail if we don't skip them. If not, we can take this out. If they do fail, we have to probably import unittest or something in those modules in order to make the @unittest.skip work?

The original issue at PT for distributed/rpc/test_tensorpipe/agent is at least still open pytorch/pytorch#59436 (not sure about others)

@casparvl
Copy link
Contributor Author

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn1.local.snellius.surf.nl - Linux centos linux 8.4.2105, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/12dfa5f9d1f70920a325722b11e0e3e9 for a full test report.

@boegel boegel added this to the next release (4.5.1?) milestone Oct 31, 2021
@casparvl
Copy link
Contributor Author

casparvl commented Nov 1, 2021

@branfosj Thanks for the pointers! Very useful to know that it is probably the underlying BLAS indeed, and not so much this particular version of PyTorch.

I actually started off with an mkl-based build, but had trouble there: at the start of the test phase it immediately failed with

== 2021-10-22 18:15:19,537 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): cmd "export PYTHONPATH=/tmp/casparl/eb_tmp/eb-p3pmetjl/tmp8r5a4e8c/lib/python3.9/site-packages:$PYTHONPATH &&  cd test && PYTHONUNBUFFERED=1 /tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python run_test.py --continue-through-error  --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_fork distributed/test_distributed_spawn test_optim distributed/rpc/test_process_group_agent " exited with exit code 1 and output:
Traceback (most recent call last):
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1-imkl/pytorch/test/run_test.py", line 15, in <module>
    import torch
  File "/tmp/casparl/eb_tmp/eb-p3pmetjl/tmp8r5a4e8c/lib/python3.9/site-packages/torch/__init__.py", line 197, in <module>
    from torch._C import *  # noqa: F403
ImportError: /tmp/casparl/eb_tmp/eb-p3pmetjl/tmp8r5a4e8c/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: __kmpc_global_thread_num
 (at easybuild/tools/run.py:577 in parse_cmd_output)

I figured out that this symbol is part of Intel OpenMP, which surprised me, as the configure (and any compiler flags I could find in a verbose build) suggested it was using GNU OpenMP:

-- MKL OpenMP type: GNU
-- MKL OpenMP library: -fopenmp

So... that Intel OpenMP symbol is not supposed to be there at all I believe... But I really couldn't find how it ended up there.

Anyway, since most of my users will use it on GPUs, I figured instead of trying to debug this further, I'd attempt a build against OpenBLAS first, and see if that gave me fewer issues (when is one ever so lucky, right...?).

@branfosj
Copy link
Member

branfosj commented Nov 1, 2021

Ah, this just reminded me. The easyconfigs I pointed at are deployed on single GPU per node. So I'll not have done any of the PyTorch tests with multiple-GPUs on a single node. This may account for several test failures.

…ch-1.10.0_skip-nccl-error-tests.patch from the EasyConfig. This resolves the test failures I got, and I don't hit the original issues that these patches were made for.
@casparvl
Copy link
Contributor Author

casparvl commented Nov 1, 2021

After the last commit, the number of failing tests is down to these four btw:

distributed/optim/test_zero_redundancy_optimizer failed!
test_linalg failed!
test_ops failed!
test_quantization failed!

I'll report the errors for the latter three more extensively below, since those were still missing from my overview earlier in this ticket:

test_linalg

======================================================================
ERROR: test_cond_cpu_complex128 (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1635, in test_cond
    run_test_case(a, p)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1600, in run_test_case
    result_numpy = np.linalg.cond(input.cpu().numpy(), p)
  File "<__array_function__ internals>", line 5, in cond
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 1780, in cond
    r = norm(x, p, axis=(-2, -1)) * norm(invx, p, axis=(-2, -1))
  File "<__array_function__ internals>", line 5, in norm
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 2600, in norm
    ret = _multi_svd_norm(x, row_axis, col_axis, sum)
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 2354, in _multi_svd_norm
    result = op(svd(y, compute_uv=False), axis=-1)
  File "<__array_function__ internals>", line 5, in svd
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 1672, in svd
    s = gufunc(a, signature=signature, extobj=extobj)
  File "/tmp/sw_stack_gpu/software/SciPy-bundle/2021.05-foss-2021a/lib/python3.9/site-packages/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge
...

And similar for

ERROR: test_cond_cpu_complex64 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_float32 (__main__.TestLinalgCPU)
ERROR: test_cond_cpu_float64 (__main__.TestLinalgCPU)
ERROR: test_norm_extreme_values_cpu (__main__.TestLinalgCPU)

Also

======================================================================
ERROR: test_norm_extreme_values_cpu (__main__.TestLinalgCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 373, in instantiated_test
    raise rte
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 1886, in test_norm_extreme_values
    result = torch.linalg.norm(x, ord=ord)
RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/native/LinearAlgebraUtils.h":243, please report a bug to PyTorch. svd_cpu: Argument 4 has illegal value. Most certainly there is a bug in the
 implementation calling the backend library.

and

======================================================================
FAIL: test_svd_errors_and_warnings_cpu_complex128 (__main__.TestLinalgCPU)
----------------------------------------------------------------------
RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/native/LinearAlgebraUtils.h":243, please report a bug to PyTorch. svd_cpu: Argument 4 has illegal value. Most certainly there is a bug in the
 implementation calling the backend library.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_linalg.py", line 2903, in test_svd_errors_and_warnings
    svd(a)
AssertionError: "The algorithm failed to converge" does not match "falseINTERNAL ASSERT FAILED at "../aten/src/ATen/native/LinearAlgebraUtils.h":243, please report a bug to PyTorch. svd_cpu: Argument 4 ha
s illegal value. Most certainly there is a bug in the implementation calling the backend library."

And similar for

FAIL: test_svd_errors_and_warnings_cpu_complex64 (__main__.TestLinalgCPU)
FAIL: test_svd_errors_and_warnings_cpu_float32 (__main__.TestLinalgCPU)
FAIL: test_svd_errors_and_warnings_cpu_float64 (__main__.TestLinalgCPU)

test_ops

======================================================================
ERROR: test_fn_grad_linalg_det_singular_cpu_complex128 (__main__.TestGradientsCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 373, in instantiated_test
    raise rte
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 769, in dep_fn
    return fn(slf, *args, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 734, in test_wrapper
    return test(*args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_ops.py", line 594, in test_fn_grad
    self._grad_test_helper(device, dtype, op, op.get_op())
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_ops.py", line 579, in _grad_test_helper
    return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_ops.py", line 555, in _check_helper
    self.assertTrue(gradcheck(fn, gradcheck_args,
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 2688, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 1263, in gradcheck
    return _gradcheck_helper(**args)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 1276, in _gradcheck_helper
    _gradcheck_real_imag(gradcheck_fn, func, func_out, tupled_inputs, outputs, eps,
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 937, in _gradcheck_real_imag
    gradcheck_fn(imag_fn, imag_func_out, tupled_inputs, imag_outputs, eps,
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 1167, in _fast_gradcheck
    _check_analytical_numerical_equal(analytical_vJu, numerical_vJu, complex_indices,
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/autograd/gradcheck.py", line 1147, in _check_analytical_numerical_equal
    raise GradcheckError(_get_notallclose_msg(a, n, j, i, complex_indices, test_imag, is_forward_ad) + jacobians_str)
torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Jacobian mismatch for output 0 with respect to input 0,
numerical:tensor(-0.2793+0.3037j)
analytical:tensor(-0.1395+0.2513j)

The above quantities relating the numerical and analytical jacobians are computed
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[-0.0219+0.7833j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1495-0.0239j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0309+0.1527j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0278-0.0132j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.6846+0.9262j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1514-0.1631j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1742+0.1487j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0202-0.0404j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-1.2716+0.3531j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1125+0.2288j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2305+0.1255j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0350+0.0382j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.3390+0.2314j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0318-0.0727j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0761+0.0296j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0021-0.0160j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.1647+1.2499j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.6506+0.3460j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.7885-0.2146j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0098-0.1311j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0424-0.1595j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0903-0.0340j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.1060+0.0146j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0035+0.0169j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0095+0.5351j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.2578+0.1771j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.3223-0.1283j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0021-0.0558j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.4675-0.1524j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.2312+0.1707j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.2082-0.2413j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0478+0.0186j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1815+0.0774j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0600-0.1846j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1540-0.0686j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.2090+0.0981j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1259+0.0822j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0916+0.1162j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0218+0.1267j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1507+0.0909j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1704+0.5659j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.5396-0.2165j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.4533+0.2226j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1756+0.6688j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.3749+0.0656j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0314-0.3731j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.2592-0.1965j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.4357+0.0923j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.4641+0.4017j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.4311-0.3112j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3113+0.1094j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0516+0.1457j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  1.3594+0.2350j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -1.1909-0.0989j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.4725+0.5717j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.2684+0.2206j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.5450+0.4964j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.5082-0.3866j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3771+0.1219j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0576+0.1765j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.7871+1.1698j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.7685-0.9493j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.7560+0.0551j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0268+0.3541j]])
Analytical:
tensor([[-0.0219+0.7833j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1495-0.0239j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0309+0.1527j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0278-0.0132j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.6846+0.9262j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1514-0.1631j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1742+0.1487j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0202-0.0404j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-1.2716+0.3531j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1125+0.2288j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2305+0.1255j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0350+0.0382j,  0.0000-0.0000j,  0.0000-0.0000j,  0.0000-0.0000j],
        [ 0.3390+0.2314j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0318-0.0727j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0761+0.0296j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0021-0.0160j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.1647+1.2499j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.6506+0.3460j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.7885-0.2146j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0098-0.1311j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0424-0.1595j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0903-0.0340j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.1060+0.0146j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0035+0.0169j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0095+0.5351j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.2578+0.1771j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.3223-0.1283j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0021-0.0558j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.4675-0.1524j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.2312+0.1707j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.2082-0.2413j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0478+0.0186j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.4641+0.4017j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.4311-0.3112j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3113+0.1094j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0516+0.1457j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  1.3594+0.2350j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -1.1909-0.0989j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.4725+0.5717j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.2684+0.2206j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.5450+0.4964j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.5082-0.3866j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3771+0.1219j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0576+0.1765j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.7871+1.1698j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.7685-0.9493j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.7560+0.0551j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0268+0.3541j]])

The max per-element difference (slow mode) is: 0.6914664689360874.

and

======================================================================
FAIL: test_variant_consistency_jit_contiguous_cpu_float32 (__main__.TestJitCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 368, in instantiated_test
    result = test(self, **param_kwargs)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_device_type.py", line 734, in test_wrapper
    return test(*args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_ops.py", line 813, in test_variant_consistency_jit
    self.assertAutodiffNode(traced_fn.last_graph, op.assert_autodiffed, nonfusible_nodes, fusible_nodes)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_jit.py", line 281, in assertAutodiffNode
    self.assertEqual(should_autodiff_node,
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1948, in assertEqual
    super().assertTrue(x == y, msg=msg)
AssertionError: False is not true :
Failure in testing nodes' autodifferentiation. One or more nodes were not expected to be autodiffed but were found in a DifferentiableGraph or in a FusionGroup of a DifferentiableGraph. Did you intend for these nodes to be autodiffed? If so, change this test to expect autodifferentiation.
Specifically:
  ['aten::contiguous', 'aten::contiguous'] were not expected to be in one of the DifferentiableGraphs but were.

test_quantization

======================================================================
FAIL: test_lstm (quantization.bc.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/quantization/bc/test_backward_compatibility.py", line 273, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/quantization/bc/test_backward_compatibility.py", line 78, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1945, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/tmp/casparl/eb_tmp/eb-zicdqucj/tmp5xk49djn/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py", line 1877, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

@casparvl
Copy link
Contributor Author

casparvl commented Nov 1, 2021

Some references:
test_linalg
My bet is these are related to numpy's SVD issues with the version of OpenBLAS used in this toolchain. See potentially related links:
numpy/numpy#18914
numpy/numpy#19158
OpenMathLib/OpenBLAS#3255

test_ops

test_quantization
This is an error that was already known from 1.9.0:
pytorch/pytorch#59098
#13237 (comment)
I probably accidentally removed the patch from this EasyConfig, I'll put it back.

@casparvl
Copy link
Contributor Author

casparvl commented Nov 1, 2021

Trying to debug test_linalg failing tests, I made it print some extra info on the input matrix, and p (the order of the norm). Around line 1613 in test_linalg.py, I changed it to:

        def run_test_case(input, p):
            result = torch.linalg.cond(input, p)
            try:
                result_numpy = np.linalg.cond(input.cpu().numpy(), p)
            except:
                print(f"Failed to compute result_numpy with input={input}, p={p}, input.cpu().numpy()={input.cpu().numpy()}")
            self.assertEqual(result, result_numpy, rtol=1e-2, atol=self.precision, exact_dtype=False)
            self.assertEqual(result.shape, result_numpy.shape)

And now get

test_cond_cpu_complex128 (__main__.TestLinalgCPU) ... Failed to compute result_numpy with input=tensor([[1.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j]], dtype=torch.complex128), p=nuc, input.cpu().numpy()=[[1.+0.j 0.+0.j 0.+0.j]
 [0.+0.j 1.+0.j 0.+0.j]
 [0.+0.j 0.+0.j 0.+0.j]]
ERROR
test_cond_cpu_complex64 (__main__.TestLinalgCPU) ... Failed to compute result_numpy with input=tensor([[1.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j]]), p=nuc, input.cpu().numpy()=[[1.+0.j 0.+0.j 0.+0.j]
 [0.+0.j 1.+0.j 0.+0.j]
 [0.+0.j 0.+0.j 0.+0.j]]
ERROR
test_cond_cpu_float32 (__main__.TestLinalgCPU) ... Failed to compute result_numpy with input=tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 0.]]), p=nuc, input.cpu().numpy()=[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]]
ERROR
test_cond_cpu_float64 (__main__.TestLinalgCPU) ... Failed to compute result_numpy with input=tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 0.]], dtype=torch.float64), p=nuc, input.cpu().numpy()=[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 0.]]
ERROR

That's funny, because numpy.linalg.cond according to the latest docs doesn't take 'nuc' as option for p: https://numpy.org/doc/stable/reference/generated/numpy.linalg.cond.html
Maybe this is the problem...?

@casparvl
Copy link
Contributor Author

casparvl commented Nov 2, 2021

Indeed, running

input=np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])
np.linalg.cond(input,'nuc')

results in the "SVD did not converge" error. The reason is that NaN propagation is undefined for BLAS calls, it depends on the LAPACK implementation and behavior changed with OpenBLAS 0.3.15 according to numpy/numpy#18914.
According to that issue, the error is now expected when the input array contains NaN, but since np.linalg.cond also does a matrix inversion in an intermediate step, the same goes for non-invertible matrices as input: the inverse of a non-invertible matrix contains NaN's, and the np.linalg.norm is then called on this resulting inverse matrix, thus resulting in the same error.

While it is not yet clear which way the 'solution' would go on the numpy side (a check for NaN's in the input of e.g. calls to np.linalg.norm and np.linalg.cond would be expensive, but could be used to 'force' NaN-propagation regardless of the underlying LAPACK implementation), it seems that it will be 'document that the behavior is undefined when either input contains NaNs or is non-invertible'. In that case, the easiest solution is probably to make sure the PyTorch test gets an invertible input. That only makes sense if this test wasn't meant to specifically test for behavior on non-invertible matrices...

Edit: it seems they specifically test singular matrices as part of this test suite. I have reported the bug and suggested skipping this specific case pytorch/pytorch#67675

@casparvl
Copy link
Contributor Author

casparvl commented Nov 2, 2021

Ok, current status:

Still failing
distributed/optim/test_zero_redundancy_optimizer

  • ERROR: test_zero_model_parallel_with_bucket_view (main.TestZeroRedundancyOptimizerDistributed)
  • ERROR: test_zero_model_parallel_without_bucket_view (main.TestZeroRedundancyOptimizerDistributed)
    test_ops
  • ERROR: test_fn_grad_linalg_det_singular_cpu_complex128 (main.TestGradientsCPU)
  • FAIL: test_variant_consistency_jit_contiguous_cpu_float32 (main.TestJitCPU)
    test_linalg
  • ERROR: test_norm_extreme_values_cpu (main.TestLinalgCPU)
  • FAIL: test_svd_errors_and_warnings_cpu_complex64 (main.TestLinalgCPU)
  • FAIL: test_svd_errors_and_warnings_cpu_float32 (main.TestLinalgCPU)
  • FAIL: test_svd_errors_and_warnings_cpu_float64 (main.TestLinalgCPU)

The svd_errors_and_warnings failures from test_linalg are probably not an issue: these test what happens when calling torch.svd(a) on a matrix that contains NaNs. It is supposed to fail, it's just that we get a different error than expected (see ticket at pytorch/pytorch#67693). I may just implement a patch that skips these tests, as it doesn't matter that much how they fail.

@IvanYashchuk
Copy link

The svd_errors_and_warnings failures from test_linalg are probably not an issue: these test what happens when calling torch.svd(a) on a matrix that contains NaNs. It is supposed to fail, it's just that we get a different error than expected (see ticket at pytorch/pytorch#67693). I may just implement a patch that skips these tests, as it doesn't matter that much how they fail.

I agree with that. It's just the error message for inputs with NaNs is different between MKL and OpenBLAS builds, if other tests pass then everything is working as expected for SVD.

@casparvl
Copy link
Contributor Author

casparvl commented Nov 2, 2021

Test report by @casparvl
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gcn1 - Linux centos linux 8.4.2105, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/c250fc68509cccbd2034d989b95b75ea for a full test report.

…lues_cpu tests, which were failing because the error was a different error than expected. This has no impact on users, since these tests cases are not expected to complete succesfully anyway. Additionally, add a patch that increases the tolerance for the zero_model_parallel tests so that they also pass when using TensorFloat32, e.g. on A100
@casparvl
Copy link
Contributor Author

casparvl commented Nov 3, 2021

I've traced down the cause of the

ERROR: test_zero_model_parallel_with_bucket_view (main.TestZeroRedundancyOptimizerDistributed)
ERROR: test_zero_model_parallel_without_bucket_view (main.TestZeroRedundancyOptimizerDistributed)
test_ops

failures to the use of TensorFloat32 - which made losses with the reference case slightly different and caused the test to fail. I just update this PR with a Patch that does that (PyTorch-1.10.0_increase_zero_optimizer_test_tolerance.patch)

More info, see pytorch/pytorch#67764

@casparvl
Copy link
Contributor Author

casparvl commented Nov 3, 2021

Ok, the only ones still failing at this point are:
test_ops

ERROR: test_fn_grad_linalg_det_singular_cpu_complex128 (main.TestGradientsCPU)
FAIL: test_variant_consistency_jit_contiguous_cpu_float32 (main.TestJitCPU)

Strange thing is that the second test works fine if I run it later individually:

python -m unittest test_ops.TestJitCPU.test_variant_consistency_jit_contiguous_cpu_float32 -v
test_variant_consistency_jit_contiguous_cpu_float32 (test_ops.TestJitCPU) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.144s

OK

So I think it's just some messed up environment from a previous test that spoils this test. I'd like to try and just skip it, but it's not so easy in the test_ops test suite: it's very generic and loops over all kinds of possible combinations, but it's unclear to me how to take one out...

For the test_fn_grad_linalg_det_singular_cpu_complex128: let's see what comes back here pytorch/pytorch#67767

@IvanYashchuk
Copy link

I'd like to try and just skip it, but it's not so easy in the test_ops test suite: it's very generic and loops over all kinds of possible combinations, but it's unclear to me how to take one out...

Here's the patch to skip that test

diff --git a/torch/testing/_internal/common_methods_invocations.py b/torch/testing/_internal/common_methods_invocations.py
index 41abeb73f2..64060540fd 100644
--- a/torch/testing/_internal/common_methods_invocations.py
+++ b/torch/testing/_internal/common_methods_invocations.py
@@ -6448,7 +6448,10 @@ op_db: List[OpInfo] = [
            supports_forward_ad=True,
            autodiff_fusible_nodes=['aten::contiguous'],
            assert_jit_shape_analysis=True,
-           supports_out=False),
+           supports_out=False,
+           skips=(
+               DecorateInfo(unittest.skip("Skipped!"), 'TestJit', 'test_variant_consistency_jit', device_type='cpu'),
+           )),
     OpInfo('symeig',
            dtypes=floating_and_complex_types(),
            check_batched_gradgrad=False,

…_consistency_jit_contiguous_cpu_float32. Both don't seem to point to fundamental issues with the build, but the problem is probably on the side of the test suite.
@casparvl
Copy link
Contributor Author

casparvl commented Nov 4, 2021

Test report by @casparvl
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn1 - Linux centos linux 8.4.2105, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/258acdd91db0e564320598fbe689b356 for a full test report.

@branfosj
Copy link
Member

branfosj commented Nov 4, 2021

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0212u15b.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), 1 x NVIDIA Tesla P100-PCIE-16GB, 460.32.03, Python 3.6.8
See https://gist.github.com/e7a9977e8d03f495b7cf07eb9604d770 for a full test report.

@casparvl
Copy link
Contributor Author

casparvl commented Nov 6, 2021

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=14233 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_14233 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7220

Test results coming soon (I hope)...

- notification for comment with ID 962445004 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel boegel changed the title {devel}[foss/2021a] PyTorch v1.10.0 w/ Python 3.9.5 {devel}[foss/2021a] PyTorch v1.10.0 w/ Python 3.9.5 + CUDA-11.3.1 Nov 10, 2021
@boegel
Copy link
Member

boegel commented Nov 10, 2021

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=14233 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_14233 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 7252

Test results coming soon (I hope)...

- notification for comment with ID 965017888 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel boegel changed the title {devel}[foss/2021a] PyTorch v1.10.0 w/ Python 3.9.5 + CUDA-11.3.1 {devel}[foss/2021a] PyTorch v1.10.0, torchvision v0.11.1, Horovod v0.23.0 w/ Python 3.9.5 + CUDA-11.3.1 Nov 10, 2021
@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cnx2 - Linux rocky linux 8.4, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/f29846bd426b9947f0cd29428d4a031c for a full test report.

@boegel
Copy link
Member

boegel commented Nov 10, 2021

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (3 easyconfigs in total)
node3301.joltik.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA NVIDIA Tesla V100-SXM2-32GB, 465.19.01, Python 3.6.8
See https://gist.github.com/b81117eab9a72f425f35c44205046214 for a full test report.

@boegel
Copy link
Member

boegel commented Nov 10, 2021

Test report by @boegel
FAILED
Build succeeded for 26 out of 29 (3 easyconfigs in total)
node3908.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen2), 1 x NVIDIA NVIDIA A100-SXM-80GB, 470.57.02, Python 3.6.8
See https://gist.github.com/81eddd7f2865e3e02d4999b65c61fec2 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0103u01a.bear.cluster - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A100-PCIE-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/d43d36f86f6c2fd16761ad2f99148032 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 3 (3 easyconfigs in total)
bear-pg0103u04a.bear.cluster - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 470.57.02, Python 3.6.8
See https://gist.github.com/31f89b56fe642515088d6b7661b0f6f5 for a full test report.

@branfosj
Copy link
Member

test_mem_leak (__main__.TestProfilerCUDA)
Checks that there's no memory leak when using profiler with CUDA ... FAIL

======================================================================
FAIL: test_mem_leak (__main__.TestProfilerCUDA)
Checks that there's no memory leak when using profiler with CUDA
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/test_profiler.py", line 57, in test_mem_leak
    self.assertTrue(not (is_increasing and max_diff > 100 * 1024),
AssertionError: False is not true : memory usage is increasing, deque([5323255808, 5323259904, 5323767808, 5323821056, 5323915264], maxlen=5)

----------------------------------------------------------------------
Ran 14 tests in 8.451s

FAILED (failures=1)
test_profiler failed!

This test got disabled on Windows (pytorch/pytorch#40485). Maybe the flakiness is more widespread.

@branfosj
Copy link
Member

@boegel Your failure was also in test_profiler. Did you see the same issue or a different one?

@boegel
Copy link
Member

boegel commented Nov 11, 2021

@branfosj I'll need to reproduce the failure to take a closer look, I'll report back.

Perhaps related to #13791, so I'm inclined to skip test_profiler...

Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And see @boegel comment at #14233 (comment)

@casparvl
Copy link
Contributor Author

casparvl commented Nov 11, 2021

Ok, so how do we proceed on this? @boegel can you add a patch to skip that test?

@boegel
Copy link
Member

boegel commented Nov 11, 2021

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3908.accelgor.os - Linux RHEL 8.4, x86_64, AMD EPYC 7413 24-Core Processor (zen2), 1 x NVIDIA NVIDIA A100-SXM-80GB, 470.57.02, Python 3.6.8
See https://gist.github.com/052982eed17fdb4674929112fbaeb808 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Nov 11, 2021
@boegel
Copy link
Member

boegel commented Nov 11, 2021

Ok, so how do we proceed on this? @boegel can you add a patch to skip that test?

@casparvl It seems like the failing test was a fluke, at least it was for me, see latest successful test report...

@branfosj Is the problem persistent for you? If so, which CUDA driver version are you running?

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0103u04a.bear.cluster - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 470.57.02, Python 3.6.8
See https://gist.github.com/643a0139a5089a645c39d215dd4d7356 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0103u01a.bear.cluster - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A100-PCIE-40GB, 470.57.02, Python 3.6.8
See https://gist.github.com/40d7d700d871f2acfc1f23a22721f83e for a full test report.

Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@branfosj
Copy link
Member

Ok, so how do we proceed on this? @boegel can you add a patch to skip that test?

@casparvl It seems like the failing test was a fluke, at least it was for me, see latest successful test report...

@branfosj Is the problem persistent for you? If so, which CUDA driver version are you running?

It passed second time round on both of my new icelake+gpu nodes (one node with 2xA100 and one with 2xA30). So I'm happy to get this merged.

@branfosj
Copy link
Member

Going in, thanks @casparvl!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants