{devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237

Flamefire · 2021-06-22T14:53:49Z

(created using eb --new-pr)

…da-2020b.eb and patches: PyTorch-1.9.0_avoid-failures-in-test_unary_ufuncs.patch, PyTorch-1.9.0_fix-testnn-on-A100.patch, PyTorch-1.9.0_fix-use-after-destruct-in-cudaipctypes.patch, PyTorch-1.9.0_fix-vsx-vector-functions.patch, PyTorch-1.9.0_increase_test_cuda_tolerance.patch, PyTorch-1.9.0_increase-tolerance-for-distributed-tests.patch, PyTorch-1.9.0_limit_world_size_for_zero_redundancy_opt_test.patch, PyTorch-1.9.0_skip-nccl-error-tests.patch

branfosj · 2021-06-22T18:17:44Z

I've set off a test reports.

I did an initial test of the foss version, but forgot the upload test report option. In that I saw the test_lstm failure in test_quantization.py. Originally reported at pytorch/pytorch#43209 and now being tracked in pytorch/pytorch#59098

terjekv · 2021-06-22T18:36:19Z

Test report by @terjekv
FAILED
Build succeeded for 1 out of 3 (2 easyconfigs in total)
ninhursaga.uio.no - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (cascadelake), Python 3.6.8
See https://gist.github.com/c93ed8c226efc0af6100ce7e466864ba for a full test report.

branfosj · 2021-06-22T21:50:20Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0309u35a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/4ce9516a3230068dccd42e05e7ced77d for a full test report.

Edit: Failure in test_quantization.

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-u4w1o_z0/tmp1ednt2sm/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/scratch-local/branfosj-admin/eb-u4w1o_z0/tmp1ednt2sm/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/scratch-local/branfosj-admin/eb-u4w1o_z0/tmp1ednt2sm/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------

branfosj · 2021-06-22T22:52:42Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u26a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/7811093d15e69fe464776610699113a2 for a full test report.

Edit.

Failure in test_quantization.

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/fosscuda-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/fosscuda-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------

Failure in distributed/rpc/cuda/test_tensorpipe_agent.

======================================================================
ERROR: test_devices_option_mismatch (__main__.TensorPipeTensorPipeAgentCudaRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 391, in wrapper
    self._join_processes(fn)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 583, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 626, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5878, in test_devices_option_mismatch
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 294, in _tensorpipe_init_backend_handler
    _tensorpipe_check_local_device_maps(name, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 215, in _tensorpipe_check_local_device_maps
    raise ValueError(
ValueError: Invalid device in TensorPipe options on worker0:
device_maps = {'worker1': {device(type='cuda', index=0): device(type='cuda', index=0)}},
devices = [device(type='cuda', index=1)]

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5878, in test_devices_option_mismatch
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 294, in _tensorpipe_init_backend_handler
    _tensorpipe_check_local_device_maps(name, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 215, in _tensorpipe_check_local_device_maps
    raise ValueError(
ValueError: Invalid device in TensorPipe options on worker1:
device_maps = {'worker2': {device(type='cuda', index=0): device(type='cuda', index=0)}},
devices = [device(type='cuda', index=1)]



======================================================================
ERROR: test_devices_option_mismatch_reverse (__main__.TensorPipeTensorPipeAgentCudaRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 391, in wrapper
    self._join_processes(fn)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 583, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 626, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5903, in test_devices_option_mismatch_reverse
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 316, in _tensorpipe_init_backend_handler
    _tensorpipe_check_remote_device_maps(agent, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 254, in _tensorpipe_check_remote_device_maps
    check_one_worker(worker_name, worker_device_maps, all_device_counts)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 243, in check_one_worker
    raise ValueError(
ValueError: Invalid device_map configuration on worker0 for worker1, remote device out of range:
device_maps = {'worker1': {device(type='cuda', index=0): device(type='cuda', index=1)}}

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5903, in test_devices_option_mismatch_reverse
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 316, in _tensorpipe_init_backend_handler
    _tensorpipe_check_remote_device_maps(agent, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 254, in _tensorpipe_check_remote_device_maps
    check_one_worker(worker_name, worker_device_maps, all_device_counts)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 243, in check_one_worker
    raise ValueError(
ValueError: Invalid device_map configuration on worker0 for worker1, remote device out of range:
device_maps = {'worker1': {device(type='cuda', index=0): device(type='cuda', index=1)}}

Flamefire · 2021-06-23T06:47:56Z

@branfosj Thanks for the reminder. Don't really have a Cascadelake machine easily available. I added a patch to skip that test. Can you try that?

@terjekv Something seems odd with your build machine:

RuntimeError: In operator() at tensorpipe/common/ibv.h:151 "": Cannot allocate memory
distributed/pipeline/sync/test_transparency.py::test_simple_linears libi40iw-i40iw_ucreate_cq: failed to pin memory for CQ

terjekv · 2021-06-23T07:39:03Z

I don't quite understand how. The machine monitoring suggests that free memory never dropped under 30GB or so. Could we see a spike eat that? :(

Flamefire · 2021-06-23T08:47:58Z

I don't quite understand how

@terjekv No idea. But seems to be related to Infiniband. It is a call to ibv_create_cq which fails (in one case) which is dynamically loaded from libibverbs.so.1

Maybe just a fluke?

terjekv · 2021-06-23T08:52:18Z

I hope so. It could be IB stuff on RHEL8 causing issues? Rerunning stuff to see.

terjekv · 2021-06-23T09:43:39Z

Same error. Kinda iffy. The box does not have any IB hardware, but has:

$ rpm -qa | grep ibverbs
libibverbs-32.0-4.el8.x86_64

Flamefire · 2021-06-23T09:51:18Z

Same error. Kinda iffy. The box does not have any IB hardware, but has:

Then that is maybe just unsupported. Please open an issue in the pytorch repo to get them to clarify that.

verdurin · 2021-06-23T09:56:05Z

Test report by @verdurin
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
easybuild-c7.novalocal - Linux centos linux 7.9.2009, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.6.8
See https://gist.github.com/ce6e347ad636a9e190afb1d95419d2ce for a full test report.

terjekv · 2021-06-23T11:06:18Z

Test report by @terjekv
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
ninhursaga.uio.no - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (cascadelake), Python 3.6.8
See https://gist.github.com/8ad20fe90fce1ef2e2054bc9b62513f7 for a full test report.

branfosj · 2021-06-23T14:52:46Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u26a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/ae6843b248484549adf3a5782796db46 for a full test report.

Edit: Something odd is going on with /tmp on that system which is causing the failure.

branfosj · 2021-06-23T19:26:49Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u26a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/6c165029fd2fa326af5ea23b03de734b for a full test report.

branfosj · 2021-06-23T20:48:19Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u29a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/1ce6af2cdbe1afb582917f2515933542 for a full test report.

Flamefire · 2021-06-24T06:52:46Z

@branfosj Can you check what exactly failed there? It says still test_quantization but that part of the log is missing. And the test_lstm should be fixed...

branfosj · 2021-06-24T07:32:22Z

I'm confused - the failure is as before:

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/tmp-branfosj-admin-up/eb-se9g9n5f/tmpgv5jwrj6/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/build-branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 232, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/dev/shm/build-branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 77, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/dev/shm/tmp-branfosj-admin-up/eb-se9g9n5f/tmpgv5jwrj6/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/dev/shm/tmp-branfosj-admin-up/eb-se9g9n5f/tmpgv5jwrj6/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------

branfosj · 2021-06-24T07:38:18Z

I'm setting off a build so I can do some debugging on why that test is not being skipped.

Flamefire · 2021-06-24T08:09:05Z

I didn't actually skip it but marked it as expected failure: 74c8c32#diff-de8a49a66cde6fe5e687dac5c654f196ebd0110d55a739b1f57095131f5d471cR22

Not sure why this isn't working

Flamefire · 2021-06-24T08:17:36Z

@branfosj Found it. Currently uploading a new patch. The mark must come first -.-

branfosj · 2021-06-24T08:37:03Z

@branfosj Found it. Currently uploading a new patch. The mark must come first -.-

Thanks. I'll set off a test report.

branfosj · 2021-06-24T09:48:50Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u03a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/83aabfc0ac2ec8bb05232b61e1646237 for a full test report.

Flamefire · 2021-06-24T10:14:36Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8032 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/34655e94e8bf78f9dafb37f20c51a827 for a full test report.

Flamefire · 2021-06-24T10:21:17Z

@branfosj

FAILED (skipped=24, unexpected successes=1)
test_quantization failed!

I give up and skip that fucking test

branfosj · 2021-06-24T11:45:35Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u03a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/e16eb2c0e470f09b733a87fdc0c23db0 for a full test report.

branfosj · 2021-06-24T11:53:32Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bber0501u03a.bb2.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/2e48e045dd2cb324878128df27a8cfa3 for a full test report.

Flamefire · 2021-06-24T12:13:33Z

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8032 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/72eb8c9b63bb015a528adf89a167ce5b for a full test report.

branfosj · 2021-06-24T13:26:14Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0308u31a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/50dec1b4b4112d43ffeeee7402a2ce27 for a full test report.

branfosj · 2021-06-24T16:23:17Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u30a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/a5adcff00ca81a93de9065159d179a86 for a full test report.

Flamefire · 2021-06-25T06:26:15Z

@branfosj

distributed/test_c10d_nccl failed!

Not sure here... While I do like the option to run all tests and report failure afterwards it makes those test reports hard as the error happens way before. I think I should add some post-processing to the easyblock so it shows the actual failures

branfosj · 2021-06-25T07:23:16Z

The failed test was with 4 GPUs. We are back to NCCL errors:

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: uncorrectable NVLink error detected during the execution

Full output from distributed/test_c10d_nccl: https://gist.github.com/branfosj/fd219f55aca2e9f48c1c4cb92531afc8

Flamefire · 2021-06-25T07:35:18Z

Yeah but this time test_broadcast_coalesced_nccl which I haven't seen failing before. Do you use a different NCCL? Have you rebuild NCCL with that patch already (not sure if done automatically)?

branfosj · 2021-06-25T07:51:18Z

NCCL is NCCL-2.8.3-GCCcore-10.2.0-CUDA-11.1.1.eb with NCCL-2.8.3_fix-isend-irecv.patch - built from #13071

Flamefire · 2021-06-25T08:30:14Z

Hm, I can't see anything wrong with this test so I'd rather not disable it. Maybe just a fluke?
Due to the async nature of CUDA errors the stacktrace doesn't say anything about the actual source of the error.

branfosj · 2021-06-25T15:49:00Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0309u05a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/f3bbe8f21aaf4e1f9eb5367d78d20907 for a full test report.

Edit: Test with 2 GPUs

Flamefire · 2021-06-25T18:50:48Z

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/2194f5d7b97d9714ac453f72d0b05a04 for a full test report.

branfosj · 2021-06-25T20:28:04Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0309u05a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/737a9bc8ec068d208504e4ddfc7f5573 for a full test report.

Edit: Test with 1 GPU

branfosj · 2021-06-26T20:37:08Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0309u10a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/609b088ae4b5075e8f2c086eba1535c6 for a full test report.

Edit: Test with 4 GPUs

branfosj · 2021-06-27T08:34:35Z

Hm, I can't see anything wrong with this test so I'd rather not disable it. Maybe just a fluke?
Due to the async nature of CUDA errors the stacktrace doesn't say anything about the actual source of the error.

I've run using all 4 GPUs three times:

Failure: {devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237 (comment)
Pass in a run where I forgot to set upload test report
Pass: {devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237 (comment)

So the first one was either a fluke failure or the test is flaky.

Flamefire · 2021-06-28T07:02:32Z

The test shouldn't really be flaky from what I can tell, I guess it is NCCL being the problem here. Hence I'm hesitant disabling this test. FWIW: I reran this test file 5 times without any failure.

My proposal: Include as-is and keep the potential failure as a kind of warning that pytorch might fail in the given environment.

boegel · 2021-06-28T11:27:42Z

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3159.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/8eea43094a796743710c9bfb5ad6d581 for a full test report.

branfosj

lgtm

branfosj · 2021-06-30T17:45:41Z

Going in, thanks @Flamefire!

branfosj added the update label Jun 22, 2021

branfosj added this to the 4.x milestone Jun 22, 2021

Mark test_lstm as xfail

74c8c32

Add device patch and fix patch naming

54d2127

Move xfail mark

9d6bebf

Flamefire added 2 commits June 24, 2021 12:25

Skip lstm test for good

4d88798

Skip lstm test for good

39609d4

branfosj approved these changes Jun 30, 2021

View reviewed changes

branfosj modified the milestones: 4.x, next release (4.4.1) Jun 30, 2021

branfosj merged commit 9f74b08 into easybuilders:develop Jun 30, 2021

Flamefire deleted the 20210622165340_new_pr_PyTorch190 branch July 1, 2021 08:56

casparvl mentioned this pull request Nov 1, 2021

{devel}[foss/2021a] PyTorch v1.10.0, torchvision v0.11.1, Horovod v0.23.0 w/ Python 3.9.5 + CUDA-11.3.1 #14233

Merged

{devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237

{devel}[foss/2020b,fosscuda/2020b] PyTorch v1.9.0 w/ Python 3.8.6 #13237

Uh oh!

Conversation

Flamefire commented Jun 22, 2021

Uh oh!

branfosj commented Jun 22, 2021

Uh oh!

terjekv commented Jun 22, 2021

Uh oh!

branfosj commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

branfosj commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Jun 23, 2021

Uh oh!

terjekv commented Jun 23, 2021

Uh oh!

Flamefire commented Jun 23, 2021

Uh oh!

terjekv commented Jun 23, 2021

Uh oh!

terjekv commented Jun 23, 2021

Uh oh!

Flamefire commented Jun 23, 2021

Uh oh!

verdurin commented Jun 23, 2021

Uh oh!

terjekv commented Jun 23, 2021

Uh oh!

branfosj commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

branfosj commented Jun 23, 2021

Uh oh!

branfosj commented Jun 23, 2021

Uh oh!

Flamefire commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

Flamefire commented Jun 24, 2021

Uh oh!

Flamefire commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

Flamefire commented Jun 24, 2021

Uh oh!

Flamefire commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

Flamefire commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

branfosj commented Jun 24, 2021

Uh oh!

Flamefire commented Jun 25, 2021

Uh oh!

branfosj commented Jun 25, 2021

Uh oh!

Flamefire commented Jun 25, 2021

Uh oh!

branfosj commented Jun 25, 2021

Uh oh!

Flamefire commented Jun 25, 2021

Uh oh!

branfosj commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

branfosj commented Jun 22, 2021 •

edited

Loading

branfosj commented Jun 22, 2021 •

edited

Loading

branfosj commented Jun 23, 2021 •

edited

Loading

branfosj commented Jun 25, 2021 •

edited

Loading

branfosj commented Jun 25, 2021 •

edited

Loading

branfosj commented Jun 26, 2021 •

edited

Loading