Skip to content

Conversation

@Flamefire
Copy link
Contributor

(created using eb --new-pr)

…da-2020b.eb and patches: PyTorch-1.9.0_avoid-failures-in-test_unary_ufuncs.patch, PyTorch-1.9.0_fix-testnn-on-A100.patch, PyTorch-1.9.0_fix-use-after-destruct-in-cudaipctypes.patch, PyTorch-1.9.0_fix-vsx-vector-functions.patch, PyTorch-1.9.0_increase_test_cuda_tolerance.patch, PyTorch-1.9.0_increase-tolerance-for-distributed-tests.patch, PyTorch-1.9.0_limit_world_size_for_zero_redundancy_opt_test.patch, PyTorch-1.9.0_skip-nccl-error-tests.patch
@branfosj branfosj added this to the 4.x milestone Jun 22, 2021
@branfosj
Copy link
Member

I've set off a test reports.

I did an initial test of the foss version, but forgot the upload test report option. In that I saw the test_lstm failure in test_quantization.py. Originally reported at pytorch/pytorch#43209 and now being tracked in pytorch/pytorch#59098

@terjekv
Copy link
Collaborator

terjekv commented Jun 22, 2021

Test report by @terjekv
FAILED
Build succeeded for 1 out of 3 (2 easyconfigs in total)
ninhursaga.uio.no - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (cascadelake), Python 3.6.8
See https://gist.github.com/c93ed8c226efc0af6100ce7e466864ba for a full test report.

@branfosj
Copy link
Member

branfosj commented Jun 22, 2021

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0309u35a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/4ce9516a3230068dccd42e05e7ced77d for a full test report.

Edit: Failure in test_quantization.

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-u4w1o_z0/tmp1ednt2sm/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/scratch-local/branfosj-admin/eb-u4w1o_z0/tmp1ednt2sm/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/scratch-local/branfosj-admin/eb-u4w1o_z0/tmp1ednt2sm/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------

@branfosj
Copy link
Member

branfosj commented Jun 22, 2021

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u26a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/7811093d15e69fe464776610699113a2 for a full test report.

Edit.

Failure in test_quantization.

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/fosscuda-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/dev/shm/build-branfosj-admin/branfosj-admin-up/PyTorch/1.9.0/fosscuda-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------

Failure in distributed/rpc/cuda/test_tensorpipe_agent.

======================================================================
ERROR: test_devices_option_mismatch (__main__.TensorPipeTensorPipeAgentCudaRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 391, in wrapper
    self._join_processes(fn)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 583, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 626, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5878, in test_devices_option_mismatch
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 294, in _tensorpipe_init_backend_handler
    _tensorpipe_check_local_device_maps(name, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 215, in _tensorpipe_check_local_device_maps
    raise ValueError(
ValueError: Invalid device in TensorPipe options on worker0:
device_maps = {'worker1': {device(type='cuda', index=0): device(type='cuda', index=0)}},
devices = [device(type='cuda', index=1)]

Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5878, in test_devices_option_mismatch
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 294, in _tensorpipe_init_backend_handler
    _tensorpipe_check_local_device_maps(name, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 215, in _tensorpipe_check_local_device_maps
    raise ValueError(
ValueError: Invalid device in TensorPipe options on worker1:
device_maps = {'worker2': {device(type='cuda', index=0): device(type='cuda', index=0)}},
devices = [device(type='cuda', index=1)]



======================================================================
ERROR: test_devices_option_mismatch_reverse (__main__.TensorPipeTensorPipeAgentCudaRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 391, in wrapper
    self._join_processes(fn)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 583, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 626, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5903, in test_devices_option_mismatch_reverse
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 316, in _tensorpipe_init_backend_handler
    _tensorpipe_check_remote_device_maps(agent, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 254, in _tensorpipe_check_remote_device_maps
    check_one_worker(worker_name, worker_device_maps, all_device_counts)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 243, in check_one_worker
    raise ValueError(
ValueError: Invalid device_map configuration on worker0 for worker1, remote device out of range:
device_maps = {'worker1': {device(type='cuda', index=0): device(type='cuda', index=1)}}

Process 2 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 509, in run_test
    getattr(self, test_name)()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 393, in wrapper
    fn()
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 93, in wrapper
    return func(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5903, in test_devices_option_mismatch_reverse
    rpc.init_rpc(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 203, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 316, in _tensorpipe_init_backend_handler
    _tensorpipe_check_remote_device_maps(agent, rpc_backend_options)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 254, in _tensorpipe_check_remote_device_maps
    check_one_worker(worker_name, worker_device_maps, all_device_counts)
  File "/scratch-local/branfosj-admin/eb-r498agf7/tmp9ivfe7tf/lib/python3.8/site-packages/torch/distributed/rpc/backend_registry.py", line 243, in check_one_worker
    raise ValueError(
ValueError: Invalid device_map configuration on worker0 for worker1, remote device out of range:
device_maps = {'worker1': {device(type='cuda', index=0): device(type='cuda', index=1)}}

@Flamefire
Copy link
Contributor Author

@branfosj Thanks for the reminder. Don't really have a Cascadelake machine easily available. I added a patch to skip that test. Can you try that?

@terjekv Something seems odd with your build machine:

RuntimeError: In operator() at tensorpipe/common/ibv.h:151 "": Cannot allocate memory
distributed/pipeline/sync/test_transparency.py::test_simple_linears libi40iw-i40iw_ucreate_cq: failed to pin memory for CQ

@terjekv
Copy link
Collaborator

terjekv commented Jun 23, 2021

I don't quite understand how. The machine monitoring suggests that free memory never dropped under 30GB or so. Could we see a spike eat that? :(

@Flamefire
Copy link
Contributor Author

I don't quite understand how

@terjekv No idea. But seems to be related to Infiniband. It is a call to ibv_create_cq which fails (in one case) which is dynamically loaded from libibverbs.so.1

Maybe just a fluke?

@terjekv
Copy link
Collaborator

terjekv commented Jun 23, 2021

I hope so. It could be IB stuff on RHEL8 causing issues? Rerunning stuff to see.

Copy link
Collaborator

terjekv commented Jun 23, 2021

Same error. Kinda iffy. The box does not have any IB hardware, but has:

$ rpm -qa | grep ibverbs
libibverbs-32.0-4.el8.x86_64

@Flamefire
Copy link
Contributor Author

Same error. Kinda iffy. The box does not have any IB hardware, but has:

Then that is maybe just unsupported. Please open an issue in the pytorch repo to get them to clarify that.

@verdurin
Copy link
Member

Test report by @verdurin
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
easybuild-c7.novalocal - Linux centos linux 7.9.2009, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.6.8
See https://gist.github.com/ce6e347ad636a9e190afb1d95419d2ce for a full test report.

@terjekv
Copy link
Collaborator

terjekv commented Jun 23, 2021

Test report by @terjekv
FAILED
Build succeeded for 0 out of 2 (2 easyconfigs in total)
ninhursaga.uio.no - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (cascadelake), Python 3.6.8
See https://gist.github.com/8ad20fe90fce1ef2e2054bc9b62513f7 for a full test report.

@branfosj
Copy link
Member

branfosj commented Jun 23, 2021

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u26a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/ae6843b248484549adf3a5782796db46 for a full test report.

Edit: Something odd is going on with /tmp on that system which is causing the failure.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u26a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/6c165029fd2fa326af5ea23b03de734b for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u29a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/1ce6af2cdbe1afb582917f2515933542 for a full test report.

@Flamefire
Copy link
Contributor Author

@branfosj Can you check what exactly failed there? It says still test_quantization but that part of the log is missing. And the test_lstm should be fixed...

@branfosj
Copy link
Member

I'm confused - the failure is as before:

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/tmp-branfosj-admin-up/eb-se9g9n5f/tmpgv5jwrj6/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 161, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/build-branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 232, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/dev/shm/build-branfosj-admin-up/PyTorch/1.9.0/foss-2020b/pytorch/test/quantization/test_backward_compatibility.py", line 77, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/dev/shm/tmp-branfosj-admin-up/eb-se9g9n5f/tmpgv5jwrj6/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1388, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/dev/shm/tmp-branfosj-admin-up/eb-se9g9n5f/tmpgv5jwrj6/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1355, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

----------------------------------------------------------------------

@branfosj
Copy link
Member

I'm setting off a build so I can do some debugging on why that test is not being skipped.

@Flamefire
Copy link
Contributor Author

I didn't actually skip it but marked it as expected failure: 74c8c32#diff-de8a49a66cde6fe5e687dac5c654f196ebd0110d55a739b1f57095131f5d471cR22

Not sure why this isn't working

@Flamefire
Copy link
Contributor Author

@branfosj Found it. Currently uploading a new patch. The mark must come first -.-

@branfosj
Copy link
Member

@branfosj Found it. Currently uploading a new patch. The mark must come first -.-

Thanks. I'll set off a test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u03a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/83aabfc0ac2ec8bb05232b61e1646237 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8032 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/34655e94e8bf78f9dafb37f20c51a827 for a full test report.

@Flamefire
Copy link
Contributor Author

@branfosj

FAILED (skipped=24, unexpected successes=1)
test_quantization failed!

I give up and skip that fucking test

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u03a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/e16eb2c0e470f09b733a87fdc0c23db0 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bber0501u03a.bb2.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/2e48e045dd2cb324878128df27a8cfa3 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8032 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/72eb8c9b63bb015a528adf89a167ce5b for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0308u31a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/50dec1b4b4112d43ffeeee7402a2ce27 for a full test report.

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bask-pg0308u30a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/a5adcff00ca81a93de9065159d179a86 for a full test report.

@Flamefire
Copy link
Contributor Author

@branfosj

distributed/test_c10d_nccl failed!

Not sure here... While I do like the option to run all tests and report failure afterwards it makes those test reports hard as the error happens way before. I think I should add some post-processing to the easyblock so it shows the actual failures

@branfosj
Copy link
Member

The failed test was with 4 GPUs. We are back to NCCL errors:

terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: uncorrectable NVLink error detected during the execution

Full output from distributed/test_c10d_nccl: https://gist.github.com/branfosj/fd219f55aca2e9f48c1c4cb92531afc8

@Flamefire
Copy link
Contributor Author

Yeah but this time test_broadcast_coalesced_nccl which I haven't seen failing before. Do you use a different NCCL? Have you rebuild NCCL with that patch already (not sure if done automatically)?

@branfosj
Copy link
Member

NCCL is NCCL-2.8.3-GCCcore-10.2.0-CUDA-11.1.1.eb with NCCL-2.8.3_fix-isend-irecv.patch - built from #13071

@Flamefire
Copy link
Contributor Author

Hm, I can't see anything wrong with this test so I'd rather not disable it. Maybe just a fluke?
Due to the async nature of CUDA errors the stacktrace doesn't say anything about the actual source of the error.

@branfosj
Copy link
Member

branfosj commented Jun 25, 2021

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0309u05a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/f3bbe8f21aaf4e1f9eb5367d78d20907 for a full test report.

Edit: Test with 2 GPUs

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa7 - Linux centos linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), Python 2.7.5
See https://gist.github.com/2194f5d7b97d9714ac453f72d0b05a04 for a full test report.

@branfosj
Copy link
Member

branfosj commented Jun 25, 2021

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0309u05a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/737a9bc8ec068d208504e4ddfc7f5573 for a full test report.

Edit: Test with 1 GPU

@branfosj
Copy link
Member

branfosj commented Jun 26, 2021

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bask-pg0309u10a - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/609b088ae4b5075e8f2c086eba1535c6 for a full test report.

Edit: Test with 4 GPUs

@branfosj
Copy link
Member

Hm, I can't see anything wrong with this test so I'd rather not disable it. Maybe just a fluke?
Due to the async nature of CUDA errors the stacktrace doesn't say anything about the actual source of the error.

I've run using all 4 GPUs three times:

So the first one was either a fluke failure or the test is flaky.

@Flamefire
Copy link
Contributor Author

The test shouldn't really be flaky from what I can tell, I guess it is NCCL being the problem here. Hence I'm hesitant disabling this test. FWIW: I reran this test file 5 times without any failure.

My proposal: Include as-is and keep the potential failure as a kind of warning that pytorch might fail in the given environment.

@boegel
Copy link
Member

boegel commented Jun 28, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node3159.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/8eea43094a796743710c9bfb5ad6d581 for a full test report.

Copy link
Member

@branfosj branfosj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@branfosj branfosj modified the milestones: 4.x, next release (4.4.1) Jun 30, 2021
@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit 9f74b08 into easybuilders:develop Jun 30, 2021
@Flamefire Flamefire deleted the 20210622165340_new_pr_PyTorch190 branch July 1, 2021 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants