Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Nov 14, 2023

(created using eb --new-pr)

I found this to be the cause for multiple failures of PyTorch on x86 (AMD EPYC) like:

distributed/_shard/checkpoint/test_checkpoint (8 total tests, failures=6)
distributed/_shard/checkpoint/test_file_system_checkpoint (6 total tests, failures=5)
distributed/_shard/sharded_optim/test_sharded_optim (2 total tests, failures=2)
distributed/_shard/sharded_tensor/ops/test_binary_cmp (4 total tests, failures=4)
distributed/_shard/sharded_tensor/ops/test_chunk (2 total tests, failures=2)
distributed/_shard/sharded_tensor/ops/test_elementwise_ops (4 total tests, failures=4)
distributed/_shard/sharded_tensor/ops/test_embedding (2 total tests, failures=2)
distributed/_shard/sharded_tensor/ops/test_embedding_bag (2 total tests, failures=2)
distributed/_shard/sharded_tensor/ops/test_init (3 total tests, failures=3)
distributed/_shard/sharded_tensor/ops/test_linear (3 total tests, failures=3)
distributed/_shard/sharded_tensor/ops/test_matrix_ops (11 total tests, failures=11)
distributed/_shard/sharded_tensor/ops/test_softmax (2 total tests, failures=2)
distributed/_shard/sharded_tensor/ops/test_tensor_ops (5 total tests, failures=5)
distributed/_shard/sharded_tensor/test_sharded_tensor (64 total tests, failures=50, errors=3, skipped=1)
distributed/_shard/sharded_tensor/test_sharded_tensor_reshard (2 total tests, failures=2)
distributed/_shard/sharding_plan/test_sharding_plan (5 total tests, failures=4, errors=1)
distributed/_shard/sharding_spec/test_sharding_spec (10 total tests, failures=2)
distributed/_shard/test_partial_tensor (5 total tests, failures=5)
distributed/algorithms/ddp_comm_hooks/test_ddp_hooks (6 total tests, failures=6)

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in total)
taurusi8022 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/6fd9b3902349f524045e196ef66013d4 for a full test report.

@smoors smoors added the bug fix label Nov 15, 2023
@smoors
Copy link
Contributor

smoors commented Nov 15, 2023

@boegelbot: please test @ generoso

@boegelbot
Copy link
Collaborator

@smoors: Request for testing this PR well received on login1

PR test command 'EB_PR=19231 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19231 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12167

Test results coming soon (I hope)...

Details

- notification for comment with ID 1812577674 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 10 out of 11 (9 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/edda2ea705b3548178674c0fd51aeffc for a full test report.

@smoors
Copy link
Contributor

smoors commented Nov 15, 2023

FAIL (build issue) NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.1.1.eb

== 2023-11-15 14:29:43,863 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/easybuild-framework/easybuild/base/exceptions.py:126 in init): Checksum verification for /project/boegelbot/sources/n/NCCL/v2.18.3-1.tar.gz using {'v2.18.3-1.tar.gz': '6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9'} failed. (at easybuild/easybuild-framework/easybuild/framework/easyblock.py:2457 in checksum_step)

@Flamefire
Copy link
Contributor Author

FAIL (build issue) NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.1.1.eb

== 2023-11-15 14:29:43,863 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/easybuild-framework/easybuild/base/exceptions.py:126 in init): Checksum verification for /project/boegelbot/sources/n/NCCL/v2.18.3-1.tar.gz using {'v2.18.3-1.tar.gz': '6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9'} failed. (at easybuild/easybuild-framework/easybuild/framework/easyblock.py:2457 in checksum_step)

That must be a download error on your site: wget -O- https://github.com/NVIDIA/NCCL/archive/v2.18.3-1.tar.gz | sha256sum shows 6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9

Comment on lines 14 to 18
checksums = [('6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9',
'b4f5d7d9eea2c12e32e7a06fe138b2cfc75969c6d5c473aa6f819a792db2fc96')]
patches = ['NCCL-2.16.2_fix-cpuid.patch']
checksums = [
{'v2.18.3-1.tar.gz': '6477d83c9edbb34a0ebce6d751a1b32962bc6415d75d04972b676c6894ceaef9'},
{'NCCL-2.16.2_fix-cpuid.patch': '0459ecadcd32b2a7a000a2ce4f675afba908b2c0afabafde585330ff4f83e277'},
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restore the alternative checksum. See #18906

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, didn't notice that. Done 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hitting a framework bug. Shall we resolve that first or shall I revert to non-dicts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get this merged first?
checksumming is quite a mess at the moment. probably better to fix this in EB-5 so breaking changes can be made, see also easybuilders/easybuild-framework#4346

@SebastianAchilles
Copy link
Member

A college of mine tried to run GROMACS with NCCL and ran into a Segmentation fault pointing to ncclTopoGetXmlFromCpu(). We tried to reduce it a bit. Without the patch for example broadcast_perf from https://github.com/NVIDIA/nccl-tests/tree/master fails using NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.2.0.eb (we are using NCCL with a slightly newer CUDA version) also gives a Segmentation fault. With the patch the test works.
The patch looks good to me. 👍

@Flamefire
Copy link
Contributor Author

A college of mine tried to run GROMACS with NCCL and ran into a Segmentation fault pointing to ncclTopoGetXmlFromCpu(). We tried to reduce it a bit. Without the patch for example broadcast_perf from https://github.com/NVIDIA/nccl-tests/tree/master fails using NCCL-2.18.3-GCCcore-12.3.0-CUDA-12.2.0.eb (we are using NCCL with a slightly newer CUDA version) also gives a Segmentation fault. With the patch the test works. The patch looks good to me. 👍

ncclTopoGetXmlFromCpu also was in the stacktrace on my machine and the GCC guys confirmed the issue there. So that fits, thanks for testing!

@SebastianAchilles
Copy link
Member

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on login1

PR test command 'EB_PR=19231 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19231 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12174

Test results coming soon (I hope)...

Details

- notification for comment with ID 1814160413 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bwd-rockylinux-92 - Linux Rocky Linux 9.2 (Blue Onyx), x86_64, Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (broadwell), 2 x NVIDIA NVIDIA GeForce GTX 1060 6GB, 535.129.03, Python 3.9.16
See https://gist.github.com/SebastianAchilles/40004bed8408b49ea44f790ad615be38 for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
skl-rockylinux-88 - Linux Rocky Linux 8.8, x86_64, Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz (skylake), 1 x NVIDIA NVIDIA RTX A4000, 545.23.06, Python 3.6.8
See https://gist.github.com/SebastianAchilles/12bd064b50f82e7172b715e014fd2a71 for a full test report.

@SebastianAchilles
Copy link
Member

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@SebastianAchilles: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19231 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19231 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3739

Test results coming soon (I hope)...

Details

- notification for comment with ID 1814215206 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 9 out of 9 (9 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/e265afd108ac0e3a8923eba90b049069 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 12 out of 12 (9 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/393e72bf7c3cf343f4813bad88e8bcaa for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Nov 16, 2023
Copy link
Contributor

@smoors smoors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@smoors
Copy link
Contributor

smoors commented Nov 16, 2023

Going in, thanks @Flamefire!

@smoors smoors merged commit 4adb8d1 into easybuilders:develop Nov 16, 2023
@Flamefire Flamefire deleted the 20231114131104_new_pr_NCCL2103 branch November 16, 2023 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants