Skip to content

Conversation

@Micket
Copy link
Contributor

@Micket Micket commented Feb 3, 2023

(created using eb --new-pr)

…patches: UCC-CUDA-1.1.0_cuda_12_mem_ops.patch
@Micket Micket added the update label Feb 3, 2023
@Micket Micket marked this pull request as draft February 3, 2023 17:19
@Micket
Copy link
Contributor Author

Micket commented Feb 3, 2023

I backported the patch that deals with CUDA 12 MEM OPS things. It builds, but, unfortunately just crashes when used.
So, more things are needed apparently. Crap.

@Micket
Copy link
Contributor Author

Micket commented Mar 12, 2023

I tried to isolate the segfault i saw, but it turns out the segfault is present as soon as you just set UCC_COMPONENT_PATH to anything, so, actually, the backported patch might just work.. if we can sort out what's causing ucc_info to segfault

$ ml UCC/1.1.0-GCCcore-12.2.0 
$ export UCC_COMPONENT_PATH=$EBROOTUCC/lib/ucc/
$ ucc_info  -v
[vera-c2:4015973:0:4015973] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace (tid:4015973) ====
 0 0x0000000000012ce0 __funlockfile()  :0
 1 0x00000000000ccc75 __strlen_avx2()  :0
 2 0x0000000000019026 ucc_str_concat()  /dev/shm/UCC/1.1.0/GCCcore-12.2.0/ucc-1.1.0/src/utils/ucc_string.c:123
 3 0x0000000000009ff0 ucc_check_config_file()  /dev/shm/UCC/1.1.0/GCCcore-12.2.0/ucc-1.1.0/src/core/ucc_constructor.c:105
 4 0x0000000000009ff0 ucc_constructor()  /dev/shm/UCC/1.1.0/GCCcore-12.2.0/ucc-1.1.0/src/core/ucc_constructor.c:136
 5 0x00000000000098b8 ucc_lib_config_read()  /dev/shm/UCC/1.1.0/GCCcore-12.2.0/ucc-1.1.0/src/core/ucc_lib.c:368
 6 0x00000000004011a5 main()  /dev/shm/UCC/1.1.0/GCCcore-12.2.0/ucc-1.1.0/tools/info/ucc_info.c:131
 7 0x000000000003acf3 __libc_start_main()  ???:0
 8 0x00000000004013ae _start()  ???:0
=================================
Segmentation fault (core dumped)

Ugh. Trying to see what changes they made in ucc_constructor to tell where it might go wrong, turns out they have also since then completely removed the option to specify UCC_COMPONENT_PATH, and it's now hardcoded.
openucx/ucc#586
so, next UCC release we will just be screwed.

@Micket
Copy link
Contributor Author

Micket commented Mar 13, 2023

I found the new code without the broken UCC_COMPONENT_PATH much easier to understand. The logic isn't as convoluted and combined with install path and the rest, instead, we'd just need to hijack:
https://github.com/openucx/ucc/blob/1cdc03d74edfd9f03c59872c5e9ccf06310df492/src/core/ucc_constructor.c#L86-L87

 status = ucc_sys_path_join(lib_path, UCC_MODULE_SUBDIR, 
                            &ucc_global_config.component_path); 

though, even then, the way we do it, symlinking all default built components into the new path in UCC-CUDA is also a bit ugly (how would you handle a third component thing?)
what we'd rather have is a list of component_paths

@Micket
Copy link
Contributor Author

Micket commented Mar 13, 2023

so what we really would like is to have
https://github.com/openucx/ucc/blob/1cdc03d74edfd9f03c59872c5e9ccf06310df492/src/utils/ucc_component.c#L111
support multiple different paths. Code is mostly noisy mallocs and such, but all it does it format a glob string.

So component path should be able to support and type of string that glob would accept;
{/path/to/ucc/lib/ucc/,/path/to/ucc-cuda/lib/ucc/}

@Micket Micket added this to the next release (4.7.2?) milestone Apr 8, 2023
@Micket Micket changed the title {lib}[GCCcore/12.2.0] UCC-CUDA v1.1.0 {lib}[GCCcore/12.2.0] UCC-CUDA v1.1.0 + UCC patch for component paths Apr 8, 2023
@Micket
Copy link
Contributor Author

Micket commented Apr 8, 2023

Wasn't worth fixing hte old behavior, because it

  1. also required copying over all the original components for it to work, which was messy and wouldn't work if someone wanted to make a third variant, like a "UCC-ROCM" anyway.
  2. next release removed the old component environment variable completely anyway
  3. couldn't figure out how to even get that to work, since it's a bunch of messy assumptions made of how it works.

This new patch will hopefully work going forward (with minimal updates), and as a nice bonus, if someone did forget to rebuild without this patch, since i no longer set UCC_COMPONENT_PATH to anything, at least it won't completely break UCC.
Though, if someone wants to get the UCC-CUDA components, then they need to rebuild UCC with this patch, i see no other option.

@Micket Micket marked this pull request as ready for review April 8, 2023 13:08
@boegelbot

This comment was marked as resolved.

@Micket
Copy link
Contributor Author

Micket commented Apr 8, 2023

Test report by @Micket
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
vera-c1 - Linux Rocky Linux 8.6, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/Micket/ac89db0776e3fa72d9b497c14cbeedf5 for a full test report.

@Micket
Copy link
Contributor Author

Micket commented Apr 8, 2023

Test report by @Micket
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
vera-c1 - Linux Rocky Linux 8.6, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/Micket/1a113a3c37e921507ef72499032469cf for a full test report.

@boegel
Copy link
Member

boegel commented Apr 8, 2023

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=17255 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_17255 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10611

Test results coming soon (I hope)...

- notification for comment with ID 1500928368 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Apr 8, 2023

Test report by @boegel
SUCCESS
Build succeeded for 6 out of 6 (2 easyconfigs in total)
node3102.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/d58fd436653fd2140b1b570382da5ca2 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (2 easyconfigs in total)
cns3 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/93176fd5b76bb18b409856fc2413a6f4 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 8, 2023

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=17255 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_17255 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 2471

Test results coming soon (I hope)...

- notification for comment with ID 1500937078 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 5 out of 5 (2 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/32018b9558e905ffa1be324e38b9c1ec for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Apr 8, 2023

Going in, thanks @Micket!

@boegel boegel merged commit e64966c into easybuilders:develop Apr 8, 2023
@Micket Micket deleted the 20230203171738_new_pr_UCC-CUDA110 branch April 21, 2023 09:18
@boegel boegel changed the title {lib}[GCCcore/12.2.0] UCC-CUDA v1.1.0 + UCC patch for component paths {lib}[GCCcore/12.2.0] UCC-CUDA v1.1.0 + add patch for UCC 1.1.0 for multiple component paths May 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants