Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Aug 5, 2021

(created using eb --new-pr)

This is basically a fixed version of #13455 which I initially created because with --default-opt-level=opt one of the tests of the MPI build failed.

The crucial part here is that set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -ip") needs the -ip option so the tests succeed, otherwise a misoptimization seemingly occurs.

So I changed the DIRAC-19.0_use_easybuild_opts.patch which is meant to make the build use the CMake options and remove stuff like unconditional -g. So changes now:

  • set(CMAKE_C_FLAGS_DEBUG "-O0") fully removed --> Don't inject an optimization option for (here: DEBUG) configs
  • set(CMAKE_C_FLAGS_RELEASE "-O2 -Wno-unused") --> Moved the warning flag to CMAKE_C_FLAGS and removed the rest
  • set(CMAKE_CXX_FLAGS_DEBUG "-O0 -DDEBUG")-> set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -DDEBUG") --> Remove the -O0, keep the default flags and the custom define
  • set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -ip") -> set(CMAKE_Fortran_FLAGS_RELEASE "${CMAKE_Fortran_FLAGS_RELEASE} -ip") similar, keep the -ip!

In the EC I removed CMAKE_BUILD_TYPE, separate_build_dir, -G as those are the defaults already

I also checked if we can remove the parallel = 1 and do have an idea but that change would be to large and the build "only" takes about 20min (could be cut down to 2-4 with a while ! make -j loop... but yeah)

@hajgato

@boegel boegel added the bug fix label Aug 5, 2021
@boegel boegel changed the title Fix DIRAC build with high compiler optimiztations fix build of DIRAC 19.0 easyconfig with high compiler optimizations Aug 5, 2021
@boegel boegel added this to the 4.x milestone Aug 5, 2021
@boegel
Copy link
Member

boegel commented Aug 5, 2021

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=13613 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_13613 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 17904

Test results coming soon (I hope)...

Details

- notification for comment with ID 893341772 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusi6605.taurus.hrsk.tu-dresden.de - Linux RHEL 7.9, x86_64, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (haswell), Python 2.7.5
See https://gist.github.com/fdc6f49ce9afea615a7f6e727123fd8e for a full test report.

@boegel
Copy link
Member

boegel commented Aug 5, 2021

Test report by @boegel
SUCCESS
Build succeeded for 4 out of 4 (2 easyconfigs in total)
node3510.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/8ea21cb7d047246edab73f80441bdbd8 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
generoso-c1-s-4 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/afa80fbb2be06f5232a5a3eaeb704eae for a full test report.

@boegel
Copy link
Member

boegel commented Aug 5, 2021

Test report by @boegel
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
node3109.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/c9d68af4eb96200e673402d8fb99251a for a full test report.

@boegel
Copy link
Member

boegel commented Aug 5, 2021

Test report by @boegel
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
node2610.swalot.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/e72e50d7a01a287033c64654bd9b614f for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel The failure looks like some unrelated issue with UCX?

@boegel
Copy link
Member

boegel commented Aug 5, 2021

@boegel The failure looks like some unrelated issue with UCX?

Yeah, I'm a bit puzzled what's going on there, but looks unrelated, so I won't let it block this PR.

For reference, some DIRAC tests are failing on Intel Skylake for DIRAC-19.0-intel-2020a-Python-2.7.18-mpi-int64.eb are failing with errors like:

 35/132 Test  #35: dft_response ...............................***Failed    5.05 sec
ERROR: crash during ['python', '/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.18-mpi-int64/easybuild_obj/pam', '--dirac=/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.188
-mpi-int64/easybuild_obj/dirac.x', '--noarch', '--nobackup', '--inp=blyp_sdft.inp', '--mol=he.mol']

 **** dirac-executable stderr console output : ****
[node3109:27059:0:27060]       ud_ep.c:544  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed: iface=0x25b34020 ep=0x25a12430 conn_id=0 ep_id=0, dest_ep_id=0 rx_psn=2 neth_psn=1 ep_flags=0x78 ctl_opp
s=0x0 rx_creq_count=2
Details
 35/132 Test  #35: dft_response ...............................***Failed    5.05 sec
ERROR: crash during ['python', '/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.18-mpi-int64/easybuild_obj/pam', '--dirac=/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.188
-mpi-int64/easybuild_obj/dirac.x', '--noarch', '--nobackup', '--inp=blyp_sdft.inp', '--mol=he.mol']

 **** dirac-executable stderr console output : ****
[node3109:27059:0:27060]       ud_ep.c:544  Assertion `ep->rx.ooo_pkts.head_sn == neth->psn' failed: iface=0x25b34020 ep=0x25a12430 conn_id=0 ep_id=0, dest_ep_id=0 rx_psn=2 neth_psn=1 ep_flags=0x78 ctl_opp
s=0x0 rx_creq_count=2
==== backtrace (tid:  27060) ====
 0 0x000000000002151e ucs_debug_print_backtrace()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x00000000000632eb uct_ud_ep_rx_creq()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_ep.c:544
 2 0x0000000000063ad5 uct_ud_ep_process_rx()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_ep.c:651
 3 0x000000000006ddb1 uct_ud_mlx5_iface_poll_rx()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/accel/ud_mlx5.c:430
 4 0x000000000006ddb1 uct_ud_mlx5_iface_async_progress()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/accel/ud_mlx5.c:493
 5 0x00000000000616d8 uct_ud_iface_async_progress()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_iface.c:864
 6 0x00000000000616d8 uct_ud_iface_timer()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_iface.c:879
 7 0x0000000000012699 ucs_async_handler_invoke()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:224
 8 0x00000000000128bd ucs_async_handler_dispatch()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:242
 9 0x00000000000129a0 ucs_async_dispatch_handlers()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:271
10 0x0000000000012b0c ucs_async_dispatch_timerq()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:298
11 0x000000000001581d ucs_async_thread_func()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/thread.c:138
12 0x0000000000007ea5 start_thread()  pthread_create.c:0
13 0x00000000000fe9fd __clone()  ???:0
=================================
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
dirac.x            00000000038C362B  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B4B4CCEF630  Unknown               Unknown  Unknown
libc-2.17.so       00002B4B4F322387  gsignal               Unknown  Unknown
libc-2.17.so       00002B4B4F323A78  abort                 Unknown  Unknown
libucs.so.0.0.0    00002B4B4C828185  ucs_fatal_error_m     Unknown  Unknown
libucs.so.0.0.0    00002B4B4C8282FE  ucs_fatal_error_f     Unknown  Unknown
libuct_ib.so.0.0.  00002B4B5138B2EB  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  00002B4B5138BAD5  uct_ud_ep_process     Unknown  Unknown
libuct_ib.so.0.0.  00002B4B51395DB1  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  00002B4B513896D8  Unknown               Unknown  Unknown
libucs.so.0.0.0    00002B4B4C81A699  Unknown               Unknown  Unknown
libucs.so.0.0.0    00002B4B4C81A8BD  Unknown               Unknown  Unknown
libucs.so.0.0.0    00002B4B4C81A9A0  ucs_async_dispatc     Unknown  Unknown
libucs.so.0.0.0    00002B4B4C81AB0C  ucs_async_dispatc     Unknown  Unknown
libucs.so.0.0.0    00002B4B4C81D81D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B4B4CCE7EA5  Unknown               Unknown  Unknown
libc-2.17.so       00002B4B4F3EA9FD  clone                 Unknown  Unknown

directory: /tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.18-mpi-int64/easybuild_obj/test/dft_response
   inputs: he.mol  &  blyp_sdft.inp
 96/132 Test  #96: open-shells ................................***Failed    2.40 sec
ERROR: crash during ['python', '/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.18-mpi-int64/easybuild_obj/pam', '--dirac=/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.188
-mpi-int64/easybuild_obj/dirac.x', '--noarch', '--nobackup', '--inp=CH.x2c.scf_mj.2Pi12.inp', '--mol=CH.lsym.mol']

 **** dirac-executable stderr console output : ****
[node3109:37813:0:37814]     address.c:859  Assertion `address_count <= 64' failed
Details
 96/132 Test  #96: open-shells ................................***Failed    2.40 sec
ERROR: crash during ['python', '/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.18-mpi-int64/easybuild_obj/pam', '--dirac=/tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.188
-mpi-int64/easybuild_obj/dirac.x', '--noarch', '--nobackup', '--inp=CH.x2c.scf_mj.2Pi12.inp', '--mol=CH.lsym.mol']

 **** dirac-executable stderr console output : ****
[node3109:37813:0:37814]     address.c:859  Assertion `address_count <= 64' failed
==== backtrace (tid:  37814) ====
 0 0x000000000002151e ucs_debug_print_backtrace()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x0000000000053c38 ucp_address_unpack()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/wireup/address.c:859
 2 0x000000000005d1ef ucp_wireup_msg_handler()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/wireup/wireup.c:646
 3 0x0000000000063a18 uct_iface_invoke_am()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/base/uct_iface.h:628
 4 0x0000000000063a18 uct_ib_iface_invoke_am_desc()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/base/ib_iface.h:309
 5 0x0000000000063a18 uct_ud_ep_process_rx()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_ep.c:714
 6 0x000000000006ddb1 uct_ud_mlx5_iface_poll_rx()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/accel/ud_mlx5.c:430
 7 0x000000000006ddb1 uct_ud_mlx5_iface_async_progress()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/accel/ud_mlx5.c:493
 8 0x00000000000616d8 uct_ud_iface_async_progress()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_iface.c:864
 9 0x00000000000616d8 uct_ud_iface_timer()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/ud/base/ud_iface.c:879
10 0x0000000000012699 ucs_async_handler_invoke()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:224
11 0x00000000000128bd ucs_async_handler_dispatch()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:242
12 0x00000000000129a0 ucs_async_dispatch_handlers()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:271
13 0x0000000000012b0c ucs_async_dispatch_timerq()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/async.c:298
14 0x000000000001581d ucs_async_thread_func()  /tmp/vsc40023/easybuild_build/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/async/thread.c:138
15 0x0000000000007ea5 start_thread()  pthread_create.c:0
16 0x00000000000fe9fd __clone()  ???:0
=================================
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source
dirac.x            00000000038C362B  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002AB247801630  Unknown               Unknown  Unknown
libc-2.17.so       00002AB249E34387  gsignal               Unknown  Unknown
libc-2.17.so       00002AB249E35A78  abort                 Unknown  Unknown
libucs.so.0.0.0    00002AB24733A185  ucs_fatal_error_m     Unknown  Unknown
libucs.so.0.0.0    00002AB24733A2FE  ucs_fatal_error_f     Unknown  Unknown
libucp.so.0.0.0    00002AB24728BC38  ucp_address_unpac     Unknown  Unknown
libucp.so.0.0.0    00002AB2472951EF  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  00002AB24BE9DA18  uct_ud_ep_process     Unknown  Unknown
libuct_ib.so.0.0.  00002AB24BEA7DB1  Unknown               Unknown  Unknown
libuct_ib.so.0.0.  00002AB24BE9B6D8  Unknown               Unknown  Unknown
libucs.so.0.0.0    00002AB24732C699  Unknown               Unknown  Unknown
libucs.so.0.0.0    00002AB24732C8BD  Unknown               Unknown  Unknown
libucs.so.0.0.0    00002AB24732C9A0  ucs_async_dispatc     Unknown  Unknown
libucs.so.0.0.0    00002AB24732CB0C  ucs_async_dispatc     Unknown  Unknown
libucs.so.0.0.0    00002AB24732F81D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AB2477F9EA5  Unknown               Unknown  Unknown
libc-2.17.so       00002AB249EFC9FD  clone                 Unknown  Unknown

directory: /tmp/vsc40023/easybuild_build/DIRAC/19.0/intel-2020a-Python-2.7.18-mpi-int64/easybuild_obj/test/open-shells
   inputs: CH.lsym.mol  &  CH.x2c.scf_mj.2Pi12.inp

Perhaps related to #13628 ?

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Aug 5, 2021

Going in, thanks @Flamefire!

@boegel boegel merged commit 2f97a76 into easybuilders:develop Aug 5, 2021
@boegel boegel modified the milestones: 4.x, next release (4.4.2?) Aug 5, 2021
@bartoldeman
Copy link
Contributor

Doesn't look related at first sight. You never know but this seems more of a UCX vs lower level network libraries/driver issue.

@boegel
Copy link
Member

boegel commented Aug 5, 2021

It's indeed an unrelated issue, problem remains even after rebuilding GCCcore and UCX using the patch from #13628

@Flamefire Flamefire deleted the 20210805090435_new_pr_DIRAC190 branch August 6, 2021 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants