Skip to content

Conversation

@patrickscholz
Copy link
Contributor

@patrickscholz patrickscholz commented Jan 28, 2025

  • fix some issues where pointer are associated allthough vector array wasnt allocated yet --> which triggers error with extended gnu compile options let to a couple addons in src/associate_mesh_ass.h and src/associate_part_ass.h
    e.g.
if (allocated(partit%remList_elem2D)) then
 ...
  • make juwels compiler settings work with stage/2025
  • put additional debug compiler options for GNU into src/CMakeList.txt
  • The additional compiler flags found occasionally the problem that communication arrays where not found to be contiguous in memory and therefor wanting to make a copy of them before communication, led to a couple of addons in gen_halo_exchange.F90 e.g.
! --> old
ELSE
   call MPI_SEND( arr2D, myDim_nod2D, MPI_DOUBLE_PRECISION, 0, 2, MPI_COMM_FESOM, MPIerr )
ENDIF
! --> new
ELSE
   call MPI_SEND( arr2D(1:myDim_nod2D), myDim_nod2D, MPI_DOUBLE_PRECISION, 0, 2, MPI_COMM_FESOM, MPIerr )
ENDIF
  • in ice_thermo_oce.F90 fix issue with the initialisation of the variable rsf when using linfs otherwise leads to problems when all arrays are initialsed with Nan's by gnu compiler flag

  • fix indice issue in oce_spp.F90 where brine rejection were written into bottom topography also leaded to NaN in the bottom topography and triggered an NaN checker

  • Mysterie Issue checked with Llview on juwels:
    (mesh A0_40, 11.5M vertices, 69 levels, runs on 4800CPUs on Juwels )
    grafik
    Occasionally on juwels 1 compute node seem to require 4x times more memory than any of the other compute nodes. This issue is not consistently reproducible. I assume it is somehow attributed to the I/O system. Which might be the reasons for the OOM (out of memory) errors that Vasco encountered with his setup on juwels. We have to keep an eye on this, also what happens on other machines.
    I think this is how the RAM should look like if everything works as it should ...
    grafik

  • Improve Juwels environment file

  • fix and test juwels GNU and Intel compiler flags for hopefully optimal performance

scholz6 and others added 17 commits January 28, 2025 12:28
…if variables are already allocated before an array pointer into that variable is associated, otherwise GNU extended compiler option trigger an error
…ly where the GNU compiler complained about not recognizing contiguous arrays fir the mpi communication
@JanStreffing
Copy link
Collaborator

FYI @ufukozkan and I recently also made a Stage 2025 with Intel icx compiler and ParastationMPI on Juwels that works. This was in esm_tools, but it would be easy to add the resulting env.sh into fesom2.

@patrickscholz
Copy link
Contributor Author

@JanStreffing: I tried this as well, but i had problems to solve some MPI dependencies which lead into an compiler error in FESOM2. Did you try to compile FESOM2.6 with this?

@patrickscholz patrickscholz marked this pull request as ready for review February 3, 2025 11:11
@JanStreffing
Copy link
Collaborator

Yes 2.6.5

@JanStreffing
Copy link
Collaborator

Here is what we came up with as environment file. Obviously some things here are not needed for FESOM and are for other parts of AWI-CM3:

#!/usr/bin/bash
# ENVIRONMENT used in test_960_v5_checking_for_oasis_compute_18500101-18500101.run
# Use this file to source the environment in your
# preprocessing or postprocessing scripts

module purge
module load Stages/2025
module load Intel/2024.2.0
module load ParaStationMPI/5.10.0-1
module load CMake/3.29.3
module load Python/3.12.3
module load imkl/2024.2.0
module load Perl/5.38.2
module load Perl-bundle-CPAN/5.38.2
module load git/2.45.1
module load libaec FFTW cURL netCDF netCDF-Fortran ecCodes CDO NCO
module list

export LC_ALL=en_US.UTF-8
export TMPDIR=/tmp
export FC=mpifort
export F77=mpifort
export MPIFC=mpifort
export FCFLAGS=-free
export CC=mpicc
export CXX=mpic++
export MPIROOT="$($FC -show | perl -lne 'm{ -I(.*?)/include } and print $1')"
export MPI_LIB="$($FC -show |sed -e 's/^[^ ]*//' -e 's/-[I][^ ]*//g')"
export AEC_ROOT=$EBROOTLIBAEC
export SZIPROOT=$EBROOTLIBAEC
export HDF5ROOT=$EBROOTHDF5
export HDF5_ROOT=$EBROOTHDF5
export NETCDFROOT=$EBROOTNETCDF
export NETCDFFROOT=$EBROOTNETCDFMINFORTRAN
export ECCODESROOT=$EBROOTECCODES
export HDF5_C_INCLUDE_DIRECTORIES=$HDF5_ROOT/include
export NETCDF_Fortran_INCLUDE_DIRECTORIES=$NETCDFFROOT/include
export NETCDF_C_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export NETCDF_CXX_INCLUDE_DIRECTORIES=$NETCDFROOT/include
export OASIS3MCT_FC_LIB="-L$NETCDFFROOT/lib -lnetcdff"
export PERL5LIB=/p/project/chhb19/HPC_libraries/perl5/lib/perl5
export PERL5_PATH=$PERL5LIB
export PERL5OPT=-Mwarnings=FATAL,uninitialized
export MKL_CBWR=AUTO,STRICT
export LD_RUN_PATH=$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/p/scratch/cesmtst/oezkan1/runtime/awicm3-v3.3//test_960_v5_checking_for_oasis/run_18500101-18500101/work//lib/fesom/
export OIFS_FFIXED=""
export GRIB_SAMPLES_PATH="$ECCODESROOT/share/eccodes/ifs_samples/grib1_mlgrib2/"
export DR_HOOK_IGNORE_SIGNALS=-1
export OMP_SCHEDULE=STATIC
export OMP_STACKSIZE=128M
export MAIN_LDFLAGS=-openmp
export USER=oezkan1
export FESOM_USE_CPLNG="active"
export ECE_CPL_NEMO_LIM="false"
export ECE_CPL_FESOM_FESIM="true"
export ECE_AWI_CPL_FESOM="true"
export ENVIRONMENT_SET_BY_ESMTOOLS=TRUE

unset SLURM_DISTRIBUTION
unset SLURM_NTASKS
unset SLURM_NPROCS
unset SLURM_ARBITRARY_NODELIST

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Feb 3, 2025

@JanStreffing: You are right the compiling with intel on JUWELS works but only when you prescribe the compiler env variables:
export FC=mpifort
export F77=mpifort
export MPIFC=mpifort
export FCFLAGS=-free
export CC=mpicc
export CXX=mpic++
... i always assumed that they should be automatically setted by setting the module environment.

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Feb 3, 2025

@JanStreffing @dsidoren : just made a small test run on juwels for GCC and Intel compiler (core2 mesh, 192CPus, simulate 1month, with i/O and restarts writing). It turns out that a GNU compiled FESOM2 is on JUWELS faster by factor 1.3 than an INTEL compiled FESOM2.

Total Runtime (1month, COR2):
- GCC Compiler: 533.52 sec
- INTEL Compiler: 723.8 sec

... not sure if that hold for large meshes as well!

@patrickscholz
Copy link
Contributor Author

@JanStreffing, @dsidoren : Also tried large AO40 mesh from Vasco (11.5M vertices, 69 lev, 4800 CPUs , 100 Compute nodes, simulated 100 steps with 1x mean i/o at the end )

GCC Compiler: 150.7 sec
Intel Compiler: 250.4 sec

GCC speed up by factor 1.66!!!

@JanStreffing
Copy link
Collaborator

Can you try once with ParaStationMPI-mt ?

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Feb 3, 2025

@JanStreffing: Ich seh gerade, wir haben glaub ich für Juwels keine optimisierung an für intel ...

if(${CMAKE_Fortran_COMPILER_ID} STREQUAL  Intel )
   target_compile_options(${PROJECT_NAME} PRIVATE -r8 -i4 -fp-model precise -no-prec-div -no-prec-sqrt -fimf-use-svml -ip -init=zero -no-wrap-margin -fpe0) # add -fpe0 for RAPS environment
   if(${FESOM_PLATFORM_STRATEGY} STREQUAL  levante.dkrz.de )
      target_compile_options(${PROJECT_NAME} PRIVATE -march=core-avx2 -mtune=core-avx2)
   elseif(${FESOM_PLATFORM_STRATEGY} STREQUAL leo-dcgp )
      target_compile_options(${PROJECT_NAME} PRIVATE -O3 -xCORE-AVX512 -qopt-zmm-usage=high -align array64byte -ipo)
   elseif(${FESOM_PLATFORM_STRATEGY} STREQUAL mn5-gpp )
      target_compile_options(${PROJECT_NAME} PRIVATE -O3 -xCORE-AVX512 -qopt-zmm-usage=high -align array64byte -ipo)
   elseif(${FESOM_PLATFORM_STRATEGY} STREQUAL  albedo)
      target_compile_options(${PROJECT_NAME} PRIVATE -march=core-avx2 -O3 -ip -fPIC -qopt-malloc-options=2 -qopt-prefetch=5 -unroll-aggressive) # -g -traceback -check) #NEC mpi option
   elseif(${FESOM_PLATFORM_STRATEGY} STREQUAL atosecmwf )
      target_compile_options(${PROJECT_NAME} PRIVATE -march=core-avx2 -mtune=core-avx2)
   else()
      target_compile_options(${PROJECT_NAME} PRIVATE -xHost)
   endif()
   

@patrickscholz
Copy link
Contributor Author

AO40 mesh, 4800 CPUs runtime for 100 steps:
GCC/openMPI , (-O2, ... ) : 150.7 sec
Intel/ParaStationMPI , (-XHost ) : 250.4 sec
Intel/ParaStationMPI-*mt, (-XHost ) : 187.4 sec
Intel/ParaStationMPI , (-O3 -xCORE-AVX512 ...) : 249.8 sec
Intel/ParaStationMPI-*mt, (-O3 -xCORE-AVX512 ...) : 264.3 sec

... really weird behavior need to play a bit more!

@patrickscholz
Copy link
Contributor Author

patrickscholz commented Feb 5, 2025

@JanStreffing, @dsidoren , @suvarchal

  • Core2 mesh, 192 CPUs @ juwels, simulated 1 month with 1x times meanI/O
Compiler Options - runtime [sec.] (core2, 1 month, 192CPUs @ juwels)
GCC/openMPI 390
GCC/openMPI -O2 192
GCC/openMPI -O3 -march=skylake-avx512 -mtune=skylake-avx512 -mprefer-vector-width=512 -falign-loops=64 -falign-functions=64 -falign-jumps=64 (chatGPT recomendation) 173
Intel/Para...MPI nothing 309s
Intel/Para...MPI -O2 114s
Intel/Para...MPI -O3 115s
Intel/Para...MPI -O3 -xCORE-AVX512 117s
Intel/Para...MPI -O3 -xCORE-AVX512 -qopt-zmm-usage=high -align array64byte 119s
Intel/Para...MPI -O2 -xCORE-AVX2 114s
Intel/Para...MPI -O3 -xCORE-AVX2 112s
Intel/Para...MPI -O3 -xCORE-AVX2 -qopt-streaming-stores=always 126s
Intel/Para...MPI -O3 -xCORE-AVX2 -qopt-prefetch=5 116s
Intel/Para...MPI -O3 -xCORE-AVX2 -funroll-loops 113s
**Intel/Para...MPI-mt ** -O3 -xCORE-AVX2 113s

-Summarys for Juwels performance: Intel/ParaStationMPI with -O3 -xCORE-ACX2 is fastest option

-PS: It looked that so far we had no -Ox optimization for Levante activated. I changed that with this pull request!

-PPS: asynchronous Multithreading doesnt work on juwels either

@JanStreffing
Copy link
Collaborator

Good work. Maybe @ufukozkan you can try this on juwels with AWI-CM3 v3.3?

@patrickscholz patrickscholz merged commit ec14775 into main Feb 5, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants