-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Description
Description
On leconte, SVD has bad backward error with 8 ranks / 8 GPUs, for both target host and device.
Except backwards error ~ 1e-15. Using 1, 2, 4 ranks worked.
Steps To Reproduce
mpirun -np 8 ./tester --origin h --target h --jobu s --jobvt s --dim 1234 --dim 1k,2k,4k,8k,16k --ref n --nb 128,192,256,320 svd
% SLATE version 2023.08.25, id 57ea922b
% 2023-09-01 11:41:53, 8 MPI ranks, CPU-only MPI, 7 OpenMP threads per MPI rank
type origin target A jobu jobvt m n nb ib p q la pt S - Sref Backward U orth. V orth. time (s) ref time (s) status
d host host 1 some some 1234 1234 128 32 2 4 1 3 NA 1.91e-03 1.80e-16 1.90e-16 1.659 NA FAILED
d host host 1 some some 1234 1234 192 32 2 4 1 3 NA 1.81e-03 1.85e-16 1.94e-16 1.974 NA FAILED
...
mpirun -np 8 ./bind_gpus.sh ./tester --origin d --target d --jobu s --jobvt s --dim 1234 --dim 1k,2k,4k,8k,16k --ref n --nb 128,192,256,320 svd
% SLATE version 2023.08.25, id 57ea922b
% 2023-09-01 07:30:57, 8 MPI ranks, CPU-only MPI, 7 OpenMP threads, 1 GPU devices per MPI rank
type origin target A jobu jobvt m n nb ib p q la pt S - Sref Backward U orth. V orth. time (s) ref time (s) status
d dev dev 1 some some 1234 1234 128 32 2 4 1 3 NA 1.87e-03 1.89e-16 1.98e-16 1.854 NA FAILED
d dev dev 1 some some 1234 1234 192 32 2 4 1 3 NA 1.82e-03 1.79e-16 1.94e-16 1.992 NA FAILED
...
Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.
- SLATE version / commit ID (e.g.,
git log --oneline -n 1): - 57ea922 (HEAD -> release, tag: v2023.08.25, github/master) Version 2023.08.25
- How installed:
- git clone
- release tar file
- Spack
- module
- How compiled:
- makefile (include your
make.inc)
- makefile (include your
sh leconte test> cat ../make.inc
CXXFLAGS = -Werror -Dslate_omp_default_none='default(none)'
CXX = mpicxx
CC = mpicc
FC = mpif90
blas = mkl
mkl_blacs = intelmpi
blas_threaded = 1
- CMake (include your command line options)
- Compiler & version (e.g.,
mpicxx --version): - BLAS library (e.g., MKL, ESSL, OpenBLAS) & version:
- CUDA / ROCm / oneMKL version (e.g.,
nvcc --version): - MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes
mpicxx -vgives info.): - OS:
- Hardware (CPUs, GPUs, nodes): leconte, DGX, 8x V100
sh leconte test> module -t list
Currently Loaded Modulefiles:
gdbm/1.23/gcc-11.3.1-fhrtav
ncurses/6.4/gcc-11.3.1-vbhesx
sqlite/3.42.0/gcc-11.3.1-5ijskr
python/3.10.10/gcc-11.3.1-adtfss
perl/5.36.0/gcc-11.3.1-lhqic5
git/2.40.0/gcc-11.3.1-zem5da
cmake/3.26.3/gcc-11.3.1-6wjjvi
htop/3.2.2/gcc-11.3.1-rtm7gj
env-basic
gcc/11.4.0/gcc-11.3.1-rony5z
intel-oneapi-mkl/2023.1.0/gcc-11.3.1-qq2eto
intel-oneapi-mpi/2021.9.0/gcc-11.3.1-h77w4m
cuda/12.1.1/gcc-11.3.1-h7ttz4
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels