Skip to content

error with SVD #106

@mgates3

Description

@mgates3

Description
On leconte, SVD has bad backward error with 8 ranks / 8 GPUs, for both target host and device.
Except backwards error ~ 1e-15. Using 1, 2, 4 ranks worked.

Steps To Reproduce

mpirun -np 8 ./tester --origin h --target h --jobu s --jobvt s --dim 1234 --dim 1k,2k,4k,8k,16k --ref n --nb 128,192,256,320 svd
% SLATE version 2023.08.25, id 57ea922b
% 2023-09-01 11:41:53, 8 MPI ranks, CPU-only MPI, 7 OpenMP threads per MPI rank
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt   S - Sref   Backward    U orth.    V orth.   time (s)  ref time (s)  status  
   d    host    host   1       some       some    1234    1234   128  32    2    4   1   3         NA   1.91e-03   1.80e-16   1.90e-16      1.659            NA  FAILED  
   d    host    host   1       some       some    1234    1234   192  32    2    4   1   3         NA   1.81e-03   1.85e-16   1.94e-16      1.974            NA  FAILED  
...

mpirun -np 8 ./bind_gpus.sh ./tester --origin d --target d --jobu s --jobvt s --dim 1234 --dim 1k,2k,4k,8k,16k --ref n --nb 128,192,256,320 svd
% SLATE version 2023.08.25, id 57ea922b
% 2023-09-01 07:30:57, 8 MPI ranks, CPU-only MPI, 7 OpenMP threads, 1 GPU devices per MPI rank
type  origin  target   A       jobu      jobvt       m       n    nb  ib    p    q  la  pt   S - Sref   Backward    U orth.    V orth.   time (s)  ref time (s)  status  
   d     dev     dev   1       some       some    1234    1234   128  32    2    4   1   3         NA   1.87e-03   1.89e-16   1.98e-16      1.854            NA  FAILED  
   d     dev     dev   1       some       some    1234    1234   192  32    2    4   1   3         NA   1.82e-03   1.79e-16   1.94e-16      1.992            NA  FAILED  
...

Environment
The more information that you can provide about your environment, the simpler it is for us to understand and reproduce the issue.

  • SLATE version / commit ID (e.g., git log --oneline -n 1):
  • 57ea922 (HEAD -> release, tag: v2023.08.25, github/master) Version 2023.08.25
  • How installed:
    • git clone
    • release tar file
    • Spack
    • module
  • How compiled:
    • makefile (include your make.inc)
sh leconte test> cat ../make.inc
CXXFLAGS = -Werror -Dslate_omp_default_none='default(none)'
CXX      = mpicxx
CC       = mpicc
FC       = mpif90
blas     = mkl
mkl_blacs = intelmpi
blas_threaded = 1
  • CMake (include your command line options)
  • Compiler & version (e.g., mpicxx --version):
  • BLAS library (e.g., MKL, ESSL, OpenBLAS) & version:
  • CUDA / ROCm / oneMKL version (e.g., nvcc --version):
  • MPI library & version (MPICH, Open MPI, Intel MPI, IBM Spectrum, Cray MPI, etc. Sometimes mpicxx -v gives info.):
  • OS:
  • Hardware (CPUs, GPUs, nodes): leconte, DGX, 8x V100
sh leconte test> module -t list
Currently Loaded Modulefiles:
gdbm/1.23/gcc-11.3.1-fhrtav
ncurses/6.4/gcc-11.3.1-vbhesx
sqlite/3.42.0/gcc-11.3.1-5ijskr
python/3.10.10/gcc-11.3.1-adtfss
perl/5.36.0/gcc-11.3.1-lhqic5
git/2.40.0/gcc-11.3.1-zem5da
cmake/3.26.3/gcc-11.3.1-6wjjvi
htop/3.2.2/gcc-11.3.1-rtm7gj
env-basic
gcc/11.4.0/gcc-11.3.1-rony5z
intel-oneapi-mkl/2023.1.0/gcc-11.3.1-qq2eto
intel-oneapi-mpi/2021.9.0/gcc-11.3.1-h77w4m
cuda/12.1.1/gcc-11.3.1-h7ttz4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions