Skip to content

Latest commit

 

History

History
80 lines (60 loc) · 8.75 KB

File metadata and controls

80 lines (60 loc) · 8.75 KB

Node validator test suite 

Single-node checks be performed at boot time and upon node re-configuration

  • Tag: vs-node-validator
Category Description Duration Test name
HW check Run dgemm on all GPUs and CPUs (*) 30s node-burn-ce.py
UENV Check if uenv status runs 5s uenv_status.py
Build c++ hello world   TODO
Integration tests Check for installed packages (list) 5s PackagePresentTest
Check if mount points (main.tf) are present MountPointExistsTest
Check if environment vars are correctly set  EnvVariableConfigTest
Check if proxy config is correctly set ProxyConfigTest
Slurm Selected Epilog and Prolog tests 5s 002-slowlink, 006-bugs.sh, 020-gpumem.sh
  • Remark: (*) running tests such as node burn locally (without slurm) would require to maintain the same test twice and, therefore for the moment being, it was agreed that we should find an alternative solution for these tests.

Node vetting (appscheckout) test suite 

  • Single and multi-node checks to be performed after interventions
    • To be executed against all nodes in a reservation (such as appscheckout or maintenance)
    • Tag: appscheckout (veto mode)
Category Description All-nodes Duration Test name
HW check Run dgemm on all GPUs and CPUs Y 1min node-burn-ce.py
Stream (memory bandwidth test) Y TODO node-burn-ce.py
Network Simple MPI (CPI) Y TODO mpi_cpi.py
OSU all-to-all Y TODO TODO
NCCL allreduce (2min) Y TODO pytorch_allreduce.py
Network bandwidth between gpus (per node) Y TODO cxi_gpu_loopback_bw.py

Production test suite 

Single and multi-node checks to be performed regularly (nightly) in production using a subset of nodes

Category Description Test name
Apps CP2K cp2k_uenv.py
Gromacs gromacs_check.py
LAMMPS lammps.py
PyTorch pytorch_allreduce.py, pytorch_nvidia.py
QuantumEspresso quantumespresso_check_uenv.py
Microbenchmarks Memory allocation speed alloc_speed.py
Libraries dlaf dlaf.py
Programming environment Build Hello World (C/C++/F) helloworld.py
CUDA Samples (?) cuda_samples.py
Checks that nvml can report GPU informations cuda_nvml.py
Affinity affinity_check.py
Test if multi-threaded MPI works mpi.py
Config/Integration Slurm: slurm.py
Slurm: partitions correspond to TF definition TODO
Slurm: Slurm Transparent Hugepages slurm.py
Slurm: number of nodes available per partition SlurmQueueStatusCheck
Slurm: Check if Gres is properly configured on Slurm SlurmGPUGres
Slurm: new features TODO
Containers Test OSU benchmarsk with CE OMB_MPICH_CE, OMB_OMPI_CE
Stream benchmark with ce RunNVGPUJobCE - ce_import_run_image.py
Verify simple container runs RunJobCE - ce_import_run_image.py
Test SSH to a container ssh.py
CUDA nbody with CE check_cuda_nbody.py

Maintenance test suite 

Single and multi-node checks to be performed before & after vCluster interventions (using a reservation)

  • Remark: Application checks can be the same as Production, but ideally they should be using more nodes