- Tag: vs-node-validator
| Category | Description | Duration | Test name |
|---|---|---|---|
| UENV | Check if uenv status runs | 5s | uenv_status.py |
| Build c++ hello world | TODO | ||
| Integration tests | Check for installed packages (list) | 5s | PackagePresentTest |
| Check if mount points (main.tf) are present | MountPointExistsTest | ||
| Check if environment vars are correctly set | EnvVariableConfigTest | ||
| Check if proxy config is correctly set | ProxyConfigTest | ||
| Slurm | Selected Epilog and Prolog tests | 5s | 002-slowlink, 006-bugs.sh, 020-gpumem.sh |
- Remark: (*) running tests such as node burn locally (without slurm) would require to maintain the same test twice and, therefore for the moment being, it was agreed that we should find an alternative solution for these tests.
- Single and multi-node checks to be performed after interventions
- To be executed against all nodes in a reservation (such as appscheckout or maintenance)
- Tag: appscheckout (veto mode)
| Category | Description | All-nodes | Duration | Test name |
|---|---|---|---|---|
| HW check | Run dgemm on all GPUs and CPUs | Y | 1min | node-burn-ce.py |
| Stream (memory bandwidth test) | Y | TODO | node-burn-ce.py | |
| Network | Simple MPI (CPI) | Y | TODO | mpi_cpi.py |
| OSU all-to-all | Y | TODO | TODO | |
| NCCL allreduce (2min) | Y | TODO | pytorch_allreduce.py | |
| Network bandwidth between gpus (per node) | Y | TODO | cxi_gpu_loopback_bw.py |
Single and multi-node checks to be performed regularly (nightly) in production using a subset of nodes
- Tag: production
- See: Test coverage
| Category | Description | Test name |
|---|---|---|
| Apps | CP2K | cp2k_uenv.py |
| Gromacs | gromacs_check.py | |
| LAMMPS | lammps.py | |
| PyTorch | pytorch_allreduce.py, pytorch_nvidia.py | |
| QuantumEspresso | quantumespresso_check_uenv.py | |
| Microbenchmarks | Memory allocation speed | alloc_speed.py |
| Libraries | dlaf | dlaf.py |
| Programming environment | Build Hello World (C/C++/F) | helloworld.py |
| CUDA Samples (?) | cuda_samples.py | |
| Checks that nvml can report GPU informations | cuda_nvml.py | |
| Affinity | affinity_check.py | |
| Test if multi-threaded MPI works | mpi.py | |
| Config/Integration | Slurm: | slurm.py |
| Slurm: partitions correspond to TF definition | TODO | |
| Slurm: Slurm Transparent Hugepages | slurm.py | |
| Slurm: number of nodes available per partition | SlurmQueueStatusCheck | |
| Slurm: Check if Gres is properly configured on Slurm | SlurmGPUGres | |
| Slurm: new features | TODO | |
| Containers | Test OSU benchmarsk with CE | OMB_MPICH_CE, OMB_OMPI_CE |
| Stream benchmark with ce | RunNVGPUJobCE - ce_import_run_image.py | |
| Verify simple container runs | RunJobCE - ce_import_run_image.py | |
| Test SSH to a container | ssh.py | |
| CUDA nbody with CE | check_cuda_nbody.py |
Single and multi-node checks to be performed before & after vCluster interventions (using a reservation)
- Tags: maintenance + production
- See: Test coverage
- Remark: Application checks can be the same as Production, but ideally they should be using more nodes