Node validator test suite

Single-node checks be performed at boot time and upon node re-configuration

Category	Description	Duration	Test name
~~HW check~~	~~Run dgemm on all GPUs and CPUs~~ (*)	~~30s~~	~~node-burn-ce.py~~
UENV	Check if uenv status runs	5s	uenv_status.py
	Build c++ hello world		TODO
Integration tests	Check for installed packages (list)	5s	PackagePresentTest
	Check if mount points (main.tf) are present		MountPointExistsTest
	Check if environment vars are correctly set		EnvVariableConfigTest
	Check if proxy config is correctly set		ProxyConfigTest
Slurm	Selected Epilog and Prolog tests	5s	002-slowlink, 006-bugs.sh, 020-gpumem.sh

Remark: (*) running tests such as node burn locally (without slurm) would require to maintain the same test twice and, therefore for the moment being, it was agreed that we should find an alternative solution for these tests.

Single and multi-node checks to be performed after interventions
- To be executed against all nodes in a reservation (such as appscheckout or maintenance)
- Tag: appscheckout (veto mode)

Category	Description	All-nodes	Duration	Test name
HW check	Run dgemm on all GPUs and CPUs	Y	1min	node-burn-ce.py
	Stream (memory bandwidth test)	Y	TODO	node-burn-ce.py
Network	Simple MPI (CPI)	Y	TODO	mpi_cpi.py
	OSU all-to-all	Y	TODO	TODO
	NCCL allreduce (2min)	Y	TODO	pytorch_allreduce.py
	Network bandwidth between gpus (per node)	Y	TODO	cxi_gpu_loopback_bw.py

Single and multi-node checks to be performed regularly (nightly) in production using a subset of nodes

Category	Description	Test name
Apps	CP2K	cp2k_uenv.py
	Gromacs	gromacs_check.py
	LAMMPS	lammps.py
	PyTorch	pytorch_allreduce.py, pytorch_nvidia.py
	QuantumEspresso	quantumespresso_check_uenv.py
Microbenchmarks	Memory allocation speed	alloc_speed.py
Libraries	dlaf	dlaf.py
Programming environment	Build Hello World (C/C++/F)	helloworld.py
	CUDA Samples (?)	cuda_samples.py
	Checks that nvml can report GPU informations	cuda_nvml.py
	Affinity	affinity_check.py
	Test if multi-threaded MPI works	mpi.py
Config/Integration	Slurm:	slurm.py
	Slurm: partitions correspond to TF definition	TODO
	Slurm: Slurm Transparent Hugepages	slurm.py
	Slurm: number of nodes available per partition	SlurmQueueStatusCheck
	Slurm: Check if Gres is properly configured on Slurm	SlurmGPUGres
	Slurm: new features	TODO
Containers	Test OSU benchmarsk with CE	OMB_MPICH_CE, OMB_OMPI_CE
	Stream benchmark with ce	RunNVGPUJobCE - ce_import_run_image.py
	Verify simple container runs	RunJobCE - ce_import_run_image.py
	Test SSH to a container	ssh.py
	CUDA nbody with CE	check_cuda_nbody.py

Single and multi-node checks to be performed before & after vCluster interventions (using a reservation)

Remark: Application checks can be the same as Production, but ideally they should be using more nodes