add a CUDA device code sanity check #4692

jfgrimm · 2024-10-24T16:53:29Z

At the moment, we do no checking that the cuda compute capabilities that EasyBuild is configured to use, are actually used in the resultant binaries/libraries

WIP PR to introduce an extra sanity check when CUDA is present to check for mismatches between cuda_compute_capabilities and what cuobjdump reports

…_capabilities when CUDA is used

ocaisa · 2024-10-24T17:17:47Z

It's great that you looked into this, we've also been discussing it in EESSI: https://gitlab.com/eessi/support/-/issues/92

jfgrimm · 2024-10-24T17:32:38Z

@ocaisa thanks for the link, I'll take a look

Currently, main things I still plan to add to this pr:

An EB option to toggle whether this is a warning or error (akin to rpath sanity check strictness)
whitelisting (e.g. for bundled precompiled stuff)
handling software that only allows targeting a single CCC

ocaisa · 2024-10-24T17:41:53Z

I think it's a good idea to check for device code and ptx (with lack of ptx for the highest compute capability being a warning). The availability of ptx will allow you to run the application on future arch's.

easybuild/tools/systemtools.py

easybuild/framework/easyblock.py

casparvl · 2025-02-19T20:08:12Z

FYI: I checked with @jfgrimm on chat, he probably has little time to work on it in the near future. Since this is a very valuable feature for EESSI that we'd like to have before we start building a large amount of GPU software, I'll try to work on this myself. Note that @jfgrimm was ok in me pushing to his branch, so I'll do that rather than create my own PR - at least we can have the full discussion in one place, namely here.

casparvl · 2025-02-19T20:17:35Z

I tested this as follows:

cloned Jasper's feature branch into $HOME/easybuild/easybuild-framework/
load EESSI and EESSI-extend: module purge && module load EESSI/2023.06 EESSI-extend/2023.06-easybuild
installed an EasyBuild from the current 5.0.x branch using the EasyConfig EasyBuild-5.0.x.eb below, using the EasyBuild-4.9.4 from EESSI: eb EasyBuild-5.0.x.eb. This ensures I have the versions of blocks and easyconfigs from 5.0.x.

#EasyBuild-5.0.x.eb
# Nice way of installing an EasyBuild installation from the develop branch...
# Install with 'eblocalinstall --force-download ...' to make sure you get the latest version
easyblock = 'EB_EasyBuildMeta'
name = 'EasyBuild'
version = '5.0.x'
homepage = 'https://easybuilders.github.io/easybuild'
description = """EasyBuild is a software build and installation framework
 written in Python that allows you to install software in a structured,
 repeatable and robust way."""
toolchain = SYSTEM
sources = [
    {
        'source_urls': ['https://github.com/easybuilders/easybuild-framework/archive/'],
        'download_filename': '5.0.x.tar.gz',
        'filename': 'easybuild-framework-develop.tar.gz',
    },
    {
        'source_urls': ['https://github.com/easybuilders/easybuild-easyblocks/archive/'],
        'download_filename': '5.0.x.tar.gz',
        'filename': 'easybuild-easyblocks-develop.tar.gz',
    },
    {
        'source_urls': ['https://github.com/easybuilders/easybuild-easyconfigs/archive/'],
        'download_filename': '5.0.x.tar.gz',
        'filename': 'easybuild-easyconfigs-develop.tar.gz',
    },
]
# order matters a lot, to avoid having dependencies auto-resolved (--no-deps easy_install option doesn't work?)
# EasyBuild is a (set of) Python packages, so it depends on Python
# usually, we want to use the system Python, so no actual Python dependency is listed
allow_system_deps = [('Python', SYS_PYTHON_VERSION)]
local_pyshortver = '.'.join(SYS_PYTHON_VERSION.split('.')[:2])
sanity_check_paths = {
    'files': ['bin/eb'],
    'dirs': ['lib/python%s/site-packages' % local_pyshortver],
}
moduleclass = 'tools'

Set the folowing environment variables to pick up on the feature branch:

export PATH=$HOME/easybuild/easybuild-framework/:$PATH
export PYTHONPATH=$HOME/easybuild/easybuild-framework/:$PYTHONPATH

Added the following configuration (for some reason, my robot-path was empty, I now make it use the easyconfigs from the 5.0.x I installed above):

export EASYBUILD_ROBOT_PATHS=/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/5.0.x/easybuild/easyconfigs
export EASYBUILD_CUDA_COMPUTE_CAPABILITIES=8.0

I tried to install a CUDA-Samples:

eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --rebuild

This resulted in

== 2025-02-19 20:55:23,959 build_log.py:226 ERROR EasyBuild encountered an error (at easybuild/easybuild-framework/easybuild/tools/build_log.py:166 in caller_
info): Sanity check failed: Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/
software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/jitLto. Surplus compute capabilities: 5.2. Missing compute capabilities: 8.0.
Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-G
CC-12.3.0-CUDA-12.1.1/bin/inlinePTX_nvrtc. Surplus compute capabilities: 5.2. Missing compute capabilities: 8.0.
Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-G
CC-12.3.0-CUDA-12.1.1/bin/conjugateGradientCudaGraphs. Surplus compute capabilities: 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9, 9.0.

And many more. That's great, it means this PR is actually doing what it should. Indeed, checking manually:

$ cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/jitLto

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit

So, yeah... CUDA-Samples is a mess when it comes to it's build system. The docs say you can set the CUDA compute capabilities by passing the SMS=<something> argument to it. Just for reference, my build command from the logs was:

rm -r bin/win64 &&  make  -j 16 HOST_COMPILER=g++ SMS='80'
 FILTER_OUT='Samples/2_Concepts_and_Techniques/EGLStream_CUDA_Interop/Makefile Samples/2_Concepts_and_Techniques/streamOrderedAllocationIPC/Makefile Samples/3
_CUDA_Features/tf32TensorCoreGemm/Makefile Samples/3_CUDA_Features/warpAggregatedAtomicsCG/Makefile Samples/4_CUDA_Libraries/boxFilterNPP/Makefile Samples/4_C
UDA_Libraries/cannyEdgeDetectorNPP/Makefile Samples/4_CUDA_Libraries/cudaNvSci/Makefile Samples/4_CUDA_Libraries/cudaNvSciNvMedia/Makefile Samples/4_CUDA_Libr
aries/freeImageInteropNPP/Makefile Samples/4_CUDA_Libraries/histEqualizationNPP/Makefile Samples/4_CUDA_Libraries/FilterBorderControlNPP/Makefile Samples/5_Do
main_Specific/simpleGL/Makefile Samples/5_Domain_Specific/simpleVulkan/Makefile Samples/5_Domain_Specific/simpleVulkanMMAP/Makefile Samples/5_Domain_Specific/
vulkanImageCUDA/Makefile Samples/0_Introduction/simpleAWBarrier/Makefile Samples/3_CUDA_Features/bf16TensorCoreGemm/Makefile Samples/3_CUDA_Features/dmmaTenso
rCoreGemm/Makefile Samples/3_CUDA_Features/globalToShmemAsyncCopy/Makefile Samples/4_CUDA_Libraries/simpleCUFFT_callback/Makefile Samples/2_Concepts_and_Techn
iques/cuHook/Makefile ' && rm bin/*/linux/release/lib*.so.*

Note that there are many executables in CUDA-Samples that were build for the correct CC. E.g.:

$ cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/deviceQuery

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

casparvl · 2025-02-19T20:44:10Z

Collecting some todo's:

add --strict-cuda-sanity-check EB option (default no): regular sanity check would fail (raise an error) if not at least the configured CCs are present. It should report surplus CCs (at least with --debug), but not fail. The strict variant will also fail if there are surplus CCs present. N.B. I'm not in favor of converting the error into a warning here - if you're not getting the CC you're requesting via --cuda-compute-capabilities, that's not what the user is counting on, and that should be a failure. A user can always decide to whitelist to make sure the sanity check passes, but this should be a very conscious decision. Since many of us are building in bulk, semi-automated pipelines, etc, warnings would too easily be missed.
whitelisting (e.g. for bundled precompiled stuff). This will cause the sanity check to be skipped (or at most print a warning/info) for software that is whitelisted. It enables a conscious override by a user to say 'yes, I know this binary wasn't build for the requested CC, and I'm ok with that'.
Also check for PTX code (and which arch that PTX code is for). We currently don't have any way of asking EasyBuild to build for a certain PTX arch, so a question would be: what do we check against? A logical default would be to check for PTX code for the highest CC in --cuda-compute-capabilities as this would allow forward-compatibility of the binary through JIT compilation.
add --strict-ptx-sanity-check (default: no): regular sanity check would fail (raise an error) if not at least the configured virtual architectures are present. It should report surplus CCs (at least with --debug), but not fail. The strict variant will also fail if there are surplus CCs present. => EDIT: Won't do, out of scope, see add a CUDA device code sanity check #4692 (comment)
add --cuda-virtual-architectures option to EasyBuild, which can be used to determine for which virtual architecture to compile PTX code. It won't do anything initially until EB contributors start supporting this in their EasyBlocks and/or we get proper NVCC compiler wrappers that could inject such arguments. => EDIT: Won't do, out of scope, see add a CUDA device code sanity check #4692 (comment)

…er the first non-cuda file. that's wrong

casparvl · 2025-02-20T22:45:49Z

Ignore list seems to work. Adding

cuda_sanity_ignore_files = ['bin/watershedSegmentationNPP', 'bin/simpleTemplates_nvrtc']

to the EasyConfig for CUDA-Samples results in

== 2025-02-20 23:43:16,229 easyblock.py:3350 DEBUG Sanity checking for CUDA device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1
.1/bin/simpleTemplates_nvrtc
== 2025-02-20 23:43:16,229 run.py:489 INFO Path to bash that will be used to run shell commands: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
== 2025-02-20 23:43:16,229 run.py:500 INFO Running shell command 'file /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplat
es_nvrtc' in /tmp/casparl/easybuild/build/CUDASamples/12.1/GCC-12.3.0-CUDA-12.1.1/cuda-samples-12.1
== 2025-02-20 23:43:16,235 run.py:598 INFO Output of 'file ...' shell command (stdout + stderr):
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamica
lly linked, interpreter /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, not stripped

== 2025-02-20 23:43:16,235 run.py:601 INFO Shell command completed successfully (see output above): file /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12
.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc
== 2025-02-20 23:43:16,235 run.py:489 INFO Path to bash that will be used to run shell commands: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
== 2025-02-20 23:43:16,235 run.py:500 INFO Running shell command 'cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTe
mplates_nvrtc' in /tmp/casparl/easybuild/build/CUDASamples/12.1/GCC-12.3.0-CUDA-12.1.1/cuda-samples-12.1
== 2025-02-20 23:43:16,240 run.py:598 INFO Output of 'cuobjdump ...' shell command (stdout + stderr):

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit

== 2025-02-20 23:43:16,240 run.py:601 INFO Shell command completed successfully (see output above): cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-G
CC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc
== 2025-02-20 23:43:16,241 easyblock.py:3376 WARNING Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/1
2.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc. Surplus compute capabilities: 5.2. Missing compute capabilities: 8.0. This failure will be ignored as /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc is listed in 'ignore_cuda_sanity_failures'.
== 2025-02-20 23:43:16,241 easyblock.py:3393 WARNING Configured highest compute capability was '8.0', but no PTX code for this compute capability was found in '/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc' PTX architectures supported in that file: []

and note that this binary does not get listed in the failure message. So that's the intended behavior: the warning is still printed, but it doesn't result in an error.

casparvl · 2025-02-21T16:21:04Z

Just to test: putting all of these files in the ignore list, the installation of CUDA-Samples now passes.

cuda_sanity_ignore_files = [
    'bin/binomialOptions_nvrtc',
    'bin/jitLto',
    'bin/inlinePTX_nvrtc',
    'bin/conjugateGradientCudaGraphs',
    'bin/simpleVoteIntrinsics_nvrtc',
    'bin/MersenneTwisterGP11213',
    'bin/nvJPEG_encoder',
    'bin/vectorAdd_nvrtc',
    'bin/clock_nvrtc',
    'bin/nvJPEG',
    'bin/BlackScholes_nvrtc',
    'bin/simpleAtomicIntrinsics_nvrtc',
    'bin/batchedLabelMarkersAndLabelCompressionNPP',
    'bin/conjugateGradient',
    'bin/simpleAssert_nvrtc',
    'bin/matrixMul_nvrtc',
    'bin/cuSolverDn_LinearSolver',
    'bin/quasirandomGenerator_nvrtc',
    'bin/watershedSegmentationNPP',
    'bin/simpleTemplates_nvrtc'
]

This provides a nice starting point for further tests, I can easily just remove one from the exclude list, and check that I get the expected result.

casparvl · 2025-02-21T18:09:57Z

So... the whole thing with checking PTX codes makes me rethink what EasyBuild should do when a certain --cude-device-compute-capabilities is set. Currently, this is ill-defined at best. Our official docs say:

List of CUDA compute capabilities to use when building GPU software;
values should be specified as digits separated by a dot, for example:
3.5,5.0,7.2 (type comma-separated list)

But what does that mean? What do we expect the nvcc compiler to do here? Say we were to compile a simple hello world, and I would do --cuda-compute-capabilies=8.0,9.0, what would I expect my nvcc invocation to look like?

nvcc hello.cu --gpu-architecture=compute_80 --gpu-code=sm_80,sm_90 -o hello

i.e. would it only build device code for 80/90, and not include PTX? And build both through the lowest common virtual architecture? Or should it do

nvcc hello.cu --gpu-architecture=compute_80 --gpu-code=sm_80,sm_90,compute_80 -o hello

i.e. also include the PTX code for the --gpu-architecture we specified? Or do we expect it to use the generalized option --generate-code so that it does

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 -o hello

i.e. the stage one compilation is executed once for each CUDA compute capability, so that the generated sm_90 code can actually use the features from the compute_90 architecture? Or do we expect it to do

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_90,code=compute_90 -o hello

so that it actually includes not only the device codes for CC80 and CC90, but also the PTX code for CC90 (for forwards compatibility)?

Honestly, from a performance perspective, I think it would be best if EasyBuild would indeed use the generalized arguments, so that the sm_90 code would use the full capabilities of the compute_90 virtual architecture. Since EasyBuild focusses on performance, I think this makes sense. The only price you pay is longer compilation time, since you also have to build that compute_90 virtual architecture PTX code. Whether to include the PTX code is a different question. As proposed above, I think this should be a separate option in EasyBuild, so that one can decide in the EB config whether to ship PTX code, and which version(s).

I.e. my proposal would be that if EasyBuild is configured with --cuda-compute-capabilities=7.0,8.0,9.0 and --cuda-virtual-architectures=7.0,9.0 that this would trigger:

nvcc hello.cu  --generate-code=arch=compute_70,code=sm_70 --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_70,code=compute_70 --generate-code=arch=compute_90,code=compute_90 -o hello

Note that it may not always be possible to convince all build systems to actually do this - e.g. some codes might really only compile for a single cuda-compute-capability, or the build system doesn't make this distinction between real and virtual architectures to build for. Eventually the most robust and generic way to get this done might just be to implement nvcc compiler wrappers that inject these --generate-code arguments.

I'm creating a CUDA hello world EasyConfig that we can use to serve as an easy example of 1) how we think --cuda-compute-capabilities and --cuda-virtual-architectures should work in EasyBuild and 2) have an EasyConfig with which we can easily test these things, including the sanity check.

I'm not sure what the best way forward is. If I include everything in this PR, it may be a bit heavy - although honestly at the framwork level it's just about defining the options, the real implementation would have to be done in EasyBlocks and EasyConfigs that use this information...

My plan is to include the options in this PR, and make an accompanying PR for my CUDA hello world that uses these options in the way described above. The rest is then up to anyone updating or creating new EasyBlocks/EasyConfigs that somehow use information on the CUDA compute capability.

casparvl · 2025-02-21T18:50:39Z

Ok, change of plans. After thinking it over, this would be a massive scope creep that would delay the sanity check part that we primarily care about in this PR. Instead, in this PR, I'll focus on just that: a sanity check for the CUDA device codes. We can assume that everyone using EasyBuild expect this to be the meaning of the --cuda-compute-capabilities, i.e. they expect if they specify 8.0,9.0 that the resulting binaries contain device code for 8.0 and 9.0. Which virtual architecture was used to get there, or what PTX codes are shipped as part of the binary are not relevant to that expectation, and can be considered further optimizations that we can do in a separate PR.

I will retain the code that prints a warning for the PTX code not matching the highest architecture. Or maybe demote it to an info message. In any case, it's convenient for future reference if EasyBuild extracts this information.

I will not implement a strict option for the PTX code sanity check in this PR. It does not make sense to be sanity checking for behavior that we haven't clearly defined, i.e. there is no clear definition of what PTX code is expected to be included when someone sets --cuda-compute-capabilities.

casparvl · 2025-02-21T19:36:48Z

Everything not sanity-check related is now described in this issue, which can be used to create one or more follow-up PRs.

…nity check on surpluss CUDA archs if this option is set. Otherwise, print warning

casparvl · 2025-02-21T20:50:54Z

Tested by adding

cuda_sanity_ignore_files = [
    'bin/binomialOptions_nvrtc',
    'bin/jitLto',
    'bin/inlinePTX_nvrtc',
    'bin/conjugateGradientCudaGraphs',
    'bin/simpleVoteIntrinsics_nvrtc',
    'bin/MersenneTwisterGP11213',
    'bin/nvJPEG_encoder',
    'bin/vectorAdd_nvrtc',
    'bin/clock_nvrtc',
    'bin/nvJPEG',
    'bin/BlackScholes_nvrtc',
    'bin/simpleAtomicIntrinsics_nvrtc',
    'bin/batchedLabelMarkersAndLabelCompressionNPP',
    # 'bin/conjugateGradient',
    'bin/simpleAssert_nvrtc',
    'bin/matrixMul_nvrtc',
    'bin/cuSolverDn_LinearSolver',
    'bin/quasirandomGenerator_nvrtc',
    'bin/watershedSegmentationNPP',
    'bin/simpleTemplates_nvrtc'
]

To CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb. Then, with:

eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --rebuild

my build succeeds whereas with

eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --rebuild --strict-cuda-sanity-check

It fails with:

== 2025-02-21 21:41:21,349 build_log.py:226 ERROR EasyBuild encountered an error (at easybuild/easybuild-framework/easybuild/tools/build_log.py:166 in caller_info): Sanity check failed: Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/conjugateGradient. Surplus compute capabilities: 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9, 9.0.  (at easybuild/easybuild-framework/easybuild/framework/easyblock.py:4010 in _sanity_check_step)

as intended.

Only thing left to do for this PR is tests. Not my strong suit to be honest, but let's see. I guess the tricky thing here is that a true test requires a real CUDA binary, and I'm not sure that's even feasible... To build one, I'd need a CUDA module in the test environment - I'm not sure if we have that. I could try to find a CUDA binary that we could just install (maybe just include a hello-world type of CUDA binary) and test with that... Maybe that's the most feasible option. But I have no clue if we can reasonable include binaries in the repo under the test directory. I have an 800KB hello world binary, that shouldn't be too crazy I guess.

ocaisa · 2025-02-21T21:09:57Z

What you can do is create a mock cuobjdump script that parrots output, you're only checking that EBcan run the code and paste the output.

casparvl · 2025-02-21T21:34:40Z

Damn your good, it took me 25 more minutes of looking at other examples to figure out that even if I could ingest a binary, I'd lack the cuobjdump executable. Might indeed as well fake cuobjdump output on a toy build example.

ocaisa · 2025-05-16T09:17:24Z

Thanks a lot @jfgrimm for seeding this and @casparvl for getting it over the line, I think this is of high value

boegel · 2025-05-16T10:00:16Z

test_toy_cuda_sanity_check is failing with:

easybuild.tools.build_log.EasyBuildError: "Installation of CUDA-5.5.22.eb failed: 'cuda_5.5.22_linux_64.run has unknown file extension'"

easybuild/framework/easyblock.py

…-on-failed-checks + improve help text for --cuda-sanity-check-* configuration options

ocaisa · 2025-05-16T11:25:23Z

test_toy_cuda_sanity_check is failing with:

easybuild.tools.build_log.EasyBuildError: "Installation of CUDA-5.5.22.eb failed: 'cuda_5.5.22_linux_64.run has unknown file extension'"

I think this is only failing some scenarios because a HMNS is being used (which means the easyconfig has to be parsed)

A quick fix is to change the source format in the CUDA easyconfigs (or in that one in particular)

EDIT: HMNS is not it

…, trace/log messages, and tests

rename `--cuda-sanity-check-error-on-fail` to `--cuda-sanity-check-error-on-failed-checks` + improve help text for `--cuda-sanity-check-*` configuration options

…A sanity check

also consider shared libraries under `lib/python*/site-packages` in CUDA sanity check

…ies under lib/python*/site-packages are being checked in CUDA sanity check

extend `test_toy_cuda_sanity_check` to also check whether shared libraries under `lib/python*/site-packages` are being checked in CUDA sanity check

ocaisa · 2025-05-16T17:10:34Z

Looks like failing test is yet another issue with rate limits:

easybuild-framework/test/framework/filetools.py

Line 525 in 76e9033

test_url = 'https://github.com/easybuilders/easybuild-framework/raw/develop/'

ocaisa · 2025-05-16T17:45:19Z

@boegel Good to go now

boegel

This is a big change, implementing a big enhancement to the sanity check step performed by EasyBuild for builds of GPU software (anything that involves CUDA as a direct dependency).

It's been extensively tested and reviewed by several people, so time to get this in (still in time for the upcoming EasyBuild v5.1.0 release)!

@jfgrimm Thanks a lot for kickstarting this, and thanks for @casparvl and @ocaisa for all the work that was done on this in recent weeks!

Flamefire · 2025-05-21T08:25:13Z

easybuild/tools/systemtools.py

+    res = run_shell_cmd("file %s" % path, fail_on_error=False, hidden=True, output_file=False, stream_output=False)
+    if res.exit_code != EasyBuildExit.SUCCESS:
+        fail_msg = "Failed to run 'file %s': %s" % (path, res.output)
+        _log.warning(fail_msg)


Shouldn't this exit here?

Flamefire · 2025-05-21T08:34:24Z

easybuild/framework/easyblock.py

+                for entry in os.listdir(dirpath):
+                    path = os.path.join(dirpath, entry)
+                    if os.path.isfile(path):
+                        self.log.debug("Sanity checking file {path} for CUDA device code")


This message is duplicate isn't it?

Flamefire · 2025-05-21T08:53:20Z

easybuild/tools/systemtools.py

+    result = None
+    if any(x in res.output for x in ['executable', 'object', 'archive']):
+        # Make sure we have a cuobjdump command
+        if not shutil.which('cuobjdump'):


This is already checked at https://github.com/easybuilders/easybuild-framework/pull/4692/files#diff-00260ae7a519d5825760f53b067b29fb84a3e0d2649e6a27ace99abaca96d7d1R4361 is this required in both places?

sanity check binaries/libraries for device code matching cuda_compute…

e329d46

…_capabilities when CUDA is used

jfgrimm added enhancement EasyBuild-5.0 EasyBuild 5.0 labels Oct 24, 2024

jfgrimm added this to the 5.0 milestone Oct 24, 2024

Merge branch '5.0.x' into cuda-device-code-sanity-check

c8cece2

ocaisa reviewed Feb 19, 2025

View reviewed changes

easybuild/tools/systemtools.py Outdated Show resolved Hide resolved

ocaisa reviewed Feb 19, 2025

View reviewed changes

easybuild/framework/easyblock.py Outdated Show resolved Hide resolved

casparvl reviewed Feb 19, 2025

View reviewed changes

easybuild/framework/easyblock.py Outdated Show resolved Hide resolved

Caspar van Leeuwen added 4 commits February 20, 2025 02:40

Add check for PTX, more explicit debug logging

ee63b8e

That return should not be there, as it will stop the sanity check aft…

de6d49d

…er the first non-cuda file. that's wrong

Fix some logic in the PTX warning printed

0e97868

Add option for ignoring individual files in the CUDA sanity check

6b6d2c8

casparvl mentioned this pull request Feb 21, 2025

The desired behavior of EasyBuild for --cuda-compute-capabilities is ill defined #4770

Open

Add strict-cuda-sanity-check option and make sure we only fail the sa…

6568909

…nity check on surpluss CUDA archs if this option is set. Otherwise, print warning

Caspar van Leeuwen added 2 commits February 21, 2025 23:55

This is a work in progress for creating a set of tests...

3d07ef6

First test working..

f13fca2

ocaisa enabled auto-merge May 16, 2025 09:09

boegel reviewed May 16, 2025

View reviewed changes

easybuild/framework/easyblock.py Show resolved Hide resolved

ocaisa closed this May 16, 2025

auto-merge was automatically disabled May 16, 2025 11:20
Pull request was closed

ocaisa reopened this May 16, 2025

rename --cuda-sanity-check-error-on-fail to --cuda-sanity-check-error…

190156b

…-on-failed-checks + improve help text for --cuda-sanity-check-* configuration options

Add fake modulefile for CUDA in Tcl format as well

b14cceb

casparvl dismissed ocaisa’s stale review via b14cceb May 16, 2025 11:53

Caspar van Leeuwen added 2 commits May 16, 2025 13:56

Spread over two writes

abc108b

Merge branch 'develop' into cuda-device-code-sanity-check

b6eb063

boegel mentioned this pull request May 16, 2025

rename --cuda-sanity-check-error-on-fail to --cuda-sanity-check-error-on-failed-checks + improve help text for --cuda-sanity-check-* configuration options jfgrimm/easybuild-framework#2

Merged

boegel and others added 3 commits May 16, 2025 14:58

also rename to --cuda-sanity-check-error-on-failed-checks in comments…

22858ec

…, trace/log messages, and tests

Merge pull request #2 from boegel/cuda-device-code-sanity-check

2655a07

rename `--cuda-sanity-check-error-on-fail` to `--cuda-sanity-check-error-on-failed-checks` + improve help text for `--cuda-sanity-check-*` configuration options

also consider shared libraries under lib/python*/site-packages in CUD…

ceacffa

…A sanity check

boegel mentioned this pull request May 16, 2025

also consider shared libraries under lib/python*/site-packages in CUDA sanity check jfgrimm/easybuild-framework#3

Merged

jfgrimm and others added 2 commits May 16, 2025 15:24

Merge pull request #3 from boegel/cuda-device-code-sanity-check

e73900c

also consider shared libraries under `lib/python*/site-packages` in CUDA sanity check

extend test_toy_cuda_sanity_check to also check whether shared librar…

7e92cd5

…ies under lib/python*/site-packages are being checked in CUDA sanity check

boegel mentioned this pull request May 16, 2025

extend test_toy_cuda_sanity_check to also check whether shared libraries under lib/python*/site-packages are being checked in CUDA sanity check jfgrimm/easybuild-framework#4

Merged

Merge pull request #4 from boegel/cuda-device-code-sanity-check

5cef2e0

extend `test_toy_cuda_sanity_check` to also check whether shared libraries under `lib/python*/site-packages` are being checked in CUDA sanity check

boegel approved these changes May 20, 2025

View reviewed changes

boegel merged commit 1118cc0 into easybuilders:develop May 20, 2025
55 of 57 checks passed

Flamefire reviewed May 21, 2025

View reviewed changes

Flamefire mentioned this pull request May 21, 2025

enhance cuda_sanity_ignore_files to also allow file patterns #4882

Open

add a CUDA device code sanity check #4692

add a CUDA device code sanity check #4692

Uh oh!

Conversation

jfgrimm commented Oct 24, 2024

Uh oh!

ocaisa commented Oct 24, 2024

Uh oh!

jfgrimm commented Oct 24, 2024

Uh oh!

ocaisa commented Oct 24, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

casparvl commented Feb 19, 2025

Uh oh!

casparvl commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casparvl commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casparvl commented Feb 20, 2025

Uh oh!

casparvl commented Feb 21, 2025

Uh oh!

casparvl commented Feb 21, 2025

Uh oh!

casparvl commented Feb 21, 2025

Uh oh!

casparvl commented Feb 21, 2025

Uh oh!

casparvl commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocaisa commented Feb 21, 2025

Uh oh!

casparvl commented Feb 21, 2025

Uh oh!

ocaisa commented May 16, 2025

Uh oh!

boegel commented May 16, 2025

Uh oh!

Uh oh!

ocaisa commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ocaisa commented May 16, 2025

Uh oh!

ocaisa commented May 16, 2025

Uh oh!

boegel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Flamefire May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Flamefire May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Flamefire May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

casparvl commented Feb 19, 2025 •

edited

Loading

casparvl commented Feb 19, 2025 •

edited

Loading

casparvl commented Feb 21, 2025 •

edited

Loading

ocaisa commented May 16, 2025 •

edited

Loading