Skip to content

Conversation

@jfgrimm
Copy link
Member

@jfgrimm jfgrimm commented Oct 24, 2024

At the moment, we do no checking that the cuda compute capabilities that EasyBuild is configured to use, are actually used in the resultant binaries/libraries

WIP PR to introduce an extra sanity check when CUDA is present to check for mismatches between cuda_compute_capabilities and what cuobjdump reports

@jfgrimm jfgrimm added this to the 5.0 milestone Oct 24, 2024
@ocaisa
Copy link
Member

ocaisa commented Oct 24, 2024

It's great that you looked into this, we've also been discussing it in EESSI: https://gitlab.com/eessi/support/-/issues/92

@jfgrimm
Copy link
Member Author

jfgrimm commented Oct 24, 2024

@ocaisa thanks for the link, I'll take a look

Currently, main things I still plan to add to this pr:

  • An EB option to toggle whether this is a warning or error (akin to rpath sanity check strictness)
  • whitelisting (e.g. for bundled precompiled stuff)
  • handling software that only allows targeting a single CCC

@ocaisa
Copy link
Member

ocaisa commented Oct 24, 2024

I think it's a good idea to check for device code and ptx (with lack of ptx for the highest compute capability being a warning). The availability of ptx will allow you to run the application on future arch's.

@casparvl
Copy link
Contributor

FYI: I checked with @jfgrimm on chat, he probably has little time to work on it in the near future. Since this is a very valuable feature for EESSI that we'd like to have before we start building a large amount of GPU software, I'll try to work on this myself. Note that @jfgrimm was ok in me pushing to his branch, so I'll do that rather than create my own PR - at least we can have the full discussion in one place, namely here.

@casparvl
Copy link
Contributor

casparvl commented Feb 19, 2025

I tested this as follows:

  • cloned Jasper's feature branch into $HOME/easybuild/easybuild-framework/
  • load EESSI and EESSI-extend: module purge && module load EESSI/2023.06 EESSI-extend/2023.06-easybuild
  • installed an EasyBuild from the current 5.0.x branch using the EasyConfig EasyBuild-5.0.x.eb below, using the EasyBuild-4.9.4 from EESSI: eb EasyBuild-5.0.x.eb. This ensures I have the versions of blocks and easyconfigs from 5.0.x.
#EasyBuild-5.0.x.eb
# Nice way of installing an EasyBuild installation from the develop branch...
# Install with 'eblocalinstall --force-download ...' to make sure you get the latest version
easyblock = 'EB_EasyBuildMeta'
name = 'EasyBuild'
version = '5.0.x'
homepage = 'https://easybuilders.github.io/easybuild'
description = """EasyBuild is a software build and installation framework
 written in Python that allows you to install software in a structured,
 repeatable and robust way."""
toolchain = SYSTEM
sources = [
    {
        'source_urls': ['https://github.com/easybuilders/easybuild-framework/archive/'],
        'download_filename': '5.0.x.tar.gz',
        'filename': 'easybuild-framework-develop.tar.gz',
    },
    {
        'source_urls': ['https://github.com/easybuilders/easybuild-easyblocks/archive/'],
        'download_filename': '5.0.x.tar.gz',
        'filename': 'easybuild-easyblocks-develop.tar.gz',
    },
    {
        'source_urls': ['https://github.com/easybuilders/easybuild-easyconfigs/archive/'],
        'download_filename': '5.0.x.tar.gz',
        'filename': 'easybuild-easyconfigs-develop.tar.gz',
    },
]
# order matters a lot, to avoid having dependencies auto-resolved (--no-deps easy_install option doesn't work?)
# EasyBuild is a (set of) Python packages, so it depends on Python
# usually, we want to use the system Python, so no actual Python dependency is listed
allow_system_deps = [('Python', SYS_PYTHON_VERSION)]
local_pyshortver = '.'.join(SYS_PYTHON_VERSION.split('.')[:2])
sanity_check_paths = {
    'files': ['bin/eb'],
    'dirs': ['lib/python%s/site-packages' % local_pyshortver],
}
moduleclass = 'tools'
  • Set the folowing environment variables to pick up on the feature branch:
export PATH=$HOME/easybuild/easybuild-framework/:$PATH
export PYTHONPATH=$HOME/easybuild/easybuild-framework/:$PYTHONPATH
  • Added the following configuration (for some reason, my robot-path was empty, I now make it use the easyconfigs from the 5.0.x I installed above):
export EASYBUILD_ROBOT_PATHS=/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/EasyBuild/5.0.x/easybuild/easyconfigs
export EASYBUILD_CUDA_COMPUTE_CAPABILITIES=8.0
  • I tried to install a CUDA-Samples:
eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --rebuild

This resulted in

== 2025-02-19 20:55:23,959 build_log.py:226 ERROR EasyBuild encountered an error (at easybuild/easybuild-framework/easybuild/tools/build_log.py:166 in caller_
info): Sanity check failed: Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/
software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/jitLto. Surplus compute capabilities: 5.2. Missing compute capabilities: 8.0.
Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-G
CC-12.3.0-CUDA-12.1.1/bin/inlinePTX_nvrtc. Surplus compute capabilities: 5.2. Missing compute capabilities: 8.0.
Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-G
CC-12.3.0-CUDA-12.1.1/bin/conjugateGradientCudaGraphs. Surplus compute capabilities: 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9, 9.0.

And many more. That's great, it means this PR is actually doing what it should. Indeed, checking manually:

$ cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/jitLto

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit

So, yeah... CUDA-Samples is a mess when it comes to it's build system. The docs say you can set the CUDA compute capabilities by passing the SMS=<something> argument to it. Just for reference, my build command from the logs was:

rm -r bin/win64 &&  make  -j 16 HOST_COMPILER=g++ SMS='80'
 FILTER_OUT='Samples/2_Concepts_and_Techniques/EGLStream_CUDA_Interop/Makefile Samples/2_Concepts_and_Techniques/streamOrderedAllocationIPC/Makefile Samples/3
_CUDA_Features/tf32TensorCoreGemm/Makefile Samples/3_CUDA_Features/warpAggregatedAtomicsCG/Makefile Samples/4_CUDA_Libraries/boxFilterNPP/Makefile Samples/4_C
UDA_Libraries/cannyEdgeDetectorNPP/Makefile Samples/4_CUDA_Libraries/cudaNvSci/Makefile Samples/4_CUDA_Libraries/cudaNvSciNvMedia/Makefile Samples/4_CUDA_Libr
aries/freeImageInteropNPP/Makefile Samples/4_CUDA_Libraries/histEqualizationNPP/Makefile Samples/4_CUDA_Libraries/FilterBorderControlNPP/Makefile Samples/5_Do
main_Specific/simpleGL/Makefile Samples/5_Domain_Specific/simpleVulkan/Makefile Samples/5_Domain_Specific/simpleVulkanMMAP/Makefile Samples/5_Domain_Specific/
vulkanImageCUDA/Makefile Samples/0_Introduction/simpleAWBarrier/Makefile Samples/3_CUDA_Features/bf16TensorCoreGemm/Makefile Samples/3_CUDA_Features/dmmaTenso
rCoreGemm/Makefile Samples/3_CUDA_Features/globalToShmemAsyncCopy/Makefile Samples/4_CUDA_Libraries/simpleCUFFT_callback/Makefile Samples/2_Concepts_and_Techn
iques/cuHook/Makefile ' && rm bin/*/linux/release/lib*.so.*

Note that there are many executables in CUDA-Samples that were build for the correct CC. E.g.:

$ cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/deviceQuery

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

@casparvl
Copy link
Contributor

casparvl commented Feb 19, 2025

Collecting some todo's:

  • add --strict-cuda-sanity-check EB option (default no): regular sanity check would fail (raise an error) if not at least the configured CCs are present. It should report surplus CCs (at least with --debug), but not fail. The strict variant will also fail if there are surplus CCs present. N.B. I'm not in favor of converting the error into a warning here - if you're not getting the CC you're requesting via --cuda-compute-capabilities, that's not what the user is counting on, and that should be a failure. A user can always decide to whitelist to make sure the sanity check passes, but this should be a very conscious decision. Since many of us are building in bulk, semi-automated pipelines, etc, warnings would too easily be missed.
  • whitelisting (e.g. for bundled precompiled stuff). This will cause the sanity check to be skipped (or at most print a warning/info) for software that is whitelisted. It enables a conscious override by a user to say 'yes, I know this binary wasn't build for the requested CC, and I'm ok with that'.
  • Also check for PTX code (and which arch that PTX code is for). We currently don't have any way of asking EasyBuild to build for a certain PTX arch, so a question would be: what do we check against? A logical default would be to check for PTX code for the highest CC in --cuda-compute-capabilities as this would allow forward-compatibility of the binary through JIT compilation.
  • add --strict-ptx-sanity-check (default: no): regular sanity check would fail (raise an error) if not at least the configured virtual architectures are present. It should report surplus CCs (at least with --debug), but not fail. The strict variant will also fail if there are surplus CCs present. => EDIT: Won't do, out of scope, see add a CUDA device code sanity check #4692 (comment)
  • add --cuda-virtual-architectures option to EasyBuild, which can be used to determine for which virtual architecture to compile PTX code. It won't do anything initially until EB contributors start supporting this in their EasyBlocks and/or we get proper NVCC compiler wrappers that could inject such arguments. => EDIT: Won't do, out of scope, see add a CUDA device code sanity check #4692 (comment)

@casparvl
Copy link
Contributor

Ignore list seems to work. Adding

cuda_sanity_ignore_files = ['bin/watershedSegmentationNPP', 'bin/simpleTemplates_nvrtc']

to the EasyConfig for CUDA-Samples results in

== 2025-02-20 23:43:16,229 easyblock.py:3350 DEBUG Sanity checking for CUDA device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1
.1/bin/simpleTemplates_nvrtc
== 2025-02-20 23:43:16,229 run.py:489 INFO Path to bash that will be used to run shell commands: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
== 2025-02-20 23:43:16,229 run.py:500 INFO Running shell command 'file /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplat
es_nvrtc' in /tmp/casparl/easybuild/build/CUDASamples/12.1/GCC-12.3.0-CUDA-12.1.1/cuda-samples-12.1
== 2025-02-20 23:43:16,235 run.py:598 INFO Output of 'file ...' shell command (stdout + stderr):
/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamica
lly linked, interpreter /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, not stripped

== 2025-02-20 23:43:16,235 run.py:601 INFO Shell command completed successfully (see output above): file /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12
.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc
== 2025-02-20 23:43:16,235 run.py:489 INFO Path to bash that will be used to run shell commands: /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/bin/bash
== 2025-02-20 23:43:16,235 run.py:500 INFO Running shell command 'cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTe
mplates_nvrtc' in /tmp/casparl/easybuild/build/CUDASamples/12.1/GCC-12.3.0-CUDA-12.1.1/cuda-samples-12.1
== 2025-02-20 23:43:16,240 run.py:598 INFO Output of 'cuobjdump ...' shell command (stdout + stderr):

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
host = linux
compile_size = 64bit

== 2025-02-20 23:43:16,240 run.py:601 INFO Shell command completed successfully (see output above): cuobjdump /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-G
CC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc
== 2025-02-20 23:43:16,241 easyblock.py:3376 WARNING Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/1
2.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc. Surplus compute capabilities: 5.2. Missing compute capabilities: 8.0. This failure will be ignored as /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc is listed in 'ignore_cuda_sanity_failures'.
== 2025-02-20 23:43:16,241 easyblock.py:3393 WARNING Configured highest compute capability was '8.0', but no PTX code for this compute capability was found in '/home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/simpleTemplates_nvrtc' PTX architectures supported in that file: []

and note that this binary does not get listed in the failure message. So that's the intended behavior: the warning is still printed, but it doesn't result in an error.

@casparvl
Copy link
Contributor

Just to test: putting all of these files in the ignore list, the installation of CUDA-Samples now passes.

cuda_sanity_ignore_files = [
    'bin/binomialOptions_nvrtc',
    'bin/jitLto',
    'bin/inlinePTX_nvrtc',
    'bin/conjugateGradientCudaGraphs',
    'bin/simpleVoteIntrinsics_nvrtc',
    'bin/MersenneTwisterGP11213',
    'bin/nvJPEG_encoder',
    'bin/vectorAdd_nvrtc',
    'bin/clock_nvrtc',
    'bin/nvJPEG',
    'bin/BlackScholes_nvrtc',
    'bin/simpleAtomicIntrinsics_nvrtc',
    'bin/batchedLabelMarkersAndLabelCompressionNPP',
    'bin/conjugateGradient',
    'bin/simpleAssert_nvrtc',
    'bin/matrixMul_nvrtc',
    'bin/cuSolverDn_LinearSolver',
    'bin/quasirandomGenerator_nvrtc',
    'bin/watershedSegmentationNPP',
    'bin/simpleTemplates_nvrtc'
]

This provides a nice starting point for further tests, I can easily just remove one from the exclude list, and check that I get the expected result.

@casparvl
Copy link
Contributor

So... the whole thing with checking PTX codes makes me rethink what EasyBuild should do when a certain --cude-device-compute-capabilities is set. Currently, this is ill-defined at best. Our official docs say:

List of CUDA compute capabilities to use when building GPU software;
values should be specified as digits separated by a dot, for example:
3.5,5.0,7.2 (type comma-separated list)

But what does that mean? What do we expect the nvcc compiler to do here? Say we were to compile a simple hello world, and I would do --cuda-compute-capabilies=8.0,9.0, what would I expect my nvcc invocation to look like?

nvcc hello.cu --gpu-architecture=compute_80 --gpu-code=sm_80,sm_90 -o hello

i.e. would it only build device code for 80/90, and not include PTX? And build both through the lowest common virtual architecture? Or should it do

nvcc hello.cu --gpu-architecture=compute_80 --gpu-code=sm_80,sm_90,compute_80 -o hello

i.e. also include the PTX code for the --gpu-architecture we specified? Or do we expect it to use the generalized option --generate-code so that it does

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 -o hello

i.e. the stage one compilation is executed once for each CUDA compute capability, so that the generated sm_90 code can actually use the features from the compute_90 architecture? Or do we expect it to do

nvcc hello.cu  --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_90,code=compute_90 -o hello

so that it actually includes not only the device codes for CC80 and CC90, but also the PTX code for CC90 (for forwards compatibility)?

Honestly, from a performance perspective, I think it would be best if EasyBuild would indeed use the generalized arguments, so that the sm_90 code would use the full capabilities of the compute_90 virtual architecture. Since EasyBuild focusses on performance, I think this makes sense. The only price you pay is longer compilation time, since you also have to build that compute_90 virtual architecture PTX code. Whether to include the PTX code is a different question. As proposed above, I think this should be a separate option in EasyBuild, so that one can decide in the EB config whether to ship PTX code, and which version(s).

I.e. my proposal would be that if EasyBuild is configured with --cuda-compute-capabilities=7.0,8.0,9.0 and --cuda-virtual-architectures=7.0,9.0 that this would trigger:

nvcc hello.cu  --generate-code=arch=compute_70,code=sm_70 --generate-code=arch=compute_80,code=sm_80 --generate-code=arch=compute_90,code=sm_90 --generate-code=arch=compute_70,code=compute_70 --generate-code=arch=compute_90,code=compute_90 -o hello

Note that it may not always be possible to convince all build systems to actually do this - e.g. some codes might really only compile for a single cuda-compute-capability, or the build system doesn't make this distinction between real and virtual architectures to build for. Eventually the most robust and generic way to get this done might just be to implement nvcc compiler wrappers that inject these --generate-code arguments.

I'm creating a CUDA hello world EasyConfig that we can use to serve as an easy example of 1) how we think --cuda-compute-capabilities and --cuda-virtual-architectures should work in EasyBuild and 2) have an EasyConfig with which we can easily test these things, including the sanity check.

I'm not sure what the best way forward is. If I include everything in this PR, it may be a bit heavy - although honestly at the framwork level it's just about defining the options, the real implementation would have to be done in EasyBlocks and EasyConfigs that use this information...

My plan is to include the options in this PR, and make an accompanying PR for my CUDA hello world that uses these options in the way described above. The rest is then up to anyone updating or creating new EasyBlocks/EasyConfigs that somehow use information on the CUDA compute capability.

@casparvl
Copy link
Contributor

Ok, change of plans. After thinking it over, this would be a massive scope creep that would delay the sanity check part that we primarily care about in this PR. Instead, in this PR, I'll focus on just that: a sanity check for the CUDA device codes. We can assume that everyone using EasyBuild expect this to be the meaning of the --cuda-compute-capabilities, i.e. they expect if they specify 8.0,9.0 that the resulting binaries contain device code for 8.0 and 9.0. Which virtual architecture was used to get there, or what PTX codes are shipped as part of the binary are not relevant to that expectation, and can be considered further optimizations that we can do in a separate PR.

I will retain the code that prints a warning for the PTX code not matching the highest architecture. Or maybe demote it to an info message. In any case, it's convenient for future reference if EasyBuild extracts this information.

I will not implement a strict option for the PTX code sanity check in this PR. It does not make sense to be sanity checking for behavior that we haven't clearly defined, i.e. there is no clear definition of what PTX code is expected to be included when someone sets --cuda-compute-capabilities.

@casparvl
Copy link
Contributor

Everything not sanity-check related is now described in this issue, which can be used to create one or more follow-up PRs.

…nity check on surpluss CUDA archs if this option is set. Otherwise, print warning
@casparvl
Copy link
Contributor

casparvl commented Feb 21, 2025

Tested by adding

cuda_sanity_ignore_files = [
    'bin/binomialOptions_nvrtc',
    'bin/jitLto',
    'bin/inlinePTX_nvrtc',
    'bin/conjugateGradientCudaGraphs',
    'bin/simpleVoteIntrinsics_nvrtc',
    'bin/MersenneTwisterGP11213',
    'bin/nvJPEG_encoder',
    'bin/vectorAdd_nvrtc',
    'bin/clock_nvrtc',
    'bin/nvJPEG',
    'bin/BlackScholes_nvrtc',
    'bin/simpleAtomicIntrinsics_nvrtc',
    'bin/batchedLabelMarkersAndLabelCompressionNPP',
    # 'bin/conjugateGradient',
    'bin/simpleAssert_nvrtc',
    'bin/matrixMul_nvrtc',
    'bin/cuSolverDn_LinearSolver',
    'bin/quasirandomGenerator_nvrtc',
    'bin/watershedSegmentationNPP',
    'bin/simpleTemplates_nvrtc'
]

To CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb. Then, with:

eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --rebuild

my build succeeds whereas with

eb CUDA-Samples-12.1-GCC-12.3.0-CUDA-12.1.1.eb --rebuild --strict-cuda-sanity-check

It fails with:

== 2025-02-21 21:41:21,349 build_log.py:226 ERROR EasyBuild encountered an error (at easybuild/easybuild-framework/easybuild/tools/build_log.py:166 in caller_info): Sanity check failed: Mismatch between cuda_compute_capabilities and device code in /home/casparl/eessi/versions/2023.06/software/linux/x86_64/amd/zen2/software/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1/bin/conjugateGradient. Surplus compute capabilities: 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.6, 8.9, 9.0.  (at easybuild/easybuild-framework/easybuild/framework/easyblock.py:4010 in _sanity_check_step)

as intended.

Only thing left to do for this PR is tests. Not my strong suit to be honest, but let's see. I guess the tricky thing here is that a true test requires a real CUDA binary, and I'm not sure that's even feasible... To build one, I'd need a CUDA module in the test environment - I'm not sure if we have that. I could try to find a CUDA binary that we could just install (maybe just include a hello-world type of CUDA binary) and test with that... Maybe that's the most feasible option. But I have no clue if we can reasonable include binaries in the repo under the test directory. I have an 800KB hello world binary, that shouldn't be too crazy I guess.

@ocaisa
Copy link
Member

ocaisa commented Feb 21, 2025

What you can do is create a mock cuobjdump script that parrots output, you're only checking that EBcan run the code and paste the output.

@casparvl
Copy link
Contributor

Damn your good, it took me 25 more minutes of looking at other examples to figure out that even if I could ingest a binary, I'd lack the cuobjdump executable. Might indeed as well fake cuobjdump output on a toy build example.

@ocaisa ocaisa enabled auto-merge May 16, 2025 09:09
@ocaisa
Copy link
Member

ocaisa commented May 16, 2025

Thanks a lot @jfgrimm for seeding this and @casparvl for getting it over the line, I think this is of high value

@boegel
Copy link
Member

boegel commented May 16, 2025

test_toy_cuda_sanity_check is failing with:

easybuild.tools.build_log.EasyBuildError: "Installation of CUDA-5.5.22.eb failed: 'cuda_5.5.22_linux_64.run has unknown file extension'"

@ocaisa ocaisa closed this May 16, 2025
auto-merge was automatically disabled May 16, 2025 11:20

Pull request was closed

@ocaisa ocaisa reopened this May 16, 2025
…-on-failed-checks + improve help text for --cuda-sanity-check-* configuration options
@ocaisa
Copy link
Member

ocaisa commented May 16, 2025

test_toy_cuda_sanity_check is failing with:

easybuild.tools.build_log.EasyBuildError: "Installation of CUDA-5.5.22.eb failed: 'cuda_5.5.22_linux_64.run has unknown file extension'"

I think this is only failing some scenarios because a HMNS is being used (which means the easyconfig has to be parsed)

A quick fix is to change the source format in the CUDA easyconfigs (or in that one in particular)

EDIT: HMNS is not it

boegel and others added 3 commits May 16, 2025 14:58
rename `--cuda-sanity-check-error-on-fail` to `--cuda-sanity-check-error-on-failed-checks` + improve help text for `--cuda-sanity-check-*` configuration options
jfgrimm and others added 2 commits May 16, 2025 15:24
also consider shared libraries under `lib/python*/site-packages` in CUDA sanity check
…ies under lib/python*/site-packages are being checked in CUDA sanity check
extend `test_toy_cuda_sanity_check` to also check whether shared libraries under `lib/python*/site-packages` are being checked in CUDA sanity check
@ocaisa
Copy link
Member

ocaisa commented May 16, 2025

Looks like failing test is yet another issue with rate limits:

@ocaisa
Copy link
Member

ocaisa commented May 16, 2025

@boegel Good to go now

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change, implementing a big enhancement to the sanity check step performed by EasyBuild for builds of GPU software (anything that involves CUDA as a direct dependency).

It's been extensively tested and reviewed by several people, so time to get this in (still in time for the upcoming EasyBuild v5.1.0 release)!

@jfgrimm Thanks a lot for kickstarting this, and thanks for @casparvl and @ocaisa for all the work that was done on this in recent weeks!

@boegel boegel merged commit 1118cc0 into easybuilders:develop May 20, 2025
55 of 57 checks passed
res = run_shell_cmd("file %s" % path, fail_on_error=False, hidden=True, output_file=False, stream_output=False)
if res.exit_code != EasyBuildExit.SUCCESS:
fail_msg = "Failed to run 'file %s': %s" % (path, res.output)
_log.warning(fail_msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this exit here?

for entry in os.listdir(dirpath):
path = os.path.join(dirpath, entry)
if os.path.isfile(path):
self.log.debug("Sanity checking file {path} for CUDA device code")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is duplicate isn't it?

result = None
if any(x in res.output for x in ['executable', 'object', 'archive']):
# Make sure we have a cuobjdump command
if not shutil.which('cuobjdump'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants