Skip to content
Merged
Show file tree
Hide file tree
Changes from 110 commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
e329d46
sanity check binaries/libraries for device code matching cuda_compute…
jfgrimm Oct 24, 2024
c8cece2
Merge branch '5.0.x' into cuda-device-code-sanity-check
casparvl Feb 19, 2025
ee63b8e
Add check for PTX, more explicit debug logging
Feb 20, 2025
de6d49d
That return should not be there, as it will stop the sanity check aft…
Feb 20, 2025
0e97868
Fix some logic in the PTX warning printed
Feb 20, 2025
6b6d2c8
Add option for ignoring individual files in the CUDA sanity check
Feb 20, 2025
6568909
Add strict-cuda-sanity-check option and make sure we only fail the sa…
Feb 21, 2025
3d07ef6
This is a work in progress for creating a set of tests...
Feb 21, 2025
f13fca2
First test working..
Feb 22, 2025
bbe189d
Restructure logging messages a bit
Mar 5, 2025
caff559
quote path in debug log string
Mar 5, 2025
a6408ff
Added unit tests
Mar 5, 2025
ba960aa
Fix hound issues
Mar 5, 2025
f569ba4
flake8 compliance: add extra blank line
Mar 5, 2025
025604b
Resolved conflict
Apr 3, 2025
17dc755
Make sure strict_cuda_sanity_check is in the list for the correct def…
Apr 3, 2025
354f071
Add accept-ptx-as-cc-support and accept-missing-cuda-ptx options
Apr 3, 2025
4634cd4
Fix indentation mistake
Apr 3, 2025
dd5feda
Get rid of early return, which apparently BDFL doesn't like :)
Apr 3, 2025
25fccd1
Fix too long line
Apr 3, 2025
563ba3a
Implement small review comments
Apr 3, 2025
ed49c46
Fixed typo
Apr 3, 2025
f2f252d
Check for presence of cuobjdump. If sanity check isn't run, raise tha…
Apr 3, 2025
1e97753
Raise missing cuobjdump to error: if CUDA root was defined, then this…
Apr 3, 2025
39e5652
downgrade to debug message, as the CUDA root not being defined is pre…
Apr 3, 2025
8e70838
Limit to a single return statement
Apr 3, 2025
dbf7a7e
No more early returns
Apr 3, 2025
7a919f3
Start adding summary data
Apr 3, 2025
dab2042
Add logic to deal with accept-ptx-for-cc-support and --accept-missing…
Apr 8, 2025
be0a990
Further counting and summary collection
Apr 8, 2025
384c17a
Fix some hound issues
Apr 8, 2025
5b0dd43
First attempt at summary report
Apr 8, 2025
039542f
Make formatting more readable
Apr 8, 2025
ec0683d
Fix hound issues
Apr 8, 2025
84a1905
Fix one more hound issue
Apr 8, 2025
7508104
Add clear distinction between ignores & failures. Also, clearly indic…
Apr 8, 2025
213acef
Truncate too long lines
Apr 8, 2025
e56cace
Fix linting errors
Apr 8, 2025
1c230d6
Fix missing f for f-strings:
Apr 8, 2025
24f6b8a
Add option to ignore all CUDA sanity failures to not break current Ea…
Apr 9, 2025
92073b1
Implement an option te report, but ignore _all_ failures
Apr 9, 2025
aecd62d
Fix typo and missing comma
Apr 9, 2025
74d7349
Replaced all num_X with len(files_x), we don't need separate counters
Apr 10, 2025
31dc541
Update easybuild/framework/easyblock.py
casparvl Apr 10, 2025
9266344
Removed some forgotten num_files_X and replaced with len(files_X)
Apr 10, 2025
2a03e2e
Change option names
Apr 10, 2025
931cd8c
Fix cuda-compute-capabilities description to be more specific that fa…
Apr 10, 2025
a466f36
Various changes from code review
Apr 10, 2025
1bbff1b
Replaced more occurences of cc by devcode
Apr 10, 2025
22c3c23
only store relative paths in the files_X variables
Apr 10, 2025
4166d34
Processed various review comments...
Apr 10, 2025
45bfcda
Fix hound issues
Apr 10, 2025
3b9b386
Renamed function
Apr 10, 2025
0688117
Various review comments processed
Apr 10, 2025
050226f
Fixed hound issues:
Apr 10, 2025
c8a448a
Make sure to raise an error if cuobjdump doesnt exist, or if it retur…
Apr 10, 2025
dd2be94
Raise info to warning when we're not erroring on failure
Apr 10, 2025
b0d5d5f
Fix linting issues
Apr 14, 2025
5c0adce
Deduplicate code by replacing get_cuda_device_code_and_ptx_architectu…
Apr 14, 2025
9c2167f
Grammar fix
Apr 14, 2025
4ba7942
Fix whitespace
Apr 14, 2025
8d94d87
Fix undefined name
Apr 14, 2025
0b615e1
Create mock setup for get_cuda_object_dump_raw and get_cuda_architecture
Apr 15, 2025
f9e99a2
Fix naming in config and put in the correct (alphabetical) place in t…
Apr 15, 2025
aca934d
Make sure archives are also checked. Libary does _not_ seem to be an …
Apr 15, 2025
7fde91b
Initial (working) version of a unit test for get_cuda_object_dump_raw
Apr 15, 2025
2258d91
Fix hound issues
Apr 15, 2025
f782df8
More test cases
Apr 15, 2025
3ba1d7b
Remove a stray print statement
Apr 15, 2025
79d7084
Add remaining test cases for get_cuda_architecture and get_cuda_objec…
Apr 15, 2025
316e71f
Change if-elif-else into a nested if-else, with an if-if. This is sin…
Apr 15, 2025
494bd95
Don't keep accumulating the fail_msg after we have logged it
Apr 15, 2025
8a9ea6d
added f to fstring...
Apr 15, 2025
a2960c2
Replace Surplus with Additional in warning
Apr 15, 2025
a62cdaa
Updateing toy builds. 10 test cases defined, of which 3 are implement…
Apr 15, 2025
11cf157
Architectures can be 9.0a or 10.0a now, i.e. sm_90a is a valid optimi…
Apr 16, 2025
8d9720e
Some missing f-strings and small refinements in the summary reporting…
Apr 16, 2025
b426226
Removed old tests and replaced them with new ones. New tests check al…
Apr 16, 2025
e53143e
Fix hound issues
Apr 16, 2025
5f533ea
Fix linting issues
Apr 16, 2025
2ee867b
Remove f-string, as there are no placeholders in this string
Apr 16, 2025
9c32b67
Fix unit test expected result
Apr 16, 2025
4fb884b
Add a test that triggers the if missing_ptx_ccs: if path in ignore_fi…
May 12, 2025
a02d198
Remove blank line
May 12, 2025
45f659c
Now make sure test 1.a. actually fails on the test case @ocaisa found…
May 12, 2025
f9f3050
Fix the test so that it now passes once the issue is fixed... It was …
May 12, 2025
1a5cd5d
Fix the issue with the missing ignore_msg
May 12, 2025
19cfe04
Apply suggestions from code review
casparvl May 12, 2025
7e3a2dd
Remove check_cuobjdump, as it is not needed anymore
May 12, 2025
2442d44
Format file lists on separate lines for better readability of the logs
May 12, 2025
97cef2b
Still need to do some formatting, but things now go to trace output f…
May 12, 2025
c358e27
Print both to trace output (with short version of advice), and to log…
May 13, 2025
588c342
Fix hound issues - and hopefully CI checks
May 13, 2025
09a182f
Fix missing condition
May 13, 2025
1787c47
Added missing f to f-string
May 13, 2025
0f85f13
Modify tests for the new syntax, and to also check the trace output
May 13, 2025
4b78bd7
Fix hound issues
May 13, 2025
581767c
Apply suggestions from code review
ocaisa May 14, 2025
f027c51
Apply suggestions from code review
ocaisa May 15, 2025
8b6c40d
Apply suggestions from code review
ocaisa May 15, 2025
d6620d4
Apply suggestions from code review
ocaisa May 15, 2025
39e5561
Update easybuild/framework/easyblock.py
casparvl May 15, 2025
2bbfff9
Update easybuild/framework/easyblock.py
casparvl May 15, 2025
f009b7d
Fix unit tests for the new setup where we check if CUDA is a dep, ins…
May 15, 2025
a901ba5
Don't set EBROOTCUDA anymore, it's no longer needed
May 15, 2025
e304b17
Keep failure message short: just list the number of files, and refer …
May 15, 2025
4273f16
Make defining a non-empty failure message conditional on an actual fa…
May 15, 2025
c2cdb87
We don't need to track with is_failure, we can just check if any of t…
May 15, 2025
e9ec501
Fix too long line
May 15, 2025
a17a42c
Fix unit tests to accomodate for the difference in the error message …
May 15, 2025
190156b
rename --cuda-sanity-check-error-on-fail to --cuda-sanity-check-error…
boegel May 16, 2025
b14cceb
Add fake modulefile for CUDA in Tcl format as well
May 16, 2025
abc108b
Spread over two writes
May 16, 2025
b6eb063
Merge branch 'develop' into cuda-device-code-sanity-check
May 16, 2025
22858ec
also rename to --cuda-sanity-check-error-on-failed-checks in comments…
boegel May 16, 2025
2655a07
Merge pull request #2 from boegel/cuda-device-code-sanity-check
jfgrimm May 16, 2025
ceacffa
also consider shared libraries under lib/python*/site-packages in CUD…
boegel May 16, 2025
e73900c
Merge pull request #3 from boegel/cuda-device-code-sanity-check
jfgrimm May 16, 2025
7e92cd5
extend test_toy_cuda_sanity_check to also check whether shared librar…
boegel May 16, 2025
5cef2e0
Merge pull request #4 from boegel/cuda-device-code-sanity-check
jfgrimm May 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
375 changes: 373 additions & 2 deletions easybuild/framework/easyblock.py

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions easybuild/framework/easyconfig/default.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,6 +124,11 @@
'after make (for e.g.,"test" for make test)'), BUILD],
'bin_lib_subdirs': [[], "List of subdirectories for binaries and libraries, which is used during sanity check "
"to check RPATH linking and banned/required libraries", BUILD],
'cuda_sanity_ignore_files': [[], "List of files (relative to the installation prefix) for which failures in "
"the CUDA sanity check step are ignored. Typically used for files where you "
"know the CUDA architectures in those files don't match the "
"--cuda-compute-capabitilities configured for EasyBuild AND where you know "
"that this is ok / reasonable (e.g. binary installations)", BUILD],
'sanity_check_commands': [[], ("format: [(name, options)] e.g. [('gzip','-h')]. "
"Using a non-tuple is equivalent to (name, '-h')"), BUILD],
'sanity_check_paths': [{}, ("List of files and directories to check "
Expand Down
4 changes: 4 additions & 0 deletions easybuild/tools/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,10 @@ def mk_full_default_path(name, prefix=DEFAULT_PREFIX):
'backup_patched_files',
'consider_archived_easyconfigs',
'container_build_image',
'cuda_sanity_check_accept_ptx_as_devcode',
'cuda_sanity_check_accept_missing_ptx',
'cuda_sanity_check_error_on_fail',
'cuda_sanity_check_strict',
'debug',
'debug_lmod',
'dump_autopep8',
Expand Down
40 changes: 37 additions & 3 deletions easybuild/tools/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -398,7 +398,41 @@ def override_options(self):
int, 'store_or_None', None),
'cuda-compute-capabilities': ("List of CUDA compute capabilities to use when building GPU software; "
"values should be specified as digits separated by a dot, "
"for example: 3.5,5.0,7.2", 'strlist', 'extend', None),
"for example: 3.5,5.0,7.2. EasyBuild will (where possible) compile fat "
"binaries with support for (at least) all requested CUDA compute "
"capabilities, and PTX code for the highest CUDA compute capability (for "
"forwards compatibility). The check on this behavior may be relaxed using "
"--cuda-sanity-check-accept-missing-ptx, "
"--cuda-sanity-check-accept-ptx-as-devcode, "
"or made more stringent using --cuda-sanity-check-strict.",
'strlist', 'extend', None),
'cuda-sanity-check-accept-missing-ptx': ("CUDA sanity check also passes if PTX code for the highest "
"requested CUDA compute capability is not present (but will "
"print a warning)",
None, 'store_true', False),
'cuda-sanity-check-accept-ptx-as-devcode': ("CUDA sanity check also passes if requested device code is "
"not present, as long as PTX code is present that can be "
"JIT-compiled for each target in --cuda-compute-capabilities "
"E.g. if --cuda-compute-capabilities=8.0 and a binary is "
"found in the installation that does not have device code for "
"8.0, but it does have PTX code for 7.0, the sanity check "
"will pass if, and only if, this option is True. "
"Note that JIT-compiling means the binary will work on the "
"requested architecture, but is it not necessarily as well "
"optimized as when actual device code is present for the "
"requested architecture ",
None, 'store_true', False),
'cuda-sanity-check-error-on-fail': ("If True, failures in the CUDA sanity check will produce an error. "
"If False, the CUDA sanity check will be performed, and failures will "
"be reported, but they will not result in an error",
None, 'store_true', False),
'cuda-sanity-check-strict': ("Perform strict CUDA sanity check. Without this option, the CUDA sanity "
"check will fail if the CUDA binaries don't contain code for (at least) "
"all compute capabilities defined in --cude-compute-capabilities, but will "
"accept if code for additional compute capabilities is present. "
"With this setting, the sanity check will also fail if code is present for "
"more compute capabilities than defined in --cuda-compute-capabilities.",
None, 'store_true', False),
'debug-lmod': ("Run Lmod modules tool commands in debug module", None, 'store_true', False),
'default-opt-level': ("Specify default optimisation level", 'choice', 'store', DEFAULT_OPT_LEVEL,
Compiler.COMPILER_OPT_OPTIONS),
Expand Down Expand Up @@ -544,7 +578,7 @@ def override_options(self):
"Git commit to use for the target software build (robot capabilities are automatically disabled)",
None, 'store', None),
'sticky-bit': ("Set sticky bit on newly created directories", None, 'store_true', False),
'strict-rpath-sanity-check': ("Perform strict RPATH sanity check, which involces unsetting "
'strict-rpath-sanity-check': ("Perform strict RPATH sanity check, which involves unsetting "
"$LD_LIBRARY_PATH before checking whether all required libraries are found",
None, 'store_true', False),
'sysroot': ("Location root directory of system, prefix for standard paths like /usr/lib and /usr/include",
Expand Down Expand Up @@ -944,7 +978,7 @@ def validate(self):
# values passed to --cuda-compute-capabilities must be of form X.Y (with both X and Y integers),
# see https://developer.nvidia.com/cuda-gpus
if self.options.cuda_compute_capabilities:
cuda_cc_regex = re.compile(r'^[0-9]+\.[0-9]+$')
cuda_cc_regex = re.compile(r'^[0-9]+\.[0-9]+a?$')
faulty_cuda_ccs = [x for x in self.options.cuda_compute_capabilities if not cuda_cc_regex.match(x)]
if faulty_cuda_ccs:
error_msg = "Incorrect values in --cuda-compute-capabilities (expected pattern: '%s'): %s"
Expand Down
103 changes: 103 additions & 0 deletions easybuild/tools/systemtools.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@

* Jens Timmerman (Ghent University)
* Ward Poelmans (Ghent University)
* Jasper Grimm (UoY)
* Jan Andre Reuter (Forschungszentrum Juelich GmbH)
"""
import csv
Expand All @@ -41,6 +42,7 @@
import platform
import pwd
import re
import shutil
import struct
import sys
import termios
Expand All @@ -64,6 +66,7 @@
pass

from easybuild.base import fancylogger
from easybuild.tools import LooseVersion
from easybuild.tools.build_log import EasyBuildError, EasyBuildExit, print_warning
from easybuild.tools.config import IGNORE
from easybuild.tools.filetools import is_readable, read_file, which
Expand Down Expand Up @@ -998,6 +1001,106 @@ def get_glibc_version():
return glibc_ver


def get_cuda_object_dump_raw(path):
"""
Get raw ouput from command which extracts information from CUDA binary files in a human-readable format,
or None for files containing no CUDA device code.
See https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump
"""

res = run_shell_cmd("file %s" % path, fail_on_error=False, hidden=True, output_file=False, stream_output=False)
if res.exit_code != EasyBuildExit.SUCCESS:
fail_msg = "Failed to run 'file %s': %s" % (path, res.output)
_log.warning(fail_msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this exit here?


# check that the file is an executable or object (shared library) or archive (static library)
result = None
if any(x in res.output for x in ['executable', 'object', 'archive']):
# Make sure we have a cuobjdump command
if not shutil.which('cuobjdump'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise EasyBuildError("Failed to get object dump from CUDA file: cuobjdump command not found")
cuda_cmd = f"cuobjdump {path}"
res = run_shell_cmd(cuda_cmd, fail_on_error=False, hidden=True, output_file=False, stream_output=False)
if res.exit_code == EasyBuildExit.SUCCESS:
result = res.output
else:
# Check and report for the common case that this is simply not a CUDA binary, i.e. does not
# contain CUDA device code
no_device_code_match = re.search(r'does not contain device code', res.output)
if no_device_code_match is not None:
# File is a regular executable, object or library, but not a CUDA file
msg = "'%s' does not appear to be a CUDA binary: cuobjdump failed to find device code in this file"
_log.debug(msg, path)
else:
# This should not happen: there was no string saying this was NOT a CUDA file, yet no device code
# was found at all
msg = "Dumping CUDA binary file information for '%s' via '%s' failed! Output: '%s'"
raise EasyBuildError(msg, path, cuda_cmd, res.output)

return result


def get_cuda_architectures(path, section_type):
"""
Get a sorted list of CUDA architectures supported in the file in 'path'.
path: full path to a CUDA file
section_type: the type of section in the cuobjdump output to check for architectures ('elf' or 'ptx')
Returns None if no CUDA device code is present in the file
"""

# Note that typical output for a cuobjdump call will look like this for device code:
#
# Fatbin elf code:
# ================
# arch = sm_90
# code version = [1,7]
# host = linux
# compile_size = 64bit
#
# And for ptx code, it will look like this:
#
# Fatbin ptx code:
# ================
# arch = sm_90
# code version = [8,1]
# host = linux
# compile_size = 64bit

# Pattern to extract elf code architectures and ptx code architectures respectively
code_regex = re.compile(f'Fatbin {section_type} code:\n=+\narch = sm_([0-9]+)([0-9]a?)')

# resolve symlinks
if os.path.islink(path) and os.path.exists(path):
path = os.path.realpath(path)

cc_archs = None
cuda_raw = get_cuda_object_dump_raw(path)
if cuda_raw is not None:
# extract unique device code architectures from raw dump
code_matches = re.findall(code_regex, cuda_raw)
if code_matches:
# convert match tuples into unique list of cuda compute capabilities
# e.g. [('8', '6'), ('8', '6'), ('9', '0')] -> ['8.6', '9.0']
cc_archs = sorted(['.'.join(m) for m in set(code_matches)], key=LooseVersion)
else:
# Try to be clear in the warning... did we not find elf/ptx code sections at all? or was the arch missing?
section_regex = re.compile(f'Fatbin {section_type} code')
section_matches = re.findall(section_regex, cuda_raw)
if section_matches:
fail_msg = f"Found Fatbin {section_type} code section(s) in cuobjdump output for {path}, "
fail_msg += "but failed to extract CUDA architecture"
else:
# In this case, the "Fatbin {section_type} code" section is simply missing from the binary
# It is entirely possible for a CUDA binary to have only device code or only ptx code (and thus the
# other section could be missing). However, considering --cuda-compute-capabilities is supposed to
# generate both PTX and device code (at least for the highest CC in that list), it is unexpected
# in an EasyBuild context and thus we print a warning
fail_msg = f"Failed to find Fatbin {section_type} code section(s) in cuobjdump output for {path}."
_log.warning(fail_msg)

return cc_archs


def get_linked_libs_raw(path):
"""
Get raw output from command that reports linked libraries for dynamically linked executables/libraries,
Expand Down
Loading
Loading