-
Notifications
You must be signed in to change notification settings - Fork 217
add a CUDA device code sanity check #4692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a CUDA device code sanity check #4692
Conversation
…_capabilities when CUDA is used
|
It's great that you looked into this, we've also been discussing it in EESSI: https://gitlab.com/eessi/support/-/issues/92 |
|
@ocaisa thanks for the link, I'll take a look Currently, main things I still plan to add to this pr:
|
|
I think it's a good idea to check for device code and ptx (with lack of ptx for the highest compute capability being a warning). The availability of ptx will allow you to run the application on future arch's. |
|
FYI: I checked with @jfgrimm on chat, he probably has little time to work on it in the near future. Since this is a very valuable feature for EESSI that we'd like to have before we start building a large amount of GPU software, I'll try to work on this myself. Note that @jfgrimm was ok in me pushing to his branch, so I'll do that rather than create my own PR - at least we can have the full discussion in one place, namely here. |
|
I tested this as follows:
This resulted in And many more. That's great, it means this PR is actually doing what it should. Indeed, checking manually: So, yeah... Note that there are many executables in CUDA-Samples that were build for the correct CC. E.g.: |
|
Collecting some todo's:
|
|
Ignore list seems to work. Adding to the EasyConfig for CUDA-Samples results in and note that this binary does not get listed in the failure message. So that's the intended behavior: the warning is still printed, but it doesn't result in an error. |
|
Just to test: putting all of these files in the ignore list, the installation of CUDA-Samples now passes. This provides a nice starting point for further tests, I can easily just remove one from the exclude list, and check that I get the expected result. |
|
So... the whole thing with checking PTX codes makes me rethink what EasyBuild should do when a certain
But what does that mean? What do we expect the nvcc compiler to do here? Say we were to compile a simple hello world, and I would do i.e. would it only build device code for 80/90, and not include PTX? And build both through the lowest common virtual architecture? Or should it do i.e. also include the PTX code for the i.e. the stage one compilation is executed once for each CUDA compute capability, so that the generated so that it actually includes not only the device codes for CC80 and CC90, but also the PTX code for CC90 (for forwards compatibility)? Honestly, from a performance perspective, I think it would be best if EasyBuild would indeed use the generalized arguments, so that the I.e. my proposal would be that if EasyBuild is configured with Note that it may not always be possible to convince all build systems to actually do this - e.g. some codes might really only compile for a single cuda-compute-capability, or the build system doesn't make this distinction between real and virtual architectures to build for. Eventually the most robust and generic way to get this done might just be to implement I'm creating a CUDA hello world EasyConfig that we can use to serve as an easy example of 1) how we think I'm not sure what the best way forward is. If I include everything in this PR, it may be a bit heavy - although honestly at the framwork level it's just about defining the options, the real implementation would have to be done in EasyBlocks and EasyConfigs that use this information... My plan is to include the options in this PR, and make an accompanying PR for my CUDA hello world that uses these options in the way described above. The rest is then up to anyone updating or creating new EasyBlocks/EasyConfigs that somehow use information on the CUDA compute capability. |
|
Ok, change of plans. After thinking it over, this would be a massive scope creep that would delay the sanity check part that we primarily care about in this PR. Instead, in this PR, I'll focus on just that: a sanity check for the CUDA device codes. We can assume that everyone using EasyBuild expect this to be the meaning of the I will retain the code that prints a warning for the PTX code not matching the highest architecture. Or maybe demote it to an info message. In any case, it's convenient for future reference if EasyBuild extracts this information. I will not implement a strict option for the PTX code sanity check in this PR. It does not make sense to be sanity checking for behavior that we haven't clearly defined, i.e. there is no clear definition of what PTX code is expected to be included when someone sets |
|
Everything not sanity-check related is now described in this issue, which can be used to create one or more follow-up PRs. |
…nity check on surpluss CUDA archs if this option is set. Otherwise, print warning
|
Tested by adding To my build succeeds whereas with It fails with: as intended. Only thing left to do for this PR is tests. Not my strong suit to be honest, but let's see. I guess the tricky thing here is that a true test requires a real CUDA binary, and I'm not sure that's even feasible... To build one, I'd need a CUDA module in the test environment - I'm not sure if we have that. I could try to find a CUDA binary that we could just install (maybe just include a hello-world type of CUDA binary) and test with that... Maybe that's the most feasible option. But I have no clue if we can reasonable include binaries in the repo under the test directory. I have an 800KB hello world binary, that shouldn't be too crazy I guess. |
|
What you can do is create a mock |
|
Damn your good, it took me 25 more minutes of looking at other examples to figure out that even if I could ingest a binary, I'd lack the |
|
|
…-on-failed-checks + improve help text for --cuda-sanity-check-* configuration options
I think this is only failing some scenarios because a HMNS is being used (which means the easyconfig has to be parsed) A quick fix is to change the source format in the CUDA easyconfigs (or in that one in particular) EDIT: HMNS is not it |
…, trace/log messages, and tests
rename `--cuda-sanity-check-error-on-fail` to `--cuda-sanity-check-error-on-failed-checks` + improve help text for `--cuda-sanity-check-*` configuration options
also consider shared libraries under `lib/python*/site-packages` in CUDA sanity check
…ies under lib/python*/site-packages are being checked in CUDA sanity check
extend `test_toy_cuda_sanity_check` to also check whether shared libraries under `lib/python*/site-packages` are being checked in CUDA sanity check
|
Looks like failing test is yet another issue with rate limits:
|
|
@boegel Good to go now |
boegel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change, implementing a big enhancement to the sanity check step performed by EasyBuild for builds of GPU software (anything that involves CUDA as a direct dependency).
It's been extensively tested and reviewed by several people, so time to get this in (still in time for the upcoming EasyBuild v5.1.0 release)!
@jfgrimm Thanks a lot for kickstarting this, and thanks for @casparvl and @ocaisa for all the work that was done on this in recent weeks!
| res = run_shell_cmd("file %s" % path, fail_on_error=False, hidden=True, output_file=False, stream_output=False) | ||
| if res.exit_code != EasyBuildExit.SUCCESS: | ||
| fail_msg = "Failed to run 'file %s': %s" % (path, res.output) | ||
| _log.warning(fail_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this exit here?
| for entry in os.listdir(dirpath): | ||
| path = os.path.join(dirpath, entry) | ||
| if os.path.isfile(path): | ||
| self.log.debug("Sanity checking file {path} for CUDA device code") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is duplicate isn't it?
| result = None | ||
| if any(x in res.output for x in ['executable', 'object', 'archive']): | ||
| # Make sure we have a cuobjdump command | ||
| if not shutil.which('cuobjdump'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already checked at https://github.com/easybuilders/easybuild-framework/pull/4692/files#diff-00260ae7a519d5825760f53b067b29fb84a3e0d2649e6a27ace99abaca96d7d1R4361 is this required in both places?
At the moment, we do no checking that the cuda compute capabilities that EasyBuild is configured to use, are actually used in the resultant binaries/libraries
WIP PR to introduce an extra sanity check when CUDA is present to check for mismatches between
cuda_compute_capabilitiesand whatcuobjdumpreports