-
Notifications
You must be signed in to change notification settings - Fork 772
{system}[system/system] CUDAcompat v11.6, CUDAcompat v11.7 #15892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{system}[system/system] CUDAcompat v11.6, CUDAcompat v11.7 #15892
Conversation
ca548de to
1bee6aa
Compare
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @ocaisa |
|
About the EULA, I suggest we err on the side of caution and require it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added as a versionsuffix? There are 5 11.4 releases of the CUDAcompat libraries (for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When going through the official process at https://www.nvidia.com/Download/index.aspx?lang=en-us I only get 1.
So I wouldn't litter the module name yet and do so when and if the need arises. See Python: No suffix as long is we only use 1/the default version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the compat libraries from https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and you can see all the driver releases for each minor version.
Given you know you have a uniqueness issue, I still think it is better to include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver versions do seem to map to a patch version though, so you could include that instead...but from what I can see the map from patch version to driver version can only be known by peeping at something like https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and figuring it out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the runfile downloads are only there for the latest version, not the earlier ones. Hence on the runfile download page you only get 1 version.
It may be worth doing the Java approach and have 1 alias package for 11.6 and individual packages for the versions with the 11.6 one pointing to the latest.
IMO we can do that later when we decide to add another EC for the same CUDA version, but that would be an enhancement I don't think it is worth the work right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The different driver versions of the compat package do not matter for compatibility (given a fixed major&minor version)
Anything wrong?
Yes I agree with that, the only point being the driver version of the compat package. I suspect the version does matter if you are not on the driver branches listed in the tables at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package, in which case whether the compat libraries will work for you is probably arbitrary.
Hence I think a
CUDAcompat/11package isn't very useful to have in EasyBuild: You won't know if it is compatible with your driver. So we need to include the minor versions in the ECs so site admins can install the EC which is compatible with the (currently) installed driver.
I think that depends on clever the underlying easyblock is. To install/update CUDAcompat/11 (a modulerc) you'd need to (ultimately) install CUDAcompat/11.X-DRIVER_VERSION, if the easyblock actually checks compatibility for you then this will fail gracefully with a helpful error message. I get your point though, but I think you answer it yourself below.
Of course we could stack this:
- CUDAcompat/11 depends on the latest CUDAcompat/11.x
- CUDAcompat/11.x depends on the latest CUDAcompat/11.x-y
But we need the "middle layer".
I think that this could work. Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.
For CUDAcompat/11.4 you could have 5 different nv_versions
Yes, "could". But currently there is only 1 EC. And as (according to the above) we will always need a
CUDAcompat/11.4it is easy to change the existing EC to have a version suffix and add an alias EC.
We always start with one ec but it is rare we stop there. If we already know we may need to change things, I think it is better we start off with the right path rather than change the behaviour after the fact.
So my point is that currently the version suffix is not required and not adding it doesn't prevent us from adding it later when it is required. Also not adding it now, means having only 2 new ECs instead of 4. I really want to be able to add
CUDAcompat/11.6as a dependency to CUDA in site specific hooks.
I think using a modulerc for CUDAcompat/11.6 is a more robust and reproducible solution. I'll test it out today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tested using a modulerc to wrap access to the specific CUDAcompat version and this seems to work without issue. This was with a trivial change to the ec in this PR:
name = 'CUDAcompat'
version = '11.6'
versionsuffix = "-D%s" % nv_version
and then the creation of two new (trivial) easyconfigs, CUDAcompat-11.6.eb:
easyblock = 'ModuleRC'
name = 'CUDAcompat'
version = '11.6'
homepage = 'https://docs.nvidia.com/deploy/cuda-compatibility/index.html'
description = """Using the CUDA Forward Compatibility package,
system administrators can run applications built using a newer toolkit
even when an older driver that does not satisfy the minimum required driver version
is installed on the system.
This forward compatibility allows the CUDA deployments in data centers and enterprises
to benefit from the faster release cadence and the latest features and performance of CUDA Toolkit.
"""
toolchain = SYSTEM
dependencies = [(name, "11.6", "-D510.73.08")]
moduleclass = 'system'
and CUDAcompat-11.eb
easyblock = 'ModuleRC'
name = 'CUDAcompat'
version = '11'
homepage = 'https://docs.nvidia.com/deploy/cuda-compatibility/index.html'
description = """Using the CUDA Forward Compatibility package,
system administrators can run applications built using a newer toolkit
even when an older driver that does not satisfy the minimum required driver version
is installed on the system.
This forward compatibility allows the CUDA deployments in data centers and enterprises
to benefit from the faster release cadence and the latest features and performance of CUDA Toolkit.
"""
toolchain = SYSTEM
dependencies = [(name, "11.6")]
moduleclass = 'system'
(since 11.7 is included in this PR, the dep here should really be 11.7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.
Why do you think that? If at our cluster someone was to install that EC it would fail right away as the driver is not compatible with 11.7. So choosing 11.6 directly is the better choice (and the only option)
Anyway added the 11.6, 11.7 and 11 ECs as aliases so everyone can choose. Also added the version check to the easyblock. We could even omit the CUDA version from the "lowest level" ECs as they are already versioned: By the driver version. But keeping it allows the sanity check to work and users/admins to easily see which one applies. Made them hidden though as you should rather load CUDAcompat/11.6 instead of an individual driver version of the CUDA version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.
Why do you think that? If at our cluster someone was to install that EC it would fail right away as the driver is not compatible with 11.7. So choosing 11.6 directly is the better choice (and the only option)
That's kind of my point, you're making the assumption that the driver installed will never be updated. That might be true in your case, but it's unlikely to be true for everyone. If the driver gets updated to an 11.7+ driver then all your modules that depend on 11.6 will start failing.
What the modulerc actually points to is a site choice. The two modulerc easyconfigs would likely have to be edited in a situation like yours anyway based on when when your driver is EOL. If the CUDAcompat easyblock was robust enough to actually check if the compat libs worked then this would be a non-issue since the primary installation would fail anyway (but as discussed in the easyblock PR that is easier said than done).
In your particular case, you would choose 11 to point to 11.6, if you had a choice obviously you would want the latest and greatest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's kind of my point, you're making the assumption that the driver installed will never be updated. That might be true in your case, but it's unlikely to be true for everyone. If the driver gets updated to an 11.7+ driver then all your modules that depend on 11.6 will start failing.
Understood. So adding both layers as discussed earlier makes sense. Site admins now have the full choice if they want the latest 11 package, the latest 11.x package or a specific version.
Anything else missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something with this information? Perhaps set it as a list and have the easyblock check the "true" driver version against the compatible versions so we can verify if the compat libraries will actually work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this but decided against it: Getting the "true" driver version isn't trivial to do portably and you may install on another machine than the one you'll run on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can get the driver version with
nvidia-smi --query-gpu=driver_version --format=csv,noheader
which reports the same value regardless of the compat libraries used (in my testing at least).
But of course if you don't have a GPU...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a check to the EasyBlock. I think the test-step is a good place as that allows skipping it.
Similar to the TensorFlow EasyBlock I'll skip the check if nvidia-smi isn't there.
Not fully sure how the check would be done. I'd say match the major versions of the driver against the list from the ECs and check if the driver version is at least the matched version from the EC. If not or no same major version in the EC was found it is an error. So in this case 450.36.01 and 460.0.0 would both fail the test (to low patch level, wrong major version)
or do I misunderstand the NVIDIA doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure on that front, it's something I've only been able to see through testing with a CUDA executable. In my case I have access to a system with 460.73.01 drivers, all the 11.7 compat libraries fail to run CUDA executables (even though nvidia-smi reports the CUDA version just fine) with the documented error code:
cudaGetDeviceCount returned 803
The driver I have doesn't appear in the list at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that implies that the above approach would work fine: Your 460.x driver would be reported as incompatible with the compat libraries and your tests seem to confirm that it really is incompatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, the 460.73.01 driver is also not listed for the 11.6 compat libraries, but if I test it with those it works just fine. That just says to me the only way you can really know is by actually compiling something with nvcc and trying to run it.
EDIT
The 460 drivers are EOL according to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package, however I suspect they were not EOL when the 11.6 packages were released, hence the discrepancy. Historical compatability information can probably be derived from comparing https://endoflife.date/nvidia and the release date of the drivers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the 11.6 compat libs on 460.32.03 and that failed with 804 (cudaErrorCompatNotSupportedOnDevice) similar to your 803 (cudaErrorSystemDriverMismatch)
That just says to me the only way you can really know is by actually compiling something with nvcc and trying to run it.
I'm not sure how to do that. I want the compat module as a dependency of the (some) CUDA module so when users load CUDA they get the right compat package. But if you need to load CUDA to test the compat package you have a cyclic dependency. Hence the idea to do that test in the CUDA easyblock. Because that is what you'd like to have working.
easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6-510.73.08.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6-510.73.08.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.7-515.48.07.eb
Outdated
Show resolved
Hide resolved
5248aec to
e557361
Compare
|
Test report by @Flamefire |
|
Tested also on GTX1080 with driver 460.32.03: Both CUDAcompat/11.6-510.73.08 & CUDAcompat/11.6-510.85.02 error out with 804:"forward compatibility was attempted on non supported HW" as expected for non-Tesla devices. |
|
Test report by @ocaisa |
|
So the previous test report is actually a good thing. It shows that the 11.6 installation works for a 460 driver and fails for 11.7, which is correct. |
|
Indeed:
|
|
Test report by @ocaisa |
|
Test report by @ocaisa |
e557361 to
352b627
Compare
|
Rebased without changes to files to clean up the history and retrigger CI after the easyblock got merged |
ocaisa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
(created using
eb --new-pr)Requires easybuilders/easybuild-easyblocks#2764
The download URLs actually work (similar to the CUDA runtime) but when going over the website you come over this:
So we need an EULA acceptance check here.