{system}[system/system] CUDAcompat v11.6, CUDAcompat v11.7 #15892

Flamefire · 2022-07-22T11:34:24Z

(created using eb --new-pr)

Requires easybuilders/easybuild-easyblocks#2764

The download URLs actually work (similar to the CUDA runtime) but when going over the website you come over this:

By clicking the "Agree & Download" button below, you are confirming that you have read and agree to be bound by the License For Customer Use of NVIDIA Software for use of the driver. The driver will begin downloading immediately after clicking on the "Agree & Download" button below. NVIDIA recommends users update to the latest driver version. Please review NVIDIA Product Security for more information.

So we need an EULA acceptance check here.

Flamefire · 2022-07-22T11:57:40Z

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/dd82b1dea5b4f6f1bceeaa5f9a190243 for a full test report.

Flamefire · 2022-07-22T12:02:06Z

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa11 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/705bb7197f948f6e863c88df5bfb8477 for a full test report.

ocaisa · 2022-07-22T12:09:13Z

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
login1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, Python 3.6.8
See https://gist.github.com/c0080118658ea67d9251f072bc6edcf1 for a full test report.

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6.eb

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.7.eb

ocaisa · 2022-07-22T12:45:40Z

About the EULA, I suggest we err on the side of caution and require it.

ocaisa · 2022-07-22T13:56:23Z

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6.eb

Should this be added as a versionsuffix? There are 5 11.4 releases of the CUDAcompat libraries (for example).

When going through the official process at https://www.nvidia.com/Download/index.aspx?lang=en-us I only get 1.

So I wouldn't litter the module name yet and do so when and if the need arises. See Python: No suffix as long is we only use 1/the default version

Take a look at the compat libraries from https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and you can see all the driver releases for each minor version.

Given you know you have a uniqueness issue, I still think it is better to include.

The driver versions do seem to map to a patch version though, so you could include that instead...but from what I can see the map from patch version to driver version can only be known by peeping at something like https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and figuring it out

It seems that the runfile downloads are only there for the latest version, not the earlier ones. Hence on the runfile download page you only get 1 version.
It may be worth doing the Java approach and have 1 alias package for 11.6 and individual packages for the versions with the 11.6 one pointing to the latest.
IMO we can do that later when we decide to add another EC for the same CUDA version, but that would be an enhancement I don't think it is worth the work right now.

The different driver versions of the compat package do not matter for compatibility (given a fixed major&minor version)

Anything wrong?

Yes I agree with that, the only point being the driver version of the compat package. I suspect the version does matter if you are not on the driver branches listed in the tables at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package, in which case whether the compat libraries will work for you is probably arbitrary.

Hence I think a CUDAcompat/11 package isn't very useful to have in EasyBuild: You won't know if it is compatible with your driver. So we need to include the minor versions in the ECs so site admins can install the EC which is compatible with the (currently) installed driver.

I think that depends on clever the underlying easyblock is. To install/update CUDAcompat/11 (a modulerc) you'd need to (ultimately) install CUDAcompat/11.X-DRIVER_VERSION, if the easyblock actually checks compatibility for you then this will fail gracefully with a helpful error message. I get your point though, but I think you answer it yourself below.

Of course we could stack this:

CUDAcompat/11 depends on the latest CUDAcompat/11.x

CUDAcompat/11.x depends on the latest CUDAcompat/11.x-y

But we need the "middle layer".

I think that this could work. Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.

For CUDAcompat/11.4 you could have 5 different nv_versions

Yes, "could". But currently there is only 1 EC. And as (according to the above) we will always need a CUDAcompat/11.4 it is easy to change the existing EC to have a version suffix and add an alias EC.

We always start with one ec but it is rare we stop there. If we already know we may need to change things, I think it is better we start off with the right path rather than change the behaviour after the fact.

So my point is that currently the version suffix is not required and not adding it doesn't prevent us from adding it later when it is required. Also not adding it now, means having only 2 new ECs instead of 4. I really want to be able to add CUDAcompat/11.6 as a dependency to CUDA in site specific hooks.

I think using a modulerc for CUDAcompat/11.6 is a more robust and reproducible solution. I'll test it out today.

Ok, I tested using a modulerc to wrap access to the specific CUDAcompat version and this seems to work without issue. This was with a trivial change to the ec in this PR:

name = 'CUDAcompat' version = '11.6' versionsuffix = "-D%s" % nv_version

and then the creation of two new (trivial) easyconfigs, CUDAcompat-11.6.eb:

easyblock = 'ModuleRC' name = 'CUDAcompat' version = '11.6' homepage = 'https://docs.nvidia.com/deploy/cuda-compatibility/index.html' description = """Using the CUDA Forward Compatibility package, system administrators can run applications built using a newer toolkit even when an older driver that does not satisfy the minimum required driver version is installed on the system. This forward compatibility allows the CUDA deployments in data centers and enterprises to benefit from the faster release cadence and the latest features and performance of CUDA Toolkit. """ toolchain = SYSTEM dependencies = [(name, "11.6", "-D510.73.08")] moduleclass = 'system'

and CUDAcompat-11.eb

easyblock = 'ModuleRC' name = 'CUDAcompat' version = '11' homepage = 'https://docs.nvidia.com/deploy/cuda-compatibility/index.html' description = """Using the CUDA Forward Compatibility package, system administrators can run applications built using a newer toolkit even when an older driver that does not satisfy the minimum required driver version is installed on the system. This forward compatibility allows the CUDA deployments in data centers and enterprises to benefit from the faster release cadence and the latest features and performance of CUDA Toolkit. """ toolchain = SYSTEM dependencies = [(name, "11.6")] moduleclass = 'system'

(since 11.7 is included in this PR, the dep here should really be 11.7)

Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.

Why do you think that? If at our cluster someone was to install that EC it would fail right away as the driver is not compatible with 11.7. So choosing 11.6 directly is the better choice (and the only option)

Anyway added the 11.6, 11.7 and 11 ECs as aliases so everyone can choose. Also added the version check to the easyblock. We could even omit the CUDA version from the "lowest level" ECs as they are already versioned: By the driver version. But keeping it allows the sanity check to work and users/admins to easily see which one applies. Made them hidden though as you should rather load CUDAcompat/11.6 instead of an individual driver version of the CUDA version.

Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.

Why do you think that? If at our cluster someone was to install that EC it would fail right away as the driver is not compatible with 11.7. So choosing 11.6 directly is the better choice (and the only option)

That's kind of my point, you're making the assumption that the driver installed will never be updated. That might be true in your case, but it's unlikely to be true for everyone. If the driver gets updated to an 11.7+ driver then all your modules that depend on 11.6 will start failing.

What the modulerc actually points to is a site choice. The two modulerc easyconfigs would likely have to be edited in a situation like yours anyway based on when when your driver is EOL. If the CUDAcompat easyblock was robust enough to actually check if the compat libs worked then this would be a non-issue since the primary installation would fail anyway (but as discussed in the easyblock PR that is easier said than done).

In your particular case, you would choose 11 to point to 11.6, if you had a choice obviously you would want the latest and greatest.

That's kind of my point, you're making the assumption that the driver installed will never be updated. That might be true in your case, but it's unlikely to be true for everyone. If the driver gets updated to an 11.7+ driver then all your modules that depend on 11.6 will start failing.

Understood. So adding both layers as discussed earlier makes sense. Site admins now have the full choice if they want the latest 11 package, the latest 11.x package or a specific version.

Anything else missing?

ocaisa · 2022-07-25T12:24:19Z

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.7.eb

Can we do something with this information? Perhaps set it as a list and have the easyblock check the "true" driver version against the compatible versions so we can verify if the compat libraries will actually work?

I was thinking about this but decided against it: Getting the "true" driver version isn't trivial to do portably and you may install on another machine than the one you'll run on.

You can get the driver version with

nvidia-smi --query-gpu=driver_version --format=csv,noheader

which reports the same value regardless of the compat libraries used (in my testing at least).

But of course if you don't have a GPU...

I'll add a check to the EasyBlock. I think the test-step is a good place as that allows skipping it.
Similar to the TensorFlow EasyBlock I'll skip the check if nvidia-smi isn't there.

Not fully sure how the check would be done. I'd say match the major versions of the driver against the list from the ECs and check if the driver version is at least the matched version from the EC. If not or no same major version in the EC was found it is an error. So in this case 450.36.01 and 460.0.0 would both fail the test (to low patch level, wrong major version)
or do I misunderstand the NVIDIA doc?

I'm not sure on that front, it's something I've only been able to see through testing with a CUDA executable. In my case I have access to a system with 460.73.01 drivers, all the 11.7 compat libraries fail to run CUDA executables (even though nvidia-smi reports the CUDA version just fine) with the documented error code:

cudaGetDeviceCount returned 803

The driver I have doesn't appear in the list at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package at all.

Well that implies that the above approach would work fine: Your 460.x driver would be reported as incompatible with the compat libraries and your tests seem to confirm that it really is incompatible.

Not quite, the 460.73.01 driver is also not listed for the 11.6 compat libraries, but if I test it with those it works just fine. That just says to me the only way you can really know is by actually compiling something with nvcc and trying to run it.

EDIT

The 460 drivers are EOL according to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package, however I suspect they were not EOL when the 11.6 packages were released, hence the discrepancy. Historical compatability information can probably be derived from comparing https://endoflife.date/nvidia and the release date of the drivers

I tested the 11.6 compat libs on 460.32.03 and that failed with 804 (cudaErrorCompatNotSupportedOnDevice) similar to your 803 (cudaErrorSystemDriverMismatch)

That just says to me the only way you can really know is by actually compiling something with nvcc and trying to run it.

I'm not sure how to do that. I want the compat module as a dependency of the (some) CUDA module so when users load CUDA they get the right compat package. But if you need to load CUDA to test the compat package you have a cyclic dependency. Hence the idea to do that test in the CUDA easyblock. Because that is what you'd like to have working.

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6.eb

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6-510.73.08.eb

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.7-515.48.07.eb

Flamefire · 2022-10-05T15:07:16Z

Test report by @Flamefire
SUCCESS
Build succeeded (with --ignore-test-failure) for 5 out of 5 (5 easyconfigs in total)
taurusa9 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/c4906c97af46c5606a05d573011e8d46 for a full test report.

Flamefire · 2022-10-05T15:25:34Z

Tested also on GTX1080 with driver 460.32.03: Both CUDAcompat/11.6-510.73.08 & CUDAcompat/11.6-510.85.02 error out with 804:"forward compatibility was attempted on non supported HW" as expected for non-Tesla devices.
But as mentioned in #15892 (comment) we can't do much here so use of those modules is on discretion of site admins that should know when this is applicable

ocaisa · 2022-10-07T10:42:17Z

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
FAILED
Build succeeded for 2 out of 5 (5 easyconfigs in total)
gpunode1.int.eessi-gpu.learnhpc.eu - Linux Rocky Linux 8.5 (Green Obsidian), x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.9.13
See https://gist.github.com/39b98fb0e497c92c7e46dbafa710e1a9 for a full test report.

ocaisa · 2022-10-07T10:50:06Z

So the previous test report is actually a good thing. It shows that the 11.6 installation works for a 460 driver and fails for 11.7, which is correct.

Flamefire · 2022-10-07T10:55:15Z

Indeed:

The installed CUDA driver 460.73.01 is not a supported branch/major version for CUDAcompat 11.7. Supported drivers: 450.36.06, 470.57.02, 510.39.01

ocaisa · 2022-10-17T11:55:52Z

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gpunode1.int.eessi-gpu.learnhpc.eu - Linux Rocky Linux 8.5 (Green Obsidian), x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.9.13
See https://gist.github.com/49705db80b309e87e1ab929eebe0a627 for a full test report.

ocaisa · 2022-10-17T11:56:08Z

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gpunode1.int.eessi-gpu.learnhpc.eu - Linux Rocky Linux 8.5 (Green Obsidian), x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.9.13
See https://gist.github.com/0f1921ed6bbd469c529c0d4ca40fa113 for a full test report.

@Flamefire

As tested by @Flamefire and @ocaisa

Flamefire · 2022-10-17T12:11:37Z

Rebased without changes to files to clean up the history and retrigger CI after the easyblock got merged

ocaisa

LGTM

Flamefire mentioned this pull request Jul 22, 2022

add custom easyblock for CUDA compatibility libraries easybuilders/easybuild-easyblocks#2764

Merged

Flamefire force-pushed the 20220722133417_new_pr_CUDAcompat116 branch from ca548de to 1bee6aa Compare July 22, 2022 11:56

ocaisa mentioned this pull request Jul 22, 2022

Add CUDA support EESSI/software-layer#172

Closed

5 tasks

ocaisa requested changes Jul 22, 2022

View reviewed changes

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6.eb Outdated Show resolved Hide resolved

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.7.eb Outdated Show resolved Hide resolved

ocaisa reviewed Jul 22, 2022

View reviewed changes

ocaisa reviewed Jul 25, 2022

View reviewed changes

easybuilders deleted a comment from boegelbot Jul 25, 2022

ocaisa reviewed Jul 29, 2022

View reviewed changes

easybuild/easyconfigs/c/CUDAcompat/CUDAcompat-11.6.eb Outdated Show resolved Hide resolved

ocaisa mentioned this pull request Aug 1, 2022

Next pilot version EESSI/software-layer#173

Closed

boegel added the new label Aug 4, 2022

boegel added this to the 4.x milestone Aug 4, 2022

easybuilders deleted a comment from boegelbot Aug 4, 2022

ocaisa requested changes Sep 30, 2022

View reviewed changes

Flamefire force-pushed the 20220722133417_new_pr_CUDAcompat116 branch from 5248aec to e557361 Compare October 5, 2022 14:59

easybuilders deleted a comment from boegelbot Oct 6, 2022

easybuilders deleted a comment from boegelbot Oct 7, 2022

easybuilders deleted a comment from boegelbot Oct 17, 2022

adding easyconfigs: CUDAcompat-11.6.eb, CUDAcompat-11.7.eb

299c6dc

Flamefire added 2 commits October 17, 2022 14:10

Add alias modules and the compatible driver versions a list

f2e6b05

Mark 11.6 as compatible to R440 and R460

352b627

As tested by @Flamefire and @ocaisa

Flamefire force-pushed the 20220722133417_new_pr_CUDAcompat116 branch from e557361 to 352b627 Compare October 17, 2022 12:10

ocaisa approved these changes Oct 17, 2022

View reviewed changes

ocaisa merged commit 3e9f502 into easybuilders:develop Oct 17, 2022

Flamefire deleted the 20220722133417_new_pr_CUDAcompat116 branch October 18, 2022 10:19

boegel modified the milestones: 4.x, next release (4.6.2?) Oct 19, 2022

{system}[system/system] CUDAcompat v11.6, CUDAcompat v11.7 #15892

{system}[system/system] CUDAcompat v11.6, CUDAcompat v11.7 #15892

Uh oh!

Conversation

Flamefire commented Jul 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flamefire commented Jul 22, 2022

Uh oh!

Flamefire commented Jul 22, 2022

Uh oh!

ocaisa commented Jul 22, 2022

Uh oh!

Uh oh!

Uh oh!

ocaisa commented Jul 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocaisa Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Flamefire Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocaisa Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Flamefire commented Oct 5, 2022

Uh oh!

Flamefire commented Oct 5, 2022

Uh oh!

ocaisa commented Oct 7, 2022

Uh oh!

ocaisa commented Oct 7, 2022

Uh oh!

Flamefire commented Oct 7, 2022

Uh oh!

ocaisa commented Oct 17, 2022

Uh oh!

ocaisa commented Oct 17, 2022

Uh oh!

Flamefire commented Oct 17, 2022

Uh oh!

ocaisa left a comment

Flamefire commented Jul 22, 2022 •

edited

Loading

ocaisa Jul 27, 2022 •

edited

Loading

Flamefire Jul 26, 2022 •

edited

Loading

ocaisa Jul 28, 2022 •

edited

Loading