Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jul 22, 2022

(created using eb --new-pr)

Requires easybuilders/easybuild-easyblocks#2764

The download URLs actually work (similar to the CUDA runtime) but when going over the website you come over this:

By clicking the "Agree & Download" button below, you are confirming that you have read and agree to be bound by the License For Customer Use of NVIDIA Software for use of the driver. The driver will begin downloading immediately after clicking on the "Agree & Download" button below. NVIDIA recommends users update to the latest driver version. Please review NVIDIA Product Security for more information.

So we need an EULA acceptance check here.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusml3 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/dd82b1dea5b4f6f1bceeaa5f9a190243 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 2 out of 2 (2 easyconfigs in total)
taurusa11 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/705bb7197f948f6e863c88df5bfb8477 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Jul 22, 2022

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
login1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, Python 3.6.8
See https://gist.github.com/c0080118658ea67d9251f072bc6edcf1 for a full test report.

@ocaisa ocaisa mentioned this pull request Jul 22, 2022
5 tasks
@ocaisa
Copy link
Member

ocaisa commented Jul 22, 2022

About the EULA, I suggest we err on the side of caution and require it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added as a versionsuffix? There are 5 11.4 releases of the CUDAcompat libraries (for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When going through the official process at https://www.nvidia.com/Download/index.aspx?lang=en-us I only get 1.

So I wouldn't litter the module name yet and do so when and if the need arises. See Python: No suffix as long is we only use 1/the default version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the compat libraries from https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and you can see all the driver releases for each minor version.

Given you know you have a uniqueness issue, I still think it is better to include.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver versions do seem to map to a patch version though, so you could include that instead...but from what I can see the map from patch version to driver version can only be known by peeping at something like https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ and figuring it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the runfile downloads are only there for the latest version, not the earlier ones. Hence on the runfile download page you only get 1 version.
It may be worth doing the Java approach and have 1 alias package for 11.6 and individual packages for the versions with the 11.6 one pointing to the latest.
IMO we can do that later when we decide to add another EC for the same CUDA version, but that would be an enhancement I don't think it is worth the work right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The different driver versions of the compat package do not matter for compatibility (given a fixed major&minor version)

Anything wrong?

Yes I agree with that, the only point being the driver version of the compat package. I suspect the version does matter if you are not on the driver branches listed in the tables at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package, in which case whether the compat libraries will work for you is probably arbitrary.

Hence I think a CUDAcompat/11 package isn't very useful to have in EasyBuild: You won't know if it is compatible with your driver. So we need to include the minor versions in the ECs so site admins can install the EC which is compatible with the (currently) installed driver.

I think that depends on clever the underlying easyblock is. To install/update CUDAcompat/11 (a modulerc) you'd need to (ultimately) install CUDAcompat/11.X-DRIVER_VERSION, if the easyblock actually checks compatibility for you then this will fail gracefully with a helpful error message. I get your point though, but I think you answer it yourself below.

Of course we could stack this:

  • CUDAcompat/11 depends on the latest CUDAcompat/11.x
  • CUDAcompat/11.x depends on the latest CUDAcompat/11.x-y

But we need the "middle layer".

I think that this could work. Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.

For CUDAcompat/11.4 you could have 5 different nv_versions

Yes, "could". But currently there is only 1 EC. And as (according to the above) we will always need a CUDAcompat/11.4 it is easy to change the existing EC to have a version suffix and add an alias EC.

We always start with one ec but it is rare we stop there. If we already know we may need to change things, I think it is better we start off with the right path rather than change the behaviour after the fact.

So my point is that currently the version suffix is not required and not adding it doesn't prevent us from adding it later when it is required. Also not adding it now, means having only 2 new ECs instead of 4. I really want to be able to add CUDAcompat/11.6 as a dependency to CUDA in site specific hooks.

I think using a modulerc for CUDAcompat/11.6 is a more robust and reproducible solution. I'll test it out today.

Copy link
Member

@ocaisa ocaisa Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I tested using a modulerc to wrap access to the specific CUDAcompat version and this seems to work without issue. This was with a trivial change to the ec in this PR:

name = 'CUDAcompat'
version = '11.6'
versionsuffix = "-D%s" % nv_version

and then the creation of two new (trivial) easyconfigs, CUDAcompat-11.6.eb:

easyblock = 'ModuleRC'

name = 'CUDAcompat'
version = '11.6'

homepage = 'https://docs.nvidia.com/deploy/cuda-compatibility/index.html'
description = """Using the CUDA Forward Compatibility package,
 system administrators can run applications built using a newer toolkit
 even when an older driver that does not satisfy the minimum required driver version
 is installed on the system.
 This forward compatibility allows the CUDA deployments in data centers and enterprises
 to benefit from the faster release cadence and the latest features and performance of CUDA Toolkit.
"""

toolchain = SYSTEM

dependencies = [(name, "11.6", "-D510.73.08")]

moduleclass = 'system'

and CUDAcompat-11.eb

easyblock = 'ModuleRC'

name = 'CUDAcompat'
version = '11'

homepage = 'https://docs.nvidia.com/deploy/cuda-compatibility/index.html'
description = """Using the CUDA Forward Compatibility package,
 system administrators can run applications built using a newer toolkit
 even when an older driver that does not satisfy the minimum required driver version
 is installed on the system.
 This forward compatibility allows the CUDA deployments in data centers and enterprises
 to benefit from the faster release cadence and the latest features and performance of CUDA Toolkit.
"""

toolchain = SYSTEM

dependencies = [(name, "11.6")]

moduleclass = 'system'

(since 11.7 is included in this PR, the dep here should really be 11.7)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.

Why do you think that? If at our cluster someone was to install that EC it would fail right away as the driver is not compatible with 11.7. So choosing 11.6 directly is the better choice (and the only option)

Anyway added the 11.6, 11.7 and 11 ECs as aliases so everyone can choose. Also added the version check to the easyblock. We could even omit the CUDA version from the "lowest level" ECs as they are already versioned: By the driver version. But keeping it allows the sanity check to work and users/admins to easily see which one applies. Made them hidden though as you should rather load CUDAcompat/11.6 instead of an individual driver version of the CUDA version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most people would prefer to have CUDAcompat/11 that is dynamic and updates to the latest available compat package.

Why do you think that? If at our cluster someone was to install that EC it would fail right away as the driver is not compatible with 11.7. So choosing 11.6 directly is the better choice (and the only option)

That's kind of my point, you're making the assumption that the driver installed will never be updated. That might be true in your case, but it's unlikely to be true for everyone. If the driver gets updated to an 11.7+ driver then all your modules that depend on 11.6 will start failing.

What the modulerc actually points to is a site choice. The two modulerc easyconfigs would likely have to be edited in a situation like yours anyway based on when when your driver is EOL. If the CUDAcompat easyblock was robust enough to actually check if the compat libs worked then this would be a non-issue since the primary installation would fail anyway (but as discussed in the easyblock PR that is easier said than done).

In your particular case, you would choose 11 to point to 11.6, if you had a choice obviously you would want the latest and greatest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's kind of my point, you're making the assumption that the driver installed will never be updated. That might be true in your case, but it's unlikely to be true for everyone. If the driver gets updated to an 11.7+ driver then all your modules that depend on 11.6 will start failing.

Understood. So adding both layers as discussed earlier makes sense. Site admins now have the full choice if they want the latest 11 package, the latest 11.x package or a specific version.

Anything else missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something with this information? Perhaps set it as a list and have the easyblock check the "true" driver version against the compatible versions so we can verify if the compat libraries will actually work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this but decided against it: Getting the "true" driver version isn't trivial to do portably and you may install on another machine than the one you'll run on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get the driver version with

nvidia-smi --query-gpu=driver_version --format=csv,noheader

which reports the same value regardless of the compat libraries used (in my testing at least).

But of course if you don't have a GPU...

Copy link
Contributor Author

@Flamefire Flamefire Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a check to the EasyBlock. I think the test-step is a good place as that allows skipping it.
Similar to the TensorFlow EasyBlock I'll skip the check if nvidia-smi isn't there.

Not fully sure how the check would be done. I'd say match the major versions of the driver against the list from the ECs and check if the driver version is at least the matched version from the EC. If not or no same major version in the EC was found it is an error. So in this case 450.36.01 and 460.0.0 would both fail the test (to low patch level, wrong major version)
or do I misunderstand the NVIDIA doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure on that front, it's something I've only been able to see through testing with a CUDA executable. In my case I have access to a system with 460.73.01 drivers, all the 11.7 compat libraries fail to run CUDA executables (even though nvidia-smi reports the CUDA version just fine) with the documented error code:

cudaGetDeviceCount returned 803

The driver I have doesn't appear in the list at https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that implies that the above approach would work fine: Your 460.x driver would be reported as incompatible with the compat libraries and your tests seem to confirm that it really is incompatible.

Copy link
Member

@ocaisa ocaisa Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, the 460.73.01 driver is also not listed for the 11.6 compat libraries, but if I test it with those it works just fine. That just says to me the only way you can really know is by actually compiling something with nvcc and trying to run it.

EDIT

The 460 drivers are EOL according to https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package, however I suspect they were not EOL when the 11.6 packages were released, hence the discrepancy. Historical compatability information can probably be derived from comparing https://endoflife.date/nvidia and the release date of the drivers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the 11.6 compat libs on 460.32.03 and that failed with 804 (cudaErrorCompatNotSupportedOnDevice) similar to your 803 (cudaErrorSystemDriverMismatch)

That just says to me the only way you can really know is by actually compiling something with nvcc and trying to run it.

I'm not sure how to do that. I want the compat module as a dependency of the (some) CUDA module so when users load CUDA they get the right compat package. But if you need to load CUDA to test the compat package you have a cyclic dependency. Hence the idea to do that test in the CUDA easyblock. Because that is what you'd like to have working.

@easybuilders easybuilders deleted a comment from boegelbot Jul 25, 2022
@boegel boegel added the new label Aug 4, 2022
@boegel boegel added this to the 4.x milestone Aug 4, 2022
@easybuilders easybuilders deleted a comment from boegelbot Aug 4, 2022
@Flamefire Flamefire force-pushed the 20220722133417_new_pr_CUDAcompat116 branch from 5248aec to e557361 Compare October 5, 2022 14:59
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded (with --ignore-test-failure) for 5 out of 5 (5 easyconfigs in total)
taurusa9 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/c4906c97af46c5606a05d573011e8d46 for a full test report.

@Flamefire
Copy link
Contributor Author

Tested also on GTX1080 with driver 460.32.03: Both CUDAcompat/11.6-510.73.08 & CUDAcompat/11.6-510.85.02 error out with 804:"forward compatibility was attempted on non supported HW" as expected for non-Tesla devices.
But as mentioned in #15892 (comment) we can't do much here so use of those modules is on discretion of site admins that should know when this is applicable

@easybuilders easybuilders deleted a comment from boegelbot Oct 6, 2022
@easybuilders easybuilders deleted a comment from boegelbot Oct 6, 2022
@ocaisa
Copy link
Member

ocaisa commented Oct 7, 2022

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
FAILED
Build succeeded for 2 out of 5 (5 easyconfigs in total)
gpunode1.int.eessi-gpu.learnhpc.eu - Linux Rocky Linux 8.5 (Green Obsidian), x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.9.13
See https://gist.github.com/39b98fb0e497c92c7e46dbafa710e1a9 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Oct 7, 2022

So the previous test report is actually a good thing. It shows that the 11.6 installation works for a 460 driver and fails for 11.7, which is correct.

@Flamefire
Copy link
Contributor Author

Indeed:

The installed CUDA driver 460.73.01 is not a supported branch/major version for CUDAcompat 11.7. Supported drivers: 450.36.06, 470.57.02, 510.39.01

@easybuilders easybuilders deleted a comment from boegelbot Oct 7, 2022
@ocaisa
Copy link
Member

ocaisa commented Oct 17, 2022

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gpunode1.int.eessi-gpu.learnhpc.eu - Linux Rocky Linux 8.5 (Green Obsidian), x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.9.13
See https://gist.github.com/49705db80b309e87e1ab929eebe0a627 for a full test report.

@ocaisa
Copy link
Member

ocaisa commented Oct 17, 2022

Test report by @ocaisa
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2764
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gpunode1.int.eessi-gpu.learnhpc.eu - Linux Rocky Linux 8.5 (Green Obsidian), x86_64, AMD EPYC 7742 64-Core Processor (zen2), 1 x NVIDIA GRID V100-4C, 460.73.01, Python 3.9.13
See https://gist.github.com/0f1921ed6bbd469c529c0d4ca40fa113 for a full test report.

@easybuilders easybuilders deleted a comment from boegelbot Oct 17, 2022
@Flamefire Flamefire force-pushed the 20220722133417_new_pr_CUDAcompat116 branch from e557361 to 352b627 Compare October 17, 2022 12:10
@Flamefire
Copy link
Contributor Author

Rebased without changes to files to clean up the history and retrigger CI after the easyblock got merged

Copy link
Member

@ocaisa ocaisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ocaisa ocaisa merged commit 3e9f502 into easybuilders:develop Oct 17, 2022
@Flamefire Flamefire deleted the 20220722133417_new_pr_CUDAcompat116 branch October 18, 2022 10:19
@boegel boegel modified the milestones: 4.x, next release (4.6.2?) Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants