-
Notifications
You must be signed in to change notification settings - Fork 72
{2023.06}[foss/2023a] cuDNN 8.9.2.26 w/ CUDA 12.1.1 (part 1) #772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
4c5c5a3
c5c7dcc
da7c1e4
454d2bb
6824d75
0e92d8d
044e168
f983fed
6e95efd
21916fd
bf26846
e1ba74f
1f1eada
fc00d0c
e968608
9c6c3a4
97d5b67
8e7a0e8
57f5a48
21ffc18
a3edc20
7f601dc
e3101b0
e9018fb
54753c3
2b76a54
086ba5a
ecd30ca
18cdaa2
5716424
9f3853c
b9017d4
eea2879
33199d7
06cd2ea
a27683e
d5572ea
b68fdfa
0e7c9d8
d57f8d8
02d3e1e
b28017a
20aacdc
88bdb88
7b8ba8b
a95d546
45fa6b1
c3482b2
db90ca7
6a0223c
affe37b
d2d95e9
77f3bc9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| easyconfigs: | ||
| - cuDNN-8.9.2.26-CUDA-12.1.1.eb | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just for completeness' sake, I would also list It's already installed, so won't make a difference w.r.t. produced installation, but it looks a bit strange not having it listed explicitly in an I was also wondering whether we need both this easystack file and the one under I understand we do because the latter gets shipped with EESSI, but I'm wondering if we should symlink them or something, since they should stay in sync, no?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keeping them in sync would be good. Not sure it works with a symlink though. If we keep the file under If we symlink in the other direction, will our
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Made an attempt to arrange for that (only having one file ... or better not duplicates with the same content) in 8e7a0e8
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After todays GPU tiger meeting, we agreed to separate these after all. The list of what is needed for |
||
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -756,64 +756,141 @@ def post_postproc_cuda(self, *args, **kwargs): | |||||||||||||
| if 'libcudart' not in allowlist: | ||||||||||||||
| raise EasyBuildError("Did not find 'libcudart' in allowlist: %s" % allowlist) | ||||||||||||||
|
|
||||||||||||||
| # iterate over all files in the CUDA installation directory | ||||||||||||||
| for dir_path, _, files in os.walk(self.installdir): | ||||||||||||||
| for filename in files: | ||||||||||||||
| full_path = os.path.join(dir_path, filename) | ||||||||||||||
| # we only really care about real files, i.e. not symlinks | ||||||||||||||
| if not os.path.islink(full_path): | ||||||||||||||
| # check if the current file name stub is part of the allowlist | ||||||||||||||
| basename = filename.split('.')[0] | ||||||||||||||
| if basename in allowlist: | ||||||||||||||
| self.log.debug("%s is found in allowlist, so keeping it: %s", basename, full_path) | ||||||||||||||
| else: | ||||||||||||||
| self.log.debug("%s is not found in allowlist, so replacing it with symlink: %s", | ||||||||||||||
| basename, full_path) | ||||||||||||||
| # if it is not in the allowlist, delete the file and create a symlink to host_injections | ||||||||||||||
|
|
||||||||||||||
| # the host_injections path is under a fixed repo/location for CUDA | ||||||||||||||
| host_inj_path = re.sub(EESSI_INSTALLATION_REGEX, HOST_INJECTIONS_LOCATION, full_path) | ||||||||||||||
| # CUDA itself doesn't care about compute capability so remove this duplication from | ||||||||||||||
| # under host_injections (symlink to a single CUDA installation for all compute | ||||||||||||||
| # capabilities) | ||||||||||||||
| accel_subdir = os.getenv("EESSI_ACCELERATOR_TARGET") | ||||||||||||||
| if accel_subdir: | ||||||||||||||
| host_inj_path = host_inj_path.replace("/accel/%s" % accel_subdir, '') | ||||||||||||||
| # make sure source and target of symlink are not the same | ||||||||||||||
| if full_path == host_inj_path: | ||||||||||||||
| raise EasyBuildError("Source (%s) and target (%s) are the same location, are you sure you " | ||||||||||||||
| "are using this hook for an EESSI installation?", | ||||||||||||||
| full_path, host_inj_path) | ||||||||||||||
| remove_file(full_path) | ||||||||||||||
| symlink(host_inj_path, full_path) | ||||||||||||||
| # replace files that are not distributable with symlinks into | ||||||||||||||
| # host_injections | ||||||||||||||
| replace_non_distributable_files_with_symlinks(self.log, self.installdir, self.name, allowlist) | ||||||||||||||
| else: | ||||||||||||||
| raise EasyBuildError("CUDA-specific hook triggered for non-CUDA easyconfig?!") | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def post_postproc_cudnn(self, *args, **kwargs): | ||||||||||||||
| """ | ||||||||||||||
| Remove files from cuDNN installation that we are not allowed to ship, | ||||||||||||||
| and replace them with a symlink to a corresponding installation under host_injections. | ||||||||||||||
| """ | ||||||||||||||
|
|
||||||||||||||
| # We need to check if we are doing an EESSI-distributed installation | ||||||||||||||
| eessi_installation = bool(re.search(EESSI_INSTALLATION_REGEX, self.installdir)) | ||||||||||||||
|
|
||||||||||||||
| if self.name == 'cuDNN' and eessi_installation: | ||||||||||||||
| print_msg("Replacing files in cuDNN installation that we can not ship with symlinks to host_injections...") | ||||||||||||||
|
|
||||||||||||||
| allowlist = ['LICENSE'] | ||||||||||||||
|
|
||||||||||||||
| # read cuDNN LICENSE, construct allowlist based on section "2. Distribution" that specifies list of files that can be shipped | ||||||||||||||
| license_path = os.path.join(self.installdir, 'LICENSE') | ||||||||||||||
| search_string = "2. Distribution. The following portions of the SDK are distributable under the Agreement:" | ||||||||||||||
casparvl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
| with open(license_path) as infile: | ||||||||||||||
| for line in infile: | ||||||||||||||
| if line.strip().startswith(search_string): | ||||||||||||||
| # remove search string, split into words, remove trailing | ||||||||||||||
| # dots '.' and only retain words starting with a dot '.' | ||||||||||||||
| distributable = line[len(search_string):] | ||||||||||||||
| for word in distributable.split(): | ||||||||||||||
| if word[0] == '.': | ||||||||||||||
| allowlist.append(word.rstrip('.')) | ||||||||||||||
casparvl marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
|
||||||||||||||
| allowlist = sorted(set(allowlist)) | ||||||||||||||
| self.log.info("Allowlist for files in cuDNN installation that can be redistributed: " + ', '.join(allowlist)) | ||||||||||||||
|
|
||||||||||||||
| # replace files that are not distributable with symlinks into | ||||||||||||||
| # host_injections | ||||||||||||||
| replace_non_distributable_files_with_symlinks(self.log, self.installdir, self.name, allowlist) | ||||||||||||||
| else: | ||||||||||||||
| raise EasyBuildError("cuDNN-specific hook triggered for non-cuDNN easyconfig?!") | ||||||||||||||
|
|
||||||||||||||
|
|
||||||||||||||
| def replace_non_distributable_files_with_symlinks(log, install_dir, package, allowlist): | ||||||||||||||
|
||||||||||||||
| """ | ||||||||||||||
| Replace files that cannot be distributed with symlinks into host_injections | ||||||||||||||
| """ | ||||||||||||||
| extension_based = { "CUDA": False, "cuDNN": True } | ||||||||||||||
|
||||||||||||||
| extension_based = { "CUDA": False, "cuDNN": True } | |
| extension_based = { | |
| "CUDA": False, | |
| "cuDNN": True, | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 044e168
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if extension_based[package]: | |
| if '.' in filename: | |
| extension = '.' + filename.split('.')[1] | |
| if extension_based[package] and '.' in filename: | |
| extension = '.' + filename.split('.')[1] |
I'm not sure what this does, so probably deserve a comment above, perhaps with an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in f983fed
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using the same condition twice, so we should introduce a local variable for this, something like:
| elif extension_based[package] and '.' in filename and extension in allowlist: | |
| check_by_extension = extension_based[package] and '.' in filename | |
| if check_by_extension: | |
| extension = '.' + filename.split('.')[1] | |
| ... | |
| elif check_by_extension and extension in allowlist: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented in 6e95efd
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code golf:
| if extension_based[package]: | |
| print_name = filename | |
| else: | |
| print_name = basename | |
| print_name = filename if extension_based[package] else basename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented in 21916fd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing, since filename will never be explicitly in the allowlist (only extensions will be)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allowlist is created in the post_postproc_{cuda,cudnn} function. For CUDA it contains 'EULA' (note without suffix '.txt'), 'README' and a list of file name "stubs" (only the first component when the full file names are split at '.'). For cuDNN, it contains 'LICENSE', '.so', ...
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # CUDA and cuDNN itself don't care about compute capability so remove this duplication from | |
| # CUDA and cu* libraries themselves don't care about compute capability so remove this duplication from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in bf26846
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Add 'gpu' property EESSI<PACKAGE>VERSION envvars and drop dependencies to | |
| Add 'gpu' property + EESSI<PACKAGE>VERSION envvars, and drop dependencies to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improved in e968608
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a constant, which is also used in replace_non_distributable_files_with_symlinks to check whether the mapping in extension_based is complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be a constant. Not sure if it "should" be?
If we would want to use that to also check if the mapping in extension_based is complete, we would assume that we call this function (replace...) for all (CUDA-dependent) packages we run inject_gpu_property for?
Not sure if either adds much value.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg_versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for package in packages_list: | |
| for pkg_name in pkg_names: |
(it's a tuple, not a list, so a bit misleading otherwise, and no need to get very wordy with the loop variable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of packages stored in a tuple 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using iter seems overkill to me, also below, I don't see the point in doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's because we're iterating over a list that we're modifying with .remove.
Maybe we should use deps = ec_dict['dependencies'][:] instead (make a copy), and loop over deps (while still doing the .remove on ec_dict['dependencies'], and add a comment to clarify why that's done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this should be ==, not in, we're comparing software names here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values is a bit confusing here, maybe extra_mod_footer_lines is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we actually injecting these?
Is that just so we can tell which CUDA or cuDNN we're using (since it's a build dep no module will be loaded for these)?
Should we use EESSI_<pkgname>_VERSION to make this a bit more readable for humans?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why it was added for CUDA. For cuDNN, I just do the same.
Maybe it was inspired by some EBVERSIONCUDA envvar name? If we change the names we would have to rebuild some module files. See list below
$ grep VERSION /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all/*/*.lua | sed -e 's/.*all//'
/CUDA/12.1.1.lua:setenv("EBVERSIONCUDA", "12.1.1")
/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua:setenv("EBVERSIONCUDAMINSAMPLES", "12.1")
/CUDA-Samples/12.1-GCC-12.3.0-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","12.1.1")
/ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1.lua:setenv("EBVERSIONESPRESSO", "4.2.2")
/ESPResSo/4.2.2-foss-2023a-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","12.1.1")
/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.lua:setenv("EBVERSIONLAMMPS", "2Aug2023_update2")
/LAMMPS/2Aug2023_update2-foss-2023a-kokkos-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","12.1.1")
/NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua:setenv("EBVERSIONNCCL", "2.18.3")
/NCCL/2.18.3-GCCcore-12.3.0-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","12.1.1")
/OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1.lua:setenv("EBVERSIONOSUMINMICROMINBENCHMARKS", "7.2")
/OSU-Micro-Benchmarks/7.2-gompi-2023a-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","1.14.1")
/UCC-CUDA/1.2.0-GCCcore-12.3.0-CUDA-12.1.1.lua:setenv("EBVERSIONUCCMINCUDA", "1.2.0")
/UCC-CUDA/1.2.0-GCCcore-12.3.0-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","12.1.1")
/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua:setenv("EBVERSIONUCXMINCUDA", "1.14.1")
/UCX-CUDA/1.14.1-GCCcore-12.3.0-CUDA-12.1.1.lua:setenv("EESSICUDAVERSION","12.1.1")
If we want to change that, we should probably do the change in a separate PR before adding cuDNN.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
positive logic is easier to follow:
# take into account that modluafooter may already be set
if key in ec_dict:
...
else:
...and maybe we should rename key to modluafooter to avoid confusion (or mod_footer if you want to plan ahead to a time where we may also have Tcl modules, but I doubt that'll happen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if values is renamed to extra_mod_footer_lines and you iterate here with for line in extra_mod_footer_lines, then new_value can just be value :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed in 57f5a48
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a hit here seems unlikely to me, but ok, doesn't hurt I guess :)
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -122,7 +122,10 @@ copy_files_by_list ${TOPDIR}/scripts ${INSTALL_PREFIX}/scripts "${script_files[@ | |
|
|
||
| # Copy files for the scripts/gpu_support/nvidia directory | ||
| nvidia_files=( | ||
| install_cuda_host_injections.sh link_nvidia_host_libraries.sh | ||
| eessi-2023.06-cuda-and-libraries.yml | ||
| install_cuda_and_libraries.sh | ||
| install_cuda_host_injections.sh | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't we deprecate this script in favor of We should definitely update the documentation at https://www.eessi.io/docs/gpu if we're going forward with
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could/should do that, but maybe in a separate PR that coordinates the change in the docs?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, deprecation can be done later. First, this has to be deployed. Then, we can put it in the docs, and people can start using it. Only then should we deprecate the old method.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. follow-up on this via #789 |
||
| link_nvidia_host_libraries.sh | ||
| ) | ||
| copy_files_by_list ${TOPDIR}/scripts/gpu_support/nvidia ${INSTALL_PREFIX}/scripts/gpu_support/nvidia "${nvidia_files[@]}" | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| easyconfigs: | ||
| - CUDA-12.1.1.eb | ||
| - cuDNN-8.9.2.26-CUDA-12.1.1.eb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trz42 This doesn't look like an ideal path to use, even if it's temporary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reusing
TMPDIRvia 97d5b67