Skip to content

Conversation

@AndiH
Copy link
Contributor

@AndiH AndiH commented Sep 28, 2020

PGI beyond 20.4 is not called PGI any more but NVIDIA HPC SDK (NVHPC), with 20.7 being the first version; https://developer.nvidia.com/hpc-sdk.

This pull request is part of a set of three pull requests to introduce NVHPC to EasyBuild (1, 2, 3). The stack has been successfully deployed at JSC and is under use by our HPC users.
The EasyBuild files are created after discussions with the NVHPC product owners, taking into account HPC practices and EasyBuild conventions.

This request is for pulling the NVHPC EasyBlock. It is based on the PGI EasyBlock but cleaned-up and extended for NVHPC.

Additional options are included; for example default_cuda_version, compute_capability and several module_* options. The former two are passed to the NVHPC's install script which configures NVHPC for usage with a certain CUDA or device Compute Capability.
The module_* options refer to options being passed to the LMod module where additional directories beyond the default directories are included into the environment. As an example, nvhpc_nvshmem_basedir adds the directory of NVHPC in which NVSHMEM is included into CPATH and LD_LIBRARY_PATH with their corresponding sub-directories. The options exist to enable HPC centres to rather provide their own installations of the packaged libraries.

This exposes options suggested by NVHPC's install scripts and the created modulefiles
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also take care of the code style suggestions.

Don't hesitate to reach out if you need any help!

@AndiH
Copy link
Contributor Author

AndiH commented Sep 29, 2020

(Snapshot at end of work day today. Will continue after additional input. Note: Not tested on system.)

Default is now None, to be overwritten either via the easyconfig or the command line (--try-amend=default_cuda_version='11.0'). If it is not overwritten manually, it is tried to be guessed by requesting the version of a (potential) CUDA module. If that also fails, EB is error'd out with a descriptive message.
@AndiH
Copy link
Contributor Author

AndiH commented Sep 30, 2020

Some more work done. @boegel can you have a look at the open questions?

@hound hound bot deleted a comment from AndiH Oct 13, 2020
@hound hound bot deleted a comment from AndiH Oct 13, 2020
@hound hound bot deleted a comment from boegel Oct 13, 2020
@hound hound bot deleted a comment from boegel Oct 13, 2020
@hound hound bot deleted a comment from boegel Oct 13, 2020
@hound hound bot deleted a comment from boegel Oct 13, 2020
@hound hound bot deleted a comment from boegel Oct 13, 2020
@hound hound bot deleted a comment from AndiH Oct 13, 2020
@hound hound bot deleted a comment from AndiH Oct 13, 2020
@hound hound bot deleted a comment from AndiH Oct 13, 2020
@boegel boegel changed the title Add NVHPC EasyBlock (aka PGI) add custom easyblock for NVHPC (aka PGI) Oct 13, 2020
after=default_compute_capability
)
)
default_compute_capability.replace(".", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assignment is missing here, this can't work? .replace returns a new string value...

Also, I hate to be pedantic, but you're now just assuming you have a string value in default_compute_capability, which may not be the case (and if it's not, the crash will be pretty ugly).

So I'd add a small check:

if isinstance(default_compute_capability, str):
    default_compute_capability = default_compute_capability.replace('.', '')
else:
    raise EasyBuildError("Unexpected non-string value encountered for compute capability: %s",
                         default_compute_capability)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 2c91709

'dirs': [os.path.join(prefix, 'compilers', 'bin'), os.path.join(prefix, 'compilers', 'lib'),
os.path.join(prefix, 'compilers', 'include'), os.path.join(prefix, 'compilers', 'man')]
}
custom_commands = ["%s -v" % custom_paths["files"][0]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would check all commands, and just use the command name (not the path):

compiler_cmds = ['nvc', 'nvc++', 'nvfortran']
custom_paths = {
    'files': [os.path.join(prefix, 'compilers', 'bin', x) for x in compiler_cmds + ['siterc']],
...
custom_commands = ["%s -v" % x for x in compiler_cmds]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in c54b719.

@boegel
Copy link
Member

boegel commented Oct 14, 2020

@AndiH Please also revisit easybuilders/easybuild-easyconfigs#11391, even though the tests can't pass there as long as this easyblock isn't merged, since that's the test case for this easyblock (we'll merge both in quick succession).

We're getting there, I'm confident we can get this merged really soon now...

@AndiH
Copy link
Contributor Author

AndiH commented Oct 15, 2020

@boegel Absolutely! I wanted to get the EasyBlock done first before I implement the easyconfig. I am working on that next!

@hound hound bot deleted a comment from AndiH Oct 16, 2020
@hound hound bot deleted a comment from boegel Oct 16, 2020
@hound hound bot deleted a comment from boegel Oct 16, 2020
@hound hound bot deleted a comment from AndiH Oct 16, 2020
@AndiH
Copy link
Contributor Author

AndiH commented Oct 16, 2020

I think I've tackled all the remaining open issues!
I leave resolution of the open comments to you and will now start with the edits for the easyconfig.

@boegel
Copy link
Member

boegel commented Oct 17, 2020

@AndiH Tested this using a tweaked version of easybuilders/easybuild-easyconfigs#11391 (only changed the CUDA dep), ran into this crash:

== installing...
ERROR: Traceback (most recent call last):
  File "/user/boegel/easybuild-framework/easybuild/main.py", line 115, in build_and_install_software
    (ec_res['success'], app_log, err) = build_and_install_one(ec, init_env)
  File "/user/boegel/easybuild-framework/easybuild/framework/easyblock.py", line 3301, in build_and_install_one
    result = app.run_all_steps(run_test_cases=run_test_cases)
  File "/user/boegel/easybuild-framework/easybuild/framework/easyblock.py", line 3204, in run_all_steps
    self.run_step(step_name, step_methods)
  File "/user/boegel/easybuild/framework/easyblock.py", line 3059, in run_step
    step_method(self)()
  File "/tmp/eb-30tldro0/included-easyblocks-h6qfu1hb/easybuild/easyblocks/nvhpc.py", line 127, in install_step
    default_compute_capability = _before_default_compute_capability[0]
IndexError: list index out of range

@AndiH
Copy link
Contributor Author

AndiH commented Oct 19, 2020

@boegel Did you specify a Compute Capability (via CLI or via easyconfig)?

(I test some combinations meanwhile!)

Plus, proper quotes for hint for easyconfig
@AndiH
Copy link
Contributor Author

AndiH commented Oct 19, 2020

Ok, this is a bit weird, but fixed in e4c6c34.

If cuda_compute_capabilities is not set in the easyconfig file and not via CLI, the value is not None, but rather []; for whatever reason. The added if ec_default_compute_capability checks for empty lists.

@boegel
Copy link
Member

boegel commented Oct 20, 2020

@AndiH Getting [] as default value for cuda_compute_capabilities makes sense, see:

$ eb -a | grep cuda_compute_capabilities
cuda_compute_capabilities   List of CUDA compute capabilities to build with (if supported) [default: []]

Thanks for fixing that issue!

I'm running some final tests now, but I think this PR and easybuilders/easybuild-easyconfigs#11391 are good to go (I'll deal with the failing tests in the latter once this PR is merged).

@AndiH
Copy link
Contributor Author

AndiH commented Oct 20, 2020

Great! 🕺

I try to work soon on NVHPC 20.9, which has been released meanwhile!

@boegel
Copy link
Member

boegel commented Oct 20, 2020

lgtm, tested with easybuilders/easybuild-easyconfigs#11391

Thanks a lot for the effort and patience @AndiH !

@boegel boegel merged commit 400bbd6 into easybuilders:develop Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants