Skip to content

Conversation

@gkaf89
Copy link
Contributor

@gkaf89 gkaf89 commented Aug 5, 2024

The files can be build in some selected build path (--buildpath), and the logs of successful compilation are then concentrated to some other location for permanent storage (--logfile-format). Logs of failed builds remain in the build path location so that they can be inspected.

However, this setup is problematic when building software in HPC jobs. Quite often in HPC systems the build path is set to some fast storage local to the node, like NVME raid mounted on /tmp or /dev/shm (as suggested in the documentation: https://docs.easybuild.io/configuration/#buildpath). The node storage is often wiped out after the end of a job, so the log files and the artifacts are no longer available after the termination of the job.

This commit adds an option to accumulate errors in some more permanent location, so they can be easily inspected after a failed build.

@gkaf89 gkaf89 marked this pull request as draft August 5, 2024 13:10
@gkaf89 gkaf89 force-pushed the feature/error-logging branch 2 times, most recently from b306ccd to 092fcd0 Compare August 5, 2024 13:26
@gkaf89
Copy link
Contributor Author

gkaf89 commented Aug 5, 2024

I am not sure what is the best way to select the build directory so that I can move it to a more permanent location. That is at the moment I am recreating the location of the build path and then copy the directory to the destination path:

source_build_path = os.path.join(buildpath, name, version, toolchain)
dest_build_path = os.path.join(err_log_path, name, version, toolchain)
copy_dir(source_build_path, dest_build_path)
  • Is there some variable holding the build path, or even the relative build path (i.e. os.path.join(name, version, toolchain))?
  • Should we extract this functionality to a module?

@gkaf89 gkaf89 force-pushed the feature/error-logging branch 7 times, most recently from 50d99c3 to 86fe081 Compare August 12, 2024 19:10
@boegel boegel added this to the 4.x milestone Aug 13, 2024
@boegel
Copy link
Member

boegel commented Aug 14, 2024

@gkaf89 The builddir variable that is set in each easyblock instance hold the path to the build directory for that particular easyconfig.
You can determine the relative path via the build_path() function that is available from easybuild.tools.config, that should report the top directory that corresponds to the buildpath EasyBuild configuration option (see also https://docs.easybuild.io/configuration/#buildpath).

For, for example, for example-1.2.3-GCC-12.3.0.eb, the builddir path would be something like /tmp/myuser/easybuild/build/example/1.2.3/GCC-12.3.0/, with buildpath set to /tmp/myuser/easybuild/build.
Not that the actual build directory in which the compilation is being done would be one level deeper, corresponding to the unpacked source tarball, so something like /tmp/myuser/easybuild/build/example/1.2.3/GCC-12.3.0/example-1.2.3/.

So, I think you could create a subdirectory in the permanent storage location that uses the name of the easyconfig file (to keep it simple), and copy the contents of builddir in there.
You do somehow want to make sure that the target path is unique though, because you could have multiple builds ongoing on different nodes that would all copy to the same permanent location in the end...

@akesandgren
Copy link
Contributor

@gkaf89 The builddir variable that is set in each easyblock instance hold the path to the build directory for that particular easyconfig. You can determine the relative path via the build_path() function that is available from easybuild.tools.config, that should report the top directory that corresponds to the buildpath EasyBuild configuration option (see also https://docs.easybuild.io/configuration/#buildpath).

For, for example, for example-1.2.3-GCC-12.3.0.eb, the builddir path would be something like /tmp/myuser/easybuild/build/example/1.2.3/GCC-12.3.0/, with buildpath set to /tmp/myuser/easybuild/build. Not that the actual build directory in which the compilation is being done would be one level deeper, corresponding to the unpacked source tarball, so something like /tmp/myuser/easybuild/build/example/1.2.3/GCC-12.3.0/example-1.2.3/.

So, I think you could create a subdirectory in the permanent storage location that uses the name of the easyconfig file (to keep it simple), and copy the contents of builddir in there. You do somehow want to make sure that the target path is unique though, because you could have multiple builds ongoing on different nodes that would all copy to the same permanent location in the end...

Yeah, the thing to copy should be builddir into a path with the diff of buildpath and builddir based in permanent-storage-location. Just make sure to remove old remnants of that first :-)

@gkaf89 gkaf89 force-pushed the feature/error-logging branch 4 times, most recently from 1274a2b to b1a9da8 Compare August 23, 2024 08:53
@boegel
Copy link
Member

boegel commented Aug 27, 2024

@gkaf89 If you need any help with this, do let us know!

@gkaf89 gkaf89 force-pushed the feature/error-logging branch from b1a9da8 to 6bc53e6 Compare September 8, 2024 22:43
@gkaf89
Copy link
Contributor Author

gkaf89 commented Sep 8, 2024

@boegel The commit is ready. I won't have enough time to familiarize myself with the test framework for the EasyBlockTest class to prepare a test before the next release.

The commit can be tested by modifying the configuration options of some easyconfig that uses the system toolchain to cause a failure. For instance I added the option

configopts = '--some-invalid-option'

in zlib-1.3.1.eb. The result is that the temporary log file in the build directory and the extracted source code are copied in a permanent location.

@gkaf89 gkaf89 marked this pull request as ready for review September 8, 2024 23:38
@boegel
Copy link
Member

boegel commented Sep 11, 2024

@gkaf89 There's a problem with the tests, looks like test_toy_build was broken by the changes being made here?
See for example https://github.com/easybuilders/easybuild-framework/actions/runs/10805964835/job/29973948603

@gkaf89
Copy link
Contributor Author

gkaf89 commented Sep 11, 2024

The failure is caused because the target location for permanent storage is the same as the source location. The steps I am following to resolve the issue:

  • add a source/destination check to avoid a hard failure, and
  • detect how the source and the destination path end with the same value in the test.

@gkaf89 gkaf89 force-pushed the feature/error-logging branch from b34ff11 to 5cfdfd1 Compare September 11, 2024 10:24
@gkaf89
Copy link
Contributor Author

gkaf89 commented Sep 11, 2024

@boegel Some edge cases where uncovered by the tests. The latest commit resolves the issue.

I leave it up you if you prefer to move it to version 5. I am not familiar enough with the tests to test the PR extensively.

@gkaf89 gkaf89 requested a review from boegel September 11, 2024 12:51
@boegel
Copy link
Member

boegel commented Dec 4, 2024

@gkaf89 As briefly discussed during the conf call today, it would be good if you could add a test (or enhance an existing one, like test_toy_broken) to verify that the added functionality works as intended (and keeps working).

Do let us know if you need any help with that!

@boegel
Copy link
Member

boegel commented Mar 1, 2025

@gkaf89 Finally found some time to go through this in detail...

I did some code cleanup, and renamed a bunch of things so it makes a bit more sense.
Most of that is internal, but I also renamed the configuration options to --failed-install-build-dirs-path and --failed-install-logs-path.

Thanks a lot for all the effort on this, especially the extensive tests that you added, that helps a lot to be confident in reviewing & refactoring this a bit!

This should be good to go now, unless I messed up something causing some tests to fail (if so, I'll get them fixed)

boegel added 2 commits March 2, 2025 10:09
…nstalls-build-dirs-path and --failed-installs-logs-path are also set with a default subdirectory, they should be used opt-in
@boegel boegel force-pushed the feature/error-logging branch from 480b92b to 6a6eef6 Compare March 2, 2025 09:10
Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final suggestions, maybe @boegel can check which make sense.
Looks good to me with or without.

@boegel boegel force-pushed the feature/error-logging branch from 18e60cf to 1d757db Compare March 2, 2025 14:09
@boegel boegel changed the title Copy build log and artifacts to a permanent location after failures copy build directory and/or log file(s) if installation failed to path specified via --failed-install-build-dirs-path or --failed-install-logs-path Mar 2, 2025
@boegel boegel merged commit ae07512 into easybuilders:5.0.x Mar 2, 2025
39 checks passed
@boegel boegel removed this from EasyBuild v5.0 Mar 2, 2025
@Flamefire
Copy link
Contributor

@boegel Did my comment about the salt got lost? It is collapsed now, so not sure if the decision was deliberate:
I don't think we should use the salt as it will be there for every folder now.
As implemented by @gkaf89 we would have folders with (only) date-time in virtually all cases as it includes the seconds. In the very unlikely case that another build failed in the exact same second create_non_existing_paths already handles that by using a dynamic suffix in only that case. This would look cleaner, wouldn't it?
If I'm not missing anything I can change that in a small followup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants