Skip to content

Conversation

@jle-quel
Copy link
Contributor

@jle-quel jle-quel commented Oct 4, 2023

Description

This PR updates the rocSOLVER backend in order to fix the multiple race conditions seen in this backend, as well as some small cleaning and ifdef to support other sycl implementation modifications.

Here's the full list of modifications and what they fixed:

  • [CLEANING]: The add of an ifdef to support the different names of the hip backend given by the different sycl implementations.
  • [CLEANING]: Removal of all the devInfo logic in the library which calls HIP functions which do not accept and set devInfo.
  • [RACE-CONDITION]: This PR [SYCL] Add test for extended deleters intel/llvm#11344 re-introduced the call to the extendedDeleter which fixes the segfault that was happening at the end of the example_lapack_getrs_usm
  • [RACE-CONDITION]: The helper function lapack_info_check was not correctly waiting on the sycl::queue which created a race condition when reading devInfo. The wait has been moved to wait for all overloads, as well as the addition of another wait to let the memcpy finish.
  • [RACE-CONDITION]: All onemkl_rocsolver_host_task now uses ROCSOLVER_ERROR_FUNC_T_SYNC to synchronize the HIPStream. (Similar to [RNG][rocRAND] Update rocRAND backend #376)
  • [RACE-CONDITION]: Added the explicit requirements for parrallel_for on host_task via depends_on.

Checklist

All Submissions

  • Do all unit tests pass locally? Attach a log.
  • Have you formatted the code using clang-format?

example_lapack_getrs_usm.log.txt
test_main_lapack_ct.log.txt
test_main_lapack_rt.log.txt

Copy link
Contributor

@sknepper sknepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks @jle-quel ! I just had a couple questions, plus one small note.
These changes seem to include the main change from my #388 pull request, plus some extra benefits. However, I had also fixed a few deprecation warnings by including <rocblas/rocblas.h> and <rocsolver/rocsolver.h> instead of just <rocblas.h> and <rocsolver.h> in src/lapack/backends/rocsolver/rocsolver_helper.hpp and src/lapack/backends/rocsolver/rocsolver_task.hpp

It's not required, but you could consider making that change in this PR and I can just close the previous one.

@sknepper
Copy link
Contributor

sknepper commented Oct 5, 2023

/intelci: run

@jle-quel
Copy link
Contributor Author

jle-quel commented Oct 5, 2023

This is great, thanks @jle-quel ! I just had a couple questions, plus one small note. These changes seem to include the main change from my #388 pull request, plus some extra benefits. However, I had also fixed a few deprecation warnings by including <rocblas/rocblas.h> and <rocsolver/rocsolver.h> instead of just <rocblas.h> and <rocsolver.h> in src/lapack/backends/rocsolver/rocsolver_helper.hpp and src/lapack/backends/rocsolver/rocsolver_task.hpp

It's not required, but you could consider making that change in this PR and I can just close the previous one.

Thank you,
Oh, I was unaware of such open work. Of course I'll copy the deprecation warning fixes you made and bring them here 👍

@jle-quel
Copy link
Contributor Author

jle-quel commented Oct 5, 2023

/intelci: run

Not sure what the CI does, but running tests on this PR now, will not return the same results I shared in the description.
This PR is dependent on the ExtendedDeleter which is not yet merged (reference in the description)

@sknepper
Copy link
Contributor

sknepper commented Oct 6, 2023

/intelci: run

Not sure what the CI does, but running tests on this PR now, will not return the same results I shared in the description. This PR is dependent on the ExtendedDeleter which is not yet merged (reference in the description)

Thanks for the clarification! A passing CI is not currently required (and there are other issues that are unrelated to your PR that are also impacting it), but figured I'd run both that and manual testing, but neither of those have the ExtendedDeleter fix. For me, all of the compile-time dispatching tests passed, while some of the run-time dispatching ones segfaulted (until they eventually just stopped running) - which seems to be expected without the ExtendedDeleter fix.

Copy link
Contributor

@sknepper sknepper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jle-quel ! These changes look good to me.

@jle-quel
Copy link
Contributor Author

I'll mark this PR as Ready for review when the following PRs are merged.

@jle-quel jle-quel marked this pull request as ready for review October 18, 2023 07:03
@sknepper sknepper merged commit 0e5043b into uxlfoundation:develop Oct 18, 2023
@jle-quel jle-quel deleted the jle-quel/update-rocsolver-backend branch October 19, 2023 07:48
normallytangent pushed a commit to normallytangent/oneMKL that referenced this pull request Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants