[SYCL] optimize createSyclObjFromImpl calls to take rvalue-ref to shared_ptr #20859
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The optimization results in moving shared_pointer inside createSyclObjFromImpl instead of copying and thanks to it we save two atomic operations (see e.g. this SO thread).
I've applied it to all possible places in the code, leaving only these where copying is indeed needed (mostly for context_impl use).
Results summary
overhead over UR reduced by ~8% in scenarios using events. Other benchmarks also show visible improvements in many cases, including new pytorch multiqueue benchmarks which improved overall by 2.7%
Results Examples
The new result is expressed by dots on the right sides of plots.
old = 140, new = 138.2, UR baseline = 118.1, overhead over UR reduced by 8.1%
old = 122.3, new = 121.3, UR baseline = 108.1, overhead over UR reduced by 7.0%
old time = 13.91, new time = 13.58, whole stack reduced by 2.4%
And finally new pytorch microbenchmarks:
old time = 1.81, new time = 1.76, L0 baseline = 1.44
whole stack reduced by 2.8%, overhead over L0 reduced by 13.5%