[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436
+41
−25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Technical Details
Replace per-transpose loops with a single unified loop that processes all transposes simultaneously, computing indices once and reusing them across all operations.
Update
packed_transpose_multiple_heroes.hlotest to verify the single-loop structure with multiple iter_args.Motivation
This fixes the second performance regression (first #434) caused by the new implementation of fused transpose emitter for fp narrower than 32 (
PackedTranspose).Test Result
It reduces the execution time for
fused_convert_transpose_3.hloof Llama 3 8B FP8 by ~30%, bringing it almost back to v0.6.0 performance (~4% gap). Together with #434, the performance of the 4 top fused_convert_transpose kernels is improved by ~17%, resulting in an end-to-end model performance improvement of ~1% (tokens per second per gpu).