[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436

nurmukhametov · 2025-11-20T13:44:36Z

Technical Details

Replace per-transpose loops with a single unified loop that processes all transposes simultaneously, computing indices once and reusing them across all operations.

Update packed_transpose_multiple_heroes.hlo test to verify the single-loop structure with multiple iter_args.

Motivation

This fixes the second performance regression (first #434) caused by the new implementation of fused transpose emitter for fp narrower than 32 (PackedTranspose).

Test Result

It reduces the execution time for fused_convert_transpose_3.hlo of Llama 3 8B FP8 by ~30%, bringing it almost back to v0.6.0 performance (~4% gap). Together with #434, the performance of the 4 top fused_convert_transpose kernels is improved by ~17%, resulting in an end-to-end model performance improvement of ~1% (tokens per second per gpu).

Replace per-transpose loops with a single unified loop that processes all transposes simultaneously, computing indices once and reusing them across all operations. Update packed_transpose_multiple_heroes.hlo test to verify the single-loop structure with multiple iter_args.

charleshofer · 2025-11-21T23:24:34Z

Could you check to see if this is applicable to the rocm-jaxlib-v0.8.0 and fix there as well, if needed?

nurmukhametov requested a review from i-chaochen November 20, 2025 13:44

i-chaochen requested a review from amd-jianli12 November 21, 2025 17:45

i-chaochen requested a review from pemeliya November 21, 2025 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436

[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436

nurmukhametov commented Nov 20, 2025

Uh oh!

charleshofer commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436

Are you sure you want to change the base?

[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436

Conversation

nurmukhametov commented Nov 20, 2025

Technical Details

Motivation

Test Result

Uh oh!

charleshofer commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants