Skip to content

Conversation

@nurmukhametov
Copy link

[XLA:GPU] Rename warp to shmem_group in PackedTranspose.

Also calculate their count as kNumThreadsPerBlock / kNumShmemBanks to avoid inconsistency when manually specified.

This change is NFC for non-AMD GPUs. For AMD GPUs, it fixes the performance regression caused by inconsistency between shmem_group size, kNumThreadsPerBlock and kNumShmemBanks. It ended up in a situation downstream where half of the launched threads per block were not utilized at all. Updated packed transpose tests to verify correct thread utilization.

Copy link
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the renaming it as SHMEM_GROUP

Also calculate their count as kNumThreadsPerBlock / kNumShmemBanks to
avoid inconsistency when manually specified.

This change is NFC for non-AMD GPUs. For AMD GPUs, it fixes the
performance regression caused by inconsistency between shmem_group size,
kNumThreadsPerBlock and kNumShmemBanks. It ended up in a situation
downstream where half of the launched threads per block were not
utilized at all. Updated packed transpose tests to verify correct thread
utilization.
@nurmukhametov nurmukhametov force-pushed the anurmukh/fix-packed-transpose-threads-0.7.1 branch from 1bb6c4d to 7db6f55 Compare November 17, 2025 17:55
@i-chaochen
Copy link
Collaborator

Hi @nurmukhametov all CI are green, I guess we can merge it now?

@nurmukhametov
Copy link
Author

Hi @nurmukhametov all CI are green, I guess we can merge it now?

Yes, we can if nobody else wants to review it.

@i-chaochen i-chaochen merged commit 729dcdf into rocm-jaxlib-v0.7.1 Nov 19, 2025
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants