Fix performance regression on transposes on Llama 3 8B FP8 #434

nurmukhametov · 2025-11-17T17:40:09Z

[XLA:GPU] Rename warp to shmem_group in PackedTranspose.

Also calculate their count as kNumThreadsPerBlock / kNumShmemBanks to avoid inconsistency when manually specified.

This change is NFC for non-AMD GPUs. For AMD GPUs, it fixes the performance regression caused by inconsistency between shmem_group size, kNumThreadsPerBlock and kNumShmemBanks. It ended up in a situation downstream where half of the launched threads per block were not utilized at all. Updated packed transpose tests to verify correct thread utilization.

i-chaochen

Nice! I like the renaming it as SHMEM_GROUP

Also calculate their count as kNumThreadsPerBlock / kNumShmemBanks to avoid inconsistency when manually specified. This change is NFC for non-AMD GPUs. For AMD GPUs, it fixes the performance regression caused by inconsistency between shmem_group size, kNumThreadsPerBlock and kNumShmemBanks. It ended up in a situation downstream where half of the launched threads per block were not utilized at all. Updated packed transpose tests to verify correct thread utilization.

i-chaochen · 2025-11-18T23:45:15Z

Hi @nurmukhametov all CI are green, I guess we can merge it now?

nurmukhametov · 2025-11-19T10:33:58Z

Hi @nurmukhametov all CI are green, I guess we can merge it now?

Yes, we can if nobody else wants to review it.

i-chaochen requested review from pemeliya and zoranjovanovic-ns November 17, 2025 17:48

i-chaochen approved these changes Nov 17, 2025

View reviewed changes

nurmukhametov force-pushed the anurmukh/fix-packed-transpose-threads-0.7.1 branch from 1bb6c4d to 7db6f55 Compare November 17, 2025 17:55

zoranjovanovic-ns approved these changes Nov 17, 2025

View reviewed changes

i-chaochen merged commit 729dcdf into rocm-jaxlib-v0.7.1 Nov 19, 2025
7 of 8 checks passed

nurmukhametov mentioned this pull request Nov 20, 2025

[XLA:GPU] Fuse shmem write loops for transposes in PackedTranspose #436

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix performance regression on transposes on Llama 3 8B FP8 #434

Fix performance regression on transposes on Llama 3 8B FP8 #434

Uh oh!

nurmukhametov commented Nov 17, 2025

Uh oh!

i-chaochen left a comment

Uh oh!

i-chaochen commented Nov 18, 2025

Uh oh!

nurmukhametov commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix performance regression on transposes on Llama 3 8B FP8 #434

Fix performance regression on transposes on Llama 3 8B FP8 #434

Uh oh!

Conversation

nurmukhametov commented Nov 17, 2025

Uh oh!

i-chaochen left a comment

Choose a reason for hiding this comment

Uh oh!

i-chaochen commented Nov 18, 2025

Uh oh!

nurmukhametov commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants