[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

gshtras · 2025-02-13T16:48:19Z

Performance improvement for ROCm working around the hardware limitation.

In GEMM, you can have significant Tagram channel hotspot problems on MI300 if the stride of a matrix is a multiple of 512 bytes in GEMM. This is especially true for TN transpose cases, which might increase the latency of VMEM instructions, resulting in a significant drop in performance. If it's possible (or makes sense), stride padding can be used to avoid any stride multiple of 512 bytes (for example, for TN F16 GEMM, lda = M + 128 when M%256==0) from the application when allocating memory for the matrices.

One requirement for this is for w8a8_block_fp8_matmul to support the non-contiguous weights, which it seems to already do, so the leftover assertion is obsolete.
While maintaining the same correctness, this shows the following latency improvement on ROCm:
amd/Llama-3.1-8B-Instruct-FP8-KV bs=64 in=512 out=512 tp=1:
5.95s -> 5.7s (4%)
amd/Llama-3.1-70B-Instruct-FP8-KV bs=64 4in=512 out=512 tp=1:
25.6s -> 24.3s (5%)
deepseek-ai/DeepSeek-R1 bs=64 in=256 out=256 tp=8:
26.1s -> 24.9 (5%)

Signed-off-by: Gregory Shtrasberg <[email protected]>

… strides Signed-off-by: Gregory Shtrasberg <[email protected]>

github-actions · 2025-02-13T16:48:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

NickLucche · 2025-02-18T14:31:10Z

vllm/model_executor/layers/quantization/fp8.py

+                and (weight.stride(-2) * weight.element_size()) % 512 == 0):
+            num_pad = 256 // weight.element_size()
+            weight = F.pad(weight, (0, num_pad), "constant", 0)[..., :-num_pad]
+            torch.cuda.empty_cache()


is empty_cache really necessary here?

Without it there is a possibility of having double the memory allocated, depending on the allocator behavior

vllm/model_executor/layers/quantization/fp8.py

NickLucche · 2025-02-18T14:42:25Z

Thanks for contributing! 🙏🏻
I only had a few comments to add while actual review from code owners is pending.

vllm/model_executor/layers/quantization/fp8.py

Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]>

robertgshaw2-redhat · 2025-02-21T15:46:21Z

vllm/model_executor/layers/quantization/utils/fp8_utils.py

    M = A.numel() // A.shape[-1]

-    assert B.ndim == 2 and B.is_contiguous() and Bs.ndim == 2
+    assert B.ndim == 2 and Bs.ndim == 2


Are we sure this is okay?

The kernel works just fine with a padded non-contiguous tensor. And in any scenario other than with padding it should be contiguous already, so no existing workflow is supposed to break.

One other option is just to call weight.contiguous() after we pad it in process_weights_after_loading?

This would remove the padding, reverting the F.pad action

sorry, that was a dumb comment by me

@gshtras I agree contiguous here was overly strict. But should we still check that the stride is 1 for the last dimension? B.stride(-1) == 1?

robertgshaw2-redhat · 2025-02-21T15:48:02Z

Nice work!

… on ROCm (vllm-project#13231)

… on ROCm (vllm-project#13231) Signed-off-by: Louis Ulmer <[email protected]>

… on ROCm (vllm-project#13231)

gshtras added 2 commits February 12, 2025 23:47

Apply FP8 weights padding to 256 bytes on ROCm

bbab81f

Signed-off-by: Gregory Shtrasberg <[email protected]>

Removing the contiguous requirement, as the kernel supports arbitrary…

2205c07

… strides Signed-off-by: Gregory Shtrasberg <[email protected]>

gshtras requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners February 13, 2025 16:48

gshtras changed the title ~~[ROCm] Apply FP8 weights padding to 256 bytes on ROCm~~ [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm Feb 13, 2025

hongxiayang added the rocm Related to AMD ROCm label Feb 13, 2025

NickLucche reviewed Feb 18, 2025

View reviewed changes

mgoin reviewed Feb 18, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

Change the order of the checks

f3da192

Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Gregory Shtrasberg <[email protected]>

gshtras force-pushed the fp8_padding_upstream branch from 6106325 to f3da192 Compare February 18, 2025 17:35

robertgshaw2-redhat reviewed Feb 21, 2025

View reviewed changes

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 21, 2025

robertgshaw2-redhat enabled auto-merge (squash) February 21, 2025 15:52

robertgshaw2-redhat disabled auto-merge February 21, 2025 15:53

robertgshaw2-redhat enabled auto-merge (squash) February 21, 2025 16:42

robertgshaw2-redhat approved these changes Feb 21, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into fp8_padding_upstream

d3bb507

simon-mo merged commit c904fdd into vllm-project:main Feb 22, 2025
42 of 46 checks passed

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes…

7626844

… on ROCm (vllm-project#13231)

gshtras mentioned this pull request Mar 7, 2025

[ROCm][Kernel] MoE weights padding #14454

Merged

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes…

9ad4433

… on ROCm (vllm-project#13231) Signed-off-by: Louis Ulmer <[email protected]>

gshtras deleted the fp8_padding_upstream branch April 7, 2025 14:59

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes…

c79b874

… on ROCm (vllm-project#13231)

Uh oh!

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm #13231

Uh oh!

Conversation

gshtras commented Feb 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NickLucche commented Feb 18, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Feb 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gshtras commented Feb 13, 2025 •

edited by github-actions bot

Loading