[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

tjtanaa · 2025-01-28T05:59:33Z

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Note: This PR feature requires ROCm 6.3 and later and GPU Arch MI300 and later.

Description

This PR involves the following enhancements

This is a PR specific to support Per-Token-Activation Per-Channel-Weight (PTPC-FP8) FP8 Quantization Inferencing.
The model will be quantized on-the-fly from BFloat16 to FP8. Model weight which are store in Float16 will need to be casted into BFloat16.
It used PyTorch latest rowwise scaled GEMM feature in torch._scaled_mm which is introduced in [ROCm] hipblaslt rowwise f8 gemm pytorch/pytorch#144432 , which speeds up current naive implementation by at least 2 times. For more details check out the Performance section

To support this feature, the Dockerfile.rocm_base PyTorch repo commit has been updated to 3a585126.
Dockerfile.rocm is left untouched as the base image is referencing to AMD docker hub registry. That base image at this point in time has already installed with PyTorch repo commit 3a585126.

Small enhancement. The documentation has been updated to ROCm 6.3 and various commits in the installation step has been updated to match the commits in Dockerfile.rocm_base.

Performance

Perplexity Test

Model: Llama-3.1-8B-Instruct
Dataset: Wikitexts
GPU: MI300X

Model	Quantization	KVCacheDtype	Tasks	Metric	Metric Score
Llama-3.1-8B-Instruct/	auto (bf16)	auto (bf16)	wikitext	word_perplexity	9.4281
Llama-3.1-8B-Instruct/	fp8	fp8_e4m3	wikitext	word_perplexity	9.5124
Llama-3.1-8B-Instruct/	ptpc_fp8	fp8_e4m3	wikitext	word_perplexity	9.5093
Llama-3.1-8B-Instruct/	ptpc_fp8 (naive)	fp8_e4m3	wikitext	word_perplexity	9.5095

Speed Test (Old naive implementation vs torch._scaled_mm rowwise scaled GEMM feature)

Model: Llama-3.1-70B-Instruct
Dataset: SharedGPT
GPU: 1xMI300X

Quantization	KVCacheDType	Req/s	Total token/s	Output tokens/s
ptpc_fp8 (naive)	fp8_e4m3	2.43	1003.46	481.28
ptpc_fp8 (torch._scaled_mm rowwise scaled GEMM feature)	fp8_e4m3	6.36	2631.04	1261.91

github-actions · 2025-01-28T05:59:44Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: kliuae <[email protected]>

Signed-off-by: tjtanaa <[email protected]>

…12244) Signed-off-by: DarkLight1337 <[email protected]>

…m-project#12237) Signed-off-by: Jee Jee Li <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

…project#12252) Signed-off-by: DarkLight1337 <[email protected]>

…ct#12246) Signed-off-by: youkaichao <[email protected]>

…-project#12259) Signed-off-by: Roger Wang <[email protected]>

…ct#12260) Signed-off-by: Thomas Parnell <[email protected]>

Signed-off-by: Mengqing Cao <[email protected]>

Signed-off-by: youkaichao <[email protected]>

…ed (vllm-project#10802) Signed-off-by: Jannis Schönleber <[email protected]>

Signed-off-by: Jinzhen Lin <[email protected]>

…project#10907) Signed-off-by: rickyx <[email protected]>

Signed-off-by: Andy Lo <[email protected]>

Signed-off-by: Adrian Cole <[email protected]>

…ject#12235) Signed-off-by: wangxiyuan <[email protected]>

)

…shes (vllm-project#12277) Signed-off-by: maleksan85 <[email protected]> Co-authored-by: maleksan85 <[email protected]>

…for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <[email protected]>

Signed-off-by: DarkLight1337 <[email protected]>

…-project#12464) Signed-off-by: Isotr0py <[email protected]>

…12454) Signed-off-by: Woosuk Kwon <[email protected]>

Signed-off-by: Isotr0py <[email protected]>

…t#12339) Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Mark McLoughlin <[email protected]>

…12469) Signed-off-by: Woosuk Kwon <[email protected]>

Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <[email protected]> Signed-off-by: wallashss <[email protected]> Co-authored-by: wallashss <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

…vllm-project#11277) Signed-off-by: Liangfu Chen <[email protected]> Co-authored-by: Jiangfei Duan <[email protected]>

Signed-off-by: mgoin <[email protected]>

mergify · 2025-01-28T06:12:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tjtanaa · 2025-01-28T06:45:15Z

This PR is closed as the git history is messed up. The PR is replaced by #12501

tjtanaa requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners January 28, 2025 05:59

mergify bot added documentation Improvements or additions to documentation ci/build labels Jan 28, 2025

kliuae and others added 24 commits January 28, 2025 06:04

add Per-token-activation per-channel-weight on-the-fly quantization fp8

6559a9e

Signed-off-by: kliuae <[email protected]>

add ptpc fp8 unittests

798c07e

Signed-off-by: tjtanaa <[email protected]>

remove is_navi check for now

63f9657

Signed-off-by: tjtanaa <[email protected]>

update rocm gpu installation readme; remove navi check

6dc4604

Signed-off-by: tjtanaa <[email protected]>

update PyTorch version to enable torch._scaled_mm rowwise

30f0ecd

Signed-off-by: tjtanaa <[email protected]>

[Misc] Rename MultiModalInputsV2 -> MultiModalInputs (vllm-project#…

be57b24

…12244) Signed-off-by: DarkLight1337 <[email protected]>

[Misc]Add BNB quantization for PaliGemmaForConditionalGeneration (vll…

66d6dd2

…m-project#12237) Signed-off-by: Jee Jee Li <[email protected]>

[Misc] Remove redundant TypeVar from base model (vllm-project#12248)

e9ddeda

Signed-off-by: DarkLight1337 <[email protected]>

[Bugfix] Fix mm_limits access for merged multi-modal processor (vllm-…

0572080

…project#12252) Signed-off-by: DarkLight1337 <[email protected]>

[torch.compile] transparent compilation with more logging (vllm-proje…

29b95c6

…ct#12246) Signed-off-by: youkaichao <[email protected]>

[V1][Bugfix] Fix data item ordering in mixed-modality inference (vllm…

b559fa6

…-project#12259) Signed-off-by: Roger Wang <[email protected]>

Remove pytorch comments for outlines + compressed-tensors (vllm-proje…

6cfb7ac

…ct#12260) Signed-off-by: Thomas Parnell <[email protected]>

[Platform] improve platforms getattr (vllm-project#12264)

27530bb

Signed-off-by: Mengqing Cao <[email protected]>

[ci/build] update nightly torch for gh200 test (vllm-project#12270)

91b7860

Signed-off-by: youkaichao <[email protected]>

[Bugfix] fix race condition that leads to wrong order of token return…

98b8414

…ed (vllm-project#10802) Signed-off-by: Jannis Schönleber <[email protected]>

[Kernel] fix moe_align_block_size error condition (vllm-project#12239)

e4564cb

Signed-off-by: Jinzhen Lin <[email protected]>

[v1][stats][1/n] Add RequestStatsUpdate and RequestStats types (vllm-…

36077d4

…project#10907) Signed-off-by: rickyx <[email protected]>

[Bugfix] Multi-sequence broken (vllm-project#11898)

049885f

Signed-off-by: Andy Lo <[email protected]>

[Misc] Remove experimental dep from tracing.py (vllm-project#12007)

0db6a75

Signed-off-by: Adrian Cole <[email protected]>

[Misc] Set default backend to SDPA for get_vit_attn_backend (vllm-pro…

cbe2a73

…ject#12235) Signed-off-by: wangxiyuan <[email protected]>

[Core] Free CPU pinned memory on environment cleanup (vllm-project#10477

7980828

)

[BUGFIX] When skip_tokenize_init and multistep are set, execution cra…

10611d8

…shes (vllm-project#12277) Signed-off-by: maleksan85 <[email protected]> Co-authored-by: maleksan85 <[email protected]>

[Documentation][AMD] Add information about prebuilt ROCm vLLM docker …

fb43dee

…for perf validation purpose (vllm-project#12281) Signed-off-by: Hongxia Yang <[email protected]>

[VLM] Simplify post-processing of replacement info (vllm-project#12269)

4f2fc00

Signed-off-by: DarkLight1337 <[email protected]>

DarkLight1337 and others added 12 commits January 28, 2025 06:11

[Bugfix] Fix Granite 3.0 MoE model loading (vllm-project#12446)

4176918

Signed-off-by: DarkLight1337 <[email protected]>

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata (vllm…

7a6cded

…-project#12464) Signed-off-by: Isotr0py <[email protected]>

[V1][Minor] Minor optimizations for update_from_output (vllm-project#…

bd69c90

…12454) Signed-off-by: Woosuk Kwon <[email protected]>

[Bugfix] Fix gpt2 GGUF inference (vllm-project#12467)

899cea0

Signed-off-by: Isotr0py <[email protected]>

[Build] Only build 9.0a for scaled_mm and sparse kernels (vllm-projec…

2fa4f8e

…t#12339) Signed-off-by: Lucas Wilkinson <[email protected]>

[V1][Metrics] Add initial Prometheus logger (vllm-project#12416)

1253304

Signed-off-by: Mark McLoughlin <[email protected]>

[V1][CI/Test] Do basic test for top-p & top-k sampling (vllm-project#…

0f2a9ce

…12469) Signed-off-by: Woosuk Kwon <[email protected]>

[FlashInfer] Upgrade to 0.2.0 (vllm-project#11194)

45844a3

Signed-off-by: Bowen Wang <[email protected]> Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>

[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and `prompt_logp…

411e0d2

…robs` with ChunkedPrefill (vllm-project#10132) Signed-off-by: NickLucche <[email protected]> Signed-off-by: wallashss <[email protected]> Co-authored-by: wallashss <[email protected]>

Update pre-commit hooks (vllm-project#12475)

0ae8f3e

Signed-off-by: Harry Mellor <[email protected]>

[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache (…

008891b

…vllm-project#11277) Signed-off-by: Liangfu Chen <[email protected]> Co-authored-by: Jiangfei Duan <[email protected]>

Fix bad path in prometheus example (vllm-project#12481)

79151e0

Signed-off-by: mgoin <[email protected]>

tjtanaa force-pushed the ptpc-fp8-rocm branch from d2b5204 to 79151e0 Compare January 28, 2025 06:11

tjtanaa requested review from DarkLight1337, LiuXiaoxuanPKU, WoosukKwon, alexm-redhat, comaniac, njhill, simon-mo, youkaichao, ywang96 and zhuohan123 as code owners January 28, 2025 06:11

mergify bot added frontend needs-rebase labels Jan 28, 2025

tjtanaa marked this pull request as draft January 28, 2025 06:12

tjtanaa closed this Jan 28, 2025

tjtanaa deleted the ptpc-fp8-rocm branch February 25, 2025 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

Uh oh!

tjtanaa commented Jan 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 28, 2025

Uh oh!

mergify bot commented Jan 28, 2025

Uh oh!

tjtanaa commented Jan 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

49 participants

Uh oh!

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

[ROCm] [Feature] [Doc] [Dockerfile] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing #12499

Uh oh!

Conversation

tjtanaa commented Jan 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing

Description

Performance

Perplexity Test

Speed Test (Old naive implementation vs torch._scaled_mm rowwise scaled GEMM feature)

Uh oh!

github-actions bot commented Jan 28, 2025

Uh oh!

mergify bot commented Jan 28, 2025

Uh oh!

tjtanaa commented Jan 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

49 participants

tjtanaa commented Jan 28, 2025 •

edited by github-actions bot

Loading