[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 #12587

tlrmchlsmth · 2025-01-30T20:29:49Z

Integrates the block-quantized kernels introduced in #11868 for use in linear layers.

github-actions · 2025-01-30T20:30:01Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Tyler Michael Smith <[email protected]>

vllm/model_executor/layers/quantization/utils/fp8_utils.py

mgoin · 2025-01-31T22:44:09Z

Confirmed accuracy with gsm8k eval

VLLM_MLA_DISABLE=1 lm_eval --model vllm --model_args pretrained=/data/nm/models/DeepSeek-R1,trust_remote_code=True,tensor_parallel_size=8,max_model_len=10000 --tasks gsm8k --num_fewshot 5 --batch_size auto
...
Processed prompts: 100%|██████████| 1319/1319 [05:03<00:00,  4.35it/s, est. speed input: 3791.39 toks/s, output: 446.01 toks/s]
Running generate_until requests: 100%|███████████| 1319/1319 [05:03<00:00,  4.34it/s]
vllm (pretrained=/data/nm/models/DeepSeek-R1,trust_remote_code=True,tensor_parallel_size=8,max_model_len=10000), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9522|±  |0.0059|
|     |       |strict-match    |     5|exact_match|↑  |0.9522|±  |0.0059|

simon-mo · 2025-01-31T23:01:57Z

Confirmed on TP8PP2 setting
Before Run 1:
Throughput: 0.33 requests/s, 1632.43 total tokens/s, 326.49 output tokens/s

Before Run 2:
Throughput: 0.32 requests/s, 1587.24 total tokens/s, 317.45 output tokens/s

This PR
Throughput: 0.35 requests/s, 1735.48 total tokens/s, 347.10 output tokens/s

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Isotr0py <[email protected]>

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Srikanth Srinivas <[email protected]>

…DeepSeekV3 (vllm-project#12587) Integrates the block-quantized kernels introduced in vllm-project#11868 for use in linear layers. Signed-off-by: Tyler Michael Smith <[email protected]>

mergify bot added the ci/build label Jan 30, 2025

tlrmchlsmth marked this pull request as ready for review January 31, 2025 21:15

tlrmchlsmth requested review from mgoin and robertgshaw2-redhat as code owners January 31, 2025 21:15

tlrmchlsmth changed the title ~~[Kernel][Quantization] Integrate block cutlass~~ [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 Jan 31, 2025

integrate block-quantized cutlass kernels

f9e3256

Signed-off-by: Tyler Michael Smith <[email protected]>

tlrmchlsmth force-pushed the integrate_block_cutlass branch from bbf58e5 to f9e3256 Compare January 31, 2025 21:34

simon-mo approved these changes Jan 31, 2025

View reviewed changes

simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 31, 2025

mgoin reviewed Jan 31, 2025

View reviewed changes

vllm/model_executor/layers/quantization/utils/fp8_utils.py Show resolved Hide resolved

mgoin approved these changes Jan 31, 2025

View reviewed changes

simon-mo merged commit eb5741a into vllm-project:main Jan 31, 2025
51 of 70 checks passed

houseroad mentioned this pull request Mar 14, 2025

[Bugfix][W8A8] fixed cutlass block fp8 binding #14796

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 #12587

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 #12587

Uh oh!

tlrmchlsmth commented Jan 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jan 30, 2025

Uh oh!

Uh oh!

mgoin commented Jan 31, 2025

Uh oh!

simon-mo commented Jan 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 #12587

[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 #12587

Uh oh!

Conversation

tlrmchlsmth commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 30, 2025

Uh oh!

Uh oh!

mgoin commented Jan 31, 2025

Uh oh!

simon-mo commented Jan 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tlrmchlsmth commented Jan 30, 2025 •

edited

Loading