[Misc] Fused MoE Marlin support for GPTQ #8217

dsikka · 2024-09-05T22:43:01Z

Summary

Add GPTQ Marlin MoE Support; marlin MoE kernels currently support int4
Update/add optional testing for large MoE models for GPTQ and llm-compressor

Co-authored by @ElizaWszola, from Neural Magic

github-actions · 2024-09-05T22:43:13Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

…into gptq_fused_moe

mgoin

I need to run a quick test on it myself, but this looks good to land for 4bit support!

mgoin · 2024-09-09T20:21:43Z

Performance looks great!

python benchmarks/benchmark_latency.py --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --input-len 128 --output-len 512 --batch-size 1 --num-iters-warmup 2 --num-iters 10
Avg latency: 5.879553547129035 seconds

python benchmarks/benchmark_latency.py --model nm-testing/Mixtral-8x7B-Instruct-v0.1-W4A16-quantized --tensor-parallel-size 2 --input-len 128 --output-len 512 --batch-size 1 --num-iters-warmup 2 --num-iters 10
Avg latency: 4.726654114946723 seconds

python benchmarks/benchmark_latency.py --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --tensor-parallel-size 2 --input-len 128 --output-len 512 --batch-size 1 --num-iters-warmup 2 --num-iters 10
Avg latency: 4.787863119132817 seconds

Before this PR, GPTQ Mixtral would be much slower:

python benchmarks/benchmark_latency.py --model TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --tensor-parallel-size 2 --input-len 128 --output-len 512 --batch-size 1 --num-iters-warmup 2 --num-iters 10
Avg latency: 8.206900223530829 seconds

fengyang95 · 2024-09-11T03:01:23Z

Does this support deepseek-v2?

xiaoqi35 · 2024-09-12T07:27:21Z

Thanks !
That's an important feature for deepseek-v2. Quantized deepseek-v2 models need FuseMoE that supports int4 quantization.
So Awq quantization is also OK?

ElizaWszola · 2024-09-18T11:22:20Z

Sonnet benchmark results (no act order, 4-bit):

// 4-bit quantized moe without act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ")

TTFT	TPOT

ElizaWszola · 2024-09-18T13:47:48Z

Thanks ! That's an important feature for deepseek-v2. Quantized deepseek-v2 models need FuseMoE that supports int4 quantization. So Awq quantization is also OK?

AWQ is currently in our future work scope!

Signed-off-by: Alvant <[email protected]>

Signed-off-by: Amit Garg <[email protected]>

Signed-off-by: LeiWang1999 <[email protected]>

ElizaWszola and others added 20 commits August 30, 2024 09:07

Enable 8-bit weights in Fused Marlin MoE

0abac6f

fix rocm

fdf69c2

bad paste

4da163b

add test case; fix imports for tests

21d2337

Merge branch 'main' into marlin-moe-8-bit

080ab23

fix to adapt custom_routin_function

638777a

Use select_experts to compute top_k tensors in fused moe

bd4b84d

bring back fused_moe_marlin -> fused_marlin_moe

bef6b53

GPTQ Fused MoE class

db1f07e

Add GPTQMarlinMoEMethod to gptq_marlin.py

6753789

Use FusedMoE layer for all loads

7df4014

Merge branch 'main' into marlin-moe-8-bit

befc52b

Merge branch 'marlin-moe-8-bit' into gptq_fused_moe

c3dc249

Make sure that GPTQ runs through mixtral.py

2fa03e5

remove large model

b45594c

enforce float16A/scales for marlin moe

8a504d9

Cleanup, comments

effd2cd

Merge branch 'marlin-moe-8-bit' into gptq_fused_moe

689ea0a

cleanup

ec47561

update/fix weight loading to support tp

9f97b3b

ElizaWszola and others added 3 commits September 6, 2024 03:12

remove 8-bit stuff for now

b841ac4

Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …

a245032

…into gptq_fused_moe

fix; update large model testing cases

9d8a80c

dsikka marked this pull request as ready for review September 6, 2024 15:17

dsikka added 2 commits September 6, 2024 16:13

add hack to support unfused mixtral pathway for int8

315e22f

fix install for tpu test

565cc43

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 6, 2024

ElizaWszola added 2 commits September 7, 2024 06:30

Move float16 typecast hack to gptq marlin moe method

8886423

Move output type conversion to gptq method as well

ab27497

dsikka and others added 5 commits September 9, 2024 01:41

fix; update large model testing cases

a991d82

add hack to support unfused mixtral pathway for int8

d57804d

fix install for tpu test

96fa486

Move float16 typecast hack to gptq marlin moe method

1faab90

Move output type conversion to gptq method as well

970e06a

dsikka force-pushed the gptq_fused_moe branch from ab27497 to 970e06a Compare September 9, 2024 01:42

dsikka and others added 3 commits September 9, 2024 01:48

typo fix; fix comment

fd0a4f2

Merge branch 'gptq_fused_moe' of https://github.com/neuralmagic/vllm …

3ac9273

…into gptq_fused_moe

Clarify comment, change how we process bias

d51a2f4

dsikka requested a review from mgoin September 9, 2024 17:38

mgoin approved these changes Sep 9, 2024

View reviewed changes

Merge branch 'main' into gptq_fused_moe

12f05c5

mgoin merged commit 6cd5e5b into vllm-project:main Sep 10, 2024

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

dd65a81

ElizaWszola mentioned this pull request Sep 18, 2024

Integrate fused Mixtral MoE with Marlin kernels #7079

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

8dcc729

Signed-off-by: Alvant <[email protected]>

garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

f331661

Signed-off-by: Amit Garg <[email protected]>

mgoin mentioned this pull request Feb 17, 2025

GPTQ & AWQ Fused MOE #2761

Closed

3 tasks

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

2909a90

Signed-off-by: LeiWang1999 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc] Fused MoE Marlin support for GPTQ #8217

[Misc] Fused MoE Marlin support for GPTQ #8217

Uh oh!

dsikka commented Sep 5, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Sep 5, 2024

Uh oh!

mgoin left a comment

Uh oh!

mgoin commented Sep 9, 2024

Uh oh!

fengyang95 commented Sep 11, 2024

Uh oh!

xiaoqi35 commented Sep 12, 2024

Uh oh!

ElizaWszola commented Sep 18, 2024

Uh oh!

ElizaWszola commented Sep 18, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[Misc] Fused MoE Marlin support for GPTQ #8217

[Misc] Fused MoE Marlin support for GPTQ #8217

Uh oh!

Conversation

dsikka commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

github-actions bot commented Sep 5, 2024

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Sep 9, 2024

Uh oh!

fengyang95 commented Sep 11, 2024

Uh oh!

xiaoqi35 commented Sep 12, 2024

Uh oh!

ElizaWszola commented Sep 18, 2024

Uh oh!

ElizaWszola commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dsikka commented Sep 5, 2024 •

edited

Loading

ElizaWszola commented Sep 18, 2024 •

edited

Loading