Integrate fused Mixtral MoE with Marlin kernels #7079

ElizaWszola · 2024-08-02T13:08:39Z

This PR's functionality has been implemented through PRs #8217 and #8032. I'm closing it.

Reimplement quantized Mixtral to combine Marlin kernels with fused MoE.

This PR rewrites the Mixtral model to run a modified Marlin kernel that takes advantage of fused_moe functionality.

The C++ code takes in all expert data and topk_ids tensor. It runs a kernel to compute sorted_ids offsets related to each expert, and then feeds them to the Marlin kernels. The Marlin kernels are run multiple times per each expert, using current expert number to figure out the current position inside sorted_ids and the number of tokens to process in each particular call. The values of sorted_ids are then used to indirectly access the rows of input/output A/C tensors. If the the rows of input A are identical for each of topk experts that access them (first MMM of fused MoE), tensor A consists of M x K elements, with each row being accessed topk times by the relevant experts. Otherwise (second MMM of fused MoE), A consists of M x topk x K elements, with each row being accessed once.

Unit testing:

pytest tests/kernels/test_moe.py -k test_fused_marlin_moe

End-to-end testing:
Run offline_inference.py with

// 4-bit quantized moe without act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ")

// 4-bit quantized moe with act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ",
          revision="gptq-4bit-128g-actorder_True")

// 8-bit quantized moe with act order
llm = LLM(model="TheBloke/Mixtral-8x7B-v0.1-GPTQ",
          revision="gptq-8bit-128g-actorder_True")

Sonnet benchmark results (no act order, 4-bit):

TTFT	TPOT

Sonnet benchmark results (with act order, 8-bit):

TTFT	TPOT

github-actions · 2024-08-02T13:08:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

alexm-redhat · 2024-08-02T13:14:49Z

@robertgshaw2-neuralmagic @mgoin @tlrmchlsmth

ElizaWszola · 2024-08-02T14:10:27Z

/unready

Refactoring for maintainability

ElizaWszola · 2024-08-29T08:25:04Z

@dsikka I've added some compressed_tensors compatibility code (tested only offline inference with W4A16)

dsikka · 2024-08-29T15:30:58Z

vllm/model_executor/layers/fused_moe/layer.py

+        expert_id: int,
+        is_gptq: bool = False,
+    ):
+        if is_gptq:


We'd want to use the weight loading functionality already present.

dsikka · 2024-08-29T15:41:11Z

vllm/model_executor/models/__init__.py

    "MistralForCausalLM": ("llama", "LlamaForCausalLM"),
    "MixtralForCausalLM": ("mixtral", "MixtralForCausalLM"),
-    "QuantMixtralForCausalLM": ("mixtral_quant", "MixtralForCausalLM"),
+    "QuantMixtralForCausalLM": ("mixtral", "MixtralForCausalLM"),


we'd want mixtral_quant by default

dsikka · 2024-08-29T15:43:19Z

vllm/model_executor/layers/fused_moe/layer.py

+        gate_down_up = [
+            ckpt_gate_proj_name, ckpt_down_proj_name, ckpt_up_proj_name
        ]
+        return ([


Can we leverage what already exists?

ElizaWszola · 2024-09-18T11:24:43Z

Closing this PR because its entire functionality has been implemented through PRs #8217 and #8032.

ElizaWszola added 2 commits August 2, 2024 01:42

Moving branch to a different repo

5a2ab25

clean up the CPU code

b39dba4

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 2, 2024

mgoin removed the ready ONLY add when PR is ready to merge/full CI is needed label Aug 2, 2024

ElizaWszola and others added 7 commits August 2, 2024 15:26

Fix build issues

b0c4671

Refactoring for maintainability

e5c1a81

Fixing tests

7da678e

Addressing repacking comment

641696b

gptq -> marlin renaming

3cef667

Undo formatting changes

a6710af

Final formatting change

e29107f

robertgshaw2-redhat mentioned this pull request Aug 12, 2024

[Feature]: DeepSeek-Coder-V2-Instruct-FP8 on 8xA100 #7322

Closed

DhruvaBansal00 added 15 commits August 12, 2024 13:50

Switching to mixtral file for quantized mixtral

099d61e

Bug fixes

bdf6bdc

is quantized change

19c5c59

debug stat

3b7cc60

replace wiehgt name with param name

d2c4754

typo

f579cb2

debug

79394eb

more debug

ec75f4e

only relevant logging

91ca970

log

1b9d5bb

log

ec06719

removing qzero weights

71d82e1

Qzeors in expert mapping

d3465d0

Debug

226ee26

Load qzero

21d7d27

DhruvaBansal00 and others added 20 commits August 15, 2024 14:40

Switching to mixtral moe

10940a5

Precision changes

895ffbe

Cleanup

e54b2e4

Mixtral quant parity:

b4f23dc

fixing tests

d59fe3b

Tests working and correctness verified

0d9cbdc

Formating

112aa40

Moving single marlin alongside fused marlin

1ca9098

Removing unused imports

4d41425

single marlin moe import

4907f43

Merge branch 'main' into marlin-moe-integration

8f4648c

Merge branch 'marlin-moe-integration' into gptq-marlin-refactor

8225037

Unify shard_id to be of str w[1-3] format

315e3b6

Merge pull request #4 from DhruvaBansal00/gptq-marlin-refactor

34bb5b0

Refactoring for maintainability

Merge branch 'main' into marlin-moe-integration

fd4bb21

Unfused codepath for non-supported quant_types

7956a69

uint8b128 support

2511f78

Merge branch 'main' into marlin-moe-integration

f875842

Cleanup, compressed tensors compatibility

d8feb8d

update todo

3676621

ElizaWszola changed the title ~~Integrate fused MoE with Marlin kernels~~ Integrate fused Mixtral MoE with Marlin kernels Aug 29, 2024

dsikka suggested changes Aug 29, 2024

View reviewed changes

dsikka reviewed Aug 29, 2024

View reviewed changes

ElizaWszola mentioned this pull request Aug 30, 2024

[Kernel] Enable 8-bit weights in Fused Marlin MoE #8032

Merged

ElizaWszola added 3 commits August 30, 2024 09:33

Fix merge

75e3dd5

bad paste

a5f5a74

GPTQFusedMoE layer

e305306

ElizaWszola closed this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrate fused Mixtral MoE with Marlin kernels #7079

Integrate fused Mixtral MoE with Marlin kernels #7079

Uh oh!

ElizaWszola commented Aug 2, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 2, 2024

Uh oh!

alexm-redhat commented Aug 2, 2024

Uh oh!

ElizaWszola commented Aug 2, 2024

Uh oh!

ElizaWszola commented Aug 29, 2024

Uh oh!

dsikka Aug 29, 2024

Uh oh!

dsikka Aug 29, 2024

Uh oh!

dsikka Aug 29, 2024

Uh oh!

ElizaWszola commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Integrate fused Mixtral MoE with Marlin kernels #7079

Integrate fused Mixtral MoE with Marlin kernels #7079

Uh oh!

Conversation

ElizaWszola commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 2, 2024

Uh oh!

alexm-redhat commented Aug 2, 2024

Uh oh!

ElizaWszola commented Aug 2, 2024

Uh oh!

ElizaWszola commented Aug 29, 2024

Uh oh!

dsikka Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

dsikka Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

dsikka Aug 29, 2024

Choose a reason for hiding this comment

Uh oh!

ElizaWszola commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ElizaWszola commented Aug 2, 2024 •

edited

Loading