-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Kernel] W8A16 Int8 inside FusedMoE #7415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 25 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
6b834a3
Add experts int8 config
mzusman afddd3b
Add support in fusedmoe
mzusman 289367a
Add experts int8 to quantization list
mzusman 084405e
Remove logger
mzusman 0c690fe
Add to optimized quantization
mzusman 3100490
Format
mzusman 413400c
Add startup test for experts_int8
mzusman 9e7bc79
Typo
mzusman 1ebb5d7
Add test
mzusman 44a72d6
Change compute capabiltiy to 80
mzusman 39660ca
Format
mzusman a097b6e
Disable for CPU
mzusman c12635c
Add use_int8 to the moe benchmarks
mzusman 9436034
Use JambaMoE to implement MLP
mzusman 4b712e4
Use MoE to implement MLP
mzusman 3b6967e
Format
mzusman 5f5b11e
Fix
mzusman e199b17
Move experts_int8 to quantizatiob subdir and add is quant method
mzusman 9c47ad0
Split if else in benchmark moe
mzusman 97f0585
Rename use_int8 to use_int8_w8a16, use_fp8 to use_fp_w8a8
mzusman 0025459
Reverse order
mzusman a1d75cb
Change dtype in configs filename
mzusman 505e3d3
Single function to get dtype config name
mzusman 80d977c
Align experts int8 apply with fp8
mzusman 1c403be
Align with upstream
mzusman 744ecd4
Format
mzusman a5bf0b3
Change fp8 to fp8_w8a8
mzusman 1c7e689
Correct the args
mzusman e438b84
Remove experts int8 from ignore cpu
mzusman c23a2f4
Fix typo
mzusman 7e619c7
Fix Jamba tests since MLP layer is not aligned with HF
mzusman 70a6598
Merge remote-tracking branch 'github/main' into expert_int8_upstream
mzusman 4d6c546
Merge remote-tracking branch 'github/main' into expert_int8_upstream
mzusman File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| # flake8: noqa | ||
| """Tests experts_int8 quantization startup and generation, | ||
| doesn't test correctness | ||
| """ | ||
| from tests.quantization.utils import is_quant_method_supported | ||
|
|
||
| import pytest | ||
|
|
||
| MODELS = ["ai21labs/Jamba-tiny-random"] | ||
|
|
||
|
|
||
| @pytest.mark.skipif(not is_quant_method_supported("experts_int8"), | ||
| reason="ExpertsInt8 is not supported on this GPU type.") | ||
| @pytest.mark.parametrize("model", MODELS) | ||
| @pytest.mark.parametrize("dtype", ["bfloat16"]) | ||
| @pytest.mark.parametrize("max_tokens", [10]) | ||
| def test_model_experts_int8_startup( | ||
| hf_runner, | ||
| vllm_runner, | ||
| example_prompts, | ||
| model: str, | ||
| dtype: str, | ||
| max_tokens: int, | ||
| ) -> None: | ||
|
|
||
| with vllm_runner(model, dtype=dtype, | ||
| quantization="experts_int8") as vllm_model: | ||
| vllm_model.generate_greedy(example_prompts, max_tokens) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.