-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix][Quantization] Fix FP8 + EP #13784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Tyler Michael Smith <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Tyler Michael Smith <[email protected]>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
Signed-off-by: Tyler Michael Smith <[email protected]>
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely, thanks!
Signed-off-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Tyler Michael Smith <[email protected]>
Fp8MoEMethodis reaching into thelayerto get the number of experts duringprocess_weights_after_loading, which isn't right when using expert parallelism.VLLM_TEST_ENABLE_EP=1 vllm serve neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 --tensor-parallel-size 2Would hit the following error:
I changed
num_expertstoglobal_num_expertsand added alocal_num_expertsas well, and fixed similar spots inquark_moe.pyandcompressed_tensors_moe.py.I also fixed up a couple of other spots where we are looking at the layer's
num_expertsfor heuristics.vllm/vllm/model_executor/layers/quantization/gptq_marlin.py
Line 156 in 1f0ae3e
vllm/vllm/model_executor/layers/quantization/awq_marlin.py
Line 139 in 1f0ae3e