-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) #25987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Benjamin Chislett <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a bugfix to allow skipping Mixture-of-Experts (MoE) layers during NVFP4 quantization, which is crucial for models like nvidia/DeepSeek-R1-FP4 when using Multi-Token Prediction (MTP).
The main changes are:
- In
vllm/model_executor/layers/quantization/modelopt.py, theModelOptNvFp4Config.get_quant_methodnow checks if an MoE layer is in the exclusion list and returnsNoneif so. - In
vllm/model_executor/layers/fused_moe/layer.py, theFusedMoElayer's__init__method is updated to handle theNonereturn fromget_quant_methodby falling back to the unquantized method, effectively skipping quantization for that layer. - Several related changes in
deepseek_v2.py,deepseek_mtp.py, anddeepseek_eagle.pyrefactor how the model configuration is passed toDeepseekV2DecoderLayerto correctly support draft models in speculative decoding scenarios.
The changes are well-structured and correctly address the identified issue. The refactoring for config propagation is clean and necessary. The overall implementation looks solid.
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, thanks for the fix
|
@benchislett The basic model failure seems related |
Signed-off-by: Benjamin Chislett <[email protected]>
|
@benchislett please merge with main to fix the docker |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Signed-off-by: Benjamin Chislett <[email protected]>
Purpose
There is no fallback in
ModelOptNvfp4Config.get_quant_methodfor when quant_config should skip an MoE layer.This is a problem for
nvidia/DeepSeek-R1-FP4when running with MTP since the entire MTP layer is left unquantized, and should be skipped by quantization:https://huggingface.co/nvidia/DeepSeek-R1-FP4/blob/main/hf_quant_config.json#L188
"exclude_modules": [
...
"model.layers.61*",
...
]
This PR includes some diff from #25953.
Testing
Evaluated in combination with #25984, see results there.