-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue
#24119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue
#24119
Conversation
Signed-off-by: yewentao256 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a bug in the DeepseekV2MoE forward pass where routed_scaling_factor was being applied twice for non-fp16 data types. The self.experts layer (FusedMoE) already applies this scaling factor internally. By removing the redundant multiplication, the code is now more correct. I have also added a review comment highlighting a potential related scaling issue in the handling of shared_output that may require further investigation.
routed_scaling_factor Double Mul Issuerouted_scaling_factor Double Mul Issue
| final_hidden_states = self.experts(hidden_states=hidden_states, | ||
| router_logits=router_logits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is handled here right?
vllm/vllm/model_executor/layers/fused_moe/fused_moe.py
Lines 1014 to 1015 in 98aee61
| if routed_scaling_factor != 1.0: | |
| topk_weights = topk_weights * routed_scaling_factor |
Why does main not apply the routed_scaling_factor in the fp16 case? Is there an accuracy issue we need to handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so I changed another way to fix since deepseek v2 has a lot of logic for routed_scaling_factor outside
Signed-off-by: yewentao256 <[email protected]>
sarckk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the fix
I think we also need this fix for dots1 and glm4_moe? both use use_grouped_topk=True so they should be affected by the double mul issue as well
vllm/vllm/model_executor/models/dots1.py
Line 163 in 98aee61
| router_logits=router_logits) * self.routed_scaling_factor |
vllm/vllm/model_executor/models/glm4_moe.py
Line 189 in 98aee61
| router_logits=router_logits) * self.routed_scaling_factor |
Signed-off-by: yewentao256 <[email protected]>
@sarckk Nice find! Feel free to have a pr towards this. |
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix.
|
@yewentao256 Thanks for the fix. BTW, can you please add a test in our CI? I think we've seen similar accuracy issues 5+ times on DeepSeek... |
|
Besiding running full deepseek model, is it possible to run a smaller model to detect such issue, like DeepSeek v2, etc. |
* 'main' of https://github.com/845473182/vllm: (457 commits) [BugFix] Fix routed_scaling_factor double mul for dots1 and glm4 MoE models (vllm-project#24132) [Misc] Add check for dual_chunk_attention (vllm-project#24070) [Doc]: fix typos in Python comments (vllm-project#24115) [Doc]: fix typos in Python comments (vllm-project#24093) [Compile] Fix Compile Warning for `w4a8_mm_entry.cu` (vllm-project#23660) fix some typos (vllm-project#24071) [V1] Wrapper which plumbs request-level logits processors into vLLM batch-level logits processing (vllm-project#23656) Upgrade xgrammar to 0.1.23 (vllm-project#22988) Update release pipeline post PyTorch 2.8.0 update (vllm-project#24073) [XPU] Fix the bug of LoRA logits on the XPU platform (vllm-project#24081) [CI/Build] Disable SiluMul NVFP4 quant fusion tests (vllm-project#24121) [Bug] R1 Accuracy: Fix `routed_scaling_factor` Double Mul Issue (vllm-project#24119) [AMD][Kernel][Bugfix] Cast offsets tensor bn to tl.int64 to avoid GPU segfault (vllm-project#23692) [CI] Enable all hf transformers baselines in test_hybrid (vllm-project#23936) [Log] Only Print Profiler Results on Rank 0 (vllm-project#23370) Fix weights loading for Apertus (vllm-project#24100) [Metrics] Deprecate TPOT in favor of ITL (vllm-project#24110) [Bugfix] Fix packed_factor missing attribute error (vllm-project#23902) Run ruff format on a few files. (vllm-project#24075) [Bugfix] Fix transform_config parsing in Compressed Tensors (vllm-project#23945) ...
…-project#24119) Signed-off-by: yewentao256 <[email protected]>
…-project#24119) Signed-off-by: yewentao256 <[email protected]>
Purpose
Fixes #24118 that is introduced by #23123
Test
lm_eval --model local-completions --model_args "base_url=http://127.0.0.1:9256/v1/completions,model=deepseek-ai/DeepSeek-R1,num_concurrent=256" --tasks gsm8kOrigin
Now