-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Bugfix] Fix FP16 overflow for DeepSeek V2 #13232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
b825ae2 to
eadf068
Compare
|
An alternative to this would be to just reject and raise an error when the user sets |
Customer requirement due to performance on ROCm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a comment about why we have this case for torch.float16, similar to what is written in the PR description (e.g. we have this special case to avoid overflow on 300X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So If I am following this right, DeepSeekV2 has the following structure:
- the first layer uses Dense MLP (
"first_k_dense_replace": 1) - all other layers use fused MoE
After the first dense MLP, we do
hidden = hidden / scaling_factor
residual = residual / scaling_factorAnd then for the all the other layers, we have
- Before the change:
hidden = layernorm(hidden)
hidden = self_attn(hidden)
hidden = layernorm(hidden)
router_logits, _ = gate(hidden_states)
final_hidden = experts(hidden, router_logits) * scaling_factor
if shared_experts:
shared_output = shared_experts(hidden)
final_hidden = final_hidden + shared_output
final_hidden = all_reduce(final_hidden)
return final_hidden- After the change:
hidden = layernorm(hidden)
hidden = self_attn(hidden)
hidden = hidden / scaling_factor
hidden = layernorm(hidden)
router_logits, _ = gate(hidden_states)
final_hidden = experts(hidden, router_logits)
if shared_experts:
shared_output = shared_experts(hidden)
final_hidden = final_hidden + shared_output / scaling_factor
final_hidden = all_reduce(final_hidden)
return final_hiddenI dont quite follow why these should give equivalent results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added, the overflow occurs on NV as well. It exceeds the max value of FP16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Concurrensee - can you just briefly explain why the results should be equivalent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, since scaling_factor=16 for this model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this code block, in "DeepseekV2DecoderLayer", we have
if isinstance(self.mlp, DeepseekV2MoE) and \
hidden_states.dtype == torch.float16:
# This is a special case to avoid FP16 overflow
hidden_states *= 1. / self.routed_scaling_factor
We scale the hidden_states to exploit the non-linearity in following layernorm to make sure residual and hidden_sates in the same scale. When we make sure scale is same, then by exploiting non-linearity, the data go through the layernorm will have same value as original model after the layernorm
eadf068 to
862d46b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the comment should be a few lines down
| if hidden_states.dtype != torch.float16: | |
| # This is a special case to avoid FP16 overflow | |
| final_hidden_states = final_hidden_states + shared_output | |
| else: | |
| final_hidden_states = final_hidden_states + shared_output \ | |
| * (1. / self.routed_scaling_factor) | |
| if hidden_states.dtype != torch.float16: | |
| final_hidden_states = final_hidden_states + shared_output | |
| else: | |
| # This is a special case to avoid FP16 overflow | |
| final_hidden_states = final_hidden_states + shared_output \ | |
| * (1. / self.routed_scaling_factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I have corrected the comment to the right place
Signed-off-by: Yida Wu <[email protected]>
862d46b to
09fd67f
Compare
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for discussion
|
@Concurrensee @mgoin |
Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Yida Wu <[email protected]>
Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Mu Huai <[email protected]>
This PR fixes DeepSeek V2 fp16 overflow issue.
When dtype is set to fp16 (vllm will cast model from bf16 to fp16), the model's intermediate value will overflow the max of fp16, causing model to produce garbage. This PR fixes this issue and is mathematically equivalent to original model; the fix only applies when dtype is fp16. This fix does not affect DeepSeek V3/R1 since DeepSeek V3/R1 are native fp8 models.