Skip to content

Conversation

@Concurrensee
Copy link
Contributor

@Concurrensee Concurrensee commented Feb 13, 2025

This PR fixes DeepSeek V2 fp16 overflow issue.

When dtype is set to fp16 (vllm will cast model from bf16 to fp16), the model's intermediate value will overflow the max of fp16, causing model to produce garbage. This PR fixes this issue and is mathematically equivalent to original model; the fix only applies when dtype is fp16. This fix does not affect DeepSeek V3/R1 since DeepSeek V3/R1 are native fp8 models.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@Concurrensee Concurrensee force-pushed the Deepseek_V2_fp16_fix branch 2 times, most recently from b825ae2 to eadf068 Compare February 13, 2025 17:19
@robertgshaw2-redhat
Copy link
Collaborator

An alternative to this would be to just reject and raise an error when the user sets --dtype float16 for this model. Why do we need to support this case?

@gshtras
Copy link
Collaborator

gshtras commented Feb 21, 2025

An alternative to this would be to just reject and raise an error when the user sets --dtype float16 for this model. Why do we need to support this case?

Customer requirement due to performance on ROCm

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a comment about why we have this case for torch.float16, similar to what is written in the PR description (e.g. we have this special case to avoid overflow on 300X)

Copy link
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So If I am following this right, DeepSeekV2 has the following structure:

  • the first layer uses Dense MLP ("first_k_dense_replace": 1)
  • all other layers use fused MoE

After the first dense MLP, we do

hidden = hidden / scaling_factor
residual = residual / scaling_factor

And then for the all the other layers, we have

  • Before the change:
hidden = layernorm(hidden)
hidden = self_attn(hidden)
hidden = layernorm(hidden)
router_logits, _ = gate(hidden_states)
final_hidden = experts(hidden, router_logits) * scaling_factor
if shared_experts:
    shared_output = shared_experts(hidden)
    final_hidden = final_hidden + shared_output
final_hidden = all_reduce(final_hidden)
return final_hidden
  • After the change:
hidden = layernorm(hidden)
hidden = self_attn(hidden)
hidden = hidden / scaling_factor
hidden = layernorm(hidden)
router_logits, _ = gate(hidden_states)
final_hidden = experts(hidden, router_logits)
if shared_experts:
    shared_output = shared_experts(hidden)
    final_hidden = final_hidden + shared_output / scaling_factor
final_hidden = all_reduce(final_hidden)
return final_hidden

I dont quite follow why these should give equivalent results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added, the overflow occurs on NV as well. It exceeds the max value of FP16.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Concurrensee - can you just briefly explain why the results should be equivalent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, since scaling_factor=16 for this model

Copy link
Contributor Author

@Concurrensee Concurrensee Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this code block, in "DeepseekV2DecoderLayer", we have

        if isinstance(self.mlp, DeepseekV2MoE) and \
            hidden_states.dtype == torch.float16:
            # This is a special case to avoid FP16 overflow
            hidden_states *= 1. / self.routed_scaling_factor

We scale the hidden_states to exploit the non-linearity in following layernorm to make sure residual and hidden_sates in the same scale. When we make sure scale is same, then by exploiting non-linearity, the data go through the layernorm will have same value as original model after the layernorm

Comment on lines 171 to 176
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the comment should be a few lines down

Suggested change
if hidden_states.dtype != torch.float16:
# This is a special case to avoid FP16 overflow
final_hidden_states = final_hidden_states + shared_output
else:
final_hidden_states = final_hidden_states + shared_output \
* (1. / self.routed_scaling_factor)
if hidden_states.dtype != torch.float16:
final_hidden_states = final_hidden_states + shared_output
else:
# This is a special case to avoid FP16 overflow
final_hidden_states = final_hidden_states + shared_output \
* (1. / self.routed_scaling_factor)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
I have corrected the comment to the right place

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for discussion

@mgoin mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Mar 10, 2025
@vllm-bot vllm-bot merged commit c982ac5 into vllm-project:main Mar 11, 2025
37 of 45 checks passed
@jinzhen-lin
Copy link
Contributor

jinzhen-lin commented Mar 13, 2025

@Concurrensee @mgoin
It seems that this PR only consider the case when first_k_dense_replace == 1. With models that use first_k_dense_replace > 1 (DeepSeek-R1 use 3), residual *= 1. / self.routed_scaling_factor would run multi times, leads the model to generate meaningless text. It should consider all cases, including first_k_dense_replace > 1 and first_k_dense_replace == 0.

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working force-merge ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants