[Bugfix] Fix FP16 overflow for DeepSeek V2 #13232

Concurrensee · 2025-02-13T17:03:18Z

This PR fixes DeepSeek V2 fp16 overflow issue.

When dtype is set to fp16 (vllm will cast model from bf16 to fp16), the model's intermediate value will overflow the max of fp16, causing model to produce garbage. This PR fixes this issue and is mathematically equivalent to original model; the fix only applies when dtype is fp16. This fix does not affect DeepSeek V3/R1 since DeepSeek V3/R1 are native fp8 models.

github-actions · 2025-02-13T17:03:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

robertgshaw2-redhat · 2025-02-21T15:57:57Z

An alternative to this would be to just reject and raise an error when the user sets --dtype float16 for this model. Why do we need to support this case?

gshtras · 2025-02-21T15:59:41Z

An alternative to this would be to just reject and raise an error when the user sets --dtype float16 for this model. Why do we need to support this case?

Customer requirement due to performance on ROCm

robertgshaw2-redhat · 2025-02-21T16:08:29Z

vllm/model_executor/models/deepseek_v2.py

Can you please add a comment about why we have this case for torch.float16, similar to what is written in the PR description (e.g. we have this special case to avoid overflow on 300X)

So If I am following this right, DeepSeekV2 has the following structure:

the first layer uses Dense MLP ("first_k_dense_replace": 1)

all other layers use fused MoE

After the first dense MLP, we do

hidden = hidden / scaling_factor residual = residual / scaling_factor

And then for the all the other layers, we have

Before the change:

hidden = layernorm(hidden) hidden = self_attn(hidden) hidden = layernorm(hidden) router_logits, _ = gate(hidden_states) final_hidden = experts(hidden, router_logits) * scaling_factor if shared_experts: shared_output = shared_experts(hidden) final_hidden = final_hidden + shared_output final_hidden = all_reduce(final_hidden) return final_hidden

After the change:

hidden = layernorm(hidden) hidden = self_attn(hidden) hidden = hidden / scaling_factor hidden = layernorm(hidden) router_logits, _ = gate(hidden_states) final_hidden = experts(hidden, router_logits) if shared_experts: shared_output = shared_experts(hidden) final_hidden = final_hidden + shared_output / scaling_factor final_hidden = all_reduce(final_hidden) return final_hidden

I dont quite follow why these should give equivalent results

Comment added, the overflow occurs on NV as well. It exceeds the max value of FP16.

@Concurrensee - can you just briefly explain why the results should be equivalent?

It makes sense, since scaling_factor=16 for this model

Before this code block, in "DeepseekV2DecoderLayer", we have

if isinstance(self.mlp, DeepseekV2MoE) and \ hidden_states.dtype == torch.float16: # This is a special case to avoid FP16 overflow hidden_states *= 1. / self.routed_scaling_factor

We scale the hidden_states to exploit the non-linearity in following layernorm to make sure residual and hidden_sates in the same scale. When we make sure scale is same, then by exploiting non-linearity, the data go through the layernorm will have same value as original model after the layernorm

vllm/model_executor/models/deepseek_v2.py

tlrmchlsmth · 2025-02-21T17:27:23Z

vllm/model_executor/models/deepseek_v2.py

looks like the comment should be a few lines down

Suggested change

if hidden_states.dtype != torch.float16:

# This is a special case to avoid FP16 overflow

final_hidden_states = final_hidden_states + shared_output

else:

final_hidden_states = final_hidden_states + shared_output \

* (1. / self.routed_scaling_factor)

if hidden_states.dtype != torch.float16:

final_hidden_states = final_hidden_states + shared_output

else:

# This is a special case to avoid FP16 overflow

final_hidden_states = final_hidden_states + shared_output \

* (1. / self.routed_scaling_factor)

Yes.
I have corrected the comment to the right place

Signed-off-by: Yida Wu <[email protected]>

mgoin

LGTM thanks for discussion

jinzhen-lin · 2025-03-13T11:36:37Z

@Concurrensee @mgoin
It seems that this PR only consider the case when first_k_dense_replace == 1. With models that use first_k_dense_replace > 1 (DeepSeek-R1 use 3), residual *= 1. / self.routed_scaling_factor would run multi times, leads the model to generate meaningless text. It should consider all cases, including first_k_dense_replace > 1 and first_k_dense_replace == 0.

Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

Signed-off-by: Yida Wu <[email protected]>

Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Concurrensee force-pushed the Deepseek_V2_fp16_fix branch 2 times, most recently from b825ae2 to eadf068 Compare February 13, 2025 17:19

robertgshaw2-redhat reviewed Feb 21, 2025

View reviewed changes

vllm/model_executor/models/deepseek_v2.py Outdated Show resolved Hide resolved

Concurrensee force-pushed the Deepseek_V2_fp16_fix branch from eadf068 to 862d46b Compare February 21, 2025 16:37

tlrmchlsmth reviewed Feb 21, 2025

View reviewed changes

fix FP16 overflow for DSv2

09fd67f

Signed-off-by: Yida Wu <[email protected]>

Concurrensee force-pushed the Deepseek_V2_fp16_fix branch from 862d46b to 09fd67f Compare February 24, 2025 15:03

mgoin approved these changes Mar 10, 2025

View reviewed changes

mgoin added bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed labels Mar 10, 2025

robertgshaw2-redhat added the force-merge label Mar 11, 2025

vllm-bot merged commit c982ac5 into vllm-project:main Mar 11, 2025
37 of 45 checks passed

noooop mentioned this pull request Mar 12, 2025

VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization #14126

Closed

1 task

jinzhen-lin mentioned this pull request Mar 14, 2025

[Bugfix] fix deepseek fp16 scale bug #14809

Merged

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Bugfix] Fix FP16 overflow for DeepSeek V2 (vllm-project#13232)

be5784f

Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

noooop mentioned this pull request Apr 25, 2025

[Feature]: Automatically detect numerical issues #17123

Closed

1 task

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Bugfix] Fix FP16 overflow for DeepSeek V2 (vllm-project#13232)

b1bbc25

Signed-off-by: Yida Wu <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

[Bugfix] Fix FP16 overflow for DeepSeek V2 (vllm-project#13232)

aee6429

Signed-off-by: Yida Wu <[email protected]> Signed-off-by: Mu Huai <[email protected]>

sarckk mentioned this pull request Sep 2, 2025

[Bug] R1 Accuracy: Fix routed_scaling_factor Double Mul Issue #24119

Merged

-            if hidden_states.dtype != torch.float16:
-                # This is a special case to avoid FP16 overflow
-                final_hidden_states = final_hidden_states + shared_output
-            else:
-                final_hidden_states = final_hidden_states + shared_output \
-                    * (1. / self.routed_scaling_factor)
+            if hidden_states.dtype != torch.float16:
+                final_hidden_states = final_hidden_states + shared_output
+            else:
+                # This is a special case to avoid FP16 overflow
+                final_hidden_states = final_hidden_states + shared_output \
+                    * (1. / self.routed_scaling_factor)

Uh oh!

[Bugfix] Fix FP16 overflow for DeepSeek V2 #13232

[Bugfix] Fix FP16 overflow for DeepSeek V2 #13232

Uh oh!

Conversation

Concurrensee commented Feb 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 13, 2025

Uh oh!

robertgshaw2-redhat commented Feb 21, 2025

Uh oh!

gshtras commented Feb 21, 2025

Uh oh!

robertgshaw2-redhat Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Concurrensee Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Concurrensee Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tlrmchlsmth Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Concurrensee Feb 24, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jinzhen-lin commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Concurrensee commented Feb 13, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat Feb 21, 2025 •

edited

Loading

robertgshaw2-redhat Feb 21, 2025 •

edited

Loading

Concurrensee Feb 21, 2025 •

edited

Loading

jinzhen-lin commented Mar 13, 2025 •

edited

Loading