-
Notifications
You must be signed in to change notification settings - Fork 645
Description
This issue tracks initial support for the Deepseek V3 model with vllm-ascend:
https://huggingface.co/deepseek-ai/DeepSeek-R1
https://huggingface.co/deepseek-ai/DeepSeek-V3
Support Progress
update (2025.03.07): the DeepSeek V3 / R1 supported! DeepSeek V3 / R1现已支持:
Please try v0.7.3-dev 请参考文档:
https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/tutorials.html#online-serving-on-multi-machine
CANN version dependency resolved by #242
update (2025.03.05) we are still waiting for CANN 8.1.RC1.alpha001 release.: https://www.hiascend.com/zh/developer/download/community/result?module=cann
update (2025.02.22) DeepSeek V3 / R1 support will be ready in next RC release of vLLM Ascend (v0.7.3rc1) in the early of 2025.03
Known issue will be fixed in vllm-ascend v0.7.3rc1 (March. 2025) with CANN 8.1.RC1.alpha001 (March. 2025):
-
AssertionError: Torch not compiled with CUDA enabledIssue link: DeepSeek-R1 on 0.7.1-dev with
Torch not compiled with CUDA enabled#122 (comment)Workaround: This is because in the code of the vllm community, specifically in the file vllm/vllm/model_executor/layers/rotary_embedding.py, the device is hard-coded as 'cuda'. We can choose to manually replace these occurrences of 'cuda' with 'npu' or add "from torch_npu.contrib import transfer_to_npu" at the beginning of the script.
Fixed by:- vLLM PR (work in v0.7.4): [model][refactor] remove cuda hard code in models and layers vllm#13658
- vLLM Ascend workarouind PR (work in v0.7.3-dev): [BugFix] Add transfer_to_npu in worker.py to replace hard-code 'cuda' in vllm. #228
-
w8a8 quantization is unspported yet
ValueError: Unknown quantization method: ascend. Must be one of ['aqlm', 'awq', 'deepspeedfp', 'tpu_int8', 'fp8', 'fbgemm_fp8', 'modelopt', 'marlin', 'gguf', 'gptq_marlin_24', 'gptq_marlin', 'awq_marlin', 'gptq', 'compressed-tensors', 'bitsandbytes', 'qqq', 'hqq', 'experts_int8', 'neuron_quant', 'ipex', 'quark', 'moe_wna16'].Issue Link: Quantization error while running Deepseek-V3-w8a8 #119
Workaround: don't use quantization, and wait for next final release (late of 2025.03)
-
Quantization is unspported yet
KeyError: 'model.layers.0.self_attn.q_a_proj.weight'
issue: DeepSeek-R1 on 0.7.1-dev withTorch not compiled with CUDA enabled#122 (comment)
Wrokaround: Remove https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/config.json#L39-L47 -
RuntimeError: GroupTopkOperation CreateOperation failedWorkaround: This is caused by the inner ops in CANN, will fixed in next RC release of vLLM Ascend (v0.7.3rc1) in the early of 2025.03. Need bump CANN version to CANN 8.1.RC1.alpha001 (will public publish at the
end of Feb. 2025March.2025)Will be fixed by: [Misc]: Bump CANN version to CANN 8.1.RC1.alpha001 #142
Workaround: [Fix] Remove npu_group_topk before CANN version update #242
update (2025.02.19): #88 merged to v0.7.1-dev, DeepSeek test passed (via DeepSeek-V2-Lite), V3 arch same as V2 should also work, will backport to main soon.
Here is the note for DeepSeek-V2-Lite deploy: https://vllm-ascend.readthedocs.io/en/latest/tutorials.html#online-serving-on-multi-machine