-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Deepseek-R1 GGUF support #6847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Deepseek-R1 GGUF support #6847
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @SandroPats, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.
This PR aims to add support for Deepseek-R1 models in the GGUF format, specifically those provided by Unsloth. The author notes that this implementation reuses concepts and code from a similar effort in the vllm project. A key requirement for using these models with this PR is providing the Hugging Face configuration and tokenizer files separately via a new command-line argument, --hf-config-path. Due to the current implementation relying on vllm's GGUF modules, CUDA graphs must be disabled when running these models.
Highlights
- GGUF Support for Deepseek-R1: Adds core functionality to load and run Deepseek-R1 models distributed in the GGUF format.
- Separate HF Config Path: Introduces a new command-line argument
--hf-config-pathand corresponding configuration field to specify the location of the Hugging Faceconfig.json,tokenizer.json,configuration.py, andmodeling.pyfiles needed for GGUF models. - GGUF Weight Handling: Includes modifications to linear and MoE layers to correctly handle GGUF-specific weight loading, dequantization, and materialization of uninitialized parameters, including considerations for tensor parallelism.
- Deepseek V3 Specifics: Adds specific logic in the model loader to map GGUF tensor names to Hugging Face names for Deepseek V3 and adjusts the QKV fusion logic for GGUF models.
- CUDA Graph Compatibility: Notes that the current GGUF implementation is not compatible with CUDA graphs, requiring users to disable them via
--disable-cuda-graph.
Changelog
Click here to see the changelog
- python/sglang/srt/configs/model_config.py
- Added
hf_config_pathparameter toModelConfigconstructor. - Stored
hf_config_pathandtrust_remote_codeas instance attributes. - Modified
get_configcall to usehf_config_pathif provided. - Added
hf_config_pathwhen creatingModelConfigfromServerArgs.
- Added
- python/sglang/srt/layers/linear.py
- Added logic in
weight_loaderto handle GGUF weight types and materializeUninitializedParameterfor GGUF weights. - Adjusted materialization logic for GGUF weights to account for tensor parallelism sharding on the output dimension.
- Added logic in
- python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Imported
UninitializedParameter. - Added
quant_configtoFusedMoEconstructor. - Modified
weight_loaderto handle GGUF weight types and materializeUninitializedParameterfor GGUF MoE weights, including tensor parallelism sharding. - Conditionally pass
routed_scaling_factorin theforwardmethod based on quantization type. - Changed parameter name from
correction_biastoe_score_correction_biasin theforwardcall.
- Imported
- python/sglang/srt/model_loader/loader.py
- Imported
gguf. - Added specific GGUF to HF tensor name mapping for
deepseek_v3(mapping todeepseek2and handling expert weights). - Used
trust_remote_codewhen loading the dummy model config. - Modified
load_weightsto disable QKV fusion (fuse_qkv_a_proj) for GGUF quantization. - Adjusted the logic for fusing QKV weights to handle GGUF-specific 'type' weights.
- Removed the
weight_namesargument from thepost_load_weightscall.
- Imported
- python/sglang/srt/server_args.py
- Added
hf_config_pathfield toServerArgsdataclass. - Added
--hf-config-pathcommand-line argument.
- Added
- python/sglang/srt/utils.py
- Imported
GGUFMoEMethodandFusedMoE. - Added monkey patch to
GGUFConfig.get_quant_methodto supportFusedMoElayers with GGUF.
- Imported
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
GGUF weights arrive,
Deepseek models now thrive,
Config path is key.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for DeepSeek-R1 GGUF models, which is a valuable addition. The changes involve modifications across configuration, model loading, and layer implementations, particularly for handling GGUF-specific weight structures and quantization.
The approach of leveraging and adapting vLLM's GGUF infrastructure is sensible. However, there are a few critical points to address, such as a typo in a keyword argument and a potential logical issue in attention layer normalization. Additionally, the checklist items (tests, documentation, benchmarks) are mostly unaddressed and are important for ensuring the quality and usability of this feature. The requirement to disable CUDA graphs for GGUF models is a notable limitation that should be clearly communicated to users.
Summary of Findings
- Critical Typo: A typo
routed_scaling_factoinstead ofrouted_scaling_factorwas found inpython/sglang/srt/layers/moe/fused_moe_triton/layer.py. This needs to be corrected. - Potential Double Layer Normalization: In
python/sglang/srt/models/deepseek_v2.py,self.q_a_layernormappears to be applied twice consecutively under certain conditions for GGUF models. This could be a bug and needs verification. - GGUF Attention Mechanism: Commented-out code in
python/sglang/srt/models/deepseek_v2.pysuggests GGUF might require MHA. Clarification is needed on whether the current MLA approach is fully compatible and performant for GGUF, especially given CUDA graph limitations. - Operational Requirements and Documentation: The PR relies on specific files from the Unsloth GGUF repo and requires disabling CUDA graphs. These are important operational details that need to be clearly documented for users. The PR checklist also has several unaddressed items (tests, docs, benchmarks).
Merge Readiness
This pull request makes good progress in adding DeepSeek-R1 GGUF support. However, there are critical and high-severity issues identified (a typo and potential double normalization) that must be addressed before merging.
Additionally, I recommend the author:
- Clarify the commented-out MHA logic for GGUF models.
- Address the unchecked items in the PR checklist, particularly adding unit tests and updating documentation to reflect the new GGUF support, its requirements (like
--hf-config-path, specific file dependencies from Unsloth), and limitations (like--disable-cuda-graph).
Given these points, I am requesting changes. I am unable to approve pull requests, but once these issues are resolved, the PR will be in a much better state for further review and merging by authorized maintainers.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for DeepSeek-R1 GGUF models, which is a valuable addition. The changes span model configuration, weight loading, and model-specific adaptations for DeepSeek V2 architecture to handle GGUF's quantized format and specific tensor naming/structures.
Overall, the approach seems reasonable, reusing concepts from vLLM and adapting them. However, there are a few areas that need attention, particularly concerning the hardcoded model-specific mappings, clarity on certain GGUF handling logic, and most importantly, the completion of essential checklist items like testing and documentation.
The introduction of --hf-config-path is a good step for flexibility. The disabling of CUDA graphs for GGUF models is a key performance consideration that needs to be clearly communicated to users.
Summary of Findings
- Missing Checklist Items: The PR checklist indicates that formatting, unit tests, documentation, and benchmark results are not yet completed. These are crucial for ensuring the quality, correctness, and usability of this feature.
- Model-Specific Hardcoding: The tensor name mapping for
deepseek_v3inmodel_loader/loader.pyis hardcoded. This could be a maintenance concern for future model variants or GGUF convention changes. - Clarity on GGUF-Specific Logic: The reasons for disabling
fuse_qkv_a_projfor GGUF models and the handling of*.weight_typetensors in non-GGUF fusion paths could be clarified. - Code Duplication: Common GGUF weight loading logic is duplicated across
linear.pyandmoe/.../layer.py, offering an opportunity for refactoring. - Performance Implications: The requirement to disable CUDA graphs for GGUF models is a significant performance note that needs clear documentation.
- External File Dependencies: The assumption that specific files (
config.json,tokenizer.json,configuration.py,modeling.py) from the Unsloth GGUF repo are available needs to be clearly documented, including how--hf-config-pathinteracts with the location of these other files.
Merge Readiness
This pull request makes good progress towards GGUF support for DeepSeek-R1. However, before it can be considered ready for merging, several critical aspects need to be addressed:
- Checklist Completion: The PR checklist items, especially unit tests, comprehensive documentation (including the
--disable-cuda-graphrequirement, the--hf-config-pathusage, and dependencies on Unsloth repo files), and ideally benchmark results, must be completed. - Clarifications: Addressing the review comments regarding the GGUF-specific logic (e.g., QKV fusion, type tensor handling) would improve code clarity and confidence in its correctness.
- Maintainability: Consider refactoring the duplicated GGUF weight loading logic and exploring more generalized solutions for GGUF tensor name mapping.
Given these points, particularly the incomplete checklist and the need for testing, I recommend that these changes be made before merging. I am unable to approve pull requests, but based on this review, further work is needed.
| if model_type == "deepseek_v3": | ||
| model_type = "deepseek2" | ||
| # GGUF layer map assumes that we will have a merged expert weights | ||
| # so we need to map them manually | ||
| for idx in range(config.num_hidden_layers): | ||
| gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = ( | ||
| f"model.layers.{idx}.mlp.gate.e_score_correction_bias" | ||
| ) | ||
| gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = ( | ||
| f"model.layers.{idx}.mlp.experts.0.down_proj.weight" | ||
| ) | ||
| gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = ( | ||
| f"model.layers.{idx}.mlp.experts.0.gate_proj.weight" | ||
| ) | ||
| gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = ( | ||
| f"model.layers.{idx}.mlp.experts.0.up_proj.weight" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explicit mapping for deepseek_v3 (aliased to deepseek2) seems quite model-specific. While it addresses the immediate need, could this approach become a maintenance challenge if more DeepSeek variants or other models with unique GGUF naming conventions are added?
Is there potential for a more generalized mechanism for GGUF to Hugging Face tensor name translation, perhaps through a configurable mapping file or a more robust detection strategy based on GGUF metadata, if available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quantized gguf weights are stored in way that is incompatible with torch.stack (dequantization result will be different)
| is_gguf_weight = getattr(param, "is_gguf_weight", False) | ||
| is_gguf_weight_type = getattr(param, "is_gguf_weight_type", False) | ||
| if is_gguf_weight_type: | ||
| param.weight_type = loaded_weight.item() | ||
|
|
||
| # Materialize GGUF UninitializedParameter | ||
| if is_gguf_weight and isinstance(param, UninitializedParameter): | ||
| param.materialize(loaded_weight.shape, dtype=loaded_weight.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GGUF-specific weight handling logic (checking is_gguf_weight, is_gguf_weight_type, and materializing UninitializedParameter) is also present in python/sglang/srt/layers/moe/fused_moe_triton/layer.py.
To improve maintainability and reduce code duplication, have you considered refactoring this common GGUF weight processing into a shared utility function or perhaps a method in a common base class if these layers share more GGUF loading patterns?
| num_expert_group=self.num_expert_group, | ||
| custom_routing_function=self.custom_routing_function, | ||
| correction_bias=self.correction_bias, | ||
| e_score_correction_bias=self.correction_bias, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motivation
Support model unsloth/DeepSeek-R1-GGUF (#3973)
Modifications
Adding support of unsloth/DeepSeek-R1-GGUF models. Reused ideas and some code from similar pr from vllm (vllm-project/vllm#13167). Current design assumes you provide config.json from Unsloth GGUF repo and tokenizer.json, configuration.py and modeling.py from Original DeepSeek-R1 hf repo. Config can be provided through the new server argument: --hf-config-path
Current implementation heavily relies on vllm gguf modules. Unfortunately those modules are not compatible with cuda graphs, so it should be disabled: --disable-cuda-graph.
Checklist