Deepseek-R1 GGUF support #6847

SandroPats · 2025-06-03T15:34:47Z

Motivation

Support model unsloth/DeepSeek-R1-GGUF (#3973)

Modifications

Adding support of unsloth/DeepSeek-R1-GGUF models. Reused ideas and some code from similar pr from vllm (vllm-project/vllm#13167). Current design assumes you provide config.json from Unsloth GGUF repo and tokenizer.json, configuration.py and modeling.py from Original DeepSeek-R1 hf repo. Config can be provided through the new server argument: --hf-config-path

Current implementation heavily relies on vllm gguf modules. Unfortunately those modules are not compatible with cuda graphs, so it should be disabled: --disable-cuda-graph.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Hello @SandroPats, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR aims to add support for Deepseek-R1 models in the GGUF format, specifically those provided by Unsloth. The author notes that this implementation reuses concepts and code from a similar effort in the vllm project. A key requirement for using these models with this PR is providing the Hugging Face configuration and tokenizer files separately via a new command-line argument, --hf-config-path. Due to the current implementation relying on vllm's GGUF modules, CUDA graphs must be disabled when running these models.

Highlights

GGUF Support for Deepseek-R1: Adds core functionality to load and run Deepseek-R1 models distributed in the GGUF format.
Separate HF Config Path: Introduces a new command-line argument --hf-config-path and corresponding configuration field to specify the location of the Hugging Face config.json, tokenizer.json, configuration.py, and modeling.py files needed for GGUF models.
GGUF Weight Handling: Includes modifications to linear and MoE layers to correctly handle GGUF-specific weight loading, dequantization, and materialization of uninitialized parameters, including considerations for tensor parallelism.
Deepseek V3 Specifics: Adds specific logic in the model loader to map GGUF tensor names to Hugging Face names for Deepseek V3 and adjusts the QKV fusion logic for GGUF models.
CUDA Graph Compatibility: Notes that the current GGUF implementation is not compatible with CUDA graphs, requiring users to disable them via --disable-cuda-graph.

Changelog

Click here to see the changelog

python/sglang/srt/configs/model_config.py
- Added hf_config_path parameter to ModelConfig constructor.
- Stored hf_config_path and trust_remote_code as instance attributes.
- Modified get_config call to use hf_config_path if provided.
- Added hf_config_path when creating ModelConfig from ServerArgs.
python/sglang/srt/layers/linear.py
- Added logic in weight_loader to handle GGUF weight types and materialize UninitializedParameter for GGUF weights.
- Adjusted materialization logic for GGUF weights to account for tensor parallelism sharding on the output dimension.
python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Imported UninitializedParameter.
- Added quant_config to FusedMoE constructor.
- Modified weight_loader to handle GGUF weight types and materialize UninitializedParameter for GGUF MoE weights, including tensor parallelism sharding.
- Conditionally pass routed_scaling_factor in the forward method based on quantization type.
- Changed parameter name from correction_bias to e_score_correction_bias in the forward call.
python/sglang/srt/model_loader/loader.py
- Imported gguf.
- Added specific GGUF to HF tensor name mapping for deepseek_v3 (mapping to deepseek2 and handling expert weights).
- Used trust_remote_code when loading the dummy model config.
- Modified load_weights to disable QKV fusion (fuse_qkv_a_proj) for GGUF quantization.
- Adjusted the logic for fusing QKV weights to handle GGUF-specific 'type' weights.
- Removed the weight_names argument from the post_load_weights call.
python/sglang/srt/server_args.py
- Added hf_config_path field to ServerArgs dataclass.
- Added --hf-config-path command-line argument.
python/sglang/srt/utils.py
- Imported GGUFMoEMethod and FusedMoE.
- Added monkey patch to GGUFConfig.get_quant_method to support FusedMoE layers with GGUF.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

GGUF weights arrive,
Deepseek models now thrive,
Config path is key.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek-R1 GGUF models, which is a valuable addition. The changes involve modifications across configuration, model loading, and layer implementations, particularly for handling GGUF-specific weight structures and quantization.

The approach of leveraging and adapting vLLM's GGUF infrastructure is sensible. However, there are a few critical points to address, such as a typo in a keyword argument and a potential logical issue in attention layer normalization. Additionally, the checklist items (tests, documentation, benchmarks) are mostly unaddressed and are important for ensuring the quality and usability of this feature. The requirement to disable CUDA graphs for GGUF models is a notable limitation that should be clearly communicated to users.

Summary of Findings

Critical Typo: A typo routed_scaling_facto instead of routed_scaling_factor was found in python/sglang/srt/layers/moe/fused_moe_triton/layer.py. This needs to be corrected.
Potential Double Layer Normalization: In python/sglang/srt/models/deepseek_v2.py, self.q_a_layernorm appears to be applied twice consecutively under certain conditions for GGUF models. This could be a bug and needs verification.
GGUF Attention Mechanism: Commented-out code in python/sglang/srt/models/deepseek_v2.py suggests GGUF might require MHA. Clarification is needed on whether the current MLA approach is fully compatible and performant for GGUF, especially given CUDA graph limitations.
Operational Requirements and Documentation: The PR relies on specific files from the Unsloth GGUF repo and requires disabling CUDA graphs. These are important operational details that need to be clearly documented for users. The PR checklist also has several unaddressed items (tests, docs, benchmarks).

Merge Readiness

This pull request makes good progress in adding DeepSeek-R1 GGUF support. However, there are critical and high-severity issues identified (a typo and potential double normalization) that must be addressed before merging.

Additionally, I recommend the author:

Clarify the commented-out MHA logic for GGUF models.
Address the unchecked items in the PR checklist, particularly adding unit tests and updating documentation to reflect the new GGUF support, its requirements (like --hf-config-path, specific file dependencies from Unsloth), and limitations (like --disable-cuda-graph).

Given these points, I am requesting changes. I am unable to approve pull requests, but once these issues are resolved, the PR will be in a much better state for further review and merging by authorized maintainers.

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

python/sglang/srt/models/deepseek_v2.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

SandroPats · 2025-06-03T16:11:56Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for DeepSeek-R1 GGUF models, which is a valuable addition. The changes span model configuration, weight loading, and model-specific adaptations for DeepSeek V2 architecture to handle GGUF's quantized format and specific tensor naming/structures.

Overall, the approach seems reasonable, reusing concepts from vLLM and adapting them. However, there are a few areas that need attention, particularly concerning the hardcoded model-specific mappings, clarity on certain GGUF handling logic, and most importantly, the completion of essential checklist items like testing and documentation.

The introduction of --hf-config-path is a good step for flexibility. The disabling of CUDA graphs for GGUF models is a key performance consideration that needs to be clearly communicated to users.

Summary of Findings

Missing Checklist Items: The PR checklist indicates that formatting, unit tests, documentation, and benchmark results are not yet completed. These are crucial for ensuring the quality, correctness, and usability of this feature.
Model-Specific Hardcoding: The tensor name mapping for deepseek_v3 in model_loader/loader.py is hardcoded. This could be a maintenance concern for future model variants or GGUF convention changes.
Clarity on GGUF-Specific Logic: The reasons for disabling fuse_qkv_a_proj for GGUF models and the handling of *.weight_type tensors in non-GGUF fusion paths could be clarified.
Code Duplication: Common GGUF weight loading logic is duplicated across linear.py and moe/.../layer.py, offering an opportunity for refactoring.
Performance Implications: The requirement to disable CUDA graphs for GGUF models is a significant performance note that needs clear documentation.
External File Dependencies: The assumption that specific files (config.json, tokenizer.json, configuration.py, modeling.py) from the Unsloth GGUF repo are available needs to be clearly documented, including how --hf-config-path interacts with the location of these other files.

Merge Readiness

This pull request makes good progress towards GGUF support for DeepSeek-R1. However, before it can be considered ready for merging, several critical aspects need to be addressed:

Checklist Completion: The PR checklist items, especially unit tests, comprehensive documentation (including the --disable-cuda-graph requirement, the --hf-config-path usage, and dependencies on Unsloth repo files), and ideally benchmark results, must be completed.
Clarifications: Addressing the review comments regarding the GGUF-specific logic (e.g., QKV fusion, type tensor handling) would improve code clarity and confidence in its correctness.
Maintainability: Consider refactoring the duplicated GGUF weight loading logic and exploring more generalized solutions for GGUF tensor name mapping.

Given these points, particularly the incomplete checklist and the need for testing, I recommend that these changes be made before merging. I am unable to approve pull requests, but based on this review, further work is needed.

gemini-code-assist · 2025-06-03T16:13:37Z

python/sglang/srt/model_loader/loader.py

+        if model_type == "deepseek_v3":
+            model_type = "deepseek2"
+            # GGUF layer map assumes that we will have a merged expert weights
+            # so we need to map them manually
+            for idx in range(config.num_hidden_layers):
+                gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = (
+                    f"model.layers.{idx}.mlp.gate.e_score_correction_bias"
+                )
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = (
+                    f"model.layers.{idx}.mlp.experts.0.down_proj.weight"
+                )
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = (
+                    f"model.layers.{idx}.mlp.experts.0.gate_proj.weight"
+                )
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = (
+                    f"model.layers.{idx}.mlp.experts.0.up_proj.weight"
+                )


This explicit mapping for deepseek_v3 (aliased to deepseek2) seems quite model-specific. While it addresses the immediate need, could this approach become a maintenance challenge if more DeepSeek variants or other models with unique GGUF naming conventions are added?

Is there potential for a more generalized mechanism for GGUF to Hugging Face tensor name translation, perhaps through a configurable mapping file or a more robust detection strategy based on GGUF metadata, if available?

Quantized gguf weights are stored in way that is incompatible with torch.stack (dequantization result will be different)

python/sglang/srt/models/deepseek_v2.py

gemini-code-assist · 2025-06-03T16:13:37Z

python/sglang/srt/layers/linear.py

+        is_gguf_weight = getattr(param, "is_gguf_weight", False)
+        is_gguf_weight_type = getattr(param, "is_gguf_weight_type", False)
+        if is_gguf_weight_type:
+            param.weight_type = loaded_weight.item()
+
+        # Materialize GGUF UninitializedParameter
+        if is_gguf_weight and isinstance(param, UninitializedParameter):
+            param.materialize(loaded_weight.shape, dtype=loaded_weight.dtype)


The GGUF-specific weight handling logic (checking is_gguf_weight, is_gguf_weight_type, and materializing UninitializedParameter) is also present in python/sglang/srt/layers/moe/fused_moe_triton/layer.py.

To improve maintainability and reduce code duplication, have you considered refactoring this common GGUF weight processing into a shared utility function or perhaps a method in a common base class if these layers share more GGUF loading patterns?

gemini-code-assist · 2025-06-03T16:13:37Z

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

            num_expert_group=self.num_expert_group,
            custom_routing_function=self.custom_routing_function,
-            correction_bias=self.correction_bias,
+            e_score_correction_bias=self.correction_bias,


Renaming correction_bias to e_score_correction_bias in the arguments to self.quant_method.apply aligns well with the GGUF tensor name exp_probs_b.bias being mapped to e_score_correction_bias in model_loader.py. This improves consistency and clarity.

SandroPats added 13 commits June 3, 2025 12:07

support custom hf config

a07efbf

Init gguf deepseek model

90e9670

Fix fused qkv loading for gguf

390bf82

Fix ColumnParallelLinear for tp gguf models

493559b

materialize gguf weights

8f594c0

Disabple fuse_qvk for GGUF

bfb5b36

add moe layers map

4aad188

MHA for gguf

7b78e2c

no scaling factor for gguf

2117016

fix comflict

519598b

conflict

574d838

dequantize gguf during postprocessing

0dea86a

remove accelerate dep

16125e5

gemini-code-assist bot reviewed Jun 3, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 3, 2025

View reviewed changes

SandroPats mentioned this pull request Jun 3, 2025

[Feature] Support model unsloth/DeepSeek-R1-GGUF #3973

Open

2 tasks

SandroPats and others added 3 commits June 3, 2025 18:40

Update python/sglang/srt/layers/moe/fused_moe_triton/layer.py

40df7f3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

feedback from gemini

0adb2f0

style

3b2032f

gemini-code-assist bot suggested changes Jun 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deepseek-R1 GGUF support #6847

Deepseek-R1 GGUF support #6847

SandroPats commented Jun 3, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SandroPats commented Jun 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

SandroPats Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

gemini-code-assist bot Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Deepseek-R1 GGUF support #6847

Are you sure you want to change the base?

Deepseek-R1 GGUF support #6847

Conversation

SandroPats commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SandroPats commented Jun 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

SandroPats Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SandroPats commented Jun 3, 2025 •

edited

Loading