Skip to content

Conversation

@SandroPats
Copy link

@SandroPats SandroPats commented Jun 3, 2025

Motivation

Support model unsloth/DeepSeek-R1-GGUF (#3973)

Modifications

Adding support of unsloth/DeepSeek-R1-GGUF models. Reused ideas and some code from similar pr from vllm (vllm-project/vllm#13167). Current design assumes you provide config.json from Unsloth GGUF repo and tokenizer.json, configuration.py and modeling.py from Original DeepSeek-R1 hf repo. Config can be provided through the new server argument: --hf-config-path

Current implementation heavily relies on vllm gguf modules. Unfortunately those modules are not compatible with cuda graphs, so it should be disabled: --disable-cuda-graph.

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @SandroPats, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request to help everyone get up to speed quickly.

This PR aims to add support for Deepseek-R1 models in the GGUF format, specifically those provided by Unsloth. The author notes that this implementation reuses concepts and code from a similar effort in the vllm project. A key requirement for using these models with this PR is providing the Hugging Face configuration and tokenizer files separately via a new command-line argument, --hf-config-path. Due to the current implementation relying on vllm's GGUF modules, CUDA graphs must be disabled when running these models.

Highlights

  • GGUF Support for Deepseek-R1: Adds core functionality to load and run Deepseek-R1 models distributed in the GGUF format.
  • Separate HF Config Path: Introduces a new command-line argument --hf-config-path and corresponding configuration field to specify the location of the Hugging Face config.json, tokenizer.json, configuration.py, and modeling.py files needed for GGUF models.
  • GGUF Weight Handling: Includes modifications to linear and MoE layers to correctly handle GGUF-specific weight loading, dequantization, and materialization of uninitialized parameters, including considerations for tensor parallelism.
  • Deepseek V3 Specifics: Adds specific logic in the model loader to map GGUF tensor names to Hugging Face names for Deepseek V3 and adjusts the QKV fusion logic for GGUF models.
  • CUDA Graph Compatibility: Notes that the current GGUF implementation is not compatible with CUDA graphs, requiring users to disable them via --disable-cuda-graph.

Changelog

Click here to see the changelog
  • python/sglang/srt/configs/model_config.py
    • Added hf_config_path parameter to ModelConfig constructor.
    • Stored hf_config_path and trust_remote_code as instance attributes.
    • Modified get_config call to use hf_config_path if provided.
    • Added hf_config_path when creating ModelConfig from ServerArgs.
  • python/sglang/srt/layers/linear.py
    • Added logic in weight_loader to handle GGUF weight types and materialize UninitializedParameter for GGUF weights.
    • Adjusted materialization logic for GGUF weights to account for tensor parallelism sharding on the output dimension.
  • python/sglang/srt/layers/moe/fused_moe_triton/layer.py
    • Imported UninitializedParameter.
    • Added quant_config to FusedMoE constructor.
    • Modified weight_loader to handle GGUF weight types and materialize UninitializedParameter for GGUF MoE weights, including tensor parallelism sharding.
    • Conditionally pass routed_scaling_factor in the forward method based on quantization type.
    • Changed parameter name from correction_bias to e_score_correction_bias in the forward call.
  • python/sglang/srt/model_loader/loader.py
    • Imported gguf.
    • Added specific GGUF to HF tensor name mapping for deepseek_v3 (mapping to deepseek2 and handling expert weights).
    • Used trust_remote_code when loading the dummy model config.
    • Modified load_weights to disable QKV fusion (fuse_qkv_a_proj) for GGUF quantization.
    • Adjusted the logic for fusing QKV weights to handle GGUF-specific 'type' weights.
    • Removed the weight_names argument from the post_load_weights call.
  • python/sglang/srt/server_args.py
    • Added hf_config_path field to ServerArgs dataclass.
    • Added --hf-config-path command-line argument.
  • python/sglang/srt/utils.py
    • Imported GGUFMoEMethod and FusedMoE.
    • Added monkey patch to GGUFConfig.get_quant_method to support FusedMoE layers with GGUF.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


GGUF weights arrive,
Deepseek models now thrive,
Config path is key.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepSeek-R1 GGUF models, which is a valuable addition. The changes involve modifications across configuration, model loading, and layer implementations, particularly for handling GGUF-specific weight structures and quantization.

The approach of leveraging and adapting vLLM's GGUF infrastructure is sensible. However, there are a few critical points to address, such as a typo in a keyword argument and a potential logical issue in attention layer normalization. Additionally, the checklist items (tests, documentation, benchmarks) are mostly unaddressed and are important for ensuring the quality and usability of this feature. The requirement to disable CUDA graphs for GGUF models is a notable limitation that should be clearly communicated to users.

Summary of Findings

  • Critical Typo: A typo routed_scaling_facto instead of routed_scaling_factor was found in python/sglang/srt/layers/moe/fused_moe_triton/layer.py. This needs to be corrected.
  • Potential Double Layer Normalization: In python/sglang/srt/models/deepseek_v2.py, self.q_a_layernorm appears to be applied twice consecutively under certain conditions for GGUF models. This could be a bug and needs verification.
  • GGUF Attention Mechanism: Commented-out code in python/sglang/srt/models/deepseek_v2.py suggests GGUF might require MHA. Clarification is needed on whether the current MLA approach is fully compatible and performant for GGUF, especially given CUDA graph limitations.
  • Operational Requirements and Documentation: The PR relies on specific files from the Unsloth GGUF repo and requires disabling CUDA graphs. These are important operational details that need to be clearly documented for users. The PR checklist also has several unaddressed items (tests, docs, benchmarks).

Merge Readiness

This pull request makes good progress in adding DeepSeek-R1 GGUF support. However, there are critical and high-severity issues identified (a typo and potential double normalization) that must be addressed before merging.

Additionally, I recommend the author:

  • Clarify the commented-out MHA logic for GGUF models.
  • Address the unchecked items in the PR checklist, particularly adding unit tests and updating documentation to reflect the new GGUF support, its requirements (like --hf-config-path, specific file dependencies from Unsloth), and limitations (like --disable-cuda-graph).

Given these points, I am requesting changes. I am unable to approve pull requests, but once these issues are resolved, the PR will be in a much better state for further review and merging by authorized maintainers.

SandroPats and others added 3 commits June 3, 2025 18:40
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@SandroPats
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepSeek-R1 GGUF models, which is a valuable addition. The changes span model configuration, weight loading, and model-specific adaptations for DeepSeek V2 architecture to handle GGUF's quantized format and specific tensor naming/structures.

Overall, the approach seems reasonable, reusing concepts from vLLM and adapting them. However, there are a few areas that need attention, particularly concerning the hardcoded model-specific mappings, clarity on certain GGUF handling logic, and most importantly, the completion of essential checklist items like testing and documentation.

The introduction of --hf-config-path is a good step for flexibility. The disabling of CUDA graphs for GGUF models is a key performance consideration that needs to be clearly communicated to users.

Summary of Findings

  • Missing Checklist Items: The PR checklist indicates that formatting, unit tests, documentation, and benchmark results are not yet completed. These are crucial for ensuring the quality, correctness, and usability of this feature.
  • Model-Specific Hardcoding: The tensor name mapping for deepseek_v3 in model_loader/loader.py is hardcoded. This could be a maintenance concern for future model variants or GGUF convention changes.
  • Clarity on GGUF-Specific Logic: The reasons for disabling fuse_qkv_a_proj for GGUF models and the handling of *.weight_type tensors in non-GGUF fusion paths could be clarified.
  • Code Duplication: Common GGUF weight loading logic is duplicated across linear.py and moe/.../layer.py, offering an opportunity for refactoring.
  • Performance Implications: The requirement to disable CUDA graphs for GGUF models is a significant performance note that needs clear documentation.
  • External File Dependencies: The assumption that specific files (config.json, tokenizer.json, configuration.py, modeling.py) from the Unsloth GGUF repo are available needs to be clearly documented, including how --hf-config-path interacts with the location of these other files.

Merge Readiness

This pull request makes good progress towards GGUF support for DeepSeek-R1. However, before it can be considered ready for merging, several critical aspects need to be addressed:

  1. Checklist Completion: The PR checklist items, especially unit tests, comprehensive documentation (including the --disable-cuda-graph requirement, the --hf-config-path usage, and dependencies on Unsloth repo files), and ideally benchmark results, must be completed.
  2. Clarifications: Addressing the review comments regarding the GGUF-specific logic (e.g., QKV fusion, type tensor handling) would improve code clarity and confidence in its correctness.
  3. Maintainability: Consider refactoring the duplicated GGUF weight loading logic and exploring more generalized solutions for GGUF tensor name mapping.

Given these points, particularly the incomplete checklist and the need for testing, I recommend that these changes be made before merging. I am unable to approve pull requests, but based on this review, further work is needed.

Comment on lines +1220 to +1236
if model_type == "deepseek_v3":
model_type = "deepseek2"
# GGUF layer map assumes that we will have a merged expert weights
# so we need to map them manually
for idx in range(config.num_hidden_layers):
gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = (
f"model.layers.{idx}.mlp.gate.e_score_correction_bias"
)
gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = (
f"model.layers.{idx}.mlp.experts.0.down_proj.weight"
)
gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = (
f"model.layers.{idx}.mlp.experts.0.gate_proj.weight"
)
gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = (
f"model.layers.{idx}.mlp.experts.0.up_proj.weight"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This explicit mapping for deepseek_v3 (aliased to deepseek2) seems quite model-specific. While it addresses the immediate need, could this approach become a maintenance challenge if more DeepSeek variants or other models with unique GGUF naming conventions are added?

Is there potential for a more generalized mechanism for GGUF to Hugging Face tensor name translation, perhaps through a configurable mapping file or a more robust detection strategy based on GGUF metadata, if available?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantized gguf weights are stored in way that is incompatible with torch.stack (dequantization result will be different)

Comment on lines +280 to +287
is_gguf_weight = getattr(param, "is_gguf_weight", False)
is_gguf_weight_type = getattr(param, "is_gguf_weight_type", False)
if is_gguf_weight_type:
param.weight_type = loaded_weight.item()

# Materialize GGUF UninitializedParameter
if is_gguf_weight and isinstance(param, UninitializedParameter):
param.materialize(loaded_weight.shape, dtype=loaded_weight.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The GGUF-specific weight handling logic (checking is_gguf_weight, is_gguf_weight_type, and materializing UninitializedParameter) is also present in python/sglang/srt/layers/moe/fused_moe_triton/layer.py.

To improve maintainability and reduce code duplication, have you considered refactoring this common GGUF weight processing into a shared utility function or perhaps a method in a common base class if these layers share more GGUF loading patterns?

num_expert_group=self.num_expert_group,
custom_routing_function=self.custom_routing_function,
correction_bias=self.correction_bias,
e_score_correction_bias=self.correction_bias,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Renaming correction_bias to e_score_correction_bias in the arguments to self.quant_method.apply aligns well with the GGUF tensor name exp_probs_b.bias being mapped to e_score_correction_bias in model_loader.py. This improves consistency and clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant