[feat] Support tp mode for DeepSeek-R1-W4AFP8 by chenxijun1029 · Pull Request #8118 · sgl-project/sglang

chenxijun1029 · 2025-07-17T08:54:38Z

Motivation

Support tp mode for DeepSeek w4a8 model, which has a better performace than ep mode.

Modifications

Add W4AFp8MoEMethod and associated create_weights, process_weights_after_loading function and apply function. In the apply function, we use the same cutlass_w4a8_moe kernel as ep moe uses.
Add some tile shape and cluster shape config for tp moe in cutlass_w4a8_moe kernel.
Add a router logic in w4afp8 quant config and method. When "enable_ep_moe" found in global_server_args_dict, we use ep mode, else tp.

Co-author: @yuhyao 827623970@qq.com

Benchmark

We run DeepSeek-R1-W4AFP8 on 8H20 with tp8, comparing to run DeepSeek-R1 on 8H20 with ep8.
Test configuration:

ISL1000, OSL1000

input/output len = 1000/1000, qps=128, max_concurrency=128, num_prompt=256.
The results are shown below:

TP

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    128.0     
Max request concurrency:                 128       
Successful requests:                     256       
Benchmark duration (s):                  159.00    
Total input tokens:                      256000    
Total generated tokens:                  256000    
Total generated tokens (retokenized):    254696    
Request throughput (req/s):              1.61      
Input token throughput (tok/s):          1610.09   
Output token throughput (tok/s):         1610.09   
Total token throughput (tok/s):          3220.18   
Concurrency:                             127.52    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   79201.47  
Median E2E Latency (ms):                 78956.21  
---------------Time to First Token----------------
Mean TTFT (ms):                          6547.98   
Median TTFT (ms):                        6612.11   
P99 TTFT (ms):                           11687.82  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           72.73     
Median ITL (ms):                         68.05     
P95 ITL (ms):                            72.42     
P99 ITL (ms):                            73.05     
Max ITL (ms):                            11148.65  
==================================================

While EP:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    128.0     
Max request concurrency:                 128       
Successful requests:                     256       
Benchmark duration (s):                  161.37    
Total input tokens:                      256000    
Total generated tokens:                  256000    
Total generated tokens (retokenized):    255343    
Request throughput (req/s):              1.59      
Input token throughput (tok/s):          1586.41   
Output token throughput (tok/s):         1586.41   
Total token throughput (tok/s):          3172.81   
Concurrency:                             127.53    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   80390.85  
Median E2E Latency (ms):                 80644.43  
---------------Time to First Token----------------
Mean TTFT (ms):                          8143.41   
Median TTFT (ms):                        8144.56   
P99 TTFT (ms):                           14833.05  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           72.32     
Median ITL (ms):                         66.38     
P95 ITL (ms):                            71.43     
P99 ITL (ms):                            72.08     
Max ITL (ms):                            14113.65  
==================================================

ISL6000, OSL1000

input/output len = 6000/1000, qps=128, max_concurrency=128, num_prompt=256.
The results are shown below:

TP

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    128.0     
Max request concurrency:                 128       
Successful requests:                     256       
Benchmark duration (s):                  525.30    
Total input tokens:                      1536000   
Total generated tokens:                  256000    
Total generated tokens (retokenized):    254578    
Request throughput (req/s):              0.49      
Input token throughput (tok/s):          2924.03   
Output token throughput (tok/s):         487.34    
Total token throughput (tok/s):          3411.37   
Concurrency:                             109.09    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   223849.73 
Median E2E Latency (ms):                 251328.94 
---------------Time to First Token----------------
Mean TTFT (ms):                          118601.26 
Median TTFT (ms):                        149225.11 
P99 TTFT (ms):                           183862.13 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           105.36    
Median ITL (ms):                         83.10     
P95 ITL (ms):                            85.79     
P99 ITL (ms):                            86.27     
Max ITL (ms):                            52034.42  
==================================================

While EP:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    128.0     
Max request concurrency:                 128       
Successful requests:                     256       
Benchmark duration (s):                  592.58    
Total input tokens:                      1536000   
Total generated tokens:                  256000    
Total generated tokens (retokenized):    255781    
Request throughput (req/s):              0.43      
Input token throughput (tok/s):          2592.07   
Output token throughput (tok/s):         432.01    
Total token throughput (tok/s):          3024.08   
Concurrency:                             108.41    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   250942.46 
Median E2E Latency (ms):                 282080.60 
---------------Time to First Token----------------
Mean TTFT (ms):                          137965.32 
Median TTFT (ms):                        171540.14 
P99 TTFT (ms):                           217855.50 
---------------Inter-Token Latency----------------
Mean ITL (ms):                           113.09    
Median ITL (ms):                         81.36     
P95 ITL (ms):                            84.83     
P99 ITL (ms):                            85.55     
Max ITL (ms):                            71910.62  
==================================================

We conclude the results as the below sheet:

Scenario	Version	ITL (ms)	TTFT (ms)	Request throughput (req/s)
1K input，1K output，128 request rate	EP	72.32	8 143.41	1.59
	TP	72.73 (+0.6 %)	6 547.98 (−19.60 %)	1.61 (+1.2 %)
6K input，1K output，128 request rate	EP	113.09	137 965.32	0.43
	TP	105.36 (−6.83 %)	118 601.26 (−14.03 %)	0.49 (+14.0 %)

#Accuracy
mmlu: 86.9 （by @yuhyao ）
aime24: 80.0
gpqa: 71.2
math500: 95.6

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the [Running Unit Tests] - The unittest for cutlass_w4a8_moe kernel already exsited (https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @chenxijun1029, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the DeepSeek-R1-W4AFP8 model's performance by introducing support for Tensor Parallelism (TP) mode. It involves a comprehensive update to the quantization and MoE layer implementations, including new weight handling, dynamic method selection, and optimized low-level kernel configurations, ultimately aiming for more efficient model serving.

Highlights

Tensor Parallelism for DeepSeek-R1-W4AFP8: Introduced a new W4AFp8TPMoEMethod to enable Tensor Parallelism (TP) for DeepSeek-R1-W4AFP8 models, demonstrating improved performance compared to Expert Parallelism (EP) mode.
Dynamic MoE Quantization Method Selection: Implemented a routing mechanism that dynamically selects between Expert Parallelism (EP) and Tensor Parallelism (TP) quantization methods for MoE layers based on the enable_ep_moe global server argument.
Optimized CUTLASS Kernel Configurations: Added new tile and cluster shape configurations within the cutlass_w4a8_moe kernel and extended its dispatch logic to provide optimized performance for the specific matrix dimensions encountered in Tensor Parallelism mode.
Enhanced Weight Processing and Loading: Updated the weight creation and processing logic for TP MoE, including refactoring scale interleaving into a shared utility function and adding support for special naming rules for input scales in mixed-precision models.
Shared Experts Fusion Disablement: Explicitly disabled shared experts fusion optimization for W4A8 TP MoE models, as their quantization methods differ between routed and shared experts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for tensor parallelism (TP) mode for the DeepSeek-R1-W4AFP8 model, which shows performance improvements over the existing expert parallelism (EP) mode. The changes include adding a new W4AFp8TPMoEMethod for quantization, updating the MoE kernel configurations, and adding logic to switch between TP and EP modes. The implementation looks solid and the benchmark results are promising. I've added a couple of comments to improve code maintainability by reducing duplication in both the Python and CUDA C++ code. These changes should not affect performance but will make the code easier to read and maintain.

gemini-code-assist · 2025-07-17T08:56:19Z

sgl-kernel/csrc/moe/cutlass_moe/w4a8/w4a8_grouped_mm_c3x.cu

+  } else if (n == 512 && k == 7168) {
+    // group gemm 1 for tp
+    if (m <= 4) {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME(64, 32, 512, 2, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    } else if (m <= 16) {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME_CO(128, 16, 512, 2, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    } else if (m <= 256) {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME_CO(128, 16, 512, 2, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    } else if (m <= 1024) {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME_CO(128, 32, 512, 2, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    } else {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME_CO(128, 64, 512, 1, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    }
+  } else if (n == 7168 && k == 256) {
+    // group gemm 2 for tp
+    if (m <= 8) {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME(64, 16, 256, 1, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    } else if (m <= 512) {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME(128, 32, 256, 2, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    } else {
+      using Cutlass3xW4A8GemmSelected = typename JOIN_STRUCT_NAME(128, 64, 256, 1, 1, 1)::Cutlass3xW4A8Gemm;
+      cutlass_w4a8_group_gemm_caller<Cutlass3xW4A8GemmSelected>(
+          d_tensors,
+          a_tensors,
+          b_tensors,
+          a_scales,
+          b_scales,
+          expert_offsets,
+          problem_sizes,
+          a_strides,
+          b_strides,
+          d_strides,
+          s_strides,
+          chunk_size);
+    }


There's significant code duplication in the new else if blocks for n == 512 && k == 7168 and n == 7168 && k == 256. The only difference between the if (m <= ...) blocks is the Cutlass3xW4A8GemmSelected type.

To improve maintainability, you could refactor this using a helper template function or a macro to reduce the repeated calls to cutlass_w4a8_group_gemm_caller.

For example:

template <typename Gemm> void invoke_cutlass_caller(...) { cutlass_w4a8_group_gemm_caller<Gemm>(...); } // In dispatch_w4a8_moe_mm_sm90: if (m <= 4) { using Gemm = typename JOIN_STRUCT_NAME(64, 32, 512, 2, 1, 1)::Cutlass3xW4A8Gemm; invoke_cutlass_caller<Gemm>(...); } else if (m <= 16) { using Gemm = typename JOIN_STRUCT_NAME_CO(128, 16, 512, 2, 1, 1)::Cutlass3xW4A8Gemm; invoke_cutlass_caller<Gemm>(...); } // ...

While I understand this pattern is common for performance-critical CUDA code to help the compiler generate specialized code, this refactoring would make the code much cleaner without a performance penalty.

AniZpZ · 2025-07-17T14:22:01Z

Thanks for your great work on this! To help us evaluate the impact of this PR, could you please provide the performance results (like GSM8K, MMLU, and Hellaswag)

python/sglang/srt/layers/moe/ep_moe/layer.py

yangsijia-serena · 2025-07-20T14:00:24Z

python/sglang/srt/layers/quantization/w4afp8.py

+        layer.register_parameter("w2_weight_scale_inv", w2_scales)
+        set_weight_attrs(w2_scales, extra_weight_attrs)
+
+        # The input scale for w1 and w3 should be the same


just want to confirm if you've checked the contents of the act_scales.safetensors file. Are the input scales for w1 and w3 all consistent?

python/sglang/srt/layers/quantization/w4afp8.py

yuhyao · 2025-08-05T08:46:38Z

@chenxijun1029 Nice work! Just wondering if there will be any further updates?
Also, should the file sgl-kernel/tests/test_cutlass_w4a8_moe_mm.py be updated as well?

zhilingjiang · 2025-08-11T11:38:08Z

Nice work!

Co-authored-by: yuhyao <827623970@qq.com>

Fix bugs caused by rebasing.

chenxijun1029 · 2025-08-15T12:26:40Z

Thanks for your great work on this! To help us evaluate the impact of this PR, could you please provide the performance results (like GSM8K, MMLU, and Hellaswag)

Hi, I have updated the accuracy in the pr, please check it.

sgl-kernel/csrc/moe/cutlass_moe/w4a8/w4a8_grouped_mm_c3x.cu

AniZpZ · 2025-08-20T02:08:14Z

LGTM, do you have any more sugguestion? @yangsijia-serena

yangsijia-serena · 2025-08-20T09:38:09Z

LGTM, do you have any more sugguestion? @yangsijia-serena

LGTM!

Co-authored-by: yuhyao <827623970@qq.com>

BBuf · 2025-09-11T03:04:28Z

python/sglang/srt/layers/moe/fused_moe_triton/layer.py

                "compressed" in self.quant_method.__class__.__name__.lower()
-                and param.data[expert_id] != 1
-                and (param.data[expert_id] - loaded_weight).abs() > 1e-5
+                or "w4afp8" in self.quant_config.get_name()


It broken compressed-tensor format moe models like neuralmagic/Mixtral-8x7B-Instruct-v0.1-FP8 , fixed in #10299 .

Bruce-x-1997 · 2025-09-12T05:08:18Z

hello, thanks for the pr, but when I use w4afp8 in deepseek-r1(https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8) and test it in aime24, it could only get 10% score
so what's your test model, which reach 80% in aime24
@chenxijun1029

chenxijun1029 · 2025-09-12T16:33:59Z

hello, thanks for the pr, but when I use w4afp8 in deepseek-r1(https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8) and test it in aime24, it could only get 10% score so what's your test model, which reach 80% in aime24 @chenxijun1029

I tested it with evalscope, and you should set "max_num_tokens" larger, i.e. 20000, for better performance.

Bruce-x-1997 · 2025-09-17T01:45:16Z

hello, thanks for the pr, but when I use w4afp8 in deepseek-r1(https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8) and test it in aime24, it could only get 10% score so what's your test model, which reach 80% in aime24 @chenxijun1029

I tested it with evalscope, and you should set "max_num_tokens" larger, i.e. 20000, for better performance.

thanks, I have reached 80% in r1 & r1-0528(quant use modelopt)
but I found if I use the same method to v3.1 , v3.1 accuracy could not be accepted, in v3.1(10% in aime24), and if we use a question in aime24 to test v3.1(after quant, and batch is 1), its answer is wrong.
do you find similar problems?is there anything we can try? @chenxijun1029

chenxijun1029 · 2025-09-17T07:26:58Z

hello, thanks for the pr, but when I use w4afp8 in deepseek-r1(https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8) and test it in aime24, it could only get 10% score so what's your test model, which reach 80% in aime24 @chenxijun1029

I tested it with evalscope, and you should set "max_num_tokens" larger, i.e. 20000, for better performance.

thanks, I have reached 80% in r1 & r1-0528(quant use modelopt) but I found if I use the same method to v3.1 , v3.1 accuracy could not be accepted, in v3.1(10% in aime24), and if we use a question in aime24 to test v3.1(after quant, and batch is 1), its answer is wrong. do you find similar problems?is there anything we can try? @chenxijun1029

Can we share more information via wechat?

llc-kc · 2025-11-13T04:59:26Z

@chenxijun1029 Hi, are the TP and EP results evaluated by W4AFP8 and FP8 respectively? Do you compare W4AFP8 with the original FP8 both in TP mode? Why does W4AFP8 significantly reduce the weight size but seem not to improve ITL? Thanks.

llc-kc · 2025-11-13T05:00:50Z

@chenxijun1029 do we need an extra launch parameter --quantization w4afp8?

junliu-mde · 2025-11-19T06:17:21Z

@chenxijun1029 Hi, are the TP and EP results evaluated by W4AFP8 and FP8 respectively? Do you compare W4AFP8 with the original FP8 both in TP mode? Why does W4AFP8 significantly reduce the weight size but seem not to improve ITL? Thanks.

I guess it's the dequant process has non-negligible overhead. There should be space for improvement in kernels.

chenxijun1029 requested review from BBuf, ByronHsu, FlamingoPg, HaiShaw, HandH1998, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, yizhang2077, zhaochenyang20 and zhyncs as code owners July 17, 2025 08:54

gemini-code-assist bot reviewed Jul 17, 2025

View reviewed changes

ispobock assigned AniZpZ and yangsijia-serena Jul 17, 2025

zhangxiaolei123456 mentioned this pull request Jul 18, 2025

[Roadmap] Kimi-K2 performance enhancement on H20 GPU #8151

Closed

15 tasks

yangsijia-serena reviewed Jul 20, 2025

View reviewed changes

huangzl18883 reviewed Jul 30, 2025

View reviewed changes

python/sglang/srt/layers/quantization/w4afp8.py Outdated Show resolved Hide resolved

chenxijun1029 requested a review from kushanam as a code owner July 31, 2025 02:41

donpromax reviewed Aug 4, 2025

View reviewed changes

python/sglang/srt/layers/quantization/w4afp8.py Outdated Show resolved Hide resolved

chenxijun1029 requested a review from Edwardf0t1 as a code owner August 11, 2025 02:44

zhilingjiang mentioned this pull request Aug 13, 2025

support Qwen3-MoE-w4afp8 #9147

Open

4 tasks

chenxijun1029 requested review from kssteven418, rkooo567 and slin1237 as code owners August 13, 2025 09:18

Fix some bugs

5bee9f3

Co-authored-by: yuhyao <827623970@qq.com>

chenxijun1029 force-pushed the feat/w4afp8-tp branch from b0f62e8 to 5bee9f3 Compare August 13, 2025 11:35

yuhyao and others added 3 commits August 13, 2025 21:27

Fix bugs caused by rebasing.

029a0c0

Unify w4afp8 quant method.

5ae044c

Merge pull request #2 from yuhyao/github-pr-8118-new-fix

68986e0

Fix bugs caused by rebasing.

Merge branch 'main' into feat/w4afp8-tp

4e27025

AniZpZ reviewed Aug 18, 2025

View reviewed changes

sgl-kernel/csrc/moe/cutlass_moe/w4a8/w4a8_grouped_mm_c3x.cu Show resolved Hide resolved

Refactor the w4a8 grouped gemm dispatch module

756540e

chenxijun1029 force-pushed the feat/w4afp8-tp branch from f0a79a2 to 756540e Compare August 19, 2025 03:59

Merge branch 'main' into feat/w4afp8-tp

4cbdea7

AniZpZ approved these changes Aug 20, 2025

View reviewed changes

yuhyao mentioned this pull request Aug 20, 2025

[Bug] Fix w4afp8 moe kernel #9392

Merged

4 tasks

Merge branch 'main' into feat/w4afp8-tp

1b11bb5

zhyncs added the high priority label Aug 22, 2025

Merge branch 'main' into feat/w4afp8-tp

6fa12e0

zhyncs assigned zhaochenyang20 Aug 25, 2025

zhyncs merged commit d4a9384 into sgl-project:main Sep 2, 2025
105 of 113 checks passed

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[feat] Support tp mode for DeepSeek-R1-W4AFP8 (sgl-project#8118)

b935dea

Co-authored-by: yuhyao <827623970@qq.com>

BBuf mentioned this pull request Sep 11, 2025

[fix CI] Fix logical condition in fused MoE layer for compressed tensor quantization #10299

Merged

BBuf reviewed Sep 11, 2025

View reviewed changes

Conversation

chenxijun1029 commented Jul 17, 2025 • edited by mickqian Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark

ISL1000, OSL1000

ISL6000, OSL1000

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

AniZpZ commented Jul 17, 2025

Uh oh!

Uh oh!

yangsijia-serena Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuhyao commented Aug 5, 2025

Uh oh!

zhilingjiang commented Aug 11, 2025

Uh oh!

chenxijun1029 commented Aug 15, 2025

Uh oh!

Uh oh!

AniZpZ commented Aug 20, 2025

Uh oh!

yangsijia-serena commented Aug 20, 2025

Uh oh!

Uh oh!

BBuf Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Bruce-x-1997 commented Sep 12, 2025

Uh oh!

chenxijun1029 commented Sep 12, 2025

Uh oh!

Bruce-x-1997 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenxijun1029 commented Sep 17, 2025

Uh oh!

llc-kc commented Nov 13, 2025

Uh oh!

llc-kc commented Nov 13, 2025

Uh oh!

junliu-mde commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

chenxijun1029 commented Jul 17, 2025 •

edited by mickqian

Loading

Bruce-x-1997 commented Sep 17, 2025 •

edited

Loading