Add DeepSeek V3/R1 shared experts fusion by BBuf · Pull Request #4918 · sgl-project/sglang

BBuf · 2025-03-30T14:13:37Z

Motivation

I gain idea mainly from vllm-project/vllm#15502 , thanks for the author's work.I will add references in the modifications for grouped_topk function and DeepSeek v2 model weight_loader. And sgl-kernel's moe_align_kernel kernel has already supported num_experts > 256 in pr, so it's easy to implement fuse shared experts into 256 experts now.

Conclusion

parser.add_argument(
            "--n-share-experts-fusion",
            type=int,
            default=None,
            help="The number of shared_experts need to be replica to fuse with normal experts in deepseek v3/r1 "
            "we use tp_size by default.",
        )

Setting ServerArgs.n_share_experts_fusion=tp_size can yield maximum benefits, in the test data presented below, as the QPS increased to 4, the throughput improved by 4%, while both TTFT and ITL decreased by approximately 15% to 20%.

Acc in H200

➜  sglang git:(support_r1_shared_expers_fusion) ✗ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8                                            
100%|████████████████████████████████████████████████████████████████████████| 1319/1319 [01:08<00:00, 19.14it/s]
Accuracy: 0.952
Invalid: 0.000
Latency: 69.547 s
Output throughput: 1998.856 token/s

Benchmark in H200

random

share-gpt

QPS	指标	Baseline (--disable-shared-experts-fusion)	优化版本	改进百分比
1	总吞吐量 (tok/s)	483.47	485.72	+0.5%
	平均TTFT (ms)	949.18	664.25	+30.0%
	平均ITL (ms)	54.69	50.20	+8.2%
4	总吞吐量 (tok/s)	1088.59	1132.73	+4.0%
	平均TTFT (ms)	2630.26	2144.08	+18.5%
	平均ITL (ms)	156.21	132.75	+15.0%
8	总吞吐量 (tok/s)	1188.77	1235.63	+3.9%
	平均TTFT (ms)	6320.67	3443.59	+45.5%
	平均ITL (ms)	188.29	178.94	+5.0%

qps=1

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 --disable-shared-experts-fusion
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 1 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  322.06    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60147     
Request throughput (req/s):              0.93      
Input token throughput (tok/s):          295.89    
Output token throughput (tok/s):         187.58    
Total token throughput (tok/s):          483.47    
Concurrency:                             11.07     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   11879.96  
Median E2E Latency (ms):                 7582.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          949.18    
Median TTFT (ms):                        351.17    
P99 TTFT (ms):                           8154.20   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           54.69     
Median ITL (ms):                         40.97     
P95 ITL (ms):                            53.34     
P99 ITL (ms):                            297.26    
Max ITL (ms):                            7614.16   
==================================================

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 1--port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  320.56    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60132     
Request throughput (req/s):              0.94      
Input token throughput (tok/s):          297.27    
Output token throughput (tok/s):         188.45    
Total token throughput (tok/s):          485.72    
Concurrency:                             10.01     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   10695.58  
Median E2E Latency (ms):                 6544.49   
---------------Time to First Token----------------
Mean TTFT (ms):                          664.25    
Median TTFT (ms):                        313.28    
P99 TTFT (ms):                           5128.61   
---------------Inter-Token Latency----------------
Mean ITL (ms):                           50.20     
Median ITL (ms):                         39.50     
P95 ITL (ms):                            47.16     
P99 ITL (ms):                            268.81    
Max ITL (ms):                            6598.42   
=================================================

Total token throughput (tok/s)
- main: 483.47
- pr: 485.72
ttft
- main: 949.18 ms
- pr: 664.25 ms(40%)
itl
- main: 54.69 ms
- pr: 50.20 ms(8.9%)

qps=4

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 --disable-shared-experts-fusion
 python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 4 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  143.03    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60162     
Request throughput (req/s):              2.10      
Input token throughput (tok/s):          666.23    
Output token throughput (tok/s):         422.36    
Total token throughput (tok/s):          1088.59   
Concurrency:                             70.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   33845.89  
Median E2E Latency (ms):                 26233.23  
---------------Time to First Token----------------
Mean TTFT (ms):                          2630.26   
Median TTFT (ms):                        997.26    
P99 TTFT (ms):                           13312.57  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           156.21    
Median ITL (ms):                         68.18     
P95 ITL (ms):                            570.93    
P99 ITL (ms):                            908.09    
Max ITL (ms):                            11241.39  
==================================================

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 4 --port 30001
============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  137.46    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60056     
Request throughput (req/s):              2.18      
Input token throughput (tok/s):          693.25    
Output token throughput (tok/s):         439.48    
Total token throughput (tok/s):          1132.73   
Concurrency:                             62.55     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   28660.35  
Median E2E Latency (ms):                 21592.16  
---------------Time to First Token----------------
Mean TTFT (ms):                          2144.08   
Median TTFT (ms):                        641.09    
P99 TTFT (ms):                           16040.76  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           132.75    
Median ITL (ms):                         66.72     
P95 ITL (ms):                            506.70    
P99 ITL (ms):                            914.90    
Max ITL (ms):                            8829.84   
==================================================

Total token throughput (tok/s)
- main: 1088.59
- pr: 1132.73 (4.0%)
ttft
- main: 2630.26 ms
- pr: 2144.08 ms(20%)
itl
- main: 156.2ms
- pr: 132.75ms (17%+)

qps=8

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 --disable-shared-experts-fusion
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 8 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  130.98    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60151     
Request throughput (req/s):              2.29      
Input token throughput (tok/s):          727.54    
Output token throughput (tok/s):         461.23    
Total token throughput (tok/s):          1188.77   
Concurrency:                             100.71    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   43969.47  
Median E2E Latency (ms):                 41886.32  
---------------Time to First Token----------------
Mean TTFT (ms):                          6320.67   
Median TTFT (ms):                        5567.51   
P99 TTFT (ms):                           17773.40  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           188.29    
Median ITL (ms):                         67.75     
P95 ITL (ms):                            490.78    
P99 ITL (ms):                            1261.62   
Max ITL (ms):                            11864.62  
==================================================

python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --port 30001 
python3 -m sglang.bench_serving --backend sglang --num-prompts 300 --request-rate 8 --port 30001

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     300       
Benchmark duration (s):                  126.01    
Total input tokens:                      95293     
Total generated tokens:                  60411     
Total generated tokens (retokenized):    60139     
Request throughput (req/s):              2.38      
Input token throughput (tok/s):          756.23    
Output token throughput (tok/s):         479.41    
Total token throughput (tok/s):          1235.63   
Concurrency:                             93.33     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   39204.10  
Median E2E Latency (ms):                 37838.59  
---------------Time to First Token----------------
Mean TTFT (ms):                          3443.59   
Median TTFT (ms):                        2325.78   
P99 TTFT (ms):                           14249.98  
---------------Inter-Token Latency----------------
Mean ITL (ms):                           178.94    
Median ITL (ms):                         67.67     
P95 ITL (ms):                            581.96    
P99 ITL (ms):                            1626.09   
Max ITL (ms):                            7362.30   
==================================================

Total token throughput (tok/s)
- main: 1188.77
- pr: 1235.63(3.9%)
ttft
- main: 6320.67 ms
- pr: 3443.59 ms(80%)
itl
- main: 188.29 ms
- pr: 178.94ms(5%)

fzyzcjy

Just some nits

python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py

python/sglang/srt/layers/moe/topk.py

fzyzcjy · 2025-03-31T00:49:22Z

python/sglang/srt/layers/moe/topk.py

    tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
    topk_weights, topk_ids = torch.topk(tmp_scores, k=topk, dim=-1, sorted=False)
+    if share_fusion:
+        topk_ids[:, -1] = torch.randint(low=num_experts,


nit: wondering whether randint will be a little bit slower - shall we use something like round-robin

python/sglang/srt/models/deepseek_v2.py

fzyzcjy · 2025-03-31T00:56:17Z

python/sglang/srt/models/deepseek_v2.py

+                            f"model.layers.{moe_layer}."
+                            f"mlp.experts."
+                            f"{self.config.n_routed_experts + num_repeat}"
+                            f".{suffix}", weights_dict[


nit: is it possible that we remove the original shared_experts weight after this, then we save a bit of memory by making that shared_expert never load, and we can remove the logic of

if self.n_shared_experts is not None and self.n_share_fusion_experts == 0: shared_output = self.shared_experts(...)

above, since now it is directly None

sgl-kernel/tests/test_moe_align.py

…ect/sglang into support_r1_shared_expers_fusion

ch-wan · 2025-03-31T02:47:28Z

python/sglang/srt/layers/moe/topk.py

    renormalize: bool,
    num_expert_group: int = 0,
    topk_group: int = 0,
+    share_fusion: int = 0,


The naming is a little bit confusing. Is it identical with n_share_fusion_experts in the previous files? Why do we need a different name?

Yeah, I'll handle it.

ch-wan · 2025-03-31T02:57:19Z

benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py

        E = config.n_routed_experts
+        n_share_fusion_experts = int(os.getenv("SHARE_EXPERTS_FUSION_REPLICA", "0"))
+        if n_share_fusion_experts > 0:
+            E = E + n_share_fusion_experts     


DeepSeek-V2 has 2 shared experts. Should we multiple the number of replica with the number of shared experts?

python/sglang/srt/layers/moe/topk.py

python/sglang/srt/models/deepseek_v2.py

benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py

ch-wan · 2025-03-31T04:07:08Z

python/sglang/srt/layers/moe/topk.py

    tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)  # [n, e]
    topk_weights, topk_ids = torch.topk(tmp_scores, k=topk, dim=-1, sorted=False)
+    if share_fusion:
+        topk_ids[:, -1] = torch.randint(low=num_experts,


we can call torch.randint(.., out=topk_ids[...]) to save data copy

ch-wan · 2025-03-31T04:10:17Z

@BBuf I have finished my review. My major concern is how it is compatible with deepseek-v2 where the model has 2 shared experts?

BBuf · 2025-03-31T08:17:08Z

@BBuf I have finished my review. My major concern is how it is compatible with deepseek-v2 where the model has 2 shared experts?

Thank you. Currently, I am facing an issue with garbled output with the pr. I will address your feedback once I debug the issue and identify the cause.

fzyzcjy · 2025-03-31T08:34:03Z

Currently, I am facing an issue with garbled output with the pr.

Wondering whether the "gets a bit confused about the numbers" comment above may be a little bit related (again I just used eyes to look at it so can be wildly wrong)

BBuf · 2025-03-31T08:37:22Z

Currently, I am facing an issue with garbled output with the pr.

Wondering whether the "gets a bit confused about the numbers" comment above may be a little bit related (again I just used eyes to look at it so can be wildly wrong)

No, it's because the logic for share_fusion is not handled in the biased_grouped_topk function. I assumed that the grouped_topk function was being used instead of the biased_grouped_topk function. I will check later when h200 is free.

fzyzcjy · 2025-03-31T08:41:02Z

Ah I see

python/sglang/bench_serving.py

xihuai18 · 2025-04-04T21:31:04Z

Is this pr compatible with nextn？

xihuai18 · 2025-04-05T06:33:29Z

Is this pr compatible with nextn？

It seems not:

[2025-04-05 14:23:25 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/path/to/your/sglang/python/sglang/srt/managers/scheduler.py", line 1993, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/managers/scheduler.py", line 261, in __init__
    self.draft_worker = EAGLEWorker(
                        ^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/speculative/eagle_worker.py", line 104, in __init__
    super().__init__(
  File "/path/to/your/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/model_executor/model_runner.py", line 170, in __init__
    self.initialize(min_per_gpu_memory)
  File "/path/to/your/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in initialize
    self.load_model()
  File "/path/to/your/sglang/python/sglang/srt/model_executor/model_runner.py", line 392, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/model_loader/loader.py", line 371, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/path/to/your/sglang/python/sglang/srt/models/deepseek_nextn.py", line 259, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'model.decoder.mlp.shared_experts.down_proj.weight'

yiakwy-xpu-ml-framework-team · 2025-04-05T09:40:10Z

python/sglang/srt/server_args.py

            gpu_mem = None

+        if is_hip():
+            self.disable_shared_experts_fusion = True


We should add this to the documentation and to do lists (tunning for mi300x, easy to forget). D

@BBuf

yiakwy-xpu-ml-framework-team · 2025-04-05T09:47:39Z

Is this pr compatible with nextn？

It seems not:

[2025-04-05 14:23:25 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/path/to/your/sglang/python/sglang/srt/managers/scheduler.py", line 1993, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/managers/scheduler.py", line 261, in __init__
    self.draft_worker = EAGLEWorker(
                        ^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/speculative/eagle_worker.py", line 104, in __init__
    super().__init__(
  File "/path/to/your/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/model_executor/model_runner.py", line 170, in __init__
    self.initialize(min_per_gpu_memory)
  File "/path/to/your/sglang/python/sglang/srt/model_executor/model_runner.py", line 180, in initialize
    self.load_model()
  File "/path/to/your/sglang/python/sglang/srt/model_executor/model_runner.py", line 392, in load_model
    self.model = get_model(
                 ^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/model_loader/__init__.py", line 22, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/path/to/your/sglang/python/sglang/srt/model_loader/loader.py", line 371, in load_model
    model.load_weights(self._get_all_weights(model_config, model))
  File "/path/to/your/sglang/python/sglang/srt/models/deepseek_nextn.py", line 259, in load_weights
    param = params_dict[name]
            ~~~~~~~~~~~^^^^^^
KeyError: 'model.decoder.mlp.shared_experts.down_proj.weight'

@xihuai18 yep the weights name has been updated from model.layers.{moe_layer}.mlp.shared_experts.{suffix} to model.layers.{moe_layer}.mlp.experts.{n_ronted_experts + shared_expert_id}.{suffix}.

It can be fixed very quickly.

Better to setup a recording list and dependency list:

weights name mangling : model(deepseekv2), nextn(optimiztion module)... should be checked.

lambert0312 · 2025-04-08T01:15:15Z

@xihuai18 yep the weights name has been updated from model.layers.{moe_layer}.mlp.shared_experts.{suffix} to model.layers.{moe_layer}.mlp.experts.{n_ronted_experts + shared_expert_id}.{suffix}.

It can be fixed very quickly.

Better to setup a recording list and dependency list:
weights name mangling : model(deepseekv2), nextn(optimiztion module)... should be checked.

@xihuai18 @yiakwy-xpu-ml-framework-team I also encountered this problem. I modified a version according to the same solution and submitted a PR: #5143

* Fix ut mla-test-1-gpu-amd (sgl-project#4813) Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> * Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (sgl-project#4638) * [k8s] Clarified the usage of shared memory. (sgl-project#4341) * gemma3: impl `get_attention_sliding_window_size` for attn init (sgl-project#4823) * add partial_json_parser and einops (sgl-project#4827) * fix the release doc dependency issue (sgl-project#4828) * Update doc for DeepSeek-V3-0324 (sgl-project#4825) * deps: lazy import optional dependencies `gguf` and `torchvision` (sgl-project#4826) * Update MMMU Benchmark instructions (sgl-project#4694) * Fix the nightly eval by lowering the threshold of `neuralmagic/gemma-2-2b-it-FP8` (sgl-project#4830) * Basic Cleanup (sgl-project#4833) * Support (1 <= dp < tp) in the dp attention in DeepEP (sgl-project#4770) Co-authored-by: Cheng Wan <cwan39@gatech.edu> * [Fix] Add compressed_tensors as deps (sgl-project#4819) * Fix error due to CustomAllreduce setup failure (sgl-project#4815) Signed-off-by: Kebe <mail@kebe7jun.com> * use default for torch.ops (sgl-project#4835) * [CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (sgl-project#3969) * [Misc] Fix issues reported by torchfix (sgl-project#4837) * Include context length in /v1/models response. (sgl-project#4809) * [Fix] `self.worker` assignment in `TpModelWorker` and refactor references (sgl-project#4788) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Fix the lora adapter when lora path is none (sgl-project#4799) Co-authored-by: Beichen Ma <mabeichen12@gmail.com> * fix: fix typo of comments in w8a8_fp8.py (sgl-project#4843) * Remove retry in nightly tests (sgl-project#4846) * Fix CI of test_patch_torch (sgl-project#4844) * IPv6 support (sgl-project#3949) Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> * ci: add condition for daily docker build (sgl-project#4487) * [Fix] fix output_top_logprobs is not exist (sgl-project#4597) * fix: when use SGLANG_PORT this env,port is str (sgl-project#4528) Signed-off-by: rongfu.leng <lenronfu@gmail.com> * Support Page Size > 1 for FA3 (sgl-project#4832) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix Engine error when enabling DP attention (sgl-project#4648) * fix: Inappropriate lack of Optional type on OpenAI ChatCompletionRequest (sgl-project#4681) * Support controlling nsys start and end range programmatically (sgl-project#4688) * Remove empty tool function name (sgl-project#4704) Signed-off-by: Kebe <mail@kebe7jun.com> * Fix missing arguments in SchedulePolicy and RadixCache initialization in tests. (sgl-project#4712) * get the python version from env (sgl-project#4729) * Fix torch.cuda.MemPool() internal assertion failure (sgl-project#4687) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * Super tiny remove unused code (sgl-project#4750) * Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * Workaround for async copy issue in HPU eager mode (sgl-project#1) Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> * [SW-223847]: Fix sgl_kernel module not available (sgl-project#2) Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> * [Base] Enable torch compile (sgl-project#4) * [SW-226331] disable dynamic shape in torch compile mode Signed-off-by: Mohit Sinha <msinha@habana.ai> --------- Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Brayden Zhong <b8zhong@uwaterloo.ca> Signed-off-by: rongfu.leng <lenronfu@gmail.com> Signed-off-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Signed-off-by: Mohit Sinha <msinha@habana.ai> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com> Co-authored-by: AinL <gmlwns5176@gmail.com> Co-authored-by: Jiří Suchomel <jiri.suchomel@statsperform.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Daniel Holanda <holand.daniel@gmail.com> Co-authored-by: tarinkk <129432511+tarinkk@users.noreply.github.com> Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Jon Durbin <jon@jondurbin.com> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Qiaolin Yu <qy254@cornell.edu> Co-authored-by: Beichen Ma <mabeichen12@gmail.com> Co-authored-by: Jiaqi <57028284+ZhuJiaqi9905@users.noreply.github.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Vincent <vincentzhongy+githubvincent4@gmail.com> Co-authored-by: warjiang <1096409085@qq.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: rongfu.leng <lenronfu@gmail.com> Co-authored-by: Stefan He <hebiaobuaa@gmail.com> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: BroadbentJim <BroadbentJim@users.noreply.github.com> Co-authored-by: vikram singh shekhawat <vshekhawat@habana.ai> Co-authored-by: DavidChan <chengwei0519@163.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: Rahul Vijayaraghavan <rahul.vijayaraghavan@intel.com> Co-authored-by: Rahul Vijayaraghavan <rvijayaraghavan@habana.ai> Co-authored-by: Jay Thakur <jthakur@habana.ai> Co-authored-by: Anshuman Tripathy <atripathy@habana.ai>

* Support with_stack and record_shapes in profiler (sgl-project#4740) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * test: reduce `mem_fraction_static` for gemma3 vision test (sgl-project#4840) * Fix CI tests (sgl-project#4853) * Fix fa3 cuda graph page_size > 1 precision and page_size=1 speed (sgl-project#4855) * Revert "get the python version from env (sgl-project#4729)" (sgl-project#4863) * [Feature] add multi-rank support for Lora (sgl-project#4492) Co-authored-by: rudy152 <czh1137892874@gmail.com> * Clean up `import vllm` in quantization/__init__.py (sgl-project#4834) * Fix wrong variable name when stopping memory profile (sgl-project#4772) * [Feat] support deepgemm for cmake (sgl-project#4864) * Make torch compile configurable for biased_grouped_topk (sgl-project#4749) * update sgl-kernel test ci (sgl-project#4866) * fix sampling issue (sgl-project#4871) * bump sgl-kernel 0.0.5.post4 (sgl-project#4768) * fix sgl-kernel cu118 build (sgl-project#4872) * [Feature] Support FA3 backend for MLA (sgl-project#4831) * upgrade sgl-kernel 0.0.5.post4 (sgl-project#4873) * update torch compile doc (sgl-project#4874) * bump v0.4.4.post3 (sgl-project#4878) * Fix BadRequestError wrong arguments and remove openai dependency (sgl-project#4882) * Improve stack trace of retry errors (sgl-project#4845) * Tiny fix doc error (sgl-project#4795) * [Docs] Update DeepGEMM at README.md (sgl-project#4886) * Update CODEOWNERS (sgl-project#4889) * Delete test_deep_gemm.py (sgl-project#4891) * Add deepseek style fused moe group gate selection kernel (sgl-project#4530) * quick fix: add default for new kernel (sgl-project#4898) * remove setup for sgl-kernel (sgl-project#4899) * [Misc] Clean m.def and add Development Tips (sgl-project#4890) * fix allreduce test (sgl-project#4909) * Support page size > 1 + eagle (sgl-project#4908) * Fix retract for page size > 1 (sgl-project#4914) * [Feature] use pytest for sgl-kernel (sgl-project#4896) * fix bmm fp8 (sgl-project#4926) * Fix the timeout for unit-test-2-gpu in pr-test.yml (sgl-project#4927) * Fix 2-gpu CI test and suppress some warnings (sgl-project#4930) * [feat] add fa3 in sgl-kernel (sgl-project#4902) Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> * Fix sglang frontend's incorrect dependency on torch (sgl-project#4931) * [Fix] avoid stream sync and torch compile in prefill for fa3 backend (sgl-project#4932) * cleanup sgl-kernel (sgl-project#4933) * [Fix] Improve Lora tests and reduce CI runtime (sgl-project#4925) * Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (sgl-project#4883) Co-authored-by: ch-wan <cwan39@gatech.edu> * [Fix] Add torch compile for torch.clamp back (sgl-project#4936) * Fix oom error for large page size (sgl-project#4913) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * [feat] interface for platforms abstraction (sgl-project#4928) * [Fix] revert clean m.def for cudagraph (sgl-project#4944) * refactor: multimodal data (sgl-project#4754) * bump sgl-kernel v0.0.6 (sgl-project#4950) * [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (sgl-project#4953) * use fa3 in sgl-kernel (sgl-project#4954) * Revert PR 4764 & 4813 related to R1 RoPE (sgl-project#4959) * [Feature] Support DeepEP Low Latency (sgl-project#4767) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu> * update bench_serving (sgl-project#4958) * Prevent memory leak of retract_decode when page_size > 1 (sgl-project#4977) * [VLM RLHF] Take Image input for verl vlm rollout (sgl-project#4915) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Co-authored-by: GeLee <leege233@gmail.com> * Large page size aligned hierarchical caching (sgl-project#4581) * bug fix for hicache host eviction (sgl-project#4989) * sgl scaled_fp8_quant support output padding (sgl-project#4861) * Add Eagle Speculative Decoding to FA3 Backend (sgl-project#4951) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> * Update tokenizer_manager.py (sgl-project#5008) * [sgl-kernel] per token group quant support COLUMN MAJOR (sgl-project#4817) * update cutlass tag (sgl-project#5011) * Feature/revise docs ci (sgl-project#5009) * fix: fix illegal cuda memory access at fused_moe_kernel (sgl-project#4727) Co-authored-by: yuethe <yuethe@tencent.com> * [Build] Support build sgl-kernel with ccache (sgl-project#5020) * fix deepgemm as well (sgl-project#5030) * try to fix ci oserror (sgl-project#5024) * Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5005) * Small refactor DeepEPMode to clean up code a bit (sgl-project#4992) * [Fix] fix fa3 build at cu118 (sgl-project#5036) * Revert "Replace enable_flashinfer_mla argument with attention_backend" (sgl-project#5048) * bump sgl-kernel v0.0.7 (sgl-project#5046) * update eagle-3 docs (sgl-project#4796) Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> * Add LlavaLlamaForCausaLM in MultiModal Processors (sgl-project#5039) Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> * Update the retry count (sgl-project#5051) * upgrade sgl-kernel v0.0.7 (sgl-project#5049) * [2/3] fix dsv3 awq issue (sgl-project#4625) Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> * Feature/revise docs ci (sgl-project#5056) * Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5057) * [fix] remove `cuda_device_count_stateless` (sgl-project#5060) * Small refactor DeepEPDispatcher into subclasses (sgl-project#4994) * Support async DeepEP by splitting into two stages (sgl-project#4995) * Cleanup unused resources after DeepEP operation (sgl-project#4996) * Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918) * [deepep] fix: shared experts are not initialized when shared experts fusion is enabled (sgl-project#5072) * fix dummy-load deepseekv2 (sgl-project#4535) * support sgl-kernel on blackwell (sgl-project#5074) * FA3 Spec Decoding to support top k = 1 and add cuda graph support (sgl-project#5050) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com> * [Revision] Replace enable_flashinfer_mla argument with attention_backend (sgl-project#5052) * upgrade transformers 4.51.0 (sgl-project#5088) * sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5079) * bump sgl-kernel 0.0.8 (sgl-project#5089) * python transfer custom allreduce from trt kernel to vllm kernel (sgl-project#5080) * bump v0.4.4.post4 (sgl-project#5091) * Fix: Reduce the number of document ci attempts to avoid long ci running (sgl-project#5097) Co-authored-by: shuaills <shishuaiuoe@gmail.com> * Add Llama4 support (sgl-project#5092) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com> * Fix refactor error - fp8.py (sgl-project#5106) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * bump v0.4.5 (sgl-project#5117) * [ci] fix llama4 ci error (sgl-project#5126) * Refactor and Optimize FA3 Code (sgl-project#5090) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> * Add Llama4 user guide (sgl-project#5133) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * [Misc] Use pytest.mark.skipif in sgl-kernel test (sgl-project#5137) * feat: disable grammar restrictions within reasoning sections (sgl-project#4984) Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> * [modelopt] automatically inspect if model is ModelOpt quantized and set quantization method (sgl-project#5145) * [AMD] Fix missing per_token_group_quant_fp8 for ROCm (sgl-project#5140) * fix multimodal hash feature (sgl-project#5083) * Fix run time error in ROCm platform (sgl-project#5147) Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> * [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (sgl-project#5103) * Add unit test on page_size > 1 and mla and integration test for Flash Attention 3 (sgl-project#4760) * Use public model for FA3 speculative decode testing (sgl-project#5152) * Add dummy grok test to amd CI. (sgl-project#5115) * fix empty_cache error in pt_weights_iterator (sgl-project#5151) Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> * Fix torch compile errors (sgl-project#5158) * Fix loading KV quantization scale; Enable modelopt kv cache (sgl-project#4686) Co-authored-by: qingquansong <ustcsqq@gmail.com> * [PD] Fix unclosed prefill connection warning of mini_lb (sgl-project#5155) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> * Add optimized native kernels in sgl-kernel (sgl-project#5150) Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> * [PD] Simplify mini LB (sgl-project#4911) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * Small improvement of native api docs (sgl-project#5139) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * [feat&refactor] Enhance multimodal input support with refactor io_struct (sgl-project#4938) Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> * Support 2x8xH100 for Llama 4 (sgl-project#5159) * FP4 weight loading and inference (2/2) (sgl-project#3972) * Fix multimodal hashing error (sgl-project#5174) * Tiny disable model that does not work (sgl-project#5175) * [Bugfix] Fix index out of bounds in local attention with large sequences (sgl-project#5173) * [Fix] DeepEP Compatibility with Low Latency (sgl-project#5068) Co-authored-by: ch-wan <cwan39@gatech.edu> * docs: remove the use of Downward API for LWS_WORKER_INDEX (sgl-project#5110) Signed-off-by: Kay Yan <kay.yan@daocloud.io> * feat: add DeepGEMM build warning (sgl-project#5176) Co-authored-by: grimoire <streetyao@live.com> * fix: use DeepEPDispatcher on CUDA (sgl-project#5180) * [DeepEP] fix: import buffer error (sgl-project#5179) * Let `bench_one_batch` support `enable_dp_attention` (sgl-project#4058) * [Misc] clean up vllm in sgl-kernel test (sgl-project#5189) * Fix ci test "test_eval_fp8_accuracy" failed (sgl-project#5185) Co-authored-by: wunhuang <wunhuang@amd.com> * Optimize topk operation in llama4 (sgl-project#5128) * Support Llama4 fp8 inference (sgl-project#5194) Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: zhyncs <me@zhyncs.com> * [ci] fix ci test fused_moe op (sgl-project#5102) * model: support mllama4 (sgl-project#5144) * update grok test (sgl-project#5171) * sgl-kernel use cutlass latest version for fp8 blockwise gemm (sgl-project#5207) * Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5196) * fix: log warning when disable cuda graph (sgl-project#5209) * [metrics] Add in queue metrics (sgl-project#4444) * Fix DeepSeek error when using DeepEP mode (sgl-project#5190) * reduce moe_align_block_size_kernel small batch mode overhead (sgl-project#5086) * [PD] Support KV transfer with mooncake (sgl-project#4880) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: shangmingc <csmthu@gmail.com> * [PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (sgl-project#5204) * Update deps for mllama4 (sgl-project#5215) * Fix deepseek-v3 with torch.compile in PyTorch 2.6. (sgl-project#5213) * ROCm sgl-kernel: compatible to later torch (sgl-project#5167) * [Misc] Clean sgl-kernel test (sgl-project#5216) * Update Makefile / build script to avoid installing incompatible torch dependency (sgl-project#5245) * Fix torch.compile cacheing (sgl-project#5259) Co-authored-by: zhyncs <me@zhyncs.com> * ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations (sgl-project#5228) * Optimize attention in llama4 (sgl-project#5127) * Optimize GPU memory usage in FlashAttentionBackend's strided indexing (sgl-project#5262) Co-authored-by: ch-wan <cwan39@gatech.edu> * Support `--enable-llama4-multimodal` (sgl-project#5254) * [fix] fix mrope positions not picked up (sgl-project#5265) * doc: nested loop code for offline engine (sgl-project#5244) * fix: examples for token_in_token_out_vlm (sgl-project#5193) * Fix a 404 link in send_request.ipynb (sgl-project#5280) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * fix: enable fp4 compilation on cu128 (sgl-project#5286) * feat: add cu128 identifier for sgl-kernel (sgl-project#5287) * chore: relax the torch version restriction for sgl-kernel compilation (sgl-project#5288) * chore: bump sgl-kernel v0.0.8.post1 (sgl-project#5289) * [PD] fix: skip warmup request in disaggregation mode to prevent crash on timeout (sgl-project#5292) * [Docs] Supported Model Docs - Major restructuring (sgl-project#5290) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> * fix: update update_wheel_index for cu128 (sgl-project#5300) * [Docs] Remove the older supported docs section (sgl-project#5301) * remove moe_align_block_size torch.zeros in small batch/expert mode (sgl-project#5298) * feat: add blackwell Dockerfile (sgl-project#5302) * feat: add blackwell workflow (sgl-project#5303) * fix: use fa3 unit test on hopper only (sgl-project#5304) * misc: update blackwell Dockerfile (sgl-project#5306) * fix: remove cublas_grouped_gemm (sgl-project#5307) * fix: update flash attn (sgl-project#5308) * fix: use deepgemm only on hopper (sgl-project#5310) * [VLM] Adopt fast image processor by default (sgl-project#5065) * Adjust ci test threshold (sgl-project#5271) * Blackwell Cutlass MLA kernel (sgl-project#5142) * misc: cleanup 3rdparty (sgl-project#5311) * update variable naming and comments for rocm (sgl-project#5299) * Fix w8a8_int8 model shared experts fusion load weights error (sgl-project#5120) * Add flash_attn_varlen_func to sgl-kernel (sgl-project#5315) * Fix fa3 window size setup (sgl-project#5316) * chore: bump sgl-kernel v0.0.8.post2 (sgl-project#5317) * feat: use fa3 mla by default on hopper (sgl-project#5210) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * Fix: docs/backend/structured_outputs.ipynb (sgl-project#4884) * Delete python/sglang/srt/layers/moe/fused_moe_triton/configs/E=257,N=… (sgl-project#5321) * refine fused_moe tuning docs (sgl-project#5294) * Support server based rollout in Verlengine (sgl-project#4848) Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> * [Feat] Add sparse attn to sgl-kernel (sgl-project#5327) * fix: solve cu118 issue for cutlass mla (sgl-project#5331) * chore: bump sgl-kernel v0.0.8.post3 (sgl-project#5332) * ci: update release node (sgl-project#5333) * fix: determine if flashinfer is installed (sgl-project#5336) * feat: adapt merge_state (sgl-project#5337) * misc: update sagemaker Dockerfile (sgl-project#5341) * Fix: Ensure tensors for dist.broadcast match NCCL backend device (sgl-project#5322) * docs: update adoption and sponsorship list with Oracle (sgl-project#5343) * chore: upgrade sgl-kernel 0.0.8.post3 (sgl-project#5342) * Fix typo: infight -> inflight (sgl-project#5357) * [PD] Add transfer backend abstraction (sgl-project#5328) * fix MLATokenToKVPoolHost get_size_per_token bug (sgl-project#5161) Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> * fix sgl-project#5322 (sgl-project#5359) * feat: update experiment_runner (sgl-project#5360) * [DeepEP] Reduce routed scaling overhead (sgl-project#5277) Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Free metadata_buffer_index after transfer finished (sgl-project#5364) * Fix DeepSeek DP Attention + torch compile (sgl-project#5367) Co-authored-by: ispobock <ispobaoke@163.com> * Support for Qwen2.5-VL Model in bitsandbytes Format (sgl-project#5003) * Fix PD disaggregation bugs (sgl-project#5326) * [PD Bug] fix MLA get_contiguous_buf_infos error (sgl-project#5384) * [perf] experimental enhance fp8 per-tensor quant (sgl-project#5370) * Apply deepseek cuda rope (sgl-project#5385) Co-authored-by: Yineng Zhang <me@zhyncs.com> * apply fused moe gate in ds v3/r1 (sgl-project#5371) Co-authored-by: Yineng Zhang <me@zhyncs.com> * fix: update test config (sgl-project#5392) * [Fix] Turn off DeepGEMM by default (sgl-project#5263) * minor clean up of sgl-kernel/CMakeLists.txt (sgl-project#5393) * Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5368) * Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (sgl-project#5291) Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> * [fix/misc] remove duplicate row in deepseek v2 model (sgl-project#5279) * chore: upgrade DeepGEMM (sgl-project#5395) * fix: update pr-test-sgl-kernel (sgl-project#5399) * kernel: support slightly faster merge_state_v2 cuda kernel (sgl-project#5381) * chore: bump sgl-kernel 0.0.9 (sgl-project#5400) * chore: upgrade sgl-kernel 0.0.9 (sgl-project#5401) * Tiny fix DeepseekScalingRotaryEmbedding always use forward_native (sgl-project#5406) * Fix bench_serving with random-ids (sgl-project#5214) * [misc] fix ci flaky case (sgl-project#5352) * [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (sgl-project#5412) * Support dynamic connection and TP 16 (sgl-project#5351) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * Fix broadcast use cuda device lead to memory capacity unbalanced (sgl-project#5416) * [PD] Fix dynamic port support and MLA buffer for Mooncake (sgl-project#5415) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: ybyang <ybyang7@iflytek.com> * Distinguish bootstrap key only in decode server (sgl-project#5422) * [PD] Remove unused bootstrap param and fix port table type (sgl-project#5423) * [minor] cleanup cmakelists.txt (sgl-project#5420) * bugfix: fix merge_state_v2 cuda graph (sgl-project#5419) * chore: bump sgl-kernel v0.0.9.post1 (sgl-project#5430) * fix: solve release issue (sgl-project#5434) * BLackwell cutlass mla: Add check for bad page size/block num combinations (sgl-project#5431) * feat: update model_specific_adjustment (sgl-project#5344) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> * chore: upgrade sgl-kernel 0.0.9.post1 (sgl-project#5436) * Fix ignore_eos parameter when loading a chat template (sgl-project#5264) * add attention backend supporting matrix in the doc (sgl-project#5211) Co-authored-by: Stefan He <hebiaobuaa@gmail.com> * Support BNB quantization for llama/mllama (sgl-project#5038) Co-authored-by: Yuhao Yang <yyh073@foxmail.com> * [Docs] Update start/install.md (sgl-project#5398) * [Minor] Move torch.compile patch to a better place (sgl-project#5397) * [Bug fix] need record start time in pd mode (sgl-project#5425) * Support MHA with chunked prefix cache for DeepSeek chunked prefill (sgl-project#5113) * chore: bump v0.4.5.post1 (sgl-project#5445) * Revert "[SW-226289] rebase sglang to tag v0.4.5 (sgl-project#12)" This reverts commit 0eac714. --------- Signed-off-by: Xinyuan Tong <justinning0323@outlook.com> Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Kay Yan <kay.yan@daocloud.io> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Juwan Yoo <ryan@tmfi.us> Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Yineng Zhang <me@zhyncs.com> Co-authored-by: chaobo jia <91889375+jcbjcbjc@users.noreply.github.com> Co-authored-by: rudy152 <czh1137892874@gmail.com> Co-authored-by: Fr4nk1in <sh.fu@outlook.com> Co-authored-by: yinfan98 <1106310035@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Ke Bao <ISPObaoke@163.com> Co-authored-by: Yi Zhang <1109276519@qq.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Sleepcoo <Sleepcoo@gmail.com> Co-authored-by: SEPLOS <seplos@aliyun.com> Co-authored-by: ch-wan <cwan39@gatech.edu> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com> Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: XinyuanTong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: GeLee <leege233@gmail.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com> Co-authored-by: zcnrex <zcnrex@gmail.com> Co-authored-by: Kaiyu Yang <yangky@umich.edu> Co-authored-by: renxin <90580890+renxinx@users.noreply.github.com> Co-authored-by: saltyfish66 <38240284+saltyfish66@users.noreply.github.com> Co-authored-by: yuethe <yuethe@tencent.com> Co-authored-by: simveit <69345428+simveit@users.noreply.github.com> Co-authored-by: Yifan Zhang <zhangyif21@mails.tsinghua.edu.cn> Co-authored-by: Ravi Theja <ravi03071991@gmail.com> Co-authored-by: Ravi Theja Desetty <ravitheja@Ravis-MacBook-Pro.local> Co-authored-by: AniZpZ <zhuangsen.zp@antgroup.com> Co-authored-by: 晟海 <huangtingwei.htw@antgroup.com> Co-authored-by: Tommy Yang <tommyyang0524@gmail.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: mlmz <54172054+minleminzui@users.noreply.github.com> Co-authored-by: shuaills <shishuaiuoe@gmail.com> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: HAI <hixiao@gmail.com> Co-authored-by: tianhaoyu <thy@mail.ecust.edu.cn> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Yun Dai <yundai424@gmail.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: root <root@dell300x-pla-t10-17.pla.dcgpu> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: saienduri <saimanas.enduri@amd.com> Co-authored-by: DangKai <dangkai4u@outlook.com> Co-authored-by: dangkai.dk <dangkai.dk@alibaba-inc.com> Co-authored-by: shangmingc <csmthu@gmail.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com> Co-authored-by: YanbingJiang <yanbing.jiang@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: Byron Hsu <byronhsu1230@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Kay Yan <kay.yan@daocloud.io> Co-authored-by: grimoire <streetyao@live.com> Co-authored-by: HandH1998 <1335248067@qq.com> Co-authored-by: Zhaoyang Hao <77828610+Muuuchen@users.noreply.github.com> Co-authored-by: Teng Ma <805522925@qq.com> Co-authored-by: Shangming Cai <caishangming@linux.alibaba.com> Co-authored-by: Xuchun Shang <xuchun.shang@linux.alibaba.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Elfie Guo <164945471+elfiegg@users.noreply.github.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: Yusong Gao <yusong.gao@icloud.com> Co-authored-by: Zhaoyi Li <36555117+Lzy17@users.noreply.github.com> Co-authored-by: lambert0312 <lambert80.ios@gmail.com> Co-authored-by: tianlian yi <91449279+yitianlian@users.noreply.github.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com> Co-authored-by: yulei <yuulei12@gmail.com> Co-authored-by: Yongtong Wu <914554688@qq.com> Co-authored-by: yhyang201 <47235274+yhyang201@users.noreply.github.com> Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com> Co-authored-by: Ximingwang-09 <72070413+Ximingwang-09@users.noreply.github.com> Co-authored-by: ximing.wxm <ximing.wxm@antgroup.com> Co-authored-by: Yangcheng Li <bluebluelitchi@hotmail.com> Co-authored-by: DefTruth <31974251+DefTruth@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: ybyang <ybyang7@iflytek.com> Co-authored-by: mRSun15 <3150105645@zju.edu.cn> Co-authored-by: ryang <38470282+ryang-max@users.noreply.github.com> Co-authored-by: Yuhao Yang <yyh073@foxmail.com>

dongyibo · 2025-09-30T08:58:27Z

@BBuf I have finished my review. My major concern is how it is compatible with deepseek-v2 where the model has 2 shared experts?

hello，How can I integrate with the two shared experts, for example, DeepSeekV2 has two shared experts? @ch-wan @BBuf

BBuf added 2 commits March 30, 2025 12:36

upd

4b53bc9

upd

953a000

BBuf requested review from ByronHsu, FlamingoPg, HaiShaw, HandH1998, Ying1123, hnyls2002, ispobock, merrymercy, yizhang2077 and zhyncs as code owners March 30, 2025 14:13

zhyncs added the high priority label Mar 30, 2025

Merge branch 'main' into support_r1_shared_expers_fusion

500e3e2

zhyncs assigned fzyzcjy and ch-wan Mar 30, 2025

fzyzcjy changed the title ~~add deepseek v3/r1 shared expers fusion~~ add deepseek v3/r1 shared experts fusion Mar 31, 2025

fzyzcjy changed the title ~~add deepseek v3/r1 shared experts fusion~~ Add DeepSeek V3/R1 shared experts fusion Mar 31, 2025

fzyzcjy reviewed Mar 31, 2025

View reviewed changes

BBuf added 2 commits March 31, 2025 02:46

upd

5dac1c2

Merge branch 'support_r1_shared_expers_fusion' of github.com:sgl-proj…

5cff889

…ect/sglang into support_r1_shared_expers_fusion

ch-wan requested changes Mar 31, 2025

View reviewed changes

BBuf added 3 commits March 31, 2025 13:36

fix acc bug

e771349

upd

c69675a

fix circular import

4180d63

BBuf and others added 2 commits April 3, 2025 17:57

Merge branch 'main' into support_r1_shared_expers_fusion

64f261e

fix bug

9a6832a

zhyncs reviewed Apr 3, 2025

View reviewed changes

python/sglang/bench_serving.py Outdated Show resolved Hide resolved

Merge branch 'main' into support_r1_shared_expers_fusion

dbcae93

BBuf mentioned this pull request Apr 4, 2025

[Bug] failed to run tuning_fused_moe_triton.py #4991

Closed

5 tasks

zhyncs merged commit 924ca7c into main Apr 4, 2025
12 of 25 checks passed

zhyncs deleted the support_r1_shared_expers_fusion branch April 4, 2025 08:59

xihuai18 pushed a commit to xihuai18/sglang that referenced this pull request Apr 4, 2025

Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

0545a58

ch-wan mentioned this pull request Apr 4, 2025

[deepep] fix: shared experts are not initialized when shared experts fusion is disabled #5072

Merged

6 tasks

yiakwy-xpu-ml-framework-team reviewed Apr 5, 2025

View reviewed changes

BBuf mentioned this pull request Apr 6, 2025

reduce moe_align_block_size_kernel small batch mode overhead #5086

Merged

3 tasks

lambert0312 mentioned this pull request Apr 7, 2025

Fix w8a8_int8 model shared experts fusion load weights error #5120

Merged

6 tasks

zhyncs mentioned this pull request Apr 7, 2025

[AMD] Fix missing per_token_group_quant_fp8 for ROCm #5140

Merged

6 tasks

lambert0312 mentioned this pull request Apr 8, 2025

Tiny refactor DeepSeek V3/R1 NextN shared experts fusion #5143

Closed

6 tasks

finger92 pushed a commit to protagolabs/sglang that referenced this pull request Apr 10, 2025

Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

38df92c

thyecust pushed a commit to thyecust/sglang that referenced this pull request Apr 11, 2025

Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

0cfe991

Ximingwang-09 mentioned this pull request Apr 11, 2025

Add H20 dtype fp8_w8a8 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 #5291

Merged

6 tasks

lambert0312 mentioned this pull request Apr 14, 2025

Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 #5368

Merged

6 tasks

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

Add DeepSeek V3/R1 shared experts fusion (sgl-project#4918)

9f0f616

Conversation

BBuf commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Conclusion

Acc in H200

Benchmark in H200

random

share-gpt

qps=1

qps=4

qps=8

Uh oh!

fzyzcjy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fzyzcjy Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ch-wan Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

BBuf Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ch-wan Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

ch-wan commented Mar 31, 2025

Uh oh!

BBuf commented Mar 31, 2025

Uh oh!

fzyzcjy commented Mar 31, 2025

Uh oh!

BBuf commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!

xihuai18 commented Apr 4, 2025

Uh oh!

xihuai18 commented Apr 5, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team commented Apr 5, 2025

Uh oh!

lambert0312 commented Apr 8, 2025

Uh oh!

dongyibo commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

BBuf commented Mar 30, 2025 •

edited

Loading

BBuf commented Mar 31, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Apr 5, 2025 •

edited

Loading