-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Support DeepSeek-R1 w4a8 low latency deepep #8311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
ayrnb
wants to merge
95
commits into
sgl-project:main
from
bytedance-iaas:feat/w4a8_support_ll_deepep
Closed
Changes from 3 commits
Commits
Show all changes
95 commits
Select commit
Hold shift + click to select a range
1e53e23
support w4a8 low latency deepep
ayrnb 93cb396
clean code
ayrnb 31d01f9
clean code
ayrnb 157b979
clean code
ayrnb 5dd0f87
[bug] fix pd completion protocol for batching support (#8317)
slin1237 f6e07f2
[router] fix pd model completion request (#8303)
slin1237 bfb118c
fix bug when eos_ids==0 (#8315)
bzantium 2f86f3a
[router] add endpoint unit test (#8298)
slin1237 a167fd0
[code style] Clean dead triton kernel code in fused_moe and useless v…
BBuf 96c5d85
fix
ayrnb 0090240
fix
ayrnb 8d1c5b9
chore: upgrade flashinfer v0.2.9rc1 (#8301)
Swipe4057 33c4b4d
[router] add streaming unit test (#8299)
slin1237 39fe1e8
[router] add request format unit test (#8300)
slin1237 145482f
HiCache Storage TP Refinement (#8307)
xiezhq-hermann d40846d
breakdown kernel update (#8334)
xiezhq-hermann f4674df
support idle batch for TBO (#8233)
sherry-1001 28d4d47
[Feature] Integrate quick allreduce and select the best allreduce imp…
lihaoyang-amd c0fb25e
DP Enhancement (#8280)
ch-wan 7ad6b76
fix: Fix failed functional tests https://github.com/meta-llama/llama-…
ynwang007 af4b9ba
[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_qui…
hubertlu-tw 15d2759
[CPU] Add tutorial docs for SGL on CPU (#8000)
ZailiWang 70e37b9
chore: upgrade mooncake 0.3.5 (#8341)
ShangmingCai 9045cc1
[torch.compile bug] avoid biased_grouped_topk_impl func repeatedly tr…
BBuf 1b9cea5
[P/D] Support ipv6 in P/D scenario (#7858)
thefacetakt 12cb760
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-…
Xu-Wenqing f8260f2
[Bugfix][Feat] Add XML-ish grammar in EBNFComposer and fix misc bugs …
CatherineSue ed2e313
Clean up server_args, triton cache manager (#8332)
merrymercy 7181ec8
fix: upgrade nccl version (#8359)
zhyncs d8ee156
[Feat] Add reasoning parser for Qwen/Qwen3-235B-A22B-Thinking-2507 (#…
CatherineSue f8ca236
fix: kimi k2 xgrammar crash (#8367)
zhyncs 58c468f
Fix FP4 MoE accuracy from missing routed_scaling_factor (#8333)
trevor-m 3ec0b21
[CI] Fix flaky threshold (#8370)
merrymercy 2272c2a
chore: bump v0.4.9.post4 (#8305)
zhyncs 8af145b
Fix test_moe_fused_gate_combined sgl-kernel ci test (#8374)
ispobock e6312d2
Uodate Dockerfile.gb200 to latest sglang (#8356)
kyleliang-nv 4fa44d6
chore: improve mmmu benchmark (#7000)
mickqian e236d8f
Save peak memory in logits processor (#8343)
ch-wan ce32bc2
Extract update_weights from RL Engine to SGLang to keep simplicity an…
hebiao064 5347567
chore: improvements on mm_utils (#7737)
mickqian 3212c2a
vlm: optimize tensor transport (#6003)
mickqian da0c026
Tiny assert EPLB is used together with expert parallel (#8381)
fzyzcjy b7094a5
model: support intern-s1 (#8350)
RunningLeon 5c705b1
Add perf tests for LoRA (#8314)
lifuhuang 7615463
Remove slot usage in code to be backward-compatible with python 3.9 (…
lifuhuang 62a6b7c
Add docker release flow for gb200 (#8394)
kyleliang-nv 528bd1e
HiCache, check before terminate prefetching (#8372)
xiezhq-hermann 426b749
Add nvfp4 scaled mm benchmark. (#8401)
HydraQYH b602f42
Urgent Fix: intern-s1 chat-template matching (#8403)
JustinTong0323 ed0fdbf
Tool to dump and compare internal activation tensors (#7976)
fzyzcjy 62222bd
Minor tool for comparison of benchmark results (#7974)
fzyzcjy e34cf6a
Fix bench script making input data on L2 cache (#7739)
fzyzcjy 85486b6
[NVIDIA] Add Flashinfer MoE blockscale fp8 backend (#8036)
kaixih 91e3d15
Update Cutlass in sgl-kernel to v4.1 (#8392)
Fridge003 0bcc195
fix: minor fix TransportProxyTensor under tp (#8382)
mickqian 2ab9702
[router] add different policies for p node and d node (#8395)
slin1237 2a1936d
Add A800 fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-In…
lambert0312 36d6f0b
fix: fix the missing metrics on non-rank0 nodes (#7720)
acelyc111 bf0f448
[2/N] MoE Refactor: Unify weight loader and quant methods (#8397)
ch-wan 5c9c275
Use FlashInfer FP4 gemm. (#8241)
elfiegg 44d600c
Support precomputed_embeddings for Llama 4 (#8156)
AlienKevin 4d921f2
[hotfix] fix merge conflicts in FlashInferEPMoE (#8405)
ch-wan bf3352c
chore: update CODEOWNERS (#8407)
zhyncs 10ee895
chore: upgrade flashinfer v0.2.9rc2 (#8406)
zhyncs b3eac16
Support triton kernels v3.4.0 for fused_moe (#8258)
yuan-luo 22e00ee
[Bugfix] Prevent PD server crash from invalid grammar (#8062)
ShangmingCai 95217a9
Change to use native arm runner (#8414)
kyleliang-nv df90645
Support overlapped lora updates (#8213)
lifuhuang b58c3c2
Support ue8m0 for triton quant kernel (#7603)
fzyzcjy e983d66
Fix: Improve test_openai_function_calling unit test and fix reasoning…
byjiang1996 b47eda3
bugfix: Fix multiple finish_reason chunks and tool_calls finish reaso…
CatherineSue 58dd95f
Fix test_openai_server (#8419)
CatherineSue bb81dae
Fix docker buildx push error (#8425)
kyleliang-nv dd487e5
bugfix: Fix XGrammar backend to use model's EOS tokens for constraine…
CatherineSue fe6a445
[router] improve router logs and request id header (#8415)
slin1237 2810338
[feat] Support different attention backends for prefill and decode (…
Qiaolin-Yu 4ad9737
chore: bump transformer to 4.54.0 (#8416)
hebiao064 2fd5c70
[PD] Fix abort_request for PD disaggregation (#8352)
ShangmingCai 6d6a8bc
GLM-4.5 Model Support (#8224)
zRzRzRzRzRzRzR 5922c0c
Remove zstd compression for building Dockerfile.gb200 (#8442)
kyleliang-nv 484d0e0
doc: add bench_one_batch_server in the benchmark doc (#8441)
Qiaolin-Yu 581e7dc
GLM-4.5 Model Support Follow-up (#8445)
byjiang1996 25f73c6
fix GLM4_MOE launch with compressed_tensor quant model (#8456)
zminglei fb4ce17
Fix per_token_group_quant_8bit when hidden_dim // group_size is not d…
strgrb 2262369
Revert "[kernel] opt moe align block kernel by block/warp scan algori…
BBuf 45bc170
chore: bump v0.4.9.post5 (#8458)
zhyncs a9dd3ec
fix:reorder topk experts to ensure shared expert replaces minimal sco…
erictanjn 712877a
support w4a8 low latency deepep
ayrnb 77351b7
clean code
ayrnb c15e34a
clean code
ayrnb f770ea6
clean code
ayrnb cfe7d62
fix
ayrnb d2afdb4
fix
ayrnb eb39568
Merge branch 'feat/w4a8_support_ll_deepep' of github.com:bytedance-ia…
ayrnb 1e721d4
support cudagraph
ayrnb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,6 +11,7 @@ | |
| ) | ||
|
|
||
| from sglang.srt.layers.moe.ep_moe.kernels import ( | ||
| deepep_ll_get_cutlass_w4a8_moe_mm_data, | ||
| post_reorder_triton_kernel, | ||
| pre_reorder_triton_kernel_for_cutlass_moe, | ||
| run_cutlass_moe_ep_preproess, | ||
|
|
@@ -43,6 +44,7 @@ def cutlass_w4a8_moe( | |
| a1_scale: Optional[torch.Tensor] = None, | ||
| a2_scale: Optional[torch.Tensor] = None, | ||
| apply_router_weight_on_input: bool = False, | ||
| ep_mode: str = "ep", | ||
| ) -> torch.Tensor: | ||
| """ | ||
| This function computes a w4a8-quantized Mixture of Experts (MoE) layer | ||
|
|
@@ -83,10 +85,14 @@ def cutlass_w4a8_moe( | |
| Returns: | ||
| - torch.Tensor: The fp8 output tensor after applying the MoE layer. | ||
| """ | ||
| assert topk_weights.shape == topk_ids_.shape, "topk shape mismatch" | ||
| assert ( | ||
| topk_weights.shape == topk_ids_.shape if topk_weights is not None else True | ||
| ), "topk shape mismatch" | ||
| assert w1_q.dtype == torch.int8 | ||
| assert w2_q.dtype == torch.int8 | ||
| assert a.shape[1] // 2 == w1_q.shape[2], "Hidden size mismatch w1" | ||
| assert ( | ||
| a.shape[1] // 2 == w1_q.shape[2] if ep_mode != "deepep_ll" else True | ||
| ), "Hidden size mismatch w1" | ||
| assert w1_q.shape[2] * 2 == w2_q.shape[1], "Hidden size mismatch w2" | ||
| assert w1_q.shape[0] == w2_q.shape[0], "Expert number mismatch" | ||
| assert w1_q.shape[0] == w1_scale.shape[0], "w1 scales expert number mismatch" | ||
|
|
@@ -108,52 +114,79 @@ def cutlass_w4a8_moe( | |
| m = a.size(0) | ||
| k = w1_q.size(2) * 2 # w1_q is transposed and packed | ||
| n = w2_q.size(2) * 2 # w2_q is transposed and packed | ||
| topk = topk_ids_.size(1) | ||
| topk = topk_ids_.size(1) if ep_mode == "ep" else 1 | ||
|
|
||
| if apply_router_weight_on_input: | ||
| assert topk == 1, "apply_router_weight_on_input is only implemented for topk=1" | ||
|
|
||
| device = a.device | ||
|
|
||
| _, src2dst, _ = run_cutlass_moe_ep_preproess( | ||
| local_topk_ids, | ||
| num_experts, | ||
| ) | ||
| if ep_mode == "ep": | ||
| _, src2dst, _ = run_cutlass_moe_ep_preproess( | ||
| local_topk_ids, | ||
| num_experts, | ||
| ) | ||
|
|
||
| gateup_input = torch.empty( | ||
| (m * topk, k), | ||
| device=device, | ||
| dtype=torch.float8_e4m3fn, | ||
| ) | ||
| gateup_input = torch.empty( | ||
| (m * topk, k), | ||
| device=device, | ||
| dtype=torch.float8_e4m3fn, | ||
| ) | ||
|
|
||
| pre_reorder_triton_kernel_for_cutlass_moe[(m,)]( | ||
| a, | ||
| gateup_input, | ||
| src2dst, | ||
| local_topk_ids, | ||
| a1_scale, | ||
| total_num_experts, | ||
| topk, | ||
| k, | ||
| BLOCK_SIZE=512, | ||
| ) | ||
| pre_reorder_triton_kernel_for_cutlass_moe[(m,)]( | ||
| a, | ||
| gateup_input, | ||
| src2dst, | ||
| local_topk_ids, | ||
| a1_scale, | ||
| total_num_experts, | ||
| topk, | ||
| k, | ||
| BLOCK_SIZE=512, | ||
| ) | ||
| elif ep_mode == "deepep_ll": | ||
| num_tokens = a.size(1) | ||
|
|
||
| else: | ||
| raise ValueError(f"Invalid ep_mode: {ep_mode}") | ||
|
|
||
| # NOTE: a_map and c_map are not used in the get_cutlass_w4a8_moe_mm_data kernel, | ||
| # they are kept to allow for a quick switch of the permutation logic | ||
| # from the current triton kernel implementation to the cutlass-based one if needed. | ||
| a_map = torch.empty((local_topk_ids.numel()), dtype=torch.int32, device=device) | ||
| c_map = torch.empty((local_topk_ids.numel()), dtype=torch.int32, device=device) | ||
| get_cutlass_w4a8_moe_mm_data( | ||
| local_topk_ids, | ||
| expert_offsets, | ||
| problem_sizes1, | ||
| problem_sizes2, | ||
| a_map, | ||
| c_map, | ||
| num_experts, | ||
| n, | ||
| k, | ||
| ) | ||
| if ep_mode == "deepep_ll": | ||
| gateup_input_origin, expert_offsets, problem_sizes1, problem_sizes2 = ( | ||
| deepep_ll_get_cutlass_w4a8_moe_mm_data( | ||
| a, | ||
| local_topk_ids, | ||
| expert_offsets, | ||
| problem_sizes1, | ||
| problem_sizes2, | ||
| num_experts, | ||
| n, | ||
| k, | ||
| ) | ||
| ) | ||
| gateup_input = torch.empty( | ||
| gateup_input_origin.shape, dtype=torch.float8_e4m3fn, device=device | ||
| ) | ||
| sgl_per_tensor_quant_fp8( | ||
| gateup_input_origin, gateup_input, a1_scale.float(), True | ||
| ) | ||
|
|
||
| else: | ||
| a_map = torch.empty((local_topk_ids.numel()), dtype=torch.int32, device=device) | ||
| c_map = torch.empty((local_topk_ids.numel()), dtype=torch.int32, device=device) | ||
| get_cutlass_w4a8_moe_mm_data( | ||
| local_topk_ids, | ||
| expert_offsets, | ||
| problem_sizes1, | ||
| problem_sizes2, | ||
| a_map, | ||
| c_map, | ||
| num_experts, | ||
| n, | ||
| k, | ||
| ) | ||
|
|
||
| c1 = torch.empty((m * topk, n * 2), device=device, dtype=torch.half) | ||
| c2 = torch.zeros((m * topk, k), device=device, dtype=torch.half) | ||
|
|
@@ -197,19 +230,33 @@ def cutlass_w4a8_moe( | |
| 128, | ||
| topk, | ||
| ) | ||
|
|
||
| output = torch.empty_like(a) | ||
| post_reorder_triton_kernel[(m,)]( | ||
| c2, | ||
| output, | ||
| src2dst, | ||
| topk_ids_, | ||
| topk_weights, | ||
| start_expert_id, | ||
| end_expert_id, | ||
| topk, | ||
| k, | ||
| 0, | ||
| BLOCK_SIZE=512, | ||
| ) | ||
| if ep_mode == "ep": | ||
| output = torch.empty_like(a) | ||
| post_reorder_triton_kernel[(m,)]( | ||
| c2, | ||
| output, | ||
| src2dst, | ||
| local_topk_ids, | ||
| topk_weights, | ||
| start_expert_id, | ||
| end_expert_id, | ||
| topk, | ||
| k, | ||
| 0, | ||
| BLOCK_SIZE=512, | ||
| ) | ||
| elif ep_mode == "deepep_ll": | ||
| output = torch.zeros( | ||
| (len(local_topk_ids), num_tokens, k), device=device, dtype=c2.dtype | ||
| ) | ||
| non_zero_indices = torch.nonzero(local_topk_ids, as_tuple=True)[0] | ||
| c2_index = 0 | ||
| for expert_idx in non_zero_indices: | ||
| num_non_zero_rows = local_topk_ids[expert_idx].item() | ||
| output[expert_idx, :num_non_zero_rows] = c2[ | ||
| c2_index : c2_index + num_non_zero_rows | ||
| ] | ||
| c2_index += num_non_zero_rows | ||
| else: | ||
| output = c2 | ||
|
Comment on lines
+264
to
+265
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| return output | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Python loop iterates over active experts to scatter the results. For performance-critical code running on a GPU, this can be a bottleneck due to the overhead of launching multiple operations from a Python loop. Consider vectorizing this operation or using a custom kernel for a more efficient implementation.