Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm by Wangzheee · Pull Request #18144 · sgl-project/sglang

Wangzheee · 2026-02-03T03:07:31Z

Motivation

This pull request (PR) refactors the functionality of PR-7762 Significantly enhance performance：

Resolved the question w4afp8 deepep is that it can only use bf16

Reimplemented the gemm pipeline with per-token quantization granularity, enabling dispatch for fp8 tokens

Improve performance with w4afp8 moe gemm

Add a pipeline for A scale (per-token)
Optimize the pipeline
Replace LDS with LDSM for W4 weight
Estimate the shape of the actual activation in MOE, to optimize block tile shape

How to use w4afp8

merge PR

to sglang develop

use w4afp8 model

Use open-source models：
- https://huggingface.co/AngelSlim/DeepSeek-R1-0528_w4a8_fp8
- https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8
You can quantize the model by yourself
- Support wInt4aFp8 for moe vllm-project/llm-compressor#2027

Modifications

w4afp8 moe gemm kernel
moe layer
test

Accuracy Tests

Benchmarking and Profiling

Summary

w4afp8(optimize) VS fp8
w4afp8(optimize) VS w4afp8(previous)

EP（ep-size=8）

batch-size performance improvement with MoE GEMM performance improvement with TPOT

32 90% 50% ～ 60%

64 80% 50%
TP（tp-size=8）

batch-size performance improvement with MoE GEMM performance improvement with TPOT

32 60% 5%

64 50% 5%

Detail data

Max request concurrency: 32
Successful requests: 300
Total input tokens: 1,228,500
Total input text tokens: 1,228,500
Total generated tokens: 307,200
ep-size: 8

	fp8	-	Wint4Afp8 （sglang previous）	-	Wint4Afp8 (per-token) (pipeline optimized) (tile shape optimized)	-
	dp-size=1	dp-size=2	dp-size=1	dp-size=2	dp-size=1	dp-size=2
Request throughput (req/s)	0.42	0.47	0.43	0.47	0.60	0.69
Input token throughput (tok/s)	1737	1906	1747	1924	2453	2821
Output token throughput (tok/s)	434	476	436	481	613	706
Total token throughput (tok/s)	2172	2382	2184	2405	3066	3527
Mean E2E Latency (ms)	72335	65756	71608	64868	51170	44294
Mean ITL (ms)	64.94	58.82	63.68	57.45	43.02	36.44
Mean TTFT (ms)	5899	5588	6460	6096	7161	7020
Mean TPOT (ms)	64.94	58.81	63.68	57.45	43.02	36.44

Max request concurrency: 64
Successful requests: 300
Total input tokens: 1,228,500
Total input text tokens: 1,228,500
Total generated tokens: 307,200
ep-size: 8

	fp8	-	Wint4Afp8 （sglang previous）	-	Wint4Afp8 (per-token) (pipeline optimized) (tile shape optimized)	-
	dp-size=1	dp-size=2	dp-size=1	dp-size=2	dp-size=1	dp-size=2
Request throughput (req/s)	0.43	0.68	0.43	0.71	0.60	0.93
Input token throughput (tok/s)	1769	2785	1771	2894	2460	3825
Output token throughput (tok/s)	442	696	443	723	615	956
Total token throughput (tok/s)	2212	3481	2214	3618	3075	4781
Mean E2E Latency (ms)	136359	87896	134694	84463	97293	63847
Mean ITL (ms)	68.70	76.53	67.26	72.44	46.26	51.15
Mean TTFT (ms)	66077	9603	65886	10357	49970	11524
Mean TPOT (ms)	68.70	76.53	67.26	72.44	46.26	51.15

Max request concurrency: 32
Successful requests: 300
Total input tokens: 1,228,500
Total input text tokens: 1,228,500
Total generated tokens: 307,200
tp-size: 8

	fp8	-	Wint4Afp8 （sglang previous）	-	Wint4Afp8 (per-token) (pipeline optimized) (tile shape optimized)	-
	dp-size=1	dp-size=2	dp-size=1	dp-size=2	dp-size=1	dp-size=2
Request throughput (req/s)	0.64	0.68	0.52	0.54	0.67	0.71
Input token throughput (tok/s)	2600	2769	2118	2230	2745	2906
Output token throughput (tok/s)	650	692	530	557	686	727
Total token throughput (tok/s)	3251	3461	2648	2787	3432	3633
Mean E2E Latency (ms)	48530	45207	59607	56296	45851	42967
Mean ITL (ms)	42.42	39.00	52.59	49.35	39.03	36.14
Mean TTFT (ms)	5131	5312	5809	5809	5927	5993
Mean TPOT (ms)	42.42	39.00	52.59	49.35	39.03	36.14

Max request concurrency: 64
Successful requests: 300
Total input tokens: 1,228,500
Total input text tokens: 1,228,500
Total generated tokens: 307,200
tp-size: 8

	fp8	-	Wint4Afp8 （sglang previous）	-	Wint4Afp8 (per-token) (pipeline optimized) (tile shape optimized)	-
	dp-size=1	dp-size=2	dp-size=1	dp-size=2	dp-size=1	dp-size=2
Request throughput (req/s)	0.66	0.90	0.51	0.74	0.67	0.96
Input token throughput (tok/s)	2716	3673	2097	3038	2722	3921
Output token throughput (tok/s)	679	918	524	759	681	981
Total token throughput (tok/s)	3395	4591	2627	3797	3403	4902
Mean E2E Latency (ms)	90289	66558	115518	80557	88295	62168
Mean ITL (ms)	44.63	56.03	57.48	68.73	42.61	50.63
Mean TTFT (ms)	44628	9238	56717	10250	44707	10375
Mean TPOT (ms)	44.63	56.03	57.48	68.73	42.61	50.63

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-03T03:07:34Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

huangtingwei9988 · 2026-03-06T08:59:41Z

@AniZpZ PTAL

yuyu5333 · 2026-03-10T08:24:08Z

Thank you for the efforts. Could you please provide a command that can be used to start the deployment?

Also, may I ask which version of sglang you were developed based on?

Wangzheee · 2026-03-11T08:05:48Z

Thank you for the efforts. Could you please provide a command that can be used to start the deployment?

Also, may I ask which version of sglang you were developed based on?

Thanks for your reply. The description of how to use it has been added to the PR

yuyu5333 · 2026-03-11T08:39:36Z

Here are some relevant modifications for W4AFP8, aimed at improving performance in the W4AFP8. The previous implementation calculated incorrect N and K, resulting in always selecting the same type of groupgemm. This is the related PR.：#15380; #15315

Wangzheee · 2026-03-11T08:49:55Z

Here are some relevant modifications for W4AFP8, aimed at improving performance in the W4AFP8 scenario. The previous implementation calculated incorrect N and K, resulting in always selecting the same type of groupgemm. This is the related PR.：#15380; #15315

This PR has re-estimated the shape of A to select the shape of Block Tile, and best selection of Pingpong and Cooperative pipelines

Wangzheee requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners February 3, 2026 03:07

github-actions bot added sgl-kernel npu labels Feb 3, 2026

Wangzheee requested review from AniZpZ, Edwardf0t1, Fridge003, Ying1123 and ch-wan as code owners February 6, 2026 11:35

Wangzheee force-pushed the w4afp8_per-token-kernel branch from 2417a67 to 202d61a Compare February 27, 2026 11:40

huangtingwei9988 assigned AniZpZ Mar 6, 2026

Wangzheee changed the title ~~Implement w4afp8 cutlass moe gemmm for per-token-per-group~~ Enhance performance for w4afp8: implement w4afp8 cutlass moe gemm(per-token) for fpdispatch Mar 11, 2026

Wangzheee changed the title ~~Enhance performance for w4afp8: implement w4afp8 cutlass moe gemm(per-token) for fpdispatch~~ Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm Mar 11, 2026

Wangzheee requested a review from b8zhong as a code owner March 12, 2026 03:36

Wangzheee requested review from Qiaolin-Yu and hebiao064 as code owners March 12, 2026 04:03

HydraQYH self-assigned this Mar 16, 2026

Wangzheee closed this Mar 20, 2026

Wangzheee force-pushed the w4afp8_per-token-kernel branch from 64b60c2 to 576e397 Compare March 20, 2026 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm#18144

Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm#18144
Wangzheee wants to merge 0 commit intosgl-project:mainfrom
Wangzheee:w4afp8_per-token-kernel

Wangzheee commented Feb 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Uh oh!

huangtingwei9988 commented Mar 6, 2026

Uh oh!

yuyu5333 commented Mar 10, 2026 •

edited

Loading

Uh oh!

Wangzheee commented Mar 11, 2026

Uh oh!

yuyu5333 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Wangzheee commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

batch-size	performance improvement with MoE GEMM	performance improvement with TPOT
32	90%	50% ～ 60%
64	80%	50%

Conversation

Wangzheee commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

How to use w4afp8

Modifications

Accuracy Tests

Benchmarking and Profiling

Summary

Detail data

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Uh oh!

huangtingwei9988 commented Mar 6, 2026

Uh oh!

yuyu5333 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wangzheee commented Mar 11, 2026

Uh oh!

yuyu5333 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Wangzheee commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Wangzheee commented Feb 3, 2026 •

edited

Loading

yuyu5333 commented Mar 10, 2026 •

edited

Loading

yuyu5333 commented Mar 11, 2026 •

edited

Loading

Wangzheee commented Mar 11, 2026 •

edited

Loading