Skip to content

Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm#18144

Closed
Wangzheee wants to merge 0 commit intosgl-project:mainfrom
Wangzheee:w4afp8_per-token-kernel
Closed

Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm#18144
Wangzheee wants to merge 0 commit intosgl-project:mainfrom
Wangzheee:w4afp8_per-token-kernel

Conversation

@Wangzheee
Copy link
Copy Markdown

@Wangzheee Wangzheee commented Feb 3, 2026

Motivation

This pull request (PR) refactors the functionality of PR-7762 Significantly enhance performance:

  1. Resolved the question w4afp8 deepep is that it can only use bf16
  • Reimplemented the gemm pipeline with per-token quantization granularity, enabling dispatch for fp8 tokens
  1. Improve performance with w4afp8 moe gemm
  • Add a pipeline for A scale (per-token)

  • Optimize the pipeline

  • Replace LDS with LDSM for W4 weight

  • Estimate the shape of the actual activation in MOE, to optimize block tile shape

How to use w4afp8

  1. merge PR
  • to sglang develop
  1. use w4afp8 model

Modifications

  • w4afp8 moe gemm kernel
  • moe layer
  • test

Accuracy Tests

Benchmarking and Profiling

Summary

w4afp8(optimize) VS fp8
w4afp8(optimize) VS w4afp8(previous)

  1. EP(ep-size=8)

    batch-size performance improvement with MoE GEMM performance improvement with TPOT
    32 90% 50% ~ 60%
    64 80% 50%
  2. TP(tp-size=8)

    batch-size performance improvement with MoE GEMM performance improvement with TPOT
    32 60% 5%
    64 50% 5%

Detail data

  • Max request concurrency: 32
  • Successful requests: 300
  • Total input tokens: 1,228,500
  • Total input text tokens: 1,228,500
  • Total generated tokens: 307,200
  • ep-size: 8
fp8
- Wint4Afp8
(sglang previous)
- Wint4Afp8
(per-token)
(pipeline optimized)
(tile shape optimized)
-
dp-size=1 dp-size=2 dp-size=1 dp-size=2 dp-size=1 dp-size=2
Request throughput (req/s) 0.42 0.47 0.43 0.47 0.60 0.69
Input token throughput (tok/s) 1737 1906 1747 1924 2453 2821
Output token throughput (tok/s) 434 476 436 481 613 706
Total token throughput (tok/s) 2172 2382 2184 2405 3066 3527
Mean E2E Latency (ms) 72335 65756 71608 64868 51170 44294
Mean ITL (ms) 64.94 58.82 63.68 57.45 43.02 36.44
Mean TTFT (ms) 5899 5588 6460 6096 7161 7020
Mean TPOT (ms) 64.94 58.81 63.68 57.45 43.02 36.44
  • Max request concurrency: 64
  • Successful requests: 300
  • Total input tokens: 1,228,500
  • Total input text tokens: 1,228,500
  • Total generated tokens: 307,200
  • ep-size: 8
fp8
- Wint4Afp8
(sglang previous)
- Wint4Afp8
(per-token)
(pipeline optimized)
(tile shape optimized)
-
dp-size=1 dp-size=2 dp-size=1 dp-size=2 dp-size=1 dp-size=2
Request throughput (req/s) 0.43 0.68 0.43 0.71 0.60 0.93
Input token throughput (tok/s) 1769 2785 1771 2894 2460 3825
Output token throughput (tok/s) 442 696 443 723 615 956
Total token throughput (tok/s) 2212 3481 2214 3618 3075 4781
Mean E2E Latency (ms) 136359 87896 134694 84463 97293 63847
Mean ITL (ms) 68.70 76.53 67.26 72.44 46.26 51.15
Mean TTFT (ms) 66077 9603 65886 10357 49970 11524
Mean TPOT (ms) 68.70 76.53 67.26 72.44 46.26 51.15
  • Max request concurrency: 32
  • Successful requests: 300
  • Total input tokens: 1,228,500
  • Total input text tokens: 1,228,500
  • Total generated tokens: 307,200
  • tp-size: 8
fp8
- Wint4Afp8
(sglang previous)
- Wint4Afp8
(per-token)
(pipeline optimized)
(tile shape optimized)
-
dp-size=1 dp-size=2 dp-size=1 dp-size=2 dp-size=1 dp-size=2
Request throughput (req/s) 0.64 0.68 0.52 0.54 0.67 0.71
Input token throughput (tok/s) 2600 2769 2118 2230 2745 2906
Output token throughput (tok/s) 650 692 530 557 686 727
Total token throughput (tok/s) 3251 3461 2648 2787 3432 3633
Mean E2E Latency (ms) 48530 45207 59607 56296 45851 42967
Mean ITL (ms) 42.42 39.00 52.59 49.35 39.03 36.14
Mean TTFT (ms) 5131 5312 5809 5809 5927 5993
Mean TPOT (ms) 42.42 39.00 52.59 49.35 39.03 36.14
  • Max request concurrency: 64
  • Successful requests: 300
  • Total input tokens: 1,228,500
  • Total input text tokens: 1,228,500
  • Total generated tokens: 307,200
  • tp-size: 8
fp8
- Wint4Afp8
(sglang previous)
- Wint4Afp8
(per-token)
(pipeline optimized)
(tile shape optimized)
-
dp-size=1 dp-size=2 dp-size=1 dp-size=2 dp-size=1 dp-size=2
Request throughput (req/s) 0.66 0.90 0.51 0.74 0.67 0.96
Input token throughput (tok/s) 2716 3673 2097 3038 2722 3921
Output token throughput (tok/s) 679 918 524 759 681 981
Total token throughput (tok/s) 3395 4591 2627 3797 3403 4902
Mean E2E Latency (ms) 90289 66558 115518 80557 88295 62168
Mean ITL (ms) 44.63 56.03 57.48 68.73 42.61 50.63
Mean TTFT (ms) 44628 9238 56717 10250 44707 10375
Mean TPOT (ms) 44.63 56.03 57.48 68.73 42.61 50.63

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

@AniZpZ PTAL

@yuyu5333
Copy link
Copy Markdown
Contributor

yuyu5333 commented Mar 10, 2026

Thank you for the efforts. Could you please provide a command that can be used to start the deployment?

Also, may I ask which version of sglang you were developed based on?

@Wangzheee Wangzheee changed the title Implement w4afp8 cutlass moe gemmm for per-token-per-group Enhance performance for w4afp8: implement w4afp8 cutlass moe gemm(per-token) for fpdispatch Mar 11, 2026
@Wangzheee Wangzheee changed the title Enhance performance for w4afp8: implement w4afp8 cutlass moe gemm(per-token) for fpdispatch Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm Mar 11, 2026
@Wangzheee
Copy link
Copy Markdown
Author

Thank you for the efforts. Could you please provide a command that can be used to start the deployment?

Also, may I ask which version of sglang you were developed based on?

Thanks for your reply. The description of how to use it has been added to the PR

@yuyu5333
Copy link
Copy Markdown
Contributor

yuyu5333 commented Mar 11, 2026

Here are some relevant modifications for W4AFP8, aimed at improving performance in the W4AFP8. The previous implementation calculated incorrect N and K, resulting in always selecting the same type of groupgemm. This is the related PR.:#15380; #15315

@Wangzheee
Copy link
Copy Markdown
Author

Wangzheee commented Mar 11, 2026

Here are some relevant modifications for W4AFP8, aimed at improving performance in the W4AFP8 scenario. The previous implementation calculated incorrect N and K, resulting in always selecting the same type of groupgemm. This is the related PR.:#15380; #15315

This PR has re-estimated the shape of A to select the shape of Block Tile, and best selection of Pingpong and Cooperative pipelines

@Wangzheee Wangzheee requested a review from b8zhong as a code owner March 12, 2026 03:36
@HydraQYH HydraQYH self-assigned this Mar 16, 2026
@Wangzheee Wangzheee closed this Mar 20, 2026
@Wangzheee Wangzheee force-pushed the w4afp8_per-token-kernel branch from 64b60c2 to 576e397 Compare March 20, 2026 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants