Enhance w4afp8 performance: implement per-token w4afp8 CUTLASS MoE GEMM for FP8 dispatch, improve performance with w4afp8 moe gemm#18144
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2417a67 to
202d61a
Compare
|
@AniZpZ PTAL |
|
Thank you for the efforts. Could you please provide a command that can be used to start the deployment? Also, may I ask which version of sglang you were developed based on? |
Thanks for your reply. The description of how to use it has been added to the PR |
This PR has re-estimated the shape of A to select the shape of Block Tile, and best selection of Pingpong and Cooperative pipelines |
64b60c2 to
576e397
Compare
Motivation
This pull request (PR) refactors the functionality of PR-7762 Significantly enhance performance:
Add a pipeline for A scale (per-token)
Optimize the pipeline
Replace LDS with LDSM for W4 weight
Estimate the shape of the actual activation in MOE, to optimize block tile shape
How to use w4afp8
Use open-source models:
You can quantize the model by yourself
Modifications
Accuracy Tests
Benchmarking and Profiling
Summary
w4afp8(optimize) VS fp8
w4afp8(optimize) VS w4afp8(previous)
EP(ep-size=8)
TP(tp-size=8)
Detail data
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci