Implement per-token w4afp8 moe gemm, improve performance with w4afp8 moe gemm#21101
Implement per-token w4afp8 moe gemm, improve performance with w4afp8 moe gemm#21101Wangzheee wants to merge 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on a substantial performance uplift for w4afp8 Mixture-of-Experts (MoE) GEMM operations. It refactors the quantization pipeline to support per-token granularity, introduces distinct scaling mechanisms for both input and weight matrices, and incorporates low-level optimizations to leverage GPU architecture more effectively. The changes aim to improve throughput and reduce latency for quantized models, particularly in scenarios involving MoE layers. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new Triton kernel for interleaving int4 data and integrates it into the sglang framework. It also modifies the cutlass kernel to support scale B and adds a lookup table for int4 to fp8 conversion. The review comments suggest removing a redundant condition in the Triton kernel and adding a comment to explain the padding logic in the cutlass kernel.
| dst_id = dst_row_id * cols_div4 + dst_col_id | ||
|
|
||
| valid = ( | ||
| (col_id < cols_div4) |
There was a problem hiding this comment.
The condition (col_id < cols_div4) appears to be redundant. Based on the loop condition mask_partition (which is partition_id < (cols // 64)), the maximum value of col_id is (cols // 64 - 1) * 16 + 15, which simplifies to cols / 4 - 1. Since cols_div4 is cols // 4, col_id will always be less than cols_div4 when mask_partition is true. The valid mask already includes mask_partition[:, None], so this check is unnecessary. Removing it could offer a minor performance improvement.
| int64_t scale_k = k / 128; | ||
| b_scales_offsets[expert_id] = b_scales_base_as_int + (per_out_ch ? expert_id * n * scale_k : expert_id); | ||
| a_scales_offsets[expert_id] = | ||
| a_scales_base_as_int + (per_act_token ? expert_offset * (scale_k % 4 == 0 ? scale_k : scale_k * 4) : 0); |
Co-authored-by: Jiang Shao <[email protected]>
0b97239 to
dd25c4d
Compare
b84e186 to
dd25c4d
Compare
Motivation
[1/2] Enhance w4afp8 performance: The kernel part of the complete code functionality
This pull request (PR) refactors the functionality of PR-7762 Significantly enhance performance:
Add a pipeline for A scale (per-token)
Optimize the pipeline
Replace LDS with LDSM for W4 weight
Estimate the shape of the actual activation in MOE, to optimize block tile shape
How to use w4afp8
Use open-source models:
You can quantize the model by yourself
Modifications
Accuracy Tests
model: https://huggingface.co/deepseek-ai/DeepSeek-R1 https://huggingface.co/Barrrrry/DeepSeek-R1-W4AFP8
Benchmarking and Profiling
Summary
w4afp8(per-token optimize) VS fp8 same as w4afp8(per-token optimize) VS w4afp8(previous)
EP(ep-size=8)
TP(tp-size=8)
Detail data
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
(sglang previous)
(per-token)
(pipeline optimized)
(tile shape optimized)
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci