-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849
Conversation
Signed-off-by: JartX <[email protected]>
Signed-off-by: JartX <[email protected]>
Signed-off-by: JartX <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Expert Parallel Load Balancing (EPLB) to the Qwen3VL model and the CompressedTensorsWNA16MoEMethod quantization method. The changes involve adding necessary checks and parameter passing for EPLB in the quantization method, and implementing the MixtureOfExperts interface for the Qwen3VL model. The implementation seems correct and follows existing patterns in the codebase. I have not found any critical or high-severity issues in the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@mgoin @yewentao256 Would you be so kind as to run the tests? many thanks! |
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add metrics report, eg, lm_eval for acc and vllm bench serve for performance to make sure the update is correct
|
EP + EPLB Without Cache Benchmarking summary: EP Without Cache Benchmarking summary: EPLB + EP
EP
|
…orswna16moemethod
|
@yewentao256 @tjtanaa Would you be so kind as to take a look at the test? I'd say it's not the PR's fault. Thank you very much! |
|
@JartX ok. I will try on gfx942 |
|
Thanks for the input! @tjtanaa I was also referring to the tests where it says two were failed. I think a much better graphics card than mine, or a pool of graphics cards, would be better. Qwen3 235 VL In my case, I see improvements in latency and throughput, especially under high loads, not simulated, but real-world, such as with tools, images, etc. |
…orswna16moemethod
|
@tjtanaa @yewentao256 @mgoin all test passed :)! Can you merge It? |
…m-project#28849) Signed-off-by: LuminolT <[email protected]>
) Signed-off-by: jiang1.li <[email protected]>
Based on PR: #25311, I'm adding EPLB support to Qwen3VL and the quant CompressedTensorsWNA16MoEMethod.