[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849

JartX · 2025-11-17T09:53:30Z

Based on PR: #25311, I'm adding EPLB support to Qwen3VL and the quant CompressedTensorsWNA16MoEMethod.

Signed-off-by: JartX <[email protected]>

gemini-code-assist

Code Review

This pull request adds support for Expert Parallel Load Balancing (EPLB) to the Qwen3VL model and the CompressedTensorsWNA16MoEMethod quantization method. The changes involve adding necessary checks and parameter passing for EPLB in the quantization method, and implementing the MixtureOfExperts interface for the Qwen3VL model. The implementation seems correct and follows existing patterns in the codebase. I have not found any critical or high-severity issues in the changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/qwen3_vl_moe.py

JartX · 2025-11-17T10:50:33Z

@mgoin @yewentao256 Would you be so kind as to run the tests? many thanks!

yewentao256

Please also add metrics report, eg, lm_eval for acc and vllm bench serve for performance to make sure the update is correct

JartX · 2025-11-19T10:39:13Z

@yewentao256 @mgoin @tjtanaa

evalscope perf --url "http://127.0.0.1/v1/chat/completions" --parallel 200 --model /models/Qwen3-VL-30B-A3B-Instruct-AWQ-W4A16-mse-seq --number 200 --api openai --dataset flickr8k --stream

EP + EPLB Without Cache

vllm serve /models/Qwen3-VL-30B-A3B-Instruct-AWQ-W4A16-mse-seq \
    --gpu-memory-utilization 0.90 \
    --max_model_len 40960 \
    -tp 4 \
    --port 8000 \
    --limit-mm-per-prompt '{"image":6, "video":0}' \
    --mm-encoder-tp-mode data \
    --dtype=float16 \
    --enable-log-requests \
    --chat-template /chat-template-tools.jinja \
    --enable-expert-parallel \
    --enable-eplb \
    --num-redundant-experts 8 \
    --eplb-window-size 3000 \
    --eplb-step-interval 1000 \
    --no-enable-prefix-caching

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 72.4338 |
+-----------------------------------+-----------+
| Number of concurrency | 200 |
+-----------------------------------+-----------+
| Total requests | 200 |
+-----------------------------------+-----------+
| Succeed requests | 200 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1324.16 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 2222.4 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 3.6286 |
+-----------------------------------+-----------+
| Average latency (s) | 38.3631 |
+-----------------------------------+-----------+
| Average time to first token (s) | 6.7313 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0897 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0867 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.545 |
+-----------------------------------+-----------+
| Average output tokens per request | 364.925 |
+-----------------------------------+-----------+
2025-11-19 10:12:55 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 2.7414 | 0.0677 | 0.0805 | 25.3166 | 219 | 210 | 7.7036 | 14.603 |
| 25% | 4.6434 | 0.0773 | 0.0847 | 33.1969 | 228 | 275 | 8.4811 | 15.1249 |
| 50% | 6.7104 | 0.0816 | 0.09 | 39.4073 | 244 | 368 | 9.1828 | 15.632 |
| 66% | 7.6327 | 0.0834 | 0.0928 | 42.5317 | 259 | 407 | 9.6304 | 16.1142 |
| 75% | 9.2238 | 0.0842 | 0.095 | 43.6025 | 266 | 429 | 9.8827 | 16.5865 |
| 80% | 9.7682 | 0.0847 | 0.0959 | 44.4719 | 271 | 442 | 10.0314 | 16.8554 |
| 90% | 10.8307 | 0.0858 | 0.0996 | 47.1042 | 279 | 491 | 10.5507 | 18.0681 |
| 95% | 11.8842 | 0.0873 | 0.1026 | 51.5517 | 291 | 586 | 10.9332 | 19.474 |
| 98% | 11.8885 | 0.1639 | 0.1064 | 53.9784 | 304 | 606 | 11.9021 | 21.9638 |
| 99% | 12.1114 | 0.5288 | 0.108 | 55.0982 | 332 | 646 | 12.5311 | 24.9188 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

EP Without Cache

vllm serve /models/Qwen3-VL-30B-A3B-Instruct-AWQ-W4A16-mse-seq \
    --gpu-memory-utilization 0.90 \
    --max_model_len 40960 \
    -tp 4 \
    --port 8000 \
    --limit-mm-per-prompt '{"image":6, "video":0}' \
    --mm-encoder-tp-mode data \
    --dtype=float16 \
    --enable-log-requests \
    --chat-template /chat-template-tools.jinja \
    --enable-expert-parallel \
    --no-enable-prefix-caching

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 57.4057 |
+-----------------------------------+-----------+
| Number of concurrency | 200 |
+-----------------------------------+-----------+
| Total requests | 200 |
+-----------------------------------+-----------+
| Succeed requests | 200 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1270.94 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 2148.46 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 3.5449 |
+-----------------------------------+-----------+
| Average latency (s) | 40.7577 |
+-----------------------------------+-----------+
| Average time to first token (s) | 8.4959 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0931 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.09 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.545 |
+-----------------------------------+-----------+
| Average output tokens per request | 358.525 |
+-----------------------------------+-----------+
2025-11-19 10:23:23 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 3.9765 | 0.0504 | 0.0788 | 31.2179 | 219 | 211 | 6.8652 | 13.355 |
| 25% | 6.3792 | 0.0818 | 0.0858 | 36.2328 | 228 | 288 | 7.7685 | 13.9148 |
| 50% | 8.787 | 0.0862 | 0.0938 | 42.5993 | 244 | 351 | 8.5401 | 14.8389 |
| 66% | 9.8268 | 0.0877 | 0.0978 | 44.4241 | 259 | 402 | 9.0478 | 15.4779 |
| 75% | 11.1185 | 0.0886 | 0.1007 | 46.1566 | 266 | 430 | 9.5543 | 15.8241 |
| 80% | 11.6817 | 0.0891 | 0.1017 | 46.7386 | 271 | 452 | 9.8415 | 16.0808 |
| 90% | 12.8096 | 0.0906 | 0.1061 | 49.3728 | 279 | 501 | 10.5277 | 16.6254 |
| 95% | 13.9391 | 0.0918 | 0.1099 | 51.6401 | 291 | 548 | 11.36 | 17.1159 |
| 98% | 13.9411 | 0.1273 | 0.1186 | 55.4414 | 304 | 635 | 12.021 | 18.6752 |
| 99% | 15.3501 | 0.5797 | 0.1216 | 56.4012 | 332 | 678 | 12.06 | 19.9472 |

EPLB + EP

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.8696	±	0.0067
		none	exact_match	↑	0.6356	±	0.0096
		none	relaxed_accuracy	↑	0.8604	±	0.0069

EP

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.8752	±	0.0066
		none	exact_match	↑	0.6356	±	0.0096
		none	relaxed_accuracy	↑	0.8644	±	0.0068

…orswna16moemethod

JartX · 2025-11-19T13:13:04Z

@yewentao256 @tjtanaa Would you be so kind as to take a look at the test? I'd say it's not the PR's fault. Thank you very much!

tjtanaa · 2025-11-19T14:20:48Z

@JartX ok. I will try on gfx942

JartX · 2025-11-19T17:57:10Z

Thanks for the input! @tjtanaa I was also referring to the tests where it says two were failed. I think a much better graphics card than mine, or a pool of graphics cards, would be better. Qwen3 235 VL In my case, I see improvements in latency and throughput, especially under high loads, not simulated, but real-world, such as with tools, images, etc.

…orswna16moemethod

JartX · 2025-11-19T21:15:34Z

@tjtanaa @yewentao256 @mgoin all test passed :)! Can you merge It?☺️

…m-project#28849)

…m-project#28849) Signed-off-by: LuminolT <[email protected]>

) Signed-off-by: jiang1.li <[email protected]>

…m-project#28849)

JartX added 3 commits November 16, 2025 22:04

working on eplb compressedwna16

b300b4f

Signed-off-by: JartX <[email protected]>

qwen3_vl_moe eplb support

9094f8c

Signed-off-by: JartX <[email protected]>

remove spanish commentary

d8203d4

Signed-off-by: JartX <[email protected]>

JartX requested review from mgoin, pavanimajety, robertgshaw2-redhat, sighingnow, tlrmchlsmth and yewentao256 as code owners November 17, 2025 09:53

mergify bot added the qwen Related to Qwen models label Nov 17, 2025

JartX changed the title ~~[Feaature] EPLB on Qwen3VL and CompressedTensorsWNA16MoEMethod~~ [Feature] EPLB on Qwen3VL and CompressedTensorsWNA16MoEMethod Nov 17, 2025

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 17, 2025

View reviewed changes

vllm/model_executor/models/qwen3_vl_moe.py Show resolved Hide resolved

JartX changed the title ~~[Feature] EPLB on Qwen3VL and CompressedTensorsWNA16MoEMethod~~ [Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod Nov 17, 2025

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 17, 2025

yewentao256 reviewed Nov 17, 2025

View reviewed changes

Merge branch 'main' into feature/eplb_qwen3_vl_moe_and_compressedtens…

5757cf3

…orswna16moemethod

Merge branch 'main' into feature/eplb_qwen3_vl_moe_and_compressedtens…

fbac990

…orswna16moemethod

mgoin approved these changes Nov 19, 2025

View reviewed changes

mgoin merged commit 8e38e99 into vllm-project:main Nov 19, 2025
57 checks passed

Victor49152 pushed a commit to Victor49152/vllm that referenced this pull request Nov 20, 2025

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (vll…

cae363c

…m-project#28849)

JartX deleted the feature/eplb_qwen3_vl_moe_and_compressedtensorswna16moemethod branch November 20, 2025 11:25

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (vll…

ad04871

…m-project#28849) Signed-off-by: LuminolT <[email protected]>

bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (#28849

5571820

) Signed-off-by: jiang1.li <[email protected]>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (vll…

ec82982

…m-project#28849)

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (vll…

9870fdb

…m-project#28849)

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod (vll…

a6e35b8

…m-project#28849)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849

Uh oh!

JartX commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

JartX commented Nov 17, 2025

Uh oh!

yewentao256 left a comment

Uh oh!

JartX commented Nov 19, 2025 •

edited

Loading

Uh oh!

JartX commented Nov 19, 2025

Uh oh!

tjtanaa commented Nov 19, 2025

Uh oh!

JartX commented Nov 19, 2025

Uh oh!

JartX commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849

[Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod #28849

Uh oh!

Conversation

JartX commented Nov 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

JartX commented Nov 17, 2025

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

JartX commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JartX commented Nov 19, 2025

Uh oh!

tjtanaa commented Nov 19, 2025

Uh oh!

JartX commented Nov 19, 2025

Uh oh!

JartX commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JartX commented Nov 19, 2025 •

edited

Loading