Skip to content

Conversation

@JartX
Copy link
Contributor

@JartX JartX commented Nov 17, 2025

Based on PR: #25311, I'm adding EPLB support to Qwen3VL and the quant CompressedTensorsWNA16MoEMethod.

@mergify mergify bot added the qwen Related to Qwen models label Nov 17, 2025
@JartX JartX changed the title [Feaature] EPLB on Qwen3VL and CompressedTensorsWNA16MoEMethod [Feature] EPLB on Qwen3VL and CompressedTensorsWNA16MoEMethod Nov 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Expert Parallel Load Balancing (EPLB) to the Qwen3VL model and the CompressedTensorsWNA16MoEMethod quantization method. The changes involve adding necessary checks and parameter passing for EPLB in the quantization method, and implementing the MixtureOfExperts interface for the Qwen3VL model. The implementation seems correct and follows existing patterns in the codebase. I have not found any critical or high-severity issues in the changes.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@JartX JartX changed the title [Feature] EPLB on Qwen3VL and CompressedTensorsWNA16MoEMethod [Feature] EPLB on Qwen3VLMoe and CompressedTensorsWNA16MoEMethod Nov 17, 2025
@JartX
Copy link
Contributor Author

JartX commented Nov 17, 2025

@mgoin @yewentao256 Would you be so kind as to run the tests? many thanks!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 17, 2025
Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add metrics report, eg, lm_eval for acc and vllm bench serve for performance to make sure the update is correct

@JartX
Copy link
Contributor Author

JartX commented Nov 19, 2025

@yewentao256 @mgoin @tjtanaa

evalscope perf --url "http://127.0.0.1/v1/chat/completions" --parallel 200 --model /models/Qwen3-VL-30B-A3B-Instruct-AWQ-W4A16-mse-seq --number 200 --api openai --dataset flickr8k --stream

EP + EPLB Without Cache

vllm serve /models/Qwen3-VL-30B-A3B-Instruct-AWQ-W4A16-mse-seq \
    --gpu-memory-utilization 0.90 \
    --max_model_len 40960 \
    -tp 4 \
    --port 8000 \
    --limit-mm-per-prompt '{"image":6, "video":0}' \
    --mm-encoder-tp-mode data \
    --dtype=float16 \
    --enable-log-requests \
    --chat-template /chat-template-tools.jinja \
    --enable-expert-parallel \
    --enable-eplb \
    --num-redundant-experts 8 \
    --eplb-window-size 3000 \
    --eplb-step-interval 1000 \
    --no-enable-prefix-caching

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 72.4338 |
+-----------------------------------+-----------+
| Number of concurrency | 200 |
+-----------------------------------+-----------+
| Total requests | 200 |
+-----------------------------------+-----------+
| Succeed requests | 200 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1324.16 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 2222.4 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 3.6286 |
+-----------------------------------+-----------+
| Average latency (s) | 38.3631 |
+-----------------------------------+-----------+
| Average time to first token (s) | 6.7313 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0897 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0867 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.545 |
+-----------------------------------+-----------+
| Average output tokens per request | 364.925 |
+-----------------------------------+-----------+
2025-11-19 10:12:55 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 2.7414 | 0.0677 | 0.0805 | 25.3166 | 219 | 210 | 7.7036 | 14.603 |
| 25% | 4.6434 | 0.0773 | 0.0847 | 33.1969 | 228 | 275 | 8.4811 | 15.1249 |
| 50% | 6.7104 | 0.0816 | 0.09 | 39.4073 | 244 | 368 | 9.1828 | 15.632 |
| 66% | 7.6327 | 0.0834 | 0.0928 | 42.5317 | 259 | 407 | 9.6304 | 16.1142 |
| 75% | 9.2238 | 0.0842 | 0.095 | 43.6025 | 266 | 429 | 9.8827 | 16.5865 |
| 80% | 9.7682 | 0.0847 | 0.0959 | 44.4719 | 271 | 442 | 10.0314 | 16.8554 |
| 90% | 10.8307 | 0.0858 | 0.0996 | 47.1042 | 279 | 491 | 10.5507 | 18.0681 |
| 95% | 11.8842 | 0.0873 | 0.1026 | 51.5517 | 291 | 586 | 10.9332 | 19.474 |
| 98% | 11.8885 | 0.1639 | 0.1064 | 53.9784 | 304 | 606 | 11.9021 | 21.9638 |
| 99% | 12.1114 | 0.5288 | 0.108 | 55.0982 | 332 | 646 | 12.5311 | 24.9188 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

EP Without Cache

vllm serve /models/Qwen3-VL-30B-A3B-Instruct-AWQ-W4A16-mse-seq \
    --gpu-memory-utilization 0.90 \
    --max_model_len 40960 \
    -tp 4 \
    --port 8000 \
    --limit-mm-per-prompt '{"image":6, "video":0}' \
    --mm-encoder-tp-mode data \
    --dtype=float16 \
    --enable-log-requests \
    --chat-template /chat-template-tools.jinja \
    --enable-expert-parallel \
    --no-enable-prefix-caching

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 57.4057 |
+-----------------------------------+-----------+
| Number of concurrency | 200 |
+-----------------------------------+-----------+
| Total requests | 200 |
+-----------------------------------+-----------+
| Succeed requests | 200 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1270.94 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 2148.46 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 3.5449 |
+-----------------------------------+-----------+
| Average latency (s) | 40.7577 |
+-----------------------------------+-----------+
| Average time to first token (s) | 8.4959 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0931 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.09 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.545 |
+-----------------------------------+-----------+
| Average output tokens per request | 358.525 |
+-----------------------------------+-----------+
2025-11-19 10:23:23 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 3.9765 | 0.0504 | 0.0788 | 31.2179 | 219 | 211 | 6.8652 | 13.355 |
| 25% | 6.3792 | 0.0818 | 0.0858 | 36.2328 | 228 | 288 | 7.7685 | 13.9148 |
| 50% | 8.787 | 0.0862 | 0.0938 | 42.5993 | 244 | 351 | 8.5401 | 14.8389 |
| 66% | 9.8268 | 0.0877 | 0.0978 | 44.4241 | 259 | 402 | 9.0478 | 15.4779 |
| 75% | 11.1185 | 0.0886 | 0.1007 | 46.1566 | 266 | 430 | 9.5543 | 15.8241 |
| 80% | 11.6817 | 0.0891 | 0.1017 | 46.7386 | 271 | 452 | 9.8415 | 16.0808 |
| 90% | 12.8096 | 0.0906 | 0.1061 | 49.3728 | 279 | 501 | 10.5277 | 16.6254 |
| 95% | 13.9391 | 0.0918 | 0.1099 | 51.6401 | 291 | 548 | 11.36 | 17.1159 |
| 98% | 13.9411 | 0.1273 | 0.1186 | 55.4414 | 304 | 635 | 12.021 | 18.6752 |
| 99% | 15.3501 | 0.5797 | 0.1216 | 56.4012 | 332 | 678 | 12.06 | 19.9472 |

EPLB + EP

Tasks Version Filter n-shot Metric Value Stderr
chartqa 0 none 0 anywhere_accuracy 0.8696 ± 0.0067
none 0 exact_match 0.6356 ± 0.0096
none 0 relaxed_accuracy 0.8604 ± 0.0069

EP

Tasks Version Filter n-shot Metric Value Stderr
chartqa 0 none 0 anywhere_accuracy 0.8752 ± 0.0066
none 0 exact_match 0.6356 ± 0.0096
none 0 relaxed_accuracy 0.8644 ± 0.0068

@JartX
Copy link
Contributor Author

JartX commented Nov 19, 2025

@yewentao256 @tjtanaa Would you be so kind as to take a look at the test? I'd say it's not the PR's fault. Thank you very much!

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 19, 2025

@JartX ok. I will try on gfx942

@JartX
Copy link
Contributor Author

JartX commented Nov 19, 2025

Thanks for the input! @tjtanaa I was also referring to the tests where it says two were failed. I think a much better graphics card than mine, or a pool of graphics cards, would be better. Qwen3 235 VL In my case, I see improvements in latency and throughput, especially under high loads, not simulated, but real-world, such as with tools, images, etc.

@JartX
Copy link
Contributor Author

JartX commented Nov 19, 2025

@tjtanaa @yewentao256 @mgoin all test passed :)! Can you merge It?☺️

@mgoin mgoin merged commit 8e38e99 into vllm-project:main Nov 19, 2025
57 checks passed
@JartX JartX deleted the feature/eplb_qwen3_vl_moe_and_compressedtensorswna16moemethod branch November 20, 2025 11:25
LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025
bigPYJ1151 pushed a commit that referenced this pull request Nov 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants