-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[BugFix] Support EP/DP + EPLB with MTP #25311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
LucasWilkinson
merged 45 commits into
vllm-project:main
from
neuralmagic:imarkov/fix_eplb_mtp
Nov 5, 2025
+956
−528
Merged
Changes from 13 commits
Commits
Show all changes
45 commits
Select commit
Hold shift + click to select a range
368ad79
Wip
ilmarkov e8aadae
Fix precommit
ilmarkov 98395a6
Fix other mtp models
ilmarkov cda869d
Add eplb support to Llama4
ilmarkov 7a519ee
Fix mllama4
ilmarkov ec2b02a
Refactor multi model eplb support
ilmarkov ca98544
Add test and fix
ilmarkov eeaca8f
Merge branch 'main' into fix_eplb_mtp
ilmarkov c161489
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov e713f42
Update spec decode
ilmarkov a70a344
init
SageMoore 3b51ef9
comment
SageMoore 123c8e6
Update qwen next
ilmarkov 9149d25
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov 27b6437
Cleanup
ilmarkov ff9f992
Update after review
ilmarkov d4532a6
Update buildkite pipeline test time
ilmarkov b0c8cd3
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov 7c5b5b1
Improve sync. Update after review
ilmarkov 96d4b37
Fix comment
ilmarkov 43755f6
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov 477a955
Refactor
ilmarkov 6880c9f
Refactor glm4
ilmarkov 4ab42aa
Update moemixin
ilmarkov bf4dcbc
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov 7d0ee28
Update comment for V1 Test e2e + engine
ilmarkov d129097
Update startup logging
ilmarkov a77b99f
Update test
ilmarkov ef3c9a1
Upd test constants
ilmarkov 7e60b26
Upd test time
ilmarkov df918b2
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov f4fad37
Upd
ilmarkov 644c328
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov 94e3390
Fix glm4moe
ilmarkov 69786a5
Merge branch 'main' into imarkov/fix_eplb_mtp
tlrmchlsmth 09f9869
Fix CI
ilmarkov 74f806b
Update gpu_memory_utilization to 0.93
ilmarkov 0e8dc73
Fix
ilmarkov 7f4b831
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov b88f680
Fix oom
ilmarkov 70b66a7
Merge branch 'main' into imarkov/fix_eplb_mtp
LucasWilkinson 6e17f0f
Merge remote-tracking branch 'origin/main' into imarkov/fix_eplb_mtp
ilmarkov e4fa241
Update moe_layers. Clean OpenPangu
ilmarkov deb21b1
Fix mypy
ilmarkov a9938e7
Merge branch 'main' into imarkov/fix_eplb_mtp
ilmarkov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tlrmchlsmth marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,119 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| from __future__ import annotations | ||
|
|
||
| import pytest | ||
| import torch | ||
|
|
||
| from vllm import LLM, SamplingParams | ||
| from vllm.distributed import cleanup_dist_env_and_memory | ||
|
|
||
|
|
||
| def create_test_prompts() -> list[str]: | ||
| return [ | ||
| "A robot may not injure a human being", | ||
| "To be or not to be,", | ||
| "What is the meaning of life?", | ||
| ] | ||
|
|
||
|
|
||
| @pytest.fixture | ||
| def sampling_config(): | ||
| return SamplingParams(temperature=0, max_tokens=10, ignore_eos=False) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "model_setup", | ||
| [ | ||
| ("meta-llama/Llama-4-Scout-17B-16E-Instruct", 4), | ||
| ], | ||
| ids=["llama4"], | ||
| ) | ||
| def test_eplb_model( | ||
| monkeypatch: pytest.MonkeyPatch, | ||
| sampling_config: SamplingParams, | ||
| model_setup: tuple[str, int], | ||
| ): | ||
| with monkeypatch.context() as m: | ||
| m.setenv("VLLM_USE_V1", "1") | ||
| m.setenv("VLLM_MLA_DISABLE", "1") | ||
|
|
||
| model_name, tp_size = model_setup | ||
| test_prompts = create_test_prompts() | ||
| llm = LLM( | ||
| model=model_name, | ||
| tensor_parallel_size=tp_size, | ||
| max_model_len=2048, | ||
| enable_expert_parallel=True, | ||
| num_redundant_experts=tp_size, | ||
| eplb_window_size=4, | ||
| eplb_step_interval=16, | ||
| eplb_log_balancedness=True, | ||
| enable_eplb=True, | ||
| load_format="dummy", | ||
| gpu_memory_utilization=0.95, | ||
| ) | ||
| test_prompts = create_test_prompts() | ||
|
||
| llm.generate(test_prompts, sampling_config) | ||
| del llm | ||
| torch.cuda.empty_cache() | ||
| cleanup_dist_env_and_memory() | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "model_setup", | ||
| [ | ||
| ( | ||
| "eagle", | ||
| "eagle618/deepseek-v3-random", | ||
| "eagle618/eagle-deepseek-v3-random", | ||
| 4, | ||
| ), | ||
| ("deepseek_mtp", "eagle618/deepseek-v3-random", None, 4), | ||
| ("qwen3_next_mtp", "Qwen/Qwen3-Next-80B-A3B-Instruct", None, 4), | ||
| pytest.param( | ||
| ( | ||
| "eagle", | ||
| "meta-llama/Llama-4-Scout-17B-16E-Instruct", | ||
| "morgendave/EAGLE-Llama-4-Scout-17B-16E-Instruct", | ||
| 4, | ||
| ), | ||
| marks=pytest.mark.skip(reason="Skipping due to CI OOM issues"), | ||
| ), | ||
| ], | ||
| ids=["deepseek_eagle", "deepseek_mtp", "qwen3_next_mtp", "llama4_eagle"], | ||
| ) | ||
| def test_eplb_spec_decode( | ||
| monkeypatch: pytest.MonkeyPatch, | ||
| sampling_config: SamplingParams, | ||
| model_setup: tuple[str, str, str, int], | ||
| ): | ||
| with monkeypatch.context() as m: | ||
| m.setenv("VLLM_USE_V1", "1") | ||
| m.setenv("VLLM_MLA_DISABLE", "1") | ||
|
|
||
| method, model_name, spec_model_name, tp_size = model_setup | ||
| llm = LLM( | ||
| model=model_name, | ||
| trust_remote_code=True, | ||
| tensor_parallel_size=tp_size, | ||
| speculative_config={ | ||
| "method": method, | ||
| "model": spec_model_name, | ||
| "num_speculative_tokens": 1, | ||
| "max_model_len": 2048, | ||
| }, | ||
| max_model_len=2048, | ||
| enable_expert_parallel=True, | ||
| num_redundant_experts=tp_size, | ||
| eplb_window_size=1000, | ||
| eplb_step_interval=3000, | ||
| eplb_log_balancedness=True, | ||
| enable_eplb=True, | ||
| load_format="dummy", | ||
| ) | ||
| test_prompts = create_test_prompts() | ||
| llm.generate(test_prompts, sampling_config) | ||
| del llm | ||
| torch.cuda.empty_cache() | ||
| cleanup_dist_env_and_memory() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Does your test trip the timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add new test that might take up to 15 minutes, so need to increase the timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem a bit excessive for a somewhat niche use case. I'm admittedly not well versed in the CI hierarchy, but would it make more sense to just run one model here and the rest in a nightly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please not add a test that takes 15 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it only takes 5 minutes. Good enough - Let's add this to the EPLB execution test instead:
vllm/.buildkite/test-pipeline.yaml
Lines 217 to 225 in 446912d