-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[EPLB] Reduce EPLB Inference Overhead #24573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Bowen Wang <[email protected]>
Signed-off-by: Bowen Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly improves the performance and maintainability of the Expert Parallelism Load Balancer (EPLB) by replacing the slow and non-compilable torch.rand with a deterministic modulo-based replica selection. The refactoring of EPLB logic into a separate, torch.compile-friendly function eplb_map_to_physical_and_record is a great change that enhances code clarity. I've found one critical issue that could lead to a runtime error, which I've detailed in a specific comment.
Although it's safe to pass in `dtype=None`. Makes Gemini happy. Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Bowen Wang <[email protected]>
|
LGTM - please fix the pre-commit |
Signed-off-by: Bowen Wang <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: charlifu <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: yewentao256 <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: gaojc <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]>
Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Purpose
PR #18343 introduced the Expert Parallelism Load Balancer (EPLB). By replicating a single logical expert into multiple physical experts, we can achieve better load balancing across experts.
However, this replication introduces some inference-time overhead: after the MoE routing module, we must select among multiple replicas of the same logical expert and also record expert load metrics for the rearrangement algorithm.
Previously,
torch.randwas used to select expert replicas. Unfortunately, this method is slow and nottorch.compile-friendly.In this PR, we aim to reduce EPLB overhead by:
torch.randto a modulo-based pseudo-random selection.k.select_expertsinto atorch.compile-friendly function.Test Plan
To isolate EPLB inference overhead, we test with EPLB enabled but with
num_redundant_experts=0, and without rearranging experts. This ensures that any observed differences are solely due to replica selection and load recording overhead.Test Result
We benchmarked 1000 random prompts with 1000 input tokens and 100 output tokens on DeepSeek-V3-0324, on a DP16 setting. Prefix caching was disabled to measure the raw computational cost.
vllm bench serve \ --model $MODEL \ --dataset-name random \ --ignore-eos \ --port ${PORT:-8080} \ --random-input-len 1000 \ --random-output-len 100 \w/o EPLB:
w/ EPLB, main:
w/ EPLB, this PR:
Summary:
Not accounting for the benefits of improved expert load balancing, on the main branch, EPLB introduces ~3.97% throughput drop. With this PR, we recover ~2.41%, narrowing the gap to ~1.66% compared to running without EPLB.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.