-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility #20449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Add unit tests for MoE ModularKernel combinations + Profiling utility #20449
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @varun-sundar-rabindranath, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the testing infrastructure for Mixture-of-Experts (MoE) ModularKernels. It introduces a comprehensive unit test suite to validate various combinations of PrepareFinalize and FusedExperts implementations in both single-GPU and multi-GPU (single-node) environments. Additionally, it integrates a utility for generating PyTorch profiler traces, aiding performance analysis and debugging of these MoE configurations.
Highlights
- New MoE Combination Test Suite: A new file
test_modular_kernel_combinations.pyis added, providing extensive unit tests for differentPrepareFinalizeandFusedExpertscombinations, including various quantization settings (FP8, block-quantized) andtopkvalues. These tests cover both single-GPU and multi-GPU (single-node) scenarios. - Distributed Test Utility Enhancement: The
parallel_utils.pyfile is updated withparallel_launch_with_config, a new helper function that simplifies launching distributed tests with specific vLLM configurations and environment variables, crucial for testing MoE setups. - MoE Initialization Refactoring: The
FusedMoEMethodBaseclass inlayer.pyis refactored to introduce a static methodmaybe_make_prepare_finalize, centralizing the logic for creatingPrepareAndFinalizeobjects. This simplifies the initialization process and improves modularity. - Profiling Utility Integration: The new test script includes an option (
--do-profile) to generate detailed PyTorch profiler traces for specific MoE kernel executions, enabling in-depth performance analysis.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive test suite for MoE ModularKernel combinations, which is a valuable addition for ensuring code quality and correctness. The ability to profile different combinations is also a great feature.
I've found a couple of issues that could affect the reliability of the tests. Specifically, a hardcoded port in the parallel utilities could lead to flaky tests, and there's a potential argument swap in the weight generation logic that could cause incorrect behavior. Addressing these points will help solidify this excellent contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need some help in verifying if there are more quant configs we should consider. cc @robertgshaw2-redhat @tlrmchlsmth @mgoin Thanks 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to leave placeholders for NVFP4 as well
|
This pull request has merge conflicts that must be resolved before it can be |
1eb40d7 to
ba71fd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cruft : set below as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match the batched case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor : move prepare-finalize init to a staticmethod that can be invoked from the tests.
ac43e90 to
9283aa9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verify is some combination / config is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good once it's working smoothly -- one thing I ran to running an example from the PR description:
python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type "BatchedTritonExperts"
and hit the following assert:
(EngineCore_1 pid=835) AssertionError: with expert map, -1 id is used for
(EngineCore_1 pid=835) non-local token; this causes error when casting ids to the
(EngineCore_1 pid=835) topk_indices_dtype() uint32
...which looks like a good assert to me. Expert maps + pplx kernels shouldn't be combined IMO
|
Thanks @tlrmchlsmth . The error #20714 should fix it. I have noticed that the |
1142bf8 to
670e76a
Compare
Head branch was pushed to by a user without write access
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
438001f to
6f1bf3e
Compare
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…utility (vllm-project#20449) Signed-off-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Purpose
The ModularKernel framework is very useful in mix-matching different PrepareFinalize objects with FusedExperts implementations. The catch is that it is hard to test the various combinations of these operations. This PR adds a
test_modular_kernel_combinationsunit test, that tests various combinations in a multi-gpu (single-node) and single-gpu setting.Design
tests/kernels/moe/modular_kernel_toolstests/kernels/moe/modular_kernel_tools/mk_objects.pydefines all high-level collections, like all prepare-finalize types, all fused experts types and, all quant configstests/kernels/moe/modular_kernel_tools/common.pydefines all high-level utilities. Mainly the functions,make_modular_kernelandrun_modular_kerneltests/kernels/moe/test_modular_kernel_combinations.py, the profiling code and the feature matrix generator code all leverage themake_modular_kernel/run_modular_kernelfunctions.Restrictions
--data-parallel-size=2and--tensor-parallel-size=1case.pplx,deep_epanddeep_gemmpackages to run. This is a harsh requirement that can be relaxed.Features
test_modular_kernel_combinations.pycan be run as a standalone script to test specific PrepareAndFinalize and FusedExperts combinations.Profiling command example:
python3 -m tests.kernels.moe.modular_kernel_tools.profile_modular_kernel --pf-type PplxPrepareAndFinalize --experts-type BatchedTritonExperts" --torch-trace-dir-path /home/varun/code/vllm/torch_trace_files/Feature Matrix Generation command example:
python3 -m tests.kernels.moe.modular_kernel_tools.make_feature_matrix -f feature_matrices/feature_matrix.csvfeature_matrix.csv
Test Plan
Machine : H100
pytest :
test_modular_kernel_combinations.pypass locallye2e tests:
VLLM_ALL2ALL_BACKEND="deepep_high_throughput" VLLM_USE_DEEP_GEMM=1 vllm serve Qwen/Qwen3-30B-A3B-FP8 --trust-remote-code --enable-expert-parallel --data-parallel-size 2 --port 9010VLLM_ALL2ALL_BACKEND="pplx" vllm serve deepseek-ai/DeepSeek-V2-Lite --data-parallel-size 2 --enable-expert-parallel --port 9020 --trust-remote-codeTest Result
(Optional) Documentation Update