-
Notifications
You must be signed in to change notification settings - Fork 681
[Intel HPU] enable MoE EP for hpu #5855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enables MoE (Mixture of Experts) Expert Parallelism (EP) for Intel HPU by modifying the execution path and weight handling to accommodate HPU-specific requirements.
Key changes:
- Modified MoE forward logic to route HPU through
forward_normalregardless of EP/TP configuration - Converted
down_proj_in_scalefrom list to tensor and added padding alignment for HPU's 0x80 byte alignment requirement - Added
up_gate_proj.activation_scaleweight loading support for EP mode
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/model_executor/layers/moe/moe.py | Routes HPU platform to use forward_normal path for both EP and TP modes |
| fastdeploy/model_executor/layers/backends/intel_hpu/moe/fused_moe_hpu_backend.py | Changes down_proj_in_scale handling from list to tensor and renames apply_tp to apply |
| fastdeploy/worker/hpu_model_runner.py | Adds alignment padding function for scales and implements early return for EP mode |
| fastdeploy/model_executor/load_weight_utils.py | Adds up_gate_proj_in_scale_key to weight loading for EP support |
| examples/intel_hpu/offline_demo.py | Enables EP configuration in demo script |
fastdeploy/model_executor/layers/backends/intel_hpu/moe/fused_moe_hpu_backend.py
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5855 +/- ##
==========================================
Coverage ? 67.04%
==========================================
Files ? 348
Lines ? 44643
Branches ? 6862
==========================================
Hits ? 29932
Misses ? 12507
Partials ? 2204
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6abc6d1 to
881a762
Compare
62d4ac2 to
881a762
Compare
2b4a238 to
881a762
Compare
31f8e6f to
5af9f09
Compare
|
add @LeoZhao-Intel @fmiao2372 |
5af9f09 to
e1b2940
Compare
| tensor_model_parallel_all_reduce_custom(out) | ||
| else: | ||
| out = tensor_model_parallel_all_reduce(out, self.tp_group) | ||
| out = tensor_model_parallel_all_reduce(out, self.tp_group) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个为什么不用tensor_model_parallel_all_reduce_custom ?
Motivation
enable MoE EP for hpu with loader_v1
Modifications
fastdeploy/model_executor/layers/moe/moe.py
HPU calls forward_normal no matter EP or TP, and won't fall into forward_split_allgather nor forward_chunked_moe
fused_moe_hpu_backend.py
change down_proj_in_scale from list to tensor.
hpu_model_runner.py
list to tensor, add padding dim for 0x80 alignment request.
fastdeploy/model_executor/load_weight_utils.py
needs up_gate_proj.activation_scale for EP in loader v0
fastdeploy/model_executor/models/ernie4_5_moe.py
add Attention related activation_scale name conversions
Usage or Command
set
enable_expert_parallel=True, anddisable_sequence_parallel_moe=True, to enable HPU MoE EP.Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.conducted by local tests
releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.