-
Notifications
You must be signed in to change notification settings - Fork 4.6k
if no expert found in parameter that have expert in name the loop should continue #7685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… moe Signed-off-by: Luke Friedrichs <[email protected]>
Signed-off-by: Luke Friedrichs <[email protected]>
1. `modal-accelerate` needs now `uv` installed explicitly since the image change to 2025 one. 2. moved accelerate repo cloning into the job, since the original way was incorrect as it was caching some accelerate version and not updating it. 3. annotated that how to actually test the ci work when changing the workflow as `pull_request_target` will not run the updated .py+.yaml files. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>
add Masahiro's explanation to why that code is there. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>
as we lost v100s - disable first so that it stops interfering with PRs, then port to modal. Signed-off-by: Luke Friedrichs <[email protected]>
…er (deepspeedai#7658) This PR allows seperate learning rate for muon and adam part of the Muon optimizer. Following up deepspeedai#7657 Signed-off-by: Guokai Ma <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>
…ntinue Otherwise: `TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'` Signed-off-by: Luke Friedrichs <[email protected]>
…o import moe" This reverts commit 2f232b9. Signed-off-by: Luke Friedrichs <[email protected]>
|
@stas00, FYI |
Convert this is to a draft because just continue is not sufficient, as the parameter is not saved at all in this case, so loading the model again than fails. |
|
I haven't gotten to saving checkpoints yet, so I don't have the understanding of this code yet. It's interesting someone is using this old implementation! @LckyLke, we are working on modernizing the original DS-MoE here snowflakedb/ArcticTraining#272 - currently qwen3-moe and qwen3-next are supported - but no checkpoint saving yet... will come later. |
|
@stas00 thanks for the info I will definitely check it out :) |
I have implemented some custom logic in the deeepspeed_moe classes and having "expert" in any parameter name breaks the saving function for checkpoints.
The warning triggers since the code founds an expert (by name) which is not one:
[WARNING] [engine.py:3597:_save_moe_checkpoint] No expert found in key transformer.layers.0.1.deepspeed_moe.gate.wg.experts_mask.but as we do not continue the loop this error happens still:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'A simple continue fixes this :)