Skip to content

Conversation

@LckyLke
Copy link

@LckyLke LckyLke commented Nov 11, 2025

I have implemented some custom logic in the deeepspeed_moe classes and having "expert" in any parameter name breaks the saving function for checkpoints.

The warning triggers since the code founds an expert (by name) which is not one:

[WARNING] [engine.py:3597:_save_moe_checkpoint] No expert found in key transformer.layers.0.1.deepspeed_moe.gate.wg.experts_mask.

but as we do not continue the loop this error happens still:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

A simple continue fixes this :)

LckyLke and others added 8 commits November 11, 2025 20:35
1. `modal-accelerate` needs now `uv` installed explicitly since the
image change to 2025 one.
2. moved accelerate repo cloning into the job, since the original way
was incorrect as it was caching some accelerate version and not updating
it.
3. annotated that how to actually test the ci work when changing the
workflow as `pull_request_target` will not run the updated .py+.yaml
files.

---------

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Luke Friedrichs <[email protected]>
add Masahiro's explanation to why that code is there.

---------

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Luke Friedrichs <[email protected]>
as we lost v100s - disable first so that it stops interfering with PRs,
then port to modal.

Signed-off-by: Luke Friedrichs <[email protected]>
…er (deepspeedai#7658)

This PR allows seperate learning rate for muon and adam part of the Muon
optimizer. Following up
deepspeedai#7657

Signed-off-by: Guokai Ma <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Luke Friedrichs <[email protected]>
…ntinue

Otherwise: `TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'`
Signed-off-by: Luke Friedrichs <[email protected]>
…o import moe"

This reverts commit 2f232b9.

Signed-off-by: Luke Friedrichs <[email protected]>
@LckyLke LckyLke marked this pull request as draft November 11, 2025 19:55
@sfc-gh-truwase
Copy link
Collaborator

@stas00, FYI

@LckyLke
Copy link
Author

LckyLke commented Nov 12, 2025

@stas00, FYI

Convert this is to a draft because just continue is not sufficient, as the parameter is not saved at all in this case, so loading the model again than fails.

@stas00
Copy link
Collaborator

stas00 commented Nov 12, 2025

I haven't gotten to saving checkpoints yet, so I don't have the understanding of this code yet.

It's interesting someone is using this old implementation! @LckyLke, we are working on modernizing the original DS-MoE here snowflakedb/ArcticTraining#272 - currently qwen3-moe and qwen3-next are supported - but no checkpoint saving yet... will come later.

@LckyLke
Copy link
Author

LckyLke commented Nov 19, 2025

@stas00 thanks for the info I will definitely check it out :)
Maybe I can find a fix for my problem here over the weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants