if no expert found in parameter that have expert in name the loop should continue #7685

LckyLke · 2025-11-11T19:28:41Z

I have implemented some custom logic in the deeepspeed_moe classes and having "expert" in any parameter name breaks the saving function for checkpoints.

The warning triggers since the code founds an expert (by name) which is not one:

[WARNING] [engine.py:3597:_save_moe_checkpoint] No expert found in key transformer.layers.0.1.deepspeed_moe.gate.wg.experts_mask.

but as we do not continue the loop this error happens still:

TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

A simple continue fixes this :)

… moe Signed-off-by: Luke Friedrichs <[email protected]>

Signed-off-by: Luke Friedrichs <[email protected]>

1. `modal-accelerate` needs now `uv` installed explicitly since the image change to 2025 one. 2. moved accelerate repo cloning into the job, since the original way was incorrect as it was caching some accelerate version and not updating it. 3. annotated that how to actually test the ci work when changing the workflow as `pull_request_target` will not run the updated .py+.yaml files. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>

add Masahiro's explanation to why that code is there. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>

as we lost v100s - disable first so that it stops interfering with PRs, then port to modal. Signed-off-by: Luke Friedrichs <[email protected]>

…er (deepspeedai#7658) This PR allows seperate learning rate for muon and adam part of the Muon optimizer. Following up deepspeedai#7657 Signed-off-by: Guokai Ma <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>

…ntinue Otherwise: `TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'` Signed-off-by: Luke Friedrichs <[email protected]>

…o import moe" This reverts commit 2f232b9. Signed-off-by: Luke Friedrichs <[email protected]>

sfc-gh-truwase · 2025-11-12T19:03:14Z

@stas00, FYI

LckyLke · 2025-11-12T19:10:49Z

@stas00, FYI

Convert this is to a draft because just continue is not sufficient, as the parameter is not saved at all in this case, so loading the model again than fails.

stas00 · 2025-11-12T20:22:45Z

I haven't gotten to saving checkpoints yet, so I don't have the understanding of this code yet.

It's interesting someone is using this old implementation! @LckyLke, we are working on modernizing the original DS-MoE here snowflakedb/ArcticTraining#272 - currently qwen3-moe and qwen3-next are supported - but no checkpoint saving yet... will come later.

LckyLke · 2025-11-19T11:36:40Z

@stas00 thanks for the info I will definitely check it out :)
Maybe I can find a fix for my problem here over the weekend.

LckyLke requested review from tjruwase and tohtana as code owners November 11, 2025 19:28

LckyLke force-pushed the master branch from 147ba8d to a352018 Compare November 11, 2025 19:33

LckyLke requested a review from loadams as a code owner November 11, 2025 19:33

LckyLke and others added 8 commits November 11, 2025 20:35

fixed some weird behaivoir in deepspeed which does allow me to import…

b3888f3

… moe Signed-off-by: Luke Friedrichs <[email protected]>

Update version.txt after release (deepspeedai#7675)

cb0cde7

Signed-off-by: Luke Friedrichs <[email protected]>

leaf modules: explain better (deepspeedai#7674)

1834cbb

add Masahiro's explanation to why that code is there. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Luke Friedrichs <[email protected]>

disable nv-lightning-v100.yml cI (deepspeedai#7681)

ad66ab2

as we lost v100s - disable first so that it stops interfering with PRs, then port to modal. Signed-off-by: Luke Friedrichs <[email protected]>

if no expert found in parameter that have expert in name it should co…

78ada8c

…ntinue Otherwise: `TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'` Signed-off-by: Luke Friedrichs <[email protected]>

Revert "fixed some weird behaivoir in deepspeed which does allow me t…

eba1752

…o import moe" This reverts commit 2f232b9. Signed-off-by: Luke Friedrichs <[email protected]>

LckyLke force-pushed the master branch from a352018 to eba1752 Compare November 11, 2025 19:36

Merge branch 'master' into master

d96c92b

LckyLke marked this pull request as draft November 11, 2025 19:55

sfc-gh-truwase requested a review from stas00 November 12, 2025 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

if no expert found in parameter that have expert in name the loop should continue #7685

if no expert found in parameter that have expert in name the loop should continue #7685

LckyLke commented Nov 11, 2025

Uh oh!

sfc-gh-truwase commented Nov 12, 2025

Uh oh!

LckyLke commented Nov 12, 2025

Uh oh!

stas00 commented Nov 12, 2025

Uh oh!

LckyLke commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

if no expert found in parameter that have expert in name the loop should continue #7685

Are you sure you want to change the base?

if no expert found in parameter that have expert in name the loop should continue #7685

Conversation

LckyLke commented Nov 11, 2025

Uh oh!

sfc-gh-truwase commented Nov 12, 2025

Uh oh!

LckyLke commented Nov 12, 2025

Uh oh!

stas00 commented Nov 12, 2025

Uh oh!

LckyLke commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants