[trainer] a few fixes by stas00 · Pull Request #9993 · huggingface/transformers

stas00 · 2021-02-04T03:40:19Z

This PR:

removes model.to(device) - it's not needed for DeepSpeed. but primarily this allows loading models that otherwise won't load - e.g. loading 45GB (fp32) to a 40GB GPU when using Deepspeed with fp16 - as it loads only 22GB of it. But currently we load all 45GB right away and well nothing works
decouples 2 unrelated logical things related to model parallel, which was very confusing in the previous if/else incarnation
fixes a bug that left a deepspeed model to be wrapped in DDP, but it shouldn't, like a few other bugs of the same kind I created as things just happened to work until they didn't.

This PR enables t5-11b training on 1x 40GB gpu w/ Deepspeed #9996

sgugger · 2021-02-04T04:40:42Z

This is breaking sadly: with this change someone using trainer.model after instantiating a Trainer won't have it on the GPU anymore, which will make code fail. It's also best IMO if an OOM error happens sooner rather than later.

Now for deepspeed I understand why this would be necessary, so we can move the model.to in that case. I don't see other cases when this is useful (mixed precision with APEX and AMP keep a copy of the model in full precision)

stas00 · 2021-02-04T05:50:45Z

oh, that's no problem for now. Let's do it just for deepspeed then. Fairscale might join down the road.

Actually Deepspeed doesn't even need the .to() call at all. So it's even simpler.

So basically this skipping .to() is needed for all extensions that partition or tweak the model size, so MP/DeepSpeed and this will be so for PP as well.

sgugger

Better this way, thanks for adapting!

sgugger · 2021-02-04T20:03:24Z

            model = ShardedDDP(model, self.optimizer)
        elif is_sagemaker_distributed_available():
            model = DDP(model, device_ids=[dist.get_local_rank()], broadcast_buffers=False)
+        if self.deepspeed:


FYI this breaks most integrations, it should be an elif so that we don't fall into the branches after if TPU or sagemaker is here.
Will fix in a commit on master.

oh boy, my apologies, my branching skills went haywire yesterday.

just the fact that one puts an if foo really close to an existing set of conditionals doesn't make it part of it. need a different programming language that will be more do-what-i-mean-when-i-am-tired

thank you for the fix, @sgugger

No worries, just wanted to alert you :-) Thankfully we found this just before cutting the release candidate!

Oh my!

As I said above this literally happened to me several times yesterday, something went haywire and I started adding new branches with just if's adjacent to an existing if/elif/else pile - my brain decided that if they are together it's must be part of the other if/else. So odd. Some new programming language must be percolating through my neurons or a rogue AI took over and is using my brain for its experiments.

trainer fixes

1458df6

don't switch the model just for deepspeed and mp

a536a73

stas00 changed the title ~~[wip] [trainer] a few fixes~~ [trainer] a few fixes Feb 4, 2021

correct the fix

c2004e9

stas00 mentioned this pull request Feb 4, 2021

[DeepSpeed] [success] trained t5-11b on 1x 40GB gpu #9996

Closed

sgugger approved these changes Feb 4, 2021

View reviewed changes

stas00 merged commit 8c3b1fc into huggingface:master Feb 4, 2021

stas00 deleted the ds-fixes branch February 4, 2021 15:45

sgugger reviewed Feb 4, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[trainer] a few fixes#9993

[trainer] a few fixes#9993
stas00 merged 3 commits into
huggingface:masterfrom
stas00:ds-fixes

stas00 commented Feb 4, 2021 •

edited

Loading

Uh oh!

sgugger commented Feb 4, 2021

Uh oh!

stas00 commented Feb 4, 2021 •

edited

Loading

Uh oh!

sgugger left a comment

Uh oh!

sgugger Feb 4, 2021

Uh oh!

stas00 Feb 4, 2021 •

edited

Loading

Uh oh!

stas00 Feb 4, 2021

Uh oh!

sgugger Feb 4, 2021

Uh oh!

stas00 Feb 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stas00 commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger commented Feb 4, 2021

Uh oh!

stas00 commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

sgugger Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

stas00 Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

sgugger Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

stas00 Feb 4, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stas00 commented Feb 4, 2021 •

edited

Loading

stas00 commented Feb 4, 2021 •

edited

Loading

stas00 Feb 4, 2021 •

edited

Loading