Skip to content

[trainer] a few fixes#9993

Merged
stas00 merged 3 commits into
huggingface:masterfrom
stas00:ds-fixes
Feb 4, 2021
Merged

[trainer] a few fixes#9993
stas00 merged 3 commits into
huggingface:masterfrom
stas00:ds-fixes

Conversation

@stas00
Copy link
Copy Markdown
Contributor

@stas00 stas00 commented Feb 4, 2021

This PR:

  • removes model.to(device) - it's not needed for DeepSpeed. but primarily this allows loading models that otherwise won't load - e.g. loading 45GB (fp32) to a 40GB GPU when using Deepspeed with fp16 - as it loads only 22GB of it. But currently we load all 45GB right away and well nothing works
  • decouples 2 unrelated logical things related to model parallel, which was very confusing in the previous if/else incarnation
  • fixes a bug that left a deepspeed model to be wrapped in DDP, but it shouldn't, like a few other bugs of the same kind I created as things just happened to work until they didn't.

This PR enables t5-11b training on 1x 40GB gpu w/ Deepspeed #9996

@sgugger

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Feb 4, 2021

This is breaking sadly: with this change someone using trainer.model after instantiating a Trainer won't have it on the GPU anymore, which will make code fail. It's also best IMO if an OOM error happens sooner rather than later.

Now for deepspeed I understand why this would be necessary, so we can move the model.to in that case. I don't see other cases when this is useful (mixed precision with APEX and AMP keep a copy of the model in full precision)

@stas00
Copy link
Copy Markdown
Contributor Author

stas00 commented Feb 4, 2021

oh, that's no problem for now. Let's do it just for deepspeed then. Fairscale might join down the road.

Actually Deepspeed doesn't even need the .to() call at all. So it's even simpler.

So basically this skipping .to() is needed for all extensions that partition or tweak the model size, so MP/DeepSpeed and this will be so for PP as well.

@stas00 stas00 changed the title [wip] [trainer] a few fixes [trainer] a few fixes Feb 4, 2021
Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better this way, thanks for adapting!

@stas00 stas00 merged commit 8c3b1fc into huggingface:master Feb 4, 2021
@stas00 stas00 deleted the ds-fixes branch February 4, 2021 15:45
model = ShardedDDP(model, self.optimizer)
elif is_sagemaker_distributed_available():
model = DDP(model, device_ids=[dist.get_local_rank()], broadcast_buffers=False)
if self.deepspeed:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this breaks most integrations, it should be an elif so that we don't fall into the branches after if TPU or sagemaker is here.
Will fix in a commit on master.

Copy link
Copy Markdown
Contributor Author

@stas00 stas00 Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh boy, my apologies, my branching skills went haywire yesterday.

just the fact that one puts an if foo really close to an existing set of conditionals doesn't make it part of it. need a different programming language that will be more do-what-i-mean-when-i-am-tired

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the fix, @sgugger

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, just wanted to alert you :-) Thankfully we found this just before cutting the release candidate!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my!

As I said above this literally happened to me several times yesterday, something went haywire and I started adding new branches with just if's adjacent to an existing if/elif/else pile - my brain decided that if they are together it's must be part of the other if/else. So odd. Some new programming language must be percolating through my neurons or a rogue AI took over and is using my brain for its experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants