Skip to content

[trainer] model argument is not the same depending on n_gpus #8736

@stas00

Description

@stas00

Extracting the discussion from #8716

Summary of the issue: prediction_step() has a model argument which is a normal model with n_gpu < 2, and a wrapped DataParallel model with n_gpu > 1. So the API suffers from ambiguity here.

The user has to really use self.model to be able to call methods like model.config() or model.generate(), which can't be called on the wrapped model. But it's very likely they will use model instead since it'll act like self.model unless under multi_gpu. And why do we even have that model argument then?

Possible solutions discussed:

  1. monkeypatch torch.nn.DataParallel to expand its API to support all the methods of the original model transparently by installing a catch all __getattr__ and remap all the failed method look ups to delegate to self.module.

  2. not to call the function argument model anymore, since it isn't under multi gpu, but is something else.

  3. remove the model argument completely + document to always use self.model - currently in seq2seq_trainer.py once we switch to self.model, prediction_step() no longer needs model as an argument (but is it always the case?)

  4. pass self.model as the model arg, and making the wrapped model available via self.wrapped_model if the user needs it.

Summary of discussion around proposed solutions:

  1. too magical

  2. proposed calling it wrapped_model, but it's just as confusing since most of the time it's not.

  3. need to check whether wrapped model is every needed inside user functions.

  4. was not discussed yet

@sgugger, @LysandreJik

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions