[AMP] Support pure fp16 training mode for dygraph#35521
Merged
zhiqiu merged 31 commits intoPaddlePaddle:developfrom Sep 17, 2021
Merged
[AMP] Support pure fp16 training mode for dygraph#35521zhiqiu merged 31 commits intoPaddlePaddle:developfrom
zhiqiu merged 31 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
GuoxiaWang
reviewed
Sep 14, 2021
| {"box_coder", {"PriorBox", "PriorBoxVar", "TargetBox"}}, | ||
| {"momentum", {"Param", "Grad", "Velocity", "LearningRate"}}, | ||
| {"momentum", {"Param", "Grad", "Velocity", "LearningRate", "MasterParam"}}, | ||
| {"sparse_momentum", {"Param", "Grad", "Velocity", "Index", "LearningRate"}}, |
Contributor
There was a problem hiding this comment.
下次提交的时候,帮忙把 sparse_momentum 也加上 MasterParam 吧,谢谢
Contributor
Author
There was a problem hiding this comment.
目前框架中还没有找到使用sparse_momentum的优化器,以及动态图调用sparse_momentum的地方,所以pure fp16的pr中暂时先不加入了。
zhiqiu
reviewed
Sep 16, 2021
Comment on lines
+295
to
+296
| tracer._enable_amp_l1 = original_enable_amp_l1 | ||
| tracer._enable_amp_l2 = original_enable_amp_l2 |
Contributor
There was a problem hiding this comment.
change tracer.enebale_amp to tracer.amp_level
lanxianghit
approved these changes
Sep 17, 2021
raindrops2sea
approved these changes
Sep 17, 2021
AnnaTrainingG
pushed a commit
to AnnaTrainingG/Paddle
that referenced
this pull request
Sep 29, 2021
* add pure fp16 major function in auto_cast & tracer * support master weight in dygraph for pure fp16 * check mix dtype of fp16&fp32 for check_finite_and_unscale op * change pure fp16 funtion name * refine some bug in auto_cast * refine auto_cast interface logic * add param _casted_by_pure_fp16 for class Layer * support state_dict hook for save model by user appointed dtype in pure_fp16_decorator * refine pure_fp16_decorator as decorator * add unittest * add comment * add comment * support recompute * add comment for auto_cast and decorator * support to_static_state_dict for paddle.jit.save * unlimite models num and optimizers num * add lookup_table in black_list * fix momentum and layer state_dict * fix bug in layer state_dict * fix bug in layer state_dict_helper * refine unittest * refine test_momentun_op * refine interface and some code * refine amp_decorator interface * refine pure fp16 interface * refine master weight interface
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR types
New features
PR changes
Others
Describe
1. Background:
The paddy static graph training mode has developed the
pure fp16training mode, but dynamic graph lacks this training mode.pure fp16pr:Support pure fp16 training for AMP API. #29544It is verified on GPT-2 117M model that there is a large gap on training speed between Paddle AMP and Megatron (dygraph mode), Megatron uses fp16 training, all model parameters use fp16, all op executed in fp16.
To sum up, we developed the dynamic graph
pure fp16training mode。2. Pure fp16 API: (dygraph mode)
2.1. Rewrite network parameters from fp32 to fp16 by
decorate:In amp training mode, it uses the blac&white list to control the fp16 computation, so it will insert many cast OPs. In pure fp16 training mode, all OPs are executed in fp16, unless this OP does not support fp16.
Therefore, under the dygraph training, it is necessary to rewrite all the network parameters to fp16, so there is no need to insert cast OP for data conversion in the process of execution.
2.2. Optimzier update the fp16 parameters of the network by
decorate:Rewriting the network parameters from fp32 to fp16 in 2.1 is not an inplace operation. Therefore, optimzier needs to update the fp16 network parameters.
2.3.
levelparameter indecorateandauto_cast:The
levelparameter is set to unify the training results of AMP and pure fp16. Thelevelaccept values are O1 and O2:O1 represent amp, the input data type of each operator will be casted by white_list and black_list;
O2 represent pure fp16, all OPs parameters and input data will be casted to fp16, except OPs in black_list, don't support fp16 kernel and batchnorm parameters.
In
decorate, default value forlevelis O1. In amp training mode, thedecoratewill do nothing, so you do not need call this api, but in pure fp16, you need calldecorateexplicitly.In
auto_cast, default value forlevelis O1. This is to be compatible with the original amp training mode of Paddle.2.4.
master_weightindecorate:Until now, Momentum, Adam and AdamW support the float16 computation. All three of them have the multi_precision parameters, which can avoid poor accuracy or slow convergence in a way.
In
decorate, the default master_weight is None. If master_weight is None or True, in pure fp16,multi_precisionwill be set to True, if user do not want use this strategy, user should set to False.2.5.
save_dtypeindecorate:In pure fp16, the model parameters will rewrite from fp32 to fp16. For Inference, it usually need to save the model of fp32 data type, so we provide an interface
save_dtype, whensave_dtypeis not None, we will register a data type conversion function hook forLayer.state_dict(). So that, all parameters instate_dictwill cast tosave_dtype. Finally, the data type of the model parameters saved throughpaddle.saveandpaddle.jit.savewill besave_dtype.save_dtypesupport fp16、fp32、fp64 or None. Ifsave_dtypeis None, we will not register data type conversion function hook.2.6.
custom_white_listandcustom_black_list:In amp, the input data type of each operator will be casted by white_list and black_list.
In pure fp16, only black_list is effective.
3. Use example:
4. Performance testing:
4.1. Test GPT2 performance by pure fp16:
**4.2. Result:
5、文档预览:
中文文档预览:
overview: http://10.136.157.23:8090/documentation/docs/zh/api/paddle/amp/Overview_cn.html?reviewVersion=jenkins-doc-review-py3.8-252
auto_cast: http://10.136.157.23:8090/documentation/docs/zh/api/paddle/amp/auto_cast_cn.html?reviewVersion=jenkins-doc-review-py3.8-252
decorate: http://10.136.157.23:8090/documentation/docs/zh/api/paddle/amp/decorate_cn.html?reviewVersion=jenkins-doc-review-py3.8-252
英文文档预览:
auto_cast: http://10.136.157.23:8090/documentation/docs/en/api/paddle/amp/auto_cast_en.html?reviewVersion=jenkins-doc-review-py3.8-252
decorate: http://10.136.157.23:8090/documentation/docs/en/api/paddle/amp/decorate_en.html?reviewVersion=jenkins-doc-review-py3.8-252