-
Notifications
You must be signed in to change notification settings - Fork 5.9k
fuse L2Decay and momentum when param.regularizer is set #32845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fuse L2Decay and momentum when param.regularizer is set #32845
Conversation
|
Thanks for your contribution! |
ee3c9b3 to
47558e3
Compare
47558e3 to
58eda79
Compare
|
Sorry to inform you that e88475d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
d0439a9 to
4621e0c
Compare
4621e0c to
b0ca588
Compare
b0ca588 to
83ab8c5
Compare
python/paddle/optimizer/momentum.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L270 - L297可以直接写成调用基类的_create_regularization_of_grad函数?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Xreki
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
6b21f7f to
c7fa29e
Compare
zhiqiu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Performance optimizationPR changes
OthersDescribe
fuse L2Decay and momentum when param.regularizer is setbefore
当前Paddle支持momentum + L2Decay的融合:
Paddle/python/paddle/optimizer/momentum.py
Lines 108 to 115 in 1ef2327
_append_optimize_op时,通过设置momentum op的以下参数,将weight_decay和momentum计算都在momentum op中完成,达到融合的目的Paddle/python/paddle/optimizer/momentum.py
Lines 209 to 210 in 1ef2327
但是如果模型中通过momentum的weight_decay参数设置了全局的regularizer=L2Decay,但是某些层又通过paddle.ParamAttr设置了特定的regularizer,则会发生以下情况:
_append_optimize_op时,设置momentum op的参数,以实现融合append_regularization_ops(params_grads, self.regularization)以及self._create_optimization_pass(params_grads)_create_regularization_of_grad完成weight_decay,如下代码,会执行param的regularizerPaddle/python/paddle/fluid/regularizer.py
Lines 25 to 40 in 1ef2327
_append_optimize_op,因(1)中设置了self._regularization_method和self._regularization_coeff,将会导致momentum op中再次做weight_decayafter
由于在
append_regularization_ops(params_grads, self.regularization)中会遍历所有参数,执行参数的regularization。如果是使用momentum,则需要在遍历参数时,判断参数的regularizer是否为L2Decay,如果是,则跳过做regularization。然后在_append_optimize_op时,去设置momentum op的regularization_method参数。因此本PR做了以下修改:append_regularization_ops和_create_regularization_of_grad删除,移动到了optimizer.py文件中,作为Optimizer Class的实例方法。这样保证了不影响到其他优化器。Paddle/python/paddle/fluid/regularizer.py
Lines 25 to 108 in 5fa44c3
_create_regularization_of_grad方法,和父类此方法唯一的区别是:当param设置了L2Decay,就直接跳过参数的regularization。具体参考本PR中momentum.py文件的修改:综上,只要参数指定的regularizer是L2Decay,就会用该参数的regularizer替代全局的设置,避免了进行2次regularization,同时依然达到融合的效果。
performance
拿TSM进行测试,该模型为一些参数设置了自己的regularizer=L2Decay,bug修复前,会导致某些参数进行2次regularization。从profile report中可以看到,会有多次scale和sum的调用。
同时该bug可能还影响了收敛速度和精度。对比了bug修复前的训练log: