-
Notifications
You must be signed in to change notification settings - Fork 5.9k
[hybrid] remove scale op in insert_scale_loss_grad_ops #35775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hybrid] remove scale op in insert_scale_loss_grad_ops #35775
Conversation
|
Thanks for your contribution! |
sandyhouse
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for code modification, but I think after this modification, the program description maybe confusing for others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thought this change is equivalent to origin, it breaks the strong assumption in framework that the gradient back-propogation starts from a constant ONE.
Might need a comment to notify the later maintainer that the start point of gradient backward would change according to the DataParallel and ShardingParallel degree. |
OK,add in next PR. |
| "loss_grad_op must be fill_constant op, " \ | ||
| "but this op is {}".format(op.type) | ||
| assert op.has_attr('value') | ||
| loss_scale = float(op.attr('value')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind the potential precision loss here. fill_constant op will cast the value into fp32 and then save as string into its OpDesc. reload & reset this value might cause precision loss when the denominator is odd (3,7, 11, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.



PR types
Performance optimization
PR changes
Others
Describe
移除insert_scale_loss_grad_ops中插入的scale op,直接取出loss_grad_op也即fill_constant中的value值修改,可减少插入scale op的个数。理论可以提高一小丢丢丢性能。
测试
Ernei3.0,base模型
后续优化TODO
pipeline里面可以将loss_grad_op放到LRSchedule里,一个step只执行一次,而非每个micro-step都执行。或者更激进点,设置为persistable,在startup_program里面初始化,一次训练只执行一次。