Skip to content

Conversation

@wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Sep 15, 2021

PR types

Performance optimization

PR changes

Others

Describe

移除insert_scale_loss_grad_ops中插入的scale op,直接取出loss_grad_op也即fill_constant中的value值修改,可减少插入scale op的个数。理论可以提高一小丢丢丢性能。

  • develop loss scale
    image
  • PR loss scale,0.0078125 = 0.5 * 0.015625
    image

测试

Ernei3.0,base模型

  • 速度,基本没啥变化,一小丢丢丢提升
develop(tokens/s) PR(tokens/s
44924 44951
  • 精度,对齐
    image

后续优化TODO

pipeline里面可以将loss_grad_op放到LRSchedule里,一个step只执行一次,而非每个micro-step都执行。或者更激进点,设置为persistable,在startup_program里面初始化,一次训练只执行一次。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link

@sandyhouse sandyhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for code modification, but I think after this modification, the program description maybe confusing for others.

Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thought this change is equivalent to origin, it breaks the strong assumption in framework that the gradient back-propogation starts from a constant ONE.

@JZ-LIANG
Copy link
Contributor

LGTM. Thought this change is equivalent to origin, but it change the strong assumption in framework that the gradient back-propogation starts from a constant ONE.

Might need a comment to notify the later maintainer that the start point of gradient backward would change according to the DataParallel and ShardingParallel degree.

@wangxicoding
Copy link
Contributor Author

LGTM. Thought this change is equivalent to origin, but it change the strong assumption in framework that the gradient back-propogation starts from a constant ONE.

Might need a comment to notify the later maintainer that the start point of gradient backward would change according to the DataParallel and ShardingParallel degree.

OK,add in next PR.

@wangxicoding wangxicoding merged commit 02b0be0 into PaddlePaddle:develop Sep 16, 2021
"loss_grad_op must be fill_constant op, " \
"but this op is {}".format(op.type)
assert op.has_attr('value')
loss_scale = float(op.attr('value'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind the potential precision loss here. fill_constant op will cast the value into fp32 and then save as string into its OpDesc. reload & reset this value might cause precision loss when the denominator is odd (3,7, 11, etc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
image
loss_grad_op(that is fill_constant) use value instead of str_value, which AttrType is float. value will be saved with float32 in protobuf. If we encount precision problem, I this this must be caused by float AttrType, and I think double AttrType is better, which framework does not provide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
the type of op.attr('value') is already float64 in python, add float(op.attr('value')) is only for explicit.

@wangxicoding wangxicoding deleted the hybrid_remove_loss_scale_op branch September 16, 2021 03:09
AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants