[hybrid performance] Grad fuse for gradient merge under pipeline mode#35004
Merged
wangxicoding merged 32 commits intoPaddlePaddle:developfrom Aug 20, 2021
Merged
[hybrid performance] Grad fuse for gradient merge under pipeline mode#35004wangxicoding merged 32 commits intoPaddlePaddle:developfrom
wangxicoding merged 32 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
python/paddle/fluid/tests/unittests/test_fleet_sharding_meta_optimizer.py
Outdated
Show resolved
Hide resolved
Contributor
wangxicoding
left a comment
There was a problem hiding this comment.
另外再测一组 optimize_cast + fp16_allreduce + fuse_grad_merge 吧
Contributor
Author
正在跑~~ |
Contributor
Author
|
还有优化空间,grad和param按dtype分类之后再fuse,可以减少coalesce op个数以及fused var的个数。下一个pr可以继续优化。 |
FeixLiu
added a commit
to FeixLiu/Paddle
that referenced
this pull request
Aug 31, 2021
wangxicoding
pushed a commit
that referenced
this pull request
Aug 31, 2021
FeixLiu
added a commit
to FeixLiu/Paddle
that referenced
this pull request
Sep 2, 2021
…e under pipeline mode (PaddlePaddle#35004) (PaddlePaddle#35299)" This reverts commit e931cd1.
FeixLiu
added a commit
to FeixLiu/Paddle
that referenced
this pull request
Sep 3, 2021
FeixLiu
added a commit
to FeixLiu/Paddle
that referenced
this pull request
Sep 3, 2021
PaddlePaddle#35116) (PaddlePaddle#35301)" This reverts commit 2931df5. Revert "[cherry-pick][hybrid performance] optim npu coalesce set constant (PaddlePaddle#35105) (PaddlePaddle#35302)" This reverts commit 12260bd. Revert "[cherry-pick][hybrid performance] optim the grad fuse for pipeline mode by sorting the grad by dtype (PaddlePaddle#35070) (PaddlePaddle#35300)" This reverts commit e69cc21. Revert "[cherry-pick][hybrid performance] Grad fuse for gradient merge under pipeline mode (PaddlePaddle#35004) (PaddlePaddle#35299)" This reverts commit e931cd1. Revert "Add flags to control whether to check Nan value of hccl_allreduce_sum. (PaddlePaddle#35093) (PaddlePaddle#35298)" This reverts commit d4948bc. Revert "[hybrid] Fix row parallel linear bias (PaddlePaddle#35186) (PaddlePaddle#35297)" This reverts commit b36fb03. Revert "[hybrid][npu] fix npu clear float status in pipeline (PaddlePaddle#35165) (PaddlePaddle#35295)" This reverts commit 167685e. Revert "[hybrid npu] fix npu found_finite in hybrid (PaddlePaddle#35134) (PaddlePaddle#35291)" This reverts commit e64105f. Revert "[cherry-pick][Hybrid Performance] Move the cast op of AMP which cast fp32 param to fp16 param to the optimizer (PaddlePaddle#34965) (PaddlePaddle#35296)" This reverts commit 6fb58ae. Revert "[cherry-pick] NPU use squared_l2_norm in GradientClipByGlobalNorm (PaddlePaddle#34836) (PaddlePaddle#35289)" This reverts commit 38c27d5.
FeixLiu
added a commit
to FeixLiu/Paddle
that referenced
this pull request
Sep 3, 2021
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
Others
Describe
Fused gradient merge under pipeline mode
The following test are using Ernie 3.0 model on 8 V100 GPUs, with PP=2, MP=2 and DP=2
Throughput compared tokens/s (increase compared with baseline)
Loss compared between baseline and fp16 allreduce
Loss compared between baseline and grad fuse
Loss compared between baseline and fp16 allreduce with grad fuse
Loss compared between baseline and fp16 allreduce, optimizer cast with grad fuse
NPU Loss diff (By Peng Liu)