[hybrid parallel] Optimize pipeline memory#34230
Merged
wangxicoding merged 6 commits intoPaddlePaddle:developfrom Jul 20, 2021
Merged
[hybrid parallel] Optimize pipeline memory#34230wangxicoding merged 6 commits intoPaddlePaddle:developfrom
wangxicoding merged 6 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
sneaxiy
reviewed
Jul 19, 2021
Collaborator
There was a problem hiding this comment.
Do we have another way to distinguish whether the input of ops (send_v2 or partial_send op) is the variable to send? I mean it is discouraged to set variable name as attribute. I prefer to add a bool attribute to indicate this case.
Contributor
Author
There was a problem hiding this comment.
Done. Remove pipeline_send_var attr. Find the backward send var directly.
371686c to
556fd7d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR types
Function optimization
PR changes
Others
Describe
优化流水线并行显存占用。
优化前显存会随着global batch size增大而增大,优化后显存不会随gbs增大而增大,保持不变。
原PR #34214 。原PR中存在一个bug,当反向过程使用到了前向send var时(如recompute),由于存在提前gc情况,会导致反向使用这个var时出现 tensor为null的情况。
为解决这个问题,本PR采用最开始的 #34086 nop_op来hold住前向send var,通过gc管理用完即时释放就好。当然也可在c++端加入拓扑依赖判断,但工程实现上麻烦些。
最终Pipeline的send变量显存管理方式:
详见PR [hybrid performance] Optimize pipeline send wait #34086
拓扑依赖如下图,当前stage的FB前向recv完成,那么前两个FB的反向send也一定完成了,这个时候可以将反向send的变量释放。(TODO:PR中为编码方便,将释放放到了Forward结束后,可优化为在Forward recv之后)
测试
V100 32GB 单机8卡。
gpt2-medium-en 345MB模型,pipeline_stage=8, micro_batch=4,
测试结果
PR极大降低了显存,显存不随global batch size增大而增大,理论可增大到无穷。