[hybrid parallel] Optimize pipeline memory #34214
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Function optimization
PR changes
Others
Describe
优化流水线并行显存占用。
优化前显存会随着global batch size增大而增大,优化后显存不会随gbs增大而增大,保持不变。
测试
V100 32GB 单机8卡。
gpt2-medium-en 345MB模型,pipeline_stage=8, micro_batch=4,
测试结果
PR显存不随global batch size增大而增大,理论可增大到无穷。
至于0卡会比develop多22MB显存的原因。是因为PR #34086 中将send的释放放到了Backward的recv之后,而本PR将send的释放放到了Backward之后,在生命周期上会长那么一点点时间。这部分也可以进行优化,将send的释放放到Backward的recv之后,但静态图中需要对pipeline执行器有较多改动(动态图实现简单),工程实现麻烦,故先不优化。