[hybrid parallel] Optimize pipeline memory #34214

wangxicoding · 2021-07-16T13:30:29Z

PR types

Function optimization

PR changes

Others

Describe

优化流水线并行显存占用。
优化前显存会随着global batch size增大而增大，优化后显存不会随gbs增大而增大，保持不变。

测试

V100 32GB 单机8卡。
gpt2-medium-en 345MB模型，pipeline_stage=8， micro_batch=4,

gbs	卡号	develop(MB)	PR(MB)	显存变化量(MB)
32	0	24402	24424	+22
	1	21376	21398	+22
	7	7830	7834	+4
64	0	24660	24682	+22
	1	21634	21398	-236
	7	7830	7834	+4
256	0	24660	不变	+22
	1	22408	不变	-1010
	7	8168	不变	-334
1024	0	24660	不变	+22
	1	25504	不变	-4106
	7	11770	不变	-3936
2048	0	24600	不变	+22
	1	29632	不变	-8234
	7	15710	不变	-7876
3072	0	OOM	不变
	1	OOM	不变
	7	OOM	不变

测试结果

PR显存不随global batch size增大而增大，理论可增大到无穷。

至于0卡会比develop多22MB显存的原因。是因为PR #34086 中将send的释放放到了Backward的recv之后，而本PR将send的释放放到了Backward之后，在生命周期上会长那么一点点时间。这部分也可以进行优化，将send的释放放到Backward的recv之后，但静态图中需要对pipeline执行器有较多改动（动态图实现简单），工程实现麻烦，故先不优化。

paddle-bot-old · 2021-07-16T13:30:32Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wangxicoding added 2 commits July 16, 2021 20:32

optimize pipeline memory, test=allcase

9a5238a

fix, test=allcase

c7d2e63

fix, test=allcase

4a806a7

wangxicoding force-pushed the optimize_pipeline_memory branch from eaf73d0 to 4a806a7 Compare July 16, 2021 13:33

wangxicoding changed the title ~~Optimize pipeline memory~~ [hybrid parallel] Optimize pipeline memory Jul 17, 2021

wangxicoding requested review from gongweibao and sandyhouse July 18, 2021 03:53

add section worker glog

4998715

wangxicoding mentioned this pull request Jul 19, 2021

[hybrid parallel] Optimize pipeline memory #34230

Merged

wangxicoding closed this Jul 19, 2021

wangxicoding deleted the optimize_pipeline_memory branch July 20, 2021 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[hybrid parallel] Optimize pipeline memory #34214

[hybrid parallel] Optimize pipeline memory #34214

Uh oh!

wangxicoding commented Jul 16, 2021 •

edited

Loading

Uh oh!

paddle-bot-old bot commented Jul 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[hybrid parallel] Optimize pipeline memory #34214

[hybrid parallel] Optimize pipeline memory #34214

Uh oh!

Conversation

wangxicoding commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

测试

测试结果

Uh oh!

paddle-bot-old bot commented Jul 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wangxicoding commented Jul 16, 2021 •

edited

Loading