[hybrid parallel] Optimize pipeline memory by wangxicoding · Pull Request #34230 · PaddlePaddle/Paddle

wangxicoding · 2021-07-19T05:55:50Z

PR types

Function optimization

PR changes

Others

Describe

优化流水线并行显存占用。
优化前显存会随着global batch size增大而增大，优化后显存不会随gbs增大而增大，保持不变。

原PR #34214 。原PR中存在一个bug，当反向过程使用到了前向send var时（如recompute)，由于存在提前gc情况，会导致反向使用这个var时出现 tensor为null的情况。
为解决这个问题，本PR采用最开始的 #34086 nop_op来hold住前向send var，通过gc管理用完即时释放就好。当然也可在c++端加入拓扑依赖判断，但工程实现上麻烦些。

最终Pipeline的send变量显存管理方式：

Forward send var，通过nop_op，通过gc的机制，自动显存管理。
详见PR [hybrid performance] Optimize pipeline send wait #34086
Backward send var，通过section_worker执行器，根据分析的拓扑依赖，手工显存管理。
拓扑依赖如下图，当前stage的FB前向recv完成，那么前两个FB的反向send也一定完成了，这个时候可以将反向send的变量释放。(TODO：PR中为编码方便，将释放放到了Forward结束后，可优化为在Forward recv之后)

测试

V100 32GB 单机8卡。
gpt2-medium-en 345MB模型，pipeline_stage=8， micro_batch=4,

gbs	卡号	develop(MB)	PR(MB)	显存变化量(MB)
32	0	24402	24406
	1	21376	21380
	7	7830	7834
64	0	24660	24664
	1	21634	21380	-254
	7	7830	7834
256	0	24660	不变
	1	22408	不变	-1028
	7	8168	不变	-334
1024	0	24660	不变
	1	25504	不变	-4124
	7	11770	不变	-3936
2048	0	24600	不变
	1	29632	不变	-8252
	7	15710	不变	-7876
3072	0	OOM	不变
	1	OOM	不变
	7	OOM	不变

测试结果

PR极大降低了显存，显存不随global batch size增大而增大，理论可增大到无穷。

paddle-bot-old · 2021-07-19T05:56:04Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

sneaxiy

Almost LGTM except for one suggestion.

sneaxiy · 2021-07-19T09:25:16Z

paddle/fluid/framework/section_worker.cc

Do we have another way to distinguish whether the input of ops (send_v2 or partial_send op) is the variable to send? I mean it is discouraged to set variable name as attribute. I prefer to add a bool attribute to indicate this case.

Done. Remove pipeline_send_var attr. Find the backward send var directly.

sandyhouse

LGTM

sandyhouse

LGTM

sneaxiy

LGTM now.

wangxicoding requested review from gongweibao, sandyhouse and sneaxiy July 19, 2021 08:11

sneaxiy reviewed Jul 19, 2021

View reviewed changes

wangxicoding added 5 commits July 19, 2021 11:33

optimize pipeline memory, test=allcase

a28ea3e

fix, test=allcase

9c6fef5

fix, test=allcase

a725cfa

add section worker glog

1a8ded5

fix tensor is null

fabe7eb

sandyhouse previously approved these changes Jul 19, 2021

View reviewed changes

remove pipeline_send_var, test=allcase

556fd7d

wangxicoding dismissed sandyhouse’s stale review via 556fd7d July 19, 2021 13:27

wangxicoding force-pushed the optimize_pipeline_memory1 branch from 371686c to 556fd7d Compare July 19, 2021 13:27

wangxicoding requested review from sandyhouse and sneaxiy July 19, 2021 13:32

sandyhouse approved these changes Jul 19, 2021

View reviewed changes

sneaxiy approved these changes Jul 19, 2021

View reviewed changes

wangxicoding merged commit a74208c into PaddlePaddle:develop Jul 20, 2021

wangxicoding deleted the optimize_pipeline_memory1 branch July 20, 2021 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hybrid parallel] Optimize pipeline memory#34230

[hybrid parallel] Optimize pipeline memory#34230
wangxicoding merged 6 commits intoPaddlePaddle:developfrom
wangxicoding:optimize_pipeline_memory1

wangxicoding commented Jul 19, 2021 •

edited

Loading

Uh oh!

paddle-bot-old bot commented Jul 19, 2021

Uh oh!

sneaxiy left a comment •

edited

Loading

Uh oh!

sneaxiy Jul 19, 2021 •

edited

Loading

Uh oh!

wangxicoding Jul 19, 2021

Uh oh!

sandyhouse left a comment

Uh oh!

sandyhouse left a comment

Uh oh!

sneaxiy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wangxicoding commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

测试

测试结果

Uh oh!

paddle-bot-old bot commented Jul 19, 2021

Uh oh!

sneaxiy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneaxiy Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxicoding Jul 19, 2021

Choose a reason for hiding this comment

Uh oh!

sandyhouse left a comment

Choose a reason for hiding this comment

Uh oh!

sandyhouse left a comment

Choose a reason for hiding this comment

Uh oh!

sneaxiy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wangxicoding commented Jul 19, 2021 •

edited

Loading

sneaxiy left a comment •

edited

Loading

sneaxiy Jul 19, 2021 •

edited

Loading