Skip to content

Conversation

@wangxicoding
Copy link
Contributor

@wangxicoding wangxicoding commented Aug 3, 2021

PR types

Others

PR changes

Others

Describe

Optimize ClipGradByGlobalNorm memory usage and performance.
Major changes:
1、Replace reduce_sum(square(x)) to squared_l2_norm(x).
2、scale grad use inplace elementwise_mul.

Test

single cards: run gpt3-1.3B-en, with recompute and amp, seq_len=1024 batch_size=2
16 cards hybrid: run gpt3-13B-en, with mp=4 pp=4, recompute and amp, seq_len=1024 gbs=256 micro_batch_size=2. we record cards(0, 15) memory
Ernie3.0: "hidden_size": 4096, "num_attention_heads": 128, "num_hidden_layers": 76, "num_sharing_layers": 64, mp=8, pp=2, amp, recompute, gbs=32, micro_bs=2

Memory(MB) Speed(tokens/s)
develop PR Save develop PR Improve
GPT(PE 1card) 25786 25392 394 3505 3605 2.85%
GPT(Executor 1card) 26114 25720 394 3517 3611 2.67%
GPT(Hybrid 16cards) (19592,19530) (19068,17784) (524, 1746) 5854 5859 0.08%
Ernie3.0(16cards) (18724, 25004) (17276, 23340) (1448, 1664) 2445 2362 -3.5%

In Ernie3.0 is slow... Because squared_l2_norm is slow than reduce_sum(square), need optimize...

@paddle-bot-old
Copy link

paddle-bot-old bot commented Aug 3, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@wangxicoding wangxicoding changed the title optimize global gradient clip optimize ClipGradByGlobalNorm Aug 4, 2021
Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for sharding

@wangxicoding wangxicoding merged commit 4d6f8f2 into PaddlePaddle:develop Aug 5, 2021
@wangxicoding wangxicoding deleted the optimize_global_gradient_clip branch August 5, 2021 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants