[3D-Parallel:Sharding] Optimizations for supporting ERNIE 3.0 training by JZ-LIANG · Pull Request #31884 · PaddlePaddle/Paddle

JZ-LIANG · 2021-03-26T07:27:17Z

PR types

New features

PR changes

APIs

Describe

This pr is a major update to sharding and had changed sharding APIs.
It consists all new features and performance optimizations developed for 100B ERNIE 3.0 training.

The major updates are following:
Performance optimizations:

fixed bug in Recompute optimizer: Recompute-Sharing related broadcast: FP32 --> FP16
Sharding allreduce --> reduce: save 1/2 algorithm bandwidth need for sharding grad synchronization
optimize the sharding initialization procedure
support 2 sharding segment strategise (by broadcast size or by anchor)
remove the unnecessary sync for logics in sharding supporting amp and clipbyglobalnorm

New features:

Megatron-Sharding 2D parallelism
Sharding Gradient Merge

features optimizations:

add sync in startup program to avoid potential hang when create multiple nccl comm.
uniform switch to switch among different parallelism mode

The new api:

sharding_segment_strategy： could be choose from "segment_broadcast_MB" and "segment_anchors"
- segment is a concept used by sharding to overlap comm and calc.
- segment_anchors: segment program by user defined anchors
- segment_broadcast_MB: segment program by broadcast volume (MB)
sharding_degree:
- the number of way of sharding parallelism
- turn off sharding parallelism by setting sharding_degree = 1.
mp_degree:
- the number of way of mp (Megatron) parallelism
- turn off mp parallelism by setting mp_degree = 1.
hybrid_dp:
- the data parallelism (distinguish with sharding) used to scale up training throughput as the outer parallelism
- when hybrid_dp = True, user should ensure global_wold_size = N * mp_degree * sharding_degree (N >= 2), where the N is the data parallelism degree.

example:

assume we have 4 nodes with 8 gpus in each node:
pure sharding among all 32 gpus:

    dist_strategy.sharding = True
    dist_strategy.sharding_configs = {
        "sharding_segment_strategy": "segment_broadcast_MB",
        "segment_broadcast_MB": 32,
        "segment_anchors": None,
        "sharding_degree": 32,
        "mp_degree": 1,
        "hybrid_dp": False,
        "gradient_merge_acc_step": 1,
    }

sharding-hybrid-dp which sharding parameter within 8 gpus per node and using 4 ways data parallel to scale up training throughput and enable gradient merge which is acc steps 4.

    dist_strategy.sharding = True
    dist_strategy.sharding_configs = {
        "sharding_segment_strategy": "segment_broadcast_MB",
        "segment_broadcast_MB": 32,
        "segment_anchors": None,
        "sharding_degree": 8,
        "mp_degree": 1,
        "hybrid_dp": True,
        "gradient_merge_acc_step": 4,
    }

2D megatron-sharding which megatron split parameter within 8 gpus in each node and using 4 ways sharding parallel to further distribute parameters to 4 shards.

    dist_strategy.sharding = True
    dist_strategy.sharding_configs = {
        "sharding_segment_strategy": "segment_broadcast_MB",
        "segment_broadcast_MB": 32,
        "segment_anchors": None,
        "sharding_degree": 4,
        "mp_degree": 8,
        "hybrid_dp": False,
        "gradient_merge_acc_step": 1,
    }

megatron-sharding hybrid dp mode: with 4-megatron 2-sharding parallelism in each node. and 4 nodes duplicated for data parallelism.

    dist_strategy.sharding = True
    dist_strategy.sharding_configs = {
        "sharding_segment_strategy": "segment_broadcast_MB",
        "segment_broadcast_MB": 32,
        "segment_anchors": None,
        "sharding_degree": 4,
        "mp_degree": 2,
        "hybrid_dp": True,
        "gradient_merge_acc_step": 1,
    }

paddle-bot-old · 2021-03-26T07:27:20Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

gongweibao · 2021-03-31T05:04:59Z

paddle/fluid/framework/distributed_strategy.proto


 message ShardingConfig {
-  optional float fuse_broadcast_MB = 1 [ default = 32.0 ];
+  optional float segment_broadcast_MB = 1 [ default = 32.0 ];


Add simple comments?

those args decide how sharding segment the program， which will affect the comm-calc overlap logic in sharding.
by now, we support two segment strategy:
"segment_broadcast_MB": segment by broadcast volume
"segment_anchors": segment by user defined anchors(op' s output)
I will add detail explanation in fluidDoc.

Also, next PR can add comments or link in this code

gongweibao · 2021-03-31T05:07:33Z

python/paddle/distributed/fleet/meta_optimizers/amp_optimizer.py

        # computation by split the check_finite_and_unscale op.
        is_distributed = self.role_maker._worker_num() > 1
        if self.user_defined_strategy.sharding:
+            # if self.user_defined_strategy.sharding or self.user_defined_strategy.model_parallel:


I can't get it!

sharding as well as sharidng-megatron do not support the pure_fp16 allreduce logic introduced in amp&purefp16 metaoptimizer, so we need to set "is_distributed" = False while sharding or megatron enable.
This PR will add logic for sharding-megatron supporting, but megatron meta-optimizer will be add in another pr, so I will remove this line for now.

gongweibao · 2021-03-31T05:08:56Z

python/paddle/distributed/fleet/meta_optimizers/sharding/fp16_helper.py


-        comm_op_num = insert_sync_comm_op(block, update_loss_scaling_op_idx + 3,
-                                          ring_id, [inf_var_fp32])
+        # comm_op_num = insert_sync_comm_op(block, update_loss_scaling_op_idx + 3,


gongweibao · 2021-03-31T05:12:18Z

python/paddle/fluid/tests/unittests/test_fleet_sharding_meta_optimizer.py

-            'elementwise_max', 'elementwise_div', 'elementwise_mul',
-            'elementwise_mul', 'elementwise_mul', 'momentum', 'momentum',
-            'momentum'
+            'c_reduce_sum', 'c_reduce_sum', 'c_reduce_sum', 'c_reduce_sum',


gongweibao · 2021-03-31T05:13:52Z

python/paddle/fluid/backward.py

+                    # we should create the rename var in subprog, otherwise its VarType will be BOOL
+                    block.create_var(
+                        name=var_name_dict[name],
+                        shape=block.program.global_block().var(name).shape,


block.program.global_block().var(name) is called four times.

gongweibao · 2021-03-31T05:19:02Z

python/paddle/distributed/fleet/meta_optimizers/sharding/utils.py

                return True
        return False

+    def is_gradient_merge_vars(var):


Have we recorded these kinds of hard coding or rules?

yes.
this naive rule should be updated later.
the problem is that: grad@gradientmerge should be persistable in global scope but not to be saved. we need design a method to distinguish persistable-non-savable vars with persistable-savable vars

gongweibao · 2021-03-31T05:20:30Z

python/paddle/distributed/fleet/meta_optimizers/sharding/utils.py

-                if var_name in vars_status:
-                    vars_status[var_name] = 2
+        elif op.type == "c_allreduce_sum" or op.type == "c_reduce_sum":
+            if op.all_attrs()["use_calc_stream"] == False:


Add some simple comments here?

done~
we should ensure all sharding-related grad communication (reduce / allreduce) be sync before grad being used in optimizers.

but we should ignore and skip allreduce op of Megatron, since them are schedule in calc stream and would not have non-sync problem before next usage.

gongweibao · 2021-03-31T05:20:56Z

python/paddle/distributed/fleet/meta_optimizers/sharding/utils.py

-            ring_id = op.desc.attr("ring_id")
-            var_name = op.desc.input_arg_names()[0]
-            param = var_name.split("@")[0]
+        if op.type == "c_allreduce_sum" or op.type == "c_reduce_sum":


Add some simple comments here?

done~
this problems was introduced by we want overlap the sharding grad-communication with backward calculation.
sharding use both allreduce and reduce to sync grad, we should ensure all sharding-related grad communication (reduce / allreduce) be sync before grad being used in optimizers.

gongweibao

LGTM

gongweibao · 2021-03-31T07:51:01Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

+        # dp here is the pure dp as the outest parallelism
+        self.dp_degree = int(self.role_maker._worker_num() // self.mp_degree //
+                             self.sharding_degree)
+        assert self.role_maker._worker_num(


Give an explanation of why assert fails?

gongweibao · 2021-03-31T07:54:32Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

+                self._startup_program, self.current_endpoint,
+                self.mp_group_endpoints, self.mp_rank, self.mp_ring_id, False)
+            append_naive_sync(startup_block, self.startup_prog_sync_var,
+                              self.global_ring_id)


We need process the else condition?

gongweibao · 2021-03-31T08:03:01Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

-        check_broadcast(main_block)
-        check_allreduce_sum(main_block, self._shard, self.dp_ring_id)
+        # # check op dependecy
+        # check_broadcast(main_block)


…0B-updates

wangxicoding · 2021-04-02T06:54:58Z

paddle/fluid/framework/distributed_strategy.proto


 message ShardingConfig {
-  optional float fuse_broadcast_MB = 1 [ default = 32.0 ];
+  optional float segment_broadcast_MB = 1 [ default = 32.0 ];


Also, next PR can add comments or link in this code

wangxicoding · 2021-04-02T06:58:27Z

paddle/fluid/framework/distributed_strategy.proto

+  optional float segment_broadcast_MB = 1 [ default = 32.0 ];
  optional bool hybrid_dp = 2 [ default = false ];
-  optional int32 sharding_group_size = 3 [ default = 8 ];
+  optional int32 sharding_degree = 3 [ default = 8 ];


why set default = 8, maybe can set to -1 or 0, which means get value from world_size

recorded. update in next pr

wangxicoding · 2021-04-02T07:04:03Z

paddle/fluid/framework/distributed_strategy.proto

+  optional int32 sharding_degree = 3 [ default = 8 ];
+  optional int32 mp_degree = 4 [ default = 1 ];
+  optional string sharding_segment_strategy = 5
+      [ default = 'segment_broadcast_MB' ];


Add all enum value in comments.
Can the strategy name be simplified, one is called by_broadcast(or by_param_size or other better names) and the other is by_anchors(or other better names )，

great suggestion! recorded

wangxicoding · 2021-04-02T07:16:32Z

python/paddle/distributed/fleet/meta_optimizers/sharding/fp16_helper.py

+                    set([param for param, worker_idx in shard.global_param2device.items() \
+                        if worker_idx == shard.worker_idx]))
+                assert to_check_param == should_check_param, "amp \
+                    check_finite_and_unscale checking miss [{}] and got unexpected [{}]".format(


check_finite_and_unscale --> {op.type}

recorded. updated in next pr

wangxicoding · 2021-04-02T07:34:29Z

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

-    def __init__(self, sharding_ring_id):
-        self.sharding_ring_id = sharding_ring_id
+    def __init__(self, mp_ring_id):
+        self.mp_ring_id = mp_ring_id


Need comments mp_ring is (sharding+mp)

wangxicoding · 2021-04-02T07:53:12Z

python/paddle/distributed/fleet/meta_optimizers/sharding/gradient_clip_helper.py

+        if pure_dp_degree > 1:
+            block._insert_op_without_sync(
+                idx + 2,
+                type='scale',


maybe better before c_allreduce_sum, for c_allreduce_sum may out inf

since grad tend to be a decimal value, i think scale after allreduce would be better to avoid ”Arithmetic underflow“.

wangxicoding · 2021-04-02T08:09:31Z

python/paddle/distributed/fleet/meta_optimizers/sharding/utils.py

+        outputs={'Out': sync_var},
+        attrs={
+            'ring_id': ring_id,
+            'use_calc_stream': True,


Maybe need sync_calc_stream

see in the next comment.

wangxicoding · 2021-04-02T08:21:11Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

+            False,
+            global_ring_id=self.global_ring_id,
+            sync=False)
+        append_naive_sync(startup_block, self.startup_prog_sync_var,


Can move into _init_communicator

yes, there will be a update that move all comm init related sync into _init_communicator function in next pr. recorded it.

wangxicoding · 2021-04-02T08:33:17Z

python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py

+            type='conditional_block',
+            inputs={
+                'Cond': cond,
+                'Input': [],


in gradient merge senario, there is not need to bring the temp var in optimizer block scope back to global block scope , since those temp var will only be used in optimize procedure.

wangxicoding · 2021-04-02T08:34:39Z

python/paddle/fluid/tests/unittests/test_fleet_sharding_meta_optimizer.py


+class TestFleetMetaOptimizer(TestFleetMetaOptimizer):
+    def setUp(self):
+        os.environ["PADDLE_TRAINER_ID"] = "3"


in mp-sharding or sharding-hybrid-dp setting, we need at least 4 workers to setting up the parallelism logic.

wangxicoding

LGTM

wangxicoding · 2021-04-02T09:55:24Z

Need fix.

JZ-LIANG added 12 commits March 26, 2021 14:54

Recompute fixed bug BOOL VarType

14b07ba

Sharding support Megatron

e6b489d

sharding-megatron support amp, sharidng dp init broadcast

c8b0f92

sharding-megatron suppoort gradclipbyglobalnorm

5444c3b

Sharding allreduce --> reduce

527bc96

sharding optimize init speed

7e35d31

recompute remove useless log

2cc1b7c

sharding: segment strategy

f273420

temp change for ernie_10b_two_branch

6a18b38

sharding: gradient merge

98baf20

sharding gradient merge: fix OOM

9ece14f

sharding: revise save logic for gradient merge

add91b7

JZ-LIANG changed the title ~~[Sharding] ALL Optimizations for supporting 16B ERNIE 3.0 training~~ [Sharding] Optimizations for supporting ERNIE 3.0 training Mar 26, 2021

JZ-LIANG added 2 commits March 26, 2021 16:47

Sharding: revise code format

e01e22a

sharding: update anchor segment strategy

ffb492b

sharding: revise anchor segment logic

0abe6e9

JZ-LIANG force-pushed the sharding-ERNIE160B-updates branch from bc35a69 to 0abe6e9 Compare March 29, 2021 03:03

JZ-LIANG force-pushed the sharding-ERNIE160B-updates branch from ac47cea to 7659235 Compare March 29, 2021 07:31

sharding: revise api

726525c

JZ-LIANG force-pushed the sharding-ERNIE160B-updates branch from 7659235 to 726525c Compare March 29, 2021 07:33

sharding: remove debug log

cb788cf

gongweibao requested review from gongweibao and wangxicoding March 29, 2021 11:54

sharding: add sync in startup prog, uniform parallelism switch

cf5b1c9

JZ-LIANG force-pushed the sharding-ERNIE160B-updates branch from 190a067 to cf5b1c9 Compare March 30, 2021 03:05

sharding: update unitest

ceae74b

gongweibao requested changes Mar 31, 2021

View reviewed changes

sharding: add more comments

dde7d24

JZ-LIANG force-pushed the sharding-ERNIE160B-updates branch from ed5e936 to dde7d24 Compare March 31, 2021 06:50

gongweibao previously approved these changes Mar 31, 2021

View reviewed changes

gongweibao reviewed Mar 31, 2021

View reviewed changes

recompute: fixed bug in create vars

60be6ec

JZ-LIANG dismissed gongweibao’s stale review via 60be6ec March 31, 2021 08:09

JZ-LIANG force-pushed the sharding-ERNIE160B-updates branch from c0aa6b6 to 60be6ec Compare March 31, 2021 08:09

JZ-LIANG requested a review from gongweibao March 31, 2021 08:26

JZ-LIANG changed the title ~~[Sharding] Optimizations for supporting ERNIE 3.0 training~~ [3D-Parallel:Sharding] Optimizations for supporting ERNIE 3.0 training Mar 31, 2021

JZ-LIANG added 4 commits April 2, 2021 00:11

sharding temp to check ci bug

b414ddf

sharding: revise comm _wait func

620f138

Merge remote-tracking branch 'upstream/develop' into sharding-ERNIE16…

609859a

…0B-updates

sharding: revise comm init

0732442

wangxicoding reviewed Apr 2, 2021

View reviewed changes

wangxicoding approved these changes Apr 2, 2021

View reviewed changes

fuyinno4 approved these changes Apr 2, 2021

View reviewed changes

wangxicoding merged commit 69c874f into PaddlePaddle:develop Apr 2, 2021

Conversation

JZ-LIANG commented Mar 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

paddle-bot-old bot commented Mar 26, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gongweibao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG commented Mar 26, 2021 •

edited

Loading

JZ-LIANG Mar 31, 2021 •

edited

Loading

JZ-LIANG Mar 31, 2021 •

edited

Loading