Skip to content

Commit 69cedad

Browse files
committed
sharding: update config DOC
1 parent 90133d2 commit 69cedad

File tree

1 file changed

+36
-12
lines changed

1 file changed

+36
-12
lines changed

python/paddle/distributed/fleet/base/distributed_strategy.py

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -744,6 +744,8 @@ def sharding(self):
744744
idea from [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054).
745745
Model parameters and Optimizer State are sharded into different ranks allowing to fit larger model.
746746
747+
In Hybrid parallelism scenario, we use sharding config as uniform API to set each parallelism.
748+
747749
Default value: False
748750
749751
Examples:
@@ -770,29 +772,51 @@ def sharding_configs(self):
770772
Set sharding configurations.
771773
772774
**Note**:
773-
fuse_broadcast_MB(float): size of a fused group of broadcasted parameters.
774-
This configuration will affect the communication speed in sharding training,
775-
and should be an empirical value decided by your model size and network topology.
775+
sharding_segment_strategy(string): strategy used to segment the program(forward & backward operations). two strategise are
776+
available: "segment_broadcast_MB" and "segment_anchors". segment is a concept used in sharding to overlap computation and
777+
communication. Default is segment_broadcast_MB.
778+
779+
segment_broadcast_MB(float): segment by the parameters broadcast volume. sharding will introduce parameter broadcast operations into program, and
780+
after every segment_broadcast_MB size parameter being broadcasted, the program will be cutted into one segment.
781+
This configuration will affect the communication speed in sharding training, and should be an empirical value decided by your model size and network topology.
782+
Only enable sharding_segment_strategy = segment_broadcast_MB. when Default is 32.0 .
783+
784+
segment_anchors(list): list of anchors used to segment the program, which allows a finner control of program segmentation.
785+
this strategy is experimental by now. Only enable sharding_segment_strategy = segment_anchors.
786+
787+
sharding_degree(int): specific the number of gpus within each sharding parallelism group; and sharding will be turn off if sharding_degree=1. Default is 8.
788+
789+
gradient_merge_acc_step(int): specific the accumulation steps in gradient merge; and gradient merge will be turn off if gradient_merge_acc_step=1. Default is 1.
776790
777-
hybrid_dp(bool): enable hybrid data parallelism above the sharding parallelism.
778-
you are supposed to have at least double the number of gpu you have in normal sharding
779-
training to enable this feature.
791+
optimize_offload(bool): enable the optimizer offload which will offload the moment vars to Host memory in order to saving GPU memory for fitting larger model.
792+
the moment var will be prefetch from and offloaded to Host memory during update stage. it is a stragtegy that trades off between training speed and GPU memory, and is recommened to be turn on only when gradient_merge_acc_step large, where
793+
the number of time of update stage will be relatively small compared with forward&backward's. Default is False.
794+
795+
dp_degree(int): specific the number of data parallelism group; when dp_degree >= 2, it will introduce dp_degree ways data parallelism as the outer parallelsim for the inner parallelsim. User should ensure global_world_size = mp_degree * sharding_degree * pp_degree * dp_degree. Default is 1.
796+
797+
mp_degree(int): [Hybrid parallelism ONLY] specific the the number of gpus within each megatron parallelism group; and megatron parallelism will turn be off if mp_degree=1. Default is 1.
798+
799+
pp_degree(int): [Hybrid parallelism ONLY] specific the the number of gpus within each pipeline parallelism group; and pipeline parallelism will turn be off if pp_degree=1. Default is 1.
800+
801+
pp_allreduce_in_optimize(bool): [Hybrid parallelism ONLY] move the allreduce operations from backward stage to update(optimize) stage when pipeline parallelsim is on.
802+
This configuration will affect the communication speed of Hybrid parallelism training depeneded on network topology. this strategy is experimental by now.
780803
781-
sharding_group_size(int): attribute of hybrid_dp. specific the the number of gpus within
782-
each sharding group; and therefore, the number of hybrid data parallelism ways will be equal
783-
to (global_size / sharding_group_size).
784804
785805
Examples:
786806
787807
.. code-block:: python
788808
809+
# sharding-DP, 2 nodes with 8 gpus per node
789810
import paddle.distributed.fleet as fleet
790811
strategy = fleet.DistributedStrategy()
791812
strategy.sharding = True
792813
strategy.sharding_configs = {
793-
"fuse_broadcast_MB": 32,
794-
"hybrid_dp": True,
795-
"sharding_group_size": 8}
814+
"sharding_segment_strategy": segment_broadcast_MB,
815+
"segment_broadcast_MB": 32,
816+
"sharding_degree": 8,
817+
"sharding_degree": 2,
818+
"gradient_merge_acc_step": 4,
819+
}
796820
"""
797821
return get_msg_dict(self.strategy.sharding_configs)
798822

0 commit comments

Comments
 (0)