Skip to content

Commit bbe781a

Browse files
committed
sharding update doc
1 parent ba7ee5e commit bbe781a

File tree

1 file changed

+12
-12
lines changed

1 file changed

+12
-12
lines changed

python/paddle/distributed/fleet/base/distributed_strategy.py

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -772,34 +772,34 @@ def sharding_configs(self):
772772
Set sharding configurations.
773773
774774
**Note**:
775-
sharding_segment_strategy(string): strategy used to segment the program(forward & backward operations). two strategise are
775+
sharding_segment_strategy(string, optional): strategy used to segment the program(forward & backward operations). two strategise are
776776
available: "segment_broadcast_MB" and "segment_anchors". segment is a concept used in sharding to overlap computation and
777777
communication. Default is segment_broadcast_MB.
778778
779-
segment_broadcast_MB(float): segment by the parameters broadcast volume. sharding will introduce parameter broadcast operations into program, and
779+
segment_broadcast_MB(float, optional): segment by the parameters broadcast volume. sharding will introduce parameter broadcast operations into program, and
780780
after every segment_broadcast_MB size parameter being broadcasted, the program will be cutted into one segment.
781781
This configuration will affect the communication speed in sharding training, and should be an empirical value decided by your model size and network topology.
782-
Only enable sharding_segment_strategy = segment_broadcast_MB. when Default is 32.0 .
782+
Only enable when sharding_segment_strategy = segment_broadcast_MB. Default is 32.0 .
783783
784784
segment_anchors(list): list of anchors used to segment the program, which allows a finner control of program segmentation.
785-
this strategy is experimental by now. Only enable sharding_segment_strategy = segment_anchors.
785+
this strategy is experimental by now. Only enable when sharding_segment_strategy = segment_anchors.
786786
787-
sharding_degree(int): specific the number of gpus within each sharding parallelism group; and sharding will be turn off if sharding_degree=1. Default is 8.
787+
sharding_degree(int, optional): specific the number of gpus within each sharding parallelism group; and sharding will be turn off if sharding_degree=1. Default is 8.
788788
789-
gradient_merge_acc_step(int): specific the accumulation steps in gradient merge; and gradient merge will be turn off if gradient_merge_acc_step=1. Default is 1.
789+
gradient_merge_acc_step(int, optional): specific the accumulation steps in gradient merge; and gradient merge will be turn off if gradient_merge_acc_step=1. Default is 1.
790790
791-
optimize_offload(bool): enable the optimizer offload which will offload the moment vars to Host memory in order to saving GPU memory for fitting larger model.
791+
optimize_offload(bool, optional): enable the optimizer offload which will offload the moment vars to Host memory in order to saving GPU memory for fitting larger model.
792792
the moment var will be prefetch from and offloaded to Host memory during update stage. it is a stragtegy that trades off between training speed and GPU memory, and is recommened to be turn on only when gradient_merge_acc_step large, where
793793
the number of time of update stage will be relatively small compared with forward&backward's. Default is False.
794794
795-
dp_degree(int): specific the number of data parallelism group; when dp_degree >= 2, it will introduce dp_degree ways data parallelism as the outer parallelsim for the inner parallelsim. User should ensure global_world_size = mp_degree * sharding_degree * pp_degree * dp_degree. Default is 1.
795+
dp_degree(int, optional): specific the number of data parallelism group; when dp_degree >= 2, it will introduce dp_degree ways data parallelism as the outer parallelsim for the inner parallelsim. User is responsible to ensure global_world_size = mp_degree * sharding_degree * pp_degree * dp_degree. Default is 1.
796796
797-
mp_degree(int): [Hybrid parallelism ONLY] specific the the number of gpus within each megatron parallelism group; and megatron parallelism will turn be off if mp_degree=1. Default is 1.
797+
mp_degree(int, optional): [Hybrid parallelism ONLY] specific the the number of gpus within each megatron parallelism group; and megatron parallelism will turn be off if mp_degree=1. Default is 1.
798798
799-
pp_degree(int): [Hybrid parallelism ONLY] specific the the number of gpus within each pipeline parallelism group; and pipeline parallelism will turn be off if pp_degree=1. Default is 1.
799+
pp_degree(int, optional): [Hybrid parallelism ONLY] specific the the number of gpus within each pipeline parallelism group; and pipeline parallelism will turn be off if pp_degree=1. Default is 1.
800800
801-
pp_allreduce_in_optimize(bool): [Hybrid parallelism ONLY] move the allreduce operations from backward stage to update(optimize) stage when pipeline parallelsim is on.
802-
This configuration will affect the communication speed of Hybrid parallelism training depeneded on network topology. this strategy is experimental by now.
801+
pp_allreduce_in_optimize(bool, optional): [Hybrid parallelism ONLY] move the allreduce operations from backward stage to update(optimize) stage when pipeline parallelsim is on.
802+
This configuration will affect the communication speed of Hybrid parallelism training depeneded on network topology. this strategy is experimental by now.. Default is False.
803803
804804
805805
Examples:

0 commit comments

Comments
 (0)