new group#31682
Conversation
|
Thanks for your contribution! |
|
✅ This PR's description meets the template requirements! |
gongweibao
left a comment
There was a problem hiding this comment.
Please finetune you title and repair your CI first!
| return gp | ||
|
|
||
|
|
||
| def wait(tensor, group=None, use_calc_stream=True): |
There was a problem hiding this comment.
有个疑问
- use_calc_stream这个参数是用来做什么的?设置为True或者False的时候,分别有什么作用?
- 未来是否可能会增加对其他stream的控制?
There was a problem hiding this comment.
- paddle 对 gpu 的 stream 进行的逻辑抽象,calculation stream 和 communication stream,对应不同的通道;
- 不会,除了这层抽象外,还有多个 comm stream,用 id 做区分,和 group 绑定。
| attrs={'ring_id': ring_id}, ) | ||
|
|
||
|
|
||
| def broadcast(tensor, src, group=None, use_calc_stream=True): |
There was a problem hiding this comment.
group=0改为group=None,对兼容性是否会有影响?
比如,是否有代码设定group=1的情况?
There was a problem hiding this comment.
已经和同事确认过 group=None 不会有影响,在本次 pr 前无法创建 group,所以 group=1的情况不存在,另外 group=0 显示调用的情况在代码中已经排除。
| def new_group(ranks=None, backend=None): | ||
| """ | ||
|
|
||
| Creates a new distributed comminication group. |
| backend (str): The backend used to create group, only nccl is supported now. | ||
|
|
||
| Returns: | ||
| Group: The group instance. Nerver return None. |
| import paddle | ||
|
|
||
| paddle.distributed.init_parallel_env() | ||
| tindata = np.random.random([10, 1000]).astype('float32') |
There was a problem hiding this comment.
用paddle.random 新建Tensor,就不用使用 numpy了
| Examples: | ||
| .. code-block:: python | ||
|
|
||
| import numpy as np |
| import paddle | ||
|
|
||
| paddle.distributed.init_parallel_env() | ||
| tindata = np.random.random([10, 1000]).astype('float32') |
|
|
||
| paddle.distributed.init_parallel_env() | ||
| tindata = np.random.random([10, 1000]).astype('float32') | ||
| tindata = paddle.to_tensor(tindata) |
|
|
||
| Args: | ||
| ranks (list): The global ranks of group members, list as sorted. | ||
| backend (str): The backend used to create group, only nccl is supported now. |
There was a problem hiding this comment.
backend设定默认值为None时,现在的行为是直接设置为用nccl,未来有计划改这个默认行为吗?
| place = core.CUDAPlace(genv.device_id) | ||
| core.NCCLParallelContext(strategy, place).init_with_ring_id(ring_id) | ||
| else: | ||
| assert False |
| Creates a new distributed comminication group. | ||
|
|
||
| Args: | ||
| ranks (list): The global ranks of group members, list as sorted. |
There was a problem hiding this comment.
没太懂list as sorted,是啥意思,是对ranks里的值的序有要求?
| Examples: | ||
| .. code-block:: python | ||
|
|
||
| import numpy as np |
|
|
||
| paddle.distributed.init_parallel_env() | ||
| tindata = np.random.random([10, 1000]).astype('float32') | ||
| tindata = paddle.to_tensor(tindata) |
| @@ -163,7 +371,9 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=0): | |||
| tensor (Tensor): The input Tensor. It also works as the output Tensor. Its data type | |||
| should be float16, float32, float64, int32 or int64. | |||
| op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used. | |||
| @@ -238,7 +454,9 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=0): | |||
| should be float16, float32, float64, int32 or int64. | |||
| dst (int): The destination rank id. | |||
| op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used. | |||
| @@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0): | |||
| tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type | |||
| should be float16, float32, float64, int32 or int64. | |||
| src (int): The source rank id. | |||
| @@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0): | |||
| tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type | |||
TCChenlong
left a comment
There was a problem hiding this comment.
LGTM
TODO:fix docs bug
jzhang533
left a comment
There was a problem hiding this comment.
lgtm
will fix doc problem in following pr
PR types
New features
PR changes
APIs
Describe
unitest coverd by