new group by kuizhiqing · Pull Request #31682 · PaddlePaddle/Paddle

kuizhiqing · 2021-03-17T06:08:29Z

PR types

New features

PR changes

APIs

Describe

add python api new_group
add python api wait
rewrite op CSyncCalcStreamOp with kernel
rewrite op CSyncCommStreamOp with kernel
add function NCCLParallelContext::InitWithRingID allowing creating Communicator with preset ring_id/group_id

unitest coverd by

test_fleet_sharding_meta_optimizer.py
test_new_group_api.py
test_new_group.sh

CLAassistant · 2021-03-17T06:08:32Z

All committers have signed the CLA.

paddle-bot-old · 2021-03-17T06:08:47Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot-old · 2021-03-17T06:08:54Z

✅ This PR's description meets the template requirements!
Please wait for other CI results.

paddle/fluid/operators/collective/c_sync_calc_stream_op.cc

gongweibao

Please finetune you title and repair your CI first!

paddle/fluid/imperative/nccl_context.cc

python/paddle/distributed/collective.py

paddle/fluid/operators/collective/c_sync_comm_stream_op.cc

python/paddle/distributed/collective.py

paddle/fluid/operators/collective/c_sync_comm_stream_op.cc

paddle/fluid/operators/collective/c_sync_calc_stream_op.cc

paddle/fluid/operators/collective/c_sync_comm_stream_op.cc

python/paddle/distributed/collective.py

wangxicoding

LGTM

paddle/fluid/imperative/bkcl_context.cc

python/paddle/distributed/collective.py

paddle/fluid/operators/collective/c_sync_calc_stream_op.cc

paddle/fluid/imperative/nccl_context.cc

python/paddle/distributed/collective.py

gongweibao

LGTM

qili93

LGTM

XiaoguangHu01 · 2021-03-31T09:38:20Z

python/paddle/distributed/collective.py

+    return gp
+

+def wait(tensor, group=None, use_calc_stream=True):


有个疑问

use_calc_stream这个参数是用来做什么的？设置为True或者False的时候，分别有什么作用？

未来是否可能会增加对其他stream的控制？

paddle 对 gpu 的 stream 进行的逻辑抽象，calculation stream 和 communication stream，对应不同的通道；

不会，除了这层抽象外，还有多个 comm stream，用 id 做区分，和 group 绑定。

XiaoguangHu01 · 2021-03-31T09:42:39Z

python/paddle/distributed/collective.py

+        attrs={'ring_id': ring_id}, )
+
+
+def broadcast(tensor, src, group=None, use_calc_stream=True):


group=0改为group=None，对兼容性是否会有影响？
比如，是否有代码设定group=1的情况？

已经和同事确认过 group=None 不会有影响，在本次 pr 前无法创建 group，所以 group=1的情况不存在，另外 group=0 显示调用的情况在代码中已经排除。

XiaoguangHu01

LG API

TCChenlong · 2021-04-01T02:42:36Z

python/paddle/distributed/collective.py

+def new_group(ranks=None, backend=None):
+    """
+
+    Creates a new distributed comminication group.


communication

TCChenlong · 2021-04-01T02:43:40Z

python/paddle/distributed/collective.py

+        backend (str): The backend used to create group, only nccl is supported now.
+
+    Returns:
+        Group: The group instance. Nerver return None.


Nerver？Never？

TCChenlong · 2021-04-01T02:44:12Z

python/paddle/distributed/collective.py

+            import paddle
+
+            paddle.distributed.init_parallel_env()
+            tindata = np.random.random([10, 1000]).astype('float32')


用paddle.random 新建Tensor，就不用使用 numpy了

TCChenlong · 2021-04-01T02:44:18Z

python/paddle/distributed/collective.py

+    Examples:
+        .. code-block:: python
+
+            import numpy as np


可以删除

TCChenlong · 2021-04-01T02:44:50Z

python/paddle/distributed/collective.py

+            import paddle
+
+            paddle.distributed.init_parallel_env()
+            tindata = np.random.random([10, 1000]).astype('float32')


jzhang533

see inline comments

jzhang533 · 2021-04-01T02:25:43Z

python/paddle/distributed/collective.py

+
+            paddle.distributed.init_parallel_env()
+            tindata = np.random.random([10, 1000]).astype('float32')
+            tindata = paddle.to_tensor(tindata)


可以直接用paddle.rand

jzhang533 · 2021-04-01T02:30:20Z

python/paddle/distributed/collective.py

+
+    Args:
+        ranks (list): The global ranks of group members, list as sorted.
+        backend (str): The backend used to create group, only nccl is supported now.


backend设定默认值为None时，现在的行为是直接设置为用nccl，未来有计划改这个默认行为吗？

jzhang533 · 2021-04-01T02:33:02Z

python/paddle/distributed/collective.py

+        place = core.CUDAPlace(genv.device_id)
+        core.NCCLParallelContext(strategy, place).init_with_ring_id(ring_id)
+    else:
+        assert False


直接给个出错信息吧。

jzhang533 · 2021-04-01T02:34:39Z

python/paddle/distributed/collective.py

+    Creates a new distributed comminication group.
+
+    Args:
+        ranks (list): The global ranks of group members, list as sorted.


没太懂list as sorted，是啥意思，是对ranks里的值的序有要求？

jzhang533 · 2021-04-01T02:35:14Z

python/paddle/distributed/collective.py

+    Examples:
+        .. code-block:: python
+
+            import numpy as np


没必要import numpy

jzhang533 · 2021-04-01T02:38:35Z

python/paddle/distributed/collective.py

+
+            paddle.distributed.init_parallel_env()
+            tindata = np.random.random([10, 1000]).astype('float32')
+            tindata = paddle.to_tensor(tindata)


同上，可以不用numpy。

jzhang533 · 2021-04-01T02:44:10Z

python/paddle/distributed/collective.py

@@ -163,7 +371,9 @@ def all_reduce(tensor, op=ReduceOp.SUM, group=0):
        tensor (Tensor): The input Tensor. It also works as the output Tensor. Its data type
            should be float16, float32, float64, int32 or int64.
        op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used.


文档没写默认值是什么

jzhang533 · 2021-04-01T02:44:49Z

python/paddle/distributed/collective.py

@@ -238,7 +454,9 @@ def reduce(tensor, dst, op=ReduceOp.SUM, group=0):
            should be float16, float32, float64, int32 or int64.
        dst (int): The destination rank id.
        op (ReduceOp.SUM|ReduceOp.MAX|ReduceOp.Min|ReduceOp.PROD): Optional. The operation used.


文档没写默认值是什么。

jzhang533 · 2021-04-01T02:46:37Z

python/paddle/distributed/collective.py

@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0):
        tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type
            should be float16, float32, float64, int32 or int64.
        src (int): The source rank id.


默认值说明

jzhang533 · 2021-04-01T02:47:14Z

python/paddle/distributed/collective.py

@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0):
        tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type


默认值说明

TCChenlong

LGTM
TODO：fix docs bug

jzhang533

lgtm
will fix doc problem in following pr

qili93 requested changes Mar 23, 2021

View reviewed changes

paddle/fluid/operators/collective/c_sync_calc_stream_op.cc Outdated Show resolved Hide resolved

gongweibao requested changes Mar 23, 2021

View reviewed changes

gongweibao requested a review from wangxicoding March 23, 2021 12:06

kuizhiqing requested review from gongweibao and qili93 March 24, 2021 05:28

wangxicoding reviewed Mar 24, 2021

View reviewed changes

wangxicoding requested a review from sandyhouse March 24, 2021 07:11

kuizhiqing requested a review from wangxicoding March 24, 2021 08:23

wangxicoding previously approved these changes Mar 24, 2021

View reviewed changes

sandyhouse reviewed Mar 24, 2021

View reviewed changes

wangxicoding reviewed Mar 24, 2021

View reviewed changes

python/paddle/distributed/collective.py Outdated Show resolved Hide resolved

python/paddle/distributed/collective.py Outdated Show resolved Hide resolved

kuizhiqing marked this pull request as draft March 26, 2021 12:15

new group

21cc190

kuizhiqing dismissed wangxicoding’s stale review via 21cc190 March 29, 2021 07:39

kuizhiqing marked this pull request as ready for review March 29, 2021 07:41

ci compatible fix

d9a6c45

gongweibao requested changes Mar 29, 2021

View reviewed changes

assert nccl

d1a565a

gongweibao approved these changes Mar 30, 2021

View reviewed changes

qili93 approved these changes Mar 31, 2021

View reviewed changes

XiaoguangHu01 reviewed Mar 31, 2021

View reviewed changes

XiaoguangHu01 approved these changes Mar 31, 2021

View reviewed changes

TCChenlong reviewed Apr 1, 2021

View reviewed changes

jzhang533 reviewed Apr 1, 2021

View reviewed changes

TCChenlong approved these changes Apr 1, 2021

View reviewed changes

jzhang533 approved these changes Apr 1, 2021

View reviewed changes

seiriosPlus merged commit 0774159 into PaddlePaddle:develop Apr 1, 2021

kuizhiqing mentioned this pull request Apr 1, 2021

fix doc problem #32010

Merged

		return gp


		def wait(tensor, group=None, use_calc_stream=True):

		attrs={'ring_id': ring_id}, )


		def broadcast(tensor, src, group=None, use_calc_stream=True):

		@@ -394,7 +626,9 @@ def scatter(tensor, tensor_list=None, src=0, group=0):
		tensor_list (list): A list of Tensors to scatter. Every element in the list must be a Tensor whose data type

Conversation

kuizhiqing commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

CLAassistant commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot-old bot commented Mar 17, 2021

Uh oh!

paddle-bot-old bot commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gongweibao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangxicoding left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gongweibao left a comment

Choose a reason for hiding this comment

Uh oh!

qili93 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jzhang533 left a comment

kuizhiqing commented Mar 17, 2021 •

edited

Loading

CLAassistant commented Mar 17, 2021 •

edited

Loading

paddle-bot-old bot commented Mar 17, 2021 •

edited

Loading