Conversation
paddle/framework/multigpu.md
Outdated
There was a problem hiding this comment.
模型并行的数据貌似全部都在一个GPU?把模型切成了n份,貌似输入层大部分情况会被切到一个GPU上?
There was a problem hiding this comment.
you are right. fixed.
paddle/framework/multigpu.md
Outdated
There was a problem hiding this comment.
Implement -> Implementation
paddle/framework/multigpu.md
Outdated
There was a problem hiding this comment.
These two operators are part of the graph, please draw the dependency more clearly.
If the dependency is clear, reader should be able to understand what is the target for the graph initialization, and what is the target for each training step.
paddle/framework/multigpu.md
Outdated
There was a problem hiding this comment.
Don't quite understand what does "synchronizing or synchronize style" mean :)
paddle/framework/multigpu.md
Outdated
|
|
||
| Need to notice that Allreduce operator force GPUs synchronized at that point. Every device only need runs sub-graph in a loop style forever, the whole training process in asynchronous or synchronous mode depends on the Allreduce point in the graph. | ||
|
|
||
| For the simplest implement, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. |
There was a problem hiding this comment.
Move the "Implementation" section here?
There was a problem hiding this comment.
I don't think so. Graph converter is also part of our implement. To be unified with dist_train.md, we put it at an independent paragraph to make the document more clear.
I change the first sentence for avoiding ambiguity.
paddle/framework/multigpu.md
Outdated
|
|
||
| *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines* | ||
|
|
||
| <img src="images/multigpu_before_convert.png" width="300"/> |
There was a problem hiding this comment.
It's so Weird. fixed
paddle/framework/multigpu.md
Outdated
|
|
||
| 2. Control operators between GPUs will be inserted into the graph. | ||
|
|
||
| *Broadcast, AllReduce in a single machine. And Broadcast, AllReduce, Send, Recv in multiple machines* |
There was a problem hiding this comment.
Since you mentioned "Send, Recv" can you please add a reference link to these design docs?
There was a problem hiding this comment.
Thanks for the reminding! Done.
paddle/framework/multigpu.md
Outdated
| ### Benefits | ||
|
|
||
| - can easily move the optimize sub-graph to parameter server, multi-GPU feature can be compatible with distributed support design. | ||
| - easily plug-in with NCCL2 library. |
There was a problem hiding this comment.
Reference the NCCL library URL, please.
|
|
Same questions with @helinwang : We mentioned both GPU data parallelism and model parallelism, and it seems that we are going to implement GPU data parallelism first. Need to point out this? |
|
It should be an NCCL based design doc only. Thank you for the reviewing, guys! |
|
The Data parallelism and Model parallelism, the confusion part has been removed, and add the Allreduce section detail. |
doc/design/paddle_nccl.md
Outdated
|
|
||
| As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. | ||
|
|
||
| - **AllReduce2** |
There was a problem hiding this comment.
I think we need to decide on one all reduce OP. supporting two different OP for the same purpose is just too much labor.
I am more leaning towards implementing our own AllReduce, since AllReduce2 adds one more dependency: NCCL2, and NCCL2 is closed sourced.
There was a problem hiding this comment.
AllReduce2 is a composed operator write by hand. We only use Reduce operator to implement the AllReduce2.
Now we already have changed to NCCL2 in paddle. Not one more dependency.
There was a problem hiding this comment.
I see, thanks.
Since there is already AllReduce, do we need another AllReduce2? For the reasons mentioned above.
There was a problem hiding this comment.
Yeah, we only need the AllReduce2, actually. I write down the AllReduce2 just for avoiding people to misunderstand with NCCL built-in AllReduce.
Should I remove the AllReduce description and leave AllReduce2 alone?
There was a problem hiding this comment.
How do you think about calling it AllReduce? It's a PaddlePaddle OP, and there is no AllReduce1, so we probably should not name it as AllReduce2?
doc/design/paddle_nccl.md
Outdated
| - **AllReduce2** | ||
| If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, AllReduce will only utilize the communicate resource during synchronization, then update the gradient will be a seperated phase. In fact, we can amortize the update gradient time cost into the communicating phase. | ||
| - Every parameter has its root card. That card will call **Reduce** operator and collect the gradients from GPUs. | ||
| - The whole model's parameter will be hashed to different root card, ensure the load balance between GPUs. |
There was a problem hiding this comment.
Just a personal question, should we add a field device_id in Var in the protobuf or NCCL would do this by itself?
There was a problem hiding this comment.
It's still a controversial topic in our design, it's not determined by NCCL. So we can leave that discussion in the parallel with multi-device topic.
|
|
||
| - **AllReduce** | ||
| Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is | ||
| 1. Every parameter has its root card. That card will responsible for aggregating the gradients from GPUs. |
There was a problem hiding this comment.
Maybe we could introduce how to distribute the parameters(round-robin, hash or user-specified)?
There was a problem hiding this comment.
No, that's another problem coupled with parallel.do , @tonyyang-svail is working on it.
doc/design/paddle_nccl.md
Outdated
|
|
||
| ## Motivation | ||
|
|
||
| NCCL is a NVIDIA library support Multi-GPU communicating and optimized for NVIDIA GPUs, it provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter, that can achieve high bandwidth over PCIe and NVLink high-speed interconnect. [NCCL](https://developer.nvidia.com/nccl). With NCCL library, we can easily accelerate the training in parallel. |
There was a problem hiding this comment.
doc/design/paddle_nccl.md
Outdated
|
|
||
| ### Graph Converter | ||
|
|
||
| To be compatible with [parameter server design doc](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/ops/dist_train.md), the graph converter converts the user defined operation graph into sub-graphs to be executed on different devices. |
There was a problem hiding this comment.
graph converter => transpiler
| As it shown in the picture, when each GPU compute the gradient of `W`, followed with a `AllReduce` operator, accumulate the `dW` to full batch of data, then run the optimize process individually and apply the gradient to its `W`. | ||
|
|
||
| - **AllReduce** | ||
| Need to note that our AllReduce operator is a ring-base AllReduce implementation. If we use the NCCL2 AllReduce primitive, every GPU optimized full batch of data, wasted (n-1) GPU compute resources. In addition, NCCL2 built-in AllReduce will only utilize the communicating resource during synchronization, then update the gradient will be a subsequent phase. In fact, we can amortize the update gradient time cost into the communicating phase. The process is |
There was a problem hiding this comment.
NCCL2 also support ring-base AllReduce. see https://github.com/PaddlePaddle/Paddle/wiki/NCCL2-Survey
There was a problem hiding this comment.
这个并不一样,我们需要的不仅是ring-based AllReduce. NCCL2 AllReduce只支持sum, max这类简单操作,我们需要在其中做优化。
Yancey0623
left a comment
There was a problem hiding this comment.
LGTM, and maybe @helinwang would review this PR again.
here is better for review.
fix #3651