Conversation
paddle/framework/operator.h
Outdated
| return device_context_; | ||
| } | ||
|
|
||
| //! Get a input which has multiple variables. |
There was a problem hiding this comment.
the comment is not accurate, all our input/output is stored in a vector.
paddle/framework/operator.h
Outdated
| const std::vector<std::string>& Inputs(const std::string& name) const { | ||
| return op_.Inputs(name); | ||
| } | ||
| //! Get an output which has multiple variables. |
| @@ -0,0 +1,17 @@ | |||
| /* Copyright (c) 2016 PaddlePaddle Authors. All Rights Reserve. | |||
There was a problem hiding this comment.
build an operator independent module, which may be used for Multiexecutor or some more module.
paddle/operators/nccl_op.cc
Outdated
|
|
||
| auto x_dims = ctx->GetInputsDim("X"); | ||
|
|
||
| // std::string reduction = ctx->Attrs().Get<std::string>("reduction"); |
There was a problem hiding this comment.
Add reduction Done.
paddle/operators/nccl_op.cc
Outdated
| AddOutput("Out", "The output of Reduce op"); | ||
| AddAttr<int>("root", | ||
| "root gpu of the parameter. if not set(-1). hashed by name.") | ||
| .SetDefault(-1); |
There was a problem hiding this comment.
use a const value to represent -1, such as kInvalidGPUId
|
|
||
| auto* comm = ctx.Input<Communicator>("Communicator"); | ||
|
|
||
| auto stream = reinterpret_cast<const platform::CUDADeviceContext&>( |
There was a problem hiding this comment.
The returned value is DeviceContext, the abstract class doesn't contain a stream interface.
paddle/operators/nccl_op.cu
Outdated
| auto ins_names = ctx.Inputs("X"); | ||
| std::hash<std::string> hasher; | ||
| for (size_t i = 0; i < ins.size(); ++i) { | ||
| if (root == -1) { |
There was a problem hiding this comment.
replace -1 with a const value
paddle/framework/operator.h
Outdated
| return device_context_; | ||
| } | ||
|
|
||
| //! Get variables vector with same input name. |
There was a problem hiding this comment.
I think
Get actual name vector for this input.
is better
| std::vector<int> gpus = Attr<std::vector<int>>("gpus"); | ||
| PADDLE_ENFORCE(!gpus.empty(), "Attr(gpus) should not be empty."); | ||
| platform::Communicator *comm = | ||
| scope.FindVar(name)->GetMutable<platform::Communicator>(); |
There was a problem hiding this comment.
maybe add a check
if (scope.FindVar(name) == nullptr) {...}because this op doesn't have infershape to ensure the output is there
paddle/operators/nccl_op.cu
Outdated
| See the License for the specific language governing permissions and | ||
| limitations under the License. */ | ||
|
|
||
| #define EIGEN_USE_GPU |
| auto outs = ctx.MultiOutput<LoDTensor>("Out"); | ||
|
|
||
| std::string reduction = ctx.Attr<std::string>("reduction"); | ||
| ncclRedOp_t reduction_op_ = ncclSum; |
There was a problem hiding this comment.
where are ncclSum and ncclMax and ncclProd come from?
There was a problem hiding this comment.
NCCL have all the four type of operations. http://docs.nvidia.com/deeplearning/sdk/nccl-api/ncclapidoc.html#ncclredop_t
This operator can be used with "reduction" attribute to indicate the operation.
paddle/operators/nccl_op.cu
Outdated
| } else if (reduction == "ncclProd") { | ||
| reduction_op_ = ncclProd; | ||
| } else { | ||
| PADDLE_ENFORCE(false, "Invalid reduction. default ncclSum."); |
There was a problem hiding this comment.
PADDLE_ENFORCE(false, ...) to PADDLE_THROW
paddle/operators/nccl_op_test.cu
Outdated
| See the License for the specific language governing permissions and | ||
| limitations under the License. */ | ||
|
|
||
| #define EIGEN_USE_GPU |
paddle/operators/nccl_op_test.cu
Outdated
| op2->SetInput("X", {"st"}); | ||
| op2->SetInput("Communicator", {"comm"}); | ||
| op2->SetOutput("Out", {"rt"}); | ||
| op2->SetAttr("root", {kRoot}); |
There was a problem hiding this comment.
I think this line should be
op2->SetAttr("root", kRoot);| template <class T> | ||
| void PerThreadProgram(int gpu_id, const f::OpDescBind &op_desc, | ||
| f::Scope *scope) { | ||
| std::unique_lock<std::mutex> lk(mu); |
There was a problem hiding this comment.
Because we will call GetMutable interface, which is not a thread safe function.
There was a problem hiding this comment.
In my understanding, each PerThreadProgram will run on an independent scope and place, In this situation, do we still have thread safe problem? Just want to make sure~
There was a problem hiding this comment.
Yes, we did have.
If I moved this lock guard, I really get a segment fault. In my view,
T* GetMutable() {
if (!IsType<T>()) {
holder_.reset(new PlaceholderImpl<T>(new T()));
}
return static_cast<T*>(holder_->Ptr());
}use the global placement new to allocate memory for type T, at this stage, we do not have any guard on it.
The scope and place only seperate the allocated pointer from each other, so our scope hierachy is only take effect to user's program built on the scope.
If we want the thread safe feature, we need a lock on the new T. I think.
paddle/operators/nccl_op_test.cu
Outdated
| } | ||
| } | ||
|
|
||
| // ncclAReduceOp with desc |
There was a problem hiding this comment.
ncclAReduceOp => ncclReduceOp
NCCL is a group of MPI-like primitives library, which can be used in synchronized values between multi-GPU cards.
The simplest implementation of #3769, currently, we only support multi-GPU cards run the same topology and synchronized parameters before parameter optimizing.
To minimize the review job, this PR only implement AllReduce operator, which will be used frequently in synchronizing parameters/gradients between GPU cards.
We will leave the other operators Gather/Bcast in the future work.
To support the NCCL library in current refactorization stage, Here is the brief plan.
Every GPU should run the same graph/blocks, and only can be synchronized at specific parameters/gradients.
This will be supported if the performance is a bottleneck.