overlap rpc op memcpy in distributed training#11221
overlap rpc op memcpy in distributed training#11221Yancey0623 merged 14 commits intoPaddlePaddle:developfrom
Conversation
|
|
||
| std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( | ||
| const ProgramDesc &program) const { | ||
| VLOG(3) << "Building ...."; |
There was a problem hiding this comment.
Can remove this debugging log
|
|
||
| bool is_forwarding = true; | ||
| int rpc_op_device_id = 0; | ||
| auto schedule_rpc_op = [&]() -> void { |
There was a problem hiding this comment.
Can this be changed to use get_appropriate_dev so that when using parallel executor stratagy "Reduce", the variable to send is same as the variable reduced on that device? @chengduoZH am I right?
There was a problem hiding this comment.
After disscuess with @typhoonzero @chengduoZH @panyx0718 , use get_appropriate_dev to schedule rpc op.
| } | ||
| CreateDistTrainOp(&result, *op, rpc_op_device_id); | ||
| } | ||
| if (op->Type() == "concat") { |
There was a problem hiding this comment.
else if? otherwise CreateDistTrainOp will be called again in the else below?
There was a problem hiding this comment.
Sorry, here should be else if.
| auto got = remote_vars_devices_.find(op->InputArgumentNames()[0]); | ||
| if (got == remote_vars_devices_.end()) { | ||
| schedule_rpc_op(); | ||
| } else { |
There was a problem hiding this comment.
When is else triggered? I guess you want to round-robin the send devices?
Not quite remember. In Reduce mode, is gradients calculated in 1 device instead of all devices? Should the send device match that device?
There was a problem hiding this comment.
DistributedTranspiler would split a parameter into some parameter blocks, but this is not applicable to each parameter if it's small enough, we didn't split it. So the op pipeline is such as:
(split_byref)->send->recv->(concat)For the true block of the if statement: there is no split_byref operator before send, so we need to schedule send to the right place.
For the false block, split_byref has already been schedule to a device, and send should be scheduled to the same device with split_byref
| rpc_op_device_id = got->second; | ||
| } | ||
| CreateRPCOp(&result, *op, rpc_op_device_id); | ||
| } else if (op->Type() == "recv") { |
There was a problem hiding this comment.
Will only 1 device perform broadcast in Reduce mode? So recv should be done on that device before broadcast? Perhaps take a look at get_appropriate_dev? I'm not quite sure the details
There was a problem hiding this comment.
I'll take a look at get_appropriate_dev and find the relationship with this PR.
| platform::dynload::ncclBcast(buffer, numel, data_type, 0, | ||
| nccl_ctx.comm_, nccl_ctx.stream()); | ||
|
|
||
| if (builder_.get() != nullptr && |
There was a problem hiding this comment.
Not sure this change is needed. It seems BCastParamsToGPUs is only called once at beginning to bcast parameter to each devices. It's not used during training?
There was a problem hiding this comment.
It's used during training at the end of eatch mini-batch, such as the following code snippet:
Paddle/benchmark/fluid/fluid_benchmark.py
Lines 363 to 365 in 831909c
| if (op->Type() == "send_vars") { | ||
| int op_dev_id = GetVarDeviceID(op->InputArgumentNames()[0]); | ||
| if (op_dev_id == -1) { | ||
| op_dev_id = get_appropriate_dev(op->InputArgumentNames()); |
There was a problem hiding this comment.
@Yancey1989 as we discussed, one concern, the order when calling get_appropriate_dev must be the same to reduce and split_op or the device id for the variable may be different.
| private: | ||
| BuildStrategy strategy_; | ||
| mutable std::unordered_map<std::string, VarDesc *> all_vars_; | ||
| mutable std::unordered_map<std::string, int> var_name_on_devices_; |
There was a problem hiding this comment.
We should not use unordered_map to record the var_name on devices, because the same var_name may be on different devices.
There was a problem hiding this comment.
May not, this does not record all variables, only used for Reduce strategy and distributed training.
For the Reduce strategy, we schedule Reduce Op on the different device and record the gradient variable name in var_name_on_devices_ , so it would only appear on only one device.
For the distributed training, the same as Reduce strategy, we schedule send_op and recv_op on the different device, the variable name would not appear on the different device also.
panyx0718
left a comment
There was a problem hiding this comment.
I feel that there are something that we can implement cleaner. But I guess we can do them as followup.
| std::unordered_map<std::string, VarDesc *> all_vars; | ||
| for (auto *var : program.Block(0).AllVars()) { | ||
| all_vars[var->Name()] = var; | ||
| all_vars_.emplace(var->Name(), var); |
There was a problem hiding this comment.
emplace has difference semantics from []? If not necessary, let's it keep it the same.
There was a problem hiding this comment.
Not too much, just can avoid some non-necessary copy, but it's no difference here.
| platform::dynload::ncclBcast(buffer, numel, data_type, 0, | ||
| nccl_ctx.comm_, nccl_ctx.stream()); | ||
|
|
||
| if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) { |
There was a problem hiding this comment.
builder_.get() != nullptr -> builder_
| platform::dynload::ncclBcast(buffer, numel, data_type, 0, | ||
| nccl_ctx.comm_, nccl_ctx.stream()); | ||
|
|
||
| if (builder_.get() != nullptr && builder_->GetVarDeviceID(var) != -1) { |
There was a problem hiding this comment.
I feel that GetVarDeviceID should probably be a method of a built graph or executor. This avoids making builder_ a private member. But I guess it's ok to leave it as TOOD for now.
| return dev_id; | ||
| } | ||
|
|
||
| std::unique_ptr<SSAGraph> MultiDevSSAGraphBuilder::Build( |
There was a problem hiding this comment.
After this change. Build() can only be called once? Do we want to clear "balanced_vars", "all_vars", etc at the beginning of Build()?
|
In general, let's pass variables to methods, instead of making variables class private members. When variables are private members, we need to be careful about when to clear it. |
It's a good idea, I move them as a private member just because of I want to expose And can also fix #11593 . |
|
some acc test: Pass = 0, Elapsed = 41, Training performance = 146.812973 imgs/s, Train accuracy = 0.026151, Test accuracy = 0.010417
Pass = 1, Elapsed = 39, Training performance = 152.322642 imgs/s, Train accuracy = 0.041118, Test accuracy = 0.007292
Pass = 2, Elapsed = 39, Training performance = 153.547463 imgs/s, Train accuracy = 0.043750, Test accuracy = 0.008333local=0,batch_size=80,trainers=2,pservers=2 Pass = 0, Elapsed = 127, Training performance = 47.504918 imgs/s, Train accuracy = 0.023849, Test accuracy = 0.009375
Pass = 1, Elapsed = 121, Training performance = 50.227834 imgs/s, Train accuracy = 0.024013, Test accuracy = 0.010417
Pass = 2, Elapsed = 120, Training performance = 50.489441 imgs/s, Train accuracy = 0.026974, Test accuracy = 0.010417
Pass = 3, Elapsed = 120, Training performance = 50.426222 imgs/s, Train accuracy = 0.028125, Test accuracy = 0.011458
|
Fixed #11143
After parallel bcast, this PR will improve performacne about 15% on vgg +flowers + 2trainers + 2pservers
overlap memcpy branch
develop branch
The improvement would be better on resnet, because the parameter size is smaller than vgg.