Fix NCCLBcast hang up bug in Parallel Executor#11377
Fix NCCLBcast hang up bug in Parallel Executor#11377typhoonzero merged 4 commits intoPaddlePaddle:developfrom
Conversation
2. Check the memory usage of ALL gpus rather than the first one
…clGroupEnd blocking the exception throwing 2. NOTE the usage of NCCLGroupGuard
| member_->use_cuda_ = exec_strategy.use_cuda_; | ||
|
|
||
| // Step 1. Bcast the params to devs. | ||
| // Step 1. check all memory usages of all places. |
There was a problem hiding this comment.
So these code is not needed?
There was a problem hiding this comment.
I think these messages should be print out also, so I didn't remove it.
| } | ||
| } | ||
|
|
||
| // NOTE(minqiyang): according to the ncclGroupEnd documentations: |
There was a problem hiding this comment.
Well, I think we can assume people who develop paddlepaddle with this file is familiar with the NCCL natives.
There was a problem hiding this comment.
To avoid this bug happens again, I leave these comments.
| platform::dynload::ncclBcast(buffers[i], numel, data_type, 0, | ||
| nccl_ctx.comm_, nccl_ctx.stream()); | ||
| } | ||
| member_->nccl_ctxs_->WaitAll(); |
There was a problem hiding this comment.
This line may not needed? since ncclgroupend will sync all group calls.
There was a problem hiding this comment.
@chengduoZH Can you please help take a look at this?
There was a problem hiding this comment.
@typhoonzero line 185 is necessary,
ncclgroupendonly ensures that there is only one thread can invoke nccl function, but not sync the calls.- the invoking of nccl functions is async, so we need use
WaitAll()to ensure the invoking have completed on GPU side.
| // This step will create BuddyAllocator for each place, which will | ||
| // 1. enforce that the place is avaliable and NOT used. | ||
| // 2. avoid ncclBcast hanging up for NOT enough memory to use. | ||
| for (size_t i = 0; i < member_->places_.size(); ++i) { |
There was a problem hiding this comment.
Does these lines was intended to let the process fail fast?
There was a problem hiding this comment.
yes, and print some useful messages
| platform::dynload::ncclBcast(buffers[i], numel, data_type, 0, | ||
| nccl_ctx.comm_, nccl_ctx.stream()); | ||
| } | ||
| member_->nccl_ctxs_->WaitAll(); |
There was a problem hiding this comment.
@typhoonzero line 185 is necessary,
ncclgroupendonly ensures that there is only one thread can invoke nccl function, but not sync the calls.- the invoking of nccl functions is async, so we need use
WaitAll()to ensure the invoking have completed on GPU side.
| @@ -145,9 +156,9 @@ void ParallelExecutor::BCastParamsToGPUs( | |||
| auto &dims = main_tensor.dims(); | |||
| if (paddle::platform::is_gpu_place(main_tensor.place())) { | |||
| #ifdef PADDLE_WITH_CUDA | |||
There was a problem hiding this comment.
I don't think the modify of line159~167 is necessary.
Because if the memory of one Place is insufficient, the program will throw an exception in this line.
There was a problem hiding this comment.
Actually, the thrown exception will not be handled properly, this PR was submitted to fix this bug
| size_t usage = memory::memory_usage(member_->places_[i]); | ||
| VLOG(4) << "Memory usage of device: " << member_->places_[i] << " is " | ||
| << usage << " bytes"; | ||
| } |
There was a problem hiding this comment.
This check just should do one time. So maybe it is not appropriate for placing it here.
There was a problem hiding this comment.
Because BuddyAllocator's array was a static member in function in this line, so this check can do many times in the constructor in ParallelExecutor, without init duplicated BuddyAllocator
| #include "paddle/fluid/framework/details/scope_buffered_ssa_graph_executor.h" | ||
| #include "paddle/fluid/framework/details/ssa_graph_builder_factory.h" | ||
| #include "paddle/fluid/framework/details/threaded_ssa_graph_executor.h" | ||
| #include "paddle/fluid/memory/malloc.h" |
| member_->use_cuda_ = exec_strategy.use_cuda_; | ||
|
|
||
| // Step 1. Bcast the params to devs. | ||
| // Step 2. Bcast the params to devs. |
this close #11375