Add async ssa graph executor communicator#16172
Add async ssa graph executor communicator#16172jacquesqiao merged 119 commits intoPaddlePaddle:developfrom
Conversation
… add-async-ssa-graph-executor test=develop
… add-async-ssa-graph-executor test=develop
… add-async-ssa-graph-executor test=develop
… add-async-ssa-graph-executor test=develop
… add-async-ssa-graph-executor-communicator test=develop
… add-async-ssa-graph-executor-communicator test=develop
|
|
||
| bool DealWithSpecialOp(ir::Graph *result, ir::Node *node) const override { | ||
| if (node->Op()->Type() == "recv") { | ||
| VLOG(1) << "set recv op do_not_run to true"; |
There was a problem hiding this comment.
log level may be higher.
There was a problem hiding this comment.
This log will only be called in the init phase of async ssa graph ssa graph executor. Used to make sure it work properly.
| VLOG(1) << "set recv op do_not_run to true"; | ||
| node->Op()->SetAttr("do_not_run", true); | ||
| node->Op()->Flush(); | ||
| } else if (node->Name() == "lookup_table" || node->Name() == "nce" || |
There was a problem hiding this comment.
hard code are not recommend.
There was a problem hiding this comment.
This is not hard code, this method DealWithSpecialOp is used to change some special node in the graph like send/recv.
paddle/fluid/framework/scope.h
Outdated
| /// Mark it to const because that new kid scope cannot change parent scope. | ||
| Scope& NewScope() const; | ||
|
|
||
| Scope* NewTmpScope() const; |
There was a problem hiding this comment.
what's the meaning of NewTmpScope ?
There was a problem hiding this comment.
Already add some comments.
| } | ||
| } | ||
|
|
||
| // note!! only support sync send now |
| @@ -0,0 +1,203 @@ | |||
| // Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. | |||
| if (!build_strategy.async_mode_) { | ||
| member_->executor_.reset(new details::ScopeBufferedSSAGraphExecutor( | ||
| exec_strategy, member_->local_scopes_, std::move(var_infos), | ||
| member_->places_, std::move(member_->executor_))); |
There was a problem hiding this comment.
Why did you not use ScopeBufferedSSAGraphExecutor in async_mode?
There was a problem hiding this comment.
ScopeBufferedSSAGraphExecutor will slow done async_ssa_graph_executor and cause many problem.
| size_t num_iteration_per_drop_scope_{1}; | ||
| ExecutorType type_{kDefault}; | ||
| bool dry_run_{false}; | ||
| size_t num_iteration_per_run_{1}; // only use with async_ssa_graph_executor |
There was a problem hiding this comment.
https://github.com/PaddlePaddle/Paddle/pull/16172/files#diff-bcb7058cf667aba60603c4448e6180c8R131
used here, will run multi steps when call exe.run to improve performance.
| "gpu mode does not support async_mode_ now!"); | ||
| graphs.push_back(graph); | ||
| for (int i = 1; i < places.size(); ++i) { | ||
| auto *tmp_graph = new ir::Graph(graph->OriginProgram()); |
There was a problem hiding this comment.
The OriginProgram may be a bit different from the current graph.
| {member_->local_scopes_[i]}, 1, | ||
| member_->use_cuda_, member_->nccl_ctxs_.get()); | ||
| async_graphs[i] = graphs[i]; | ||
| } |
There was a problem hiding this comment.
Why did you process i = 0 and i > 0 respectively here?
There was a problem hiding this comment.
This code can be optimized later.
| member_->local_scopes_, member_->nranks_, | ||
| if (build_strategy.async_mode_) { | ||
| VLOG(3) << "use local async mode"; | ||
| graph = build_strategy.Apply(graph, {member_->places_[0]}, loss_var_name, |
There was a problem hiding this comment.
As line 224 says that gpu mode does not support async_mode_ now!, so why did you add code in #if defined(PADDLE_WITH_CUDA) && !defined(_WIN32)?
There was a problem hiding this comment.
Will support in the future. This code is used to make sure the compiling works right.
| copy_memory(); | ||
| } else { | ||
| t->ShareDataWith(main_tensor); | ||
| share_memory(); |
There was a problem hiding this comment.
Why did you make the above modification?
There was a problem hiding this comment.
For better understanding and code share.
| std::vector<ir::Node *> nodes_to_delete; | ||
| for (auto &node : graphs[i]->Nodes()) { | ||
| VLOG(3) << "node name " << node->Name(); | ||
| if (node && node->IsOp()) { |
There was a problem hiding this comment.
Why the node maybe nullptr here?
There was a problem hiding this comment.
I have met this problem before, so I add more check to ensure it works right.
paddle/fluid/framework/scope.cc
Outdated
| return *child; | ||
| } | ||
|
|
||
| Scope* Scope::NewTmpScope() const { return new Scope(this); } |
There was a problem hiding this comment.
NewTmpScope should return a unique_ptr at least.
chengduoZH
left a comment
There was a problem hiding this comment.
Approve for the changing of scope.cc.
| std::string name_; | ||
| proto::VarType::Type type_; | ||
| bool persistable_; | ||
| }; |
|
It is suggested that the code should be polished. |
No description provided.