Skip to content

Add async ssa graph executor communicator#16172

Merged
jacquesqiao merged 119 commits intoPaddlePaddle:developfrom
jacquesqiao:add-async-ssa-graph-executor-communicator
Apr 2, 2019
Merged

Add async ssa graph executor communicator#16172
jacquesqiao merged 119 commits intoPaddlePaddle:developfrom
jacquesqiao:add-async-ssa-graph-executor-communicator

Conversation

@jacquesqiao
Copy link
Member

No description provided.

@jacquesqiao jacquesqiao requested a review from seiriosPlus March 31, 2019 00:04
@jacquesqiao jacquesqiao requested a review from chengduoZH March 31, 2019 07:57
… add-async-ssa-graph-executor-communicator

test=develop
… add-async-ssa-graph-executor-communicator

test=develop

bool DealWithSpecialOp(ir::Graph *result, ir::Node *node) const override {
if (node->Op()->Type() == "recv") {
VLOG(1) << "set recv op do_not_run to true";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log level may be higher.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log will only be called in the init phase of async ssa graph ssa graph executor. Used to make sure it work properly.

VLOG(1) << "set recv op do_not_run to true";
node->Op()->SetAttr("do_not_run", true);
node->Op()->Flush();
} else if (node->Name() == "lookup_table" || node->Name() == "nce" ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard code are not recommend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not hard code, this method DealWithSpecialOp is used to change some special node in the graph like send/recv.

/// Mark it to const because that new kid scope cannot change parent scope.
Scope& NewScope() const;

Scope* NewTmpScope() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of NewTmpScope ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already add some comments.

}
}

// note!! only support sync send now
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Todo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,203 @@
// Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018 -> 2019

if (!build_strategy.async_mode_) {
member_->executor_.reset(new details::ScopeBufferedSSAGraphExecutor(
exec_strategy, member_->local_scopes_, std::move(var_infos),
member_->places_, std::move(member_->executor_)));
Copy link
Contributor

@chengduoZH chengduoZH Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you not use ScopeBufferedSSAGraphExecutor in async_mode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScopeBufferedSSAGraphExecutor will slow done async_ssa_graph_executor and cause many problem.

size_t num_iteration_per_drop_scope_{1};
ExecutorType type_{kDefault};
bool dry_run_{false};
size_t num_iteration_per_run_{1}; // only use with async_ssa_graph_executor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/Paddle/pull/16172/files#diff-bcb7058cf667aba60603c4448e6180c8R131
used here, will run multi steps when call exe.run to improve performance.

"gpu mode does not support async_mode_ now!");
graphs.push_back(graph);
for (int i = 1; i < places.size(); ++i) {
auto *tmp_graph = new ir::Graph(graph->OriginProgram());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OriginProgram may be a bit different from the current graph.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What different?

{member_->local_scopes_[i]}, 1,
member_->use_cuda_, member_->nccl_ctxs_.get());
async_graphs[i] = graphs[i];
}
Copy link
Contributor

@chengduoZH chengduoZH Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you process i = 0 and i > 0 respectively here?

Copy link
Member Author

@jacquesqiao jacquesqiao Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be optimized later.

member_->local_scopes_, member_->nranks_,
if (build_strategy.async_mode_) {
VLOG(3) << "use local async mode";
graph = build_strategy.Apply(graph, {member_->places_[0]}, loss_var_name,
Copy link
Contributor

@chengduoZH chengduoZH Apr 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As line 224 says that gpu mode does not support async_mode_ now!, so why did you add code in #if defined(PADDLE_WITH_CUDA) && !defined(_WIN32)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will support in the future. This code is used to make sure the compiling works right.

copy_memory();
} else {
t->ShareDataWith(main_tensor);
share_memory();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you make the above modification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better understanding and code share.

seiriosPlus
seiriosPlus previously approved these changes Apr 1, 2019
Copy link
Collaborator

@seiriosPlus seiriosPlus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

std::vector<ir::Node *> nodes_to_delete;
for (auto &node : graphs[i]->Nodes()) {
VLOG(3) << "node name " << node->Name();
if (node && node->IsOp()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the node maybe nullptr here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have met this problem before, so I add more check to ensure it works right.

return *child;
}

Scope* Scope::NewTmpScope() const { return new Scope(this); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewTmpScope should return a unique_ptr at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@chengduoZH chengduoZH requested a review from panyx0718 April 1, 2019 03:45
Copy link
Contributor

@chengduoZH chengduoZH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve for the changing of scope.cc.

std::string name_;
proto::VarType::Type type_;
bool persistable_;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate with here.

@chengduoZH
Copy link
Contributor

It is suggested that the code should be polished.

@jacquesqiao jacquesqiao merged commit 21622ca into PaddlePaddle:develop Apr 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants