Skip to content

Add Stream for fetch op handle#16600

Merged
chengduoZH merged 2 commits intoPaddlePaddle:developfrom
chengduoZH:add_delay_ops_for_threaded_executor
Apr 2, 2019
Merged

Add Stream for fetch op handle#16600
chengduoZH merged 2 commits intoPaddlePaddle:developfrom
chengduoZH:add_delay_ops_for_threaded_executor

Conversation

@chengduoZH
Copy link
Contributor

No description provided.

@chengduoZH chengduoZH force-pushed the add_delay_ops_for_threaded_executor branch from 7e0a41c to f2bed8f Compare April 1, 2019 15:07
@chengduoZH chengduoZH force-pushed the add_delay_ops_for_threaded_executor branch from f2bed8f to 90b3e94 Compare April 1, 2019 15:32
test=develop
@chengduoZH chengduoZH force-pushed the add_delay_ops_for_threaded_executor branch from 747322e to 1804b19 Compare April 2, 2019 01:36
#ifdef PADDLE_WITH_CUDA
TensorCopySync(t, cpu, &tensors_[i]);
TensorCopy(t, cpu, *dev_ctxes_.at(t.place()), &tensors_[i]);
dev_ctxes_.at(t.place())->Wait();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use default stream here, it may make program slow. https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

void ReduceSSAGraphBuilder::InsertPostprocessOps(ir::Graph *result) const {
if (UseGPU()) {
if (strategy_.fuse_broadcast_op_) {
if (strategy_.fuse_broadcast_ops_) {
Copy link
Contributor

@Yancey0623 Yancey0623 Apr 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to let users know when to turn this flag on? If it always works better with Reduce strategy, please keep it on. In additional, we can update the best practices of GPU distributed training.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to let users know when to turn this flag on?

I have added the doc for fuse_broadcast_ops in pybind.cc.

In additional, we can update the best practices of GPU distributed training.

I quite agree with you that we should have a best practice doc.

Copy link
Contributor

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chengduoZH chengduoZH merged commit b75a69b into PaddlePaddle:develop Apr 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants