Skip to content

Fuse AllReduce#15921

Merged
chengduoZH merged 16 commits intoPaddlePaddle:developfrom
chengduoZH:fuse_all_reduce
Mar 20, 2019
Merged

Fuse AllReduce#15921
chengduoZH merged 16 commits intoPaddlePaddle:developfrom
chengduoZH:fuse_all_reduce

Conversation

@chengduoZH
Copy link
Contributor

@chengduoZH chengduoZH commented Feb 25, 2019

Code separated from #15497
Fix part of #16061

  ResNet ResNet  Transformer  Transformer
  4cards(s/batch) 8cards(s/batch) 4cards(step/s) 8cards(step/s)
before 0.149 0.1935 3.276035 2.255209
fuse all reduce 0.0985 0.177 3.405812 2.346058

For ResNet:

export FLAGS_fuse_parameter_memory_size=131072
export FLAGS_fuse_parameter_groups_size=3

For Transformer:

export FLAGS_fuse_parameter_memory_size=131072
export FLAGS_fuse_parameter_groups_size=10

test=develop
viz_pass->Set<std::string>("graph_viz_path", new std::string(graph_path));
}

if (strategy.fuse_elewise_add_act_ops_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not deleted but only moved to here.

// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#include "paddle/fluid/framework/details/multi_devices_graph_pass.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be after the system header files.

}

bool MultiDevSSAGraphBuilderBase::DealWithSpecialOp(ir::Graph *result,
ir::Node *node) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add an empty function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


outputs.insert(outputs.end(), op_handle.Outputs().begin(),
op_handle.Outputs().end());
// Remove Input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to remove the pointer in kGraphVars attr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the origin all_reduce nodes should be removed from graph,and it is done in line 132.

test=develop
chengduozh added 2 commits March 5, 2019 16:59
@gongweibao
Copy link
Contributor

Do we need a unittest to make sure the fused allreduce on all trainers hold the same order?

@chengduoZH
Copy link
Contributor Author

@gongweibao all_reduce_deps_pass will make sure the order of allreduce.

@gongweibao
Copy link
Contributor

gongweibao commented Mar 14, 2019

Another question: Where is the address alignment implemented?

auto iter = vars.find(p_g.second);
PADDLE_ENFORCE(iter != vars.end());
PADDLE_ENFORCE_NOT_NULL(iter->second->Var());
PADDLE_ENFORCE_EQ(iter->second->Var()->GetType(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否先略过其他的类型?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前阶段要求类型必须是lodtensor,下一个阶段可能会做这方面的改进。

// Run Only Once Programs
for (size_t i = 0; i < local_scopes.size(); ++i) {
for (auto &op_desc : program_desc.Block(0).AllOps()) {
auto op = OpRegistry::CreateOp(*op_desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can set a new tensor member to avoid fused-tensor be resized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fused_all_reduce_op_handle has address check, so it is unnecessary.

chengduozh added 2 commits March 18, 2019 16:24
Copy link
Contributor

@gongweibao gongweibao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM+++
Gradient fusion is very useful for multi-machines training communication.Thanks!

@chengduoZH chengduoZH merged commit f26ba5b into PaddlePaddle:develop Mar 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants