Skip to content

Add Fuse AllReduce Pass And Adam Pass#15497

Closed
chengduoZH wants to merge 40 commits intoPaddlePaddle:developfrom
chengduoZH:fuse_gradient_space
Closed

Add Fuse AllReduce Pass And Adam Pass#15497
chengduoZH wants to merge 40 commits intoPaddlePaddle:developfrom
chengduoZH:fuse_gradient_space

Conversation

@chengduoZH
Copy link
Contributor

No description provided.

@chengduoZH chengduoZH changed the title [WIP] Add Fuse Gradient Pass Add Fuse Gradient Pass Jan 29, 2019
@chengduoZH chengduoZH changed the title Add Fuse Gradient Pass Add Fuse AllReduce Pass Jan 29, 2019

// Add op fusion.
if (strategy.fuse_relu_depthwise_conv_) {
VLOG(10) << "Add fuse_relu_depthwise_conv_pass";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking how about put build strategy in python side so that we can define strategies for different scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put them in compiler.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe all the fuse_xx_pass can be placed in Python side.

auto iter = vars.find(p_g.second);
PADDLE_ENFORCE(iter != vars.end());

// Set Persistable to prevent this var become reusable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since grad vars are persistable, the allocation of memory space could be done at startup, no additional op in the main program needed then.

but still, need to know the memory pieces to do allreduce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the alloc_continuous_space_for_grad_op in RunOnlyOnceProgram which is runned in multi_device_pass, maybe it is better.

chengduozh added 3 commits February 13, 2019 17:24
test=develop
test=develop
test=develop

// Add automatically inplace.
if (strategy_.enable_inplace_) {
VLOG(10) << "Add inplace_pass";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove all these VLOG, can get pass names and print them from python side.

std::vector<ir::Node *> opt_ops;
for (ir::Node *node : result.Nodes()) {
if (node->IsOp()) {
GetSpecifiedOpsAndVars(fuse_op_type, aux_var_names, node, &opt_ops,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether all optimizers are the same, and only fuse gradient variable for same optimizers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

for (auto &op_desc : program.Block(0).AllOps()) {
auto op = paddle::framework::OpRegistry::CreateOp(*op_desc);
VLOG(4) << op->DebugStringEx(local_scopes_[i]);
op->Run(*local_scopes_[i], places_[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to distinguish compile time and runtime, not to run anything in passes.

static_cast<int>(OpRole::kLoss)) &&
!loss_var_name_.empty(); // If loss_var is empty. This is test mode
static_cast<int>(
OpRole::kLoss)); // If loss_var is empty. This is test mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove the comment here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

void FuseAllReduceSSAGraphBuilder::CheckGraph(const ir::Graph &graph) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need to add a new FuseAllReduceSSAGraphBuilder? maybe put fuse passes after multi device graph pass can reduce number of builders here?

chengduozh added 3 commits February 17, 2019 13:52
test=develop
test=develop
@chengduoZH chengduoZH force-pushed the fuse_gradient_space branch 2 times, most recently from 79ad035 to 7b77ef9 Compare February 20, 2019 12:57
}

// for single card training, fuse_all_reduce_ops is unnecessary.
if (strategy.fuse_all_reduce_ops_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need build_strategy? Can be calculated automatically?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have strategy.fuse_all_reduce_ops_ currently, and the default value can be set True.

fuse_gradients = true;
}

if (strategy.fuse_all_optimizer_ops_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. This flag should be auto calculated.

}

if (strategy.fuse_all_optimizer_ops_) {
if (!fuse_gradients) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels strange.

}
// NOTE: fuse_all_xx_ops will count the number of xx operator first,
// if the number is zero, fuse_all_reduce_ops will do nothing.
// Currently, only one type of optimization algorithm can be fused.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which one can't be fused? say it in doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, all the optimizer ops which are used update the dense parameters can be fused. But I only implement adam and sgd now.


bool fuse_elewise_add_act_ops_{false};

bool fuse_all_reduce_ops_{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 3 flags are not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need those flags currently. When we test those pass on more models and ensure that those pass can make program faster, we can set the default value as True.

#include "paddle/fluid/framework/details/variable_visitor.h"
#include "paddle/fluid/platform/profiler.h"

DEFINE_bool(skip_fused_all_reduce_check, false, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc?

SortParamsAndGrads(vars, &params_grads);
SetGroupGradsAndParams(vars, params_grads, &group_params_grads);

// Set Gradients as Persistable to prevent this var becoming reusable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put your pass after memory optimize so you don't need to make them persistable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but if I make them as persistable, I don't need care about the position of memory_opt_pass.

SetGroupGradsAndParams(vars, params_grads, &group_params_grads);

// Set Gradients as Persistable to prevent this var becoming reusable.
auto dtype = static_cast<proto::VarType::Type>(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why cast?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is framework::proto::VarType::Type::VarType_Type_BOOL, the dtype of the input could not be BOOL.

}

// Create the fused variable name.
const std::string prefix(kFusedVarNamePrefix);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is prefix needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when we analyze the var's name, we can recognize the fused_var by the prefix. And it also prevents the name conflict with the existing names.

}
result.Get<RunOnlyOnceProgram>(kRunOnlyOnceProgram).emplace_back();
auto& program_desc =
result.Get<RunOnlyOnceProgram>(kRunOnlyOnceProgram).back();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just run the operations in this pass. not need to save if for later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

void ResetAttribute(const std::string& attr_name, ir::Graph* graph) const {
if (graph->Has(attr_name)) {
VLOG(10) << attr_name << " is reset.";
graph->Erase(attr_name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw error? It's the job of build_strategy to do this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this pass is run many times, the result is should be the same.

OpProtoAndCheckerMaker::OpRoleAttrName())));
}

// NOTE: fused_var is only exist in scope, so the graph doesn't have fused_var
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

// of all the optimizer ops separately.
// And alloc_continuous_space ops are placed in RunOnlyOnceProgram,
// which is executed before running the model with ParallelExecutor.
if (!result.Has(kRunOnlyOnceProgram)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run it in your own pass

}
}

if (need_collection_ops_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use class member

*/
AddOutputToLeafOps(&result);

/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the following thing doesn't seem to related to this pass

constexpr char kPlaces[] = "places";
constexpr char kLocalScopes[] = "local_scopes";
constexpr char kStrategy[] = "strategy";
constexpr char kNRanks[] = "nranks";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to avoid exposing a lot of this global names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants