Support memory eager deletion on recurrent OP by zhhsplendid · Pull Request #17710 · PaddlePaddle/Paddle

zhhsplendid · 2019-05-29T03:25:47Z

Test PaddingRNN on V100 GPU device.

Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU.

GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR)
Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR)

… rnn_op

test=develop

…nn_op

… rnn_op

sneaxiy · 2019-07-09T09:29:40Z

paddle/fluid/operators/controlflow/op_variant.cc

+namespace operators {
+
+const framework::VariableNameMap& OpVariant::Inputs() const {
+  return *boost::apply_visitor(InputsVisitor(), op_);


How about move InputsVisitor, OutputsVisitor, AttributeMapVisitor to this source file as well?

You can also move RawPointerVisitor to this source file.

sneaxiy · 2019-07-09T09:32:18Z

paddle/fluid/operators/controlflow/CMakeLists.txt

 include(operators)
 register_operators(DEPS naive_executor)
-cc_library(while_op_helper SRCS while_op_helper.cc DEPS operator) 
+cc_library(op_variant SRCS op_variant.cc DEPS operator)


Target op_variant should depend on proto_desc as well? Since op_desc.cc is compiled in target proto_desc.

Do you mean program_desc.cc? Since I use program_desc.h, not op_desc.h. But there is no difference because they are both compiled in proto_desc

sneaxiy · 2019-07-09T09:35:51Z

paddle/fluid/operators/controlflow/recurrent_op_helper.h

+namespace paddle {
+namespace operators {
+
+using paddle::framework::OperatorBase;


Not good to expose OperatorBase without namespace inside header file.

sneaxiy · 2019-07-09T09:36:39Z

paddle/fluid/operators/controlflow/while_op_helper.cc

  auto &attrs = const_cast<framework::AttributeMap &>(op.Attrs());
  VLOG(2) << "Prepare to skip " << attr.size()
-          << " var(s): " << GetDebugString(attr);
+          << " var(s): " << paddle::string::join_strings(attr, ' ');


You can simplify as string::join_strings(attr, ' ').

sneaxiy · 2019-07-09T09:37:27Z

paddle/fluid/operators/recurrent_op.cc

+        executor.Prepare(*program, block->ID(),
+                         Attr<std::vector<std::string>>(
+                             kSkipEagerDeletionVars) /*skip_ref_cnt_vars*/,
+                         false /*force_disable_gc*/);


Remove the last parameter false.

sneaxiy · 2019-07-09T10:27:23Z

paddle/fluid/framework/ir/memory_optimize_pass/recurrent_op_eager_deletion_pass.h

+using paddle::operators::OpAndGradOpPair;
+
+// Pass class set skip eager deletion vars for recurrent ops
+class RecurrentOpEagerDeletionPass : public ir::Pass {


RecurrentOpEagerDeletionPass can be placed inside recurrent_op_eager_deletion_pass.cc. Therefore, this header file is unnecessary.

Can I keep it? I prefer every .cc file should have an associated .h file in general, except for some special cases.

sneaxiy · 2019-07-09T10:33:17Z

paddle/fluid/operators/controlflow/recurrent_op_helper.cc

+  for (const std::string &name : output_vars) {
+    fwd_skip_vars.insert(name);
+  }
+  SetSkipVars(fwd_op, fwd_skip_vars);


kInitialStates should be skipped too? See here.

Discuss offline that maybe it doesn't have to be skipped

sneaxiy · 2019-07-09T10:36:45Z

paddle/fluid/operators/controlflow/recurrent_op_helper.cc

+    }
+    PADDLE_ENFORCE_NOT_NULL(matched_fwd_op, "Cannot find matched forward op");
+    SetRecurrentOpAndRecurrentGradOpSkipVarAttr(*matched_fwd_op, bwd_op);
+    recurrent_ops.erase(*matched_fwd_op);


What if there are rest forward recurrent ops that have no gradient? You should also set skip vars in these ops.

sneaxiy · 2019-07-09T10:40:08Z

paddle/fluid/operators/controlflow/recurrent_op_helper.cc

+  PADDLE_ENFORCE_EQ(
+      fwd_input.size(), in_grads.size(),
+      "Backward input gradient number does not match forward input number.");
+  for (size_t i = 0; i < in_grads.size(); ++i) {


Seems wrong. You should review the codes of recurrent_op.cc to find out which variables should be skipped.

Discussed offline, it may not be wrong.

sneaxiy · 2019-07-09T10:42:31Z

python/paddle/fluid/tests/unittests/test_eager_deletion_recurrent_op.py

@@ -0,0 +1,460 @@
+#   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.


As comments above, please add corresponding unittests.

Run using ParallelExecutor.

There are duplicate recurrent ops in the graph, even it is nested inside another recurrent op. See the nested while op tests inside here.

There are recurrent ops with gradient and without gradient.

Add unittests of ptb model. See here.

Any other corner cases that should be concerned....

… rnn_op

test=develop

… rnn_op

test=develop

sneaxiy · 2019-07-17T12:03:20Z

paddle/fluid/framework/ir/memory_optimize_pass/recurrent_op_eager_deletion_pass.h

+// See the License for the specific language governing permissions and
+// limitations under the License.
+
+#pragma once


Recommend to remove this header file.

Reply to you offline.

sneaxiy · 2019-07-17T12:03:55Z

paddle/fluid/operators/controlflow/recurrent_op_helper.cc

+namespace paddle {
+namespace operators {
+/*
+constexpr char RecurrentBase::kInputs[];


Remove unused codes instead of using comments.

Sorry, this is where I forgot to remove

test=develop

… rnn_op test=develop

It aims to handle CI cannot find reference to constexpr char[] test=develop

test=develop

sneaxiy

LGTM.

Test PaddingRNN on V100 GPU device. Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU. GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR) Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR)

* Support memory eager deletion on recurrent OP (#17710) Test PaddingRNN on V100 GPU device. Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU. GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR) Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR) * Fix random test_recurrent_op failure (#18718) The change includes 3 things: 1. Set CPU_NUM to 1 in the tests because the ParallelExecutor will print warning that CPU_NUM is not set and use default 1. 2. Old tests compare two RNNs, hand written simple RNN and same RNN built by Paddle, but initialized RNN weights in numpy random and Paddle random separately. Fixed it by setting weights and bias values. 3. Also set numpy random seed in the tests. Now the two RNNs diff can be smaller (rtol from 0.1, 0.2 to. 0.01) in the tests.

zhhsplendid added 5 commits May 27, 2019 06:46

Save local file for merge

e51e7ba

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

8a6db0e

… rnn_op

Support memory eager deletion on recurrent OP

e9da08a

Remove debug log in rnn_memory_helper_op.cc

ba5ea07

Add comments for functions

90e1519

test=develop

zhhsplendid force-pushed the rnn_op branch from eb3c0c4 to 90e1519 Compare May 29, 2019 03:26

zhhsplendid requested review from liupluswei and sneaxiy May 29, 2019 03:28

zhhsplendid added 3 commits May 29, 2019 07:03

Merge branch 'rnn_op' of https://github.com/zhhsplendid/Paddle into r…

38ba259

…nn_op

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

dd9d940

… rnn_op

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

10746cf

… rnn_op

sneaxiy reviewed Jul 9, 2019

View reviewed changes

zhhsplendid added 5 commits July 11, 2019 03:25

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e98f566

… rnn_op

Modify based on reviewer's comments

8f7373d

test=develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

ce52e80

… rnn_op

Add ParallelExecutor test for eager deletion

446bc9f

Add unit test that one forward only rnn with one backward rnn

eb8efa5

test=develop

sneaxiy reviewed Jul 17, 2019

View reviewed changes

zhhsplendid added 2 commits July 17, 2019 12:05

Remove a useless comment

afb8c3e

test=develop

Fix commit history

cd1c2eb

test=develop

zhhsplendid force-pushed the rnn_op branch from 60c7725 to cd1c2eb Compare July 18, 2019 06:20

zhhsplendid added 8 commits July 18, 2019 06:36

Try constexpr declaration

8a54181

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

fc3dd8e

… rnn_op test=develop

Remove the duplicated definition

c4fb071

Try to change constexpr to const in recurrent_op

ba43599

It aims to handle CI cannot find reference to constexpr char[] test=develop

Add CMakeLists deps to fix CI

bff0367

test=develop

Change fluid.core to core to fix CI

5e9d853

test=develop

Remove get_cuda_device_count to pass Mac CI

4c9537e

test=develop

Add if branch for using get_cuda_device_count

fb905a2

test=develop

zhhsplendid added 4 commits July 18, 2019 13:23

Add if condition for CUDA / CPU places for test

36df9b1

Use 1 batch to avoid CI tests which doesn't have cuda devices

5dd6470

test=develop

Change constexpr in a class to const because msvc15 doesn't support

57a9fca

Trigger CI...

0b89ada

test=develop

sneaxiy approved these changes Jul 19, 2019

View reviewed changes

zhhsplendid merged commit 89bc3fd into PaddlePaddle:develop Jul 19, 2019

zhhsplendid mentioned this pull request Aug 26, 2019

[Cherry-pick] Support memory eager deletion on recurrent OP #19411

Merged

liushanshan07 mentioned this pull request Aug 29, 2019

Tensor not initialized yet when Tensor::type() is called #19542

Closed

		@@ -0,0 +1,460 @@
		# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.

Conversation

zhhsplendid commented May 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneaxiy Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneaxiy Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneaxiy Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sneaxiy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhhsplendid commented May 29, 2019 •

edited

Loading

sneaxiy Jul 9, 2019 •

edited

Loading

sneaxiy Jul 9, 2019 •

edited

Loading

sneaxiy Jul 9, 2019 •

edited

Loading