Fix training validation convergence - v1.4 bug fix by baojun-nervana · Pull Request #16698 · PaddlePaddle/Paddle

baojun-nervana · 2019-04-07T18:38:17Z

This is a bug fix for v1.4 release ("overfitting" issue).
In inference, the intermediate mkldnn layout was saved for performance boost. In training, the feature needs to be disabled as the weights will be updated each iteration.

This fix is to disable the intermediate layout save, so the validation will be done correctly. Thus this will resolve the "overfitting" issue. The updated convergence curve will be sent for review as I had collected more data points.

Another change in this PR is the "read" op was included to check inputs as py_reader uses read op instead of feed op.

CC. @mozga-intel @jianhang-liu

tensor-tang · 2019-04-08T06:36:51Z

When would you send the latest convergence curve?

The bugfix needed be cherry-pick to branch release/1.4.

luotao1 · 2019-04-09T01:37:29Z

paddle/fluid/operators/ngraph/ngraph_engine.cc

      bool is_persistable =
          (p_persistables->find(vi) != p_persistables->end()) ? true : false;
-      if (is_test && is_persistable) {
+      if (!is_training && is_test && is_persistable) {


is_training and is_test are duplicated?

This can be simplified, but it will need changes otherwhere as well. I am trying to minimize change at this late release stage. That may involve more time for review and validation. Will follow up to remove duplicates after the release.

luotao1 · 2019-04-09T01:37:50Z

paddle/fluid/operators/ngraph/ngraph_engine.cc


-  while (left < size && ops->at(left)->Type() == framework::kFeedOpType) {
+  while (left < size && (ops->at(left)->Type() == framework::kFeedOpType ||
+                         ops->at(left)->Type() == "read")) {


Do you add ops->at(left)->Type() == "read" enough?

baojun-nervana · 2019-04-09T03:03:39Z

paddle/fluid/operators/ngraph/ngraph_engine.cc


-  while (left < size && ops->at(left)->Type() == framework::kFeedOpType) {
+  while (left < size && (ops->at(left)->Type() == framework::kFeedOpType ||
+                         ops->at(left)->Type() == "read")) {


We can expand this when we see other cases, but we will need to know and understand the case first. So far it can handle the user cases we know.

baojun-nervana · 2019-04-09T03:06:58Z

paddle/fluid/operators/ngraph/ngraph_engine.cc

      bool is_persistable =
          (p_persistables->find(vi) != p_persistables->end()) ? true : false;
-      if (is_test && is_persistable) {
+      if (!is_training && is_test && is_persistable) {


This can be simplified, but it will need changes otherwhere as well. I am trying to minimize change at this late release stage. That may involve more time for review and validation. Will follow up to remove duplicates after the release.

tensor-tang

LGTM

Can address the comments in develop branch

fix training validation test=develop (PaddlePaddle#16698)

fix training validation test=develop

73a3ee0

baojun-nervana added Intel NGraph labels Apr 7, 2019

baojun-nervana requested review from luotao1 and tensor-tang April 7, 2019 18:38

baojun-nervana mentioned this pull request Apr 8, 2019

Fix training validation - cherry-pick to 1.4 #16716

Merged

luotao1 reviewed Apr 9, 2019

View reviewed changes

baojun-nervana commented Apr 9, 2019

View reviewed changes

tensor-tang approved these changes Apr 9, 2019

View reviewed changes

tensor-tang merged commit 1c8b34d into PaddlePaddle:develop Apr 9, 2019

baojun-nervana deleted the fix_training branch April 9, 2019 04:07

colourful-tree added a commit to colourful-tree/Paddle that referenced this pull request Apr 9, 2019

Merge pull request #10 from PaddlePaddle/develop

9b3ca81

fix training validation test=develop (PaddlePaddle#16698)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix training validation convergence - v1.4 bug fix#16698

Fix training validation convergence - v1.4 bug fix#16698
tensor-tang merged 1 commit intoPaddlePaddle:developfrom
baojun-nervana:fix_training

baojun-nervana commented Apr 7, 2019 •

edited

Loading

Uh oh!

tensor-tang commented Apr 8, 2019

Uh oh!

luotao1 Apr 9, 2019

Uh oh!

baojun-nervana Apr 9, 2019

Uh oh!

luotao1 Apr 9, 2019

Uh oh!

baojun-nervana Apr 9, 2019

Uh oh!

baojun-nervana Apr 9, 2019

Uh oh!

tensor-tang left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

baojun-nervana commented Apr 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensor-tang commented Apr 8, 2019

Uh oh!

luotao1 Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

baojun-nervana Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

luotao1 Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

baojun-nervana Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

baojun-nervana Apr 9, 2019

Choose a reason for hiding this comment

Uh oh!

tensor-tang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baojun-nervana commented Apr 7, 2019 •

edited

Loading

tensor-tang left a comment •

edited

Loading