Skip to content

[MKL-DNN] Tensor modifications revert #16462

Merged
tensor-tang merged 4 commits intoPaddlePaddle:developfrom
jczaja:prv-tensor-modifications-revert
Mar 28, 2019
Merged

[MKL-DNN] Tensor modifications revert #16462
tensor-tang merged 4 commits intoPaddlePaddle:developfrom
jczaja:prv-tensor-modifications-revert

Conversation

@jczaja
Copy link
Contributor

@jczaja jczaja commented Mar 26, 2019

This PR is removing all tensor modifications. When doing this PR there were conflicts in Concat MKLDNN op , as this op started to use new API introduced by tensor modifications. Hence @xiaolil1 Could you please take a look at relevant changes.

As for performance results. We tested (so far, only resnet50) on CLX this PR vs this PR with older MKL-DNN. PR with older MKL-DNN shows that FPS was restored to level matching one before regressions. As for this PR (including MKL-DNN 0.18) performance is poorer by ~3 FPS (120 -> 117 ,resnet50 Fp32).

@xiaolil1 Please check this PR for performance and let us know your findings.

On making this revert. I used "git revert" on three commits that introduced tensor modifications.
Manually I had to change:

  • resolve conflict in concat mkldnn op
  • Third commit was fixing and enabling transformer for mkl-dnn. This enabling of unit test was not reverted , while fix was reverted as it was related to tensor modifications.
  • Transpose MKLDNN op was updated with labeling output tensor as kNCHW layout , to avoid having mkldnn_blocked_format.

@jczaja jczaja force-pushed the prv-tensor-modifications-revert branch 2 times, most recently from ab775af to 4351e04 Compare March 26, 2019 13:02
@jczaja jczaja changed the title Prv tensor modifications revert [MKL-DNN] Tensor modifications revert Mar 26, 2019
@jczaja jczaja requested a review from tensor-tang March 26, 2019 13:40
@jczaja jczaja force-pushed the prv-tensor-modifications-revert branch from 4351e04 to b44cc19 Compare March 27, 2019 10:35
@jczaja jczaja requested review from kbinias and luotao1 March 27, 2019 12:57
@jczaja
Copy link
Contributor Author

jczaja commented Mar 27, 2019

@luotao1 , @tensor-tang One of tests: test_fsp_op.py does fail on Mac OS (doesn't fail on linux). rerun was done three times. It does not use MKL-DNN or functionality from this PR. Please advice how to proceed.

@luotao1
Copy link
Contributor

luotao1 commented Mar 28, 2019

One of tests: test_fsp_op.py does fail on Mac OS

It is fixed in #16502

@ghost
Copy link

ghost commented Mar 28, 2019

@tensor-tang @luotao1 This PR fail in MAC CI several times but have no issue in PR CI (current failure is to wait for approval). The failure in MAC CI is quite strange. Sometimes build failure, sometimes fail in "test_fsp_op.py" which is not related to this change. Could you help to have a look? Thanks!

@luotao1
Copy link
Contributor

luotao1 commented Mar 28, 2019

@jianhang-liu Please let @xiaolil1 check this PR for performance and let us know your findings at first.

@tensor-tang
Copy link
Contributor

@jianhang-liu luotao has commented?#16462 (comment)

@ghost
Copy link

ghost commented Mar 28, 2019

@xiaolil1 Could you help here? Jacek did measurement in FP32 side and found performance is all recovered (on RESNET-50). Could you help to check in your side especially for INT8? Thanks!

tensor-tang
tensor-tang previously approved these changes Mar 28, 2019
Copy link
Contributor

@tensor-tang tensor-tang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve for tensor.h changes.

jczaja added 4 commits March 28, 2019 05:55
…ed (PaddlePaddle#16233)"

This reverts commit 13816dd.
Apart from enabling transformer for MKL-DNN
This reverts commit c63f6b2.

Conflicts:
	paddle/fluid/operators/mkldnn/concat_mkldnn_op.cc
- lint

test=develop

- Lint fixes

test=develop

- Lint fixes

test=develop

- Fix Transpose MKLDNN op

test=develop
@jczaja jczaja force-pushed the prv-tensor-modifications-revert branch from b44cc19 to e239cb0 Compare March 28, 2019 04:56
@xiaolil1
Copy link
Contributor

@luotao1 @jianhang-liu We are verifying #16462.

@jczaja
Copy link
Contributor Author

jczaja commented Mar 28, 2019

@tensor-tang , @luotao1 Changes were rebased on newer develop (approval got dismissed)

@jczaja
Copy link
Contributor Author

jczaja commented Mar 28, 2019

As mentioned at begining of this PR. This PR was tested for performance as well as its modification(older MKL-DNN) to verify restoring of performance. With currently used MKL-DNN (0.18) performance was found to be lower.

@xiaolil1
Copy link
Contributor

@jczaja @jianhang-liu @tensor-tang @luotao1
Here is the performance we tested,
"latest develop + #16462" vs "reference version 08c96d1",
FP32 latency performance still have 6% regression, while we generally allow 5% drop, @tensor-tang @luotao1 please make the decision whether approve it.
@jczaja @jianhang-liu Due to MKL-DNN update between the latest code and reference code, we suggest you to file a JIRA for MKL-DNN to track the performance regression issue.

INT8 ( BS = 1, Cores = 28) ResNet-50 ( FPS ) MobileNet ( FPS )
patched 286 691
reference 262 538
patched / reference 1.09 1.28
FP32 ( BS = 1, Cores = 28) ResNet-50 ( FPS ) MobileNet ( FPS )
patched 94 296
reference 100 307
patched / reference 0.94 0.96

@jczaja
Copy link
Contributor Author

jczaja commented Mar 28, 2019

@xiaolil1 Thanks for testing this PR. For full picture it would be good to modify this PR to test with MKL-DNN from before regression. I can prepare that one today, once I'm back to work

As for MKL-DNN regression testing. Without profiling conv operator , it is difficult to say , but my current guess is that time of creation of primitive descriptors was increased. It does not matter for int8 conv (there is a reuse for that one) but it does matter for fp32. Next Week I will make PR with fix improving that. MKL-DNN optimize execution of primitives rather than creation time. During Primitive Descriptor creation MKL-DNN does iterate over available implementations and
pick then one suitable for work. MKL-DNN 0.18 added Int8 functionality so perhaps choosing implementation takes a bit longer than it used to (this is a guess)

@xiaolil1
Copy link
Contributor

We tested develop, develop + #16325 (修复PR), develop + #16462 (revert PR), reference (08c96d1) , with BS=1, Cores = 1 on CLX, both INT8 and FP32.

INT8 ( BS = 1, Cores = 1) ResNet-50 ( FPS ) MobileNet ( FPS )
develop 69 313
develop + #16325 69 311
develop + #16462 70 325
reference 64 296
develop / reference 1.07 1.05
develop + #16325 / reference 1.07 1.05
develop + #16462 / reference 1.09 1.10
FP32 ( BS = 1, Cores = 1) ResNet-50 ( FPS ) MobileNet ( FPS )
develop 18 105
develop + #16325 18 104
develop + #16462 18 106
reference 18 107
develop / reference 1.00 0.98
develop + #16325 / reference 0.99 0.97
develop + #16462 / reference 0.99 0,99

@xiaolil1
Copy link
Contributor

Here is the latency with 1000 iterations.

INT8 ( BS = 1, Cores = 1) ResNet-50 ( ms ) MobileNet ( ms )
develop 14.51 3.20
develop + #16325 14.47 3.22
develop + #16462 14.20 3.08
reference 15.57 3.38
FP32 ( BS = 1, Cores = 1) ResNet-50 ( ms ) MobileNet ( ms )
develop 55.14 9.54
develop + #16325 55.28 9.64
develop + #16462 55.51 9.45
reference 54.91 9.37

@tensor-tang tensor-tang merged commit 2632327 into PaddlePaddle:develop Mar 28, 2019
@tensor-tang
Copy link
Contributor

@jczaja @xiaolil1
Thanks again for your efficient work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants