Enable MKL-DNN INT8 Concat Kernel. by xiaolil1 · Pull Request #16156 · PaddlePaddle/Paddle

xiaolil1 · 2019-03-11T20:47:31Z

Enable INT8 Concat Kernel to improve the performance of MobileNet-SSD, theoretical improvement is 4X for memory bandwidth saving.

test=develop

xiaolil1 · 2019-03-12T13:52:09Z

[06:02:47] [Step 1/1] 99% tests passed, 1 tests failed out of 577
[06:02:47] [Step 1/1]
[06:02:47] [Step 1/1] Total Test time (real) = 3055.28 sec
[06:02:47] [Step 1/1]
[06:02:47] [Step 1/1] The following tests FAILED:
[06:02:47] [Step 1/1] 492 - test_layers (Failed)
[06:02:47] [Step 1/1] Errors while running CTest
[06:02:57] [Step 1/1] Process exited with code 8
[06:02:57] [Step 1/1] Process exited with code 8 (Step: Build and test (Command Line))
[06:02:57] [Step 1/1] Step Build and test (Command Line) failed

This UT always fail in PR_CI, but it seems not related to my commitment. Is it a common issue?
@luotao1 Could you please help to check it?

… concat test=develop

luotao1 · 2019-03-12T14:07:13Z

fixed by #16158. Please re-run again.

luotao1 · 2019-03-12T14:07:55Z

fixed by #16158. Please re-run again.

xiaolil1 · 2019-03-12T14:11:26Z

fixed by #16158. Please re-run again.

Got it, thanks for your quick reply!

test=develop

Sand3r-

Thank you @xiaolil1 for this PR! Please consider introducing changes I've posted underneath.

Sand3r- · 2019-03-13T08:42:20Z

paddle/fluid/operators/mkldnn/concat_mkldnn_op.cc

 public:
  void Compute(const paddle::framework::ExecutionContext& ctx) const override {
+    bool is_INT8 =
+        std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value;


Wouldn't it be better to use template specialisation for int8_t and uint8_t for Compute method? That way, the method to call would be known at the compile time and we'd avoid both RTTI and a branch which are a bit performance costy.

@Sand3r- No, compile time does not know the data type. You may refer to INT8 kernel at conv_mkldnn_op.cc as well.

It actually does, because it happens during the registration of an op. Then during the runtime an appropriate instance of such template method would be called.

Thanks @Sand3r- . How about adding a JIRA internally to experiment your suggestion, with other INT8 kernels, like Conv? With that, we can have a unified solution.

Sand3r- · 2019-03-13T08:45:08Z

paddle/fluid/operators/mkldnn/concat_mkldnn_op.cc


-    output->set_mkldnn_prim_desc(concat_pd.dst_primitive_desc());
+    output->set_layout(DataLayout::kMKLDNN);
+    output->set_format(GetDstMemFormat(concat_pd));


It is advised to stick to the new set_mkldnn_prim_desc API created by @jczaja, since the old set_format and set_layout are going to be deprecated.

Thanks for your suggestion, i have refine it.

since the old set_format and set_layout are going to be deprecated

@Sand3r- @jczaja Do you need to use set_mkldnn_prim_desc to replace all the set_format and set_layout in the codes?

That's the plan, yes.

Sand3r- · 2019-03-13T08:52:40Z

paddle/fluid/operators/mkldnn/concat_mkldnn_op.cc

+        paddle::framework::ToMKLDNNDataType(multi_input[0]->type());
+
+    ConcatPrimitiveFactory<T> prim_creator;
+    std::string key = CreateKey(ctx, multi_input, concat_axis, dt);


Regarding the re-use mechanism in general - it'd be much easier to just store the ConcatPrimitiveFactory in the context instead of creating all the keys and storing primitives separately, while logically they create a single whole. It'd also make the code much shorter.

Thanks for your proposal. The reason why we put the key generation out of ConcatPrimitiveFactory is to follow the similar pattern of design and implementation in other MKL-DNN FP32/INT8 kernels, e.g., Conv, Pooling, Quantize, and Dequantize.

test=develop

… concat test=develop

Sand3r- · 2019-03-18T08:27:44Z

I have thought about this recently, and wondered whether this branch and division into fp32 and int8 functions makes sense in case of concat. Typically, there are scales to be used in other operators, and I guess an approach of dividing the code into int8 and fp32 counterparts is acceptable, but concat doesn't have any scales, or typical for int8 parameters. Perhaps it'd be enough to simplify the code by modifying the old code with adding memory::data_type dt = paddle::framework::ToMKLDNNDataType(multi_input[0]->type()); to the old Compute function and just creating the ConcatPrimitiveFactory using the constructor which uses dt. And correct me if I'm wrong, but the rest should pretty much stay the same, but the code would just look simpler.

hshen14 · 2019-03-18T08:44:26Z

Thanks @Sand3r- for your comments.

Hi @luotao1, what do you think? This PR aims to newly introduce INT8 kernel for concat and keep FP32 kernel same as before. As suggested, it is better to separate the kernel. The major reason is to avoid any potential issues for the deployed FP32 model with MKL-DNN kernel.

luotao1 · 2019-03-18T08:54:01Z

Perhaps it'd be enough to simplify the code by modifying the old code with adding memory::data_type dt = paddle::framework::ToMKLDNNDataType(multi_input[0]->type()); to the old Compute function and just creating the ConcatPrimitiveFactory using the constructor which uses dt.

If this way is simple, makes codes less, not includes a lot of if-else like conv_mkldnn_op and easy to debug FP32 codes as well, I prefer this way.

test=develop

hshen14 · 2019-03-21T02:40:57Z

@Sand3r- @tensor-tang Please help review the PR, thanks.

Sand3r- · 2019-03-21T08:11:50Z

paddle/fluid/operators/mkldnn/concat_mkldnn_op.cc

+    std::shared_ptr<memory> dst_mem;
+    auto concat_p = std::static_pointer_cast<concat>(dev_ctx.GetBlob(key_prim));
+
+    if (concat_p == nullptr) {


What is the benefit of using this re-use mechanism instead of just caching the PrimitiveFactory?

Re-use will help improve the latency with batch size 1 in general. The single op benefit is not so big, but topology level is considerable.

Sand3r-

It's been agreed with @hshen14 that the primitive-reuse will be limited to just reusing PrimitiveFactory in another refactoring PR.

luotao1 · 2019-03-22T09:02:32Z

in another refactoring PR.

@Sand3r- @hshen14 When and where is the refracting PR?

hshen14 · 2019-03-22T09:31:23Z

Hi @luotao1 , I talked with @Sand3r- . Since the complete primitive factory idea was in another PR (FC) #15226, we would consider whether @Sand3r- will adapt the primitive factory for FP32 in Concat first after #15226 is merged and we can refine it for INT8 kernel. With that, we can have a better unified primitive reuse solution. @xiaolil1

Sand3r- · 2019-03-22T09:41:19Z

Well, the concept of factory is different for each op unfortunately, due to high variety of needed memory primitves. Hence these can be hardly unified, but it'd certainly be worth to consider rewriting other mkldnn operators to use this scheme.

luotao1 · 2019-03-26T02:47:58Z

it'd certainly be worth to consider rewriting other mkldnn operators to use this scheme.

Do you have a schedule for rewriting? @Sand3r- @kbinias @jianhang-liu

xiaolil1 added 2 commits March 12, 2019 04:26

Enable INT8 Concat Kernel to improve the performance of MobileNet-SSD.

2596933

test=develop

Optimize UT format.

2c93cdd

test=develop

luotao1 added Intel int8 labels Mar 12, 2019

luotao1 requested a review from Sand3r- March 12, 2019 01:45

Fix UT file address issue.

ac7b049

test=develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

45f3f80

… concat test=develop

Refine the license year.

d65c012

test=develop

Sand3r- reviewed Mar 13, 2019

View reviewed changes

xiaolil1 added 2 commits March 13, 2019 22:05

Optimize code for new API.

4992c05

test=develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

b7dd044

… concat test=develop

luotao1 requested a review from tensor-tang March 18, 2019 08:50

Restructure INT8 Concat kernel.

57af44a

test=develop

Sand3r- reviewed Mar 21, 2019

View reviewed changes

Sand3r- approved these changes Mar 22, 2019

View reviewed changes

luotao1 merged commit e235882 into PaddlePaddle:develop Mar 22, 2019

Conversation

xiaolil1 commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaolil1 commented Mar 12, 2019

Uh oh!

luotao1 commented Mar 12, 2019

Uh oh!

luotao1 commented Mar 12, 2019

Uh oh!

xiaolil1 commented Mar 12, 2019

Uh oh!

Sand3r- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sand3r- commented Mar 18, 2019

Uh oh!

hshen14 commented Mar 18, 2019

Uh oh!

luotao1 commented Mar 18, 2019

Uh oh!

hshen14 commented Mar 21, 2019

Uh oh!

Sand3r- Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hshen14 Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sand3r- left a comment

Choose a reason for hiding this comment

Uh oh!

luotao1 commented Mar 22, 2019

Uh oh!

hshen14 commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sand3r- commented Mar 22, 2019

Uh oh!

luotao1 commented Mar 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xiaolil1 commented Mar 11, 2019 •

edited

Loading

Sand3r- Mar 21, 2019 •

edited

Loading

hshen14 Mar 21, 2019 •

edited

Loading

hshen14 commented Mar 22, 2019 •

edited

Loading