Skip to content

Enable MKL-DNN INT8 Concat Kernel.#16156

Merged
luotao1 merged 8 commits intoPaddlePaddle:developfrom
xiaolil1:concat
Mar 22, 2019
Merged

Enable MKL-DNN INT8 Concat Kernel.#16156
luotao1 merged 8 commits intoPaddlePaddle:developfrom
xiaolil1:concat

Conversation

@xiaolil1
Copy link
Contributor

@xiaolil1 xiaolil1 commented Mar 11, 2019

Enable INT8 Concat Kernel to improve the performance of MobileNet-SSD, theoretical improvement is 4X for memory bandwidth saving.

test=develop

@xiaolil1
Copy link
Contributor Author

[06:02:47] [Step 1/1] 99% tests passed, 1 tests failed out of 577
[06:02:47] [Step 1/1]
[06:02:47] [Step 1/1] Total Test time (real) = 3055.28 sec
[06:02:47] [Step 1/1]
[06:02:47] [Step 1/1] The following tests FAILED:
[06:02:47] [Step 1/1] 492 - test_layers (Failed)
[06:02:47] [Step 1/1] Errors while running CTest
[06:02:57] [Step 1/1] Process exited with code 8
[06:02:57] [Step 1/1] Process exited with code 8 (Step: Build and test (Command Line))
[06:02:57] [Step 1/1] Step Build and test (Command Line) failed

This UT always fail in PR_CI, but it seems not related to my commitment. Is it a common issue?
@luotao1 Could you please help to check it?

@luotao1
Copy link
Contributor

luotao1 commented Mar 12, 2019

fixed by #16158. Please re-run again.

1 similar comment
@luotao1
Copy link
Contributor

luotao1 commented Mar 12, 2019

fixed by #16158. Please re-run again.

@xiaolil1
Copy link
Contributor Author

fixed by #16158. Please re-run again.

Got it, thanks for your quick reply!

Copy link
Contributor

@Sand3r- Sand3r- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xiaolil1 for this PR! Please consider introducing changes I've posted underneath.

public:
void Compute(const paddle::framework::ExecutionContext& ctx) const override {
bool is_INT8 =
std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to use template specialisation for int8_t and uint8_t for Compute method? That way, the method to call would be known at the compile time and we'd avoid both RTTI and a branch which are a bit performance costy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sand3r- No, compile time does not know the data type. You may refer to INT8 kernel at conv_mkldnn_op.cc as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually does, because it happens during the registration of an op. Then during the runtime an appropriate instance of such template method would be called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Sand3r- . How about adding a JIRA internally to experiment your suggestion, with other INT8 kernels, like Conv? With that, we can have a unified solution.


output->set_mkldnn_prim_desc(concat_pd.dst_primitive_desc());
output->set_layout(DataLayout::kMKLDNN);
output->set_format(GetDstMemFormat(concat_pd));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is advised to stick to the new set_mkldnn_prim_desc API created by @jczaja, since the old set_format and set_layout are going to be deprecated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, i have refine it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the old set_format and set_layout are going to be deprecated

@Sand3r- @jczaja Do you need to use set_mkldnn_prim_desc to replace all the set_format and set_layout in the codes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the plan, yes.

paddle::framework::ToMKLDNNDataType(multi_input[0]->type());

ConcatPrimitiveFactory<T> prim_creator;
std::string key = CreateKey(ctx, multi_input, concat_axis, dt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the re-use mechanism in general - it'd be much easier to just store the ConcatPrimitiveFactory in the context instead of creating all the keys and storing primitives separately, while logically they create a single whole. It'd also make the code much shorter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your proposal. The reason why we put the key generation out of ConcatPrimitiveFactory is to follow the similar pattern of design and implementation in other MKL-DNN FP32/INT8 kernels, e.g., Conv, Pooling, Quantize, and Dequantize.

@Sand3r-
Copy link
Contributor

Sand3r- commented Mar 18, 2019

I have thought about this recently, and wondered whether this branch and division into fp32 and int8 functions makes sense in case of concat. Typically, there are scales to be used in other operators, and I guess an approach of dividing the code into int8 and fp32 counterparts is acceptable, but concat doesn't have any scales, or typical for int8 parameters. Perhaps it'd be enough to simplify the code by modifying the old code with adding memory::data_type dt = paddle::framework::ToMKLDNNDataType(multi_input[0]->type()); to the old Compute function and just creating the ConcatPrimitiveFactory using the constructor which uses dt. And correct me if I'm wrong, but the rest should pretty much stay the same, but the code would just look simpler.

@hshen14
Copy link
Contributor

hshen14 commented Mar 18, 2019

Thanks @Sand3r- for your comments.

Hi @luotao1, what do you think? This PR aims to newly introduce INT8 kernel for concat and keep FP32 kernel same as before. As suggested, it is better to separate the kernel. The major reason is to avoid any potential issues for the deployed FP32 model with MKL-DNN kernel.

@luotao1 luotao1 requested a review from tensor-tang March 18, 2019 08:50
@luotao1
Copy link
Contributor

luotao1 commented Mar 18, 2019

Perhaps it'd be enough to simplify the code by modifying the old code with adding memory::data_type dt = paddle::framework::ToMKLDNNDataType(multi_input[0]->type()); to the old Compute function and just creating the ConcatPrimitiveFactory using the constructor which uses dt.

If this way is simple, makes codes less, not includes a lot of if-else like conv_mkldnn_op and easy to debug FP32 codes as well, I prefer this way.

@hshen14
Copy link
Contributor

hshen14 commented Mar 21, 2019

@Sand3r- @tensor-tang Please help review the PR, thanks.

std::shared_ptr<memory> dst_mem;
auto concat_p = std::static_pointer_cast<concat>(dev_ctx.GetBlob(key_prim));

if (concat_p == nullptr) {
Copy link
Contributor

@Sand3r- Sand3r- Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of using this re-use mechanism instead of just caching the PrimitiveFactory?

Copy link
Contributor

@hshen14 hshen14 Mar 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-use will help improve the latency with batch size 1 in general. The single op benefit is not so big, but topology level is considerable.

Copy link
Contributor

@Sand3r- Sand3r- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been agreed with @hshen14 that the primitive-reuse will be limited to just reusing PrimitiveFactory in another refactoring PR.

@luotao1
Copy link
Contributor

luotao1 commented Mar 22, 2019

in another refactoring PR.

@Sand3r- @hshen14 When and where is the refracting PR?

@luotao1 luotao1 merged commit e235882 into PaddlePaddle:develop Mar 22, 2019
@hshen14
Copy link
Contributor

hshen14 commented Mar 22, 2019

Hi @luotao1 , I talked with @Sand3r- . Since the complete primitive factory idea was in another PR (FC) #15226, we would consider whether @Sand3r- will adapt the primitive factory for FP32 in Concat first after #15226 is merged and we can refine it for INT8 kernel. With that, we can have a better unified primitive reuse solution. @xiaolil1

@Sand3r-
Copy link
Contributor

Sand3r- commented Mar 22, 2019

Well, the concept of factory is different for each op unfortunately, due to high variety of needed memory primitves. Hence these can be hardly unified, but it'd certainly be worth to consider rewriting other mkldnn operators to use this scheme.

@luotao1
Copy link
Contributor

luotao1 commented Mar 26, 2019

it'd certainly be worth to consider rewriting other mkldnn operators to use this scheme.

Do you have a schedule for rewriting? @Sand3r- @kbinias @jianhang-liu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants