Enable MKL-DNN INT8 Concat Kernel.#16156
Conversation
test=develop
test=develop
|
[06:02:47] [Step 1/1] 99% tests passed, 1 tests failed out of 577 This UT always fail in PR_CI, but it seems not related to my commitment. Is it a common issue? |
… concat test=develop
|
fixed by #16158. Please re-run again. |
1 similar comment
|
fixed by #16158. Please re-run again. |
Got it, thanks for your quick reply! |
test=develop
| public: | ||
| void Compute(const paddle::framework::ExecutionContext& ctx) const override { | ||
| bool is_INT8 = | ||
| std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value; |
There was a problem hiding this comment.
Wouldn't it be better to use template specialisation for int8_t and uint8_t for Compute method? That way, the method to call would be known at the compile time and we'd avoid both RTTI and a branch which are a bit performance costy.
There was a problem hiding this comment.
@Sand3r- No, compile time does not know the data type. You may refer to INT8 kernel at conv_mkldnn_op.cc as well.
There was a problem hiding this comment.
It actually does, because it happens during the registration of an op. Then during the runtime an appropriate instance of such template method would be called.
There was a problem hiding this comment.
Thanks @Sand3r- . How about adding a JIRA internally to experiment your suggestion, with other INT8 kernels, like Conv? With that, we can have a unified solution.
|
|
||
| output->set_mkldnn_prim_desc(concat_pd.dst_primitive_desc()); | ||
| output->set_layout(DataLayout::kMKLDNN); | ||
| output->set_format(GetDstMemFormat(concat_pd)); |
There was a problem hiding this comment.
It is advised to stick to the new set_mkldnn_prim_desc API created by @jczaja, since the old set_format and set_layout are going to be deprecated.
There was a problem hiding this comment.
Thanks for your suggestion, i have refine it.
| paddle::framework::ToMKLDNNDataType(multi_input[0]->type()); | ||
|
|
||
| ConcatPrimitiveFactory<T> prim_creator; | ||
| std::string key = CreateKey(ctx, multi_input, concat_axis, dt); |
There was a problem hiding this comment.
Regarding the re-use mechanism in general - it'd be much easier to just store the ConcatPrimitiveFactory in the context instead of creating all the keys and storing primitives separately, while logically they create a single whole. It'd also make the code much shorter.
There was a problem hiding this comment.
Thanks for your proposal. The reason why we put the key generation out of ConcatPrimitiveFactory is to follow the similar pattern of design and implementation in other MKL-DNN FP32/INT8 kernels, e.g., Conv, Pooling, Quantize, and Dequantize.
test=develop
… concat test=develop
|
I have thought about this recently, and wondered whether this branch and division into fp32 and int8 functions makes sense in case of concat. Typically, there are scales to be used in other operators, and I guess an approach of dividing the code into int8 and fp32 counterparts is acceptable, but concat doesn't have any scales, or typical for int8 parameters. Perhaps it'd be enough to simplify the code by modifying the old code with adding |
|
Thanks @Sand3r- for your comments. Hi @luotao1, what do you think? This PR aims to newly introduce INT8 kernel for concat and keep FP32 kernel same as before. As suggested, it is better to separate the kernel. The major reason is to avoid any potential issues for the deployed FP32 model with MKL-DNN kernel. |
If this way is simple, makes codes less, not includes a lot of |
test=develop
|
@Sand3r- @tensor-tang Please help review the PR, thanks. |
| std::shared_ptr<memory> dst_mem; | ||
| auto concat_p = std::static_pointer_cast<concat>(dev_ctx.GetBlob(key_prim)); | ||
|
|
||
| if (concat_p == nullptr) { |
There was a problem hiding this comment.
What is the benefit of using this re-use mechanism instead of just caching the PrimitiveFactory?
There was a problem hiding this comment.
Re-use will help improve the latency with batch size 1 in general. The single op benefit is not so big, but topology level is considerable.
|
Hi @luotao1 , I talked with @Sand3r- . Since the complete primitive factory idea was in another PR (FC) #15226, we would consider whether @Sand3r- will adapt the primitive factory for FP32 in Concat first after #15226 is merged and we can refine it for INT8 kernel. With that, we can have a better unified primitive reuse solution. @xiaolil1 |
|
Well, the concept of factory is different for each op unfortunately, due to high variety of needed memory primitves. Hence these can be hardly unified, but it'd certainly be worth to consider rewriting other mkldnn operators to use this scheme. |
Enable INT8 Concat Kernel to improve the performance of MobileNet-SSD, theoretical improvement is 4X for memory bandwidth saving.
test=develop