clear cache when tid == -1 and cache size exceeds max capacity by LeoZhao-Intel · Pull Request #18285 · PaddlePaddle/Paddle

LeoZhao-Intel · 2019-06-24T04:18:42Z

1. Enable cache clearing mechanism
platform::get_cur_thread_id() == -1 means it is in cache clearing mode.
In this mode, mkldnn key generation is plain format, without including real thread id, and when blob
size (mkldnn blob with first level key = -1, see line ) exceeds the defined max capacity (see line), it will trigger cache clearing, and remove one from head of this blob, the blob data structure is changed to vector type to meet requirement for removing from head.

2. Add new interface SetMKLDNNThreadId(int id) in AnalysisConfig
Use this interface to indicate that users want to set mkldnn thread id manually, original
AnalysisPredictor::SetMkldnnthreadid() API is not exposed to user directly. Meanwhile we use id=-1
to trigger cache clearing mode.
Given cache clearing mode is a specific mode to fix thread id frequent changing issue and dynamic
shape issue, it is rarely used, and should not be inherited by other AnalysisPredictor instances, we
need to set and clear value for each iteration, that means we need add hook points in
AnalysisPredictor::Run() and ZeroCopyRun().

3. Few fixes in mkldnn concat/pool/conv kernels
In these 3 kernels, due to key generation method is not aligned with new method (PR #17965), there
are few changes in key generation, and also fix potential crash issues if mkldnn cache doesn't work as
expected result (always cache successfully)
This part is merged by PR #18393
test=develop

test=develop

… cache_clearing

test=develop

…e/Paddle into cache_clearing

2. Few fix in concat/pool mkldnn kernel for key generation 3. Enable cache clearing mechanism test=develop

LeoZhao-Intel · 2019-06-25T05:51:38Z

@jczaja @jianhang-liu for code review

test=develop

jczaja · 2019-06-25T11:39:39Z

@LeoZhao-Intel

Why clearing cache is not default behaviour? Eg. capping the limit of Map happens only when enabling tid == -1 ?
Could you please explain more why changes to pooling and concat are needed eg. please describe when there could be a crash?

LeoZhao-Intel · 2019-06-25T11:46:56Z

@LeoZhao-Intel

Why clearing cache is not default behaviour? Eg. capping the limit of Map happens only when enabling tid == -1 ?
This takes cost on performance, clearing cache will make mkldnn performance drop due to recreate primitive, so not recommended to use in normal case.

Could you please explain more why changes to pooling and concat are needed eg. please describe when there could be a crash?
When clearing cache, it will release memory internally, while in pool/concat, it uses variable to store obj in a "if", when code returns from "if", the ptr is invalid then which will make crash in mkldnn execution.

e.g. this line

luotao1 · 2019-06-25T12:06:00Z

Could we use LRU for cache?

jczaja · 2019-06-25T12:16:02Z

@LeoZhao-Intel Ok,
3. Why changes to convolution are needed ? When will it crash without those changes?

LeoZhao-Intel · 2019-06-25T12:26:09Z

@LeoZhao-Intel Ok,
3. Why changes to convolution are needed ? When will it crash without those changes?
same issue, if we don't do cache, I mean set CAP=0, it will crash in conv mkldnn kenel, I may need to update description.

LeoZhao-Intel · 2019-06-25T12:26:35Z

Could we use LRU for cache?

What you mean LRU?

jczaja · 2019-06-25T12:33:34Z

@LeoZhao-Intel Ok, So my understanding is that pointer has to remain valid , as it could be that cache clearing cleared stored data that this pointer is holding ?

LeoZhao-Intel · 2019-06-25T12:36:11Z

@LeoZhao-Intel Ok, So my understanding is that pointer has to remain valid , as it could be that cache clearing cleared stored data that this pointer is holding ?

Correct! That's the idea to let shared_ptr keep memory till mkldnn pipeline execution done.

luotao1 · 2019-06-25T12:44:59Z

LRU: Least recently used

jczaja · 2019-06-25T12:52:40Z

@luotao1 We had a long term plan to improve this cache clearing. Removing oldest entry is just first step.

…e/Paddle into cache_clearing

… cache_clearing

…e/Paddle into cache_clearing

luotao1 · 2019-06-27T23:54:38Z

Could you create a new PR for xxx_mkldnn_op to merge at first?

luotao1 · 2019-06-28T00:00:42Z

Could we use a similar interface like config.EnableMKLDNN(int mkldnn_cache_size)?

use mkldnn_cache_size to replace mkldnn_thread_id?
use mkldnn_cache_size to record MKLDNN_CAP?

LeoZhao-Intel · 2019-07-05T09:45:11Z

new PRs (#18513, #18428, #18453, #18428) to fix this issue. so close it

LeoZhao-Intel added 4 commits June 24, 2019 12:22

clear cache when tid == 1 and cache size exceeds max capacity

4ea200e

test=develop

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

2088f4c

… cache_clearing

add more logs to print blob status

3d2e563

test=develop

Merge commit 'refs/pull/18284/head' of https://github.com/PaddlePaddl…

20ef647

…e/Paddle into cache_clearing

LeoZhao-Intel changed the title ~~clear cache when tid == 1 and cache size exceeds max capacity~~ clear cache when tid == -1 and cache size exceeds max capacity Jun 25, 2019

1. Add new interface in AnalysisConfig to set mkldnn thread id

14c5b2e

2. Few fix in concat/pool mkldnn kernel for key generation 3. Enable cache clearing mechanism test=develop

change to use VLOG(2)

29ca760

test=develop

LeoZhao-Intel and others added 5 commits June 25, 2019 10:21

Merge commit 'refs/pull/18283/head' of https://github.com/PaddlePaddl…

437ef14

…e/Paddle into cache_clearing

Merge commit 'refs/pull/18284/head' of https://github.com/PaddlePaddl…

b4072a5

…e/Paddle into cache_clearing

detect model test for dynamic shape

76db898

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

cbe871a

… cache_clearing

Merge commit 'refs/pull/18331/head' of https://github.com/PaddlePaddl…

5ea831e

…e/Paddle into cache_clearing

luotao1 added the Intel label Jun 26, 2019

luotao1 mentioned this pull request Jun 27, 2019

[DO NOT MERGE] detect model test2 for dynamic shape #18372

Closed

LeoZhao-Intel closed this Jul 5, 2019

Conversation

LeoZhao-Intel commented Jun 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeoZhao-Intel commented Jun 25, 2019

Uh oh!

jczaja commented Jun 25, 2019

Uh oh!

LeoZhao-Intel commented Jun 25, 2019

Uh oh!

luotao1 commented Jun 25, 2019

Uh oh!

jczaja commented Jun 25, 2019

Uh oh!

LeoZhao-Intel commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeoZhao-Intel commented Jun 25, 2019

Uh oh!

jczaja commented Jun 25, 2019

Uh oh!

LeoZhao-Intel commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luotao1 commented Jun 25, 2019

Uh oh!

jczaja commented Jun 25, 2019

Uh oh!

luotao1 commented Jun 27, 2019

Uh oh!

luotao1 commented Jun 28, 2019

Uh oh!

LeoZhao-Intel commented Jul 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LeoZhao-Intel commented Jun 24, 2019 •

edited

Loading

LeoZhao-Intel commented Jun 25, 2019 •

edited

Loading

LeoZhao-Intel commented Jun 25, 2019 •

edited

Loading