Skip to content

clear cache when tid == -1 and cache size exceeds max capacity#18285

Closed
LeoZhao-Intel wants to merge 11 commits intoPaddlePaddle:developfrom
LeoZhao-Intel:cache_clearing
Closed

clear cache when tid == -1 and cache size exceeds max capacity#18285
LeoZhao-Intel wants to merge 11 commits intoPaddlePaddle:developfrom
LeoZhao-Intel:cache_clearing

Conversation

@LeoZhao-Intel
Copy link
Contributor

@LeoZhao-Intel LeoZhao-Intel commented Jun 24, 2019

1. Enable cache clearing mechanism
platform::get_cur_thread_id() == -1 means it is in cache clearing mode.
In this mode, mkldnn key generation is plain format, without including real thread id, and when blob
size (mkldnn blob with first level key = -1, see line ) exceeds the defined max capacity (see line), it will trigger cache clearing, and remove one from head of this blob, the blob data structure is changed to vector type to meet requirement for removing from head.

2. Add new interface SetMKLDNNThreadId(int id) in AnalysisConfig
Use this interface to indicate that users want to set mkldnn thread id manually, original
AnalysisPredictor::SetMkldnnthreadid() API is not exposed to user directly. Meanwhile we use id=-1
to trigger cache clearing mode.
Given cache clearing mode is a specific mode to fix thread id frequent changing issue and dynamic
shape issue, it is rarely used, and should not be inherited by other AnalysisPredictor instances, we
need to set and clear value for each iteration, that means we need add hook points in
AnalysisPredictor::Run() and ZeroCopyRun().

3. Few fixes in mkldnn concat/pool/conv kernels
In these 3 kernels, due to key generation method is not aligned with new method (PR #17965), there
are few changes in key generation, and also fix potential crash issues if mkldnn cache doesn't work as
expected result (always cache successfully)
This part is merged by PR #18393
test=develop

@LeoZhao-Intel LeoZhao-Intel changed the title clear cache when tid == 1 and cache size exceeds max capacity clear cache when tid == -1 and cache size exceeds max capacity Jun 25, 2019
2. Few fix in concat/pool mkldnn kernel for key generation
3. Enable cache clearing mechanism

test=develop
@LeoZhao-Intel
Copy link
Contributor Author

@jczaja @jianhang-liu for code review

@jczaja
Copy link
Contributor

jczaja commented Jun 25, 2019

@LeoZhao-Intel

  1. Why clearing cache is not default behaviour? Eg. capping the limit of Map happens only when enabling tid == -1 ?
  2. Could you please explain more why changes to pooling and concat are needed eg. please describe when there could be a crash?

@LeoZhao-Intel
Copy link
Contributor Author

@LeoZhao-Intel

  1. Why clearing cache is not default behaviour? Eg. capping the limit of Map happens only when enabling tid == -1 ?
    This takes cost on performance, clearing cache will make mkldnn performance drop due to recreate primitive, so not recommended to use in normal case.
  2. Could you please explain more why changes to pooling and concat are needed eg. please describe when there could be a crash?
    When clearing cache, it will release memory internally, while in pool/concat, it uses variable to store obj in a "if", when code returns from "if", the ptr is invalid then which will make crash in mkldnn execution.

e.g. this line

@luotao1
Copy link
Contributor

luotao1 commented Jun 25, 2019

Could we use LRU for cache?

@jczaja
Copy link
Contributor

jczaja commented Jun 25, 2019

@LeoZhao-Intel Ok,
3. Why changes to convolution are needed ? When will it crash without those changes?

@LeoZhao-Intel
Copy link
Contributor Author

LeoZhao-Intel commented Jun 25, 2019

@LeoZhao-Intel Ok,
3. Why changes to convolution are needed ? When will it crash without those changes?
same issue, if we don't do cache, I mean set CAP=0, it will crash in conv mkldnn kenel, I may need to update description.

@LeoZhao-Intel
Copy link
Contributor Author

Could we use LRU for cache?

What you mean LRU?

@jczaja
Copy link
Contributor

jczaja commented Jun 25, 2019

@LeoZhao-Intel Ok, So my understanding is that pointer has to remain valid , as it could be that cache clearing cleared stored data that this pointer is holding ?

@LeoZhao-Intel
Copy link
Contributor Author

LeoZhao-Intel commented Jun 25, 2019

@LeoZhao-Intel Ok, So my understanding is that pointer has to remain valid , as it could be that cache clearing cleared stored data that this pointer is holding ?

Correct! That's the idea to let shared_ptr keep memory till mkldnn pipeline execution done.

@luotao1
Copy link
Contributor

luotao1 commented Jun 25, 2019

LRU: Least recently used

@jczaja
Copy link
Contributor

jczaja commented Jun 25, 2019

@luotao1 We had a long term plan to improve this cache clearing. Removing oldest entry is just first step.

@luotao1
Copy link
Contributor

luotao1 commented Jun 27, 2019

Could you create a new PR for xxx_mkldnn_op to merge at first?

@luotao1
Copy link
Contributor

luotao1 commented Jun 28, 2019

Could we use a similar interface like config.EnableMKLDNN(int mkldnn_cache_size)?

  • use mkldnn_cache_size to replace mkldnn_thread_id?
  • use mkldnn_cache_size to record MKLDNN_CAP?

@LeoZhao-Intel
Copy link
Contributor Author

new PRs (#18513, #18428, #18453, #18428) to fix this issue. so close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants