add mkldnn shapeblob cache clear strategy by luotao1 · Pull Request #18513 · PaddlePaddle/Paddle

luotao1 · 2019-07-05T04:16:07Z

This PR

add mkldnn shapeblob cache clear strategy: When shapeblob.size() == mkldnn_input_shape_cache_size, erase the first one of shapeblob.
add a unit-test TestMkldnnCacheClear to ensure this cache clear strategy
- If not use cache clear strategy, shapeblob.size() always = 1.
- If use cache clear strategy, shapeblob.size() = mkldnn_input_shape_cache_size.
remove unused platform::get_cur_input_shape_str function, and add size_t MKLDNNDeviceContext::GetShapeBlobSize(int mkldnn_session_id) to get the ShapeBlob size by mkldnn_session_id.

TODO

explicit following settings such as platform::set_cur_input_shape_str(ss.str()); and platform::set_cur_mkldnn_session_id(platform::kMKLDNNSessionID_CacheClearing);, will be deprecated after enhance config.EnableMKLDNN() interface.
for global function such as platform::set_cur_input_shape_str(ss.str());, platform::set_cur_mkldnn_session_id(platform::kMKLDNNSessionID_CacheClearing); etc, make them become Class function of MKLDNNDeviceContext.

test=develop

luotao1 · 2019-07-05T04:17:00Z

@LeoZhao-Intel Please take a review!

LeoZhao-Intel · 2019-07-05T08:00:48Z

paddle/fluid/platform/device_context.cc

 thread_local std::string cur_input_shape_str = "";
+// the cache size of different input shapes for MKLDNN.
+// Default 1 means fixed input shape, not dynamic shape.
+thread_local int cur_input_shape_cache_size = 1;


how about cur_input_shape_cache_capacity instead of size? size here is a kind of real data size, while it is for max cache size in my understanding.

Got it. I will change.

LeoZhao-Intel · 2019-07-05T08:01:18Z

paddle/fluid/platform/device_context.cc

  cur_input_shape_str = input_shape_str;
 }
-std::string get_cur_input_shape_str(void) { return cur_input_shape_str; }
+void set_cur_input_shape_cache_size(int input_shape_cache_size) {


set_cur_input_shape_cache_capacity?

Got it. I will change.

LeoZhao-Intel · 2019-07-05T08:02:31Z

paddle/fluid/platform/device_context.cc


 void MKLDNNDeviceContext::ResetBlobMap() const { p_blobmap_->clear(); }

+size_t MKLDNNDeviceContext::GetShapeBlobSize(int mkldnn_session_id) const {


why need a explicit input parameter mkldnn_session_id? can we just get it from thread_local? seems there is no case for user getting shapeblobsize for other session id with different id in thread_local

can we just get it from thread_local

Could you give more details on how to get it from thread_local?

why need a explicit input parameter mkldnn_session_id?

if we don't have mkldnn_session_id parameter, line428 pMap->find(mkldnn_session_id) could not find. Or do you mean there is only one session_id in pMap?

mkldnn_session_id is already stored in cur_mkldnn_session_id, it is a thread_local variable, so if user call it in same thread with predictor.run, it can easily get id from cur_mkldnn_session_id.

if user call getblobsize in same thread with predictor.run, then code is like:

BlobMap* pMap = p_blobmap_.get(); auto map_it = pMap->find(cur_mkldnn_session_id); if (map_it == pMap->end()) { LOG(FATAL) << "MKLDNNDeviceContext don't find mkldnn_session_id : " << mkldnn_session_id; } return map_it->second->size(); }

If we still want user to call this function in other threads, then this parameter is necessary.

Got it. I will change it.

LeoZhao-Intel · 2019-07-05T08:05:01Z

paddle/fluid/platform/device_context.cc

  if (key_it == sBlob->end()) {
+    // In cache clearing mode, cur_input_shape_cache_size defines max pblob
+    // capacity
+    if ((tid == kMKLDNNSessionID_CacheClearing) &&


better rename tid to sid (session id) to align with kMKLDNNSessionID_xxx

Got it. I will change.

LeoZhao-Intel · 2019-07-05T08:07:18Z

paddle/fluid/platform/device_context.h


 // default mkldnn session id
-constexpr size_t kMKLDNNSessionID_Default = 0;
+constexpr int kMKLDNNSessionID_Default = 0;


why change to int? size_t is used for aligning with std::hash(this_thread::gettid()), otherwise it may need a static_cast

Got it. I will change.

LeoZhao-Intel · 2019-07-05T08:20:32Z

paddle/fluid/platform/device_context.cc

+        (sBlob->size() == static_cast<size_t>(cur_input_shape_cache_size))) {
+      VLOG(2) << "tid=" << tid
+              << ", remove all head blob of shape: " << sBlob->begin()->first;
+      sBlob->erase(sBlob->begin()->first);


why here remove sBlob->begin()->first? seems it is wrong, should be erase(sBlob->begin()), sBlob->begin()->first is string.
BTW, it doesn't really mean removing the head or first one because sBlob is std::unordered_map and its index method is not same with vector or queue,.

map could erase by key, see: https://www.geeksforgeeks.org/map-erase-function-in-c-stl/

it doesn't really mean removing the head or first one because sBlob is

Got it. I will change the LOG. For code, since we can erase any one of the sBlob, it runs successfully.

you are right, there are 3 erase functions.

test=develop

luotao1 · 2019-07-05T10:06:33Z

All done, please review again @LeoZhao-Intel

LeoZhao-Intel · 2019-07-05T10:08:39Z

paddle/fluid/platform/device_context.cc

+    // In cache clearing mode, cur_input_shape_cache_capacity defines
+    // max pblob capacity
+    if ((sid == kMKLDNNSessionID_CacheClearing) &&
+        (sBlob->size() ==


do you think">=" is better here ?

when ==, it will clean. Thus > does not work.
What's the case when > works?

so far I still not see cases for >, but from code level, it looks safer to use >=.

test=develop

LeoZhao-Intel · 2019-07-05T10:41:05Z

LGTM

luotao1 · 2019-07-05T11:09:04Z

@jczaja Please take a review!

jczaja · 2019-07-05T20:14:31Z

paddle/fluid/platform/device_context.cc


+size_t MKLDNNDeviceContext::GetShapeBlobSize() const {
+  BlobMap* pMap = p_blobmap_.get();
+  auto map_it = pMap->find(cur_mkldnn_session_id);


currently it is only for UT purposes? Perhaps it would be safer to guard whole function with critical section , the same mutex as for SetBlob and GetBlob. In particular if there is plan that to use GetShapeBlobSize with parallel executor. Apart from that LGTM

Got it. You are right.

luotao1 · 2019-07-06T02:53:54Z

@LeoZhao-Intel @jianhang-liu The number of analyzer_mm_dnn hang increases in this PR after adding a unit-test here.
http://ci.paddlepaddle.org/viewType.html?buildTypeId=Paddle_PrCi&branch_Paddle=pull%2F18513&tab=buildTypeStatusDiv
Is the reason of #18513 (comment)?

test=develop

luotao1 · 2019-07-06T05:12:28Z

@jczaja @LeoZhao-Intel I add the mutex, and analyzer_mm_dnn don't hang when I repeat 3 times.
http://ci.paddlepaddle.org/viewType.html?buildTypeId=Paddle_PrCi&branch_Paddle=pull%2F18513&tab=buildTypeStatusDiv

Please review again!

LeoZhao-Intel · 2019-07-07T02:28:19Z

LGTM

ghost

Please consider whether we really need VLOG(2) considering those log are just detail information of cache usage.

luotao1 · 2019-07-08T01:56:21Z

Please consider whether we really need VLOG(2) considering those logs are just detail information of cache usage.

Since those logs are used by MKLDNN only, what's your opinion about it? @LeoZhao-Intel @jczaja

LeoZhao-Intel · 2019-07-08T01:59:27Z

I am fine with VLOG(2), but not sure if there is a clear rule for log level definition.

add mkldnn shapeblob cache clear strategy

64acbeb

test=develop

luotao1 requested review from a user and jczaja July 5, 2019 04:16

LeoZhao-Intel reviewed Jul 5, 2019

View reviewed changes

LeoZhao-Intel mentioned this pull request Jul 5, 2019

clear cache when tid == -1 and cache size exceeds max capacity #18285

Closed

refine with comments

4c127ca

test=develop

LeoZhao-Intel reviewed Jul 5, 2019

View reviewed changes

make cache clear strategy more safey

0ae7c5b

test=develop

jczaja reviewed Jul 5, 2019

View reviewed changes

add lock for GetShapeBlobSize

60aad7d

test=develop

ghost approved these changes Jul 8, 2019

View reviewed changes

luotao1 merged commit fe32879 into PaddlePaddle:develop Jul 8, 2019

luotao1 deleted the shape_blob_clear_cache branch July 8, 2019 01:55


		void MKLDNNDeviceContext::ResetBlobMap() const { p_blobmap_->clear(); }

		size_t MKLDNNDeviceContext::GetShapeBlobSize(int mkldnn_session_id) const {

Conversation

luotao1 commented Jul 5, 2019

This PR

TODO

Uh oh!

luotao1 commented Jul 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeoZhao-Intel Jul 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luotao1 commented Jul 5, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LeoZhao-Intel commented Jul 5, 2019

Uh oh!

luotao1 commented Jul 5, 2019

Uh oh!

jczaja Jul 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luotao1 commented Jul 6, 2019

Uh oh!

luotao1 commented Jul 6, 2019

Uh oh!

LeoZhao-Intel commented Jul 7, 2019

Uh oh!

ghost left a comment

Choose a reason for hiding this comment

Uh oh!

luotao1 commented Jul 8, 2019

Uh oh!

LeoZhao-Intel commented Jul 8, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

LeoZhao-Intel Jul 5, 2019 •

edited

Loading

jczaja Jul 5, 2019 •

edited

Loading