Skip to content

[DO NOT MERGE] detect model test2 for dynamic shape#18372

Closed
luotao1 wants to merge 36 commits intoPaddlePaddle:developfrom
luotao1:detect_model_test
Closed

[DO NOT MERGE] detect model test2 for dynamic shape#18372
luotao1 wants to merge 36 commits intoPaddlePaddle:developfrom
luotao1:detect_model_test

Conversation

@luotao1
Copy link
Contributor

@luotao1 luotao1 commented Jun 27, 2019

combine #18331 and #18285
@jianhang-liu @LeoZhao-Intel

@luotao1
Copy link
Contributor Author

luotao1 commented Jun 27, 2019

634d8c6 is used to speedup find. Since Map's find is faster than vector.

} // namespace inference
} // namespace paddle

// following lines are used for pprof
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove followings main function if you do not want to use pprof.

@luotao1
Copy link
Contributor Author

luotao1 commented Jun 29, 2019

6d5a841 support pprof to find memory leak.

  • install
  • Heap-checking: HEAPCHECK=local ./paddle/fluid/inference/tests/api/test_analyzer_detect
  • Heap-profiling: pprof --pdf ./paddle/fluid/inference/tests/api/test_analyzer_detect /tmp/test_analyzer_detect.11261.test_foo-end.heap > test_sample10.pdf

@luotao1
Copy link
Contributor Author

luotao1 commented Jun 29, 2019

pprof_sample100.pdf
pprof_sample1000.pdf
Is global_transfer_data_cache and global_transfer_scope_cache memory leak?
image

Same problem in #15032 (comment)

@LeoZhao-Intel
Copy link
Contributor

LeoZhao-Intel commented Jun 29, 2019

mkldnn多线程(每个iteration都用新的线程执行)memory leak的分析:

  1. 这个应用场景很特殊,用户会不断地创建新线程去执行predictor.run,并且只执行这个函数。
  2. 只有涉及data transform的情况才会触发,因为会创建transferscope,。mkldnn是其中一个用例,mkl不会用,所以mkl没问题,见OperatorWithKernel::PrepareData()
  3. memory leak的地方在framework里面,global_transfer_data_cache() in transfer_scope.cache.cc,原因在于这个函数用到了tread local的指针分配内存,而这个分配内存由于线程的不断切换,在线程退出后也没有机会释放。这个函数的目的是好的,为了提升性能,cache并且重用 transferscope,但是估计单测中的多线程case是这段代码在当时设计时没有考虑到的。
std::unordered_map<size_t, Scope*>& global_transfer_data_cache() {
  thread_local auto* x = new std::unordered_map<size_t, Scope*>;
  return *x;
}

解决方案讨论:
这个framework的问题和mkldnn自身cache机制的问题很相似,都是为了提高性能要充分reuse 之前的东西,并且还要支持单实例/多实例, 顺序run/并行run,固定单线程run/切换多线程run的情况,case比较复杂。想通过一个solution在底层用户无感知的情况下解决问题难度很大并且会很复杂,建议可否考虑设置特定的API来让用户确定运行模式,这样既可以简化设计,同时也可以更明确地对特定场景进行优化。

@jczaja
Copy link
Contributor

jczaja commented Jul 4, 2019

@luotao1 , I'm writting here because I have a problem reproducing potential memory leak in single-threaded execution that I was told by @LeoZhao-Intel you are seeing eg.
Test_analyzer_detect for CAP=50 e.g. for samples=1000 max memory consumption is 2.6GB and for samples=5000 it is around 4GB.

I tried to reproduce this problem, but got difficulty in observing it. So perhaps you can advise on how to have it manifested.

I tested this PR (most recent update one and earlier ones) and for CAP=50 (cfg.EnableMKLDNN(50) ) for

  • samples=1000 , maximal memory consumption is ~2.1 GB
  • samples=5000 , maximal memory consumption is ~2.2 GB

I have disabled in-loop threads starting to do prediction, by commenting out thread.emplace_back...
e.g. test_analyzer_detect is single threaded (Predictor executions starts from the same thread as mother thread).

  1. How to you check for max memory usage ?

On my side I check Maximal memory consumption by checking Max RSS value as reported by time program eg./usr/bin/time -v <cmd line>

typical output:

Command being timed: "./paddle/fluid/inference/tests/api/test_analyzer_detect --infer_model=/home/jczaja/DETECT/fluid/ --infer_data=/home/jczaja/DETECT/detect_input.txt --infer_shape=/home/jczaja/DETECT/shape.txt --gtest_filter=Analyzer_vis.profile_mkldnn --paddle_num_threads=1 --repeat=2 --batch_size=1 --sample=5000"
User time (seconds): 1019.35
System time (seconds): 156.12
Percent of CPU this job got: 98%
Elapsed (wall clock) time (h:mm:ss or m:ss): 19:55.67
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2313368 # 2.3 GB
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 102711363
Voluntary context switches: 10037
Involuntary context switches: 1763
Swaps: 0
File system inputs: 0
File system outputs: 6312
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

  1. When observing a problem have you disabled pprof and sanitizers (they tend to increase memory consumption) ? If fully disabled (no linking to tcmalloc and building for sanitizers) do you still see this increased memory consumption?

  2. What platform you used for observing memory leak?

  3. Any other advice on how to observe this memory leak ?

@luotao1
Copy link
Contributor Author

luotao1 commented Jul 4, 2019

@jczaja Thanks for your reproducing.

Test_analyzer_detect for CAP=50 e.g. for samples=1000 max memory consumption is 2.6GB and for samples=5000 it is around 4GB.

This is a result from @jianhang-liu. And my test result today is: config.EnableMKLDNN(10) with CAP=10:

  • samples=1000 (from 1k dataset) , maximal memory consumption is ~1.5 GB
  • samples=1000 (from the first 1k of 5k dataset) , maximal memory consumption is ~1.9 GB
  • samples=5000 (from 5k dataset) , maximal memory consumption is ~2.4 GB

I have disabled in-loop threads starting to do prediction, by commenting out thread.emplace_back

I don't disable this, but with #18428, I think the multi-instance memory leak is fixed now.

--paddle_num_threads=1

I use --paddle_num_threads=4, @jianhang-liu find OMP_NUM_THREADS may affect on it. But I don't test on different paddle_num_threads.

How to you check for max memory usage

I use top cmd to observe directly.

If fully disabled (no linking to tcmalloc and building for sanitizers) do you still see this increased memory consumption

Yes, it is.

What platform you used for observing memory leak

I use E5-2620 v3.

Any other advice on how to observe this memory leak

Maybe --paddle_num_threads=4 is the key point?

@luotao1 luotao1 reopened this Jul 8, 2019
@luotao1
Copy link
Contributor Author

luotao1 commented Oct 17, 2019

command:

./paddle/fluid/inference/tests/api/test_analyzer_detect --gtest_filter=Analyzer_vis.profile_mkldnn --batch_size=1 --warmup --repeat=1 --paddle_num_threads=4 --infer_model=third_party/inference_demo/face_model/densebox --infer_data=third_party/inference_demo/face_model/detect_input.txt --infer_shape=third_party/inference_demo/face_model/shape.txt  --sample=1000

2620 v3

Threads %CPU Max %Mem sample latency(ms) date commit
4 ~320% 1.1% 27.8061 10/17 1d92544

@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Apr 2, 2020
@PaddlePaddle PaddlePaddle unlocked this conversation Apr 2, 2020
@luotao1
Copy link
Contributor Author

luotao1 commented May 8, 2020

close due to #24336

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants