I0531 16:14:44.043287 461 communicator.cc:208] communicator stopped, recv thread exit
I0531 16:14:44.223714 460 communicator.cc:169] communicator stopped, send thread exit
I0531 16:14:44.223843 443 communicator.cc:307] Communicator stop done
*** Aborted at 1559290484 (unix time) try "date -d @1559290484" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 443 (TID 0x7fd54155c700) from PID 0; stack trace: ***
@ 0x7fd541139390 (unknown)
@ 0x7fd523e403eb paddle::memory::allocation::Allocator::FreeImpl()
@ 0x7fd522b4a9b9 std::_Sp_counted_base<>::_M_release()
@ 0x7fd522b4b588 paddle::framework::Variable::PlaceholderImpl<>::~PlaceholderImpl()
@ 0x7fd523defc4d paddle::framework::Scope::~Scope()
@ 0x7fd522d4bdb4 paddle::operators::distributed::Communicator::~Communicator()
@ 0x7fd522c7bd8a std::_Sp_counted_ptr<>::_M_dispose()
@ 0x7fd522b4a9b9 std::_Sp_counted_base<>::_M_release()
@ 0x7fd540d97ff8 (unknown)
@ 0x7fd540d98045 exit
@ 0x7fd540d7e837 __libc_start_main
@ 0x493299 _start
@ 0x0 (unknown)
Segmentation fault
版本、环境信息:
1)PaddlePaddle版本:1.4.1
2)CPU:Skylake CPU
3)系统环境:paddlepaddle:latest docker ubuntu 16.04, python 2.7
训练信息
1)多机,CPU 训练, 2 PS , 4 worker
复现信息:使用 fluid.incubate.fleet 接口进行分布式训练,参考 https://github.com/PaddlePaddle/Paddle/tree/develop/python/paddle/fluid/incubate/fleet/tests
问题描述:
训练几个 epoch 之后,会发生如下 core dump:
I0531 16:14:44.043287 461 communicator.cc:208] communicator stopped, recv thread exit
I0531 16:14:44.223714 460 communicator.cc:169] communicator stopped, send thread exit
I0531 16:14:44.223843 443 communicator.cc:307] Communicator stop done
*** Aborted at 1559290484 (unix time) try "date -d @1559290484" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 443 (TID 0x7fd54155c700) from PID 0; stack trace: ***
@ 0x7fd541139390 (unknown)
@ 0x7fd523e403eb paddle::memory::allocation::Allocator::FreeImpl()
@ 0x7fd522b4a9b9 std::_Sp_counted_base<>::_M_release()
@ 0x7fd522b4b588 paddle::framework::Variable::PlaceholderImpl<>::~PlaceholderImpl()
@ 0x7fd523defc4d paddle::framework::Scope::~Scope()
@ 0x7fd522d4bdb4 paddle::operators::distributed::Communicator::~Communicator()
@ 0x7fd522c7bd8a std::_Sp_counted_ptr<>::_M_dispose()
@ 0x7fd522b4a9b9 std::_Sp_counted_base<>::_M_release()
@ 0x7fd540d97ff8 (unknown)
@ 0x7fd540d98045 exit
@ 0x7fd540d7e837 __libc_start_main
@ 0x493299 _start
@ 0x0 (unknown)
Segmentation fault