-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Closed
Description
paddle built with following settings:
WITH_TESTING=OFF
WITH_GOLANG=OFF
CMAKE_BUILD_TYPE=Release
WITH_GPU=ON
WITH_STYLE_CHECK=OFF
WITH_FLUID_ONLY=ON
WITH_MKLDNN=off
WITH_DISTRIBUTE=ON
packaged the build result with docker file and python script as in the repo:
https://github.com/putcn/vgg16_dist_test
tagged it with paddlepaddlece/vgg16_dist:latest
the training script is basically the same as https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py, only removed the dependency of import paddle.v2 as paddle by changing it to import paddle
then tried to run the cluster with commands as
pserver:
docker run --network="host" -i \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "SERVER_ENDPOINT=172.19.56.198:5436" \
-e "MASTER_ENDPOINT=172.19.56.198:5436" \
-e "TASK_NAME=nostalgic_raman" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.198:5436" \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device CPU --local no --num_passes 1 --batch_size 128trainer:
nvidia-docker run --network="host" -i \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "MASTER_ENDPOINT=172.31.48.60:5436" \
-e "TASK_NAME=kind_colden" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=0" \
-e "PADDLE_INIT_TRAINER_ID=0" \
-e "TRAINING_ROLE=TRAINER" \
-e "PSERVER_HOSTS=172.19.56.198:5436" \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device GPU --local no --num_passes 1 --batch_size 128pserver started with no issue, but trainer failed while trying to exec the trainer porgram, the error is
*** Aborted at 1526510318 (unix time) try "date -d @1526510318" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 591 (TID 0x7f47667a2700) from PID 0; stack trace: ***
@ 0x7f4766380390 (unknown)
@ 0x0 (unknown)
Segmentation fault (core dumped)
any ideas?
Metadata
Metadata
Assignees
Labels
No labels