Skip to content

vgg16 dist train fails with latest paddle #10720

@putcn

Description

@putcn

paddle built with following settings:

WITH_TESTING=OFF
WITH_GOLANG=OFF 
CMAKE_BUILD_TYPE=Release 
WITH_GPU=ON 
WITH_STYLE_CHECK=OFF 
WITH_FLUID_ONLY=ON 
WITH_MKLDNN=off 
WITH_DISTRIBUTE=ON 

packaged the build result with docker file and python script as in the repo:
https://github.com/putcn/vgg16_dist_test
tagged it with paddlepaddlece/vgg16_dist:latest

the training script is basically the same as https://github.com/PaddlePaddle/Paddle/blob/develop/benchmark/cluster/vgg16/vgg16_fluid.py, only removed the dependency of import paddle.v2 as paddle by changing it to import paddle

then tried to run the cluster with commands as
pserver:

docker run --network="host" -i \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "SERVER_ENDPOINT=172.19.56.198:5436" \
-e "MASTER_ENDPOINT=172.19.56.198:5436" \
-e "TASK_NAME=nostalgic_raman" \
-e "TRAINER_INDEX=0" \
-e "TRAINING_ROLE=PSERVER" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "PSERVER_HOSTS=172.19.56.198:5436" \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device CPU --local no --num_passes 1 --batch_size 128

trainer:

nvidia-docker run --network="host" -i  \
-e "GLOG_logtostderr=1" \
-e "GLOG_vmodule=executor=3" \
-e "MASTER_ENDPOINT=172.31.48.60:5436" \
-e "TASK_NAME=kind_colden" \
-e "TRAINER_COUNT=1" \
-e "TRAINERS=1" \
-e "TRAINER_INDEX=0"  \
-e "PADDLE_INIT_TRAINER_ID=0" \
-e "TRAINING_ROLE=TRAINER"  \
-e "PSERVER_HOSTS=172.19.56.198:5436"  \
-e "PSERVERS=172.19.56.198:5436" \
paddlepaddlece/vgg16_dist:latest --device GPU --local no --num_passes 1 --batch_size 128

pserver started with no issue, but trainer failed while trying to exec the trainer porgram, the error is

*** Aborted at 1526510318 (unix time) try "date -d @1526510318" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 591 (TID 0x7f47667a2700) from PID 0; stack trace: ***
    @     0x7f4766380390 (unknown)
    @                0x0 (unknown)
Segmentation fault (core dumped)

any ideas?

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions