Fluid distributed training benchmark#7410
Conversation
typhoonzero
left a comment
There was a problem hiding this comment.
Should we put this doc into design or a separated repo?
benchmark/cluster/README.md
Outdated
| - Docker Image | ||
|
|
||
| We use different base Docker Image to run the benchmark on Kubernetes: | ||
| - PaddlePaddle v2: paddlepaddle/paddle:latest |
There was a problem hiding this comment.
Should use a static tag, so when latest tag updates, this benchmark still can be reproduced.
There was a problem hiding this comment.
Sure, but we don't have a static tag for fluid distributed training, how about a commit ID?
benchmark/cluster/README.md
Outdated
| - TensorFlow: tensorflow/tensorflow:latest | ||
|
|
||
| - Model | ||
| A digits recognize model and MNIST dataset is used in this benchmark. |
There was a problem hiding this comment.
I think this model is too small. Maybe vgg-16 (probably around 500MB) is more closer to the real usage.
benchmark/cluster/README.md
Outdated
| - PServer count of the training job. | ||
|
|
||
| - Invariant | ||
| - The number of trainers. |
There was a problem hiding this comment.
What is the trainer count we plan to try?
There was a problem hiding this comment.
Done.
And @typhoonzero reminds me that we need to measure the parallel efficiency by increasing the trainer count.
|
From @typhoonzero
Maybe not, I saw https://github.com/dzhwinter/benchmark is working for the Fluid benchmark, and I knew from @dzhwinter that will be merged into the Paddle repo in this week. |
benchmark/cluster/README.md
Outdated
| - Docker Image | ||
|
|
||
| We use different base Docker Image to run the benchmark on Kubernetes: | ||
| - PaddlePaddle v2: paddlepaddle/paddle:[commit-id] |
There was a problem hiding this comment.
v2 should use 0.10.0 tag, and fluid should use commit id
There was a problem hiding this comment.
Done, since 0.10.0 does not support v2 distributed training, use 0.11.0 .
Fixed #7409