Skip to content

Comments

Fluid distributed training benchmark#7410

Merged
Yancey0623 merged 3 commits intoPaddlePaddle:developfrom
Yancey0623:cluster_benchmark_design
Jan 12, 2018
Merged

Fluid distributed training benchmark#7410
Yancey0623 merged 3 commits intoPaddlePaddle:developfrom
Yancey0623:cluster_benchmark_design

Conversation

@Yancey0623
Copy link
Contributor

Fixed #7409

@Yancey0623 Yancey0623 changed the title add cluster training bencharmk design Fluid distributed training benchmark Jan 10, 2018
Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this doc into design or a separated repo?

- Docker Image

We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use a static tag, so when latest tag updates, this benchmark still can be reproduced.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but we don't have a static tag for fluid distributed training, how about a commit ID?

- TensorFlow: tensorflow/tensorflow:latest

- Model
A digits recognize model and MNIST dataset is used in this benchmark.
Copy link
Contributor

@helinwang helinwang Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this model is too small. Maybe vgg-16 (probably around 500MB) is more closer to the real usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- PServer count of the training job.

- Invariant
- The number of trainers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the trainer count we plan to try?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
And @typhoonzero reminds me that we need to measure the parallel efficiency by increasing the trainer count.

@Yancey0623
Copy link
Contributor Author

From @typhoonzero

Should we put this doc into design or a separated repo?

Maybe not, I saw https://github.com/dzhwinter/benchmark is working for the Fluid benchmark, and I knew from @dzhwinter that will be merged into the Paddle repo in this week.

- Docker Image

We use different base Docker Image to run the benchmark on Kubernetes:
- PaddlePaddle v2: paddlepaddle/paddle:[commit-id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 should use 0.10.0 tag, and fluid should use commit id

Copy link
Contributor Author

@Yancey0623 Yancey0623 Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, since 0.10.0 does not support v2 distributed training, use 0.11.0 .

Copy link
Contributor

@typhoonzero typhoonzero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@Yancey0623 Yancey0623 merged commit 5dbd537 into PaddlePaddle:develop Jan 12, 2018
@Yancey0623 Yancey0623 deleted the cluster_benchmark_design branch January 12, 2018 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants