Skip to content

Conversation

@kuizhiqing
Copy link
Member

PR types

New features

PR changes

Others

Describe

Adding fault tolerance feature in distributed mode ( collective mode here),

  • ElasticManager in elastic module handle life cycle of training process, be notified of node failure;
  • define LauncherInterface wrap the real task workflow which is the same as before;

This pr is compatible upgrade which means it will keep silence and change nothing if it is not enabled.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jun 7, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@Baibaifan Baibaifan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for ascend launch

Copy link
Contributor

@xymyeah xymyeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@danleifeng danleifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danleifeng danleifeng merged commit 79cbc8e into PaddlePaddle:develop Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants