Skip to content

Conversation

@gongweibao
Copy link
Collaborator

Fix #462

controller parse "TrainingJob" resource to paddle job, including master, pserver and trainer.

@gongweibao gongweibao changed the title Submittrainingjobs Add trainingjober. Dec 12, 2017
@gongweibao gongweibao changed the title Add trainingjober. Add TrainingJober. Dec 12, 2017

// Destroy destroys resource and pods.
func (c *TrainingJober) Destroy(job *paddlejob.TrainingJob) {
c.Complete(job)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何cleanupPservercleanupMaster Complete函数里面,而cleanupTrainerDestroy函数里面,和Complete是并列关系?

3个cleanup看起来应该是并列关系?考虑删掉Complete,把3个cleanup都放在Destroy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我的思路是这样的,当TrainerJob完成之后,Master和Pserver应该被清除掉。因为他们占用了Pod资源,但是TrainerJob还不能被清除掉,用户还要看日志。而且TrainerJob完成之后就不占用了资源了。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

懂了,有道理!

err := fmt.Errorf("trainerjob_err:%v master_err:%v pserver_err:%v",
terr, merr, perr)
log.Error(err.Error())
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

某一个出错,但是另外的可能创建成功了,需要cleanup。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return err
}

if m == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可能出现err == nil而同时m == nil的情况吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种的也是应该创建吧?

Copy link
Collaborator

@helinwang helinwang Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

哦,我的意思是要是不会出现就不需要处理这个情况了。
明白了,是nil就代表还不存在,所以需要创建。

log.Error(err.Error())
return err
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可能在create之前判断下job是否存在:

  • 对于重复提交的情况,这里是不是可以判断下,trainer/pserver/master只要其中一个存在就报异常呢?
  • 使用户使用paddlectl kill {jobname}也有可能存在pod删不掉的情况,例如Contrainer删除超时等,所以这里可能还需要判断下Pod的存在情况。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于重复提交的情况,这里是不是可以判断下

后边判断了

trainer/pserver/master只要其中一个存在就报异常呢?使用户使用paddlectl kill {jobname}也有可能存在pod删不掉的情况,例如Contrainer删除超时等,所以这里可能还需要判断下Pod的存在情况。

这个判断应该在提交TrainingJob资源的地方做。controller的对用户来说是异步执行的。
我准备在paddlecloud里边加上类似的判断。

if t == nil {
if err := c.createTrainer(job); err != nil {
if m == nil {
c.cleanupMaster(namespace, mname)
Copy link
Collaborator

@Yancey0623 Yancey0623 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里没有很明白,为什么m==nil的时候需要clearnup master呢?m==nil不是表示master不存在,是不是不需要再cleanup了?

Copy link
Collaborator Author

@gongweibao gongweibao Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m==nil 表示 这轮创建之前Master不存在, 但是能走到这里代表Master创建且成功了,所以当createTrainer出错的时候要清理。
隐含的是把 Tainer,Pserver,Master当做一个整体。并且,只清理自己创建的。不清理其他的。

Copy link
Collaborator

@Yancey0623 Yancey0623 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

既然master已经在上一个if中创建成功了,那么么应该不会进入到m == nil分支中了吧? 所以看起来这里永远不会被执行

c.cleanupTrainer(namespace, tname)
}

if err := c.createPserver(job); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果创建PServer失败,应该对之前创建trainer和创建master的操作进行回滚。

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry,放错地方了。还需要继续仔细测试。

Copy link
Collaborator

@Yancey0623 Yancey0623 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM++

@gongweibao gongweibao merged commit a4292d5 into PaddlePaddle:develop Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants