add logs and fix a bug#5074
Conversation
go/master/client.go
Outdated
| if err.Error() == ErrPassAfter.Error() { | ||
| // to prevent too many logs | ||
| if i%60 == 0 { | ||
| log.Debug(fmt.Sprintf("getTask passID:%d error.", passID), log.Ctx{"error": err}) |
There was a problem hiding this comment.
log.Debugf or just log.Debug(a, b) will do.
go/master/client.go
Outdated
| if i%60 == 0 { | ||
| log.Debug("getTask of passID error.", | ||
| log.Ctx{"error": err, "passID": passID}) | ||
| i = 3 |
There was a problem hiding this comment.
Why here is not i = 0?
| } | ||
|
|
||
| i := 0 | ||
| if err.Error() == ErrPassAfter.Error() { |
There was a problem hiding this comment.
I think this line is still needed, or else the other unknown error may also cause sleep and wait.
There was a problem hiding this comment.
// if err.Error() == ErrPassAfter.Error()
// wait util last pass finishes
// if other error such as network error
// wait to reconnect or task time out
There was a problem hiding this comment.
ErrPassAfter我理解应该是明确要等待。如果是其他不可恢复的错误,这里应该记录日志并且panic之类的,不然可能导致job hang住?
There was a problem hiding this comment.
可以从几个方面来考虑:
- 从接口来说,
pass_num是一个全局量,如果出现错误就退出的话,client会去尝试下一个pass_num. https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/reader/creator.py#L113 - 从流程上来说,
task应该得到这个pass_num的数据,如果出错,而错误不是ErrPassBeforeErrNoMoreAvailableErrAllTaskFailed,他应该继续等待。如果不hang住,客户端的动作应该是?我觉得还是重试。
| time.Sleep(time.Second * 3) | ||
| continue | ||
|
|
||
| if i%60 == 0 { |
There was a problem hiding this comment.
Maybe not use i and the magic value 60, just log error every time it happens?
There was a problem hiding this comment.
那样日志会很多。满屏都是一个日志。
go/master/client_test.go
Outdated
| panic(e) | ||
| } | ||
|
|
||
| c.StartGetRecords(100) |
There was a problem hiding this comment.
只是为了做什么的,能否在文件里comment说明一下?
There was a problem hiding this comment.
这个当时是想通过调用c.StartGetRecords(100)来观察日志打印了,而且没有打印太多,但现在发现这个放到unittest中也不能捕获不符合要求的错误,只能观察,所以还是去掉吧。
| c = master.client(etcd_endpoints, timeout_sec, buf_size) | ||
| c.set_dataset(paths) | ||
|
|
||
| if isinstance(paths, basestring): |
|
|
||
| if __name__ == "__main__": | ||
| unittest.main() | ||
| # TODO(gongwb):fix CI error |
There was a problem hiding this comment.
This change should not exist in this PR.
Fix https://github.com/PaddlePaddle/cloud/wiki/Demos'-common-issues#whole-job-is-waiting-but-doesnt-print-anything
Fix #5075
Fix #5078