Conversation
DrJosh9000
left a comment
There was a problem hiding this comment.
Some initial thoughts...
90fc67a to
819a190
Compare
| // In scheduler worker, often these failure are technically users error in YAML. | ||
| // Using agent refuse seems to be reasonable as that represent that agent won't execute these yaml. | ||
| // Deserve more discussion though.. | ||
| Reason: agent.SignalReasonAgentRefused, |
There was a problem hiding this comment.
@DrJosh9000 I think this change is debatable, subject to our confidence level. Just wanting to highlight this in case in you have opinion here
There was a problem hiding this comment.
Hmm. I think the cases here where failJob is called are all "pre-agent", so using "agent refused" feels wrong for that reason. We also can't rule out the Kubernetes cluster having a bad day and rejecting a valid job. "Stack error" seems like the best choice for now.
Maybe the agent change should have included a "stack rejected" reason, as well as "stack error"?
There was a problem hiding this comment.
Ah true, the controller isn't an agent. I changed it to stack error but with a bit comments explaining the caveats. 🙏🏿
This PR slightly change how we propagate non-command containers error to bk api.
StackErrorerror reason -> this can be helpful to indicate that this is not an retryable error.solves #575
solves PB-7 and PB-5