Skip to content

PB-5: improve exit code propogation for non-command containers#628

Merged
zhming0 merged 1 commit into
mainfrom
ming/pb-5
Jun 19, 2025
Merged

PB-5: improve exit code propogation for non-command containers#628
zhming0 merged 1 commit into
mainfrom
ming/pb-5

Conversation

@zhming0
Copy link
Copy Markdown
Contributor

@zhming0 zhming0 commented Jun 18, 2025

This PR slightly change how we propagate non-command containers error to bk api.

  • In initContainer failure, we will pass the first non-0 exit code from init containers back to backend.
  • When scheduler yaml failure, we will raise a StackError error reason -> this can be helpful to indicate that this is not an retryable error.

solves #575
solves PB-7 and PB-5

@zhming0 zhming0 requested a review from a team June 18, 2025 05:37
@zhming0 zhming0 requested a review from a team as a code owner June 18, 2025 05:37
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts...

Comment thread internal/controller/scheduler/fail_job.go Outdated
Comment thread internal/controller/scheduler/fail_job.go Outdated
Comment thread internal/controller/scheduler/fail_job.go Outdated
@zhming0 zhming0 force-pushed the ming/pb-5 branch 2 times, most recently from 90fc67a to 819a190 Compare June 19, 2025 00:59
@zhming0 zhming0 requested a review from DrJosh9000 June 19, 2025 00:59
Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty good

Comment thread internal/controller/scheduler/fail_job.go Outdated
Comment thread internal/controller/scheduler/fail_job.go Outdated
Comment thread internal/controller/scheduler/fail_job.go Outdated
Comment thread internal/controller/scheduler/fail_job.go Outdated
Comment thread internal/controller/scheduler/job_watcher.go Outdated
Comment thread internal/controller/scheduler/pod_watcher.go Outdated
Comment thread internal/controller/scheduler/pod_watcher.go Outdated
Comment thread internal/controller/scheduler/scheduler.go Outdated
// In scheduler worker, often these failure are technically users error in YAML.
// Using agent refuse seems to be reasonable as that represent that agent won't execute these yaml.
// Deserve more discussion though..
Reason: agent.SignalReasonAgentRefused,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DrJosh9000 I think this change is debatable, subject to our confidence level. Just wanting to highlight this in case in you have opinion here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think the cases here where failJob is called are all "pre-agent", so using "agent refused" feels wrong for that reason. We also can't rule out the Kubernetes cluster having a bad day and rejecting a valid job. "Stack error" seems like the best choice for now.

Maybe the agent change should have included a "stack rejected" reason, as well as "stack error"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah true, the controller isn't an agent. I changed it to stack error but with a bit comments explaining the caveats. 🙏🏿

Copy link
Copy Markdown
Contributor

@DrJosh9000 DrJosh9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@zhming0 zhming0 merged commit 35eabbd into main Jun 19, 2025
1 check passed
@zhming0 zhming0 deleted the ming/pb-5 branch June 19, 2025 05:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants