improve handling of SIGTERM grace period #261
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR makes this plugin behave more similarly to non-Docker jobs with regards to cancellation and timing out.
Consider the following scenario: You have a job which can timeout (or be cancelled by the user), but you want to shutdown gracefully, and conditionally mark the job as passed, if the key parts of the job succeeded before the cancellation.
A typical example would be a benchmark job. When the timeout is hit, it is possible that enough iterations of the benchmark have already run to give a pass.
When creating a normal (non-Docker) Buildkite job, this is easy. In your test process you add a signal handler for SIGTERM (which is what Buildkite sends on timeout or user cancellation) which conditionally returns exit code 0 (instead of the typical 143). If 0 is returned, Buildkite interprets that as a pass and marks the job with a green tick.
Unfortunately, this doesn't currently work with the Docker plugin. So I've made the following changes:
trapcalls to the script to ensure thatSIGTERMonly affects the Docker container, not the wrapper bash script. Without this,SIGTERMwas causing the bash script to terminate and return 143, which meant the exit code from the Docker container was being ignored. Now, if the process inside the Docker container chooses to intercept theSIGTERMand e.g. return 0, then the exit code is propagated all the way up to Buildkite UI (which displays a green tick if the exit code is 0).BUILDKITE_CANCEL_GRACE_PERIODis now passed to the Docker container: When a user presses cancel (or a job hits its timeout), Buildkite sendsSIGTERMto the job. The process is given this grace period to gracefully terminate, before Buildkite resorts to sending aSIGKILL. Docker itself has a similar flag option--stop-timeout, which was previously defaulting to 10 seconds. So, with that Docker default taking effect, increasingBUILDKITE_CANCEL_GRACE_PERIODhad no effect.