Skip to content

Apply default-image-pull-policy to agent and checkout containers#841

Closed
seemethere wants to merge 1 commit into
buildkite:mainfrom
seemethere:seemethere/default-pull-policy-all-containers
Closed

Apply default-image-pull-policy to agent and checkout containers#841
seemethere wants to merge 1 commit into
buildkite:mainfrom
seemethere:seemethere/default-pull-policy-all-containers

Conversation

@seemethere
Copy link
Copy Markdown
Contributor

Summary

The default-image-pull-policy config option is currently only applied to the copy-agent init container and imagecheck-* containers. The agent and checkout containers have no ImagePullPolicy set, so Kubernetes defaults to Always for tagged (non-digest) images.

This means every job pod contacts the container registry to verify the image manifest, even when images are pre-cached on the node (e.g. via a DaemonSet). If DNS to the registry is unavailable, the manifest check fails and the pod enters ImagePullBackOff — despite the image being locally cached.

This PR applies the same cmp.Or(DefaultImagePullPolicy, defaultPullPolicyForImage) pattern to the agent and checkout containers, matching the existing copy-agent behavior.

Motivation

We run ~300 nodes with an image-prepull DaemonSet that keeps the agent image cached. When a handful of nodes lost DNS resolution (due to stale Cilium BPF conntrack entries pointing at dead CoreDNS backends), every CI job landing on those nodes failed with ImagePullBackOff even though the image was locally available. Setting default-image-pull-policy: IfNotPresent fixed the copy-agent init container, but the agent and checkout containers still contacted the registry and failed.

Changes

  • scheduler.go: Add ImagePullPolicy to the agent container (line 547)
  • scheduler.go: Add ImagePullPolicy to the checkout container (line 1015)
  • Both use the same pattern as the existing copy-agent container

Behavior

Config Before After
default-image-pull-policy not set copy-agent: Always (tagged) / IfNotPresent (digest). agent/checkout: Always (K8s default) Same — no change
default-image-pull-policy: IfNotPresent copy-agent: IfNotPresent. agent/checkout: still Always All three: IfNotPresent

Default behavior is preserved. The change only takes effect when default-image-pull-policy is explicitly configured.

Test plan

  • Verified the patch compiles
  • Existing unit tests pass
  • Deploy to staging cluster with default-image-pull-policy: IfNotPresent and verify job pods use IfNotPresent on all 3 containers

The `default-image-pull-policy` config option was only applied to the
`copy-agent` init container and `imagecheck-*` containers. The `agent`
and `checkout` containers had no `ImagePullPolicy` set, causing
Kubernetes to default to `Always` for tagged (non-digest) images.

This means every job pod contacts the container registry even when
images are pre-cached by a DaemonSet. If DNS to the registry is
unavailable (e.g. due to stale Cilium BPF conntrack entries pointing
at dead CoreDNS backends), the manifest check fails and the pod
enters ImagePullBackOff — despite the image being locally cached.

Apply the same `cmp.Or(DefaultImagePullPolicy, defaultPullPolicyForImage)`
pattern to the `agent` and `checkout` containers, matching the existing
`copy-agent` behavior. When `default-image-pull-policy` is set to
`IfNotPresent`, all controller-injected containers now skip the
registry check if the image is cached, making job pods resilient to
transient DNS/registry outages.

Default behavior is unchanged: when the config is empty, tagged images
still use `PullAlways` and digest-pinned images use `PullIfNotPresent`.
@DrJosh9000
Copy link
Copy Markdown
Contributor

Hi @seemethere, thanks for the PR! I think this is a reasonably good idea. When a default is configured, it should be applied.

The default agent and checkout image refs should have both a non-latest tag and a digest, so Kubernetes is supposed to use IfNotPresent (https://kubernetes.io/docs/concepts/containers/images/#imagepullpolicy-defaulting). Which likely explains why nobody has pointed this out until now.

I noticed the tests fail, so I've gone ahead and updated them in #842.

@DrJosh9000
Copy link
Copy Markdown
Contributor

This has now been released in v0.40.0. Please let us know how it goes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants