Skip to content

Commit 5d7e065

Browse files
committed
wait for a given duration in case of imagePullBackOff
We have implemented imagePullBackOff as fail fast. The issue with this approach is, the node where the pod is scheduled often experiences registry rate limit. The image pull failure because of the rate limit returns the same warning (reason: Failed and message: ImagePullBackOff). The pod can potentially recover after waiting for enough time until the cap is expired. Kubernetes can then successfully pull the image and bring the pod up. Introducing a default configuration to specify cluster level timeout to allow the imagePullBackOff to retry for a given duration. Once that duration has passed, return a permanent failure. #5987 #7184 Signed-off-by: Priti Desai <[email protected]> wait for a given duration in case of imagePullBackOff Signed-off-by: Priti Desai <[email protected]>
1 parent 1568ed1 commit 5d7e065

File tree

4 files changed

+84
-3
lines changed

4 files changed

+84
-3
lines changed

config/config-defaults.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,10 @@ data:
8787
# no default-resolver-type is specified by default
8888
default-resolver-type:
8989
90+
# default-imagepullbackoff-timeout contains the default number of minutes to wait
91+
# before requeuing the TaskRun to retry
92+
# default-imagepullbackoff-timeout: "5"
93+
9094
# default-container-resource-requirements allow users to update default resource requirements
9195
# to a init-containers and containers of a pods create by the controller
9296
# Onet: All the resource requirements are applied to init-containers and containers

docs/additional-configs.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ installation.
3131
- [Verify the transparency logs using `rekor-cli`](#verify-the-transparency-logs-using-rekor-cli)
3232
- [Verify Tekton Resources](#verify-tekton-resources)
3333
- [Pipelinerun with Affinity Assistant](#pipelineruns-with-affinity-assistant)
34+
- [TaskRuns with `imagePullBackOff` Timeout](#taskruns-with-imagepullbackoff-timeout)
3435
- [Next steps](#next-steps)
3536

3637

@@ -672,6 +673,26 @@ please take a look at [Trusted Resources](./trusted-resources.md).
672673
The cluster operators can review the [guidelines](developers/affinity-assistant.md) to `cordon` a node in the cluster
673674
with the tekton controller and the affinity assistant is enabled.
674675
676+
## TaskRuns with `imagePullBackOff` Timeout
677+
678+
Tekton pipelines has adopted a fail fast strategy with a taskRun failing with `TaskRunImagePullFailed` in case of an
679+
`imagePullBackOff`. This can be limited in some cases, and it generally depends on the infrastructure. To allow the
680+
cluster operators to decide whether to wait in case of an `imagePullBackOff`, a setting is available to configure
681+
the wait time in minutes such that the controller will wait for the specified duration before declaring a failure.
682+
For example, with the following `config-defaults`, the controller does not mark the taskRun as failure for 5 minutes since
683+
the pod is scheduled in case the image pull fails with `imagePullBackOff`.
684+
See issue https://github.com/tektoncd/pipeline/issues/5987 for more details.
685+
686+
```yaml
687+
apiVersion: v1
688+
kind: ConfigMap
689+
metadata:
690+
name: config-defaults
691+
namespace: tekton-pipelines
692+
data:
693+
default-imagepullbackoff-timeout: "5"
694+
```
695+
675696
## Next steps
676697
677698
To get started with Tekton check the [Introductory tutorials][quickstarts],

pkg/apis/config/default.go

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ const (
4949
// default resource requirements, will be applied to all the containers, which has empty resource requirements
5050
ResourceRequirementDefaultContainerKey = "default"
5151

52+
DefaultImagePullBackOffTimeout = 0
53+
5254
defaultTimeoutMinutesKey = "default-timeout-minutes"
5355
defaultServiceAccountKey = "default-service-account"
5456
defaultManagedByLabelValueKey = "default-managed-by-label-value"
@@ -60,6 +62,7 @@ const (
6062
defaultForbiddenEnv = "default-forbidden-env"
6163
defaultResolverTypeKey = "default-resolver-type"
6264
defaultContainerResourceRequirementsKey = "default-container-resource-requirements"
65+
defaultImagePullBackOffTimeout = "default-imagepullbackoff-timeout"
6366
)
6467

6568
// DefaultConfig holds all the default configurations for the config.
@@ -79,6 +82,7 @@ type Defaults struct {
7982
DefaultForbiddenEnv []string
8083
DefaultResolverType string
8184
DefaultContainerResourceRequirements map[string]corev1.ResourceRequirements
85+
DefaultImagePullBackOffTimeout int
8286
}
8387

8488
// GetDefaultsConfigName returns the name of the configmap containing all
@@ -109,6 +113,7 @@ func (cfg *Defaults) Equals(other *Defaults) bool {
109113
other.DefaultTaskRunWorkspaceBinding == cfg.DefaultTaskRunWorkspaceBinding &&
110114
other.DefaultMaxMatrixCombinationsCount == cfg.DefaultMaxMatrixCombinationsCount &&
111115
other.DefaultResolverType == cfg.DefaultResolverType &&
116+
other.DefaultImagePullBackOffTimeout == cfg.DefaultImagePullBackOffTimeout &&
112117
reflect.DeepEqual(other.DefaultForbiddenEnv, cfg.DefaultForbiddenEnv)
113118
}
114119

@@ -121,6 +126,7 @@ func NewDefaultsFromMap(cfgMap map[string]string) (*Defaults, error) {
121126
DefaultCloudEventsSink: DefaultCloudEventSinkValue,
122127
DefaultMaxMatrixCombinationsCount: DefaultMaxMatrixCombinationsCount,
123128
DefaultResolverType: DefaultResolverTypeValue,
129+
DefaultImagePullBackOffTimeout: DefaultImagePullBackOffTimeout,
124130
}
125131

126132
if defaultTimeoutMin, ok := cfgMap[defaultTimeoutMinutesKey]; ok {
@@ -191,6 +197,14 @@ func NewDefaultsFromMap(cfgMap map[string]string) (*Defaults, error) {
191197
tc.DefaultContainerResourceRequirements = resourceRequirementsValue
192198
}
193199

200+
if defaultImagePullBackOff, ok := cfgMap[defaultImagePullBackOffTimeout]; ok {
201+
timeout, err := strconv.ParseInt(defaultImagePullBackOff, 10, 0)
202+
if err != nil {
203+
return nil, fmt.Errorf("failed parsing tracing config %q", defaultImagePullBackOffTimeout)
204+
}
205+
tc.DefaultImagePullBackOffTimeout = int(timeout)
206+
}
207+
194208
return &tc, nil
195209
}
196210

pkg/reconciler/taskrun/taskrun.go

Lines changed: 45 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,13 +94,15 @@ type Reconciler struct {
9494
tracerProvider trace.TracerProvider
9595
}
9696

97+
const ImagePullBackOff = "ImagePullBackOff"
98+
9799
var (
98100
// Check that our Reconciler implements taskrunreconciler.Interface
99101
_ taskrunreconciler.Interface = (*Reconciler)(nil)
100102

101103
// Pod failure reasons that trigger failure of the TaskRun
102104
podFailureReasons = map[string]struct{}{
103-
"ImagePullBackOff": {},
105+
ImagePullBackOff: {},
104106
"InvalidImageName": {},
105107
}
106108
)
@@ -171,7 +173,7 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1.TaskRun) pkgrecon
171173
}
172174

173175
// Check for Pod Failures
174-
if failed, reason, message := c.checkPodFailed(tr); failed {
176+
if failed, reason, message := c.checkPodFailed(tr, ctx); failed {
175177
err := c.failTaskRun(ctx, tr, reason, message)
176178
return c.finishReconcileUpdateEmitEvents(ctx, tr, before, err)
177179
}
@@ -222,10 +224,30 @@ func (c *Reconciler) ReconcileKind(ctx context.Context, tr *v1.TaskRun) pkgrecon
222224
return nil
223225
}
224226

225-
func (c *Reconciler) checkPodFailed(tr *v1.TaskRun) (bool, v1.TaskRunReason, string) {
227+
func (c *Reconciler) checkPodFailed(tr *v1.TaskRun, ctx context.Context) (bool, v1.TaskRunReason, string) {
226228
for _, step := range tr.Status.Steps {
227229
if step.Waiting != nil {
228230
if _, found := podFailureReasons[step.Waiting.Reason]; found {
231+
if step.Waiting.Reason == ImagePullBackOff {
232+
imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout
233+
// only attempt to recover from the imagePullBackOff if specified
234+
if imagePullBackOffTimeOut > 0 {
235+
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
236+
if err != nil {
237+
message := fmt.Sprintf(`The step %q in TaskRun %q failed to pull the image %q. The pod could not be retrieved with error: "%s."`, step.Name, tr.Name, step.ImageID, err)
238+
return true, v1.TaskRunReasonImagePullFailed, message
239+
}
240+
for _, condition := range p.Status.Conditions {
241+
// check the pod condition to get the time when the pod was scheduled
242+
// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration
243+
if condition.Type == corev1.PodScheduled {
244+
if c.Clock.Since(condition.LastTransitionTime.Time) < time.Duration(imagePullBackOffTimeOut)*time.Minute {
245+
return false, "", ""
246+
}
247+
}
248+
}
249+
}
250+
}
229251
image := step.ImageID
230252
message := fmt.Sprintf(`The step %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, step.Name, tr.Name, image, step.Waiting.Message)
231253
return true, v1.TaskRunReasonImagePullFailed, message
@@ -235,6 +257,26 @@ func (c *Reconciler) checkPodFailed(tr *v1.TaskRun) (bool, v1.TaskRunReason, str
235257
for _, sidecar := range tr.Status.Sidecars {
236258
if sidecar.Waiting != nil {
237259
if _, found := podFailureReasons[sidecar.Waiting.Reason]; found {
260+
if sidecar.Waiting.Reason == ImagePullBackOff {
261+
imagePullBackOffTimeOut := config.FromContextOrDefaults(ctx).Defaults.DefaultImagePullBackOffTimeout
262+
// only attempt to recover from the imagePullBackOff if specified
263+
if imagePullBackOffTimeOut > 0 {
264+
p, err := c.KubeClientSet.CoreV1().Pods(tr.Namespace).Get(ctx, tr.Status.PodName, metav1.GetOptions{})
265+
if err != nil {
266+
message := fmt.Sprintf(`The step %q in TaskRun %q failed to pull the image %q. The pod could not be retrieved with error: "%s."`, sidecar.Name, tr.Name, sidecar.ImageID, err)
267+
return true, v1.TaskRunReasonImagePullFailed, message
268+
}
269+
for _, condition := range p.Status.Conditions {
270+
// check the pod condition to get the time when the pod was scheduled
271+
// keep trying until the pod schedule time has exceeded the specified imagePullBackOff timeout duration
272+
if condition.Type == corev1.PodScheduled {
273+
if c.Clock.Since(condition.LastTransitionTime.Time) < time.Duration(imagePullBackOffTimeOut)*time.Minute {
274+
return false, "", ""
275+
}
276+
}
277+
}
278+
}
279+
}
238280
image := sidecar.ImageID
239281
message := fmt.Sprintf(`The sidecar %q in TaskRun %q failed to pull the image %q. The pod errored with the message: "%s."`, sidecar.Name, tr.Name, image, sidecar.Waiting.Message)
240282
return true, v1.TaskRunReasonImagePullFailed, message

0 commit comments

Comments
 (0)