-
Notifications
You must be signed in to change notification settings - Fork 268
Description
Describe the bug
Cannot override startupProbe when using exec instead of tcpSocket
To Reproduce
Steps to reproduce the behavior:
- set startupProbe to:
startupProbe:
exec:
command:
- sh
- -c
- "exit 0" # this is just for reproducing the error
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30
- helm install to k8s or in our case by using fluxcd
- Check events: validation error because two kind of probes
- Check yaml of statefulset: startupProbe has exec.command and tcpProbe
- Similar error when explicitly setting tcpProbe to {}, null and false
- Similar error when setting tcpProbe.port to null, only that port has value 0
If port is set under tcpProbe in values.yaml, my understanding is, that helm merges it with the supplied values, hence tcpProbe will always be set.
Chart Name
Opensearch, Opensearch-dashboard and data-prepper (but here an open PR exists)
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- Helm Version: 3.19.0
- Kubernetes Version: 1.31
Additional context
Easy fix: Remove startupProbe from values.yaml, because just checking if API is available is imo not enough. It is a breaking change, but using the current default will lead to bigger problems for the users.
Example: The opensearch documentation for rolling upgrades asks in step 11:
Confirm that the cluster is healthy
Only checking if we get an answer from port 9200 is not telling us anything about the cluster health.
With maxUnavailable: 1and this startupProbe/readinessProbe combination could restart the next node before the just restarted node is actually ready.
This issue touches the problem.
But I think doing extensive checks in the readinessProbe could lead to cascading failure (cluster is yellow and everything gets restarted).
The startup probe is the perfect place to do it.
Hence we need a freely configurable startupProbe to do more extensive checks, to ensure that a restarted node is actually ready.
I also think there are no good default one-fits-all values for startup, because speed of a restart depends completely on the amount of data and how it is sharded.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status