Skip to content

[BUG][Opensearch] Cannot override startupProbe/incorrect check in starupProbe #703

@budickdaDE

Description

@budickdaDE

Describe the bug
Cannot override startupProbe when using exec instead of tcpSocket

To Reproduce
Steps to reproduce the behavior:

  1. set startupProbe to:
startupProbe:
  exec:
    command:
      - sh
      - -c
      - "exit 0" # this is just for reproducing the error
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30
 
  1. helm install to k8s or in our case by using fluxcd
  2. Check events: validation error because two kind of probes
  3. Check yaml of statefulset: startupProbe has exec.command and tcpProbe
  4. Similar error when explicitly setting tcpProbe to {}, null and false
  5. Similar error when setting tcpProbe.port to null, only that port has value 0

If port is set under tcpProbe in values.yaml, my understanding is, that helm merges it with the supplied values, hence tcpProbe will always be set.

Chart Name
Opensearch, Opensearch-dashboard and data-prepper (but here an open PR exists)

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • Helm Version: 3.19.0
  • Kubernetes Version: 1.31

Additional context
Easy fix: Remove startupProbe from values.yaml, because just checking if API is available is imo not enough. It is a breaking change, but using the current default will lead to bigger problems for the users.

Example: The opensearch documentation for rolling upgrades asks in step 11:

Confirm that the cluster is healthy

Only checking if we get an answer from port 9200 is not telling us anything about the cluster health.
With maxUnavailable: 1and this startupProbe/readinessProbe combination could restart the next node before the just restarted node is actually ready.

This issue touches the problem.
But I think doing extensive checks in the readinessProbe could lead to cascading failure (cluster is yellow and everything gets restarted).
The startup probe is the perfect place to do it.
Hence we need a freely configurable startupProbe to do more extensive checks, to ensure that a restarted node is actually ready.
I also think there are no good default one-fits-all values for startup, because speed of a restart depends completely on the amount of data and how it is sharded.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinguntriagedIssues that have not yet been triaged

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions