Skip to content

ArgoCD Application Controller replica hangs/stuck during initialization of the cluster cache (v2.4.12) #10842

@rahul-mourya

Description

@rahul-mourya

Checklist:

  • I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
  • I've included steps to reproduce the bug.
  • I've pasted the output of argocd version.

Describe the bug

Some of the application get stuck at refresh operation indefinitely until the application controller restarts. We have argocd deployment with 3-replica of application controllers and 2 replica of argocd-server.
After some debugging it looks that one of the application controller replica doesn't complete the invalidation of live state cache and it's reinitialization. Once an invalidate live state cache triggers then the problematic replica would stop the automatic reconciliation of the applications which it was responsible for handling and there are verify minimal logging in the problematic replica and Memory consumption for that replica remains constant later throughout and CPU almost drops to zero. The symptoms point to a possible deadlock scenario during the reinitialization of cluster cache.
This problem is only seen with some specific cluster's applications, only with the cluster which is handled by the problematic argocd application controller replica.

Restarting the application controller statefulset seems to resolve the issue.
Similar issue reported here #8116
To Reproduce

Restarting/Deleting one of the two argocd-server pod seems to trigger the invalidation of cluster cache in all the application controller replica after which one of the replica shows the above explained symptoms of hang.
Expected behavior
No application should be stuck at refresh operations.

Invalidation of cluster cache and it's reinitialization should complete without any issues for all the application controller replica. No applications should be stuck in refresh operation and automatic reconciliation of application should run without any issues.
Screenshots

Version
v2.4.12

argocd-server: v2.4.12+28b8fea.dirty
  BuildDate: 2022-09-19T03:28:31Z
  GitCommit: 28b8fea2e68a931543b05e988e78241eb9487058
  GitTreeState: dirty
  GoVersion: go1.18.6
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: unknown 1970-01-01T00:00:00Z
  Helm Version: v3.9.0+g7ceeda6
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

Logs
Here is the logs at the time of invalidation of cache with some redacted cluster URL for privacy reasons.
Logs from other replica with proper Invalidation and reinitialization of cluster cache
Note: To validate the reinitialization of the cache, look for logs with live state cache invalidated and then Start syncing cluster logs for all the clusters assigned to this replica.


  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Start syncing cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="live state cache invalidated"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://kubernetes.default.svc"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted 3"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="invalidating live state cache"

Logs from the problematic replica
Note: No live state cache invalidated and Start syncing cluster logs after triggering of invalidation of live state cache event


  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="Invalidated cluster" server="https://redacted"
  |   | 2022-09-21 08:57:38.5738 | (no unique labels) | time="2022-09-21T03:27:38Z" level=info msg="invalidating live state cache"

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcomponent:coreIssues on core functionalities such as tracking, reconciling, managing resources, etc.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions